Python beginner: read elements in one file and use them to modify another file

2023-03-07 02:55 问答作者：

I'm an economist with no programming background. I'm trying to learn how to use python because I've been told that it is very powerful for parsing data from websites. At the moment, I'm stuck with the following code and I would be extremely grateful for any suggestion.

First of all, I wrote a code to parse the data from this table:

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

The code I wrote is the following:

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)



outfile = open("milano.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

The code reads the table, take only the information that I need and create a txt file. The code is pretty rudimental, but it accomplishes the job.

My problem starts now. The url that I posted above is just one of the approximately 200 from which I need to parse the data. All the urls are differentiated by two elements only. Using the previous url:

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

the two elements that uniquely identify this page are MILANO (the name of the city) and 15146 (a bureaucratic code).

What I wanted to do was, first, creating a file with two columns:

In the first the names of the cities I need;
In the second the bureaucratic codes.

Then, I wanted to create a loop in python that reads each line of this file, correctly modify the url in my code and perform the parsing task separately for each city.

Do you have any suggestion about how to proceed? Thanks in advance for any help and suggestion!

[Update]

Thanks to all for the helpful suggestions. I found the answer of Thomas K the most easy to implement for my knowledge of python. I still have problems, though. I modified the code in the following way:

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print开发者_运维百科 >> outfile, "|".join(record)

citylist = csv.reader(open("citycodes.csv", "rU"), dialect = csv.excel)
for city in citylist:
outfile = open("%s.txt", "w") % city
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

where citycodes.csv is in the following format

MILANO;12345
MODENA;67891

I get the following error:

Traceback (most recent call last):
File "modena2.py", line 25, in <module>
 outfile = open("%s.txt", "w") % city
TypeError: unsupported operand type(s) for %: 'file' and 'list'

Thanks again!

One little thing you need to fix:

This:

for city in citylist:
    outfile = open("%s.txt", "w") % city
#                                 ^^^^^^

Should be this:

for city in citylist:
    outfile = open("%s.txt" % city, "w")
#                           ^^^^^^

If the file is in CSV format then you can use csv to read it. Then just use urllib.urlencode() to generate the query string, and urlparse.urlunparse() to generate the full URL.

No need to create a separate file, use a python dictionary instead in which there is a relationship: city->code.

See: http://docs.python.org/tutorial/datastructures.html#dictionaries

Quick and dirty:

import csv
citylist = csv.reader(open("citylist.csv"))
for city in citylist:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
    # open the page and extract the information

Assuming you have a csv file looking like:

MILANO,15146
ROMA,12345

There are more powerful tools, like urllib.urlencode() as Ignacio mentioned. But they're probably overkill for this.

P.S. Congratulations: you've done the hard bit - scraping data from HTML. Looping over a list is the easy bit.

Just scratching out the basics...

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

outfile = open("milano.txt", "w")

def extract(soup):
    global outfile
    table = soup.find("table", cellspacing=2)
    for row in table.findAll('tr')[2:]:
            col = row.findAll('td')
            year = col[0].div.b.font.string
            detrazione = col[1].div.b.font.string
            ordinaria = col[2].div.b.font.string
            principale = col[3].div.b.font.string
            scopo = col[4].div.b.font.string
            record = (year, detrazione, ordinaria, principale, scopo)
            print >> outfile, "|".join(record)



br = Browser()
br.set_handle_robots(False)

# fill in your cities here anyway like
ListOfCityCodePairs = [('MILANO', 15146)]

for (city, code) in ListOfCityCodePairs:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=d" % (city, code)
    page1 = br.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1)

outfile.close()

继续阅读：iteration loops python

Python beginner: read elements in one file and use them to modify another file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？