Python beginner: read elements in one file and use them to modify another file
I'm an economist with no programming background. I'm trying to learn how to use python because I've been told that it is very powerful for parsing data from websites. At the moment, I'm stuck with the following code and I would be extremely grateful for any suggestion.
First of all, I wrote a code to parse the data from this table:
http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146
The code I wrote is the following:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
col = row.findAll('td')
year = col[0].div.b.font.string
detrazione = col[1].div.b.font.string
ordinaria = col[2].div.b.font.string
principale = col[3].div.b.font.string
scopo = col[4].div.b.font.string
record = (year, detrazione, ordinaria, principale, scopo)
print >> outfile, "|".join(record)
outfile = open("milano.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()
The code reads the table, take only the information that I need and create a txt file. The code is pretty rudimental, but it accomplishes the job.
My problem starts now. The url that I posted above is just one of the approximately 200 from which I need to parse the data. All the urls are differentiated by two elements only. Using the previous url:
http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146
the two elements that uniquely identify this page are MILANO (the name of the city) and 15146 (a bureaucratic code).
What I wanted to do was, first, creating a file with two columns:
- In the first the names of the cities I need;
- In the second the bureaucratic codes.
Then, I wanted to create a loop in python that reads each line of this file, correctly modify the url in my code and perform the parsing task separately for each city.
Do you have any suggestion about how to proceed? Thanks in advance for any help and suggestion!
[Update]
Thanks to all for the helpful suggestions. I found the answer of Thomas K the most easy to implement for my knowledge of python. I still have problems, though. I modified the code in the following way:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv
def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
col = row.findAll('td')
year = col[0].div.b.font.string
detrazione = col[1].div.b.font.string
ordinaria = col[2].div.b.font.string
principale = col[3].div.b.font.string
scopo = col[4].div.b.font.string
record = (year, detrazione, ordinaria, principale, scopo)
print开发者_运维百科 >> outfile, "|".join(record)
citylist = csv.reader(open("citycodes.csv", "rU"), dialect = csv.excel)
for city in citylist:
outfile = open("%s.txt", "w") % city
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()
where citycodes.csv is in the following format
MILANO;12345
MODENA;67891
I get the following error:
Traceback (most recent call last):
File "modena2.py", line 25, in <module>
outfile = open("%s.txt", "w") % city
TypeError: unsupported operand type(s) for %: 'file' and 'list'
Thanks again!
One little thing you need to fix:
This:
for city in citylist:
outfile = open("%s.txt", "w") % city
# ^^^^^^
Should be this:
for city in citylist:
outfile = open("%s.txt" % city, "w")
# ^^^^^^
If the file is in CSV format then you can use csv
to read it. Then just use urllib.urlencode()
to generate the query string, and urlparse.urlunparse()
to generate the full URL.
No need to create a separate file, use a python dictionary instead in which there is a relationship: city->code.
See: http://docs.python.org/tutorial/datastructures.html#dictionaries
Quick and dirty:
import csv
citylist = csv.reader(open("citylist.csv"))
for city in citylist:
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
# open the page and extract the information
Assuming you have a csv file looking like:
MILANO,15146
ROMA,12345
There are more powerful tools, like urllib.urlencode()
as Ignacio mentioned. But they're probably overkill for this.
P.S. Congratulations: you've done the hard bit - scraping data from HTML. Looping over a list is the easy bit.
Just scratching out the basics...
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
outfile = open("milano.txt", "w")
def extract(soup):
global outfile
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
col = row.findAll('td')
year = col[0].div.b.font.string
detrazione = col[1].div.b.font.string
ordinaria = col[2].div.b.font.string
principale = col[3].div.b.font.string
scopo = col[4].div.b.font.string
record = (year, detrazione, ordinaria, principale, scopo)
print >> outfile, "|".join(record)
br = Browser()
br.set_handle_robots(False)
# fill in your cities here anyway like
ListOfCityCodePairs = [('MILANO', 15146)]
for (city, code) in ListOfCityCodePairs:
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=d" % (city, code)
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()
精彩评论