How parsing works
I am trying the sample code for the piracy report. The line of code:
for incident in soup('td', width="90%"):
seraches the soup for an element td
with the attribute width="90%"
, correct? It invokes the __init__
method of the BeautifulStoneSoup
class, which eventually invokes SGMLParser.__init__(self)
Am I correct with the class flow above?
The soup looks like this in the report now:
<td class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" ><p>22.09.2010: 0236 UTC: Posn: 03:49.9N – 006:54.6E: Off Bonny River: Nigeria.<p/>
<p>About 21 armed pirates in three crafts boarded a pipe layer crane vessel undertow. All crew locked themselves in accommodations. Pirates were able to take one crewmember as hostage. Master called Nigerian naval vessel in vicinity. Later pirates released the crew and left the vessel. All crew safe.<p/></td>
There is no width
markup in the te开发者_JS百科xt. I changed the line of code that is searching:
for incident in soup('td', class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations"):
It appears that class
is a reserved word, maybe?
How do I get the current example code to run, and has more changed in the application than just the HTML output?
The URL I am using:
urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
There must be a better way....
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
soup = BeautifulSoup(page)
soup.find("table",{"class" : "fabrikTable"})
list1 = soup.table.findAll('p', limit=50)
i = 0
imax = 0
for item in list1 :
imax = imax + 1
while i < imax:
Itime = list1[i]
i = i + 2
Incident = list1[i]
i = i + 1
Inext = list1[i]
print "Time ", Itime
print "Incident", Incident
print " "
i = i + 1
class
is a reserved word and will not work with that method.
This method works but does not return the list:
soup.find("tr", { "class" : "fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" })
And I confirmed the class flow for the parse.
The example will run, but the HTML must be parsed with different methods because the width='90%'
is no longer in the HTML.
Still working on the proper methods; will post back when I get it working.
精彩评论