开发者

How parsing works

I am trying the sample code for the piracy report. The line of code:

for incident in soup('td', width="90%"):

seraches the soup for an element td with the attribute width="90%", correct? It invokes the __init__ method of the BeautifulStoneSoup class, which eventually invokes SGMLParser.__init__(self)

Am I correct with the class flow above?

The soup looks like this in the report now:

<td class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" ><p>22.09.2010: 0236 UTC: Posn: 03:49.9N – 006:54.6E: Off Bonny River: Nigeria.<p/>
<p>About 21 armed pirates in three crafts boarded a pipe layer crane vessel undertow. All crew locked themselves in accommodations. Pirates were able to take one crewmember as hostage. Master called Nigerian naval vessel in vicinity. Later pirates released the crew and left the vessel. All crew safe.<p/></td>

There is no width markup in the te开发者_JS百科xt. I changed the line of code that is searching:

for incident in soup('td', class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations"):

It appears that class is a reserved word, maybe?

How do I get the current example code to run, and has more changed in the application than just the HTML output?

The URL I am using:

urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")


There must be a better way....

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
soup = BeautifulSoup(page)
soup.find("table",{"class" : "fabrikTable"})
list1 = soup.table.findAll('p', limit=50)
i = 0
imax = 0
for item in list1 :
    imax = imax + 1
while i < imax:
    Itime = list1[i]
    i = i + 2
    Incident = list1[i]
    i = i + 1
    Inext = list1[i] 
    print "Time    ", Itime 
    print "Incident", Incident
    print " " 
    i = i + 1


class is a reserved word and will not work with that method.

This method works but does not return the list:

soup.find("tr", { "class" : "fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" })

And I confirmed the class flow for the parse. The example will run, but the HTML must be parsed with different methods because the width='90%' is no longer in the HTML.

Still working on the proper methods; will post back when I get it working.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜