Python and Beautiful Soup - Search for tag a, return following tag b's until tag A is found
I have 2 variables, one with 'last volume' and the other with 'last issue'.
The HTML I am dealing with contains a list of all volumes and issues, most recent first.
I need to return the href links for all volumes and issues that are newer than what I have on file.
So using t开发者_如何学Gohe below example, say my last volume is 13 and last issue is 1, I would need to return the href for vol 13, 2 and vol 14, 1.
I am having a hard time with this since the volume is on its own...
Here is what I have so far:
HTML:
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
Script Snipped:
results = soup.find('ul', attrs={'class' : 'bobby'})
#temp until I get it reading from file
lastVol = '13'
#find the last volume
findlastVol = results.findNext('strong', text= re.compile('Volume ' + lastVol))
#temp until I get it reading from file
lastIss = '2'
#find the last issue
findlastIss = findlastVol.findNext('a', text= re.compile('Issue ' + lastIss))
So I can get to the tag for the last volume and issue on file, but I have had several failed attempts at traversing back up and stopping at the first issue...
Or starting at the top and traversing down until that volume and issue condition is met...
Can someone please give me some assistance? Thanks.
I think you are looking for findPrevious, which you could use this way:
import BeautifulSoup
import re
content='''
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
last_volume=13
last_issue=1
soup=BeautifulSoup.BeautifulSoup(content)
results = soup.find('ul', attrs={'class' : 'bobby'})
for a_string in results.findAll('a', text=re.compile('Issue')):
volume=a_string.findPrevious(text=re.compile('Volume'))
volume=int(re.search(r'(\d+)',volume).group(1))
issue=int(re.search(r'(\d+)',a_string).group(1))
href=a_string.parent['href']
if (volume>last_volume) or (volume>=last_volume and issue>last_issue):
print(volume,issue,href)
yields
(14, 1, u'/content/ben/cchts/2011/00000014/00000001')
(13, 2, u'/content/ben/cchts/2010/00000013/00000002')
from BeautifulSoup import BeautifulSoup
content = '''<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>
</li>
<li><strong>Volume 13</strong></li>
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
soup = BeautifulSoup(content)
soup.prettify()
last_vol = 13
last_issue = 1
res = soup.find('ul',{"class":"bobby"})
lis = res.findAll('li')
for j in lis:
if(j.find('strong') != None):
vol = int(j.contents[0].string[7:])
elif(vol > last_vol) or (vol == last_vol and int(j.contents[1]['href'][33:]) > last_issue):
print "Volume\t:%d" % vol
print j.contents[1].string
print "href\t:%s" % j.contents[1]['href']
Gives
Volume :14 Issue 1, September 2011 href :/content/ben/cchts/2011/00000014/00000001 Volume :13 Issue 2, December 2010 href :/content/ben/cchts/2010/00000013/00000002
精彩评论