problems ... BeautifulSoup Parsing
<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
<a href="javascript开发者_C百科:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>
I would like to extract information from Mr. Paul to blabla
Some webpage has <p>
infront of Mr. Paul, so I could use FindNext('p')
However, some webpages do not have <p>
like the example above..
this is my code for when there is <p>
background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents
But when I don't have <p>
how I could extract information?
It's hard to tell from the example you have given us, but it looks to me that you could just get the next node after an h2
. In this example, Lewis Carroll has a p
-aragraph tag and your friend Paul has only a closing span
tag:
>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
... <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... p = section.findNext('p')
... if p:
... print '> ', p.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Lewis Carroll has bla bla
> Mr. Paul J. Fribourg has bla bla
Following comments:
>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
... paragraph = section.findNext('p')
... if paragraph and paragraph.string:
... print '> ', paragraph.string
... else:
... print '> ', section.parent.next.next.strip()
...
> Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]
You may, of course, wish to check copyright notices, et cetera...
"Some webpage has
<p>
infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have<p>
like the example above."
You're not giving enough information to be able to recognize your string:
- fixed node structure e.g. getChildren()[1].getChildren()[0].text
- if it's preceded by the magic string 'BACKGROUND' as per your code, then your approach of finding the next node seems good - just don't build in an assumption that the tag name is 'p'
- regex (e.g. "(Mr.|Ms.) ...")
Show us a HTML example when it does not have <p>
in front of name?
精彩评论