开发者

problems ... BeautifulSoup Parsing

<h2 class="sectionTitle">BACKGROUND</h2>
Mr. Paul J. Fribourg has bla bla</span>
<div style="margin-top:8px;">
    <a href="javascript开发者_C百科:void(0)" onclick="show_more(this);">Read Full Background</a>
</div>

I would like to extract information from Mr. Paul to blabla Some webpage has <p> infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have <p> like the example above..

this is my code for when there is <p>

background = bs2.find(text=re.compile("BACKGROUND"))
bb= background.findNext('p').contents

But when I don't have <p> how I could extract information?


It's hard to tell from the example you have given us, but it looks to me that you could just get the next node after an h2. In this example, Lewis Carroll has a p-aragraph tag and your friend Paul has only a closing span tag:

>>> from BeautifulSoup import BeautifulSoup
>>>
>>> html = '''
... <h2 class="sectionTitle">BACKGROUND</h2>
... <p>Mr. Lewis Carroll has bla bla</p>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... <h2 class="sectionTitle">BACKGROUND</h2>
... Mr. Paul J. Fribourg has bla bla</span>
... <div style="margin-top:8px;">
...     <a href="javascript:void(0)" onclick="show_more(this);">Read Full Background</a>
... </div>
... '''
>>>
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     p = section.findNext('p')
...     if p:
...         print '> ',  p.string
...     else:
...         print '> ', section.parent.next.next.strip()
...
>  Mr. Lewis Carroll has bla bla
>  Mr. Paul J. Fribourg has bla bla

Following comments:

>>> from BeautifulSoup import BeautifulSoup
>>> from urllib2 import urlopen
>>> html = urlopen('http://investing.businessweek.com/research/stocks/private/person.asp?personId=668561&privcapId=160900&previousCapId=285930&previousTitle=LOEWS%20CORP')
>>> soup = BeautifulSoup(html)
>>> headings = soup.findAll('h2', text='BACKGROUND')
>>> for section in headings:
...     paragraph = section.findNext('p')
...     if paragraph and paragraph.string:
...         print '> ', paragraph.string
...     else:
...         print '> ', section.parent.next.next.strip()
... 
>  Mr. Paul J. Fribourg has been the President of Contigroup Companies Inc. (for [...]

You may, of course, wish to check copyright notices, et cetera...


"Some webpage has<p>infront of Mr. Paul, so I could use FindNext('p') However, some webpages do not have<p>like the example above."

You're not giving enough information to be able to recognize your string:

  • fixed node structure e.g. getChildren()[1].getChildren()[0].text
  • if it's preceded by the magic string 'BACKGROUND' as per your code, then your approach of finding the next node seems good - just don't build in an assumption that the tag name is 'p'
  • regex (e.g. "(Mr.|Ms.) ...")

Show us a HTML example when it does not have <p> in front of name?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜