开发者

Selecting only text within a div tag

I'm working on a web parser using urllib. I need to be able to only save lines that lie within a certain div tag. for instance: I'm saving all text in the div "b开发者_运维技巧ody." This means all text within the div tags will be returned. It also means if there are other divs inside of it thats fine, but as soon as I hit the parent it stops. Any ideas?

My Idea

  1. search for the div you're looking for.

  2. Record the position.

  3. Keep track of any divs in the future. +1 for new div -1 for end div.

  4. when back to 0, your at your parent div? Save location.

  5. Then save data from beginnning number to end number?


If you're not really excited at the idea of parsing the HTML code yourself, there are two good options:

Beautiful Soup

Lxml

You'll probably find that lxml runs faster than BeautifulSoup, but in my uses, Beautiful Soup was very easy to learn and use, and handled typical crappy HTML as found in the wild well enough that I don't have need for anything else.

YMMV.


Using lxml:

import lxml.html as lh
content='''\
<body>
<div>AAAA
  <div>BBBB
     <div>CCCC
     </div>DDDD
  </div>EEEE
</div>FFFF
</body>
'''
doc=lh.document_fromstring(content)
div=doc.xpath('./body/div')[0]
print(div.text_content())
# AAAA
#   BBBB
#      CCCC
#      DDDD
#   EEEE

div=doc.xpath('./body/div/div')[0]
print(div.text_content())
# BBBB
#      CCCC
#      DDDD


Personally I prefer lxml in general, but there are times where it's HTML handling is a bit off... Here's a BeautifulSoup recipe if it helps.

from BeautifulSoup import BeautifulSoup, NavigableString

def printText(tags):
    s = []
    for tag in tags :
        if tag.__class__ == NavigableString :
            s.append(tag)
        else :
            s.append(printText(tag))
    return "".join(s)

html = "<html><p>Para 1<div class='stuff'>Div Lead<p>Para 2<blockquote>Quote 1</div><blockquote>Quote 2"
soup = BeautifulSoup(html)

v = soup.find('div', attrs={ 'class': 'stuff'})

print v.text_content
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜