Use BeautifulSoup to extract sibling nodes between two nodes
I've got a document like this:
<p class="top">I don't want this</p>
<p>I want this</p>
<table>
<!-- ... -->开发者_如何学编程;
</table>
<img ... />
<p> and all that stuff too</p>
<p class="end>But not this and nothing after it</p>
I want to extract everything between the p[class=top] and p[class=end] paragraphs.
Is there a nice way I can do this with BeautifulSoup?
node.nextSibling
attribute is your solution:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
nextNode = soup.find('p', {'class': 'top'})
while True:
# process
nextNode = nextNode.nextSibling
if getattr(nextNode, 'name', None) == 'p' and nextNode.get('class', None) == 'end':
break
This complicated condition is to be sure that you're accessing attributes of HTML tag and not string nodes.
精彩评论