beautifulsoup, Find th with text 'price', then get price from next th
My html looks like:
<td>
<table ..&g开发者_如何学编程t;
<tr>
<th ..>price</th>
<th>$99.99</th>
</tr>
</table>
</td>
So I am in the current table cell, how would I get the 99.99 value?
I have so far:
td[3].findChild('th')
But I need to do:
Find th with text 'price', then get next th tag's string value.
Think about it in "steps"... given that some x
is the root of the subtree you're considering,
x.findAll(text='price')
is the list of all items in that subtree containing text 'price'
. The parents of those items then of course will be:
[t.parent for t in x.findAll(text='price')]
and if you only want to keep those whose "name" (tag) is 'th'
, then of course
[t.parent for t in x.findAll(text='price') if t.parent.name=='th']
and you want the "next siblings" of those (but only if they're also 'th'
s), so
[t.parent.nextSibling for t in x.findAll(text='price')
if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']
Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:
Edit: added tolerance for a string of text between the parent th
and the "next sibling" as well as tolerance for the latter being a td
instead, per OP's comment.
for t in x.findAll(text='price'):
p = t.parent
if p.name != 'th': continue
ns = p.nextSibling
if ns and not ns.name: ns = ns.nextSibling
if not ns or ns.name not in ('td', 'th'): continue
print ns.string
I've added ns.string
, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print
but something smarter, but I'm giving you the structure.
Talking about the structure, notice that twice I use if...: continue
: this reduces nesting compared to the alternative of inverting the if
's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this
at an interactive prompt to see them all and meditate;-).
With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:
from pyparsing import makeHTMLTags, Combine, Word, nums
th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd +
th + "$" + floatnum("price") + thEnd)
tokens,startloc,endloc = priceEntry.scanString(html).next()
print tokens.price
Pyparsing's makeHTMLTags
helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum
in the definition of priceEntry
), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry
expression.
(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"]
.)
精彩评论