beautifulsoup, Find th with text 'price', then get price from next th

2023-01-09 11:25 问答作者：

My html looks like:

<td>
   <table ..&g开发者_如何学编程t;
      <tr>
         <th ..>price</th>
         <th>$99.99</th>
      </tr>
   </table>
</td>

So I am in the current table cell, how would I get the 99.99 value?

I have so far:

td[3].findChild('th')

But I need to do:

Find th with text 'price', then get next th tag's string value.

Think about it in "steps"... given that some x is the root of the subtree you're considering,

x.findAll(text='price')

is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:

[t.parent for t in x.findAll(text='price')]

and if you only want to keep those whose "name" (tag) is 'th', then of course

[t.parent for t in x.findAll(text='price') if t.parent.name=='th']

and you want the "next siblings" of those (but only if they're also 'th's), so

[t.parent.nextSibling for t in x.findAll(text='price')
 if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']

Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:

Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.

for t in x.findAll(text='price'):
  p = t.parent
  if p.name != 'th': continue
  ns = p.nextSibling
  if ns and not ns.name: ns = ns.nextSibling
  if not ns or ns.name not in ('td', 'th'): continue
  print ns.string

I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.

Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).

With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:

from pyparsing import makeHTMLTags, Combine, Word, nums

th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd + 
              th + "$" + floatnum("price") + thEnd)

tokens,startloc,endloc = priceEntry.scanString(html).next()

print tokens.price

Pyparsing's makeHTMLTags helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum in the definition of priceEntry), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry expression.

(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"].)

继续阅读：python

beautifulsoup, Find th with text 'price', then get price from next th

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？