How do I extract text from a web page with tags using Hpricot?

2023-01-29 18:51 问答作者：

I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like .

require 'hpricot'

text = 开发者_StackOverflow社区<<SOME_TEXT
  <a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
  line 1<br />  
  line 2<br />
  line 3<br />
  line 4<br />
  line 5<br />
  <b>Here's some more text</b>
SOME_TEXT

parsed = Hpricot(text)

parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed

I would expect the result to be

<br />
line 1<br />  
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>

But I am getting

<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>

How can I make Hpricot return line 1, line 2, etc?

Your first step is to read the following_siblings documentation:

Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.

Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:

parsed        = Hpricot(text)
link          = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs     = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]

puts what_you_want

That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.

It's been a while since I've used Hpricot but here's some things I remember that might help:

The quick way to get all the text:

irb(main):023:0> print parsed.inner_text
  Testing:
  line 1  
  line 2
  line 3
  line 4
  line 5
  Here's some more text

The downside to that is you get the text embedded in tags too.

Similarly, we can search for all 'text()' nodes:

irb(main):033:0> puts (parsed / 'text()')

Testing:

  line 1

  [...]

  line 5

So, we can do this:

irb(main):036:0> puts (parsed / 'text()')[2 .. -3]

  line 1

  line 2

  line 3

  line 4

  line 5

or:

irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n  line 1", "  \n  line 2", "\n  line 3", "\n  line 4", "\n  line 5", "\n  "]>

or:

irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]

The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div> or  tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by   nodes maybe, or the five lines following an <a> tag with a certain href attribute. That's the fun and challenge of dealing with HTML.

In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.

继续阅读：hpricot ruby

How do I extract text from a web page with <br /> tags using Hpricot?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？