开发者

Searching Hpricot with Regex

I'm trying to use Hpricot to get the value within a span with a class name I don't know. I know that it follows the pattern "foo_[several digits]_bar".

Right now, I'm getting the entire containing element as a string and using a regex to parse the string for the tag. That solution works, but it seems really ugly.

doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
  if(attr =~ /foo_\d+_bar/)
    string = attr
  end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")

return string

It seem like there should be a better way to do that. I'd like to do something like:

elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html

But that doesn't run. Is there a way to search with a开发者_C百科 regular expression?


This should do:

doc.search("span[@class^='foo'][@class$='bar']")

In addition to this we can give some more examples on how some other similar expressions work:

For a document like the following:

We get the output following for each query:

doc.search("//meta[@content='abcxy def ghi jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

This is what we would expect.

doc.search("//meta[@content='def']")
=> #<Hpricot::Elements[]>

As you see = is looking for exact match.

doc.search("//meta[@content~='def']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

With ~ we can do a substring matching; but not truly what you would expect.

For instance see the following.

doc.search("//meta[@content~=' def ']")
=> #<Hpricot::Elements[]>

It seems that spaces are treated specially.

With star we can go around this problem. Now we are doing true substring matching.

doc.search("//meta[@content*=' def ']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

We can also do string begin and string end matching as follows:

doc.search("//meta[@content^='def']")
=> #<Hpricot::Elements[]>

doc.search("//meta[@content^='ab']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

doc.search("//meta[@content$='mn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>

Note that for these space characters are not a problem.

doc.search("//meta[@content$=' jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>


This should do:

doc.search("span[@class^='foo'][@class$='bar']")


One could modify the incoming html before parsing.

html = open("http://scrape.example.com/search?q=#{ticker_symbol}").string
html.gsub!(/class="(foo_\d+_bar)"/){ |s| "class=\"foo_bar #{$1}\"" }
doc = Hpricot(html)

After that you can identify the elements using the foo_bar class. This is far from elegant or general but could prove to be more efficient.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜