Searching Hpricot with Regex
I'm trying to use Hpricot to get the value within a span with a class name I don't know. I know that it follows the pattern "foo_[several digits]_bar".
Right now, I'm getting the entire containing element as a string and using a regex to parse the string for the tag. That solution works, but it seems really ugly.
doc = Hpricot(open("http://scrape.example.com/search?q=#{ticker_symbol}"))
elements = doc.search("//span[@class='pr']").inner_html
string = ""
elements.each do |attr|
if(attr =~ /foo_\d+_bar/)
string = attr
end
end
# get rid of the span tags, just get the value
string.sub!(/<\/span>/, "")
string.sub!(/<span.+>/, "")
return string
It seem like there should be a better way to do that. I'd like to do something like:
elements = doc.search("//span[@class='" + /foo_\d+_bar/ + "']").inner_html
But that doesn't run. Is there a way to search with a开发者_C百科 regular expression?
This should do:
doc.search("span[@class^='foo'][@class$='bar']")
In addition to this we can give some more examples on how some other similar expressions work:
For a document like the following:
We get the output following for each query:
doc.search("//meta[@content='abcxy def ghi jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
This is what we would expect.
doc.search("//meta[@content='def']")
=> #<Hpricot::Elements[]>
As you see = is looking for exact match.
doc.search("//meta[@content~='def']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
With ~ we can do a substring matching; but not truly what you would expect.
For instance see the following.
doc.search("//meta[@content~=' def ']")
=> #<Hpricot::Elements[]>
It seems that spaces are treated specially.
With star we can go around this problem. Now we are doing true substring matching.
doc.search("//meta[@content*=' def ']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
We can also do string begin and string end matching as follows:
doc.search("//meta[@content^='def']")
=> #<Hpricot::Elements[]>
doc.search("//meta[@content^='ab']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
doc.search("//meta[@content$='mn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
Note that for these space characters are not a problem.
doc.search("//meta[@content$=' jklmn']")
=> #<Hpricot::Elements[{emptyelem <meta content="abcxy def ghi jklmn">}]>
This should do:
doc.search("span[@class^='foo'][@class$='bar']")
One could modify the incoming html before parsing.
html = open("http://scrape.example.com/search?q=#{ticker_symbol}").string
html.gsub!(/class="(foo_\d+_bar)"/){ |s| "class=\"foo_bar #{$1}\"" }
doc = Hpricot(html)
After that you can identify the elements using the foo_bar
class. This is far from elegant or general but could prove to be more efficient.
精彩评论