开发者

hpricot throws exception when trying to parse url which has noscript tag

I use hpricot gem in ruby on rails to parse a webpage and extract the meta-tag contents. But if the website has a <noscrpit> tag just after the <head> tag it throws an exception

Exception: undefined method `[]' for nil:NilClass

I even tried to update the gem to the latest version. but still the same.

this is the sample code i use.

require 'rubygems'
require 'hpricot'
require 'open-uri'
begin
       index_page = Hpricot(open("http://sample.com"))
       puts index_page.at("/html/head/meta[@name='verification']")['content'].gsub(/\s/, "")
rescue Exception => e
 开发者_运维知识库      puts "Exception: #{e}"
end

i was thinking to remove the noscript tag before giving the webpage to hpricot. or is there anyother way to do it??

my html snippet

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" content="0; url=http://www.yoursite.com/noscripts.html"/> 
</noscript> 
<meta name="verification" content="7ff5e90iormq5niy6x98j75-o1yqwcds-c1b1pjpdxt3ngypzdg7p80d6l6xnz5v3buldmmjcd4hsoyagyh4w95-ushorff60-f2e9bzgwuzg4qarx4z8xkmefbe-0-f" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>


I can't duplicate the exception with Hpricot. However, I do see problems with how you are trying to find the meta tag.

I shorted the HTML sample to help my sample code fit into the answer box here, then saved the HTML locally so I could use open-uri to get at it.

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" /> 
</noscript> 
<meta name="norton-safeweb-site-verification" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>

Contemplate the results of the searches below:

#!/usr/bin/env ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://localhost:3000/test.html'))

(doc / 'meta').size # => 2
(doc / 'meta')[1] # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="verification"]') # => nil
(doc % 'meta[@name*="verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="norton-safeweb-site-verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

Remember that '/' in Hpricot means .search() or "find all occurrences" and '%' means .at() or "find the first occurrence". Using a long path to get to the desired element is often less likely to find what you want. Look for unique things in the element or its siblings or parents. A long accessor breaks easier because the preceeding layout of the page is considered when searching; If something in the page changes the accessor will be invalid, so search atomically or in the smallest group of elements you can. Also, the Hpricot docs recommend using CSS accessors so I'm using those in the example code.

Searching for any 'meta' tag found two occurrences. So far so good. Grabbing the second one was one way of getting at what you want.

Searching for "meta with a name parameter" found the target.

Searching for "meta with a name parameter consisting of 'verification'" fails, because there isn't one. Searching inside the parameter using "*=" works.

Searching for "meta with a name parameter consisting of 'norton-safeweb-site-verification'" succeeds, because that is the full parameter value.

Hpricot has a pretty good set of CSS selectors:

http://wiki.github.com/whymirror/hpricot/supported-css-selectors

Now, all that said, I recommend using Nokogiri over Hpricot. I have found cases where Hpricot silently failed but Nokogiri successfully parsed malformed XML and HTML.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜