hpricot throws exception when trying to parse url which has noscript tag

2022-12-26 17:27 问答作者：

I use hpricot gem in ruby on rails to parse a webpage and extract the meta-tag contents. But if the website has a <noscrpit> tag just after the <head> tag it throws an exception

Exception: undefined method `[]' for nil:NilClass

I even tried to update the gem to the latest version. but still the same.

this is the sample code i use.

require 'rubygems'
require 'hpricot'
require 'open-uri'
begin
       index_page = Hpricot(open("http://sample.com"))
       puts index_page.at("/html/head/meta[@name='verification']")['content'].gsub(/\s/, "")
rescue Exception => e
 开发者_运维知识库      puts "Exception: #{e}"
end

i was thinking to remove the noscript tag before giving the webpage to hpricot. or is there anyother way to do it??

my html snippet

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" content="0; url=http://www.yoursite.com/noscripts.html"/> 
</noscript> 
<meta name="verification" content="7ff5e90iormq5niy6x98j75-o1yqwcds-c1b1pjpdxt3ngypzdg7p80d6l6xnz5v3buldmmjcd4hsoyagyh4w95-ushorff60-f2e9bzgwuzg4qarx4z8xkmefbe-0-f" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>

I can't duplicate the exception with Hpricot. However, I do see problems with how you are trying to find the meta tag.

I shorted the HTML sample to help my sample code fit into the answer box here, then saved the HTML locally so I could use open-uri to get at it.

<html> 
<head> 
<noscript> 
<meta http-equiv="refresh" /> 
</noscript> 
<meta name="norton-safeweb-site-verification" /> 
</head> 
<body> 
<h1>Testing</h1> 
</body> 
</html>

Contemplate the results of the searches below:

#!/usr/bin/env ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'

doc = Hpricot(open('http://localhost:3000/test.html'))

(doc / 'meta').size # => 2
(doc / 'meta')[1] # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="verification"]') # => nil
(doc % 'meta[@name*="verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

(doc % 'meta[@name="norton-safeweb-site-verification"]') # => {emptyelem <meta name="norton-safeweb-site-verification">}

Remember that '/' in Hpricot means .search() or "find all occurrences" and '%' means .at() or "find the first occurrence". Using a long path to get to the desired element is often less likely to find what you want. Look for unique things in the element or its siblings or parents. A long accessor breaks easier because the preceeding layout of the page is considered when searching; If something in the page changes the accessor will be invalid, so search atomically or in the smallest group of elements you can. Also, the Hpricot docs recommend using CSS accessors so I'm using those in the example code.

Searching for any 'meta' tag found two occurrences. So far so good. Grabbing the second one was one way of getting at what you want.

Searching for "meta with a name parameter" found the target.

Searching for "meta with a name parameter consisting of 'verification'" fails, because there isn't one. Searching inside the parameter using "*=" works.

Searching for "meta with a name parameter consisting of 'norton-safeweb-site-verification'" succeeds, because that is the full parameter value.

Hpricot has a pretty good set of CSS selectors:

http://wiki.github.com/whymirror/hpricot/supported-css-selectors

Now, all that said, I recommend using Nokogiri over Hpricot. I have found cases where Hpricot silently failed but Nokogiri successfully parsed malformed XML and HTML.

继续阅读：hpricot ruby ruby-on-rails

hpricot throws exception when trying to parse url which has noscript tag

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？