开发者

HTML returned by Nokogiri is different from actual HTML source code

I have been successfully screen-scraping certain sites but have come across some very odd behavior with Nokogiri today on a certain site.

If I view the HTML source code pulled down by Nokogiri with the actual HTML scource code from the site on a certain pages it is truncated. Some pages work just fine and all the data is there and others wig out and get truncated.

www.bento.com/revj/0172.html (Doesn't wor开发者_C百科k - truncated HTML returned by Nokogiri) www.bento.com/revj/0101.html (Works great)

scraped_jpage = Nokogiri::HTML(open(page_to_scrape)
puts scraped_pagej

I have tried all sorts of different code, changed encoding (UTF-8, SHIFT_JIS etc) but I cannot see any reason whatsoever that Nokogiri truncates the returned HTML.

The english versions of these pages all work perfectly.

www.bento.com/rev/0172.html www.bento.com/rev/0101.html

Thanks for any help - hopefully it's something obvious I have missed and not a bug.


Because that source page with have bad html structure.

Try to print result errors:

puts scraped_jpage.errors
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜