HTML returned by Nokogiri is different from actual HTML source code
I have been successfully screen-scraping certain sites but have come across some very odd behavior with Nokogiri today on a certain site.
If I view the HTML source code pulled down by Nokogiri with the actual HTML scource code from the site on a certain pages it is truncated. Some pages work just fine and all the data is there and others wig out and get truncated.
www.bento.com/revj/0172.html (Doesn't wor开发者_C百科k - truncated HTML returned by Nokogiri) www.bento.com/revj/0101.html (Works great)
scraped_jpage = Nokogiri::HTML(open(page_to_scrape)
puts scraped_pagej
I have tried all sorts of different code, changed encoding (UTF-8, SHIFT_JIS etc) but I cannot see any reason whatsoever that Nokogiri truncates the returned HTML.
The english versions of these pages all work perfectly.
www.bento.com/rev/0172.html www.bento.com/rev/0101.html
Thanks for any help - hopefully it's something obvious I have missed and not a bug.
Because that source page with have bad html structure.
Try to print result errors:
puts scraped_jpage.errors
精彩评论