Screen scaper that follows redirects and encodes to UTF-8
I'm looking for a gem (or a combination of gems) that can, given an URL, return the page content as UTF-8. It should also follow redirects if the URL is changed.
D开发者_JS百科oes anyone know of such?
Thanks!
Have you looked at Nokogiri? It seems to do what you are looking for in terms of encoding:
ENCODING:
Strings are always stored as UTF-8 internally. Methods that return text values will always return UTF-8 encoded strings. Methods that return XML (like to_xml, to_html and inner_html) will return a string encoded like the source document.
You can also automate some of your screen scraping with Mechanize (click links, submit forms, etc). Mechanize builds on Nokogiri so it's a nice complement to it.
Some webcasts you may want to look at:
- Nokogiri: http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
- Mechanize: http://railscasts.com/episodes/191-mechanize
精彩评论