Universal "HTTP GET html page content and recode to UTF-8" procedure
For some time I have been trying to solve fairly common problem consisting of basically three steps:
- fetch html page with specified URL and store its content in a String
- detect content encoding either from html meta information or HTTP header
- recode the content into UTF-8 for further processing
In the real usage I have the first step a little extended with functionalities like having a "user-agent" instance with cookie-jar, configurable timeout and amount of GET attempts, configurable reque开发者_运维技巧st count per time frame limitation, etc...
I implemented rest-client
wrapper but I run into several problems:
- class-global
RestClient.proxy
settings conflicting with e.g.couchrest
(usingrest-client
itself) - freezing: sometimes the timeout causes freezing of the process. AFAIK more of my friends run into the same problem with
rest-client
- redirect
Location
URI parsing:rest-client
fails to fetch "http://www.ofertacarioca.com.br/index.aspx?cidade=4,Belo%20Horizonte" correctly complaining about invalid URI '/indexnew.aspx?cidade=4,Belo Horizonte' inLocation
header of the 302 result butcurb
handles this perfectly through to the target page. I'm about to reimplement the wrapper with the use ofcurb
- recoding problems in the third step: I attempted to detect encoding from html page meta information and HTTP header (in this order) for some pages still to no avail
I would love to know of some cool gem out there that would handle such needs or of some intriguing universal solution hints if any.
As nobody has answered, I needed to implement the curb
-based solution:
curburger
Perhaps somebody finds it useful.
精彩评论