开发者

Universal "HTTP GET html page content and recode to UTF-8" procedure

For some time I have been trying to solve fairly common problem consisting of basically three steps:

  1. fetch html page with specified URL and store its content in a String
  2. detect content encoding either from html meta information or HTTP header
  3. recode the content into UTF-8 for further processing

In the real usage I have the first step a little extended with functionalities like having a "user-agent" instance with cookie-jar, configurable timeout and amount of GET attempts, configurable reque开发者_运维技巧st count per time frame limitation, etc...

I implemented rest-client wrapper but I run into several problems:

  • class-global RestClient.proxy settings conflicting with e.g. couchrest (using rest-client itself)
  • freezing: sometimes the timeout causes freezing of the process. AFAIK more of my friends run into the same problem with rest-client
  • redirect Location URI parsing: rest-client fails to fetch "http://www.ofertacarioca.com.br/index.aspx?cidade=4,Belo%20Horizonte" correctly complaining about invalid URI '/indexnew.aspx?cidade=4,Belo Horizonte' in Location header of the 302 result but curb handles this perfectly through to the target page. I'm about to reimplement the wrapper with the use of curb
  • recoding problems in the third step: I attempted to detect encoding from html page meta information and HTTP header (in this order) for some pages still to no avail

I would love to know of some cool gem out there that would handle such needs or of some intriguing universal solution hints if any.


As nobody has answered, I needed to implement the curb-based solution: curburger

Perhaps somebody finds it useful.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜