开发者

Loading EUC-JP and other Japanese text encodings in Node.JS

I'm trying to scrape some Japanese websites for a personal project. Sites with text in UTF-8 work perfectly fine, as you'd expect, but I can't get any text out of sites specifying other international encodings, specifically EUC-JP. Node also seems to be interpreting the text and performing modifications rather than passing it on raw - I've tried setting the response to be interpreted as both ascii and binary, and then set my terminal application to EUC-JP, but after doing a console.log(), neithe开发者_C百科r result in the actual text.

I've had a scan through the Node documentation, and it seems to only support two main text encodings (apart from binary and base64.)

I'm using the inbuilt http client, and specifying the encoding through the response.setEncoding method, e.g. response.setEncoding('utf8');

How are other people working with international text in Node (especially with regard to situations where the original data is not in UTF-8?) Are binary buffers the only way?

While I've done a bit of research, I'm not hugely knowledgeable when it comes to character encoding, so simple answers would be appreciated. Thanks!


There is a module that adds iconv bindings to node.js. If you grab the response as a binary Buffer, you can use Iconv.convert to convert it from EUC-JP to UTF-8 (take a look at the README for an example).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜