开发者

HTML file fetched using 'wget' reported as binary by 'less'

If I use wget to download this page:

wget http://www.aqr.com/ResearchDetails.htm -O page.html

and then attempt to view the page in less, less reports the file as being a binary.

less page.html 
"page.html" may be a binary file.  See it anyway? 

These are the response headers:

Accept-Ranges:bytes
Cache-Control:private
Content-Encoding:gzip
Content-Length:8295
Content-Type:text/html
Cteonnt-Length:44064
Date:Sun, 25 Sep 2011 12:15:53 GMT
ETag:"c0859e4e785ecc1:6cd"
Last-Modified:Fri, 19 Aug 2011 14:00:09 GMT
Server:Microsoft-IIS/6.0
X-Powered-By:ASP.NET

Opening the file in vim works fine.开发者_StackOverflow社区

Any clues as to why less can not handle it?


It's an UTF-16 encoded file. (Check with W3C Validator). You can convert it to UTF-8 with this command:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | iconv -f utf-16 -t utf-8 > page.html

less usally knows UTF-8.

edit:

As @Stephen C reported, less in Red Hat supports UTF-16. It looks to me that Red Hat patched less for UTF-16 support. On the official site of the less UTF-16 support currently is an open issue (ref number 282).


Because it is UTF-16 encoded as can be seen with the BOM of ff ee in the first two octets:

$ od -x page.html | head -1
0000000 feff 003c 0021 0044 004f 0043 0054 0059

vim is smarter about it (because it is more Unicode era) than less.

added:

See Convert UTF-16 to UTF-8 under Windows and Linux, in C for what to do about it. Or use vim to write it back out with UTF-8 encoding.


Firstly, it works for me. When I download the file using that file, I get a file that "less" shows me without any questions / problems. (I use RedHat Fedora 14.)

Second, the "file" command reports "page.html" as:

page.html: Little-endian UTF-16 Unicode HTML document text, with very long lines, with CRLF line terminators

Maybe the UTF-16 encoding is the cause of the problems. (But why ... I don't know why it would work with my version of "less" and not yours.)


@palacsint's solution works for me:

wget http://www.aqr.com/ResearchDetails.htm -q -O - | \
     iconv -f utf-16 -t utf-8 > page.html


Very likely this HTML file contains UTF characters and your locale is not set correctly (export LANG=en_US.UTF8 LESSCHARSET=utf-8). It may also happen that HTML contains invalid characters.

EDIT: After checking the file I clearly see it is UTF-16. So you need to correct your terminal settings correspondingly (although I was able to see the text correctly with UTF8 setting, perhaps my terminal program is smart).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜