HTML file fetched using 'wget' reported as binary by 'less'
If I use wget
to download this page:
wget http://www.aqr.com/ResearchDetails.htm -O page.html
and then attempt to view the page in less
, less reports the file as being a binary.
less page.html
"page.html" may be a binary file. See it anyway?
These are the response headers:
Accept-Ranges:bytes
Cache-Control:private
Content-Encoding:gzip
Content-Length:8295
Content-Type:text/html
Cteonnt-Length:44064
Date:Sun, 25 Sep 2011 12:15:53 GMT
ETag:"c0859e4e785ecc1:6cd"
Last-Modified:Fri, 19 Aug 2011 14:00:09 GMT
Server:Microsoft-IIS/6.0
X-Powered-By:ASP.NET
Opening the file in vim works fine.开发者_StackOverflow社区
Any clues as to why less can not handle it?
It's an UTF-16 encoded file. (Check with W3C Validator). You can convert it to UTF-8 with this command:
wget http://www.aqr.com/ResearchDetails.htm -q -O - | iconv -f utf-16 -t utf-8 > page.html
less
usally knows UTF-8.
edit:
As @Stephen C reported, less
in Red Hat supports UTF-16. It looks to me that Red Hat patched less for UTF-16 support. On the official site of the less UTF-16 support currently is an open issue (ref number 282).
Because it is UTF-16 encoded as can be seen with the BOM of ff ee
in the first two octets:
$ od -x page.html | head -1
0000000 feff 003c 0021 0044 004f 0043 0054 0059
vim is smarter about it (because it is more Unicode era) than less
.
added:
See Convert UTF-16 to UTF-8 under Windows and Linux, in C for what to do about it. Or use vim to write it back out with UTF-8 encoding.
Firstly, it works for me. When I download the file using that file, I get a file that "less" shows me without any questions / problems. (I use RedHat Fedora 14.)
Second, the "file" command reports "page.html" as:
page.html: Little-endian UTF-16 Unicode HTML document text, with very long lines, with CRLF line terminators
Maybe the UTF-16 encoding is the cause of the problems. (But why ... I don't know why it would work with my version of "less" and not yours.)
@palacsint's solution works for me:
wget http://www.aqr.com/ResearchDetails.htm -q -O - | \
iconv -f utf-16 -t utf-8 > page.html
Very likely this HTML file contains UTF characters and your locale is not set correctly (export LANG=en_US.UTF8 LESSCHARSET=utf-8
). It may also happen that HTML contains invalid characters.
EDIT: After checking the file I clearly see it is UTF-16. So you need to correct your terminal settings correspondingly (although I was able to see the text correctly with UTF8 setting, perhaps my terminal program is smart).
精彩评论