Best Way to determine MimeType from a String?

2023-01-10 16:31 问答作者：

I have a crawler that downloads pages and tries to parse the HTML. One of the iss开发者_如何学Pythonues I've been facing is how to properly determine what mimetype an HTML file is.

Right now I'm using

is = new ByteArrayInputStream( htmlResult.getBytes( "UTF-8" ) );
mimeType = URLConnection.guessContentTypeFromStream(is);

but it misses sites like this: http://www.artdaily.org/index.asp?int_sec%3D11%26int_new%3D39415 because of the extra space between the doc tag and HTML tag in the source.

Does anyone know a good way to determine if a string is HTML or not? Searching for or some other tag wouldn't necessarily work because of text being embedded in binary files I may come across.

thanks

Do you have control over the http connection that you crawler uses? Then how about checking the HTTP response header "Content-type". Thats one way to determine the content type. I just did a quick test of the artdaily.com to see if the content type header was sent. And there is one that has a value text/html

继续阅读：mime-types

Best Way to determine MimeType from a String?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？