Identifying if a data is RSS or HTML on python

2023-01-01 02:30 问答作者：

Is there a function or method I could call in Python

That would 开发者_开发知识库tell me if the data is RSS or HTML?

You could always analyze it yourself to search for an xml tag (for RSS) or html tag (for HTML).

Filetypes should generally be determined out-of-band. eg. if you are fetching the file from a web server, the place to look would be the Content-Type header of the HTTP response. If you're fetching a local file, the filesystem would have a way of determining filetype—on Windows that'd be looking at the file extension.

If none of that is available, you'd have to resort to content sniffing. This is never wholly reliable, and RSS is particularly annoying because there are multiple incompatible versions of it, but about the best you could do would probably be:

Attempt to parse the content with an XML parser. If it fails, the content isn't well-formed XML so can't be RSS.
Look at the document.documentElement.namespaceURI. If it's http://www.w3.org/1999/xhtml, you've got XHTML. If it's http://www.w3.org/1999/02/22-rdf-syntax-ns#, you've got RSS (of one flavour).
If the document.documentElement.tagName is rss, you've got RSS (of a slightly different flavour).

If the file couldn't be parsed as XML, it could well be HTML (or some tag-soup approximation of it). It's conceivable it might also be broken RSS. In that case most feed tools would reject it. If you need to still detect this case you'd be reduced to looking for strings like <html or <rss or <rdf:RSS near the start of the file. This would be even more unreliable.

继续阅读：python rss

Identifying if a data is RSS or HTML on python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？