开发者

Identifying if a data is RSS or HTML on python

Is there a function or method I could call in Python

That would 开发者_开发知识库tell me if the data is RSS or HTML?


You could always analyze it yourself to search for an xml tag (for RSS) or html tag (for HTML).


Filetypes should generally be determined out-of-band. eg. if you are fetching the file from a web server, the place to look would be the Content-Type header of the HTTP response. If you're fetching a local file, the filesystem would have a way of determining filetype—on Windows that'd be looking at the file extension.

If none of that is available, you'd have to resort to content sniffing. This is never wholly reliable, and RSS is particularly annoying because there are multiple incompatible versions of it, but about the best you could do would probably be:

  1. Attempt to parse the content with an XML parser. If it fails, the content isn't well-formed XML so can't be RSS.

  2. Look at the document.documentElement.namespaceURI. If it's http://www.w3.org/1999/xhtml, you've got XHTML. If it's http://www.w3.org/1999/02/22-rdf-syntax-ns#, you've got RSS (of one flavour).

  3. If the document.documentElement.tagName is rss, you've got RSS (of a slightly different flavour).

If the file couldn't be parsed as XML, it could well be HTML (or some tag-soup approximation of it). It's conceivable it might also be broken RSS. In that case most feed tools would reject it. If you need to still detect this case you'd be reduced to looking for strings like <html or <rss or <rdf:RSS near the start of the file. This would be even more unreliable.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜