Best Way to Parse HTML to XML

2023-01-22 14:27 问答作者：

Essentially, I currently have an iPhone app that can query and parse an XML file on my server. Right now, I currently have to manually update and upload my XML file every morning so my users can have the updated information. I would like to automate this process, which would essentially entail parsing various websites (NYTimes, iAmBored.com, etc), outputting the relevant information from each of these websites to an XML file, and uploading that file to my server.

Does anyone know the best way to accomplish this (parsing HTML to an XML file). Since I am a beginner, I'm not sure what languages th开发者_JS百科is requires or what is the best way to do this?

Thanks a lot in advance!

You can try to translate HTML to XHTML (XHTML is based on XML so it's XML with some rules defined in a DTD).

You can also try to parse directly HTML with a SGML parser (As XHTML is based on XML, HTML is based on SGML).

The links are provided as inspiration.

If the content you need to scrape is in XHTML then you can easily use the XSLT language to transform original content in what you need inside the XML you provide to your users.

Otherwise any kind of scraping and XML producing solution will be fine, every programming language has its support to do such things.. but you could use XPath to select the elements you need from the page and then save them inside the output file.

Can you get what you need from the RSS/Atom feeds? That will simplify things greatly because they are XML rather than HTML and can be parsed by a standard XML parser. Of course, descriptions embedded inside RSS feeds will be HTML, so depending on your application, that may be when you need to parse HTML.

XSLT is a domain-specific programming language designed for processing XML, but you can also use any programming language that includes an XML parser for the task.

TagSoup - Just Keep On Truckin'

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Also, Taggle, a TagSoup in C++, available now

继续阅读：html-parsing xml

Best Way to Parse HTML to XML

TagSoup - Just Keep On Truckin'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

TagSoup - Just Keep On Truckin'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？