开发者

Best Way to Parse HTML to XML

Essentially, I currently have an iPhone app that can query and parse an XML file on my server. Right now, I currently have to manually update and upload my XML file every morning so my users can have the updated information. I would like to automate this process, which would essentially entail parsing various websites (NYTimes, iAmBored.com, etc), outputting the relevant information from each of these websites to an XML file, and uploading that file to my server.

Does anyone know the best way to accomplish this (parsing HTML to an XML file). Since I am a beginner, I'm not sure what languages th开发者_JS百科is requires or what is the best way to do this?

Thanks a lot in advance!


You can try to translate HTML to XHTML (XHTML is based on XML so it's XML with some rules defined in a DTD).

You can also try to parse directly HTML with a SGML parser (As XHTML is based on XML, HTML is based on SGML).

The links are provided as inspiration.


If the content you need to scrape is in XHTML then you can easily use the XSLT language to transform original content in what you need inside the XML you provide to your users.

Otherwise any kind of scraping and XML producing solution will be fine, every programming language has its support to do such things.. but you could use XPath to select the elements you need from the page and then save them inside the output file.


Can you get what you need from the RSS/Atom feeds? That will simplify things greatly because they are XML rather than HTML and can be parsed by a standard XML parser. Of course, descriptions embedded inside RSS feeds will be HTML, so depending on your application, that may be when you need to parse HTML.

XSLT is a domain-specific programming language designed for processing XML, but you can also use any programming language that includes an XML parser for the task.


Best Way to Parse HTML to XML

TagSoup - Just Keep On Truckin'

Best Way to Parse HTML to XML

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Also, Taggle, a TagSoup in C++, available now

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜