Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents

2023-01-26 05:24 问答作者：

I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors.

Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like

<!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] >

and I have learned that the "[]" needs to have something valid inside. Again, I don't control the creation of the documents, but I DO HAVE to "crack" them and get at the data I want. Another example is having an "unclosed" ELEMENT, for example:

<Caption>Plants, and facilities<hardhyphen><hyphen>Inspection.</Caption>

This XMLException is "The 'hyphen' start tag on line 27 does not match the end tag of 'Caption'. Line 27, position 58." Obvious, right?

开发者_如何学Go

But then the question is how can you actually get at certain ELEMENTS in these documents, without encountering XMLExceptions. Is a SAX parser the right way? I basically want to open the document, go right to the element I want (without worrying what might or might not be well-formed nearby), pull the data, and move on. Should I just forget parsing with XMLDocument, XDocument, and just do simple string replacements like

str.Replace("<hardhypen><hyphen>", "-")

and then try to load it into one of the XML parsers. Any tips on strategies?

The issue is that you're trying to parse SGML with an XML tool. They're not the same. If you want to use an XML tool/language to access the data, you will probably need to convert the SGML to XML before trying to parse it.

Ideally you'd either use a language/tool that supports SGML (like OmniMark) or something that can handle "XML like" data (like nokogiri from the first answer?).

This can be pretty straight forward, but can get tricky at some points. Especially if you're talking about multiple doctypes (DTD's). (Also, there's no such thing as "well-formed" SGML. Yes, the elements/etc. have to be nested correctly but SGML has to have a DTD.)

Here are some differences between SGML and XML that you'd need to handle. (You may not want to go this route, but it may be helpful for informational purposes anyway.):

DOCTYPE declaration

The DOCTYPE declaration in your example is a perfectly valid SGML doctype. The [] (internal subset) doesn't have to have anything in it. If you do have declarations in the internal subset (usually entity declarations), you're more than likely going to have to keep a doctype declaration in the XML.

The issue the XML parser is having is that you don't have a system identifier in the declaration. In an XML doctype declaration, the system identifier is required if there is a public identifier. In an SGML doctype declaration, it's not required.

Bottom line: unless you need the XML to parse to a DTD/Schema or have declarations in the internal subset, strip the doctype declaration. If the XML does have to be valid, you'll at least need to add a system identifier. Don't forget to add the <?xml ...?> processing instruction.
Elements without end tags

The <hardhyphen> and <hyphen> elements are valid SGML. SGML DTD's allow you to specify tag minimization. What this means is that you can specify whether or not an end tag is required. (You can also make the start tag optional, but that's crazy talk.) In XML you have to close these elements (like <hardhyphen/> or <hardhyphen></hardhyphen>)

The best thing to do is to look at your SGML DTD and see what elements have optional end tags. The tag minimization is specified right after the element name in the element declaration. A '-' means the tag is required. An 'o' (letter 'oh') means that the tag is optional. For example if you see <!ELEMENT hyphen - o (#PCDATA)>, this means that the start tag is required (-) and the end tag is optional (o). If you see <!ELEMENT hyphen - - (#PCDATA)>, both the start and the end tags are required.

Bottom line: properly close all of the elements that don't have end tags
Processing instructions

Processing instructions (PI's) in SGML don't have the second ? when they are closed like XML does. You'll need to add the second ?.

Example SGML PI: <?asdf jkl>

Example XML PI: <?asdf jkl?>
Inclusions/Exclusions

You probably won't have to worry about this, but in an SGML DTD you can specify in an element declaration that another element is allowed anywhere inside of that element (or not allowed). This can be a pain if your target XML needs to parse to a DTD; XML DTD's do not allow inclusions/exclusions.

This is what an inclusion might look like:

<!ELEMENT chapter - - (section)+ +(revst|revend)>

This is saying that revst or revend are allowed anywhere inside of chapter. If the element declaration had -(revst|revend), it would mean that revst or revend is not allowed anywhere inside of chapter.

Hope this helps.

Yeah, use Nokogiri.

Scroll down a bit on that page and copy the code under "Synopsis" into a file, say xml-parser.rb. Then, if you're on a Mac (Ruby comes already installed on Macs.), from Terminal, run gem install nokogiri, and then run the file with: ruby xml-parser.rb.

You can also then type irb right from Terminal and then require 'nokogiri' and start playing around with the nokogiri api in real time. Gotta love interactive Ruby. :)

If you're on Windows, try this Ruby installer for Windows.

继续阅读：linq-to-xml sgml string xml xmldocument

Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？