Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents
I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors.
Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like
<!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] >
and I have learned that the "[]" needs to have something valid inside. Again, I don't control the creation of the documents, but I DO HAVE to "crack" them and get at the data I want. Another example is having an "unclosed" ELEMENT, for example:
<Caption>Plants, and facilities<hardhyphen><hyphen>Inspection.</Caption>
This XMLException is "The 'hyphen' start tag on line 27 does not match the end tag of 'Caption'. Line 27, position 58." Obvious, right?
开发者_如何学GoBut then the question is how can you actually get at certain ELEMENTS in these documents, without encountering XMLExceptions. Is a SAX parser the right way? I basically want to open the document, go right to the element I want (without worrying what might or might not be well-formed nearby), pull the data, and move on. Should I just forget parsing with XMLDocument, XDocument, and just do simple string replacements like
str.Replace("<hardhypen><hyphen>", "-")
and then try to load it into one of the XML parsers. Any tips on strategies?
The issue is that you're trying to parse SGML with an XML tool. They're not the same. If you want to use an XML tool/language to access the data, you will probably need to convert the SGML to XML before trying to parse it.
Ideally you'd either use a language/tool that supports SGML (like OmniMark) or something that can handle "XML like" data (like nokogiri from the first answer?).
This can be pretty straight forward, but can get tricky at some points. Especially if you're talking about multiple doctypes (DTD's). (Also, there's no such thing as "well-formed" SGML. Yes, the elements/etc. have to be nested correctly but SGML has to have a DTD.)
Here are some differences between SGML and XML that you'd need to handle. (You may not want to go this route, but it may be helpful for informational purposes anyway.):
DOCTYPE declaration
The DOCTYPE declaration in your example is a perfectly valid SGML doctype. The
[]
(internal subset) doesn't have to have anything in it. If you do have declarations in the internal subset (usually entity declarations), you're more than likely going to have to keep a doctype declaration in the XML.The issue the XML parser is having is that you don't have a system identifier in the declaration. In an XML doctype declaration, the system identifier is required if there is a public identifier. In an SGML doctype declaration, it's not required.
Bottom line: unless you need the XML to parse to a DTD/Schema or have declarations in the internal subset, strip the doctype declaration. If the XML does have to be valid, you'll at least need to add a system identifier. Don't forget to add the
<?xml ...?>
processing instruction.Elements without end tags
The
<hardhyphen>
and<hyphen>
elements are valid SGML. SGML DTD's allow you to specify tag minimization. What this means is that you can specify whether or not an end tag is required. (You can also make the start tag optional, but that's crazy talk.) In XML you have to close these elements (like<hardhyphen/>
or<hardhyphen></hardhyphen>
)The best thing to do is to look at your SGML DTD and see what elements have optional end tags. The tag minimization is specified right after the element name in the element declaration. A '-' means the tag is required. An 'o' (letter 'oh') means that the tag is optional. For example if you see
<!ELEMENT hyphen - o (#PCDATA)>
, this means that the start tag is required (-
) and the end tag is optional (o
). If you see<!ELEMENT hyphen - - (#PCDATA)>
, both the start and the end tags are required.Bottom line: properly close all of the elements that don't have end tags
Processing instructions
Processing instructions (PI's) in SGML don't have the second
?
when they are closed like XML does. You'll need to add the second?
.Example SGML PI:
<?asdf jkl>
Example XML PI:
<?asdf jkl?>
Inclusions/Exclusions
You probably won't have to worry about this, but in an SGML DTD you can specify in an element declaration that another element is allowed anywhere inside of that element (or not allowed). This can be a pain if your target XML needs to parse to a DTD; XML DTD's do not allow inclusions/exclusions.
This is what an inclusion might look like:
<!ELEMENT chapter - - (section)+ +(revst|revend)>
This is saying that
revst
orrevend
are allowed anywhere inside ofchapter
. If the element declaration had-(revst|revend)
, it would mean thatrevst
orrevend
is not allowed anywhere inside ofchapter
.
Hope this helps.
Yeah, use Nokogiri.
Scroll down a bit on that page and copy the code under "Synopsis" into a file, say xml-parser.rb
. Then, if you're on a Mac (Ruby comes already installed on Macs.), from Terminal, run gem install nokogiri
, and then run the file with: ruby xml-parser.rb
.
You can also then type irb
right from Terminal and then require 'nokogiri'
and start playing around with the nokogiri api in real time. Gotta love interactive Ruby. :)
If you're on Windows, try this Ruby installer for Windows.
精彩评论