libxml2 HTML parsing problems
I'm using libxml2 to parse HTML:
static htmlSAXHandler simpleSAXHandlerStruct = {
NULL, /* internalSubset */
NULL, /* isStandalone */
NULL, /* hasInternalSubset */
NULL, /* hasExternalSubset */
NULL, /* resolveEntity */
NULL, /* getEntity */
NULL, /* entityDecl */
NULL, /* notationDecl */
NULL, /* attributeDecl */
NULL, /* elementDecl */
NULL, /* unparsedEntityDecl */
NULL, /* setDocumentLocator */
NULL, /* startDocument */
NULL, /* endDocument */
NULL, /* startElement*/
NULL, /* endElement */
NULL, /* reference */
charactersFoundSAX, /* characters */
NULL, /* ignorableWhitespace */
NULL, /* processingInstruction */
NULL, /* comment */
NULL, /* warning */
errorEncounteredSAX, /* error */
NULL, /* fatalError //: unused error() get all the errors */
NULL, /* getParameterEntity */
NULL, /* cdataBlock */
NULL, /* externalSubset */
XML_SAX2_MAGIC, //
NULL,
startElementSAXP, /* startElementNs */
endElementSAXP, /* endElementNs */
NULL, /* serror */
};
The charactersFoundSAX
and errorEncounteredSAX
functions do get called, but the startElementSAXP
and endElementSAXP
functions never get called.
If I cha开发者_运维百科nge the parsing from HTML and parse XML instead (and change all the definitions including 'html' to 'xml', e.g. into xmlSAXHandler
), the functions do get called correctly.
Why is that?
HTML is not namespace aware and hence using just the startElementNs
/endElementNs
function slots in a SAX parser will result in your observed behaviour.
Simple fix: Fill in the startElement
/endElement
slots.
You can easily use wrappers to match the different signature and then call just the one underlying function in both XML and HTML mode.
精彩评论