开发者

libxml2 HTML parsing problems

I'm using libxml2 to parse HTML:

static htmlSAXHandler simpleSAXHandlerStruct = {
    NULL,                       /* internalSubset */
    NULL,                       /* isStandalone   */
    NULL,                       /* hasInternalSubset */
    NULL,                       /* hasExternalSubset */
    NULL,                       /* resolveEntity */
    NULL,                       /* getEntity */
    NULL,                       /* entityDecl */
    NULL,                       /* notationDecl */
    NULL,                       /* attributeDecl */
    NULL,                       /* elementDecl */
    NULL,                       /* unparsedEntityDecl */
    NULL,                       /* setDocumentLocator */
    NULL,                       /* startDocument */
    NULL,                       /* endDocument */
    NULL,                       /* startElement*/
    NULL,                       /* endElement */
    NULL,                       /* reference */
    charactersFoundSAX,         /* characters */
    NULL,                       /* ignorableWhitespace */
    NULL,                       /* processingInstruction */
    NULL,                       /* comment */
    NULL,                       /* warning */
    errorEncounteredSAX,        /* error */
    NULL,                       /* fatalError //: unused error() get all the errors */
    NULL,                       /* getParameterEntity */
    NULL,                       /* cdataBlock */
    NULL,                       /* externalSubset */
    XML_SAX2_MAGIC,             //
    NULL,
    startElementSAXP,           /* startElementNs */
    endElementSAXP,             /* endElementNs */
    NULL,                       /* serror */
};

The charactersFoundSAX and errorEncounteredSAX functions do get called, but the startElementSAXP and endElementSAXP functions never get called.

If I cha开发者_运维百科nge the parsing from HTML and parse XML instead (and change all the definitions including 'html' to 'xml', e.g. into xmlSAXHandler), the functions do get called correctly.

Why is that?


HTML is not namespace aware and hence using just the startElementNs/endElementNs function slots in a SAX parser will result in your observed behaviour.

Simple fix: Fill in the startElement/endElement slots.

You can easily use wrappers to match the different signature and then call just the one underlying function in both XML and HTML mode.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜