Parse a badly formatted XML document (like an HTML file)

2023-01-30 07:26 问答作者：

After the parse, I would like to remove dangerous code and write it out again properly formatted.

The purpose is to prevent scripts from entering through an email but still allow the slew of bad HTML to work (at least not fail completely).

Is there a library for that? Is there a better way to keep scripts away from the browser?

The important thing is that the program not throw a Parse Exception. The program may make best guesses and ev开发者_Go百科en if it is wrong it will be acceptable.

Edit: I would appreciate any comments on which parsers y'all think are better and why.

For flexible parsing you might want to look at JSoup. But white-listing is the way to go here. If you just disallow a bunch of "dangerous" elements, someone will likely find a way to sneak something by your parser. Instead you should only allow a small list of safe elements.

Use one of available tools that convert the HTML to XHTML.

for example

http://www.chilkatsoft.com/java-html.asp

http://java-source.net/open-source/html-parsers

http://htmlcleaner.sourceforge.net/

etc

Then use regular XML parser.

I use the Jericho HTML parser for this purpose.

Somewhat tweaked version of their sanitizer example:

public class HtmlSanitizer {

private HtmlSanitizer() {
}

private static final Set<String> VALID_ELEMENTS = Sets.newHashSet(DIV, BR,
        P, B, I, OL, UL, LI, A, STRONG, SPAN, EM, TT, IMG);


private static final Set<String> VALID_ATTRIBUTES = Sets.newHashSet("id",
        "class", "href", "target", "title", "src");

private static final Object VALID_MARKER = new Object();

public static void sanitize(Reader r, Writer w) {
    try {
        sanitize(new Source(r)).writeTo(w);
        w.flush();
        r.close();
    } catch (IOException ioe) {
        throw new RuntimeException("error during sanitize", ioe);
    }
}

public static OutputDocument sanitize(Source source) {
    source.fullSequentialParse();
    OutputDocument doc = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    int pos = 0;
    for (Tag tag : tags) {
        if (processTag(tag, doc))
            tag.setUserData(VALID_MARKER);
        else
            doc.remove(tag);
        reencodeTextSegment(source, doc, pos, tag.getBegin());
        pos = tag.getEnd();
    }
    reencodeTextSegment(source, doc, pos, source.getEnd());
    return doc;
}

private static boolean processTag(Tag tag, OutputDocument doc) {
    String elementName = tag.getName();
    if (!VALID_ELEMENTS.contains(elementName))
        return false;
    if (tag.getTagType() == StartTagType.NORMAL) {
        Element element = tag.getElement();
        if (HTMLElements.getEndTagRequiredElementNames().contains(
                elementName)) {
            if (element.getEndTag() == null)
                return false;
        } else if (HTMLElements.getEndTagOptionalElementNames().contains(
                elementName)) {
            if (elementName == HTMLElementName.LI && !isValidLITag(tag))
                return false;
            if (element.getEndTag() == null)
                doc.insert(element.getEnd(), getEndTagHTML(elementName));

        }
        doc.replace(tag, getStartTagHTML(element.getStartTag()));
    } else if (tag.getTagType() == EndTagType.NORMAL) {
        if (tag.getElement() == null)
            return false;
        if (elementName == HTMLElementName.LI && !isValidLITag(tag))
            return false;
        doc.replace(tag, getEndTagHTML(elementName));
    } else {
        return false;
    }
    return true;
}

private static boolean isValidLITag(Tag tag) {
    Element parentElement = tag.getElement().getParentElement();
    if (parentElement == null
            || parentElement.getStartTag().getUserData() != VALID_MARKER)
        return false;
    return parentElement.getName() == HTMLElementName.UL
            || parentElement.getName() == HTMLElementName.OL;
}

private static void reencodeTextSegment(Source source, OutputDocument doc,
        int begin, int end) {
    if (begin >= end)
        return;
    Segment textSegment = new Segment(source, begin, end);
    String encodedText = encode(decode(textSegment));
    doc.replace(textSegment, encodedText);
}

private static CharSequence getStartTagHTML(StartTag startTag) {
    StringBuilder sb = new StringBuilder();
    sb.append('<').append(startTag.getName());
    for (Attribute attribute : startTag.getAttributes()) {
        if (VALID_ATTRIBUTES.contains(attribute.getKey())) {
            sb.append(' ').append(attribute.getName());
            if (attribute.getValue() != null) {
                sb.append("=\"");
                sb.append(CharacterReference.encode(attribute.getValue()));
                sb.append('"');
            }
        }
    }
    if (startTag.getElement().getEndTag() == null
            && !HTMLElements.getEndTagOptionalElementNames().contains(
                    startTag.getName()))
        sb.append('/');
    sb.append('>');
    return sb;
}

private static String getEndTagHTML(String tagName) {
    return "</" + tagName + '>';
}

}

Have a look at http://nekohtml.sourceforge.net/ it has an inbuilt capability of tag balancing. Also checkout the custom filter section for Nekohtml http://nekohtml.sourceforge.net/filters.html#filters.removing . It is a very good html parser.

继续阅读：xml xml-parsing

Parse a badly formatted XML document (like an HTML file)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？