开发者

A java.io.Reader class that can skip HTML tags?

I need to strip HTML out of large volumes of text. It would be cool if I could find a class that implements java.io.Reader that would wrap another Reader, and transform the text so as to omit all of the HTML tags (or maybe replace them with spaces). It would need to be able to deal with badly-formed HTML.

Performance is important. I need to process many gigabytes of text as fast as possible. The normal way to do this would be to read my HTML into a String, parse it into a DOM tree, and iterate over the nodes extracting text as I go. Unfortunately that's much too slow. I think the开发者_高级运维 implementation is going to have to be based on some kind of low-level lexer.

Anyone know of a library that can do this?


I am assuming you want all of the text, so a hackish regex that gets most things is unsuitable. This means you need to go through at least the first part of parsing but want the library to do as little as possible after that.

You could use tagsoup which gives you a nice low level sax interface. Just ignore tags and just collect up the values of text nodes. Easy and as fast as reasonably possible.


I've used JTidy successfully in the past.

It does more than what you need, since it is essentially a DOM parser for real-world HTML. What's nice is that it is robust; it can handle quirks in the markup much like a browser would.


For speed, you'll probably want a streaming parser. Maybe Validator.nu?


Maybe a ParserCallback is any faster than creating a DOM?

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleText(char[] data, int pos)
    {
        System.out.println( data );
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}


The normal way would actually be to parse the HTML directly from a file, no intermediate time- and space-wasting String, but, as the other posters have said, you would have to tidy the HTML first, with JTidy, NekoHMTL, etc. From there I would probably use XSLT but maybe not if extreme performance was required. You still have the choice of parsers: a SAX or StAX parser would be faster and more space-efficient than a DOM parser.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜