A java.io.Reader class that can skip HTML tags?

2023-04-05 15:00 问答作者：

I need to strip HTML out of large volumes of text. It would be cool if I could find a class that implements java.io.Reader that would wrap another Reader, and transform the text so as to omit all of the HTML tags (or maybe replace them with spaces). It would need to be able to deal with badly-formed HTML.

Performance is important. I need to process many gigabytes of text as fast as possible. The normal way to do this would be to read my HTML into a String, parse it into a DOM tree, and iterate over the nodes extracting text as I go. Unfortunately that's much too slow. I think the开发者_高级运维 implementation is going to have to be based on some kind of low-level lexer.

Anyone know of a library that can do this?

I am assuming you want all of the text, so a hackish regex that gets most things is unsuitable. This means you need to go through at least the first part of parsing but want the library to do as little as possible after that.

You could use tagsoup which gives you a nice low level sax interface. Just ignore tags and just collect up the values of text nodes. Easy and as fast as reasonably possible.

I've used JTidy successfully in the past.

It does more than what you need, since it is essentially a DOM parser for real-world HTML. What's nice is that it is robust; it can handle quirks in the markup much like a browser would.

For speed, you'll probably want a streaming parser. Maybe Validator.nu?

Maybe a ParserCallback is any faster than creating a DOM?

import java.io.*;
import java.net.*;
import javax.swing.text.*;
import javax.swing.text.html.parser.*;
import javax.swing.text.html.*;

public class ParserCallbackText extends HTMLEditorKit.ParserCallback
{
    public void handleText(char[] data, int pos)
    {
        System.out.println( data );
    }

    public static void main(String[] args)
        throws Exception
    {
        Reader reader = getReader(args[0]);
        ParserCallbackText parser = new ParserCallbackText();
        new ParserDelegator().parse(reader, parser, true);
    }

    static Reader getReader(String uri)
        throws IOException
    {
        // Retrieve from Internet.
        if (uri.startsWith("http:"))
        {
            URLConnection conn = new URL(uri).openConnection();
            return new InputStreamReader(conn.getInputStream());
        }
        // Retrieve from file.
        else
        {
            return new FileReader(uri);
        }
    }
}

The normal way would actually be to parse the HTML directly from a file, no intermediate time- and space-wasting String, but, as the other posters have said, you would have to tidy the HTML first, with JTidy, NekoHMTL, etc. From there I would probably use XSLT but maybe not if extreme performance was required. You still have the choice of parsers: a SAX or StAX parser would be faster and more space-efficient than a DOM parser.

继续阅读：lexer

A java.io.Reader class that can skip HTML tags?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？