How to convert HTML to text keeping linebreaks

2022-12-24 11:28 问答作者：

How may I conv开发者_C百科ert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser

Example:

Hello<br/>World

to:

Hello\n  
World

Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.

public static String htmlToText(InputStream html) throws IOException {
    Document document = Jsoup.parse(html, null, "");
    Element body = document.body();

    return buildStringFromNode(body).toString();
}

private static StringBuffer buildStringFromNode(Node node) {
    StringBuffer buffer = new StringBuffer();

    if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        buffer.append(textNode.text().trim());
    }

    for (Node childNode : node.childNodes()) {
        buffer.append(buildStringFromNode(childNode));
    }

    if (node instanceof Element) {
        Element element = (Element) node;
        String tagName = element.tagName();
        if ("p".equals(tagName) || "br".equals(tagName)) {
            buffer.append("\n");
        }
    }

    return buffer;
}

w3m -dump -no-cookie input.html > output.txt

I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.

For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.

If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.

HTML to Text: I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.

What I have done for such a venture is using regexp to detect any set of tag enclosure. If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.

It works only for simple html pages. Tables will obviously be linearised.

I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...

I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.

Google docs data download/export

Google docs data api for java

Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.

How to convert HTML to text keeping linebreaks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？