how to convert HTML text to plain text? [duplicate]

2023-01-14 12:45 问答作者：

This question already has answers here: 开发者_运维百科 Remove HTML tags from a String (35 answers) Closed 1 year ago.

friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.

String plainText= Jsoup.parse(yout_html_text).text();

Just getting rid of HTML tags is simple:

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

But unfortunately the requirements are never that simple:

Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

You can use this single line to remove the html tags and display it as plain text.

htmlString=htmlString.replaceAll("\\<.*?\\>", "");

Use Jsoup.

Add the dependency

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

Now in your java code:

public static String html2text(String html) {
        return Jsoup.parse(html).wholeText();
    }

Just call the method html2text with passing the html text and it will return plain text.

Use a HTML parser like htmlCleaner

For detailed answer : How to remove HTML tag in Java

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

If you want to parse like browser display, use:

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

I hope this will help to parse table also in the browser format.

Thanks, Ganesh

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed). http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726

My code:

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

The maxLineLength ensures lines are not artificially wrapped at 80 characters. The setNewLine(null) uses the same new line character(s) as the source.

I use HTMLUtil.textFromHTML(value) from

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>

Using Jsoup, I got all the text in the same line.

So I used the following block of code to parse HTML and keep new lines:

private String parseHTMLContent(String toString) {
    String result = toString.replaceAll("\\<.*?\\>", "\n");
    String previousResult = "";
    while(!previousResult.equals(result)){
        previousResult = result;
        result = result.replaceAll("\n\n","\n");
    }
    return result;
}

Not the best solution but solved my problem :)

how to convert HTML text to plain text? [duplicate]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？