How to clean JTextPanes/JEditorPanes html content to string in Java?
I try to get pretty (cleaned) text content from JTextPane. Here is example code from JTextPane:
JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);
Text look like this in JTexPan开发者_运维百科e:
This is a test.
I get this kind of print to console:
<html>
  <head>
  </head>
  <body>
    This <b>is</b> a <b>test</b>.
  </body>
</html>
I've used substring() and/or replace() code, but it is uncomfortable to use:
String text = textPane.getText ().replace ("<html> ... <body>\n    , "");
Is there any simple function to remove all other tags than <b> tags (content) from string?
Sometimes JTextPane add <p> tags around content so I want to get rid of them also. 
Like this:
<html>
  <head>
  </head>
  <body>
    <p style="margin-top: 0">
      hdfhdfgh
    </p>
  </body>
</html>
I want to get only text content with tags:
This <b>is</b> a <b>test</b>.
I subclassed HTMLWriter and overrode startTag and endTag to skip all tags outside of <body>.
I did not test much, it seems to work ok. One drawback is that the output string has quite a lot of whitespace. Getting rid of that shouldn't be too hard.
import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
public class Foo {
    public static void main(String[] args) throws Exception {
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");
        StringWriter writer = new StringWriter();
        HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();
        HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
        htmlWriter.write();
        System.out.println(writer.toString());
    }
    private static class OnlyBodyHTMLWriter extends HTMLWriter {
        public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
            super(w, doc);
        }
        private boolean inBody = false;
        private boolean isBody(Element elem) {
            // copied from HTMLWriter.startTag()
            AttributeSet attr = elem.getAttributes();
            Object nameAttribute = attr
                    .getAttribute(StyleConstants.NameAttribute);
            HTML.Tag name = null;
            if (nameAttribute instanceof HTML.Tag) {
                name = (HTML.Tag) nameAttribute;
            }
            return name == HTML.Tag.BODY;
        }
        @Override
        protected void startTag(Element elem) throws IOException,
                BadLocationException {
            if (inBody) {
                super.startTag(elem);
            }
            if (isBody(elem)) {
                inBody = true;
            }
        }
        @Override
        protected void endTag(Element elem) throws IOException {
            if (isBody(elem)) {
                inBody = false;
            }
            if (inBody) {
                super.endTag(elem);
            }
        }
    }
}
You could use the HTML parser that the JEditorPane uses itself, HTMLEditorKit.ParserDelegator.
See this example, and the API docs.
I find solution to this problem by using substring and replace -methods:
// Get textPane content to string
String text = textPane.getText();
// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);
// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n      ", "").replace("\n    </p>", ""));
// This is for convert possible escape characters example & -> &
text = StringEscapeUtils.unescapeHtml(text);
There is link to StringEscapeUtils -libraries which convert escape characters back to normal view. Thanks to Ozhan Duz for the suggestion.
(commons-lang - download)
String text = textPane.getDocument.getText (0,textPane.getText().length());
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论