Cleaning a string consisting of html/server-side tags in Java
I have a text like:
I've got a date with this fellow tomorrow. Well me and thousands of others. <br /><br /><img src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg"><br /><br />Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak. <br /><br />You all should come too!<br /><br /><a href="http://nh.barackobama.com/manchesterchange">RSVP for the event</a>
I would want to like to clean it too :
I've got a date with this fellow tomorrow. Well me and thousands of others http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg Tomorrow morning I will be getting up at stupid o'clock and driving up to Manchester, NH to see Barak Obama speak.You all should come too! h**p://nh.barackobama.com/manchesterchange RSVP for the event
I would like to write a JAVA program for the same. Any pointers/suggestions would be appreciated.The tags aren't limited to the above post. This was just开发者_如何学JAVA an example.
Thanks!
PS: Replace *'s by t's in the second hyperlink as Stack Overflow doesn't allow me to post more than one link.
JTidy will do what you want. I just tried it by saving the block of text in your post as test.txt
, and ran JTidy with these options:
java -jar jtidy-r938.jar -asxml test.txt >test.html
It produced the following well-formed XHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator"
content="HTML Tidy for Java (vers. 2009-12-01), see jtidy.sourceforge.net" />
<title></title>
</head>
<body>
I've got a date with this fellow tomorrow. Well me and thousands of
others. <br />
<br />
<img
src="http://www.newwest.net/images/thumbnails_feature/barack_obama_westerners.jpg" /><br />
<br />
Tomorrow morning I will be getting up at stupid o'clock and driving
up to Manchester, NH to see Barak Obama speak. <br />
<br />
You all should come too!<br />
<br />
<a href="http://nh.barackobama.com/manchesterchange">RSVP for the
event</a>
</body>
</html>
If you use the API instead of the command line, you will be able to extract the bits you are interested in and discard the rest.
The simplest way of 'tidying' text which has XML tags is to use a regular expression that identifies anything that is a tag (i.e. anything that starts with '<' and ends with '>' and everything in between). Note this works whether or not XML is 'well-formed' as it cleans up any tags regardless of whether opening tags match with closing tags.
For example,
String noXmlString = xmlString.replaceAll("\\<.*?\\>", "");
will remove all tags from a given string. The downside is that it won't preserve the image link nor the hyperlink as per your example. Hope this helps though!
Edited 11:58 04/04/10: Try this to remove HTML encoded HTML tags (i.e.. anything that starts with <
and ends with >
)...
String noHtmlHtmlString = htmlHtmlString.replaceAll("<.+?>", "");
Then to remove any other HTML encoded/formatted bits like "
(i.e. anything that starts with & and ends with ; and in between conforms to a valid word without spaces or breaks) use
String noHtmlEncodingString = htmlEncodingString.replaceAll("&\\w+?;", "");
If there's any malformed HTML/XML beyond those, unless there's a known pattern there's no way of catching them.
I would check out an HTML parser such as JTidy. Despite its name it will parse HTML and provide a useful API to allow you to extract what you need.
精彩评论