How to display only the content of webpages(not any of the tags,links) by regular expression in Java program
I checked for this specific question and could not find any. I am writing a program in Java which analyses content from web pages, so I need a regular expression which can weed out all the links and tags (href
, img
, etc...), so that I could display only the pure content written and visible in the webpages. Thanks a lot.
Hi I wanted to make it more specific:
URLConnection connection = wordURL.openConnection("http://en.wikipedia.org/wiki/Bloom_filter");
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
String word = "bloom filter";
String reg开发者_StackOverflow社区exp2 = word;
Pattern pattern2 = Pattern.compile(regexp2);
String HTML_REGEX = "(<.+?>)+"; // as per your answer(Martijn Courteaux)
while ((line = br.readLine()) != null)
{
String content;
if ( (content = line.replaceAll(HTML_REGEX, "\n") )!= null)
{
Matcher matcher2 = pattern2.matcher(line);
if(matcher2.find())
{
System.out.println(line);
}
}
}
But unfortunately it still prints out paragraph (<p>
) tag and also <li
> tag with some rubbish inside </li>
. I would like to restrict it to display only those words where "bloom filter" is present.Thanks again.
HTML isn't regular so you can't do what you want with a regex but you can use JSoup.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
In particular you might like the following which is outlined in one of the examples...
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
Don't use RegEx for HTML parsing. Use an HTML parser, for example HTML Parser or jsoup.
I really know it isn't good to use a regex with html. But if he really wants to this might help:
String HTML_REGEX = "<.+?>";
String yourHTML = "<html><body><h1>Lorem Ipsum</h1>" +
"<p>Lorem <i>Ipsum</i> dolorem sedet. Set nihil amat. " +
"<sub>I don't know the text</sub></p></body></html>"
String content = yourHTML.replaceAll(HTML_REGEX, "\n");
System.out.println(content);
prints:
Lorem Impsum Lorem Ipsum dolorem sedet. Set nihil amat. I don't know the text
As you can see, it will work, but it is definitely not what you want.
You can reduce the number of newlines by using this regex:
String HTML_REGEX = "(<.+?>)+";
String yourHTML = "<html><body><h1>Lorem Ipsum</h1>" +
"<p>Lorem <i>Ipsum</i> dolorem sedet. Set nihil amat. " +
"<sub>I don't know the text</sub></p></body></html>"
String content = yourHTML.replaceAll(HTML_REGEX, "\n");
System.out.println(content);
prints:
Lorem Impsum Lorem Ipsum dolorem sedet. Set nihil amat. I don't know the text
I tried your code, and it didn't work indeed. After some editing this worked:
URLConnection connection = new URL("http://en.wikipedia.org/wiki/Bloom_filter").openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
String word = "bloom filter".toLowerCase();
String HTML_REGEX = "(<.+?>)+"; // as per your answer(Martijn Courteaux)
while ((line = br.readLine()) != null) {
String content;
if ((content = line.replaceAll(HTML_REGEX, "\n")) != null) {
if (content.toLowerCase().contains(word)) /* Changed: regex match -> contains() */
{
System.out.println(content); /* CHANGED: line -> content */
}
}
}
What you did wrong was:
- You didn't print
content
, butline
which of course contains the tags... - You tried to find the
word
"bloom filter
" using a regex, which is case sensitive. So, just lowercase the strings and useString.contains(CharSequence target)
, which tells you if the target string is a part of the whole string.
精彩评论