开发者

How to display only the content of webpages(not any of the tags,links) by regular expression in Java program

I checked for this specific question and could not find any. I am writing a program in Java which analyses content from web pages, so I need a regular expression which can weed out all the links and tags (href, img, etc...), so that I could display only the pure content written and visible in the webpages. Thanks a lot.

Hi I wanted to make it more specific:

URLConnection connection = wordURL.openConnection("http://en.wikipedia.org/wiki/Bloom_filter");
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
String word = "bloom filter";
String reg开发者_StackOverflow社区exp2 = word; 
Pattern pattern2 = Pattern.compile(regexp2);
String HTML_REGEX = "(<.+?>)+"; // as per your answer(Martijn Courteaux)
while ((line = br.readLine()) != null)
{
       String content;
       if ( (content = line.replaceAll(HTML_REGEX, "\n") )!= null)
       {
              Matcher matcher2 = pattern2.matcher(line);
              if(matcher2.find())
              {
                   System.out.println(line);
              }
        }
 }

But unfortunately it still prints out paragraph (<p>) tag and also <li> tag with some rubbish inside </li>. I would like to restrict it to display only those words where "bloom filter" is present.Thanks again.


HTML isn't regular so you can't do what you want with a regex but you can use JSoup.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

In particular you might like the following which is outlined in one of the examples...

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();


Don't use RegEx for HTML parsing. Use an HTML parser, for example HTML Parser or jsoup.


I really know it isn't good to use a regex with html. But if he really wants to this might help:

String HTML_REGEX = "<.+?>";
String yourHTML = "<html><body><h1>Lorem Ipsum</h1>" + 
                  "<p>Lorem <i>Ipsum</i> dolorem sedet. Set nihil amat. " + 
                  "<sub>I don't know the text</sub></p></body></html>"

String content = yourHTML.replaceAll(HTML_REGEX, "\n");
System.out.println(content);

prints:




Lorem Impsum

Lorem 
Ipsum
dolorem sedet. Set nihil amat. 
I don't know the text




As you can see, it will work, but it is definitely not what you want.


You can reduce the number of newlines by using this regex:

String HTML_REGEX = "(<.+?>)+";
String yourHTML = "<html><body><h1>Lorem Ipsum</h1>" + 
                  "<p>Lorem <i>Ipsum</i> dolorem sedet. Set nihil amat. " + 
                  "<sub>I don't know the text</sub></p></body></html>"

String content = yourHTML.replaceAll(HTML_REGEX, "\n");
System.out.println(content);

prints:


Lorem Impsum
Lorem 
Ipsum
dolorem sedet. Set nihil amat. 
I don't know the text


I tried your code, and it didn't work indeed. After some editing this worked:

URLConnection connection = new URL("http://en.wikipedia.org/wiki/Bloom_filter").openConnection();
BufferedReader br = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String line;
String word = "bloom filter".toLowerCase();
String HTML_REGEX = "(<.+?>)+"; // as per your answer(Martijn Courteaux)
while ((line = br.readLine()) != null) {
    String content;
    if ((content = line.replaceAll(HTML_REGEX, "\n")) != null) {
        if (content.toLowerCase().contains(word)) /* Changed: regex match -> contains() */
        {
            System.out.println(content); /* CHANGED: line -> content */
        }
    }
}

What you did wrong was:

  1. You didn't print content, but line which of course contains the tags...
  2. You tried to find the word "bloom filter" using a regex, which is case sensitive. So, just lowercase the strings and use String.contains(CharSequence target), which tells you if the target string is a part of the whole string.
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜