开发者

how to extract web page textual content in java? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

开发者_如何学运维

Closed 7 years ago.

Improve this question

i am looking for a method to extract text from web page (initially html) using jdk or another library . please help

thanks


Use jsoup. This is currently the most elegant library for screen scraping.

URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();

I just love its CSS selector syntax.


Use a HTML parser if at all possible; there are many available for Java.

Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.

Related questions

  • Java HTML Parsing
  • Which Html Parser is best?
  • Any good Java HTML parsers?
  • recommendations for a java HTML parser/editor
  • What HTML parsing libraries do you recommend in Java

Text extraction:

  • Text Extraction from HTML Java
  • Text extraction with java html parsers

Tag stripping:

  • Stripping HTML tags in Java
  • How to strip HTML attributes except “src” and “alt” in JAVA
  • Removing HTML from a Java String


Here's a short method that nicely wraps these details (based on java.util.Scanner):

public static String get(String url) throws Exception {
   StringBuilder sb = new StringBuilder();
   for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
      sb.append(sc.nextLine()).append('\n');
   return sb.toString();
}

And this is how it is used:

public static void main(String[] args) throws Exception {
   System.out.println(get("http://www.yahoo.com"));
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜