how to extract web page textual content in java? [closed]
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
开发者_如何学运维Closed 7 years ago.
Improve this questioni am looking for a method to extract text from web page (initially html) using jdk or another library . please help
thanks
Use jsoup. This is currently the most elegant library for screen scraping.
URL url = new URL("http://example.com/");
Document doc = Jsoup.parse(url, 3*1000);
String title = doc.title();
I just love its CSS selector syntax.
Use a HTML parser if at all possible; there are many available for Java.
Or you can use regex like many people do. This is generally not advisable, however, unless you're doing very simplistic processing.
Related questions
- Java HTML Parsing
- Which Html Parser is best?
- Any good Java HTML parsers?
- recommendations for a java HTML parser/editor
- What HTML parsing libraries do you recommend in Java
Text extraction:
- Text Extraction from HTML Java
- Text extraction with java html parsers
Tag stripping:
- Stripping HTML tags in Java
- How to strip HTML attributes except “src” and “alt” in JAVA
- Removing HTML from a Java String
Here's a short method that nicely wraps these details (based on java.util.Scanner
):
public static String get(String url) throws Exception {
StringBuilder sb = new StringBuilder();
for(Scanner sc = new Scanner(new URL(url).openStream()); sc.hasNext(); )
sb.append(sc.nextLine()).append('\n');
return sb.toString();
}
And this is how it is used:
public static void main(String[] args) throws Exception {
System.out.println(get("http://www.yahoo.com"));
}
精彩评论