开发者

Query URL and return the contents of a specific HTML ID

I am looking to write a Java app which queries multiple URLs (defined by a list of URIs) for their HTML source and returns the contents of a specific element with a defined id on each page.

As an example, lets say one started with a list of a list of blog post URLs such as...

  • http://www.myblog.com/post_three
  • http://www.myblog.com/post_two

...now, if a sample page looks l开发者_开发百科ike the following...

<html>
<body>
    <div class="content">
        <h2 id="post_title">Post Title</h2>
        <p class="post_paragraph">Here is the content of my post.</p>
    </div>
</body>
</html>

How can I grab the contents of the "post_title" id for each of my URLs, and print it to the console with the classic System.out.print(String s)?

Thanks for all input.


First you resolve the URL using Java's connection API

http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html

Then you will need to parse the HTML

http://www.google.be/search?q=java+html+parser

And finally you will need to walk the parsed document structure (that will depend on the parser you choose) to find an element with the given id.


There is included support in java to parse HTML. Take a look at javax.swing.text.html.HTMLEditorKit: http://download.oracle.com/javase/6/docs/api/javax/swing/text/html/HTMLEditorKit.html

A couple of examples of how to use it:

http://java.sun.com/products/jfc/tsc/articles/bookmarks/

Development/ParseHTML.htm">http://www.java2s.com/Tutorial/Java/0120_Development/ParseHTML.htm

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜