开发者

How to work with html code readed on Java?

I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class ReadWebPage {
  public static void main(String[] args) throws IOException {
    String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
    URL url = new URL(urltext);
    BufferedReader in = new BufferedReader(new InputStreamReader(url
      .openStream()));
    String in开发者_C百科putLine;

    while ((inputLine = in.readLine()) != null) {
      // Process each line.
      System.out.println(inputLine);
    }
    in.close();
  }
}

OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.

How I can do it??? I can't find the code example that can do it on google..... code examples are welcome

thanks


I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.

You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.


I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).


You might like to take a look at htmlparser

I used this for something similar.

Usage something like this:

Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));

        NodeList tds  = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);

            String data =  tds.toHtml();


Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.

The simple code example would be

Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);


I've found a link that is just what you was looking for: http://tiny-url.org/work_with_html_java

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜