开发者

Java - Read a website and NOT the source

OK so I redefined my last program... here it is:

import java.io.BufferedReader; 
import java.io.InputStreamReader;
import java.net.URL; 
import java.net.URLConnection;


public class asp {
    public static void main(S开发者_JAVA百科tring[] args) {
        try {
            URL game = new URL("http://localhost/mystikrpg/post.php?players");
            URLConnection connection = game.openConnection();
            BufferedReader in = new BufferedReader(new
            InputStreamReader(connection.getInputStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }
            in.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The problem? When I run it... I get the WHOLE page... EVEN THE CODE SOURCE such as the beginning of the html tag all the way to the end of the body and html tag.

When really... I want it to output is the 1.... The only way I can see it is if I split the string from <body> and </body>...

Meh. Help?


The problem? When I run it... I get the WHOLE page... EVEN THE CODE SOURCE such as the beginning of the html tag all the way to the end of the body and html tag.

Well, that's basically what an HTML page is; so that's what you get. Now, if you don't want to parse the content manually, use an HTML Parser. There are many of them but I would recommend Jsoup, one of the most elegant available library (clean and nice API, jQuery like CSS selectors, non-verbose element iteration, etc). Demo:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupDemo {
    public static void main(String[] args) throws Exception {
        URL url = new URL("http://localhost/mystikrpg/post.php?players");
        Document doc = Jsoup.parse(url, 3*1000);

        String text = doc.body().text();

        System.out.println(text); // outputs 1
    }
}

Look Ma, no hands!

PS: As a side note, I must say that I agree with some other answers here, you should maybe consider producing something else than HTML like XML, JSON or even raw text (at least as an alternative to the HTML version if you really need it).


Unless you have control over post.php and are able to make it return just what you need without the HTML tags (a la web services), you will have to parse the HTML document returned by it.

Use a HTML Parser, regular expressions are not very reliable for this.


Rough Snippet to parse the <body> tag with HTMLParser:

(Make sure to include htmlparser.jar)

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;    
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.tags.BodyTag;    

public class HTMLParserTest {   
    public static String grabBodyTag (String url) {
        if(!url.startsWith("http://")){url = "http://" + url;}      
        Parser parser = new Parser();               
        TagNameFilter filter = new TagNameFilter("body");       
        try {
            parser.setResource(url);
            NodeList list = parser.parse(filter);
            Node node = list.elementAt(0);          
            if (node instanceof BodyTag) {
                BodyTag tag = (BodyTag) node;
                return   tag.toPlainTextString(); //other formats are available
            }
        } catch (ParserException e) {
            e.printStackTrace();
        }       
        return "found no body tag...";
    }   
    public static void main(String... args){
        System.out.println(grabBodyTag("google.com"));
    }

}

This gives a String with "Web Images Videos Maps News Books Gmail more..." [omitted], in your case it will return a String with "1" in it possibly with whitespace in it (as your pastebin shows), you have to trim it and then do the conversion to a number.

Closing Note: making a post.php with (and only) the following code will make your life much easier if you don't need that script for any other thing that to return this result.

<?php
$number = 1; // or whatever login to get it.
echo $number;
?>


When you request a page you get the source. This is what's expected and normal. You'll have to parse this source to extract the content.


Scraping stuff out of HTML formatted response is unpleasant, and can make your code fragile.

Maybe the webapp / website you are trying to talk has other ways to deliver the responses; e.g. in XML or JSON format.

Getting responses in an alternative format might entail setting an appropriate ACCEPT header to the HTTP request, adding some extra parameter to the query, or changing the path.

  • Check the web API documentation for the webapp / website to see if there is any mention of this.
  • Or check the webapp source code ... if you have it.
  • Or if this is your code, consider changing it to support XML, JSON or even ad hoc text responses. (If you take this route, it would be a good idea to read up on media types and set the appropriate one in the "Content-type" header of your responses.)


When you retrieve a web page, what the server sends you is everything between the HTML tags, and more.

I think what you are looking for is a HTML parser, which will let you extract content from the web page. First you retrieve the web page as you are currently doing, then run the output through the parser, instructing the parser to extract the part that you want.

Here are some HTML parsers:

  • Swing HTML Parser - article shows how to use Java's Swing library to do some HTML parsing
  • HTML Parser
  • Java Mozilla HTML Parser
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜