Why can I only get the HTML for the homepage of websites and not others?
I am writing a java program that connects to a website and it returns the HTML, for some reason I am having problems with it. Right now I am only able to access the website if I do
//example String host = "www.google.com"
but If I want to access a URL that is any more complicated then I get an UnknownHostException. At first I thought it might have something to do with it not recognizing certain characters in the URL but im not sure. For example, here is one of the URL's Im trying to access.
host ="http://www.cyberspacei.com开发者_运维知识库/englishwiz/library/name/etymology_of_first_names.htm";
int port = 80;
Socket s = new Socket(host,port)
....etc
and It wont return anything but an UnknownHostException.
Somebody please help me!!!
it is failing because you are being asked about a hostname, not an URL like the one you are entering, if you want the document in that URL, you need to use the URL class
URL url = new URL("http://www.thesite.com/thefile.html");
Object doc = url.getContent();
of course you need to replace that "Object doc" with a file that is prepared to cache that content.
The "host" parameter for the Socket object specifies which machine to connect to on the network (internet). This is different from a URI used in a web browser which includes the protocol, server, and the directory structure of the file or object being requested.
Socket s = new Socket("www.cyberspacei.com", "80"); will open a new raw socket to the webserver running on that machine but it will then be up to you to negotiate the HTTP protocol over that socket and request "/englishwiz/library/name/etymology_of_first_names.htm"
You might save yourself some headaches by using a library such as HttpClient which takes alot of the leg work out of the http negotiation as long as you don't need raw access to the http stream.
http://hc.apache.org/httpclient-3.x/index.html
I'm not an expert in the field of Java, but I know what went wrong.
Firstly the host variable should only contain host of the URL.
The host of the URL http://www.cyberspacei.com/englishwiz/library/name/etymology_of_first_names.htm
is actually 'cyberspacei.com'
So you connect to the host, then send HTTP headers to request for the page you are looking for.
GET /englishwiz/library/name/etymology_of_first_names.htm HTTP/1.0
Host: cyberspacei.com
Accept: */*
Connection: Close
Some web pages may need User-Agent
or Referer
headers to work. so add the fields appropriately.
@ONi is right here. You're using the Socket() class, which means you're using raw sockets and you want to write your own HTTP/web server requests. You want something more like the URL class because that class 'understands' HTTP request and just gives you the content of a website.
It's like the difference between printing out & reading an email from your computer (URL class) vs. sticking the ethernet cord in your mouth and trying to decipher the signals with your tongue. The Socket() class is too low-level for what you're doing.
精彩评论