HTML scraping a website whose authentication details I have
so I'm using the following code to get the html source code of a specific url:
import java.io.*;
import java.net.*;
public class SourceViewer {
public static void main (String[] args) throws IOException{
System.out.print("Enter url of local for viewing html source code: ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String url = br.readLine();
try{
URL u = new URL(url);
HttpURLConnection uc = (HttpURLConnection) u.openConnection();
int code = uc.getResponseCode();
String response = uc.getResponseMessage();
System.out.println("HTTP/1.x " + code + " " + response);
for(int j = 1; ; j++){
String header = uc.getHeaderField(j);
String key = uc.getHeaderFieldKey(j);
if(header == null || key == null)
break;
System.out.println(uc.getHeaderFieldKey(j) + ": " + header);
}
InputStream in = new BufferedInputStream(uc.getInputStream());
Reader r = new InputStreamReader(in);
int c;
while((c = r.read()) != -1){
System.out.print((char)c);
}
}
catch(MalformedURLException ex){
System.err.println(url + " is not a valid URL.");
}
catch(IOException ie){
System.out.println("Input/Output Error: " + ie.getMessage());
}
}
}
This code works with wikipedia and other sites, but for my url it doesn't. For example:
INPUT:
Enter url of local for viewing html source code: http://ntu-edu-sg.campuspack.eu/Groups/SC207-SOFTWARE_ENGINEERING/WikiCPE207_Template_0/Week_11_Software_Testing
OUTPUT:
HTTP/1.x 403 Forbidden Set-Cookie: ARPT=LWYYVUShyp1CKIQY; path=/ X-Powered-By: Servlet/2.5 Server: Sun GlassFish Enterprise Server v2.1 Set-Cookie: UGROUTE=4c5e7101a68101c06a712650c7352d98; Path=/ P3P: CP="ALL DSP COR CUR ADMa DEVa TAIa PSAa PSDa IVAa IVDa OUR BUS UNI COM NAV INT CNT STA PRE" Set-Cookie: UG=zc2qAfg{; Path=/ Cache-Control: no-store, no-cache, must-revalidate Pragma: no-cache Expires: 0 X-Powered-By: JSF/1.2 X-Power开发者_C百科ed-By: JSF/1.2 Content-Type: text/html;charset=UTF-8 Content-Language: en-US Transfer-Encoding: chunked Date: Tue, 22 Feb 2011 16:09:48 GMT Input/Output Error: Server returned HTTP response code: 403 for URL: http://ntu-edu-sg.campuspack.eu/Groups/SC207-SOFTWARE_ENGINEERING/WikiCPE207_Template_0/Week_11_Software_Testing
Response code 403 indicates that the server is denying me permissions to scrape. I do have the authentication details required to log on, and if I try to access the url from the browser, a window pops up asking me redirect to the parent site. I was wondering if there was some way to make this window pop up from my code.
To circumvent the authentication problem, I tried logging in from the browser, and then running the code while I was still logged in. However, on running the code, i get the same output. This confuses me, since copy-pasting the url into another tab on the browser after being logged in does not ask for authentication details but simply displays the data, implying that I already have permissions. Can some one please advise me on how to scrape the url?
First you need to use a real full featured HTTPClient that will handle the redirects, and the authentication cookies that it is setting before the redirect. You need something that emulates what the browser is doing. HttpURLConnection
isn't going to be able to do that for you in this case.
A good place to start diagnosing what you need to set and how is to use something like Firebug and LiveHttpHeaders in Firefox or the Tools
-> Developer Tools
mode in Chrome to see exactly what headers and how the redirect is working and what cookies it is setting and expecting to be available when the redirect happens.
You're trying to emulate a browser with a very simplistic scraping program. In order for you to authenticate on the server, you need to use a library such as HTTPClient to submit the form on the log in page. You then need to maintain your session details so that every request you make to the website can be used to identify your authenticated program.
Signing into the website using your browser and then trying to use the program won't work because the browser's private details that are used to identify you (while using the browser) is going to be different to the details used to identify your program.
精彩评论