Browser doesnt support frames
I am trying to create a java program that performs a login against an achievo instance. I am trying to use Screen Scraping.
I manage to login using the following code:
@Test
public void testLogin() throws Exception {
HashMap<String, String> data = new HashMap<String, String>();
data.put("auth_user", "user");
data.put("auth_pw", "password");
doSubmit("https://someurl.com/achievo/index.php", data);
}
private void doSubmit(String url, HashMap<String, String> data) throws Exception {
URL siteUrl = new URL(url);
HttpsURLConnection conn = (HttpsURLConnection) siteUrl.openConnection();
conn.开发者_如何学运维setRequestMethod("POST");
conn.setDoOutput(true);
conn.setDoInput(true);
//conn.setRequestProperty( "User-agent", "spider" );
//conn.setRequestProperty("User-agent", "Opera/9.80 (X11; Linux i686; U; en) Presto/2.7.62 Version/11.01");
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 1.2.30703)");
DataOutputStream out = new DataOutputStream(conn.getOutputStream());
Set<String> keys = data.keySet();
Iterator<String> keyIter = keys.iterator();
StringBuilder content = new StringBuilder("");
for(int i=0; keyIter.hasNext(); i++) {
Object key = keyIter.next();
if(i!=0) {
content.append("&");
}
content.append(key + "=" + URLEncoder.encode(data.get(key), "UTF-8"));
}
System.out.println(content.toString());
out.writeBytes(content.toString());
out.flush();
out.close();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line = "";
while((line=in.readLine())!=null) {
System.out.println(line);
}
in.close();
}
However, when achievo successfully logs-in, I get redirected to the main page where it says:
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<title>Achievo</title>
</head>
<frameset rows="113,*" frameborder="0" border="0">
<frame name="top" scrolling="no" noresize src="top.php?atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
<frameset cols="210,*" frameborder="0" border="0">
<frame name="menu" scrolling="no" noresize src="menu.php?atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
<frame name="main" scrolling="auto" noresize src="dispatch.php?atknodetype=pim.pim&atkaction=pim&atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
</frameset>
<noframes>
<body bgcolor="#CCCCCC" text="#000000">
<p>Your browser doesnt support frames, but this is required to run Achievo</p>
</body>
</noframes>
</frameset>
Obviously I get the Your browser doesnt support frames, but this is required to run Achievo.
I have tried to directly access the dispatch.php frame, as this is what I probably want, however, it reports that my session has expired, and that I need to re-login.
Is there someway to fake a frame? Or somehow keep the connection, change the url, and try to get the dispatch.php frame?
Using HtmlUnit, I have done the following:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
HtmlPage page = webClient.getPage("https://someurl.com/index.php");
System.out.println(page.asXml());
List<HtmlForm> forms = page.getForms();
assertTrue(forms != null && !forms.isEmpty());
HtmlForm form = forms.get(0);
HtmlSubmitInput submit = form.getInputByName("login");
HtmlInput inputUsername = form.getInputByName("auth_user");
HtmlInput inputPw = form.getInputByName("auth_pw");
inputUsername.setValueAttribute("foo");
inputPw.setValueAttribute("bar");
HtmlPage page2 = submit.click();
CookieManager cookieManager = webClient.getCookieManager();
Set<Cookie> cookies = cookieManager.getCookies();
System.out.println("Is cookie " + cookieManager.isCookiesEnabled());
for(Cookie cookie : cookies) {
System.out.println(cookie.toString());
}
System.out.println(page2.asXml());
webClient.closeAllWindows();
Here I get the form, I submit it, and I retrieve the same message. When I also print out, I can see that I have a cookie. Now the question is, how do I proceed to get the dispatch.php frame using the logged in cookie?
This kind of scraping is a bit complicated, there are several factors to think about.
- Does the Achieve app set any cookies? If so, you will need to accept them and send them with the next request. I think
- By the looks of things, you will need to parse that HTML page and extract the frame you wish to load. I suspect you're getting back a session expired message because you're not sending a cookie or something like that. You need to make sure you use the exact URL provided in the FRAMESET.
I suggest using the Apache HttpClient module which is a bit more fully-featured than the standard Java URL provider, and can manage things like cookies for you.
You'll have to extract the URL of the main frame (dispatch.php?atknodetype=pim.pim&atkaction=pim&atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43
) and make a second request to this URL. If cookies are used to track sessions, you'll also have to send the cookies contained in the response to your login request.
I would use a higher-level API to do this (like Apache HttpClient), or even a programmatic browser like HtmlUnit.
精彩评论