开发者

Browser doesnt support frames

I am trying to create a java program that performs a login against an achievo instance. I am trying to use Screen Scraping.

I manage to login using the following code:

@Test
public void testLogin() throws Exception {
    HashMap<String, String> data = new HashMap<String, String>();
    data.put("auth_user", "user");
    data.put("auth_pw", "password");
    doSubmit("https://someurl.com/achievo/index.php", data);
}

private void doSubmit(String url, HashMap<String, String> data) throws Exception {
    URL siteUrl = new URL(url);
    HttpsURLConnection conn = (HttpsURLConnection) siteUrl.openConnection();
    conn.开发者_如何学运维setRequestMethod("POST");
    conn.setDoOutput(true);
    conn.setDoInput(true);
    //conn.setRequestProperty( "User-agent", "spider" );
    //conn.setRequestProperty("User-agent", "Opera/9.80 (X11; Linux i686; U; en) Presto/2.7.62 Version/11.01");

    conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322; .NET CLR 1.2.30703)");

    DataOutputStream out = new DataOutputStream(conn.getOutputStream());

    Set<String> keys = data.keySet();
    Iterator<String> keyIter = keys.iterator();
    StringBuilder content = new StringBuilder("");
    for(int i=0; keyIter.hasNext(); i++) {
        Object key = keyIter.next();
        if(i!=0) {
            content.append("&");
        }
        content.append(key + "=" + URLEncoder.encode(data.get(key), "UTF-8"));
    }
    System.out.println(content.toString());

    out.writeBytes(content.toString());
    out.flush();
    out.close();
    BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
    String line = "";
    while((line=in.readLine())!=null) {
        System.out.println(line);
    }
    in.close();
}

However, when achievo successfully logs-in, I get redirected to the main page where it says:

<head>
    <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
    <title>Achievo</title>
  </head>
    <frameset rows="113,*" frameborder="0" border="0">
    <frame name="top" scrolling="no" noresize src="top.php?atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
    <frameset cols="210,*" frameborder="0" border="0">
      <frame name="menu" scrolling="no" noresize src="menu.php?atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
      <frame name="main" scrolling="auto" noresize src="dispatch.php?atknodetype=pim.pim&atkaction=pim&atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43" marginwidth="0" marginheight="0">
    </frameset>
    <noframes>
      <body bgcolor="#CCCCCC" text="#000000">
        <p>Your browser doesnt support frames, but this is required to run Achievo</p>
      </body>
    </noframes>
  </frameset>

Obviously I get the Your browser doesnt support frames, but this is required to run Achievo.

I have tried to directly access the dispatch.php frame, as this is what I probably want, however, it reports that my session has expired, and that I need to re-login.

Is there someway to fake a frame? Or somehow keep the connection, change the url, and try to get the dispatch.php frame?


Using HtmlUnit, I have done the following:

WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
HtmlPage page = webClient.getPage("https://someurl.com/index.php");
System.out.println(page.asXml());

List<HtmlForm> forms = page.getForms();
assertTrue(forms != null && !forms.isEmpty());

HtmlForm form = forms.get(0);
HtmlSubmitInput submit = form.getInputByName("login");
HtmlInput inputUsername = form.getInputByName("auth_user");
HtmlInput inputPw = form.getInputByName("auth_pw");

inputUsername.setValueAttribute("foo");
inputPw.setValueAttribute("bar");

HtmlPage page2 = submit.click();

CookieManager cookieManager = webClient.getCookieManager();
Set<Cookie> cookies = cookieManager.getCookies();
System.out.println("Is cookie " + cookieManager.isCookiesEnabled());

for(Cookie cookie : cookies) {
    System.out.println(cookie.toString());
}

System.out.println(page2.asXml());
webClient.closeAllWindows();

Here I get the form, I submit it, and I retrieve the same message. When I also print out, I can see that I have a cookie. Now the question is, how do I proceed to get the dispatch.php frame using the logged in cookie?


This kind of scraping is a bit complicated, there are several factors to think about.

  1. Does the Achieve app set any cookies? If so, you will need to accept them and send them with the next request. I think
  2. By the looks of things, you will need to parse that HTML page and extract the frame you wish to load. I suspect you're getting back a session expired message because you're not sending a cookie or something like that. You need to make sure you use the exact URL provided in the FRAMESET.

I suggest using the Apache HttpClient module which is a bit more fully-featured than the standard Java URL provider, and can manage things like cookies for you.


You'll have to extract the URL of the main frame (dispatch.php?atknodetype=pim.pim&atkaction=pim&atklevel=-1&atkprevlevel=0&achievo=37b552462afdfd248a21fedbf0eebe43) and make a second request to this URL. If cookies are used to track sessions, you'll also have to send the cookies contained in the response to your login request.

I would use a higher-level API to do this (like Apache HttpClient), or even a programmatic browser like HtmlUnit.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜