开发者

post a form using jsdom and node.js

I am using jsdom, jquery and node.js to scrape websites. Is there any way I ca开发者_StackOverflow社区n post a form and get the resulting next page window using jsdom.

Here is the code

var httpAgent = require('http-agent'),
    jsdom = require('jsdom'),
    request = require('request');

request({uri:'http://www.orbitz.com'}, function(error, response, body){
  if(error && response.statusCode != 200)
    console.log('Error on request');

  jsdom.env({
    html: body,
      scripts : [
        'http://code.jquery.com/jquery-1.5.min.js'
      ]
    }, function(err, window) {
          var $ = window.jQuery;

          $('#airOneWay').attr('checked', true);
          $('#airRoundTrip').removeAttr('checked');
          $('#airOrigin').val('ATL');
          $('#airDestination').val('CHI');

          // here we need to submit the form $('#airbotForm') and get the resulting window
          //console.log($('#airbotForm').html());
   });
});

This is the form which needs to be submitted $('#airbotForm') and the resulting page has to be captured.

Can anybody help? Thanks


Oh man. This is where we get into crazy land.

As it stands, the key difference between jsdom and "the browser" is we can access the window externally. For instance in your example you set $ to window.$, which is basically saying "hey, for this current window I want a reference to the jquery object". You could have 10's of windows, and hold references to all of their $'s.

Now, lets say you load a new page due to a form submission/link click...

JSDOM would need to reload the window and update the javascript context (potentially injecting the scripts you provided in the original jsdom.env call). Unfortunately, the reference(s) you held from the last window would be gone/overwritten. In other words, calling $(...) after the page had reloaded would result in unexpected behavior (most likely a memory leak or selection of dom elements on the previous page)

How do you get around this?

Since you are using jquery already, do something like..

var form   = $('#htlbotForm');
var data   = form.serialize();
var url    = form.attr('action') || 'get';
var type   = form.attr('enctype') || 'application/x-www-form-urlencoded';
var method = form.attr('method');

request({
  url    : url,
  method : method.toUpperCase(),
  body   : data,
  headers : {
    'Content-type' : type
  }
},function(error, response, body) {
  // this assumes no error for brevity.
  var newDoc = jsdom.env(body, [/* scripts */], function(errors, window) {
    // do your post processing
  });
});

YMMV, but this approach should work in non-ajax situations.


You need something like: https://github.com/driverdan/node-XMLHttpRequest and you need to set up jsdom to use it for ajax type requests. I've not quite seen this type of use in the wild, but it should be possible in theory.

The other way is to do your own post directly based on nodes on http library (or request, which you're seeming to depend on).

Either: https://github.com/mikeal/request/blob/master/main.js#L357

http://nodejs.org/docs/v0.4.8/api/http.html#http.request with method POST

Josh

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜