Ruby Mechanize not returning Javascript built page correctly
I'm trying to create a script to fill out a multi-page "form" that I have to fill out weekly (unemployment form actually), the 4th page ends up giving you a Checkbox and 2 Radio Buttons, all built by Javascript. When I navigate to this page using Mechanize I get html back without those 3 controls so I can't go any farther in the process.
Is this a common problem?
I'm filling out the form then just callingpage = agent.submit(form, form.buttons.f开发者_运维技巧irst)
and it comes back without those controls built.Mechanize is an HTML parser, not a JavaScript interpreter. If it's not in the HTML, there's nothing it can do. You need a "proper" browser. (By "proper" I mean one which can at least parse HTML, run JavaScript and ideally also interpret CSS.)
There are tools like Selenium & Co. that let you "remote-control" a "real" browser (Firefox, Internet Explorer, …) and there are efforts to build completely scriptable GUI-less browsers for precisely this use case.
Note: Depending on what country you are in, the unemployment agency may be in violation of anti-discrimination laws (especially if it's a government agency), so you could maybe force them to offer a JavaScript-free version of the form, but that's a) not a short-term solution and b) a topic for your lawyer, not StackOverflow.
Are the values of the generated form predictable? I often find it convenient to bypass all the individual form-helpers and just post to the form directly:
browser = Mechanize.new
browser.post(some_url, { field1 => val1, field2 => val2, ... })
You might want to look into using Watir if you're on Windows, or firewater on Mac/Linux or safariwatir on Mac only. All are basically the same code and are at the same site.
It's more oriented toward testing websites, but you can get at the content of the page using xpath, and from there proceed. Hopefully the browser will have processed the javascript for you and will return that. I've seen some browsers display the JS rendered HTML in their source view, and others don't, so I'm not sure what results you'll have.
As has been mentioned in other answers, you need to use something which drives a real web browser as there is currently no libraries capable of parsing that level of javascript (some can follow javascript redirects, but that is pretty much it). This would be ideal and easier to maintain.
If you really want to stick with the mechanize approach then you should simply be able to add the post field manually.
If they use a captcha to circumvent automated posting then you can need to resort to a simple decaptcha service (10 dollars for 2000 credits should be enough).
Lastly, it may be prudent to just not go through all this trouble.
精彩评论