开发者

How to get mechanize requests to look like they originate from a real browser

OK, here's the header(just an example) info I got from Live HTTP Header while logging into an account:

http://example.com/login.html

POST /login.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Referer: http://example.com
Cookie: blahblahblah; blah = blahblah
Content-Type: application/x-www-form-urlencoded
Content-Length: 39
username=shane&password=123456&do=login

HTTP/1.1 200 OK
Date: Sat, 18 Dec 2010 15:41:02 GMT
Server: Apache/2.2.3 (CentOS)
X-Powered-By: PHP/5.2.14
Set-Cookie: blah = blahblah_blah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Set-Cookie: blah = blahblah; expires=Sun, 18-Dec-2011 15:41:02 GMT; path=/; domain=.example.com; HttpOnly
Cache-Control: private, no-cache="set-cookie"
Expires: 0
Pragma: no-cache
Content-Encoding: gzip
Vary: Accept-Encoding
Content-Length: 4135
Keep-Alive: timeout=10, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8

Normally I would code like this:

import mechanize
import urllib2

MechBrowser = mechanize.Browser()
LoginUrl = "http://example.com/login.html"
LoginData = "username=shane&password=123456&do=login"
LoginHeader = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)", "Referer": "http://example.com"}

LoginRequest = urllib2.Request(LoginUrl, LoginData, Logi开发者_JAVA技巧nHeader)
LoginResponse = MechBrowser.open(LoginRequest)

Above code works fine. My question is, do I also need to add these following lines (and more in previous header infos) in LoginHeader to make it really looks like firefox's surfing, not mechanize?

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

What parts/how many of header info need to be spoofed to make it looks "real"?


It depends on what you're trying to 'fool'. You can try some online services that do simple User Agent sniffing to gauge your success:

http://browserspy.dk/browser.php

http://www.browserscope.org (look for 'We think you're using...')

http://www.browserscope.org/ua

http://panopticlick.eff.org/ -> will help you to pick some 'too common to track' options

http://networking.ringofsaturn.com/Tools/browser.php

I believe a determined programmer could detect your game, but many log parsers and tools wouldn't once you echo what your real browser sends.

One thing you should consider is that lack of JS might raise red flags, so capture sent headers with JS disabled too.


Here's how you set the user agent for all requests made by mechanize.Browser

br = mechanize.Browser()
br.addheaders = [('User-agent', 'your user agent string here')]

Mechanize can fill in forms as well

br.open('http://yoursite.com/login')
br.select_form(nr=1) # select second form in page (0 indexed)
br['username'] = 'yourUserName' # inserts into form field with name 'username'
br['password'] = 'yourPassword'
response = br.submit()
if 'Welcome yourUserName' in response.get_data():
    # login was successful
else:
    # something went wrong
    print response.get_data()

See the mechanize examples for more info


If you are paranoid about keeping bots/scripts/non-real browsers out, you'd look for things like the order of HTTP requests, let one resource be added using JavaScript. If that resource is not requested, or requested before the JavaScript - then you know it's a "fake" browser. You could also look at number of requests per connection (keep-alive), or simply verify that all CSS files of the first page (given that they're at the top of the HTML) gets loaded.

YMMV but it can become pretty cumbersome to simulate enough to make some "fake" browser pass as a "real" one (used by humans).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜