开发者

JSoup UserAgent, how to set it right?

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).

I'm setting my User Agent like this:

doc = Jsoup.connect(url)
      .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")
      .get();

Am I doing something wrong?

EDIT:

开发者_如何转开发

I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....

I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.


You might try setting the referrer header as well:

doc = Jsoup.connect("https://www.facebook.com/")
      .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
      .referrer("http://www.google.com")
      .get();


Response response= Jsoup.connect(location)
           .ignoreContentType(true)
           .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
           .referrer("http://www.google.com")   
           .timeout(12000) 
           .followRedirects(true)
           .execute();

Document doc = response.parse();

User Agent

Use the latest User agent. Here's the complete list http://www.useragentstring.com/pages/useragentstring.php.

Timeout

Also don't forget to add timout, since sometimes it takes more than normal timeout to download the page.

Referer

Set the referer as google.

Follow redirects

follow redirects to get to the page.

execute() instead of get()

Use execute() to get the Response object. Which can help you to check for content type and status codes incase of error.

Later you can parse the response object to obtain the document.

Hosted the full example on github


It's likely that Facebook is setting (and then expecting) certain cookies in its requests, and considers a header that lacks any to be a bot/mobile user/limited browser/something else.

There's several questions about handling cookies with JSoup however you may find it simpler to use HttpUrlConnection or Apache's HttpClient and then passing the result to JSoup. An excellent writeup on everything you need to know: Using java.net.URLConnection to fire and handle HTTP requests

One useful way to debug the difference between your browser and JSoup is Chrome's network inspector. You can add headers from the browser to JSoup one at a time until you get the behavior you expect, then narrow down exactly which headers you need.


I had the 403 problem and setting .userAgent("Mozilla") worked for me (so it doesn't need to be super specific to work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜