开发者

How to spider a password protected site in python?

currently I have a spider written i开发者_运维问答n Java that logs into a supplier website and spiders the website. (using htmlunit)

It keeps the session (cookie) and even lets me enable/disable javascript etc.

I also use htmlparser (java) to help parse the html and extract the relevant information.

Does python have something similar to do this?


Python has urllib2 to crawl pages, which supports password authentication and cookies.

There is also a HTMLParser for extracting html, but some people prefer the more feature-full BeatifulSoup.


Scrapy API uses urllib2 plus adds wires up some different parsers and helper routines.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜