开发者

Where shall I start in making a scraper or a bot using python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 7 years ago.

开发者_高级运维 Improve this question

I'm not that new in programming languages(python) but I got no clue on where will I start in making a bot or a scraper using python?. should I study in cgi programming? or does the scraper runs just using a python script? Should I build a server for that? Got no clue for this... thanks for the help


Here are some links to get you started.

  • Build a basic web scraper in Python
  • Scrapy: An open source web scraping framework for Python
  • Web scraping with Python. Part 1: Crawling


If you’re trying to access websites that make heavy use of JavaScript, you might, overall, find Selenium easier.

Selenium is a server that controls actual web browsers on your server, and a client library (including a Python port) that allows you to control the browsers and inspect the pages in them.

It’s definitely more overhead up-front to configure (and figure out) the server and client library (and to make sure you have a working browser on your system), but if the website does a lot of stuff in JavaScript, your actual scraping code could be a lot less hairy.


Screen scraping involves a lot of regular expressions to get the exact data you want. You also want to know what sort of data you want to analyze and how you want to store it.

To get the pages, you'll need to utilize libraries such as urllib (or urllib2) and regular expressions (re) or a good script to use is beautifulsoup to do your dirty work (http://www.crummy.com/software/BeautifulSoup/)

If you want to build a pure bot that does what the search engines do, you also have to build a smart enough bot to know that you don't keep pinging the same domain continuously (results in a DOS attack).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜