Writing a Faster Python Spider

2022-12-13 19:10 问答作者：

I'm writing a spider in Python to crawl a site. 开发者_JAVA百科Trouble is, I need to examine about 2.5 million pages, so I could really use some help making it optimized for speed.

What I need to do is examine the pages for a certain number, and if it is found to record the link to the page. The spider is very simple, it just needs to sort through a lot of pages.

I'm completely new to Python, but have used Java and C++ before. I have yet to start coding it, so any recommendations on libraries or frameworks to include would be great. Any optimization tips are also greatly appreciated.

You could use MapReduce like Google does, either via Hadoop (specifically with Python: 1 and 2), Disco, or Happy.

The traditional line of thought, is write your program in standard Python, if you find it is too slow, profile it, and optimize the specific slow spots. You can make these slow spots faster by dropping down to C, using C/C++ extensions or even ctypes.

If you are spidering just one site, consider using wget -r (an example).

Where are you storing the results? You can use PiCloud's cloud library to parallelize your scraping easily across a cluster of servers.

As you are new to Python, I think the following may be helpful for you :)

if you are writing regex to search for certain pattern in the page, compile your regex wherever you can and reuse the compiled object
BeautifulSoup is a html/xml parser that may be of some use for your project.

Spidering somebody's site with millions of requests isn't very polite. Can you instead ask the webmaster for an archive of the site? Once you have that, it's a simple matter of text searching.

You waste a lot of time waiting for network requests when spidering, so you'll definitely want to make your requests in parallel. I would probably save the result data to disk and then have a second process looping over the files searching for the term. That phase could easily be distributed across multiple machines if you needed extra performance.

What Adam said. I did this once to map out Xanga's network. The way I made it faster is by having a thread-safe set containing all usernames I had to look up. Then I had 5 or so threads making requests at the same time and processing them. You're going to spend way more time waiting for the page to DL than you will processing any of the text (most likely), so just find ways to increase the number of requests you can get at the same time.

继续阅读：python web-crawler

Writing a Faster Python Spider

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？