get many pages with pycurl?

2022-12-14 23:15 问答作者：

I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

but get the pages' data in python, not disk files. Can someone please post pycurl code to do this,

or fast urllib2 (not one-at-a-time) if that's possible,

or else say "forget it, curl is faster and more robust" ? Thanks

So you have 2 problem and let me show you in one example. Notice the pycurl already did the multithreading/not one-at-a-time w/o your hardwork.

#! /usr/bin/env python

import sys, select, time
import pycurl,StringIO

c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)

m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)

# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0

# Stir the state machine into action
while 1:
    ret, num_handles = m.perform()
    if ret != pycurl.E_CALL_MULTI_PERFORM:
        break

# Keep going until all the connections have terminated
while num_handles:
    # The select method uses fdset internally to determine which file descriptors
    # to check.
    m.select(SELECT_TIMEOUT)
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()

Finally, these code is mainly based on an example on the pycurl site =.=

may be you should really read doc. ppl spend huge time on it.

here is a solution based on urllib2 and threads.

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()

You can just put that into a bash script inside a for loop.

However you may have better success at parsing each page using python. http://www.securitytube.net/Crawling-the-Web-for-Fun-and-Profit-video.aspx You will be able to get at the exact data and save it at the same time into a db. http://www.securitytube.net/Storing-Mined-Data-from-the-Web-for-Fun-and-Profit-video.aspx

If you want to crawl a website using python, you should have a look to scrapy http://scrapy.org

Using BeautifulSoup4 and requests -

Grab head page:

page = Soup(requests.get(url='http://rootpage.htm').text)

Create an array of requests:

from requests import async

requests = [async.get(url.get('href')) for url in page('a')]
responses = async.map(requests)

[dosomething(response.text) for response in responses]

Requests requires gevent to do this btw.

I can recommend you to user async module of human_curl

Look example:

from urlparse import urljoin 
from datetime import datetime

from human_curl.async import AsyncClient 
from human_curl.utils import stdout_debug

def success_callback(response, **kwargs):
    """This function call when response successed
    """
    print("success callback")
    print(response, response.request)
    print(response.headers)
    print(response.content)
    print(kwargs)

def fail_callback(request, opener, **kwargs):
    """Collect errors
    """
    print("fail callback")
    print(request, opener)
    print(kwargs)

with AsyncClient(success_callback=success_callback,
                 fail_callback=fail_callback) as async_client:
    for x in xrange(10000):
        async_client.get('http://google.com/', params=(("x", str(x)),)
        async_client.get('http://google.com/', params=(("x", str(x)),),
                        success_callback=success_callback, fail_callback=fail_callback)

Usage very simple. Then page success loaded of failed async_client call you callback. Also you can specify number on parallel connections.

继续阅读：curl pycurl python urllib2

get many pages with pycurl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？