Scraping landing pages of a list of domains [closed]
I have a reasonably long list of websites that I want to download the landing (index.html or equivalent) pages for. I am currently using Scrapy (much love to the guys behind it -- this is a fabulous framework). Scrapy is slower on this particular task than开发者_运维技巧 I'd like and I am wondering if wget or an other alternative would be faster given how straightforward the task is. Any ideas?
(Here's what I am doing with Scrapy. Anything I can do to optimize scrapy for this task? )
So, I have a start URLs list like
start_urls=[google.com yahoo.com aol.com]
And I scrape the text from each response and store this in an xml. I need to turn of the offsitemiddleware to allow for multiple domains.
Scrapy works as expected, but seems slow (About 1000 in an hour or 1 every 4 seconds). Is there a way to speed this up by increasing the number of CONCURRENT_REQUESTS_PER_SPIDER while running a single spider? Anything else?
If you want a way to concurrently download multiple sites with python, you can do so with the standard libraries like this:
import threading
import urllib
maxthreads = 4
sites = ['google.com', 'yahoo.com', ] # etc.
class Download(threading.Thread):
def run (self):
global sites
while sites:
site = sites.pop()
print "start", site
urllib.urlretrieve('http://' + site, site)
print "end ", site
for x in xrange(min(maxthreads, len(sites))):
Download().start()
You could also check out httplib2
or PycURL
to do the downloading for you instead of urllib
.
I'm not clear exactly how you want the scraped text as xml to look, but you could use xml.etree.ElementTree
from the standard library or you could install BeautifulSoup
(which would be better as it handles malformed markup).
精彩评论