Dynamically add to allowed_domains in a Scrapy spider
I have a spider that starts with a small list of allowed_domains
at the beginning of the spidering. I need to add more domains dynamically to this whitelist as the spidering continues from within a parser, but the following piece of code does not get that accomplished since subsequent requests are still being filtered. Is there another of updating allowed_domains
within the parser?
class APSpider(Bas开发者_JS百科eSpider):
name = "APSpider"
allowed_domains = ["www.somedomain.com"]
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
...
def parse(self, response):
soup = BeautifulSoup( response.body )
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
...
(At the very moment when this answer is written, the latest version of scrapy
is 1.0.3
. This answer shall work for all recent versions of scrapy
)
As the OffsiteMiddleware
reads the content in allowed_domains
only when initializing the precompiled regex object while handling the spider_opened
signal, values in allowed_domains
are never accessed later.
Thus simply updating the content of allowed_domains
shall not solve the problem.
Basically, two steps are required:
- Update the content of
allowed_domains
according to your actual need. - Have the regex cache in
OffsiteMiddleware
refreshed.
Here is the code I use for step #2:
# Refresh the regex cache for `allowed_domains`
for mw in self.crawler.engine.scraper.spidermw.middlewares:
if isinstance(mw, scrapy.spidermiddlewares.offsite.OffsiteMiddleware):
mw.spider_opened(self)
The code above is supposed to be invoked inside a response callback, thus self
here shall be an instance of the spider class.
See also:
- Source code of
scrapy.spidermiddlewares.offsite.OffsiteMiddleware
on GitHub
You could try something like the following:
class APSpider(BaseSpider):
name = "APSpider"
start_urls = [
"http://www.somedomain.com/list-of-websites",
]
def __init__(self):
self.allowed_domains = None
def parse(self, response):
soup = BeautifulSoup( response.body )
if not self.allowed_domains:
for link_tag in soup.findAll('td',{'class':'half-width'}):
_website = link_tag.find('a')['href']
u = urlparse.urlparse(_website)
self.allowed_domains.append(u.netloc)
yield Request(url=_website, callback=self.parse_secondary_site)
if response.url in self.allowed_domains:
yield Request(...)
...
精彩评论