Python爬虫之Scrapy环境搭建案例教程
python爬虫之Scrapy环境搭建
如何搭建Scrapy环境
首先要安装Python环境,Python环境搭建见:https://blog.csdn.net/alice_tl/article/details/76793590
接下来安装Scrapy
1、安装Scrapy,在终端使用pip install Scrapy(注意最好是国外的环境)
进度提示如下:
alicedeMacBook-Pro:~ alice$ pip install Scrapy Collecting Scrapy Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl Collecting w3lib>=1.17.0 (from Scrapy) Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl Collecting six>=1.5.2 (from Scrapy) xxxxxxxxxxxxxxxxxxxxx File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg return cmd.easy_install(req) File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install raise DistutilsError(msg) distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1') ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
出现缺少Twisted的错误提示:
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/
2、安装Twiseted,终端里输入:sudo pip install twisted==13.1.0
alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0 Collecting twisted==13.1.0 Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB) 100% |████████████████████████████████| 2.7MB 398kB/s Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1) Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5) Installing collected packages: twisted Running setup.py install for twisted ... error Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile: running install running build running build_py creating build creating build/lib.macosx-10.13-intel-2.7 creating build/lib.macosx-10.13-intel-2.7/twisted copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted copying twisted/_version.py -> build/li
3、再次使用sudo pip install scrapy安装,发现仍然出现错误提示,这次是没有安装lxml的错误提示:
Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting Scrapy Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB) 100% |████████████████████████████████| 256kB 210kB/s Collecting w3lib>=1.17.0 (from Scrapy) xxxxxxxxxxxx Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB) 100% |████████████████████████████████| 3.1MB 59kB/s Collecting lxml (from Scrapy) Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: ) No matching distribution found for lxml (from Scrapy)
4、安装lxml,使用:sudo pip install lxml
alicedeMacBook-Pro:~ alice$ sudo pip install lxml The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting lxml Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_http://www.cppcns.comx86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB) 100% |████████████████████████████████| 8.7MB 187kB/s Installing collected packages: lxml Successfully installed lxml-4.2.4
5、再次安装scrapy,使用sudo pip install scrapy,安装成功
alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting Scrapy Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB) 100% |████████████████████████████████| 256kB 11.5MB/s Collecting w3lib>=1.17.0 (from Scrapy) xxxxxxxxx Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4) Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy) Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/ Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB) 100% |████████████████████████████████| 61kB 66kB/s Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy Running setup.py install for functools32 ... done Running setup.py install for PyDispatcher ... done Found existing installation: zope.interface 4.1.1 Uninstalling zope.interface-4.1.1: Successfully uninstalled zope.interface-4.1.1 Running setup.py install for zope.interface ... done Running setup.py install for Twisted ... done Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2编程客栈.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0
6、检查scrapy是否安装成功,输入scrapy --version
出现scrapy的版本信息,比如:Scrapy 1.5.1 - no active project即可。
alicedeMacBook-Pro:~ alice$ scrapy --version Scrapy 1.5.1 - no active project Usage: scrapy <command> [options] [args] Available commands: bench Run quick benchmark test fetch Fetch a URL using the Scrapy downloader genspider Generate new spider using pre-defined templates runspider Run a self-contained spider (without creating a project) settings Get settings values shell Interactive scraping console startproject Create new project version Print Scrapy version view Open URL in browser, as seen by Scrapy [ more ] More commands available when run from project directory Use "scrapy <command> -h" to see more info about a command
PS:如果中途没有能够正常访问org网和使用sudo管理员权限安装,则会出现类似的错误提示
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
resolver.resolve(requirement_set)
Exception: Traceback (most recent call last): File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main status = self.run(options, args) File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run resolver.resolve(requirement_set) File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve self._resolve_one(requirement_set, req) File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one abstract_dist = self._get_abstract_dist_for(req_to_install) File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for self.require_hashes File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement progress_bar=self.progress_bar File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url progress_bar=progress_bar File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url progress_bar) File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url _download_url(resp, link, content_file, hashes, progress_bar) File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url hashes.check_against_chunks(downlowww.cppcns.comaded_chunks) File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks for chunk in chunks: File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks for chunk in chunks: File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read decode_content=False): File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream data = self.read(amt=amt, decode_content=decode_content) File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read raise IncompleteRead(self._fp_bytes_read, self.length_remaining) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__ self.gen.throw(type, value, traceback) File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher raise ReadTimeoutError(self._pool, None, 'Read timed out.') ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.
按照指南上搭建好了Scrapy的环境。
Scrapy爬虫运行常见报错及解决
按照第一个Spider代码练习,保存在 tutorial/spiders 目录下的 dmoz_spider.py 文件中:
import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)
terminal中运行:scrapy crawl dmoz,试图启动爬虫
报错提示一:
Scrapy 1.6.0 - no active project
Unknown command: crawl
alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz Scrapy 1.6.0 - no active project Unknown command: crawl Use "scrapy" to see available commands
原因是:在使用命令行startproject的时候,会自动生成scrapy.cfg。而使用命令行cmd启动爬虫时,crawl会去搜索cmd当前目录下的scrapy.cfg文件,官方文档中也进行了说明。找不到scrapy.cfg文件则认为没有该project。
解决方案:因此cd进入该dmoz项目的根目录,即scrapy.cfg文件在的目录,执行命令scrapy crawl dmoz
正常情况下得到的输出应该是:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened
2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
然而实际不是
报错提示二:
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'
alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz 2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial) 2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load return self._spiders[spider_name] KeyError: 'dmoz' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load raise KeyError("Spider not found: {}"www.cppcns.com.format(spider_name)) KeyError: 'Spider not found: dmoz'
原因:定位的目录不正确,要进入到dmoz在的目录
解决方案:也比较简单,重新check目录进去即可
报错提示三:
File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module>
from OpenSSL._util import lib as pyOpenSSLlib ImportError: No module named _utilalicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz 2018-08-06 22:25:23 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial) 2018-08-06 22:25:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit 2018-08-06 22:25:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'} Traceback (most recent call last): File "/usr/local/bin/scrapy", line 11, in <module> sys.exit(execute()) File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help func(*a, **kw) File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command t/ssl.py", line 230, in <module> from twisted.internet._sslverify import ( File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in <module> from OpenSSL._util import lib as pyOpenSSLlib ImportError: No module named _util
网上查了很久的资料,仍然无解。部分博主说是pyOpenSSL或Scrapy的安装有问题,于是重新装了pyOpenSSL和Scrapy,但还是报同样错误,实在不知道怎么解决了。
后面重装了pyOpenSSL和Scrapy,貌似是解决了~
2019-04-19 09:46:37 [scrapy.core.engine] INFO: Spider opened 2019-04-19 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/robots.txt> (referer: None) 2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed 2019-04-19 09:46:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET编程客栈 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed 2019-04-19 09:46:40 [scrapy.core.engine] INFO: Closing spider (finished) 2019-04-19 09:46:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 737, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 2103, 'downloader/response_count': 3, 'downloader/response_status_count/403': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2019, 4, 19, 1, 46, 40, 570939), 'httperror/response_ignored_count': 2, 'httperror/response_ignored_status_count/403': 2, 'log_count/DEBUG': 3, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'memusage/max': 65601536, 'memusage/startup': 65597440, 'response_received_count': 3, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/403': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2019, 4, 19, 1, 46, 37, 468659)} 2019-04-19 09:46:40 [scrapy.core.engine] INFO: Spider closed (finished) alicedeMacBook-Pro:tutorial alice$
到此这篇关于Python爬虫之Scrapy环境搭建案例教程的文章就介绍到这了,更多相关Python爬虫之Scrapy环境搭建内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们!
精彩评论