windows简易安装scrapy
Scrapy,Python开发的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持
写在前面:在本文中将使用的python版本为3.7,读者可自行选择版本。系统为windows 64位。
第一步 确保环境要求
读者请自行安装python,并将python目录,以及python目录下Scripts加入系统环境变量中。如下图所示:
注:在安装过程中如果读者勾选了将python加入环境变量,即跳过此步骤。
准备两个所需文件:**Twisted-18.7.0-cp37-cp37m-win_amd64.whl
lxml-4.2.3-cp37-cp37m-win_amd64.whl**
文件下载地址–>https://pan.baidu.com/s/1TC2q_oC5h6Z4ymRpmpSxsA (包含了3.5 以及3.7版本)
读者也可以自行下载–>非官方windows-python扩展包地址:pythonhttps://www.lfd.uci.edu/~gohlke/pythonlibs/
注:由于scrapy使用Twisted为框架,以及使用lxml解析html,在正常的安装过程中无法正确的安装这两个组件,故进行单独安装。
第二步 安装scrapy
进入Twisted-18.7.0-cp37-cp37m-win_amd64.whl 、lxml-4.2.3-cp37-cp37m-win_amd64.whl文件存放目录,使用pip命令进行安装:
C:\Users\WU\Downloads\scrapyFile>pip install lxml-4.2.3-cp37-cp37m-win_amd64.whl
C:\Users\WU\Downloads\scrapyFile>pip install Twisted-18.7.0-cp37-cp37m-win_amd64.whl
注:本人将这两个文件存放C:\Users\WU\Downloads\scrapyFile文件中
在lxml Twisted安装成功后,执行如下命令,进行scrapy安装:
pip install pywin32
pip install scrapy
注:由于python后续还将访问windows系统的API库,故需安装pywin32
第三步 验证scrapy是否安装成功
在cmd中执行‘scrapy’,出现如下信息:
C:\Users\WU\Downloads\scrapyFile>scrapy
Scrapy 1.5.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
第四步 创建scrapy项目
进入工作目录,执行如下命令,即可查看对应项目的生成:
D:\pythonplace\scrapy>scrapy startproject helloworld
New Scrapy project 'helloworld', using template directory 'd:\\software\\python3.7\\lib\\site-packages\\scrapy\\templates\\project', created in:
D:\pythonplace\scrapy\helloworld
You can start your first spider with:
cd helloworld
scrapy genspider example example.com
附录:
1.当读者用运行scrapy crawl xxx命令启动爬虫时,出现如下错误:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/usr/local/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 203, in _create_crawler
return Crawler(spidercls, self.settings)
File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 55, in __init__
self.extensions = ExtensionManager.from_crawler(self)
File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
File "<frozen importlib._bootstrap>", line 983, in _find_and_load
File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 728, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/usr/local/lib/python3.7/site-packages/scrapy/extensions/telnet.py", line 12, in <module>
from twisted.conch import manhole, telnet
File "/usr/local/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
def write(self, data, async=False):
^
SyntaxError: invalid syntax
请找到python目录下Lib/site-packages/twisted/conch/manhole.py文件的154、155、240、241、247行的async重命名
如下:
154 def write(self, data, async1=False):
155 self.handler.addOutput(data, async1)
........
240 def addOutput(self, data, async1=False):
241 if async1:
........
247 if async1: