We write directly reptiles with Requests, Selenium and other libraries, if crawling is not too large amount, less demanding speed, fully meet demand. But write more will find a lot of code and its internal components can be multiplexed, if we put these components are pulled out, the various functional modularity, slowly will form a framework prototype, over time, reptile framework was born.
We can not go to use the framework of certain functions of specific interest to achieve, just need to be concerned about crawling logic can be. With them, it can greatly simplify the amount of code and architecture will become clear, crawling efficiency will be much higher. So if there is some basis for reptiles, to use framework is a good choice.
This book introduces the framework of reptiles have PySpider and Scrapy, in this section we introduce installation PySpider, Scrapy as well as some of their extensions.
PySpider installation
PySpider is a powerful web crawler framework written in Chinese binux, it comes with a powerful WebUI, script editor, task monitoring, project manager, and the results of the processor, and it supports a variety of back-end database, multiple message queues, another it also supports JavaScript rendering the page crawling, ease of use, tell us about it in this section of the installation process.
1. Links
- The official document:http://docs.pyspider.org/
- PyPi:https://pypi.python.org/pypi/...
- GitHub:https://github.com/binux/pysp...
- Official Tutorial:http://docs.pyspider.org/en/l...
- Online examples:http://demo.pyspider.org
2. Preparation
PySpider support JavaScript rendering, and this process is dependent on PhantomJS, so also need to install PhantomJS, so please installed before installing PhantomJS, installation method is covered in the foregoing.
3. Pip installation
Pip recommended installation order is as follows:
pip3 install pyspider
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Command is finished to complete the installation.
4. Common Errors
Windows may appear under this error: Command "python setup.py egg_info" failed with error code 1 in / tmp / pip-build-vXo1W3 / pycurl
this is PyCurl installation errors usually occur under Windows, you need to install PyCurl library, download links are:http://www.lfd.uci.edu/~gohlk ..., find the corresponding Python Wheel file and then download the appropriate version.
The Windows 64-bit, Python3.6 the pycurl-7.43.0-cp36-cp36m- win_amd64.whl download, followed by Pip can be installed, the following command:
pip3 install pycurl‑7.43.0‑cp36‑cp36m‑win_amd64.whl
PyCurl under Linux If you encounter errors can refer to this article:https://imlonghao.com/19.html
Mac In such cases, perform the following actions:
brew install openssl
openssl version
查看版本
find /usr/local -name ssl.h
可以看到形如:
usr/local/Cellar/openssl/1.0.2s/include/openssl/ssl.h
添加环境变量
export PYCURL_SSL_LIBRARY=openssl
export LDFLAGS=-L/usr/local/Cellar/openssl/1.0.2s/lib
export CPPFLAGS=-I/usr/local/Cellar/openssl/1.0.2s/include
pip3 install pyspider
5. Verify Installation
After the installation is complete, you can start PySpider directly at the command line:
pyspider all
Figure 1-75 console
then PySpider Web services will run on the local port 5000, open directly in the browser:HTTP: // localhost : 5000 / to enter PySpider the WebUI management page, as shown in Figure 1-76:
Figure 1-76 Management page
if a similar page appears that prove PySpider been installed successfully.
Later will detail the usage PySpider.
There is a pit, will complain PySpider when running on Python3.7
File "/usr/local/lib/python3.7/site-packages/pyspider/run.py", line 231
async=True, get_object=False, no_input=False):
^
SyntaxError: invalid syntax
The reason is python3.7 has become the async keyword. Therefore this error.
Modify the way it is manually replace
下面位置的async改为mark_async
/usr/local/lib/python3.7/site-packages/pyspider/run.py 的231行、245行(两个)、365行
/usr/local/lib/python3.7/site-packages/pyspider/webui/app.py 的95行
/usr/local/lib/python3.7/site-packages/pyspider/fetcher/tornado_fetcher.py 的81行、89行(两个)、95行、117行
Scrapy installation
Scrapy is a very powerful framework of reptiles, more dependent libraries, the library has at least need to rely on Twisted 14.0, lxml 3.4, pyOpenSSL 0.14. And it varies in different platform environment, so before installation to ensure that the best installed some basic libraries. This section describes what Scrapy installation methods in different platforms.
1. Links
- Official website:https://scrapy.org
- The official document:https://docs.scrapy.org
- PyPi:https://pypi.python.org/pypi/...
- GitHub:https://github.com/scrapy/scrapy
- Chinese document:http://scrapy-chs.readthedocs.io
Installation under 3. Mac
Construction of Scrapy dependent libraries on your Mac requires a C compiler and development header file, which is generally provided by Xcode, you can run the following command to install:
xcode-select --install
Pip Scrapy then used to install, run the following command:
pip3 install Scrapy
Scrapy to complete the installation after finished running.
4. Verify the installation
After installation, Scrapy enter the command line, similar results if below, and certification Scrapy successful installation, shown in FIG. 1-80:
FIG verify installation 1-80
5. Common Errors
pkg_resources.VersionConflict: (six 1.5.2 (/usr/lib/python3/dist-packages), Requirement.parse('six>=1.6.0'))
six pack version is too low, six pack is to provide a compatible Python2 and Python3 library, you can upgrade six pack:
sudo pip3 install -U six
c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory
This is a bug in the Linux often appear, the lack of Libffi this library. What is libffi? "FFI" full name is the Foreign Function Interface, usually it refers to the code to allow the calling code written in one language in another language. The Libffi library provides only the lowest level, related to architecture, complete "FFI".
Install the appropriate library can be.
Ubuntu, Debian:
sudo apt-get install build-essential libssl-dev libffi-dev python3-dev
CentOS、RedHat:
sudo yum install gcc libffi-devel python-devel openssl-devel
Command "python setup.py egg_info" failed with error code 1 in/tmp/pip-build/cryptography/
It is the lack of encryption-related components, the use of Pip installed.
pip3 install cryptography
ImportError: No module named 'packaging'
The lack of packaging this package, which provides the core functionality of Python packages by Pip installed.
ImportError: No module named '_cffi_backend'
Lack cffi package can be installed using Pip:
pip3 install cffi
ImportError: No module named 'pyparsing'
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Lack pyparsing package can be installed using Pip:
pip3 install pyparsing appdirs