Python3 crawler mounted combat -10, reptiles frame: PySpider, Scrapy

We write directly reptiles with Requests, Selenium and other libraries, if crawling is not too large amount, less demanding speed, fully meet demand. But write more will find a lot of code and its internal components can be multiplexed, if we put these components are pulled out, the various functional modularity, slowly will form a framework prototype, over time, reptile framework was born.

We can not go to use the framework of certain functions of specific interest to achieve, just need to be concerned about crawling logic can be. With them, it can greatly simplify the amount of code and architecture will become clear, crawling efficiency will be much higher. So if there is some basis for reptiles, to use framework is a good choice.

This book introduces the framework of reptiles have PySpider and Scrapy, in this section we introduce installation PySpider, Scrapy as well as some of their extensions.

PySpider installation

PySpider is a powerful web crawler framework written in Chinese binux, it comes with a powerful WebUI, script editor, task monitoring, project manager, and the results of the processor, and it supports a variety of back-end database, multiple message queues, another it also supports JavaScript rendering the page crawling, ease of use, tell us about it in this section of the installation process.

1. Links

2. Preparation

PySpider support JavaScript rendering, and this process is dependent on PhantomJS, so also need to install PhantomJS, so please installed before installing PhantomJS, installation method is covered in the foregoing.

3. Pip installation

Pip recommended installation order is as follows:

pip3 install pyspider
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Command is finished to complete the installation.

4. Common Errors

Windows may appear under this error: Command "python setup.py egg_info" failed with error code 1 in / tmp / pip-build-vXo1W3 / pycurl
this is PyCurl installation errors usually occur under Windows, you need to install PyCurl library, download links are:http://www.lfd.uci.edu/~gohlk ..., find the corresponding Python Wheel file and then download the appropriate version.
The Windows 64-bit, Python3.6 the pycurl-7.43.0-cp36-cp36m- win_amd64.whl download, followed by Pip can be installed, the following command:

pip3 install pycurl‑7.43.0‑cp36‑cp36m‑win_amd64.whl

PyCurl under Linux If you encounter errors can refer to this article:https://imlonghao.com/19.html

Mac In such cases, perform the following actions:

brew install openssl

openssl version 
查看版本
find /usr/local -name ssl.h

可以看到形如:
usr/local/Cellar/openssl/1.0.2s/include/openssl/ssl.h

添加环境变量
export PYCURL_SSL_LIBRARY=openssl
export LDFLAGS=-L/usr/local/Cellar/openssl/1.0.2s/lib
export CPPFLAGS=-I/usr/local/Cellar/openssl/1.0.2s/include

 pip3 install pyspider

5. Verify Installation

After the installation is complete, you can start PySpider directly at the command line:

pyspider all

Figure 1-75 console
then PySpider Web services will run on the local port 5000, open directly in the browser:HTTP: // localhost : 5000 / to enter PySpider the WebUI management page, as shown in Figure 1-76:

Python3 crawler mounted combat -10, reptiles frame: PySpider, Scrapy

Figure 1-76 Management page
if a similar page appears that prove PySpider been installed successfully.
Later will detail the usage PySpider.

There is a pit, will complain PySpider when running on Python3.7

File "/usr/local/lib/python3.7/site-packages/pyspider/run.py", line 231
    async=True, get_object=False, no_input=False):
        ^
SyntaxError: invalid syntax

The reason is python3.7 has become the async keyword. Therefore this error.
Modify the way it is manually replace

下面位置的async改为mark_async

/usr/local/lib/python3.7/site-packages/pyspider/run.py  的231行、245行(两个)、365行

/usr/local/lib/python3.7/site-packages/pyspider/webui/app.py 的95行

/usr/local/lib/python3.7/site-packages/pyspider/fetcher/tornado_fetcher.py 的81行、89行(两个)、95行、117行

Scrapy installation

Scrapy is a very powerful framework of reptiles, more dependent libraries, the library has at least need to rely on Twisted 14.0, lxml 3.4, pyOpenSSL 0.14. And it varies in different platform environment, so before installation to ensure that the best installed some basic libraries. This section describes what Scrapy installation methods in different platforms.

1. Links

Installation under 3. Mac

Construction of Scrapy dependent libraries on your Mac requires a C compiler and development header file, which is generally provided by Xcode, you can run the following command to install:

xcode-select --install

Pip Scrapy then used to install, run the following command:

pip3 install Scrapy

Scrapy to complete the installation after finished running.

4. Verify the installation

After installation, Scrapy enter the command line, similar results if below, and certification Scrapy successful installation, shown in FIG. 1-80:

Python3 crawler mounted combat -10, reptiles frame: PySpider, Scrapy

FIG verify installation 1-80

5. Common Errors

pkg_resources.VersionConflict: (six 1.5.2 (/usr/lib/python3/dist-packages), Requirement.parse('six>=1.6.0'))

six pack version is too low, six pack is to provide a compatible Python2 and Python3 library, you can upgrade six pack:

sudo pip3 install -U six

c/_cffi_backend.c:15:17: fatal error: ffi.h: No such file or directory

This is a bug in the Linux often appear, the lack of Libffi this library. What is libffi? "FFI" full name is the Foreign Function Interface, usually it refers to the code to allow the calling code written in one language in another language. The Libffi library provides only the lowest level, related to architecture, complete "FFI".
Install the appropriate library can be.
Ubuntu, Debian:

sudo apt-get install build-essential libssl-dev libffi-dev python3-dev

CentOS、RedHat:

sudo yum install gcc libffi-devel python-devel openssl-devel

Command "python setup.py egg_info" failed with error code 1 in/tmp/pip-build/cryptography/

It is the lack of encryption-related components, the use of Pip installed.

pip3 install cryptography

ImportError: No module named 'packaging'

The lack of packaging this package, which provides the core functionality of Python packages by Pip installed.


ImportError: No module named '_cffi_backend'

Lack cffi package can be installed using Pip:

pip3 install cffi

ImportError: No module named 'pyparsing'
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Lack pyparsing package can be installed using Pip:

pip3 install pyparsing appdirs

Guess you like

Origin blog.51cto.com/14445003/2424880