Article Directory
-
-
- 1.Requests library
- 2. Selenium library
- 3. ChromeDriver
- 4. Phantomjs library
- 5. lxml library
- 6. beautifulsoup library
- 7. pyquery library
- 8. pymysql library
- 9. pymongo library
- 10. Redis library
- 11. Flask library
- 12. Jupyter library
- 1. Mysql installation and configuration
- 2. Installation and Configuration of MongoDB
- 3. Redis installation and configuration
-
If a worker wants to do his job well, he must first sharpen his tools. In order to successfully carry out the crawler work, we need to install some python-related libraries. The process of installing the libraries is mainly through pip installation or downloading the installation package in whl format.
Note: When using whl to install, the premise is that the wheel library is required, and the wheel library can be installed through pip install wheel. After the wheel library is ready, you can directly install and upgrade the whl library through the following commands.
pip install xxx.whl #安装xxx库
pip install -U xxx.whl #升级xxx库
1.Requests library
Requests
The library can easily send network requests, pass URL parameters and get a certain web page, which is very simple. Use pip
direct installation.
pip3 install requests
2. Selenium library
selenium
It is a cross-platform automated testing tool. It was originally intended to test Web application testing, but I didn't expect it to become a good tool for crawlers in the era of crawlers. Let me try to sum it up in one sentence seleninm
: it can control your browser and learn from humans to "see" web pages in a decent way .
pip3 install selenium
3. ChromeDriver
ChromeDriver
It is an automated test interface provided by Google for website developers. It is a bridge between selenium2 and chrome browser . selenium
By JsonWire
agreement with the ChromeDriver
communication, selenium
it is substantially the bottom package of this protocol, while providing the outer WebDriver
upper library call.
Go to the official website to download the corresponding version of the corresponding system, and place it in a directory where environment variables are configured, such as the directory where Python or pip is located.
4. Phantomjs library
If you use it Selenium
, the program will eventually call the real Chrome browser. In many cases, we don’t need to open the browser to browse, just run an interfaceless browser in the background to do Phantomjs
this, and Selenium
download it from the official website as the same After completion, directly unzip and add the bin directory to the environment variable, or copy it to a directory where the environment variable is set, such as the directory where python or pip is located.
Latest update: After installing this library, I tested PhantomJS and found that this library has broken up with Selenium~~~, it is suggested that you can use the Headless mode of Chrome or Firefox.
Such as:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
# 无头模式启动
chrome_options.add_argument('--headless')
# 谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--disable-gpu')
# 初始化实例
driver= webdriver.Chrome(chrome_options=chrome_options)
# 请求百度
driver.get("http://www.baidu.com")
5. lxml library
Provide xpath
analysis methods such as web page analysis.
pip install lxml
6. beautifulsoup library
Rely on the lxml
library to be used together, which can facilitate the analysis of web pages.
pip install beautifulsoup4
Introduce BeautifulSoup from bs4 when using
>>>from bs4 import BeautifulSoup
7. pyquery library
And beautifulsoup
similar parsing library web page, use syntax similar jquery
.
pip install pyquery
8. pymysql library
pymysql
It is a storage database that can operate mysql database.
pip install pymysql
9. pymongo library
pymongo
To operate the mongoDB database, there is no need to build tables when using MongoDB, nor the structure of relational data tables. It is a non-relational database.
pip install pymongo
10. Redis library
redis
It is also a non-relational database, which is mainly used to store data when distributed crawlers. Because other operations are full memory operations, the read and write performance is very strong. Redis can read at 110,000 times/s and write at 81,000 times/s. . But at the same time, it should be noted that: also because it is an in-memory database, the amount of data stored on a single machine is related to the memory size of the machine itself.
pip install redis
11. Flask library
As django
similar, it flask
is a lightweight web framework that can be used to build a web server.
pip install flask
12. Jupyter library
jypyter
It is an interactive notebook notebook, but we can write code in this notebook, support running more than 40 programming languages, and support Markdown text.
pip install jupyter
Among them, pip will automatically install the dependent libraries. After the installation is complete, use the following command to directly start jupyter
it and it will automatically open the default browser and start writing code or documentation on the web page.
jupyter notebook
The three most popular databases are Mysql
, MongoDB
and Redis
, among them, Mysql
as an open source relational database is widely cited in web development; MongoDB
and Redis
is an open source non-relational database, no relational data table structure, very convenient for development, especially for Crawler data storage; among them, Redis
due to the memory database storage method, the read and write efficiency is extremely high, and it is widely used for data storage in concurrent scenarios.
1. Mysql installation and configuration
Download the installation package of the corresponding system directly on the official website to install and configure, and pay attention to save the root account.
You can use Mysql Font as a GUI visual management tool.
2. Installation and Configuration of MongoDB
2.1 Install MongoDB
Download the installation package from the official website to install
2.2 Configuration data storage directory
C:\Program Files\MongoDB\Server\3.6\
Create a new directory in the windows installation directory to data\db
store data.
2.3 Configuration log storage directory and file
C:\Program Files\MongoDB\Server\3.6\
Create a new directory in the windows installation directory data\logs
and create a new file mongo.log
to store MongoDB log files.
2.4 Configure MongoDB as a system service
- Run in administrator mode
cmd
- Execute the
mongod
command and configure the following parameters:
mongod --bind_ip 0.0.0.0 --logpath C:\Program Files\MongoDB\Server\3.6\data\logs --logappend --dbpath C:\MongoDB\Server\3.6\data\db --port 27017 --serviceName "MongoDB" --serviceDisplayName "MongoDB" --install
bind_ip
It is configured to 0.0.0.0
indicate that any address can be accessed, logpath
specify the log path, logappend
specify that the log is stored in addition instead of overwriting, dbpath
indicate the data storage path, port
specify the port, serviceName
specify the name of serviceDisplayName
the service , and specify the name of the service to be displayed in the system. install
Indicates that it is the most system service installed in the system.
- Configure service startup mode
Right-click the computer management, enter the 服务和应用程序
middle, click 服务
, select MongoDB
, right-click to start, you can also set to automatically start.
- Pay attention to setting and recording the mongoDB database account, such as:
admin/admin
3. Redis installation and configuration
3.1 Install Redis
Download the redis file in msi format and install it directly.
3.2 Manage Redis
- After the installation is complete, the service has been automatically set in the system and started automatically, which can be
服务和应用程序
viewed in. - Download and install the visual management tool redis desktop manager, and then you can visually manage the database.