[Python crawler series tutorial 1- 100] The installation of crawler module and three kinds of databases

If a worker wants to do his job well, he must first sharpen his tools. In order to successfully carry out the crawler work, we need to install some python-related libraries. The process of installing the libraries is mainly through pip installation or downloading the installation package in whl format.

Note: When using whl to install, the premise is that the wheel library is required, and the wheel library can be installed through pip install wheel. After the wheel library is ready, you can directly install and upgrade the whl library through the following commands.

pip install xxx.whl      #安装xxx库
pip install -U xxx.whl   #升级xxx库

1.Requests library

RequestsThe library can easily send network requests, pass URL parameters and get a certain web page, which is very simple. Use pipdirect installation.

pip3 install requests

2. Selenium library

seleniumIt is a cross-platform automated testing tool. It was originally intended to test Web application testing, but I didn't expect it to become a good tool for crawlers in the era of crawlers. Let me try to sum it up in one sentence seleninm: it can control your browser and learn from humans to "see" web pages in a decent way .

pip3 install selenium

3. ChromeDriver

ChromeDriverIt is an automated test interface provided by Google for website developers. It is a bridge between selenium2 and chrome browser . seleniumBy JsonWireagreement with the ChromeDrivercommunication, seleniumit is substantially the bottom package of this protocol, while providing the outer WebDriverupper library call.
Go to the official website to download the corresponding version of the corresponding system, and place it in a directory where environment variables are configured, such as the directory where Python or pip is located.

4. Phantomjs library

If you use it Selenium, the program will eventually call the real Chrome browser. In many cases, we don’t need to open the browser to browse, just run an interfaceless browser in the background to do Phantomjsthis, and Seleniumdownload it from the official website as the same After completion, directly unzip and add the bin directory to the environment variable, or copy it to a directory where the environment variable is set, such as the directory where python or pip is located.

Latest update: After installing this library, I tested PhantomJS and found that this library has broken up with Selenium~~~, it is suggested that you can use the Headless mode of Chrome or Firefox.

Such as:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# 无头模式启动
chrome_options.add_argument('--headless')
# 谷歌文档提到需要加上这个属性来规避bug
chrome_options.add_argument('--disable-gpu')
# 初始化实例
driver= webdriver.Chrome(chrome_options=chrome_options)
# 请求百度
driver.get("http://www.baidu.com")

5. lxml library

Provide xpathanalysis methods such as web page analysis.

pip install lxml

6. beautifulsoup library

Rely on the lxmllibrary to be used together, which can facilitate the analysis of web pages.

pip install beautifulsoup4

Introduce BeautifulSoup from bs4 when using

>>>from bs4 import BeautifulSoup

7. pyquery library

And beautifulsoupsimilar parsing library web page, use syntax similar jquery.

pip install pyquery

8. pymysql library

pymysqlIt is a storage database that can operate mysql database.

pip install pymysql

9. pymongo library

pymongoTo operate the mongoDB database, there is no need to build tables when using MongoDB, nor the structure of relational data tables. It is a non-relational database.

pip install pymongo

10. Redis library

redisIt is also a non-relational database, which is mainly used to store data when distributed crawlers. Because other operations are full memory operations, the read and write performance is very strong. Redis can read at 110,000 times/s and write at 81,000 times/s. . But at the same time, it should be noted that: also because it is an in-memory database, the amount of data stored on a single machine is related to the memory size of the machine itself.

pip install redis

11. Flask library

As djangosimilar, it flaskis a lightweight web framework that can be used to build a web server.

pip install flask

12. Jupyter library

jypyterIt is an interactive notebook notebook, but we can write code in this notebook, support running more than 40 programming languages, and support Markdown text.

pip install jupyter

Among them, pip will automatically install the dependent libraries. After the installation is complete, use the following command to directly start jupyterit and it will automatically open the default browser and start writing code or documentation on the web page.

jupyter notebook

The three most popular databases are Mysql, MongoDBand Redis, among them, Mysqlas an open source relational database is widely cited in web development; MongoDBand Redisis an open source non-relational database, no relational data table structure, very convenient for development, especially for Crawler data storage; among them, Redisdue to the memory database storage method, the read and write efficiency is extremely high, and it is widely used for data storage in concurrent scenarios.

1. Mysql installation and configuration

Download the installation package of the corresponding system directly on the official website to install and configure, and pay attention to save the root account.

You can use Mysql Font as a GUI visual management tool.

2. Installation and Configuration of MongoDB

2.1 Install MongoDB

Download the installation package from the official website to install

2.2 Configuration data storage directory

C:\Program Files\MongoDB\Server\3.6\Create a new directory in the windows installation directory to data\dbstore data.

2.3 Configuration log storage directory and file

C:\Program Files\MongoDB\Server\3.6\Create a new directory in the windows installation directory data\logsand create a new file mongo.logto store MongoDB log files.

2.4 Configure MongoDB as a system service

  1. Run in administrator modecmd
  2. Execute the mongodcommand and configure the following parameters:
mongod --bind_ip 0.0.0.0 --logpath C:\Program Files\MongoDB\Server\3.6\data\logs --logappend --dbpath C:\MongoDB\Server\3.6\data\db --port 27017 --serviceName "MongoDB" --serviceDisplayName "MongoDB" --install

bind_ipIt is configured to 0.0.0.0indicate that any address can be accessed, logpathspecify the log path, logappendspecify that the log is stored in addition instead of overwriting, dbpathindicate the data storage path, portspecify the port, serviceNamespecify the name of serviceDisplayNamethe service , and specify the name of the service to be displayed in the system. installIndicates that it is the most system service installed in the system.

  1. Configure service startup mode

Right-click the computer management, enter the 服务和应用程序middle, click 服务, select MongoDB, right-click to start, you can also set to automatically start.

  1. Pay attention to setting and recording the mongoDB database account, such as: admin/admin

3. Redis installation and configuration

3.1 Install Redis

Download the redis file in msi format and install it directly.

3.2 Manage Redis

  • After the installation is complete, the service has been automatically set in the system and started automatically, which can be 服务和应用程序viewed in.
  • Download and install the visual management tool redis desktop manager, and then you can visually manage the database.

Guess you like

Origin blog.csdn.net/weixin_54707168/article/details/114236780