[Python crawler combat] crawler basics and Python environment installation

Foreword:

Crawler is the most common development project in Python, and the application object of the crawler itself is diverse (text, video, pictures, other files, etc.). In this video series of courses, we will come up with multiple cases for crawler projects Explain in actual combat, help everyone to conduct actual combat on the reptile project, and cultivate the actual process of reptile project analysis. (Each video tutorial in this series will be controlled to about 5-6 minutes)

The first article, crawler basics and Python environment installation

[Python crawler combat] Python environment construction and crawler module installation

What is a reptile:

Web crawlers (also known as web spiders, web robots) are programs or scripts that automatically grab Internet information in accordance with certain rules.

First of all, we have to understand the legal risks of reptiles. After all, we are just programmers, not using these data for illegal gains.

The legal risks of crawlers
1. Illegal crawling and use of content prohibited by the target site against the will of the website;
2. The crawler interferes with the normal operation of the visited website;
3. Crawling specific types of data or information protected by law .
So as a crawler developer, how to avoid risks?

1. Strictly abide by the robots agreement set on the website;
2. Avoid interference with the normal operation of the visited website;
3. Avoid the use of crawled data for commercialization;
4. When using and disseminating the captured information, the institute should be reviewed If the captured content is found to belong to the user's personal information, privacy or other people's business secrets, it should be stopped and deleted in time.
Said a bunch, in fact, we promise to do two things.

1. We can catch what Baidu can catch, and don't catch what Baidu can't. Do not affect the normal operation of the target website when catching.

2. Do not directly use the data you crawl for commercialization.

Python environment installation steps:
purchase Alibaba Cloud server (windows) version, or prepare a windows machine

Download related software (python) and install

Related tutorials can refer to: 1. Python installation and configuration

What we downloaded is: https://npm.taobao.org/mirrors/python/ Choose the windows version of x64

After the installation is complete, please run cmd to execute python to see if it runs successfully

Modify pip domestic source

Click to view

Installation related instructions

python -m pip install --upgrade pip
pip install jupyter
pip install selenium
pip install pyquery
pip install request
#Run jupyter notebook
jupyter notebook
1
2
3
4
5
6
7
The modules installed above are:

request is a basic http library that we can use to request http or https sites.

Selenium is actually an integrated testing tool, but we can use test simulation to realize the operation of website simulation. In simple terms, it is artificial simulation to visit the website.

The pyquery library is also a very powerful and flexible web page parsing library. If you have used Jquery, then you will also be comfortable with it.

jupyter A Web IDE that can run in real time and debug in real time.

to sum up:

This article briefly introduces reptiles and reptile-related regulations. Installation instructions are also provided using video. In the next article, we will carry out our first actual series of projects: crawling Baidu Fengyun list to prepare for the subsequent projects.

 

 

————————————————

Guess you like

Origin www.cnblogs.com/dfs23/p/12709893.html