Introduction to python crawler and its application

What is a web crawler


A web crawler, also known as a web spider or a web robot, is a program or script that automatically browses and retrieves web page information according to certain rules. Web crawlers can automatically request web pages and grab the required data. By processing the captured data, valuable information can be extracted.
Knowing crawlers
The series of search engines we are familiar with are all large-scale web crawlers, such as Baidu, Sogou, 360 browser, Google search and so on. Each search engine has its own crawler. For example, the crawler of 360 browser is called 360Spider, and the crawler of Sogou is called Sogospider.

                                               6acefa0463fc45b49d8430935af0a688.png

 

Baidu search engine, in fact, can be called Baidu spider more vividly. It crawls high-quality information from massive Internet information every day and collects it. When a user searches for keywords through Baidu, Baidu first analyzes the keywords entered by the user, then finds relevant webpages from the included webpages, sorts the webpages according to the ranking rules, and finally presents the sorted results to the user. In this process, Baidu Spider played a very important role.

Baidu's engineers have written corresponding crawler algorithms for "Baidu Spider". By applying these algorithms, "Baidu Spider" can implement corresponding search strategies, such as filtering out duplicate web pages, filtering high-quality web pages, and so on. Applying different algorithms, the efficiency of the crawler and the crawling results will be different.


Reptile classification


Crawlers can be divided into three categories: general web crawlers, focused web crawlers, and incremental web crawlers.

Universal web crawler: It is an important part of the search engine, which has been introduced above, and will not be repeated here. Universal web crawlers need to comply with the robots protocol, through which the website tells search engines which pages can be crawled and which pages are not allowed to crawl.
Robots agreement: It is a kind of "conventional" agreement, which has no legal effect. It reflects the "contract spirit" of Internet people. Industry practitioners will consciously abide by the agreement, so it is also known as the "gentleman's agreement".

Focused web crawler: It is a web crawler program for specific needs. The difference between it and the general crawler is that the focused crawler will filter and process the content of the web page when implementing web crawling, and try to ensure that only the web page information related to the needs is crawled. Focusing on web crawlers greatly saves hardware and network resources. Due to the small number of saved pages, the update speed is very fast, which also satisfies the needs of some specific groups of people for information in specific fields.

Incremental web crawler: It refers to incremental update of downloaded web pages. It is a crawler program that only crawls newly generated or changed web pages, and can ensure that the crawled pages are up-to-date to a certain extent. page.


crawler application


With the rapid development of the Internet, the World Wide Web has become the carrier of a large amount of information. How to effectively extract and utilize this information has become a huge challenge. Therefore, the crawler came into being. It can not only be used in the field of search engines, but also in big data analysis, And the commercial field has been applied on a large scale.


1) Data Analysis


In the field of data analysis, web crawler is usually an essential tool for collecting massive data. For data analysts, to perform data analysis, they must first have data sources, and by learning crawlers, more data sources can be obtained. During the collection process, data analysts can collect more valuable data according to their own purposes, and filter out those invalid data.


2) Commercial field


For enterprises, it is very important to obtain market dynamics and product information in a timely manner. Enterprises can purchase data through third-party platforms, such as Guiyang Big Data Exchange, Datatang, etc. Of course, if your company has a crawler engineer, you can obtain the desired information through crawler.


Reptiles are a double-edged sword


Crawler is a double-edged sword, it brings us convenience, but also brings hidden dangers to network security. Some criminals use crawlers to illegally collect information on netizens on the Internet, or use crawlers to maliciously attack other people's websites, resulting in serious consequences of website paralysis. Regarding the legal use of crawlers, it is recommended to read the "Network Security Law of the People's Republic of China".

                     c35248c26fff4cf68a0baf14066db666.png

 

 

In order to limit the danger brought by crawlers, most websites have good anti-crawling measures, which are further explained through the robots.txt protocol. The following is the content of Taobao robots.txt:

User-agent: Baiduspider 
Disallow: /baidu Disallow: /s? 
Disallow: /ulink? 
Disallow: /link? 
Disallow: /home/news/data/ 
Disallow: /bh
.....
User-agent: * 
Disallow: /


As can be seen from the content of the agreement, Taobao has made regulations on pages that cannot be crawled. Therefore, when you use crawlers, you must consciously abide by the robots protocol, and do not illegally obtain other people's information, or do things that endanger other people's websites.


Why use Python as a crawler


First of all, you should be clear that not only Python can be used as a crawler, such as PHP, Java, and C/C++ can be used to write crawler programs, but Python is the easiest to use as a crawler. Here is a brief comparison of their pros and cons:

PHP: It is not very good for multi-threading and asynchronous support, and its concurrent processing ability is weak; Java is also often used to write crawler programs, but the Java language itself is very cumbersome and has a large amount of code, so it is a barrier to entry for beginners Higher; although C/C++ runs efficiently, the cost of learning and development is high. Writing a small crawler can take a long time.

The Python language has beautiful syntax, concise code, high development efficiency, and supports multiple crawler modules, such as urllib, requests, Bs4, etc. Python's request module and parsing module are rich and mature, and it also provides a powerful Scrapy framework to make writing crawler programs easier. Therefore, using Python to write crawler programs is a very good choice.


The process of writing a crawler


The crawler program is different from other programs, its thinking logic is generally similar, so we don't need to spend a lot of time on the logic. The following is a brief description of the process of writing crawler programs in Python:
. First, open the URL by the request method of the urllib module to obtain the HTML object of the webpage.
. Use a browser to open the source code of the web page to analyze the web page structure and element nodes.
. Extract data via Beautiful Soup or regular expressions.
. Store data to local disk or database.

Of course, it is not limited to the above one process. To write a crawler program, you need to have a good Python programming skills, so that you will be handy in the process of writing. The crawler program needs to disguise as much as possible to visit the website as a human, not a machine, otherwise it will be restricted by the anti-crawling strategy of the website, or even block the IP directly. The relevant knowledge will be introduced in the follow-up content.
Note: The modules involved in the above process will be described in detail in the subsequent content.

 

The reason why the crawler can crawl data is because the crawler can analyze the web page and extract the desired data from the web page. Before learning the Python crawler module, it is necessary for us to be familiar with the basic structure of web pages, which is the necessary knowledge for writing crawler programs.
If you are familiar with front-end languages, you can easily master this section.

Web pages are generally composed of three parts, namely HTML (Hypertext Markup Language), CSS (Cascading Style Sheets) and JavaScript ("JS" dynamic scripting language for short), which undertake different tasks in web pages.


.HTML is responsible for defining the content of the web page.
CSS is responsible for describing the layout
of the web page. JavaScript is responsible for the behavior of the web page


HTML
HTML is the basic structure of web pages, which is equivalent to the skeletal structure of the human body. All webpages with "<" and ">" symbols are HTML tags. Common HTML tags are as follows:
 

<!DOCTYPE html> 声明为 HTML5 文档
<html>..</html> 是网页的根元素
<head>..</head> 元素包含了文档的元(meta)数据,如 <meta charset="utf-8"> 定义网页编码格式为 utf-8。
<title>..<title> 元素描述了文档的标题
<body>..</body> 表示用户可见的内容
<div>..</div> 表示框架
<p>..</p > 表示段落
<ul>..</ul> 定义无序列表
<ol>..</ol>定义有序列表
<li>..</li>表示列表项
< img src="" alt="">表示图片
<h1>..</h1>表示标题
< a href="">..</ a>表示超链接

Thank you for watching. If you don’t understand, you can leave a message and I will reply one by one. I found that many students do not know the module and will explain it to you slowly in the future. The final interpretation right of this article belongs to Zhengyin Studio               c6489218a77548459ebddcdc5cf5f011.png

 

Guess you like

Origin blog.csdn.net/m0_69043821/article/details/124453996