Getting Started with Python Crawlers 01

1. What is a reptile?

A web crawler, also known as a web spider or a web robot, is a program or script that automatically browses and retrieves web page information according to certain rules. Web crawlers can automatically request web pages and grab the required data. By processing the captured data, valuable information can be extracted.

2. Classification of reptiles

Crawlers can be divided into three categories: general-purpose web crawlers, focused web crawlers, and incremental web crawlers.

General web crawler : It is an important part of the search engine, which has been introduced above, and will not be repeated here. General web crawlers need to abide by the robots protocol, through which the website tells the search engine which pages can be crawled and which pages are not allowed to be crawled.

Robots agreement : It is a kind of "agreement and common name" agreement, which does not have legal effect. It embodies the "contract spirit" of Internet people. Practitioners in the industry will consciously abide by the agreement, so it is also known as the "gentlemen's agreement".

Focused web crawler : It is a web crawler program for specific needs. The difference between it and the general crawler is that the focused crawler will screen and process the webpage content when implementing webpage crawling, and try to ensure that only the webpage information related to the demand is crawled. Focusing on web crawlers greatly saves hardware and network resources. Due to the small number of saved pages, the update speed is very fast, which also satisfies the needs of some specific groups of people for information in specific fields.

Incremental web crawler : refers to the incremental update of downloaded web pages. It is a crawler program that only crawls newly generated or changed web pages. It can guarantee that the crawled pages are up to date to a certain extent. page.

3. Reptile application

With the rapid development of the network, the World Wide Web has become the carrier of a large amount of information. How to effectively extract and use this information has become a huge challenge. Therefore, crawlers have emerged. It can not only be used in the field of search engines, but also in big data analysis, And the commercial field has been applied on a large scale.

1) Data analysis

In the field of data analysis, web crawlers are usually an essential tool for collecting massive amounts of data. For data analysts, to conduct data analysis, they must first have data sources, and learning crawlers can obtain more data sources. During the collection process, data analysts can collect more valuable data according to their own purposes, and filter out invalid data.

2) Business field

For enterprises, it is very important to obtain market dynamics and product information in a timely manner. Enterprises can purchase data through third-party platforms, such as Guiyang Big Data Exchange, Datatang, etc. Of course, if your company has a crawler engineer, you can obtain the desired information through crawlers.

Crawlers are a double-edged sword. While it brings us convenience, it also brings hidden dangers to network security. Some lawbreakers use crawlers to illegally collect information of netizens on the Internet, or use crawlers to maliciously attack other people's websites, resulting in serious consequences of website paralysis.

In order to limit the danger brought by crawlers, most websites have good anti-crawling measures, and further explanations are made through the robots.txt protocol. The following is the content of Taobao robots.txt:

User-agent: Baiduspider 
Disallow: /baidu Disallow: /s? 
Disallow: /ulink? 
Disallow: /link? 
Disallow: /home/news/data/ 
Disallow: /bh
.....
User-agent: * 
Disallow: /

It can be seen from the content of the agreement that Taobao has made regulations on pages that cannot be crawled. Therefore, when you use crawlers, you must consciously abide by the robots agreement, and do not illegally obtain other people's information, or do things that endanger other people's websites.

Why python?

As a simple and elegant programming language, Python has become a popular choice for crawler development because of its rich third-party libraries and readability. The main reason is that there are many routines and rich learning resources, which are used for doctrine.

Python is not the only language that can be used as a crawler, such as PHP, Java, and C/C++ can be used to write crawler programs, but in comparison, Python is the easiest to use as a crawler. Let’s make a brief comparison of their advantages and disadvantages:

PHP: does not support multi-threading and asynchronously very well, and has weak concurrent processing capabilities; Java is also often used to write crawler programs, but the Java language itself is cumbersome and has a large amount of code. Therefore, for beginners, it has a high entry threshold; although C/C++ has high operating efficiency, it has high learning and development costs. Writing a small crawler can take a long time.

The Python language has beautiful syntax, concise code, high development efficiency, and supports multiple crawler modules, such as urllib, requests, Bs4, etc. Python's request module and parsing module are rich and mature, and it also provides a powerful Scrapy framework, making it easier to write crawlers. So using Python to write a crawler program is a very good choice.

The crawler program is different from other programs, and its thinking logic is generally similar, so we don't need to spend a lot of time on logic. The following is a brief description of the process of writing a crawler program in Python:

  • First, open the URL with the request method of the urllib module to get the HTML object of the web page.
  • Use a browser to open the source code of the web page to analyze the web page structure and element nodes.
  • Extract data via Beautiful Soup or regular expressions.
  • Store data to local disk or database.

Environment configuration and basic tools

1.  Python environment configuration

Install pycharm+anaconda, there are many tutorials, just find one.

Guess you like

Origin blog.csdn.net/WASEFADG/article/details/131452464