What is a crawler? Why Python leads the field of crawlers(69)

Hello kids, hello adults!

I am Cat Girl, a primary school student who falls in love with Python programming.

Follow me and learn programming with fun!

Basic concepts of crawlers

Have you heard of reptiles?

A crawler in computing, also known as a web crawler, web spider, or web robot, is a piece of computer machine code that can automatically crawl data on web pages.

What does a web page consist of?

Web pages are generally composed of text, images, audio, video and other elements.

They are arranged and combined through programming syntax such as HTML, JS, CSS, etc., and then generate web pages. In other words, the text, pictures, videos, etc. we see are mixed with HTML and other elements.

What the crawler does is to extract the text, images, audio, video, etc. that we care about from the web page. We don't care about elements such as HTML, but we need to parse the web page according to the syntax of HTML, etc.

The basic structure of a crawler

A simple crawler consists of four parts: URL manager, web page downloader, web page parser, data storage, etc.

  1. The URL manager is the content of which web page you want to download, whether there are hyperlinks in the URL, whether these links need to be downloaded, and duplication needs to be removed when downloading, etc.
  2. The web page downloader is to download the content of the web page and download the content of the web page to the local computer. Two commonly used http request libraries are the urllib library and the request library. The former is the official basic module of Python. The latter is a widely used third-party library with superior performance.
  3. A web page parser parses web page content and extracts the information we care about. The knowledge used includes regular expressions, lxml library and Beautiful Soup library.
  4. The data repository mainly stores data and persists the data locally.

Why does Python dominate the crawler field?

Because there are many mature and easy-to-use related libraries, you can use them immediately, which saves the time of making wheels.

Crawler workflow

The crawler workflow mainly consists of four steps:

  1. Request initiates a request, and the client requests the server to respond.
  2. Response gets the response, and the server sends the requested web page to the client.
  3. Parse the content and use regular expressions, lxml library or Beautiful Soup library to extract target information.
  4. Save the data and save the parsed data locally, which can be text, audio, pictures, videos, etc.

How to limit crawlers

There are currently two main ways to restrict web crawlers:

1. Source review: Determine the User-Agent (a key-value pair in the request header) for restriction. This key-value pair can be used to determine the type of browser that initiates the network request. Website maintenance personnel can restrict requests based on this.

2. Release announcement: Robots protocol.

Robots protocol is a text file used by website administrators to tell search engine spiders which pages can be crawled and which pages cannot be crawled.

robots.txt is an ASCII-encoded text file stored in the root directory of the website. It usually tells the Internet search engine robots (also known as web crawlers/spiders) which content on this website cannot be roamed by search engines. Which ones can be obtained by the device. When a robot visits a website, it will first check whether the robots.txt file exists in the root directory of the website. If it exists, it will access it according to the rules specified in the file.

You can view the robots.txt files of some websites, such as:

Obey the law

robots.txt is a code of ethics and an agreement. It is not an order or enforcement. Everyone must abide by it consciously.

Crawler technology is a technology that automatically obtains network information. However, if you do not comply with relevant laws and regulations, you will violate the law.

picture

 

To avoid this situation we can take the following measures:

1. Set access restrictions in the crawler program to avoid excessive access pressure on the target website;

2. Set a reasonable request interval in the crawler program to avoid excessive visits to the target website;

3. Set a reasonable crawling depth in the crawler program to avoid excessive data crawling on the target website;

4. Set up a reasonable data storage method in the crawler program to avoid excessive data storage pressure on the target website;

5. When using crawler technology, the privacy and intellectual property rights of the target website should be respected and its legitimate rights and interests must not be infringed.

Okay, that’s it for today’s sharing!

If you encounter any problems, let's communicate more and solve them together.

I'm Cat Girl, see you next time!

Guess you like

Origin blog.csdn.net/parasoft/article/details/132351511