1 Basic concepts of crawlers

table of Contents

1. What is a web crawler?

Two, crawler classification

Three, how to write a crawler

Fourth, the necessary skills for crawlers


1. What is a web crawler?

Web crawlers (also known as web spiders, web robots, in the FOAF community, and more often web chases) are programs or scripts that automatically crawl information on the World Wide Web in accordance with certain rules. After learning to crawl, we can do:

1. Data collection

The python crawler program can be used to collect data. This is also the most direct and commonly used method. Since the crawler program is a program, the program runs very fast and will not get tired of repeated things, so it becomes very simple and fast to use the crawler program to obtain a large amount of data.

2. Research

For example, if you want to investigate an e-commerce company, you want to know their product sales. The company claims monthly sales of hundreds of millions of yuan. If you use a crawler to crawl the sales of all products on the company's website, then you can calculate the company's actual total sales. In addition, if you grab all the comments and analyze them, you can also find out whether the website has been swiped. Data will not lie, especially with massive amounts of data, artificial falsification will always be different from what is naturally produced. In the past, it was very difficult to collect data with large amounts of data, but now with the help of crawlers, many deceptive behaviors are exposed to the sun naked.

3. Brush traffic and spike

Brushing traffic is a built-in function of python crawler. When a crawler visits a website, if the crawler hides well and the website cannot recognize that the visit comes from the crawler, then it will be regarded as a normal visit. As a result, the crawler "accidentally" brushed the site's traffic

Two, crawler classification

According to the system structure and implementation technology, web crawlers can be roughly divided into the following types: General Purpose Web Crawler, Focused Web Crawler, and Incremental Web Crawler.

1. General web crawler

General-purpose web crawlers are also called Scalable Web Crawlers. The crawling objects are expanded from some seed URLs to the entire Web, mainly collecting data for portal search engines and large Web service providers. For example: Baidu, 360, Google, Bing and other search engines

2. Focus on crawlers

Focused Crawler, also known as Topical Crawler, refers to a web crawler that selectively crawls pages related to pre-defined topics. This is the point that makes our crawler

3. Incremental crawler

Incremental Web Crawler (Incremental Web Crawler) refers to a crawler that takes incremental updates to downloaded web pages and only crawls newly generated or changed web pages. It can ensure to a certain extent that the crawled pages are as new as possible Page.

Three, how to write a crawler

1. Get the page code

  • urllib ---> requests

  • aiohttp / httpx

2. Parse the page to extract the desired information

  • Regular expression analysis-re

  • XPath parsing-lxml

  • CSS selector parsing-pyquery/beautifulsoup

3. Storage (persistence, mysql, mongodb) / compression / signature

4. Data cleaning, normalization ---> Data analysis ---> Generate statistical charts/reports

Fourth, the necessary skills for crawlers

1. Python basic grammar

2. How to grab the page

      Python library used: urllib.reqeust urllib.parse requests

3. Analyze the content

        Regular expression, xpath, bs4, jsonpath

4. Collect dynamic html

       selenium

5、scrapy

       High-performance asynchronous network framework

6. Distributed crawler

       The scrapy-redis component, which adds a set of components on the basis of scrapy, combines redis for storage and other functions

Guess you like

Origin blog.csdn.net/chengshaolei2012/article/details/113897398