[Python Basics] What is an Internet crawler?

1. What is an Internet crawler?

If we compare the Internet to a large spider web, the data on a computer is a prey on the spider web, and the crawler program is a small spider that crawls along the spider web to obtain the data it wants.

Explanation 1: Through a program, crawl the webpage according to the url to obtain useful information

Explanation 2: Use a program to simulate a browser, send a request to the server, and get the response information

2. Reptile core?

1. Crawl the webpage: crawl the entire webpage, including all the content in the webpage

2. Analyze data: analyze the data you get in the web page

3. Difficulty: the game between reptiles and anti-reptiles

3. The use of reptiles

  • Data Analysis /Artificial Datasets
  • Social software cold start
  • Public opinion monitoring
  • competitor monitoring

4. Classification of reptiles

Common Reptiles:

        Example: Baidu, 360, Google, sougou and other search engines - Bole Online

        Function

        Visit the webpage->grab data->data storage->data processing->provide retrieval services

       robots protocol

        It is a customary agreement to add a robots.txt file to explain what content on this website cannot be grabbed, and it does not play a restrictive role

        Reptiles written by themselves do not have to follow

Website Ranking (SEO)

        1. Rank according to the pagerank algorithm value (refer to website traffic, click-through rate and other indicators)

        2. Baidu PPC

shortcoming

        1. Most of the captured data is useless

        2. The data cannot be accurately obtained according to the needs of users

Focus on reptiles

Function

        According to the requirements, implement the crawler program to grab the required data

Design ideas

        1. Determine the url to be crawled

        how to get url

        2. Simulate the browser to access the url through the http protocol, and obtain the html code returned by the server

        how to access

        3. Parse the Html string (extract the required data according to certain rules)

        how to parse

5. Anti-climbing means

1.User-Agent:

The Chinese name of User Agent is User Agent, referred to as UA. It is a special string header that enables the server to identify the operating system and version, CPU type, browser and version, browser rendering engine, browser language, and browser version used by the client. plug-ins, etc.

2. Proxy IP

        West agent

        fast proxy

        What is High Anonymity, Anonymous and Transparent Proxy? What's the difference?

        1. Use a transparent proxy, the other server can know that you use a proxy, and also know your real IP

        2. Using an anonymous proxy, the other server can know that you are using a proxy, but it does not know your real IP

        3. Use a high-anonymity proxy, the other server does not know that you are using a proxy, let alone your real IP

3. Verification code access

        Coding platform

        Cloud coding platform

        super

4. The dynamic loading page website returns js data, not the real data of the web page

 Selenium drives real browsers to send requests

5. Data encryption

Analyze js code

Guess you like

Origin blog.csdn.net/qq_48108092/article/details/126095482