"Reptile Prelude"

Table of contents

Overview

Preparation

1. Introduction to crawlers

2、HTTP与HTTPS

3、URL

4. Development tools

5. Crawler process



Overview

966b0a42436b4cdd9b125f215b781342.png

Preparation

1. Introduction to crawlers


  • Concept: A web crawler is a program that disguises itself as a client to interact with the server.

    Colloquial definition: a type of program that automatically collects Internet resources

  • Function:

    1. data collection

    2. search engine

    3. Simulation operation crawling is widely used to simulate user operations, test robots, water irrigation robots, etc.

  • Difficulties in crawling development:

    1. Data acquisition: The server will set up a Turing test to prevent malicious crawling. During the crawling development process, a large part of the work deals with anti-crawling strategies.

    2. Collection speed: multi-task crawling and distributed crawling.

2、HTTP与HTTPS


Network architecture

  1. c/s: client/server client/server

  2. b/s: browser/server browser/server

  3. m/s: mobile/server mobile/server 

a5333a9f237c46b1b5784902af4ccb95.png

HTTP protocol

  1. Reason: To ensure the effective exchange of information between calculations, a protocol is required.

  2. Concept: HTTP (Hyper Text Transfer Protocol) Hypertext Transfer Protocol.

HTTPS protocol

https (Hyper Text Transfer Protocol over SecureSocketLayer) Hypertext Transfer Security Protocol, which is HTTP+SSL, is a security-oriented HTTP channel. Based on HTTP, it ensures the security of the transmission process through transmission encryption and identity authentication.

3、URL


Locate network resources through URL

URL (Uniform Resource Locator), called Uniform Resource Locator in Chinese. It is the address used to identify a certain resource. That is what we often call the website address.

Protocol + domain name (default port is 80) + path + parameters

Domain Name, also known as a network domain, is the name of a computer or computer group on the Internet consisting of a string of names separated by dots. It is used to locate the computer during data transmission. Because IP addresses are inconvenient to remember and cannot reveal the name and nature of the address organization, people designed domain names.

Port (Port) can be considered as the outlet for communication between the device and the outside world. Ports can be divided into virtual ports and physical ports. Virtual ports refer to ports inside a computer or a switch router and are not visible; physical ports are also called interfaces and are visible ports.

Path (path) represents a directory or file address on the host.

eb69b2b03a014bf1a9de88b0edc827d9.png

4. Development tools


Start via the shortcut key fn+f12/f12/right-click - check startup

ca054338eb9d4ccba3ae273a51935c26.png

  • elements: Web page source code (final page rendering result) extracts data and analyzes data

  • Console: print content

  • Sources: the sources of the entire website’s data

Network: Network work (data packet capture), data exchanged between the client and the server

5. Crawler process


We need a third-party library to help us send requests and get responses:

Import module requests

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple 

1. Target URL

  • static loading

  • dynamic loading

2. Simulate the browser to send a request and receive the response

Request method:

get: get is generally used to obtain server information, and query parameters are generally displayed on the URL.

post: post is generally used to update information. Parameters will not be displayed in the URL

import requests
url = 'https://www.baidu.com/'
response = requests.get(url)
print(response)

Output:

status code

  • 200: Request successful

  • 403: It may be recognized as an anti-crawling program.

  • 404: The server cannot find the requested web page

Content acquisition

  • response.text: Returns string type data

  • response.content: Returns byte stream data (binary)

  • response.content.decode('utf-8'): Manual decoding to obtain string type data

User-Agent: ua for short, is an identifier that provides information about the browser type and version you are using, operating system and version, browser kernel, etc. to the website you are visiting.

Cookie: used by some websites to identify users

Referer: Anti-hotlinking, displays the URL that was redirected from, and determines the origin of the request.

3. Parse web pages

4. Save data

Guess you like

Origin blog.csdn.net/m0_63636799/article/details/128160121