Crawler (01) Preliminary Knowledge & Network Protocol2020-12-07

1. Port

Each application has an independent identification, this identification is called a port. If you want to communicate data, you must know the other party's application port. In order to mark these applications, these applications are identified with numbers. The number used for identification here is called a port, also called a logical port.

2. Communication protocol

Communication between applications requires uniform rules, which are generally called communication protocols. International organizations have defined general communication rules, called TCP/IP protocol.
The so-called protocol is a regulation or rule that both computers must comply with.
HTTP is also called Hypertext Transfer Protocol (a communication protocol), and its port is 80.

3. Network Model

Computer network refers to a collection of many autonomously working computers connected to each other by communication lines. What kind of rules are used to communicate between various components is the problem of network model research. The network model generally refers to the OSI seven-layer reference model and the TCP/IP four-layer reference model. These two models are most widely used in networks.

3.1 OSI seven-layer model and TCP/IP four-layer model

Insert picture description here
OSI seven-layer model: OSI (Open System Interconnection) Open System Interconnection Reference Model is a standard system developed by the International Organization for Standardization (ISO) for the interconnection of computers or communication systems.
TCP/IP four-layer model: TCP/IP reference model is the reference model used by ARPANET, the grandfather of computer networks and its successors on the Internet.
Layering effect: convenient management
Insert picture description here

3.2 Advantages of the seven-layer model:

1. Divide the complex network into layers that are easier to manage (divide the entire large and complex problem into several easy-to-handle small problems)
2. No single manufacturer can provide a complete set of solutions and all equipment and protocols.
3. Complete their tasks independently, without affecting each other, with clear division of labor, the upper layer does not care about the specific details of the lower layer, and layering is also beneficial to network troubleshooting

3.2.1 Function and representative equipment

The hierarchical name function works on the devices in this layer.
7 The application layer provides user interfaces QQ and IE. Application
6 The presentation layer represents data and performs encryption and other processing.
5 The session layer separates the data of different applications.
4 The transport layer provides reliable or unreliable transmission and performs error correction before retransmission.
3 The network layer provides logical addresses and is used by routers. They choose the path Layer 3 switch, router
2. The data link layer splits the packet into bytes and combines the bytes into frames, uses the MAC address to provide media access, performs error detection, but does not correct the Layer 2 switch, network card
1 The physical layer transfers bits between devices, specifies the level, cable speed, and cable pin hub

3.2.2 Interaction: Why use the TCP/IP four-layer model in modern network communication instead of the OSI seven-layer model?

The OSI seven-layer model is a theoretical model, which is generally used for theoretical research. Its layering is somewhat redundant. For practical applications, the four-layer model of TCP/IP is selected. And OSI itself has its own flaws. Most people think that the number and content of the OSI model may be the best choice. In fact, it is not the case. The session layer and the presentation layer are almost empty, while the data link layer and network layer contain content. Too many, there are many sub-layers inserted, each sub-layer has a different function.

3.3 Common network-related protocols

DNS: domain name resolution protocol www.baidu.com
SNMP (Simple Network Management Protocol) network management protocol
DHCP (Dynamic Host Configuration Protocol) dynamic host configuration protocol, it is a protocol that enables clients to obtain configuration information on a TCP/IP network.
FTP ( File Transfer Protocol) file transfer protocol, it is a standard protocol, is the easiest way to exchange files between computers and networks.
TFTP (Trivial File Transfer Protocol): Small File Transfer Protocol
HTTP (Hypertext Transfer Protocol): Hypertext Transfer Protocol
HTTPS (Secure Hypertext Transfer Protocol): Secure Hypertext Transfer Protocol, which is developed by Netscape and built into its browser, It is used to compress and decompress data.
ICMP (Internet Control Message Protocol): Internet Control Message Protocol, Internet Control Message Protocol
ping ip defines message types: TTL timeout, address request and response, information request and response, Unreachable destination
SMTP (Simple Mail Transfer Protocol): Simple Mail Transfer Protocol
TELNET Protocol: Virtual Terminal Protocol
UDP (User Datagram Protocol): User Datagram Protocol, which is defined as a computer used to provide packet exchange in an interconnected network environment Communication protocol
TCP (Transmission Control Protocol): Transmission Control Protocol, a connection-oriented, reliable, byte stream-based transport layer communication protocol log forwarding: open a protocol: tcp (three-way handshake and four waved hands)

3.4 The difference between TCP protocol and UDP protocol

(1) TCP protocol: TCP (Transmission Control Protocol) is a connection-oriented protocol. Before sending and receiving data, a reliable connection must be established with the other party.
(2) UDP protocol: UDP is the abbreviation of User Datagram Protocol. The Chinese name is User Datagram Protocol. It is a connectionless transport layer protocol that provides transaction-oriented simple and unreliable information transmission services.
Summary: The difference between TCP and UDP:
1. Based on connection and connectionless;
2. Requirements for system resources (more TCP, less UDP);
3. UDP program structure is simpler; UDP packet header is very short, only 8 bytes, compared to TCP The additional overhead of 20-byte packets is very small. So the transmission speed can be faster
. 4. TCP guarantees the correctness of the data, UDP may lose packets; TCP guarantees the data sequence, UDP does not guarantee.
Scenario: udp is used for video and voice communication, or the network environment is good, for example, udp can be used for communication in a local area network. The integrity of udp data transmission can be verified through application layer software.
Tcp transfer files, data integrity requirements are high.

3.5 TCP and UDP common port number names

(1) TCP port allocation
21 ftp file transfer service
22 ssh secure remote connection service
23 telnet remote connection service
25 smtp email service
53 DNS domain name resolution service, with tcp53 and udp53 port transmission
80 http web service
443 https secure web service

4. HTTP request and response

HTTP communication consists of two parts:

  • Client request information
  • The server response information
    Insert picture description here
    includes the following steps:
  • When the user enters a URL address in the browser address bar and presses Enter, the browser will send an HTTP request to the server. HTTP requests are mainly divided into two methods: "get" and "post". When our request information contains confidential information such as account and password, we use the "post" method, and other methods generally use the "get" method.
  • When we enter the URL http://www.baidu.com in the browser, the browser sends a Request to obtain the html file of http://www.baidu.com, and the server sends the Response file object back to the browser .
  • The browser analyzes the HTML in the Response and finds that it references many other files, such as Images files, CSS files, and JS files. The browser will automatically send the Request again to obtain the image, CSS file, or JS file according to the request code that has been referenced.
  • When all the files are downloaded successfully, the web page will be displayed completely according to the HTML syntax structure.

4.1 Client's Http request

URL just identifies the location of the resource, and HTTP is used to submit and get the resource. The request message that the client sends an HTTP request to the server includes the following format:
request line, request header, blank line, and request data
. The following figure shows the general format of the request message.
Insert picture description here
A typical HTTP request example:

GET / HTTP/1.1
Host: www.baidu.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.75 Safari/537.36
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
Sec-Fetch-Site: same-origin
Referer: https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=Python%20%20%E6%89%8B%E5%8A%A8%E5%9B%9E%E6%94%B6%E5%9E%83%E5%9C%BE&oq=Python%2520%25E6%2594%25B6%25E5%2588%25B0%25E5%259B%259E%25E6%2594%25B6%25E5%259E%2583%25E5%259C%25BE&rsv_pq=f5baabda0010c033&rsv_t=1323wLC5312ORKIcfWo4JroXu16WSW5HqZ183yRWRnjWHaeeseiUUPIDun4&rqlang=cn&rsv_enter=1&rsv_dl=tb&inputT=2315&rsv_sug3=48&rsv_sug2=0&rsv_sug4=2736
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cookie: BIDUPSID=4049831E3DB8DE890DFFCA6103FF02C1;

4.2 Request method

According to the HTTP standard, HTTP requests can use multiple request methods.
HTTP 0.9: Only basic text GET function.
HTTP 1.0: A complete request/response model, complete with the protocol, defines three request methods: GET, POST and HEAD methods.
HTTP 1.1: Update based on 1.0, five new request methods are added: OPTIONS, PUT, DELETE, TRACE and CONNECT methods.
HTTP 2.0 (not popular): The definition of the request/response header has basically not changed, but all header keys must be all lowercase, and the request line must be independent of key-value pairs: method, :scheme, :host, and :path.

Serial number method description
1 GET Request the specified page information and return the entity body.
2 HEAD Similar to a GET request, except that there is no specific content in the returned response, which is used to get the header
3 POST Submit data to the specified resource for processing request (such as submitting a form or uploading a file), and the data is included in the request body. POST requests may result in the creation of new resources and/or the modification of existing resources.
4 PUT The data transmitted from the client to the server replaces the content of the specified document.
5 DELETE Request the server to delete the specified page
6 CINNECT The HTTP/1.1 protocol is reserved for proxy servers that can change the connection to the pipe mode.
7 OPTIONS Allow client to view server performance
8 TRACE The request received by the echo server is mainly used for testing or diagnosis.

5. Crawler introduction

A simple sentence is to replace people to simulate the browser to perform web operations to obtain the required data.
Why do we need crawlers, mainly to provide data sources for other applications. There are generally three ways for companies to obtain data, one is their own data, one is to purchase data, and the other is through crawlers. Python: supports many modules, concise code, high development efficiency (scrapy framework), etc., which is significantly better than other languages. The categories of crawlers are:

  • General crawlers: such as Baidu, Google, etc.
  • Focus on crawlers: According to established goals, selectively crawl data content.

6. Several concepts

6.1 GET and POST

  • GET: The query parameters will be displayed on the URL.
  • POST: The query parameters and the data to be submitted are hidden in the form and will not be displayed on the URL address.

6.2 Components of URL

URL: Nominal "Uniform Resource Locator". For example: https://new.qq.com/omn/TWF20200/TWF2020032502924000.html
https: protocol
new.qq.com: domain name
port port number: 80 /new.qq.com There is an after him: 80 can be omitted
TWF20200/TWF2020032502924000.html Path to access resources
#anchor: The anchor point is used for page positioning at the front end.
Note: When the browser requests a URL, the browser will encode the URL. (Except for English letters, numbers and some signs, all other codes are encoded with% plus hexadecimal code)
For example: https://tieba.baidu.com/f?ie=utf-8&kw=%E6%B5%B7% E8%B4%BC%E7%8E%8B&fr=search

6.3 User-Agent User Agent

Function: Record the user's browser, operating system, etc., in order to allow users to better obtain HTML page effects, such as:
User-Agent:
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36

6.4 Refer

Indicate which url the current request is coming from. Under normal circumstances, it can be used as anti-climbing technology

6.5 Common status codes

  • 200: The request is successful
  • 301: Permanent redirect
  • 302: Temporary redirect
  • 404: request failed
  • 500: server internal request

6.6 Some common content on the browser source code

We can right-click on any webpage and select "Check" in the menu that appears. As shown:
Insert picture description here

  • Elements: Elements
  • Console: Console (print information)
  • Sources: Information sources (files loaded by the entire website)
  • NetWork: Network work (information packet capture) can see a lot of web page requests,
    so I will understand so much for the time being.

Guess you like

Origin blog.csdn.net/m0_46738467/article/details/110873601