[Python web crawler] 150 lectures to easily get the Python web crawler paid course notes one-crawler basics

1. Crawler basics

1.1 Concept

A crawler is a program that simulates the behavior of a human requesting a website. It automatically requests a web page, grabs the data, and then uses certain rules to extract valuable data.

1.2 Crawler application scenarios:

  1. Search engine (Baidu or Google)
  2. Bole Online
  3. Huihui Shopping Assistant
  4. data analysis
  5. Ticket grabbing software, etc.

1.3 Why use Python to write crawlers? By comparing multiple high-level languages:

 

2. Introduction to HTTP Protocol

2.1 HTTP protocol 

Refers to the Hyper Text Transfer Protocol, Hyper Text Transfer Protocol, which is a method of publishing and accepting HTML pages. The server port number is port 80.

The HTTPS protocol is an encryption protocol of the HTTP protocol . The SSL layer is added under HTTP, and the server port number is port 443.

2.2 URL 

2.3 Common Request Method

HTTP has 8 request methods, the commonly used ones are  get request and post request

In order to implement anti-crawler mechanisms, some websites and servers often do not play cards according to common sense, such as changing a request that originally used the get method to a post request. This time depends on the situation.

2.4 Common request header parameters

In the HTTP protocol, a request is sent to the server. The data is divided into three parts, the first is in the url, the second is in the body, and the third is the head.

  1. user-agent: browser name, identity, disguise crawler , when requesting a webpage, you can know which browser the request is sent from through this parameter; if user-agent = python, it can be easily used for websites with anti-crawler mechanism Crawler when judging the request.
  2. referer: Indicate which url the current data comes from, and can also be used as anti-crawler technology
  3. cookie: Use it to determine whether the identification is from the same person in a multi-word request to identify the identity

2.5 Common response status codes

 

Guess you like

Origin blog.csdn.net/weixin_44566432/article/details/108529784