Introduction to python crawlers

Crawler-Introduction + Packet Capture

Uniform Resource Locator

https://www.baidu.com

image-20221101204852781

Resource path:domain nameafterBefore

https://www.baidu.com/s?wd=Mobile (Key words)&pn=20(Page number is 3

Search string for? After that, there is load

DNS: domain name resolution service

image-20221102150040140

Baidu homepage can be accessed directly through the red box section

The page rendered by the browser was not constructed from a single request

ask

static request

The URL entered in the URL navigation bar + various links in the html file

Find it under the document

get

Status code 200-300 Normal response 300~400 Redirect 400-500 Reverse crawl 500 Problem with the remote server itself

request header

user-agent The user agent is hard-coded and will not change.

Cookie Identifier A means of anti-crawling

referer Which page under the same domain name is redirected to?

content-type request message type

Dynamic request

javascript native communication architecture XMLHttpRequest | Fetch | jquery

Fetch/XHRtry to find

Request line + header + request message

Anti-climbing means

Header-cookie referer response message-js

The principle of fiddler simulating requests: parse, copy and send the request packet sent by the browser (Has the remote server performed anti-crawling verification?

The request message sent by the browser is consistent with the request message sent by fiddler

Important: user-agent cookie

vscode breakpoint

Continue: The program continues to run until the next breakpoint, or the program completes and exits the process.

Single step skip: step, do not enter the function, run directly to the next line of code

Single-step debugging: step in, when encountering an execution function, enter the function, otherwise run to the next line of code

Single step out: when inside a function, jump out of the current function directly

Detection

Check status coderes.status_code

Detect response message typeres.headers.get(‘content-type’)

Check whether the target data is in the response message. 'Target data' in res.text

Save response message

In the root directory of the saved file, open cmd, use python in cmd to open the http service, and access the browser python -m http.server 1234

Possible problems

Different encoding terminals result in the inability to print res.text

Method 1: Select code encoding or python to run

Method Two:

print(res.text.encode('utf8',errors='ignore').decode('utf-8',errors='ignore')) 

Login verification code returned by the remote server

Session,token

json serialization() json

The code is:

json.dumps(data)

Chinese will be encoded

Not serializing will result in a 400 response

retrieve data

res.text #查看响应内容,返回的是Unicode格式的数据
res.json()
res.content#查看响应内容,返回字节流数据
res.url#查看完整url地址
res.encoding#查看响应头部字符编码

Capture packets

Introduction to developer tools opened by F12

image-20230105123906217

Focus

  • The home page of a website is usually a package with the domain name and the name.

  • Request headers

    image-20230105180041488

Guess you like

Origin blog.csdn.net/jiuwencj/article/details/128569061