Article directory
Crawler-Introduction + Packet Capture
Uniform Resource Locator
https://www.baidu.com
Resource path:domain nameafter?Before
https://www.baidu.com/s?wd=Mobile (Key words)&pn=20(Page number is 3)
Search string for? After that, there is load
DNS: domain name resolution service
Baidu homepage can be accessed directly through the red box section
The page rendered by the browser was not constructed from a single request
ask
static request
The URL entered in the URL navigation bar + various links in the html file
Find it under the document
get
Status code 200-300 Normal response 300~400 Redirect 400-500 Reverse crawl 500 Problem with the remote server itself
request header
user-agent The user agent is hard-coded and will not change.
Cookie Identifier A means of anti-crawling
referer Which page under the same domain name is redirected to?
content-type request message type
Dynamic request
javascript native communication architecture XMLHttpRequest | Fetch | jquery
Fetch/XHRtry to find
Request line + header + request message
Anti-climbing means
Header-cookie referer response message-js
The principle of fiddler simulating requests: parse, copy and send the request packet sent by the browser (Has the remote server performed anti-crawling verification?)
The request message sent by the browser is consistent with the request message sent by fiddler
Important: user-agent cookie
vscode breakpoint
Continue: The program continues to run until the next breakpoint, or the program completes and exits the process.
Single step skip: step, do not enter the function, run directly to the next line of code
Single-step debugging: step in, when encountering an execution function, enter the function, otherwise run to the next line of code
Single step out: when inside a function, jump out of the current function directly
Detection
Check status coderes.status_code
Detect response message typeres.headers.get(‘content-type’)
Check whether the target data is in the response message. 'Target data' in res.text
Save response message
In the root directory of the saved file, open cmd, use python in cmd to open the http service, and access the browser python -m http.server 1234
Possible problems
Different encoding terminals result in the inability to print res.text
Method 1: Select code encoding or python to run
Method Two:
print(res.text.encode('utf8',errors='ignore').decode('utf-8',errors='ignore'))
Login verification code returned by the remote server
Session,token
json serialization() json
The code is:
json.dumps(data)
Chinese will be encoded
Not serializing will result in a 400 response
retrieve data
res.text #查看响应内容,返回的是Unicode格式的数据
res.json()
res.content#查看响应内容,返回字节流数据
res.url#查看完整url地址
res.encoding#查看响应头部字符编码
Capture packets
Introduction to developer tools opened by F12
Focus
-
The home page of a website is usually a package with the domain name and the name.
-
Request headers