Introductory crawler study notes Day 2 + record small problems encountered

1. The requests module sends a post request

1. Implementation method: requests.post(url, data) data is a dictionary
2. Use Kingsoft PowerWord webpage as an example.
Enter the word "dictionary", and the dictionary will be translated, and you can also see "out": dictionary in the response of the inspection page.
insert image description here3. Then use the json analysis web page (json.cn)
to paste the sentences in the response you just saw to the json analysis web page, and the "out" in 'content' corresponds to the dictionary we want
insert image description here

2. Post data source

1. Fixed value capture packet constant value
2. Input value capture packet comparison according to its own change value
3. Default value - static files need to be obtained from static html in advance
4. Default value - send requests need to be sent to the specified address Request to get data
5. Analyze js generated on the client (browser), simulate and generate data

3. Request module - session (using session for state maintenance)

The session class can automatically send the cookie generated during the request to obtain the response, and then maintain the state.

1. The role and application scenarios of session

Function: Automatically process cookies: the next request will bring the previous cookie
Application scenario: Automatically process cookies generated during multiple consecutive requests

2. How to use session

session = requests.session()
response = session.get(url, headers, …)
response = session.post(url, data, …)

The session object directly sends the parameters of the get and post requests, which are consistent with the parameters sent by the requests module.

4. Data Extraction - Classification of Response Content

1. Structured response content: Find the url and search directly on the webpage to get the data.
(1) json data (occurs frequently, and the data carrying capacity is larger): json module, re module , jsonpath module
(2) xml data: re module, lxml module

2. Unstructured response content: the structure of each article changes (such as the number of paragraphs is different, etc.)
(1) html (most commonly used): re module, lxml module

Five, xml and html

1. xml

Extensible Markup Language.
Differences from html:
(1) The function focuses on transmitting and storing data.
(2) Labels can be defined by themselves.

2. html

Hypertext Markup Language.
Function: Display data and how to display data better.

6. Problems encountered/tips

1. How to enter incognito mode
insert image description here

2. Find Ajax data on inspection page
insert image description here

Guess you like

Origin blog.csdn.net/qq_51669241/article/details/122397875
Recommended