Python crawler essentials: Use of browser developer tools, very detailed

Recently, many friends have said that they do not know how to use browser developer tools. Today we will take a closer look at the developer tools.

Take Google Chrome as an example

Network in the Google Chrome developer tools is what we often use in learning, so do you know the meaning of each of their functions?


Since I often have 360 ​​Speed ​​Browser and Google Kernel, this article takes the developer tool Network of 360 Speed ​​Browser as an example, which is basically the same as Google’s Network.

Google Network can roughly help us achieve the following functions

Look at the return value of the interface

Look at the request headers and response headers of the interface

Check the loading speed of resources

Check resource size, cache status, response status (cdn, waiting, etc. time)

The control panel of Google NETWORK is mainly divided into 7 major sections.

1. Functional area
2. Filtering area (funnel in the functional area needs to be turned on)
3. Snapshot area (screen capture needs to be turned on in the functional area)
4. Timeline area (overview needs to be turned on in the functional area)
5. Main display area
6. Information summary area
7 , console

as the picture shows

1. Functional area

1. The red dot represents whether the network logging function is turned on. If it is gray, it means it is not turned on. The red dot means it is turned on. 2.

2. Clear network logs

3. The camera captures the screen. It is turned off by default. After clicking, the icon turns blue and snapshots of the page at different times will be recorded. Only when this button is turned on will the 3 snapshot areas be displayed.

4. The funnel represents whether the filtering option is turned on. Only when this button is turned on will the 2 filtering areas be displayed.

5. The magnifying glass represents the quick search button, which can quickly find out what the current page contains.

6. Indicates whether to use a larger area to display request records. I like larger displays and can generally be turned on.

7. Indicates whether to display the overview summary. Only when this button is turned on will the 5 timelines be displayed.

8. Group by frame, after checking this option, network requests will be grouped by form name, as shown in the figure below

9. Important: Preserve log, whether to keep the log. If checked, the log will not disappear after the page is refreshed (this function is very useful, for example, after the page jumps, you want to see what requests were made before the page jump, such as Check other people's website login requests, what is returned after successful login, what requests are initiated after successful login, and what address is redirected to) Function:

(1) We can see some jumps to intermediate pages, which saves us the trouble of capturing packets.

(2) Can be compared with the data on the previous page.

10. Disable cache, cache switch, turn on this function, the browser’s cache of js, css, images, etc. on the current website will be invalid, and all requests will be resent to the server. ctrl+F5 can also achieve the same effect. After we turn on disable cache, we don't need to turn it off, it will be loaded without cache every time.

Grouping effect after checking Group by frame

11. Offline is the network connection switch, for example, when testing PWA. Or a quick configuration in the case of weak network.

12. The Online drop-down list is the network speed threshold, which can set the maximum network speed for uploading and downloading, etc. Generally, it can display the display effect of web pages under different network conditions.

2. Screening area

1. Function: For pages with more requests, we may need to filter them.

2. Function: The toolbar provides path filtering (supports regular expressions) and type filtering (All, XHR, js, pictures, etc.) for quick viewing. Press and hold CTRL to select multiple items.

3. The role of Hide data URLs: Website developers often embed some small images or CSS scripts into HTML in BASE64 format to reduce the number of HTTP requests. When the Hide data URLs option is checked, you can hide requests like data: or blob: in the request list.

4.filter search box

In addition to the above filters provided by Chrome, you can also use filter attributes in the filter box to filter request logs very flexibly.

You can perform fuzzy search (only search the URL address). If you add / at the beginning and end, it means using regular matching (searching the URL address and the returned content at the same time). Older versions of Chrome may have a regexp option on the right side of the filter input box, and check the regular pattern. will take effect

Common filter attributes can be found in the table below.

Text version:

domain: Filter out requests for the specified domain name. It not only supports automatic completion, but also supports * matching.

has-response-header: Filter out requests that contain the specified response header.

is: Find out the WebSocket request through is:running.

larger-than: Filter out requests that are larger than the specified byte size, where 1000 means 1k.

method: Filter out requests with specified HTTP methods, such as GET requests, POST requests, etc.

mime-type: Filter out requests of specified file types.

mixed-content: Filter out requests with mixed content (I don’t know what it means).

scheme: Filter out requests of specified protocols, such as scheme:http, scheme:https.

set-cookie-domain: Filter out requests containing Set-Cookie that specify cookie domain name attributes.

set-cookie-name: Filter out requests containing Set-Cookie that specify the cookie name attribute.

set-cookie-value: Filter out requests containing Set-Cookie that specify cookie value attributes.

status-code: Filter out requests with specified HTTP status code.

(1) How to use filter 1: If we want to filter the requested resources from different domain names in the web page, we can enter [domain:] in the filter box, and Chrome will automatically complete the relevant domain name information for us.

(2) In the opened web page, how to check which requests use cache? Use the command [is:from-cache], corresponding to [is:running]

3. Snapshot area and 4. Timeline area

These two areas mainly provide a decomposed display of the loading status of web page organization.

The snapshot area allows you to more intuitively see the process of the browser opening the web page and the time it takes to open the entire web page.

In the timeline area, you can slide the mouse wheel to see if the file loading time is unclear, which is very helpful for finding slow-loading files on the web page.

5. Main display area

1. The main display area includes name, status, Type, Initiator, Size, Time, Waterfall (waterfall analysis)

Name: The name of the requested resource

Status HTTP: status code

Type: MIME type of requested resource

Initiator: the object or process that initiates the request

Size: The response size returned by the server (including header and package body), which can display the decompressed size.

Time: Total duration, from the start of the request to receiving the last byte in the response

Waterfall: intuitive analysis graph of activities related to each request

2. By default, the request list is arranged in ascending order according to the time when the resource request is initiated. We can also choose to sort by the specified column and click on the relevant list header.

3. By clicking the Name of a resource, you can view the detailed information of the resource. The information displayed depends on the selected resource type. It may include the following Tab information:

Headers The HTTP header information of this resource.

Preview displays the corresponding preview based on the resource type you selected (JSON, image, text).

Response displays HTTP Response information.

Cookies Displays the cookie information in the Request and Response process of resource HTTP.

Timing shows how much time a resource spends in each part of its entire request lifecycle.

Let’s explain each function in detail for the above four Tabs:

① View resource HTTP header information

In the Headers tab, you can see basic information such as HTTP Request URL, HTTP Method, Status Code, and Remote Address, as well as detailed information such as Response Headers, Request Headers, and Query String Parameters or Form Data.

② View resource preview information

In the Preview tag, the corresponding preview information can be displayed according to the selected resource type (JSON, image, text, JS, CSS). The figure below shows the preview information when the selected resource is in JSON format.

③ View response information of resource HTTP

In the Response tag, the Response content (pure string) of the corresponding resource can be displayed according to the selected resource type (JSON, image, text, JS, CSS). The image below shows the response content when the selected resource is in CSS format.

④ View resource cookie information

If the selected resource has Cookies information during the Request and Response processes, the Cookies label will be automatically displayed, where you can view all Cookies information.

Name: The name of the cookie.
Value: The value of the cookie.
Domain: The domain name to which the cookie belongs.
Path: URL to which the cookie belongs.
Expire/Max-Age: cookie survival time.
Size: The byte size of the cookie.
HTTP: Indicates that cookies can only be set by the browser and cannot be modified by JS.
Secure: Indicates that cookies can only be transmitted over secure connections.

⑤ Analyze the time consumption information of each part of the resource in the life cycle of the request

The Timing tag can display the time spent information of each part of the resource during the entire request life cycle, which may involve the time spent in the following processes:

Queuing The time spent queuing. It may be because the request is considered by the rendering engine to be a relatively low-priority resource (image), the server is unavailable, or the browser's maximum number of concurrent requests is exceeded (Chrome's maximum number of concurrent connections is 6).

Stalled The time it takes from the time the HTTP connection is established until the request can be sent out (actually transmitting data). Including the time spent processing the proxy, if there is an established connection, this time also includes the time to wait for the established connection to be reused.

Proxy Negotiation The time spent connecting to the proxy server.

DNS Lookup The time the DNS lookup was performed. Every new domain name on the web page goes through a DNS query. If the browser has cache on the second visit, this time will be 0.

Initial Connection / Connecting The time it takes to establish a connection, including TCP handshake and retry time.

SSL Time taken to complete the SSL handshake.

Request sent The time when the request was initiated.

Waiting (Time to first byte (TTFB)) is the time between the initial network request being initiated and the first byte being received from the server. It includes the TCP connection time, the time to send the HTTP request and the time to get the first word of the response message. festival time.

Content Download The time it takes to obtain Response response data.

If the time spent on this part of TTFB exceeds 200ms, you should consider optimizing network performance. You can refer to the network performance optimization plan and related reference documents inside.

How to view the originator (request source) and dependencies of a resource (hold down Shift)

By holding down Shift and moving the cursor over the resource name, you can view which object or process initiated the resource (request source) and which resources were triggered during the request for the resource (dependent resources).

The first resource marked green above the resource is the initiator (request source) of the resource. There may be a second resource marked green that is the initiator of the resource, and so on.

The resources marked in red below this resource are the dependent resources of this resource.

6. Information summary area

Information such as [Number of requests], [Amount of data transfer], [Loading time information] and other information are displayed in the information summary area.

The DOMContentLoaded event is triggered after the DOM on the page is completely loaded and parsed, and does not wait for CSS, images, and subframes to be loaded. The DOMContentLoaded event is marked with a blue vertical line in the Overview, and the exact time is displayed in blue text in the Summary.

The load event is triggered after all DOM, CSS, JS, and images on the page are completely loaded. The load event will also be marked with a red vertical line on the Overview and Requests Table, and the exact time will also be displayed in red text in the Summary.

Combined with the loading steps of DOM document loading, the DOMContentLoaded event/Load event triggering timing is as follows:

Parse HTML structure.

Load external scripts and stylesheet files.

Parse and execute script code. // Some scripts will block the loading of the page

The DOM tree is constructed. //DOMContentLoaded event

Load external files such as images.

The page is loaded. //load event

7. Console control area

This area was originally a separate column in F12, but because it is closely related to the network, it was combined into one. I will make a separate section to introduce it later.

In the process of learning Python, many friends cannot continue learning because they do not have good learning materials or encounter problems that cannot be solved in time. Therefore, the editor has prepared relevant software tools and basic Python tutorials for everyone. , various Python practical case source codes, hundreds of Python e-books, and Python learning roadmaps are all packaged and available for free~

Just scan the QR code on the business card at the end of the article, and you can ask any study questions.

Okay, that’s the end of today’s sharing, see you next time~

Guess you like

Origin blog.csdn.net/fei347795790/article/details/132747974