Python crawler self-study series (4)

Insert picture description here

Preface

The last article talked about the cache in the crawler, which is relatively difficult, and it is not directly facing the web page, so it may be a bit boring.

In this article, we will talk about another way to deal with dynamic web pages, namely json packet capture processing.

Then, let's talk about things related to interface interaction.

Why can I say so much? Because I've talked about it before, let's summarize it and deepen it a bit.


About the json package of dynamic web pages

Unlike the simple form event of a single page application, when JavaScript is used, the entire content of the page is no longer downloaded immediately after loading. This kind of structure will cause the content of many webpages displayed in the browser may not appear in the HTML source code, and the crawling technology we introduced earlier cannot extract important information of the website.

This put several examples to explain why the use json, and how to capture the problem, a lot of people still like: I want to learn Python secretly, then stunned everyone (ninth day)
talked to climb When I fetched my own article from CSDN, the webpage code crawled down, but the data in the comment area was'hidden' and not displayed! ! !
After using the json string, I finally found the'lost' data.


Except for the situation where there is no interface information in the source code, more often I use json parsing technology when getting cookies.
why? Because I don’t know the URL where the json string is located, which means I have to go through the process myself! ! ! ! !


Insert picture description here


Human-computer interaction

Take a look at my previous introduction about cookies and sessions: I want to learn Python secretly, and then shock everyone (day 11)

Sensitive data should only be sent using POST requests to avoid exposing the data in the URL.

If you must be self-reliant and use the post method to go up, then I have to say: It is not that the data that you see on the surface needs to be submitted, and some input boxes are hidden.

You can check the'input' element on the interface through the method mentioned before, but it is still recommended to use xpath to grab it all at once, because it will be embarrassing to miss one with the naked eye.

Is it over? It's not over!
Still want cookies.
How to give this cookie?

html = requests.get(LOGIN_URL)
second_response = requests.post(LOGIN_URL, data, cookies=html.cookies) 

Like this.


It's so troublesome, why not use my method, log in first, then find the cookie on the interface after you log in, and then crawl the data down.

When you log in with the verification code, you will find the miraculous effect of this method


It's still short, then add selenium, otherwise, when the time comes to open a separate selenium, it will be linked to this link and that link, which is not good.

selenium automation

Insert picture description here

The brief introduction about selenium is still in this article: I want to learn Python secretly, and then stun everyone (day 11) mentioned in it.

Later, I made a small project with selenium. The following is the record at the time:
I want to learn Python secretly, and then shocked everyone (the twelfth day)
overnight optimization of a piece of code, and ask for advice

Although it is quite convenient and easy to install and use Selenium through common browsers, problems can arise when running these scripts on the server. For servers, interfaceless browsers are more commonly used. They also tend to be faster and more configurable than full-featured web browsers.

Another reason to use a browser-based parser like Selenium is that it behaves less like a crawler. Some websites use anti-crawl technology similar to honeypots. The pages of the website may contain hidden toxic links. When you click on it through a script, your crawler will be blocked. For such problems, Selenium can become a more powerful crawler due to its browser-based architecture. In addition, your header will contain the exact browser you are using, and you can also use normal browser functions, such as cookies, sessions, and loading images and interactive elements, which sometimes require loading specific forms or pages.


That's it for this article.

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43762191/article/details/112995418