[Crawler actual combat detailed explanation] Quickly crawl thousands of short videos, good-looking video crawling teaching

One, the basic concept of crawler

In order to take care of some novice students, we still talk about the basic concepts of crawlers before the project starts. Those who have been in contact with crawlers can ignore this part and look directly behind them.

  • 1. Introduction to the basic concepts of crawlers

    • Crawler is essentially an automated data capture program that simulates user behavior developed by humans.

    • Crawlers are to grab data from web pages, such as audio, video, text, which are all data

    • Crawler is just a branch of technology, mainly for data capture of websites, web applications, etc. It is the lowest cost and most efficient technology among many data acquisition methods.

    • From a technical point of view, "all data can be crawled" on the World Wide Web, but in the actual application process, it must be crawled in strict compliance with national laws and regulations and protocols set by major enterprises.

  • 2. How does the crawler grab web data?

    This brings us to the three major characteristics of web pages:

    • 1. Web pages have their own unique URL

    • 2. Web pages are all HTML to describe page information

    • 3. All web pages use HTTP/HTTPS protocol to transmit HTML data

  • 3. Summary of fixed routines

  • Four big steps:

    • 1. Analyze the landing page

    • 2. Send requests (requests)-get response (response) data

    • 3. Analyze the data-extract the data we really need

    • 4. Save data-save in the target file

  • requests和response

    • 1. The browser sends information to the server where the URL is located. This process is called HTTP Requests

    • 2. After the server receives the information sent by the browser, it can perform corresponding processing according to the content of the information sent by the browser, and then send the information back to the browser. This process is called HTTP Response

    • 3. After the browser receives the Response information from the server, it will process the information accordingly, and then show it to the user.


Second, the crawler case: crawling good-looking videos of the short video series

Let's formally enter today's actual combat topic, how to quickly crawl thousands of short videos, take a good-looking video as an example, not too much nonsense, let's take a look at the source code first, and we will explain in detail later.

Case code:


 Third, the code logic is explained in detail:

1. Page analysis and url analysis

What should you analyze and write about landing pages? First of all, you have to analyze whether it is a static web page or a dynamic web page. How should you analyze it? (Take the example of crawling the funny category of good-looking videos)

  • Right-click on the page, click --View the source code of the page

  • The blue box in the figure below is the real original data returned by the server to the browser, and it has not been rendered through any page.

  • Next, the red box is to search for the title of the video, and found it can’t be found! ? (Global search: Ctrl+F)

    image-20210226154327611

  • It is found that the search result is 0, indicating that the webpage we crawled today is a dynamic webpage. All his data is dynamically loaded, and we return to the video page and move the mouse down. At this time, the page will continue to load new videos for you. This is called dynamic loading

How do we get data packets for dynamic web pages? ? Right mouse click-check

A console like the following will pop up, and we should locate the network to capture data. This is a function similar to a packet capture tool provided by the browser. At the same time, because we want to crawl a dynamic website, we locate the XHR below. , It will help us filter out all the dynamic data, which means that the dynamic data in the web page is summarized in this XHR.

image-20210226162559537

When the second data packet is clicked on the right side, the original data returned by the server to the browser will pop up. Preview is for these original data. He will help you arrange the data. You can fold and expand the data accordingly. Then we will These data are expanded and you can see that it corresponds to the title of the video on the webpage.

The data packet is found, the next step is to locate the headers, find the Request URL, and there is a web page address behind it, this is the URL address we determined today

image-20210226163002142

Then let’s compare this url interception with the url on the navigation bar, pay attention to see if it is different from the address of the navigation bar above, that is to say, the address of the dynamic data packet is usually different from the address on the navigation bar, so It is very important to analyze the website. You can’t say which website you request must be which url link, you have to find the really right url

image-20210226163155377

Next, locate the request headers in headers. This is a request header. There are a lot of parameters in it. Then we need to get a user-agent parameter today. What does it do? As mentioned earlier, the crawler is to simulate the user requesting the server. In order to avoid being discovered by the other server and preventing you from crawling, you need to pretend to be yourself. Then the user-agent is a logo of the browser.

image-20210226163420155

2. Send a request --request simulates the browser to send a request and get the response data

3. Analyze the data

Today we are crawling the video, then we need the video title name and the video playback url. After finding the title and paly_url, we can get the videos layer by layer by peeling the onion, because a video is a video data, pay attention to this. A dictionary format, which can be obtained in the form of key-value pairs, and the value is obtained by taking the key name

image-20210226164914380

4. Save the data

Use python to write crawler program to download hundreds of videos with one click, the speed is still quite fast

image-20210226165949212


Dear friends, my network disk data is getting more and more piles, especially for Xiaobai's Python entry. I don’t need it anymore. Now I’m going to share it with you, and take it away if you need it.

If necessary, you can find my teaching assistant, WeChat account pykf20, she has a lot of time, please note "getting information" so that she can know your intentions and give you things as quickly as possible. Please take a closer look at the picture below:

 

 

Guess you like

Origin blog.csdn.net/zhiguigu/article/details/114927361