Python3 crawler solves the problem of obtaining asynchronous request data

Table of Contents

 

Problem Description

Solutions

Option One

Option II


Problem Description

When crawling the data detail page, the number of comments (number of comments) needs to be crawled, but the number of comments and the data of the detail page are not requested synchronously. The data will be loaded later on the detail page. If you use urllib.request.openurl Fetching the page directly, the result is that when the page is fetched, the number of comments has not been filled on the page, resulting in failure to obtain comment data.

Solutions

Option One

Since the number of comments is after the data on the details page is loaded onto the page, you can wait for a certain period of time, wait until the comment data is also loaded and populated on the page, and then grab the page, then you can get the comments on the page Is the data available?

Use selenium: selenium-python Chinese document

  • Selenium was originally an automated testing tool, and it was used in crawlers to solve the problem that requests cannot directly execute JavaScript code. Selenium essentially drives the browser and completely simulates browser operations, such as jump, input, click, drop down, etc. , To get the result of web page rendering, support multiple browsers

But in my needs, although this can solve the problem, it is too time-consuming, and the program performance may be affected. I only need to request a data field. If the waiting time for the request is too short, it may be disconnected before the data is requested. If the time for setting the request is too long, it may seriously slow down the program. So after reading some basic introduction and other people's operations, I also tried to use it in my own program, but there were too many things to configure, so I gave up decisively.

Option II

Loading the link of the number of comments separately to get the number of comments is also my way to solve my own problem. Sometimes, such a data request link can request data, but I can’t do it here, because the link I requested is very much related to the previous step. If the data is not loaded on the details page, then the data cannot be requested. , But was directly intercepted to the login page, where you need to encapsulate the link of the details page of your request to the header requesting the number of comments, disguising it as a normal data request.

Of course, knowing that this is done after I have tried a lot of methods, and learned after asking someone for advice. Colleagues collected data using the visualization interface on the collection platform, but we were asked to use code to collect data. After inquiry, we learned that we need to add the requested pre-asynchronous link to the request header.

At the beginning, I used this:


import urllib.request


def getcommentcount(preurl,suburl):
    headers = {"User-Agent",
                "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
    prevheaders = {"Referer",preurl}
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    opener.addheaders = [prevheaders]
    urllib.request.install_opener(opener)
    file = urllib.request.urlopen(suburl)
    data = file.read()
  

But the program is very unstable and keeps reporting errors

Python ValueError: Invalid header name b'Https://I.Learn.Hello.Com/Detail.Htm?CourseId=117292'

The incoming preurl actually looks like this:

https://i.learn.hello.com/detail.htm?courseId=117292

Post the question to the browser and see what others say Python ValueError: Invalid header name b':authority

I also tried the above solutions in my own program, but it did not solve the problem. I began to suspect that the way I set the header was wrong, because everyone on the Internet just used the above method to set one parameter, and I set two , So I found another way to set the header.

def getcommentcount(url,suburl):
    req = urllib.request.Request(suburl)
    req.add_header("User-Agent","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36")
    req.add_header("Referer",url)
    file = urllib.request.urlopen(req)
    data = file.read()

After this setting, the problem is solved and no error is reported. As for the reason, I will slowly explore after I am familiar with crawlers.

Guess you like

Origin blog.csdn.net/someby/article/details/105056151