Programming Xiaobai's self-study notes 9 (introduction to python crawler + detailed code explanation)

Series Article Directory

Programming Xiaobai's self-study notes eight (multithreading in python)

Programming Xiaobai's self-study notes seven (inheritance of classes in python) 

 Programming Xiaobai's self-study notes six (static methods and dynamic methods of classes in python) 

Programming Xiaobai's self-study notes five (Python class method) 

Programming Xiaobai's self-study notes 4 (regular expression module search function) 


Table of contents


foreword

Many novices come here because they heard that python can write crawler scripts very well. I, too, have finally learned the crawler part. I won’t go into details about the previous html language, the interaction between the client and the server, and go directly to the topic. .


1. Use the get method to request data

Developing a web crawler requires a third-party module requests, which we need to install. The syntax is as follows:

Pip install requests

After the installation is complete, use the get method to return the result, which is equivalent to entering the URL in the browser, and then the server returns a page to us.

The get method of the requests library is used to send a GET request to the server. Its full parameters are as follows:

  • url: URL address of the request.
  • params: query string, used to pass parameters.
  • headers: Request header information.
  • cookies: Cookie information.
  • proxies: proxy server address.
  • timeout: Timeout time.
  • verify: Whether to verify the SSL certificate.

 Let's look at a small example:

import requests
url = 'http://www.baidu.com'
try:
    req = requests.get(url)
    print(req.text)
except:
print('查询失败')

The code is still very simple, and the returned result is:

<!DOCTYPE html>

<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

There are a lot of content returned, I deleted some codes in the middle, this is the source code of the website. As far as I can see, just pass a url address to the get method. 

2. Crawl the Kugou Music Ranking List

The following is the real actual combat. Crawl the data of the Kugou music ranking list. The website is Kugou TOP500_Ranking List_Leku Channel_Kugou.com , and you can see the ranking of songs on the webpage.

 

We right-click the mouse to open the inspection option, and we can find the position of the song information in Html. We can see that the song name and singer are both in the title attribute of the <li> element. If we get all the content of the web page through the get method, it will be We can use regular expressions to extract the information we need.

 

The following is the actual combat code

import requests
import re
url = 'https://www.kugou.com/yy/rank/home/1-8888.html'
try:
    req = requests.get(url)
    songs = re.findall(r'<li.*?title="(.*?)".*?>',req.text)
    for song in songs:
        print(song)
except:
    print('查询失败')

 The program runs successfully, but it is not the result we want, and the return is empty, that is to say, there is no matching result. For this reason, I added the code print(req.text) to see what the return result we got.

<!DOCTYPE html>

<html>

<head>

<meta charset="utf-8">

<meta http-equiv="X-UA-Compatible" content="IE=edge">

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>403 Forbidden</title>

<style type="text/css">body{margin:5% auto 0 auto;padding:0 18px}.P{margin:0 22%}.O{margin-top:20px}.N{margin-top:10px}.M{margin:10px 0 30px 0}.L{margin-bottom:60px}.K{font-size:25px;color:#F90}.J{font-size:14px}.I{font-size:20px}.H{font-size:18px}.G{font-size:16px}.F{width:230px;float:left}.E{margin-top:5px}.D{margin:8px 0 0 -20px}.C{color:#3CF;cursor:pointer}.B{color:#909090;margin-top:15px}.A{line-height:30px}.hide_me{display:none}</style>

</head>

<body>

<div id="p" class="P">

<div class="K">403</div>

<div class="O I">Forbidden</div>

<p class="J A L">Error Times: Fri, 23 Jun 2023 06:40:43 GMT

<br>

<span class="F">IP: 60.174.21.124</span>Node information: CS-000-01uyG161

<br>URL: https://www.kugou.com/yy/rank/home/1-8888.html

<br>Request-Id: 64953e6b_CS-000-01uyG161_35678-151

<br>

<br>Check:

<span class="C G" onclick="s(0)">Details</span></p>

</div>

<div id="d" class="hide_me P H">

<div class="K">ERROR</div>

<p class="O I">The Requested URL could not be retrieved</p>

<div class="O">

<div>While trying to retrieve the URL:</div>

<pre class="B G">https://www.kugou.com/yy/rank/home/1-8888.html</pre></div>

<div class="M">

<span>The following error was encountered:</span>

<ul class="E">

<li class="D G">Invalid Request</li></ul>

</div>

<p class="M">The access control configuration prevents your request at this time.

<p></p>Please contact your service provider if you feel this is incorrect.</p>

<a class="N C" href="#" onclick="s(1)">return</a></div>

<script type="text/javascript">function e(i) {

return document.getElementById(i);

}

function d(i, t) {

e(i).style.display = (t ? 'block': 'none');

}

function s(e) {

d('p', e);

d('d', !e);

}</script>

</body>

</html>

It can be seen that there are no singers and song content. It should be that Sogou website has carried out some anti-crawling. So we add headers={'user-agent':'chrome'} in get to simulate browser access, the code is as follows: 

import re
url = 'https://www.kugou.com/yy/rank/home/1-8888.html'
try:
    req = requests.get(url,headers={'user-agent':'chrome'})
    # print(req.text)
    songs = re.findall(r'<li.*?title="(.*?)"',req.text)
    for song in songs:
        print(song)
except:
print('查询失败')

This time successfully returns the result we want:

Su Xingjie- Listen to sad love songs

fingertip laugh - don't ask ciaga

Guo Ding- Poignantly

An Aries - Can't wait for you

Ren Xia- Sad Love

Jane Zhang, Wang Heye - It's You (Live)

Mae Stephens - If We Ever Broke Up (Explicit)

Kui Kui - what are you doing baby

Zhang Zihao- can you

Jay Chou- Promised happiness?

Jay Chou - Sunny Day

Wang Sulong, Jike Junyi-Letting Go (Live)

Seunghwan- I will wait

Tanya Chua-Letting Go

Ren Xia - Insomnia Love Song (Live Chorus Version)

Su Xingjie- Thinking of you in the evening wind

Jay Chou- I cry and my emotions are fragmented

Cloud Dog Egg- If there is love in the sky

Cheng Xiang - possible

A-Lin - If there is love in heaven

RE-D, Erhaya, masta - sure

GEM Deng Ziqi- I like you

Let's analyze it in detail:

  1. Use the get method to get the content of the web page. This content is the same as the content introduced at the beginning. The get method sends a request to the server, and the server returns data.
  2. Add headers parameter. We didn’t get the desired result for the first time, because the website added restrictions, the purpose is to verify whether the request is sent by a normal browser, the first time our request was obviously found to be abnormal by the server, so we added headers Parameter, the content is {'user-agent':'chrome'}, which means that the browser type is Google Chrome, which has fooled the server.
  3. Use regular expressions to match the results we want. To use regular expressions, we need to import the re module first. The original webpage content is <li class=" " title="Su Xingjie-Listen to Sad Love Songs" data-index="0" data-eid="8id4200b">, then we only need to match Start with <li, and the statement containing title="" is fine, then you can write a regular expression like this <li.*?title="(.*?)", .*? represents other elements except line breaks, expressing The return result of the expression is the content of the subexpression, which is just the singer plus the song.

Summarize

The requests library is a third-party Python library for sending HTTP requests. It provides a simple and easy-to-use API, which can easily implement various HTTP request operations, such as GET, POST, PUT, DELETE, etc.

The main features of the requests library are as follows:

1. Ease of use: The API design of the requests library is concise and clear, making it easy to use.
2. High flexibility: The requests library supports multiple HTTP request methods and parameter settings, which can be flexibly configured as needed.
3. Excellent performance: the requests library uses efficient HTTP protocol parser and connection pool technology, which can improve the response speed and stability of requests.
4. Support multiple data formats: the requests library can handle JSON, XML, HTML and other data formats, and provides a wealth of parsers and converters.
5. Good cross-platform compatibility: the requests library can run on Windows, Linux, Mac OS and other operating systems, and supports Python 2.x and Python 3.x versions.

In short, the requests library is a very practical HTTP request tool that can help developers quickly implement various network request operations.

Guess you like

Origin blog.csdn.net/m0_49914128/article/details/131712586
Recommended