1. Import the necessary modules:
This blog will introduce how to use Python to write a crawler program to obtain image information from the Douyu live broadcast website and save it locally. We will use the requests
module to send HTTP requests and receive responses, and the os
module to handle file and directory operations.
If a module error occurs
Enter the console and input: It is recommended to use domestic mirror sources
pip install requests -i https://mirrors.aliyun.com/pypi/simple
I have roughly listed the following domestic mirror sources:
清华大学
https://pypi.tuna.tsinghua.edu.cn/simple
阿里云
https://mirrors.aliyun.com/pypi/simple/
豆瓣
https://pypi.douban.com/simple/
百度云
https://mirror.baidu.com/pypi/simple/
中科大
https://pypi.mirrors.ustc.edu.cn/simple/
华为云
https://mirrors.huaweicloud.com/repository/pypi/simple/
腾讯云
https://mirrors.cloud.tencent.com/pypi/simple/
2. Send a GET request to obtain response data:
The request header information is set to simulate the browser's request, and the function returns the JSON format content of the response data.
def get_html(url):
header = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
response = requests.get(url=url, headers=header)
# print(response.json())
html = response.json()
return html
How to get the request header:
Firefox browser:
- Open the landing page and right-click on an empty space on the page.
- Select the "Inspect Element" option, or press the shortcut Ctrl + Shift + C (Windows)
- In the developer tools window, switch to the Network tab.
- Refresh the page to capture all network requests.
- Select the request that interests you in the list of requests.
- You can find the request header information in the "Request Headers" or "Request Headers" section on the right.
Copy the following request header information
3. Crawling Kugou TOP500 ranking list
Extract the song’s ranking, song title, singer, duration and other information from the Kugou music rankings.
Specific steps are as follows:
Import the required modules:
requests
is used to send HTTP requests,BeautifulSoup
is used to parse HTML,time
is used Control the speed of the crawler.Set the request header information: Set the User-Agent through the
headers
dictionary to simulate the browser sending requests to prevent being banned by the website.Define function
get_info(url)
: This function receives a URL parameter and is used to crawl the information of the specified web page.Send a network request and parse HTML: Use the
requests.get()
function to send a GET request to obtain the HTML content of the web page, and use theBeautifulSoup
module to parse the HTML.Locate the required information through the CSS selector: Use the
select()
method to locate the song ranking, song title, duration and other elements based on the CSS selector.Loop through each piece of information and store it in the dictionary: Use the
zip()
function to pack elements such as ranking, song title and duration into an iterator, and then loop through each The information is stored in thedata
dictionary.Print the obtained information: Use the
print()
function to print the information in the dictionarydata
.Main program entry: Use
if __name__ == '__main__':
to determine whether the current file is directly executed. If so, execute the following code.Construct a list of page addresses to be crawled: Use list comprehension to construct a list containing the page addresses to be crawled.
Call the function to obtain page information: Use
for
to loop through the page address list, and call theget_info()
function to obtain the information of each page.Control the crawler speed: Use the
time.sleep()
function to control the crawler speed to prevent the IP from being blocked too quickly.
Source code:
import requests # 发送网络请求,获取 HTML 等信息
from bs4 import BeautifulSoup # 解析 HTML 信息,提取需要的信息
import time # 控制爬虫速度,防止过快被封IP
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
# 添加浏览器头部信息,模拟请求
}
def get_info(url):
# 参数 url :要爬取的网页地址
web_data = requests.get(url, headers=headers) # 发送网络请求,获取 HTML 等信息
soup = BeautifulSoup(web_data.text, 'lxml') # 解析 HTML 信息,提取需要的信息
# 通过 CSS 选择器定位到需要的信息
ranks = soup.select('span.pc_temp_num')
titles = soup.select('div.pc_temp_songlist > ul > li > a')
times = soup.select('span.pc_temp_tips_r > span')
# for 循环遍历每个信息,并将其存储到字典中
for rank, title, time in zip(ranks, titles, times):
data = {
"rank": rank.get_text().strip(), # 歌曲排名
"singer": title.get_text().replace("\n", "").replace("\t", "").split('-')[1], # 歌手名
"song": title.get_text().replace("\n", "").replace("\t", "").split('-')[0], # 歌曲名
"time": time.get_text().strip() # 歌曲时长
}
print(data) # 打印获取到的信息
if __name__ == '__main__':
urls = ["https://www.kugou.com/yy/rank/home/{}-8888.html".format(str(i)) for i in range(1, 24)]
# 构造要爬取的页面地址列表
for url in urls:
get_info(url) # 调用函数,获取页面信息
time.sleep(1) # 控制爬虫速度,防止过快被封IP
Rendering:
Recommend a website to everyone
IT Hot List Today One-Stop Information Platform
It contains hundreds of IT websites, everyone is welcome to visit:IT Today’s Hot List One-stop Information Platform
iToday, opens a new era of information. As an innovative IT digital media platform, iToday is committed to providing users with the latest and most comprehensive IT information and content. It contains technical information, IT community, job interviews, cutting-edge technology and many other contents. Our team is composed of a group of developers who love creating and enthusiasts who share professional programming knowledge. They select and organize authentic and trustworthy information to ensure that you get a unique and valuable reading experience. Stay connected to the world anytime, anywhere, on iToday, and start your new information journey!