[python] Crawling Douyu live broadcast photos and saving them to a local directory [Source code attached + free book at the end of the article]

Yingjie Community icon-default.png?t=N7T8https://bbs.csdn.net/topics/617804998

1. Import the necessary modules:

    This blog will introduce how to use Python to write a crawler program to obtain image information from the Douyu live broadcast website and save it locally. We will use requestsmodules to send HTTP requests and receive responses, as well as osmodules to handle file and directory operations.

        If a module error occurs

        Enter the console input: It is recommended to use domestic mirror sources

pip install requests -i https://mirrors.aliyun.com/pypi/simple

         I have roughly listed the following domestic mirror sources:

        

清华大学
https://pypi.tuna.tsinghua.edu.cn/simple

阿里云
https://mirrors.aliyun.com/pypi/simple/

豆瓣
https://pypi.douban.com/simple/ 

百度云
https://mirror.baidu.com/pypi/simple/

中科大
https://pypi.mirrors.ustc.edu.cn/simple/

华为云
https://mirrors.huaweicloud.com/repository/pypi/simple/

腾讯云
https://mirrors.cloud.tencent.com/pypi/simple/

    

2. Send a GET request to obtain response data:

        The request header information is set to simulate the browser's request, and the function returns the JSON format content of the response data.

def get_html(url):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    response = requests.get(url=url, headers=header)
    # print(response.json())
    html = response.json()
    return html

        How to get request headers:

        Firefox browser:
  1. Open the landing page and right-click on an empty space on the page.
  2. Select the "Inspect Element" option, or press the shortcut Ctrl + Shift + C (Windows)
  3. In the developer tools window, switch to the Network tab.
  4. Refresh the page to capture all network requests.
  5. Select the request that interests you in the list of requests.
  6. You can find the request header information in the "Request Headers" or "Request Headers" section on the right.

     Just copy the following request header information

3. Parse the image information in the response data

        Used to parse image information in response data. By analyzing the structure of the response data, the URL and title of each image are extracted, stored in a dictionary, and then a list of all dictionaries is returned.
def parse_html(html):
    image_info_list = []
    for item in html['data']:
        image_url = item['image_url']
        title = item['title']
        image_info = {'url': image_url, 'title': title}
        image_info_list.append(image_info)
    return image_info_list

4. Save the image locally:

Used to save pictures locally. First create a directory "directory" if it does not exist. Then traverse the picture information list, download each picture in turn and save it to the directory. The file name of the picture is the title plus the ".jpg" suffix.

def save_to_images(img_info_list):
    directory = 'images'
    if not os.path.exists(directory):
        os.makedirs(directory)

    for img_info in img_info_list:
        image_url = img_info['url']
        title = img_info['title']
        response = requests.get(image_url)
        with open(os.path.join(directory, f'{title}.jpg'), 'wb') as f:
            f.write(response.content)

Source code:

If you are interested in Internet monetization: you can follow: https://bbs.csdn.net/topics/617804998


#导入了必要的模块requests和os
import requests
import os


# 定义了一个函数get_html(url),
# 用于发送GET请求获取指定URL的响应数据。函数中设置了请求头部信息,
# 以模拟浏览器的请求。函数返回响应数据的JSON格式内容
def get_html(url):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    response = requests.get(url=url, headers=header)
    # print(response.json())
    html = response.json()
    return html


# 定义了一个函数parse_html(html),
# 用于解析响应数据中的图片信息。通过分析响应数据的结构,
# 提取出每个图片的URL和标题,并将其存储在一个字典中,然后将所有字典组成的列表返回
def parse_html(html):
    rl_list = html['data']['rl']
    # print(rl_list)
    img_info_list = []
    for rl in rl_list:
        img_info = {}
        img_info['img_url'] = rl['rs1']
        img_info['title'] = rl['nn']
        # print(img_url)
        # exit()
        img_info_list.append(img_info)
    # print(img_info_list)
    return img_info_list


# 定义了一个函数save_to_images(img_info_list),用于保存图片到本地。
# 首先创建一个目录"directory",如果目录不存在的话。然后遍历图片信息列表,
# 依次下载每个图片并保存到目录中,图片的文件名为标题加上".jpg"后缀。
def save_to_images(img_info_list):
    dir_path = 'directory'
    if not os.path.exists(dir_path):
        os.makedirs(dir_path)
    for img_info in img_info_list:
        img_path = os.path.join(dir_path, img_info['title'] + '.jpg')
        res = requests.get(img_info['img_url'])
        res_img = res.content
        with open(img_path, 'wb') as f:
            f.write(res_img)
        # exit()

#在主程序中,设置了要爬取的URL,并调用前面定义的函数来执行爬取、解析和保存操作。
if __name__ == '__main__':
    url = 'https://www.douyu.com/gapi/rknc/directory/yzRec/1'
    html = get_html(url)
    img_info_list = parse_html(html)
    save_to_images(img_info_list)

Rendering:

        

[Send a book at the end of the article]

        If you are interested in getting books for free: https://bbs.csdn.net/topics/617804998

        

brief introduction

        "Python Web Crawler from Beginner to Mastery" starts from the perspective of beginners, through easy-to-understand language and colorful examples, and introduces in detail the technologies that should be mastered to implement web crawler development using Python. The book is divided into 19 chapters, including getting to know web crawlers, understanding the web front-end, request module urllib, request module urllib3, request module requests, advanced network request module, regular expressions, XPath parsing, BeautifulSoup for parsing data, and crawling dynamic rendering Information, multi-threaded and multi-process crawlers, data processing, data storage, data visualization, App packet capture tool, identification verification code, Scrapy crawler framework, Scrapy_Redis distributed crawler, data detective. All the knowledge in the book is introduced with specific examples, and the involved program codes are given detailed annotations. Readers can easily understand the essence of web crawler program development and quickly improve development skills.

About the Author

        Tomorrow Technology, whose full name is Jilin Tomorrow Technology Co., Ltd., is a high-tech company specializing in software development, education and training, and the integration of software development education resources. The textbooks it compiles pay great attention to selecting necessary and commonly used content in software development, and also It pays great attention to the ease of learning, convenience and expansion of relevant knowledge in the content, and is deeply loved by readers. Its textbooks have won awards such as "Excellent Best-Sellers in the Industry" and "Excellent Bestsellers from National University Presses" for many times, and many varieties have long been at the forefront of the sales rankings of similar books.

        Purchase link: https://item.jd.com/13291912.html

participate

1️⃣How to participate: follow, like, collect, and comment ( life is short, I use python )
2️⃣How to win: The program randomly selects 3 people, and each friend will receive a book
3️⃣Activity time: until 2023-12- 31 22:00:00

Note: The winner will be announced on my homepage as scheduled after the event, and will be delivered to your home.

Guess you like

Origin blog.csdn.net/m0_73367097/article/details/135250658