Use ChatGPT to write python scripts to download teaching aid audio files in batches

0. Demand

        I bought a teaching aid book for my child on the weekend. There are supporting audio files in the book, which can be scanned and played through WeChat, but the child does not have a mobile phone at home, what should I do? The leader of the family asked to download it and put it on the bluetooth speaker, so just do it.

1. Page analysis

The audio page is a player, the URL is:
https://www.**************.cn/erweima/player.html?bqcg_id=2375, bqcg_id is corresponding to different chapters 2375-2399. Each chapter has two audios respectively. |
Through website code analysis, I found the address of the mp3 file:
https://h********de1.********.aliyuncs.com/wapaudio/61010/1-1.mp3
Looking at the URL, the audio resource is placed in the Alibaba Cloud disk. I thought it would be fine to download it directly from this address, but when entering this address, it displays:

Obviously, Ali didn't want you to download directly.
So can we let python simulate a browser and download the required files locally?
The answer is of course yes. This is equivalent to a reptile!
I am quite familiar with reptiles. I just learned about the relationship between reptiles and criminal law a few days ago:

However, you can still try this technology to see if it works, especially for those who only know about reptiles and have average coding ability, how to use ChatGPT to write a code to solve this small problem that may be encountered in daily life.

2. Code implementation

This time choose a domestic shell ChatGPT to achieve, version 3.5

Make a request:


Hello, you are a python programmer. Help me write scripts. The address of the web page





for batch downloading files from the website is: https://www.********.cn/erweima/player.html? bqcg_id={ID} where {ID} ranges from 2375 to 2399 There are two mp3 files in each page, the file address is:  https://hld******e1.o****** *********ai.aliyuncs.com/wapaudio/61010/{number}-1.mp3  https://hld******e1.o************ **ai.aliyuncs.com/wapaudio/61010/{number}-2.mp3 where {number} takes a value from 1 to 25, that is, when {ID} takes a value of 2375, {number} takes a value of 1 {ID When the value of } is 2376, the value of {number} is 2 and so on.
Now please write a script to download all the files into a folder.
Each file download interval is 10 seconds
.

Soon, ChatGPT gave the answer:

Ok, I have understood your needs. Please try the following code:



import requests

import time



# 设置请求头部,避免被防火墙拦截

HEADERS = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# 文件下载保存路径

FILE_PATH = "downloaded_files"



def download_mp3_file(url, file_name):

    """

    下载MP3文件

    :param url: MP3文件地址

    :param file_name: 文件名

    :return:

    """

    # 发送 GET 请求并下载文件

    response = requests.get(url, headers=HEADERS)

    if response.status_code == 200:

        with open(file_name, 'wb') as f:

            f.write(response.content)

        print(f"成功下载文件:{file_name}")

    else:

        print(f"下载文件失败:{file_name}")



if __name__ == '__main__':

    for id in range(2375, 2400):

        for num in range(1, 26):

            # 构造文件地址

            mp3_url_1 = f"https://hld******e1.o**************ai.aliyuncs.com/wapaudio/61010/{num}-1.mp3"

            mp3_url_2 = f"https://hld******e1.o**************ai.aliyuncs.com/wapaudio/61010/{num}-2.mp3"



            # 构造文件名

            file_name_1 = f"{FILE_PATH}/{id}_{num}_1.mp3"

            file_name_2 = f"{FILE_PATH}/{id}_{num}_2.mp3"



            # 下载两个MP3文件,进程睡眠10秒

            download_mp3_file(mp3_url_1, file_name_1)

            download_mp3_file(mp3_url_2, file_name_2)

            time.sleep(10)

In the script, I use the requests library to send GET requests and download files, and then use the time library to make the process sleep for 10 seconds to avoid requests being frequently blocked by the firewall.

In addition, I adopted the f-string technique to construct the file address and file name. You can copy the code locally, save the file as mp3_downloader.py, and run it with the Python interpreter. During the running process, please ensure that there is a folder named downloaded_files in the current directory, because all downloaded files will be saved in this folder.

Obviously not, the download failed:

Reminder to ChatGPT:

The download failed because when visiting the link: https://hld******e1.o**************ai.aliyuncs.com/wapaudio/61010/1-1. mp3 , the browser returns:

This XML file does not appear to have any style information associated with it. The document tree is shown below.

<Error>

<Code>AccessDenied</Code>

<Message>You are denied by bucket referer policy.</Message>

<RequestId>646AC95511D2913233E98BCC</RequestId>

<HostId>hld******e1.o************ai.aliyuncs.com</HostId>

<BucketName>hld******e1</BucketName>

<EC>0003-00000501</EC>

</Error>

ChatGPT:

This error message is because Alibaba Cloud's anti-crawler mechanism has intercepted your request. This error message will be returned if the server detects that your request headers do not comply with its rules.

In order to avoid being intercepted, we need to add `Referer` field in the request header. Specifically, every request needs to add a link to the page containing the `Referer` field to the request header. You can refer to the following code to achieve:


import requests

import time



# 设置请求头部,避免被防火墙拦截

HEADERS = {

    'User-Agent': 'Mozilla/5.0 (Linux; Android 11; SM-A105F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.66 Mobile Safari/537.36',

    'Referer': 'https://www.********.cn/erweima/player.html?bqcg_id=2375'

}

# 文件下载保存路径

FILE_PATH = "downloaded_files"



def download_mp3_file(url, file_name):

    """

    下载MP3文件

    :param url: MP3文件地址

    :param file_name: 文件名

    :return:

    """

    # 发送 GET 请求并下载文件

    response = requests.get(url, headers=HEADERS)

    if response.status_code == 200:

        with open(file_name, 'wb') as f:

            f.write(response.content)

        print(f"成功下载文件:{file_name}")

    else:

        print(f"下载文件失败:{file_name}")



if __name__ == '__main__':

    for id in range(2375, 2400):

        # 构造Referer

        HEADERS['Referer'] = f"https://www.********.cn/erweima/player.html?bqcg_id={id}"

       

        for num in range(1, 26):

            # 构造文件地址

            mp3_url_1 = f"https://hld******e1.o**************ai.aliyuncs.com/wapaudio/61010/{num}-1.mp3"

            mp3_url_2 = f"https://hld******e1.o**************ai.aliyuncs.com/wapaudio/61010/{num}-2.mp3"



            # 构造文件名

            file_name_1 = f"{FILE_PATH}/{id}_{num}_1.mp3"

            file_name_2 = f"{FILE_PATH}/{id}_{num}_2.mp3"



            # 下载两个MP3文件,进程睡眠10秒

            download_mp3_file(mp3_url_1, file_name_1)

            download_mp3_file(mp3_url_2, file_name_2)

            time.sleep(10)

Please change the `referer` variable in the code to a link to the corresponding webpage and execute the script again.

Execution, an error occurred. In fact, this error is simple, but I still ask GPT:

系统报错,请修改 File D:\untitled6.py:51 in <module> download_mp3_file(mp3_url_1, file_name_1) File D:\untitled6.py:30 in download_mp3_file with open(file_name, 'wb') as f: FileNotFoundError: [Errno 2] No such file or directory: 'downloaded_files/2375_1_1.mp3'

GPT answers:

This error message means that the program cannot find the corresponding folder or file while downloading the file. You need to manually create a folder called "downloaded_files" and make sure the path is the same as used in the code. If you have created the folder manually, please check if the folder path is correct.

Finally, on Windows systems, slashes (/) in folder paths need to be replaced with backslashes (\). Therefore, you can modify it as follows:

FILE_PATH = "downloaded_files\\"

or

FILE_PATH = r"downloaded_files/"

Please modify the folder path in the code and execute the script again.

Modification completed:

As mentioned above, I implemented a program without typing a line of code, and luckily, it succeeded once.

3. Conclusion

After completing the above-mentioned not-so-difficult question, let me talk about two feelings:

First of all, the above implementation code is not difficult. Using request source verification is also a common anti-crawling method. Therefore, as long as the crawler textbook introduces the `HEADERS` request header information, it will be mentioned, so technically it is not It's so special, but for a novice, there are still some thresholds. You have to explore it yourself, and it will take a few hours. But the emergence of AI can indeed improve work efficiency.

Second, I believe that applications like GPT will change our living and thinking habits. I remember that when we were in college, the mobile Internet was not yet popular. In the dormitory at night, we often quarreled over a certain issue, and finally we met at dawn to check information in the library; When encountering a problem, everyone may discuss it for 10 minutes before pulling out their mobile phones and searching for answers on the Internet, trying to find evidence in favor of their own views; in the future, when we encounter problems, will we find AI as soon as possible? For example, if you want an article about handing in errands, we often say: You go to Baidu to search for an article and change it to hand in (reminder! This is wrong, I didn’t do this, I wrote it myself), in the future Will everyone say: You go to XXX and ask him to write an article for you to submit; you ask XXX to write a script and download XXX resources.

And this day, I feel that we are not far away.

Serious reminder: The article only talks about technical implementation and GPT functions, all website addresses have been hidden, and I am not responsible for the possible technical risks of the technologies discussed in the article!

Guess you like

Origin blog.csdn.net/m0_37771865/article/details/130803103