Python batch download references|Python-based Sci-Hub download script|Python batch download sci-hub literature|How to use sci-hub to download papers in batches

This blog post will introduce how to quickly download the literature corresponding to the specified DOI number through Python code, and use Sci-Hub as the download library.

1. Library function preparation

Before we start, we need to install some necessary libraries, including:

  1. requests: A library that sends HTTP requests and gets responses;
  2. beautifulsoup4: used to parse HTML pages;
  3. threading: used to achieve multi-threaded processing;

These libraries can be installed through the pip command, the specific command is as follows:

pip install requests
pip install BeautifulSoup
pip install threading

In addition, you need to create a folder named "papers" in the directory where the code is located to save the downloaded documents. At the same time, it is necessary to prepare a txt file containing multiple DOI names, and each DOI name occupies one line.
insert image description here

2. Implementation steps

The entire download process can be roughly divided into the following steps:

  1. Read the txt file that stores the DOI number;
  2. Construct Sci-Hub link and send HTTP request;
  3. Analyze the HTML page to obtain the download link of the document;
  4. Download the document and save it to a local folder;
  5. Logs the success or failure of the download.

3. Realize the algorithm

The code splices the link of Sci-Hub by reading the doi number in the txt file, and then parses to get the document download link and download it

Define the request header required by the HTTP request; then define a download_paper() function, which is used to download the document and save it locally, where the doi parameter is the DOI number of the document to be downloaded; in the download_paper() function, we first use the DOI The number constructs a Sci-Hub link and sends an HTTP request; then by parsing the HTML page, the download link of the document is obtained, and the document is downloaded to the local using the requests library, and the information about the success and failure of the download is output to the console or recorded to In a log file; finally, we open the txt file storing the DOI number, traverse each line in it, and call the download_paper() function to download the corresponding document.

It should be noted that because Sci-Hub often changes the domain name, in practical applications, we need to access Sci-Hub through a browser, find the currently available domain name, and replace it with the link in the above code.

4. Accelerate download

Although the above code can already complete the document download task, but because the single-thread download speed is slow, we can use multi-thread to speed up the download process. Specifically, we can pass the DOI number of the document to be downloaded as a parameter to the download_paper() function, and create multiple threads to download the document in parallel. The following is a code implementation of downloading documents using multi-threading:

import requests
from bs4 import BeautifulSoup
import os
import threading

# 创建papers文件夹用于保存文献
path = "C:/Users/ypzhao/Desktop/papers/"
if not os.path.exists(path):
    os.mkdir(path)

# 请求头
head = {
    
    
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36"
}

# 下载文献的函数
def download_paper(doi):
    # 拼接Sci-Hub链接
    url = "https://www.sci-hub.ren/" + doi + "#"
    
    try:
        download_url = ""
        
        # 发送HTTP请求并解析HTML页面
        r = requests.get(url, headers=head)
        r.raise_for_status()
        soup = BeautifulSoup(r.text, "html.parser")
        
        # 解析得到文献下载链接
        if soup.iframe == None:
            download_url = "https:" + soup.embed.attrs["src"]
        else:
            download_url = soup.iframe.attrs["src"]
        
        # 下载文献并保存到文件
        print(doi + "\t正在下载\n下载链接为\t" + download_url)
        download_r = requests.get(download_url, headers=head)
        download_r.raise_for_status()
        with open(path + doi.replace("/", "_") + ".pdf", "wb+") as temp:
            temp.write(download_r.content)

        print(doi + "\t文献下载成功.\n")

    # 下载失败时记录错误信息
    except Exception as e:
        with open("error.log", "a+") as error:
            error.write(doi + "\t下载失败!\n")
            if download_url.startswith("https://"):
                error.write("下载url链接为: " + download_url + "\n")
            error.write(str(e) + "\n\n")

# 打开包含doi号的txt文件
with open(path + "doi.txt", "r", encoding="utf-8") as f:
    # 遍历读取doi号,并启动多线程下载文献
    threads = []
    for line in f:
        doi = line.strip()
        t = threading.Thread(target=download_paper, args=(doi,))
        threads.append(t)
    
    # 启动所有线程
    for t in threads:
        t.start()

    # 等待所有线程完成
    for t in threads:
        t.join()

5. Running results

insert image description here

Guess you like

Origin blog.csdn.net/m0_58857684/article/details/131036359