[Road to AI] Use huggingface_hub to elegantly solve the problem of downloading large models of huggingface


foreword

The resources of Hugging face are very good, but the domestic download speed is very slow, and the download is easy to time out for a large model of GB, and the download is often unsuccessful. It affects the confidence of playing AI very much. (Some people say to use Xunlei, try it and you will know helplessness.)

After many tests, the download is finally done, and the download can continue even if it times out. Really worry-free downloading! How exactly? Let's see the breakdown of this article.


1. What is Hugging face?

Hugging face was originally a chatbot start-up service provider headquartered in New York. It originally planned to start a business as a chatbot. It open sourced a Transformers library on Github. Although the chatbot business did not start, their library quickly became popular in the machine learning community. stand up. At present, more than 100,000 pre-trained models and 10,000 data sets have been shared. It seems that Chen Guo has established GitHub for AI developers, providing models, data sets (text|image|audio|video), class libraries (such as transformers|peft|accelerate), tutorials, etc.

Official website URL: https://huggingface.co/

2. Preparation

Install pip install huggingface_hub to install huggingface_hub package

C:\Users\Administrator>pip install huggingface_hub
Requirement already satisfied: huggingface_hub in d:\programdata\anaconda3\lib\site-packages (0.13.2)
Requirement already satisfied: pyyaml>=5.1 in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (4.4.0)
Requirement already satisfied: packaging>=20.9 in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (22.0)
Requirement already satisfied: requests in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (2.28.2)
Requirement already satisfied: tqdm>=4.42.1 in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (4.64.1)
Requirement already satisfied: filelock in d:\programdata\anaconda3\lib\site-packages (from huggingface_hub) (3.12.0)
Requirement already satisfied: colorama in d:\programdata\anaconda3\lib\site-packages (from tqdm>=4.42.1->huggingface_hub) (0.4.6)
Requirement already satisfied: charset-normalizer<4,>=2 in d:\programdata\anaconda3\lib\site-packages (from requests->huggingface_hub) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in d:\programdata\anaconda3\lib\site-packages (from requests->huggingface_hub) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in d:\programdata\anaconda3\lib\site-packages (from requests->huggingface_hub) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in d:\programdata\anaconda3\lib\site-packages (from requests->huggingface_hub) (1.26.14)
C:\Users\Administrator>

3. Download the entire warehouse or a single large model file

Find the warehouse you need to download, the red box is the repo_id, select the blue box to view the files in the warehouse.
insert image description here

1. Download the entire repository

Use snapshot_download to download the entire warehouse snapshot, pay attention to the following parameters:

  • allow_patterns selects the file types that need to be downloaded, and sets the file types that need to be ignored through ignore_patterns.
  • resume_download=True, which means that the resume download is allowed, which is very necessary.
  • etag_timeout=100, the timeout threshold, the default is 10 seconds, here you can modify it according to the situation.
    More parameter details can be accessed at: https://huggingface.co/docs/huggingface_hub/v0.16.3/guides/download
import time
from huggingface_hub import snapshot_download
repo_id = "LinkSoul/Chinese-Llama-2-7b"
local_dir = 'd:/ai/models1'
cache_dir = local_dir + "/cache"
while True:
    try:
        snapshot_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=repo_id,
        local_dir_use_symlinks=False,
        resume_download=True,
        allow_patterns=["*.model", "*.json", "*.bin",
        "*.py", "*.md", "*.txt"],
        ignore_patterns=["*.safetensors", "*.msgpack",
        "*.h5", "*.ot",],
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

insert image description here

2. Download a single large model file

In some cases, we only need to download the large model file, instead of downloading the entire project warehouse, just use hf_hub_download to download. The parameter settings are explained above.

import time
from huggingface_hub import hf_hub_download
repo_id = "BlinkDL/rwkv-4-pile-7b" # 仓库ID
local_dir = 'd:/ai/models2'
cache_dir = local_dir + "/cache"
filename= "RWKV-4-Pile-7B-Chn-testNovel-done-ctx2048-20230404.pth"
while True:   
    try:
        hf_hub_download(cache_dir=cache_dir,
        local_dir=local_dir,
        repo_id=repo_id,
        filename=filename,
        local_dir_use_symlinks=False,
        resume_download=True,
        etag_timeout=100
        )
    except Exception as e :
        print(e)
        # time.sleep(5)
    else:
        print('下载完成')
        break

insert image description here


Summarize

After several tests, the model download is finally done, and you don't have to worry about downloading large files anymore.
There are countless pitfalls, and the article is only for bloggers to avoid detours.
If you find it helpful, please give me a thumbs up, thank you!

appendix

huggingface Chinese Community
https://huggingface.co/blog/zh

Guess you like

Origin blog.csdn.net/popboy29/article/details/131979434