[NLP] Huggingface model/data file download method

Problem Description

As a natural language processing algorithm engineer, hugging face's open source transformers package is frequently used in daily life. During use, it needs to be downloaded every time a new model is used. If the training server has a network, you can directly download the model by calling the from_pretrained method. But as far as my experience is concerned, although this method is convenient, there are still two problems:

  • If the network is not good, the model download time will be very long, and it is common for a small model to download for several hours
  • If you change the training server, you have to download it again.

A Thunder download

The actual test found that Thunder is much faster than the command line, and it is more convenient to reset the breakpoint when there are many files in the library. It is highly recommended

First run the following code to get all file download URLs:

Copy all URLs to Xunlei to download in batches:

from huggingface_hub import hf_hub_url
from huggingface_hub.utils import filter_repo_objects
from huggingface_hub.hf_api import HfApi

repo_id = "decapoda-research/llama-7b-hf"
repo_type = "model" # 如果是数据 dataset

repo_info = HfApi().repo_info(repo_id=repo_id, repo_type=repo_type) # 有时候会连接Error,多试几次
files = list(filter_repo_objects(items=[f.rfilename for f in repo_info.siblings]))
urls = [hf_hub_url(repo_id, filename=file, repo_type=repo_type) for file in files]
print("\n".join(urls))

Two Git LFS model download scheme (elegant, but not flexible enough)

Preparation

mac: brew install git-lfs

The Git LFS solution is much more concise than the previous self-implemented solution. We need to install git lfs on the basis of installing git. Taking Windows as an example, the command is as follows

git lfs install

 

This solution also has certain problems, that is, it will download all the files in the warehouse, which will greatly prolong the download time of the model. We can see that the directory contains three different framework model files, flax_model.msgpack, tf_model.h5 and pytorch_model.bin. In the bert-base-uncased version, there is also a rust version of the rust_model.ot model. If we If you only want one version of the model file, this solution cannot be realized.

 Three Hugging Face Hub model download schemes (elegant, highly recommended)

from huggingface_hub import snapshot_download
snapshot_download(repo_id="bert-base-chinese")

How to download the content of the specified version? In the snaphot_download method, two parameters, allow_regex and ignore_regex, are provided. Simply put, the former is to download the specified matching item, and the latter is to ignore the specified matching item and download the rest. We only need to use one of them. Here we take ignore_regex as an example to demonstrate how to download only the Pytorch version of the model. The code is as follows.

snapshot_download(repo_id="bert-base-chinese", ignore_regex=["*.h5", "*.ot", "*.msgpack"])

It can be seen that the download items at this time are a few fewer than the previous complete download. Let's open the file directory to check, and we can see that there are no TensorFlow and Flax models at this time!

How to elegantly download the huggingface-transformers model - Zhihu (zhihu.com)

Guess you like

Origin blog.csdn.net/zwqjoy/article/details/131902493