concept
Hugging Face Hub is similar to Github, both are Hubs (communities). Hugging Face can be said to be the Github of machine learning. Hugging Face provides users with the following main features:
Model Repository : Git repository allows you to manage code versions and open source code. The model warehouse allows you to manage model versions, open source models, etc. The usage is similar to Github.
Models : Hugging Face provides many pre-trained machine learning models for different machine learning tasks for everyone to use. These models are stored in the model warehouse.
Dataset : There are many public datasets on Hugging Face.
Hugging face is the most famous in the field of NLP, and most of the models it provides are based on Transformer. For ease of use, Hugging Face also provides users with the following items:
Transformers: Transformers provide thousands of pre-trained models that can be used for different tasks, such as text field, audio field and CV field. This project is the core of HuggingFace. It can be said that learning HuggingFace is learning how to use this project.
Datasets: A lightweight data set framework with two main functions: ① one line of code to download and preprocess commonly used public data sets; ② fast and easy-to-use data preprocessing library.
Accelerate: Helps Pytorch users to easily implement multi-GPU/TPU/fp16.
Space: Space provides many interesting deep learning applications, you can try them out.
Transforms
Hugging Face Transformer is the core project of Hugging Face. You can use it to do the following things:
- Directly use pre-trained models for inference
- A large number of pre-trained models are available for use
- Transfer learning using pre-trained models
Install
pip install git+https://github.com/huggingface/transformers
use
from transformers import pipeline
translator = pipeline("translation_en_to_fr")
print(translator("How old are you?"))
For some specific tasks, the official does not provide corresponding models, but you can also search for the model on the official website and then specify it. When loading the model, you may get an error due to the lack of some libraries. In this case, you only need to install the corresponding libraries and then restart.
!pip install sentencepiece
translator = pipeline("translation_en_to_zh", model='Helsinki-NLP/opus-mt-en-zh')
translator("I'm learning deep learning.")
Helsinki-NLP/opus-mt-en-zh · Hugging Face
diffusers
General model training framework diffusers. diffusers supports using the model directly or training the model
- With just a few lines of code, you can use the diffusion model to generate images, which is simply good news for the majority of disabled people.
- Different "noise modifiers" can be used to balance the relationship between model generation speed and quality
- There are also many different types of models that can build diffusion models end-to-end.
Pipelines: High-level classes for quickly generating samples based on popular diffusion models in a user-friendly way
Models: Popular architectures for training new diffusion models, such as UNet
Schedulers: Various techniques for generating images based on noise in inference scenarios or generating noisy images based on noise in training scenarios.
pipeline
Using the Hugging Face model
The Transformers project provides several simple APIs to help users use the Hugging Face model, and these simple APIs are collectively called AutoClass( Official Documentation Link), including:
- AutoTokenizer: used for text segmentation
- AutoFeatureExtractor: used for feature extraction
- AutoProcessor: used for data processing
- AutoModel: used to load models
The way they are used is: AutoClass.from_pretrain("model name"), and then you can use it. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer("I'm learning deep learning.")
Usually a model will include some of the above four functions. For example, for the bert-base-uncased model, it includes the two functions of "word segmentation" and "model". We can use the code sample (Use in Transformers ) module view:
data set
The Datasets class library allows you to access and share data sets very conveniently, and can also be used to evaluate NLP, CV, speech and other tasks (Evaluation metrics).
pip install datasets
#使用语音(Audio)数据集
pip install datasets[audio]
#图片(Image)数据
pip install datasets[vision]
Find a dataset
The Hugging Face data set usually includes multiple subsets and is divided into three parts: train, validation and test. You can view the subset you need through the preview area.
from datasets import load_dataset
dataset = load_dataset("glue")
The Hugging Face data sets are all placed on github, so it is estimated that it will be difficult to successfully download them in China. This requires using load_dataset to load the local data set. For information on how to download the Hugging Face data set offline, please refer to this article.
download
import datasets
dataset = datasets.load_dataset("glue")
dataset.save_to_disk('your_path')
Load offline
import datasets
dataset = load_from_disk("your_path")