Getting started notes on huggingface

concept

Hugging Face Hub is similar to Github, both are Hubs (communities). Hugging Face can be said to be the Github of machine learning. Hugging Face provides users with the following main features:

Model Repository : Git repository allows you to manage code versions and open source code. The model warehouse allows you to manage model versions, open source models, etc. The usage is similar to Github.
Models : Hugging Face provides many pre-trained machine learning models for different machine learning tasks for everyone to use. These models are stored in the model warehouse.
​Dataset : There are many public datasets on Hugging Face.

Hugging face is the most famous in the field of NLP, and most of the models it provides are based on Transformer. For ease of use, Hugging Face also provides users with the following items:

Transformers: Transformers provide thousands of pre-trained models that can be used for different tasks, such as text field, audio field and CV field. This project is the core of HuggingFace. It can be said that learning HuggingFace is learning how to use this project.
Datasets: A lightweight data set framework with two main functions: ① one line of code to download and preprocess commonly used public data sets; ② fast and easy-to-use data preprocessing library.
Accelerate: Helps Pytorch users to easily implement multi-GPU/TPU/fp16.
Space: Space provides many interesting deep learning applications, you can try them out.

Transforms

Hugging Face Transformer is the core project of Hugging Face. You can use it to do the following things:

  • Directly use pre-trained models for inference
  • A large number of pre-trained models are available for use
  • Transfer learning using pre-trained models

Install

pip install git+https://github.com/huggingface/transformers 

use

from transformers import pipeline

translator = pipeline("translation_en_to_fr")
print(translator("How old are you?"))

For some specific tasks, the official does not provide corresponding models, but you can also search for the model on the official website and then specify it. When loading the model, you may get an error due to the lack of some libraries. In this case, you only need to install the corresponding libraries and then restart.

!pip install sentencepiece
translator = pipeline("translation_en_to_zh", model='Helsinki-NLP/opus-mt-en-zh')
translator("I'm learning deep learning.")

Helsinki-NLP/opus-mt-en-zh · Hugging Face

diffusers

General model training framework diffusers. diffusers supports using the model directly or training the model

  • With just a few lines of code, you can use the diffusion model to generate images, which is simply good news for the majority of disabled people.
  • Different "noise modifiers" can be used to balance the relationship between model generation speed and quality
  • There are also many different types of models that can build diffusion models end-to-end.

Pipelines: High-level classes for quickly generating samples based on popular diffusion models in a user-friendly way

Models: Popular architectures for training new diffusion models, such as UNet

Schedulers: Various techniques for generating images based on noise in inference scenarios or generating noisy images based on noise in training scenarios.

pipeline

Using the Hugging Face model 

The Transformers project provides several simple APIs to help users use the Hugging Face model, and these simple APIs are collectively called ​​AutoClass​​​(​ ​Official Documentation Link​​), including:

  • ​​AutoTokenizer​​: used for text segmentation
  • ​​AutoFeatureExtractor​​: used for feature extraction
  • ​​AutoProcessor​​: used for data processing
  • ​​AutoModel​​: used to load models

The way they are used is: ​​AutoClass.from_pretrain("model name")​​, and then you can use it. For example:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer("I'm learning deep learning.")

Usually a model will include some of the above four functions. For example, for the bert-base-uncased model, it includes the two functions of "word segmentation" and "model". We can use the code sample (Use in Transformers ) module view:

data set

The Datasets class library allows you to access and share data sets very conveniently, and can also be used to evaluate NLP, CV, speech and other tasks (Evaluation metrics).

pip install datasets


#使用语音(Audio)数据集
pip install datasets[audio]

#图片(Image)数据
pip install datasets[vision]

Find a dataset

The Hugging Face data set usually includes multiple subsets and is divided into three parts: train, validation and test. You can view the subset you need through the preview area.

from datasets import load_dataset

dataset = load_dataset("glue")

The Hugging Face data sets are all placed on github, so it is estimated that it will be difficult to successfully download them in China. This requires using load_dataset to load the local data set. For information on how to download the Hugging Face data set offline, please refer to this article.

download

import datasets
dataset = datasets.load_dataset("glue")
dataset.save_to_disk('your_path')

Load offline

import datasets
dataset = load_from_disk("your_path")

reference

Quick start with Hugging Face (focusing on the model (Transformers) and the data set part (Datasets))_51CTO blog_hugging face transformers

Guess you like

Origin blog.csdn.net/linzhiji/article/details/132721308