Machine Learning Insights | Mining the Value of Machine Learning from Multimodal Data

Over the past few years, we have witnessed many changes in the field of machine learning and computer science. The application of artificial intelligence is also becoming more and more extensive, and it is speeding up to integrate into people's daily life. As the core of technology, machine learning is also continuously developing and evolving, playing an increasingly important role in more fields. **What new evolution trends and development directions will machine learning have? **How ​​should we plan ahead and keep up with the cutting-edge changes of this popular technology?

The Amazon cloud technology developer community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world's most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven't paid attention/favorite yet, please don't rush over when you see this, click here to make it your technical treasure house!

 

The "Machine Learning Insights" series of articles will interpret and analyze the four potential evolution trends of machine learning in practice based on the current development status of machine learning, including multimodal machine learning, distributed training, serverless reasoning, and JAX is a newly emerging deep learning framework.

What is Machine Learning on Multimodal Data

The sounds people hear, the food they see, and the smells they smell are all modal information, and we live in an environment where multimodal information blends with each other.

In order for artificial intelligence to better understand the world, people need to give artificial intelligence the ability to learn to understand and reason about multimodal information. Multi-modal data machine learning refers to the process of establishing models to enable machines to learn various modal information from multiple modalities, and to realize the information exchange and conversion of each modal.

Multimodal applications have a wide range, covering not only artificial intelligence speakers, e-commerce product recommendation systems, image recognition and other life scenarios, but also some industrial fields, including navigation and automatic driving, physiological pathology research, environmental monitoring and weather Forecasting, etc., can also support communication between virtual humans and humans in the future Metaverse scene, etc.

Evolution in the field of multimodal learning

With the increasing popularity of multi-modal data machine learning, research on multi-modal data machine learning has ushered in many innovative breakthroughs, especially the following three important advances:

Zero-Shot Learning (ZSL)

Learn some attributes through the training set pictures, combine these attributes to obtain fusion features, and match the test set pictures that do not overlap with the training set pictures to judge its category. This process is Zero-Shot Learning (ZSL), that  is , The features of the seen pictures are used to judge the categories of pictures that have not been seen.

Let's start with an analogy with the general reasoning process of humans:

Suppose Xiao Ming and his father went to the zoo to play together. First, he saw a horse, so his father told Xiao Ming that the horse had this shape; after that, he saw a tiger again, and his father told Xiao Ming: "Look, this animal with stripes on its body is a tiger." Finally, he took him to see it again. He picked up the panda and said to him, "Look, this panda is black and white."

Then my father arranged a task for Xiao Ming, asking him to find an animal he had never seen before in the zoo, called a zebra, and told Xiao Ming about the zebra: "The zebra has the outline of a horse, and its body is like a tiger. stripes, and it is black and white like a panda." Finally, Xiao Ming found a zebra in the zoo according to his father's prompt.

The above example contains a human reasoning process, which is to use past knowledge (descriptions of horses, tigers, and pandas) to infer the specific shape of a new object (zebra) in the mind , so that the new object can be identified.

For example, in the figure below, some attributes (such as horselike, stripe, and black&white) are learned from the training set pictures (Seen Classes Data), and then these attributes are combined to obtain fusion features. The fusion features just match the zebra features of the test set. Finally, the predicted result is Zebra:

Image source: Paper " Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer (University of Tübingen, 2009) "

Zero-Shot Learning includes a reasoning process, which is how to use past knowledge to infer the specific shape of a new object in the mind, so as to identify the new object.

Nowadays, although deep learning and supervised learning can give amazing results on many tasks, this learning method also has the following disadvantages:

  • Enough sample data is required, and the workload of labeling samples is huge
  • The classification results depend on the training samples and cannot reason and identify new categories

This is obviously difficult to satisfy human beings' ultimate imagination of artificial intelligence. And Zero-Shot Learning hopes to simulate the ability of human beings to identify new categories through reasoning, so that computers have the ability to identify new things, and then realize real intelligence.

CLIP: Pre-training on Contrastive Text-Image Pairs

Before 2021, many pre-training methods in the field of Natural Language Processing (NLP) have been successful. For example, GPT-3 175B collected nearly 500 million tokens from the Internet for pre-training, and achieved SOTA (State-of-the-Art) performance and Zero-Shot Learning on many downstream tasks. This shows that learning from massive Internet data (web-scale) can surpass high-quality human-labeled NLP datasets.

However, the pre-training model in the field of Computer Vision (CV) is still mainly trained based on manually labeled ImageNet data. Due to the huge workload of manual labeling, many scientists began to imagine how to build a more efficient and convenient way to train visual representation models?

In the paper " Learning Transferable Visual Models From Natural Language Supervision " published in 2021 , the CLIP (Contrastive Language-Image Pre-training) model was grandly introduced, and it introduced in detail how to train transferable visual models through natural language processing supervisory signals. Model.

How to deal with the supervisory signal of NLP in the paper?

The paper mainly introduces the following three parts:

Image from the paper

Part 1: Contrastive pre-training.

During the training process, two encoders: a text encoder and an image encoder are used to complete the pairing of model inputs to form N sets of image and text features. The feature groups on the blue diagonal in the above matrix are used as positive samples, and the other white feature groups are used as negative samples. CLIP will perform comparative learning based on these features without manual labeling at all.

It should be noted that this kind of unsupervised contrastive learning requires a lot of data training. In the data set in the paper, there are 400 million pairs of text and image pairs, which ensures the quality of the output results.

Part II: Implementation of Zero-Shot Classification with CLIP.

In downstream tasks, CLIP avoids the use of special classification headers to enable dataset transfer without fine-tuning at all. In the paper, a very ingenious method, Prompt Template, is designed, which uses natural language to ingeniously transplant the classification task into the existing training method.

Part III: Zero-Shot Inference.

By inputting the sample picture into the image encoder to obtain the features of the image, then use the features of this image to calculate the similarity with all the text features, select the sentence corresponding to the text feature with the largest similarity to complete the classification task, and finally form the result picture .

Through the calculation, we can see that the model trained based on CLIP is very effective:

Image from the paper

This experiment was re-screened on the ImageNet dataset to produce several variant versions.

The accuracy rate of the ResNet 101 model trained on the ImageNet dataset is 76.2%, and the accuracy rate of the VIT-Large model trained with CLIP is also 76.2%. However, when we switch to other data sets and train again strictly according to the classification head of 1000 categories, the accuracy of the obtained model drops rapidly. Especially when using the last two rows of samples (sketch or adversarial samples) in the above figure, the accuracy is only 25.2% and 2.7%, which are basically random guesses, and the transfer effect is terrible. Compared with the model trained by CLIP, the accuracy rate is basically online.

This shows from the side: because of the combination with natural language processing, the visual features learned by CLIP have a strong connection with an object we describe in language.

Code Example: Running a CLIP Model on Amazon SageMaker

This example shows how to download and run a CLIP model, compute similarity between arbitrary images and text inputs, and perform ZSL image classification.

  1. Download and run the CLIP model, and we can get the model we need after inputting;
import clip

clip.available_models()
model, preprocess = clip.load("ViT-B/32")
model.cuda().eval()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

Image source: ImageNet dataset

  1. Compute Similarity: Compute cosine similarity by normalizing the features and computing the dot product for each pair.

Image source: CIFAR100 dataset

  1. ZSL Image Classification: Classifies images using cosine similarity (multiplied by 100) as the logarithm of the softmax operation.
from torchvision.datasets import CIFAR100

cifar100 = CIFAR100(os.path.expanduser("~/.cache"), transform=preprocess, download=True)

text_descriptions = [f"This is a photo of a {label}" for label in cifar100.classes]
text_tokens = clip.tokenize(text_descriptions).cuda()

with torch.no_grad():
    text_features = model.encode_text(text_tokens).float()
    text_features /= text_features.norm(dim=-1, keepdim=True)

text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
top_probs, top_labels = text_probs.cpu().topk(5, dim=-1)

The final result of performing ZSL image classification is shown here: the matching results of the CLIP ViT-B/32 model against the classification labels of the CIFAR100 image set on an image set that the model has never seen (here the CIFAR100 image set).

Image source: CIFAR100 dataset

The whole process does not need fine-tuning downstream, no training on the dataset, and direct evaluation on the benchmark.

For the complete code of the CLIP paper, please refer to:  GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

ZESREC: Zero-Shot Based Recommender System

In addition, in multimodal data machine learning, in addition to CLIP, we also have some other explorations, such as Amazon Cloud Technology's AI Labs.

In the paper " Zero-Shot Recommender Systems ", the basic idea is shared, that is, the product description information trained by BERT is used to replace the product id embedding as product input, the upper layer connects mlp to 300 dim, and then connects hmn to learn sequence features.

For the recommendation of products and users that have not appeared, it solves the recommended cold start problem very well, and even has a good performance in Zero-Shot across data sets, which is suitable for the omni-channel (Omni-Channel) of new retail, And expand and increase the dimension of input, such as video and product map, etc.

Architecture Case: Model Training on Multimodal Data

The reference deployment architecture of the multimodal data pipeline in the field of life sciences is introduced in the figure below.

Image source: Official blog " Training Machine Learning Models with Multimodal Health Data on Amazon SageMaker "

The architecture will process data from genomic data, clinical data, and medical imaging, and load the processing capabilities for each modality into the  Amazon SageMaker  Feature Store.

This example shows how to pool features from different modalities and train a predictive model that outperforms models trained on one or both data modalities.

We will continue to introduce distributed training, serverless reasoning, and the evolution trend of the JAX framework in subsequent articles. Please continue to pay attention to the WeChat official account of Build On Cloud.

With the increasing importance of machine learning and the continuous enrichment of related technology research, it is believed that the technology around machine learning will gradually improve in the future, and this will empower and promote artificial intelligence and other technical fields to usher in a broader development space , benefit mankind.

 

 

 

Author Huang Haowen

Senior developer evangelist of Amazon Cloud Technology, focusing on AI/ML, Data Science, etc. With more than 20 years of rich experience in architecture design, technology and entrepreneurial management in telecommunications, mobile Internet and cloud computing industries, he has worked in Microsoft, Sun Microsystems, China Telecom and other companies, focusing on providing corporate clients such as games, e-commerce, media and advertising. Solution consulting services such as AI/ML, data analysis, and enterprise digital transformation.

Article source: https://dev.amazoncloud.cn/column/article/63e32a58e5e05b6ff897ca0c?sc_medium=regulartraffic&sc_campaign=crossplatform&sc_channel=CSDN

Guess you like

Origin blog.csdn.net/u012365585/article/details/131692122