Dr. Google Chinese released a new model at ICCV 2021! Machine learning can predict the future!

Click the above artificial intelligence algorithm and Python big data to get more dry goods

In the upper right  ...  Set as a star ★, get resources at the first time

Only for academic sharing, if there is any infringement, please contact to delete

Reprinted from: Xinzhiyuan

Predicting the future has always been a human dream, and it just so happens that machine learning models are good at predicting. Recently, a Chinese Ph.D. from Google and Brown University published his new work at ICCV 2021. In the recipe video data set, the future can be reasonably predicted, and there is no time limit. Just hit an egg and know that you are going to make pancakes!

As machine learning models are increasingly used and deployed in the real world, AI decisions can also be used to help people make decisions in their daily lives.

Prediction has always been a central problem in the decision-making process in the field of computer vision. 

How to make reasonable predictions about the future on different time scales is also one of the important capabilities of these machine models. This ability allows the model to predict changes in the surrounding world, including the behavior of other models, and plan how to act next. decision making.

78dfa5febadfda0dce67ac6f914a1a24.png

More importantly, successful future prediction requires both capturing meaningful object changes in the environment and understanding how the environment changes over time in order to make decisions and predictions.

The work of future prediction in computer vision is mainly limited by the form of its output, which may be pixels of an image or some manually predefined labels (such as predicting whether someone will continue to walk, sit, etc.).

These predictions are too detailed to fully predict success, and they lack effective use of the richness of real-world information. That is to say, if a model predicts "jumping behavior" without knowing why they are jumping, or what they are jumping, etc., there is no way to predict success, and the result is basically a random guess.

Furthermore, with very few exceptions, previous models were designed to forecast at fixed offsets into the future, not at dynamic time intervals, although this is a limiting assumption since we rarely know when the A meaningful future state emerges.

fc62a177d66020a0fbc2a2f1b7d0c863.png

In a video of making ice cream, the time interval from cream to ice cream in the video is 35 seconds, so a model that predicts this change needs to predict 35 seconds in advance. But this interval varies greatly in different behaviors and videos. For example, some bloggers may use more detailed and longer time to make ice cream, which means that the ice cream may be made at any time in the future.

In addition, frame-by-frame annotations of such videos can be collected on a large scale, in the millions, and many instructional videos have voice-over recordings, often providing concise, general descriptions throughout the video. This data source can guide the model to focus on important parts of the video, enabling flexible data-driven predictions of future events without manual annotation.

Based on this idea, Google published an article at ICCV 2021, proposing a self-supervised approach using a large, unlabeled dataset of human activities. The established model has a high degree of abstraction, can make long-range predictions of the future at any time interval, and can select the long-term predictions of the future according to the context.

d3f47180a37f3fcb87736582f4e26ee1.png

The model has an objective function of Multi-Modal Cycle Consistency (MMCC), which enables the use of narrative instructional videos to learn a robust future prediction model. In the paper, the researchers also show how to apply MMCC to various challenging tasks without fine-tuning, and perform quantitative test experiments on its predictions.

The article's author, Chen Sun, hails from Google and Brown University, and is currently an assistant professor of computer science at Brown University, researching computer vision, machine learning, and artificial intelligence, and a research scientist at the Google Research Institute.

He received his Ph.D. degree from the University of Southern California in 2016 under the supervision of Prof. Ram Nevatia, and completed his bachelor's degree in Computer Science from Tsinghua University in 2011.

Ongoing research projects include learning multimodal representations and visual communication from unlabeled videos, identifying human activities, objects and their interactions over time, and transferring representations to embedded agents.

8e8246b794f21eefb503006fbda89a99.png

The study addresses three core questions about future forecasting:

1. Manually labeling temporal relationships in videos is time-consuming and labor-intensive, and it is difficult to define the correctness of labels. Therefore, the model should be able to autonomously learn and discover the transformation of events from a large amount of unlabeled data, so as to realize practical application.

2. Encoding complex long-term event transformations in the real world requires learning higher-level concepts that are often found in abstract underlying representations, not just pixels in images.

3. Event transformations in time series are very context-dependent, so the model must be able to predict the future at variable time intervals.

To meet these needs, researchers introduce a new self-supervised training objective function MMCC along with a model that learns expressions to address this problem.

8e76f0d8f7d7760b657cf3e1ff213757.png

The model starts with a sample frame from a narrative video and learns how to find relevant linguistic representations in all narrative texts. Combining both visual and textual modalities, the model is able to use the entire video to learn how to predict potential future events and estimate the corresponding linguistic description for that frame, and in a similar way learn functions that predict past frames.

The cycle constraint requires the final model prediction to be equal to the start frame.

On the other hand, since the model does not know which mode its input data comes from, it must operate both visually and linguistically, and therefore cannot choose a lower-level future prediction framework.

578a463f357aad19725e89784198c544.png

The model learns to embed all visual and textual nodes, and then carefully computes cross-modal nodes corresponding to starting nodes in other modes. The representations of both nodes are transformed into fully connected layers that predict future frames using attention in the initial modality. The backward process is then repeated, and the model loss is to end the cycle by predicting the starting node to train the final output of the model.

In the experimental part, since most previous benchmarks focus on supervised behavior prediction with fixed categories and temporal offsets, in this paper we design a series of new qualitative and quantitative experiments to evaluate different methods.

The first is data, where the researchers trained the model on unconstrained real-world video data. Use a subset of the HowTo100M dataset, which contains approximately 1.23 million videos and their automatically extracted audio transcripts. Videos in this dataset are roughly categorized by subject area, and only videos categorized as Recipes are used, which is about a quarter of the dataset.

Of the 338,033 recipe videos, 80% are in the training set, 15% are in the validation set, and 5% are in the test set. Recipe videos are rich in complex objects, operations, and state transitions, and this subset enables developers to train models faster.

For more controlled tests, the researchers used the CrossTask dataset, which contains similar videos with task-specific annotations.

All videos are related to tasks, such as making pancakes, etc., where each task has a pre-defined sequence of high-level sub-tasks that have rich long-term interdependencies, e.g. , then you can beat the eggs into a bowl, add the syrup, etc.

5f6329a74ed34ea1adab223d874962f9.png

The ability of the model to predict the correct future was measured using the TOP-K recall metric to assess the model's ability to predict actions (higher is better).

4400259d19f6e086c534a1029eeb8775.png

For MMCC, to determine meaningful event changes over time across the video, the researchers defined a possible transition score for each pair of frames in the video based on the model's predictions, with the predicted frame closer to the actual frame, the higher the score.

000956276275eba45b36636e37af1e46.png

References:

https://ai.googleblog.com/2021/11/making-better-future-predictions-by.html

---------♥---------

Statement: This content comes from the Internet, and the copyright belongs to the original author

The pictures are sourced from the Internet and do not represent the position of this official account. If there is any infringement, please contact to delete

Dr. AI's private WeChat, there are still a few vacancies

9919f726a12e500383c502937bd846f4.png

7db98e72e67f56062241a6a593b9b91d.gif

How to draw a beautiful deep learning model diagram?

How to draw a beautiful neural network diagram?

One article to understand various convolutions in deep learning

Click to see supportbf7dad8d7ddeb802d1e7fc41cd25708f.pngc8ae484597c67e750e8c435b1701ff03.png

Guess you like

Origin blog.csdn.net/qq_15698613/article/details/121623645