Click the above artificial intelligence algorithm and Python big data to get more dry goods
In the upper right ... Set as a star ★, get resources at the first time
Only for academic sharing, if there is any infringement, please contact to delete
Reprinted from: Xinzhiyuan
Predicting the future has always been a human dream, and it just so happens that machine learning models are good at predicting. Recently, a Chinese Ph.D. from Google and Brown University published his new work at ICCV 2021. In the recipe video data set, the future can be reasonably predicted, and there is no time limit. Just hit an egg and know that you are going to make pancakes!
As machine learning models are increasingly used and deployed in the real world, AI decisions can also be used to help people make decisions in their daily lives.
Prediction has always been a central problem in the decision-making process in the field of computer vision.
How to make reasonable predictions about the future on different time scales is also one of the important capabilities of these machine models. This ability allows the model to predict changes in the surrounding world, including the behavior of other models, and plan how to act next. decision making.
More importantly, successful future prediction requires both capturing meaningful object changes in the environment and understanding how the environment changes over time in order to make decisions and predictions.
The work of future prediction in computer vision is mainly limited by the form of its output, which may be pixels of an image or some manually predefined labels (such as predicting whether someone will continue to walk, sit, etc.).
These predictions are too detailed to fully predict success, and they lack effective use of the richness of real-world information. That is to say, if a model predicts "jumping behavior" without knowing why they are jumping, or what they are jumping, etc., there is no way to predict success, and the result is basically a random guess.
Furthermore, with very few exceptions, previous models were designed to forecast at fixed offsets into the future, not at dynamic time intervals, although this is a limiting assumption since we rarely know when the A meaningful future state emerges.
In a video of making ice cream, the time interval from cream to ice cream in the video is 35 seconds, so a model that predicts this change needs to predict 35 seconds in advance. But this interval varies greatly in different behaviors and videos. For example, some bloggers may use more detailed and longer time to make ice cream, which means that the ice cream may be made at any time in the future.
In addition, frame-by-frame annotations of such videos can be collected on a large scale, in the millions, and many instructional videos have voice-over recordings, often providing concise, general descriptions throughout the video. This data source can guide the model to focus on important parts of the video, enabling flexible data-driven predictions of future events without manual annotation.
Based on this idea, Google published an article at ICCV 2021, proposing a self-supervised approach using a large, unlabeled dataset of human activities. The established model has a high degree of abstraction, can make long-range predictions of the future at any time interval, and can select the long-term predictions of the future according to the context.
The model has an objective function of Multi-Modal Cycle Consistency (MMCC), which enables the use of narrative instructional videos to learn a robust future prediction model. In the paper, the researchers also show how to apply MMCC to various challenging tasks without fine-tuning, and perform quantitative test experiments on its predictions.
The article's author, Chen Sun, hails from Google and Brown University, and is currently an assistant professor of computer science at Brown University, researching computer vision, machine learning, and artificial intelligence, and a research scientist at the Google Research Institute.
He received his Ph.D. degree from the University of Southern California in 2016 under the supervision of Prof. Ram Nevatia, and completed his bachelor's degree in Computer Science from Tsinghua University in 2011.
Ongoing research projects include learning multimodal representations and visual communication from unlabeled videos, identifying human activities, objects and their interactions over time, and transferring representations to embedded agents.
The study addresses three core questions about future forecasting:
1. Manually labeling temporal relationships in videos is time-consuming and labor-intensive, and it is difficult to define the correctness of labels. Therefore, the model should be able to autonomously learn and discover the transformation of events from a large amount of unlabeled data, so as to realize practical application.
2. Encoding complex long-term event transformations in the real world requires learning higher-level concepts that are often found in abstract underlying representations, not just pixels in images.
3. Event transformations in time series are very context-dependent, so the model must be able to predict the future at variable time intervals.
To meet these needs, researchers introduce a new self-supervised training objective function MMCC along with a model that learns expressions to address this problem.
The model starts with a sample frame from a narrative video and learns how to find relevant linguistic representations in all narrative texts. Combining both visual and textual modalities, the model is able to use the entire video to learn how to predict potential future events and estimate the corresponding linguistic description for that frame, and in a similar way learn functions that predict past frames.
The cycle constraint requires the final model prediction to be equal to the start frame.
On the other hand, since the model does not know which mode its input data comes from, it must operate both visually and linguistically, and therefore cannot choose a lower-level future prediction framework.
The model learns to embed all visual and textual nodes, and then carefully computes cross-modal nodes corresponding to starting nodes in other modes. The representations of both nodes are transformed into fully connected layers that predict future frames using attention in the initial modality. The backward process is then repeated, and the model loss is to end the cycle by predicting the starting node to train the final output of the model.
In the experimental part, since most previous benchmarks focus on supervised behavior prediction with fixed categories and temporal offsets, in this paper we design a series of new qualitative and quantitative experiments to evaluate different methods.
The first is data, where the researchers trained the model on unconstrained real-world video data. Use a subset of the HowTo100M dataset, which contains approximately 1.23 million videos and their automatically extracted audio transcripts. Videos in this dataset are roughly categorized by subject area, and only videos categorized as Recipes are used, which is about a quarter of the dataset.
Of the 338,033 recipe videos, 80% are in the training set, 15% are in the validation set, and 5% are in the test set. Recipe videos are rich in complex objects, operations, and state transitions, and this subset enables developers to train models faster.
For more controlled tests, the researchers used the CrossTask dataset, which contains similar videos with task-specific annotations.
All videos are related to tasks, such as making pancakes, etc., where each task has a pre-defined sequence of high-level sub-tasks that have rich long-term interdependencies, e.g. , then you can beat the eggs into a bowl, add the syrup, etc.
The ability of the model to predict the correct future was measured using the TOP-K recall metric to assess the model's ability to predict actions (higher is better).
For MMCC, to determine meaningful event changes over time across the video, the researchers defined a possible transition score for each pair of frames in the video based on the model's predictions, with the predicted frame closer to the actual frame, the higher the score.
References:
https://ai.googleblog.com/2021/11/making-better-future-predictions-by.html
---------♥---------
Statement: This content comes from the Internet, and the copyright belongs to the original author
The pictures are sourced from the Internet and do not represent the position of this official account. If there is any infringement, please contact to delete
Dr. AI's private WeChat, there are still a few vacancies
How to draw a beautiful deep learning model diagram?
How to draw a beautiful neural network diagram?
One article to understand various convolutions in deep learning
Click to see support