Using stories to explain the principles of artificial intelligence algorithms - computer vision

computer vision

image classification

The story begins in a school called AI Academy, where students are preparing for a visual art competition where everyone must demonstrate their understanding of the world by creating a painting. It's like we're going to use different models for image classification tasks.

LeNet (1998)

First of all, Xiaozao decided to use the simplest method to complete his painting. He adopted the most intuitive way to construct the overall image of the picture through the accumulation of layers of colors. However, he found that while doing the job, it wasn't efficient and couldn't handle complex images. This is just like LeNet, which is the earliest convolutional neural network, which includes an input layer, two convolutional layers, two sampling layers (pooling layers) and a fully connected layer, but it is insufficient in processing complex images.

AlexNet (2012)

Seeing Xiaozao's plight, Xiaofei decided to make some improvements. He uses more layers of color and more complex techniques, such as shadows and gradients, to build richer visuals. His works are more colorful than Xiao Zao's, and can express complex themes better. This is like AlexNet, which adds more convolutional layers and fully connected layers on the basis of LeNet, and introduces the ReLU activation function at the same time, enabling the network to handle more complex images, and also solves the problem of gradient disappearance in deep neural networks .

VGGNet (2014)

After Xiaomei saw Xiaofei's work, she thought she could do better. She decided to use more layers of color, each with a consistent style and depth, so that her paintings would have better depth and detail. This is like VGGNet, all of its convolutional layers use the same size convolution kernel and stride, and the number of convolutional layers is also greatly increased, resulting in a richer feature representation.

ResNet (2015)

However, Sister Shuang reminded Xiaomei that if she simply piled up more color layers, it might make the painting too complicated and difficult to understand. Therefore, she suggested that Xiaomei add some connections between each layer of color, so that the color of each layer can directly affect the final effect. This is like ResNet, which introduces a residual connection, so that the deep information can be directly transmitted to the shallow layer, thus solving the problem that the deep neural network is difficult to train.

DenseNet (2017)

Sister Zhou admired Xiaomei's work very much. However, she also found a problem: In Xiaomei's paintings, the connection between colors is not very clear. Sister Zhou suggested that when Xiaomei adds colors, the colors of each layer should be directly connected with all the previous colors, so that the information flow between colors can be smoother. It's like DenseNet, where each layer has a direct connection to all previous layers, making the transfer of information through the network more efficient.

EfficientNet (2019)

Looking at the works of these students, Luo Xin also had his own thoughts in his heart. He noticed that different color layers required different treatments. Some color layers require finer processing, while others require a larger field of view. Therefore, he recommends balancing the number, depth, and resolution of color layers in order to process different visual information more effectively. This is like EfficientNet, which achieves better performance by simultaneously adjusting the depth, width, and resolution of the network.

At the end of the story, Xiaozao, Xiaofei and Xiaomei all felt that they had benefited a lot. They no longer simply value the number of color layers, but start to think about how to better handle and utilize these color layers. In the end, their work was a huge success, and the AI ​​Academy became a true art academy.

Target Detection

R-CNN (2014)

The school organized a scavenger hunt that began with teacher Rita releasing a series of clues, each representing a possible target location. This is like the region proposal step in the R-CNN algorithm, which proposes some candidate boxes that may contain objects.

Fast R-CNN (2015)

Sister Shuang observed some efficiency problems in treasure hunting games, such as the need to process each clue separately, which was time-consuming and labor-intensive. She proposed a faster plan, that is, to conduct a collection of all students first, and then prompt each student with individual clues, which reduces duplication of labor. This is like Fast R-CNN, which first performs a convolution on the entire image, and then classifies and regresses each candidate box.

Faster R-CNN (2016)

After Xiaoyu saw Sister Shuang's method, she had a better suggestion. Why not let the students come up with possible target locations themselves? This can not only improve efficiency, but also exercise students' observation and thinking skills. This is like the RPN in Faster R-CNN, which uses a convolutional network to automatically propose candidate boxes, greatly improving efficiency.

YOLO series (2015-2020)

Teacher Xiran put forward a bold idea. She believes that treasure hunting games should be completed in one step, directly telling students the location of the target, instead of going through a series of complicated steps. This is like the YOLO series, they use a one-step detection method, which greatly improves the speed of detection.

SSD (2016)

Sister Zhou reminded everyone that we must not forget that every student has different abilities, and we need to give multiple clues of different difficulty to suit the abilities of different students. It's like SSD, it detects at multiple scales and can find both large and small objects.

RetinaNet (2017)

Luo Xin raised the question: What if the target is very difficult to find? We need to introduce some balance mechanism into the game so that students who find difficult objects get more rewards. This is like RetinaNet, which introduces Focal Loss to balance positive and negative samples, enabling the model to better detect hard-to-find targets.

After the scavenger hunt, all the students felt grown and the teachers had a better understanding of how to design games better. Similarly, these algorithms are also gradually improving the accuracy and speed of target detection in the process of continuous experimentation and progress.

However, the game is not over yet. Students such as Xiaozao, Xiaofei, and Xiaomei’s thirst for knowledge pushes them to explore more possibilities, just like researchers in the field of AI, they are constantly challenging new problems and developing new algorithms.

We look forward to more algorithms joining our big family in the near future to provide better solutions to more complex problems. Just like in school, we always expect each student to reach their highest potential and contribute to this big family together.

So far, we have introduced target detection algorithms such as R-CNN, Fast R-CNN, Faster R-CNN, YOLO series, SSD and RetinaNet through the story of the treasure hunt game. Each algorithm is like a stage in the game. They each have their own characteristics and advantages, but their common goal is to find the target faster and more accurately.

semantic segmentation

FCN (2015)

The first large-scale team sports meeting in the school is about to begin. This is a game that comprehensively tests the cooperation ability of the students. The contest consisted of painting different subjects (representing different object categories) on a large banner (representing the entire input image) with different colors of paint (representing different class labels). The team of Xiaozao, Xiaofei and Xiaomei played first. They adopted a strategy called FCN: they first folded the banner into a small piece (representing convolution and pooling operations, reducing resolution), and then gradually Unwrapping (representing an upsampling operation, restoring resolution), paints with paint during the unwrapping process.

U-Net (2015)

Immediately afterwards, Sister Zhou and Xi Ran joined forces, and their strategy was U-Net. The characteristic of this strategy is that in the process of folding and unfolding the banner, it will save some details of the folding (representing feature maps), and then add these details back (representing skip connections) when unfolding. This ensures that the details of the painting are not lost.

DeepLab Series (2015-2018)

Next, Sister Shuang and Luo Xin formed a team, and their strategy was called DeepLab. They not only adopted a folding and unfolding strategy similar to U-Net, but also added a technique called dilated convolution, which can obtain a larger range of information without changing the size of the banner. In the years of the competition, their strategies have also been continuously upgraded, such as the introduction of the ASPP module, which further improved the accuracy of banner painting.

instance segmentation

Mask R-CNN (2017)

The last round of the competition was in charge of rita and her husband Chao Ge, and their task was instance segmentation. Their strategy is Mask R-CNN, which not only needs to recognize each subject on the banner, but also distinguish different individuals of the same subject (for example, distinguish different people). They added a brand new branch to the previous Fast R-CNN to generate precise contours of each individual (representing pixel-level masks). They processed this branch in parallel with the previous localization and classification branch, so that each individual could be accurately identified and the boundary of each individual could be accurately marked.

This is the story of Semantic Segmentation and Instance Segmentation in our school. Each algorithm is like different teachers and students in our school, each with its own characteristics and strategies, which together constitute this rich and colorful learning environment. However, no matter how different their strategies are, their goal is the same, which is to better understand and interpret the world and help us better understand and interpret data.

video understanding

One day, when Rita, a biology teacher, was preparing for the course, she suddenly thought of a question, how to make students understand the complex actions and events in the video? It's like understanding a complex biological process. At this time, Brother Chao's ex-girlfriend You Yajiang (representing C3D) came over. She used to be a professional video analyst.

C3D (2014)

Yu Yajiang said: "My previous job was to analyze video. I think we can think of video as a series of 3D images, just like a series of serials. We can analyze each frame of these serials at the same time. Through such way, we can understand the actions and events in the video.”
Just like the C3D model, it extends the concept of convolutional neural network to 3D for the first time, and can process the temporal and spatial information of the video at the same time, so as to better understand the video in the video. actions and events.

I3D (2017)

At this time, Xiran (representing I3D) also joined their discussion, and she proposed: "We can actually look at this problem in two ways. One is from the perspective of time, and the other is from the perspective of space. Start. Time analysis allows us to understand the development process of the action, and spatial analysis allows us to understand the shape of the action.”
This is like the I3D model, which is an improved version of C3D, introducing 2D convolution to Analyze spatial information, and 3D convolution to analyze temporal information, so it can process temporal and spatial information of video at the same time, and understand more complex actions and events

TSN (2016)

Then, Sister Shuang (on behalf of TSN) said: "I think we should not only analyze each frame, but also consider the sequence and relationship between them. Just like when we solve a math problem, we should not only consider each step, Also understand the logical relationship between them." This is like the TSN model, which decomposes the video into three parts: the beginning, the process and the end, and then learns the features of these three parts independently, and finally combines them, This allows for a better understanding of the actions and events in the video.

TRN (2018)

Finally, Xiaomei (on behalf of TRN) said: "I think we can make each frame 'see' other frames, and then let them 'talk' to each other, in this way, we can better understand the action in the video and events.” This is like the TRN model, which introduces a relational network, so that each frame can “see” other frames, and through this relationship to understand the actions and events in the video. The TRN model not only pays attention to the content of each frame, but also pays attention to the relationship between frames. As Xiaomei said, let each frame "see" other frames and "communicate" with each other.
So, Rita found the answer, and she will combine the suggestions of these teachers to design a teaching plan, so that students can understand the actions and events in the video from different angles. At the same time, through this story, we also understood the basic principles and ideas of the four video understanding models of C3D, I3D, TSN and TRN.

Let's remember, though, that models, whatever they are, are designed to better understand the world. Just like every teacher in the school, although they teach different courses, their goal is the same, which is to help students understand the world and become a better version of themselves.

self-supervised learning

Biology teacher Rita decided to introduce a new learning style to her students, which is called self-supervised learning.

SimCLR (2020)

Suppose you are invited to play board games at Yoya Jiang's house on the weekend, and she has prepared a board game called "Find Similarities" for you. In this game, everyone needs to find two cards that are similar in shape, color, size, etc. from a pile of cards. Through this game, everyone not only enhanced their friendship, but also exercised their observation skills. This game is like the SimCLR algorithm in self-supervised learning. In SimCLR, two different transformations of the same image are considered similar, and the goal of the model is to map these two transformations to close positions. In this way, the model can self-learn the characteristics of images without human annotation.

Moco (2020)

Then, Sister Shuang also brought you a math game called "Memory Challenge". In this game, everyone needs to remember a series of numbers and try to remember the order in which these numbers appear. This game requires everyone to maintain a high degree of concentration, and requires a trade-off between memory and discrimination. This is like the MoCo algorithm in self-supervised learning. The MoCo algorithm constructs a "memory bank" and keeps the past positive examples and a large number of negative examples as much as possible. Then, by comparing the current example with the examples in memory, the model can learn rich features.

BYOL (2020)

Finally, Xiaozao came up with an idea. He said: "We don't need competitions, and we don't need to distinguish who is right and who is wrong. We just need to learn and grow together, and that's enough." Xiaozao's words deeply Everyone was moved, so they decided that everyone would share something they had learned, and then everyone would discuss and learn together. This is like the BYOL algorithm in self-supervised learning. BYOL algorithms do not rely on negative samples, but instead learn features by learning the consistency between two views of the same image. In this way, the model can focus on learning the intrinsic properties of images rather than distinguishing different images.
After studying this weekend, Xiaozao, Xiaofei and Xiaomei not only improved their knowledge, but also got to know their teachers better. They understand that finding similarities (SimCLR), memorizing (MoCo) and learning from each other (BYOL) are very important, both in learning and in life.

generate model

In this school, their creativity and imagination are fully developed and enhanced, and they start to learn and apply generative models.

GAN (2014)

First of all, Sister Shuang introduced GAN (generated confrontation network) in class. She explained to the students that it was like a competition between fake painters (generators) and connoisseurs (discriminators) who specialized in authenticity. The goal of the fake painter is to create a work of art that connoisseurs can mistake for the real thing, and the connoisseur's task is to identify the real work from the fake as much as possible.

DCGAN (2015)

The concept of DCGAN (Deep Convolutional Generative Adversarial Network) was conveyed by Sister Zhou. She explained that it was like changing the brush and technique of the fake painter, making him better at drawing details, while changing the eyes of the connoisseur, allowing him to better understand the structure and style of the painting.

Pix2Pix (2017)

Next, Xiaozao and Xiaofei were attracted by the principle of Pix2Pix. It's like a magic converter that can change one thing into another, such as changing a daytime landscape into a nighttime one. They felt that the process was like adding a definitive guidebook between the painter and the connoisseur, telling the painter how his paintings should be painted.

CycleGAN (2017)

Next, Xiaozao and Xiaofei were attracted by the principle of Pix2Pix. It's like a magic converter that can change one thing into another, such as changing a daytime landscape into a nighttime one. They felt that the process was like adding a definitive guidebook between the painter and the connoisseur, telling the painter how his paintings should be painted.

BigGAN (2018)

Later, rita showed everyone the principle of BigGAN. She said that it was like providing a painter with a huge drawing board and rich paint, allowing him to create larger, more detailed and more realistic paintings.

StyleGAN (2018-2020)

Finally, Xiran and Chaoge jointly demonstrated the principle of StyleGAN. They liken it to giving the painter more freedom and flexibility to create his artistic style, even mixing and switching different styles.

However, when everyone was busy exploring and innovating, You Yajiang quietly observed from the sidelines. She knew that these were just the beginning of innovation, and there were more possibilities waiting for them to explore and realize in the future.

OpenAI's contribution

OpenAI's contributions also have a significant impact on the school's learning process

FROM-E (2021)

The first is the principle of DALL-E. Rita told everyone that DALL-E is like a super painter, it can create any image that people imagine, even the most bizarre and never-before-seen images, it can use a brush pictured. It's like giving Brother Chao a list of desired pictures and elements, and then Brother Chao will create a unique painting precisely according to this list. But this time, the content on the list includes not only objects, but also scenes, styles, emotions, and even some abstract concepts. This ability surprised everyone.

CLIP (2021)

Then there is the principle of CLIP, which Xiaoyu explained from the perspective of language. She said that CLIP is like an all-round scholar who can read pictures and words. Whether you show him a text description or a picture, it can understand the meaning and find out the difference between the two. association. For example, if you tell it "this picture depicts a red umbrella in the rain", it can not only understand the meaning of the text, but also find the one that best matches this description among a large number of paintings. This process is like Luo Xin finding the most suitable candidate to participate in the game by observing their actions and listening to their conversations among a large number of players.

However, Yuya-chan watched all this calmly from the sidelines. She knows that although the abilities of DALL-E and CLIP seem to be very powerful, their working methods are based on a large amount of data for learning. challenge. Yuya Jiang believes that this is like seeing the world in a mirror. Although it looks real, it is not completely real. She believes that the real challenge of artificial intelligence is how to make it truly understand the world, not just imitate and relate

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/130792531