Understand one of AI's greatest achievements: the limitations of convolutional neural networks

Author | Ben Dickson

Translator | Champagne Supernova

图 | CSDN Download from Vision China

Selling | CSDN (ID: CSDNnews)

After a long period of silence, artificial intelligence is entering a new period of vigorous development, which is mainly due to the rapid development of deep learning and artificial neural networks in recent years. More precisely, the new interest in deep learning is largely due to the success of convolutional neural networks (CNNs) , a neural network structure that is particularly good at processing visual data.

But if someone tells you that there are fundamental flaws in convolutional neural networks, what do you think? This point was put forward by Professor Geoffrey Hinton, known as "the originator of deep learning" and "father of neural networks", in the keynote speech at the AAAI conference, the top conference on artificial intelligence in 2020. AAAI Artificial Intelligence Association ) Conference is one of the major artificial intelligence conferences each year.

Hinton, together with Yann LeCun and Yoshua Bengio, attended the meeting. The three major deep learning giants, winners of the Turing Award, are also known as the “godfather of deep learning” by the industry. Hinton talked about the limitations of convolutional neural networks (CNNs) and capsule networks, and suggested that this is his next breakthrough direction in the field of artificial intelligence.

Like all his speeches, Hinton delves into many technical details, which make convolutional neural networks increasingly inefficient and different compared to human visual systems. This article will elaborate on some of the main points he made at the conference. But before we touch these points, let us understand some basic knowledge about artificial intelligence, as well as the background and reasons why convolutional neural networks (CNNs) are so important to the artificial intelligence community.

Computer vision solutions

In the early days of artificial intelligence, scientists tried to create a computer that could "see" the world like humans. These efforts have led to the creation of a completely new field of research, which is computer vision .

Early research in computer vision involved the use of symbolic artificial intelligence , where each rule must be specified by a human programmer. But the problem is that not every function of the human visual device can be decomposed with clear computer program rules. Therefore, the use rate and success rate of this method are very limited.

Another different method is machine learning . Contrary to symbolic artificial intelligence, machine learning algorithms are given a general structure, and develop their own behavioral capabilities by examining training examples. However, most early machine learning algorithms still require a lot of manual work to design components for detecting image-related features.

       

Convolutional neural networks (CNNs), unlike the above two methods, this is an end-to-end artificial intelligence model that develops its own feature detection mechanism. A well-trained multi-level convolutional neural network will automatically recognize features in a layered manner, from simple corners to complex objects, such as human faces, chairs, cars, dogs, etc.

Convolutional neural networks (CNNs) were first introduced by LeCun in the 1980s, when he was a postdoctoral research assistant in the Hinton laboratory at the University of Toronto. However, due to the huge demand for computation and data in convolutional neural networks, they were put on hold, and its adoption at that time was very limited. Then, after thirty years of development, and with the help of tremendous advances made in computing hardware and data storage technology, convolutional neural networks began to realize their full potential.

Today, thanks to large computing clusters, dedicated hardware, and massive amounts of data, convolutional neural networks have been widely and beneficially used in image classification and object recognition.

Each layer of the convolutional neural network will extract specific features from the input image.

The difference between convolutional neural networks (CNNs) and human vision

In a speech at the AAAI conference, Hinton pointed out: "Convolutional neural networks (CNNs) make full use of end-to-end learning. It turns out that if a function is good in one place, it will be good in other places, so They have achieved great success. This allows them to combine evidence and generalize well in different locations. However, they are very different from human perception. "

One of the key challenges of computer vision is dealing with data differences in the real world. Our vision system can recognize objects from different angles, different backgrounds and different lighting conditions. When an object is partially obscured by other objects or colored in a weird manner, our visual system uses clues and other knowledge to fill in the missing information and the reason we see it this way.

It turns out that it is very difficult to create artificial intelligence that can replicate the same object recognition function.

Hinton said: "Convolutional neural networks (CNNs) are designed to solve the problem of translation of objects". This means that a well-trained convolutional neural network can recognize an object regardless of its position in the image. But they can't handle other effects such as rotation and scaling well.

According to Hinton, one way to solve this problem is to use 4D or 6D maps to train artificial intelligence and then perform object detection. He added: "But this is really prohibitive."

Currently, our best solution is to collect a large number of images and display each object in a different location. Then, we train the convolutional neural network on this huge data set, hoping that it can see enough object examples to generalize, and can detect objects with reliable accuracy in the real world. Data sets such as ImageNet contain more than 14 million annotated images, which are designed to achieve this goal.

Hinton said: "This is not very effective. We hope that the convolutional neural network can be easily extended to a new viewpoint. If they learn to recognize something, and you zoom it 10 times and rotate it by 60 degrees, then this It will not cause them any problems. We know that computer graphics is like this, and we hope that convolutional neural networks will look more like this. "

In fact, ImageNet has proved to be flawed, and it is currently the preferred benchmark for evaluating computer vision systems. Despite the huge data set, it cannot capture all possible angles and positions of objects. It mainly consists of images taken at a known angle under ideal lighting conditions.

This is acceptable for the human visual system because it can easily generalize knowledge. In fact, when we observe an object from multiple angles, we can usually imagine how it looks in a new position and different visual conditions.

But Convolutional Neural Networks (CNNs) need detailed examples to illustrate the cases they need to deal with, and they do not possess the creativity of human thinking. Deep learning developers usually try to solve this problem by applying a process called "data augmentation", in which they flip the image or rotate the image a little before training the neural network. In fact, the convolutional neural network will be trained on multiple copies of each image, and each copy will be slightly different. This will help artificial intelligence to generalize to changes in the same object. To some extent, data augmentation makes artificial intelligence models more robust.

However, data augmentation cannot cover extreme situations that convolutional neural networks and other neural networks cannot handle, such as a upturned chair or a crumpled T-shirt placed on a bed. These are all situations where pixel manipulation cannot be achieved in real life.

ImageNet vs. reality: In ImageNet (left column), the objects are placed neatly, under ideal background and lighting conditions. The real world is much more chaotic (source: objectnet.dev)

Some people have solved this generalization problem by creating computer vision benchmarks and training data sets that better represent the chaotic reality of the real world. However, although they can improve the results of current artificial intelligence systems, they do not solve the fundamental problem of cross-view generalization. There will always be new angles, new lighting conditions, new colors and poses, and these new data sets cannot contain all of these situations. These new situations will even make the largest and most advanced artificial intelligence system into chaos.

Differences can be dangerous

From the point of view presented above, convolutional neural networks (CNNs) clearly recognize objects in a very different way from humans . However, these differences not only have limitations in weak generalization, but also need more examples to learn an object. The internal representation of objects generated by convolutional neural networks is also very different from the biological neural network of the human brain.

How does this manifest? "I can take a picture and add a little bit of noise, and the convolutional neural network will recognize it as something completely different, and I personally can hardly see the difference between them. Think of this as evidence. Convolutional neural networks are actually using information that is completely different from ours to recognize images. "Hinton said in a keynote speech at the AAAI conference.

These slightly modified images are called " adversarial samples " and are hot research topics in the field of artificial intelligence.

             

Adversarial samples may cause neural networks to misclassify images without affecting the human eye.

Hinton said: "It's not that this is wrong, they just use a completely different way of working, and their completely different approach will have some differences in how they generalize."

But many examples show that adversarial interference can be extremely dangerous. When your image classifier incorrectly marks the panda as a gibbon, all this is cute and funny. However, when the computer vision system of self-driving cars lacks a stop sign, and evil hackers bypassing the facial recognition security system, or Google Photos marks humans as gorillas, you will be in big trouble.

There has been a lot of research on detecting adversarial disturbances and creating powerful artificial intelligence systems that resist adversarial disturbances . However, adversarial samples also remind us that our vision system has been able to handle the world around us after several generations of evolution, and we have created our world to adapt to our vision system. Therefore, if our computer vision systems work in a fundamentally different way from human vision, they will be unpredictable and unreliable unless they are supported by complementary technologies such as lidar and radar mapping.

Coordinate system and part-whole relationship are important

Another problem pointed out by Geoffrey Hinton in the keynote speech of the AAAI conference is that convolutional neural networks cannot understand images from the perspective of objects and their parts. They recognize images as spots of pixels arranged in different patterns. They also have no explicit internal representation of entities and their relationships.

"When you imagine a convolutional neural network as the center of each pixel location, you will describe more and more what is happening at that pixel location, depending on more and more context. In the end, you get so rich The description, so that you know what objects exist in the image. But they did not explicitly parse the image. "Hinton said.

Our understanding of the composition of objects helps us understand the world and understand things we have never seen before, such as this peculiar teapot.

             

Breaking down an object into multiple parts helps us understand its nature. Is this a toilet or a teapot? (Source: Smashing lists)

Convolutional neural networks still lack a coordinate system, which is a basic component of human vision. Basically, when we see an object, we develop a mental model about its direction, which helps us to analyze its different characteristics. For example, in the image below, consider the face on the right. If you turn it upside down, you will see the face on the left. But in fact, you don't need to physically flip the image to see the face on the left. Just adjust the coordinate system mentally, you can see the two faces, regardless of the direction of the image.

Hinton pointed out: "According to the applied coordinate system, you will have a completely different internal perception. Convolutional neural networks really cannot explain this. You give them an input, they have a perception, and perception does not depend on the imposed Coordinate system. I think this is related to adversarial samples and the fact that convolutional neural networks perceive in a completely different way than humans. "

Learn from computer graphics

Hinton pointed out in a speech at the AAAI conference that a very convenient way to solve computer vision is to make a reverse graph. The three-dimensional computer graphics model is composed of a hierarchy of objects. Each object has a transformation matrix that defines its translation, rotation, and scaling relative to its parent object. The transformation matrix of the top-level object in each hierarchy defines its coordinates and direction relative to the world origin.

For example, consider a 3D model of a car. The base object has a 4 × 4 transformation matrix, which indicates that the center of the car is at coordinates (X = 10, Y = 10, Z = 0) with rotation (X = 0, Y = 0, Z = 90). The car itself is composed of many objects, such as wheels, chassis, steering wheel, windshield, gearbox, engine, etc. Each object has its own transformation matrix, with the parent matrix (the center of the car) as a reference, and they define their position and orientation. For example, the center of the front left wheel is located at (X = -1.5, Y = 2, Z = -0.3). The world coordinates of the left front wheel can be obtained by multiplying its transformation matrix with its parent matrix.

Some of these objects may have their own subset. For example, wheels are composed of tires, rims, hubs, nuts and other components. Each of these sub-items has its own transformation matrix.

Using this coordinate system hierarchy, you can easily locate and visualize objects regardless of their pose, orientation, or viewpoint. When you want to render an object, multiply each triangle in the 3D object by its transformation matrix and the transformation matrix of its parent object. Then align it with the viewpoint (another matrix multiplication) and then convert to screen coordinates before rasterizing to pixels.

"If you (to someone working in computer graphics) say: 'Can you show me from another angle?' They won't say, 'Oh, okay, I'm happy. But we didn't do it from that angle Training, so we ca n’t show you from that angle. ’They just show you from another angle, because they have a 3D model, they model a spatial structure based on the relationship between parts and the whole, and these relationships It doesn't depend on the viewpoint at all, "Hinton said. "I think it's crazy not to use this beautiful structure when processing images of 3D objects."

Capsule Network (Capsule Network) is another ambitious new project of Hinton, which attempts to make reverse computer graphics. Although the capsule network should have its own independent set of things, the basic idea behind it is to take an image, extract its objects and parts, define its coordinate system, and create a modular structure of the image.

Capsule networks are still under development, and they have gone through many iterations since their launch in 2017. But if Hinton and his colleagues can successfully make them work, we will be closer to replicating human vision.

This article is a CSDN translation, please indicate the source.

【END】

More exciting recommendations

Microsoft CEO Satir · Nadella: Do not re-create the wheel, upgrade technology and strong density

GitHub star 10,000+, the open source road of Apache's top project ShardingSphere

HKUST academician Zheng Guangting future interrogation, revealed the latest applications and practice of AI

☞Intelligent O & M challenge under big promotion: How can Ali resist the "Double 11 Cat Night"?

Ethernet Square 2.0 Custody Game and implement MPC

I have written 9 MySQL interview questions for you very carefully.

Every "watching" you order, I take it seriously

Published 1979 original articles · 40 thousand likes + · 18.39 million views

Guess you like

Origin blog.csdn.net/csdnnews/article/details/105672151