Computer Vision – Computer Vision | CV

Why is computer vision important?

Almost 70% of the human cerebral cortex processes visual information. It is the most important channel for humans to obtain information, bar none.

In the online world, photos and videos (collections of images) are also exploding!

The figure below is a trend chart of the proportion of new data on the network. Gray is structured data, blue is unstructured data (mostly images and videos). It is obvious that images and videos are growing at an exponential rate.

Image and video data are growing rapidly

Before the advent of computer vision, images were in a black box state for computers.

A picture is just a file to the machine. The machine does not know what the content of the picture is, it only knows what size, how many MB, and what format the picture is in.

 

If computers and artificial intelligence want to play an important role in the real world, they must understand pictures! This is the problem that computer vision solves.

What is Computer Vision – CV?

Computer vision is an important branch of artificial intelligence. The problem it solves is to understand the content of images .

for example:

  • Is the pet in the picture a cat or a dog?
  • Is the person in the picture Lao Zhang or Lao Wang?
  • What items are on the table in this photo?

 

What is the principle of computer vision?

The principles of the current mainstream machine vision methods based on deep learning are relatively similar to the working principles of the human brain.

The principle of human vision is as follows: starting with raw signal intake (pupil intake of pixels), then preliminary processing (some cells in the cerebral cortex find edges and directions), and then abstraction (the brain determines that the shape of the object in front of you is a circle). shape), and then abstract it further (the brain further determines that the object is a balloon).

How the human brain sees pictures

The method of the machine is similar: construct a multi-layer neural network, the lower layer recognizes the primary image features, several bottom-level features form the upper-level features, and finally, through the combination of multiple levels, classification is made at the top level.

 

Two major challenges in computer vision

It is a very simple thing for humans to understand pictures, but it is a very difficult thing for machines. Here are two typical difficulties:

Features are difficult to extract

The same cat at different angles, in different lights, with different movements. The pixel difference is very large. Even if it is the same photo, after rotating 90 degrees, the pixel difference is very large!

Therefore, the content in the picture is similar or even the same, but at the pixel level, the changes will be very large. This is a big challenge for feature extraction.

The amount of data that needs to be calculated is huge

Any photo you take on your mobile phone is 1000*2000 pixels. Each pixel has RGB 3 parameters, a total of 1000 X 2000 X 3=6,000,000. Any photo needs to process 6 million parameters, and then count the increasingly popular 4K videos. You know how terrifying this calculation level is.

CNN  solves the above two major problems

CNN belongs to the category of deep learning, and it solves the two major difficulties mentioned above very well:

  1. CNN can effectively extract features in images
  2. CNN can effectively reduce the dimensionality of massive data (without affecting feature extraction), greatly reducing the requirements for computing power.

8 major tasks of computer vision

 

Image classification

Image classification is an important basic problem in computer vision. Other tasks mentioned later are also based on it.

To give a few typical examples: face recognition, picture pornography identification, automatic classification of photo albums based on people, etc.

Image classification

Target Detection

The goal of the object detection task is to give an image or a video frame and let the computer find the positions of all objects in it and give the specific category of each object.

Target Detection

Semantic segmentation

It divides the entire image into groups of pixels, which are then labeled and classified. Semantic segmentation attempts to semantically understand what each pixel in an image is (person, car, dog, tree...).

As shown in the figure below, in addition to identifying people, roads, cars, trees, etc., we must also determine the boundaries of each object.

Semantic segmentation

Instance splitting

In addition to semantic segmentation, instance segmentation classifies different types of instances, such as marking 5 cars with 5 different colors. We will see complex scenes with multiple overlapping objects and different backgrounds, and we will not only need to classify these different objects, but also determine the boundaries, differences and relationships of the objects to each other!

Instance splitting

Video classification

Different from image classification, the object of classification is no longer a still image, but a video object composed of multiple frames of images, containing voice data, motion information, etc. Therefore, understanding the video requires obtaining more contextual information. Not only do we need to understand what each frame of image is and what it contains, we also need to combine different frames and know the contextual information.

Video classification

Human body key point detection

Body key point detection identifies human movement and behavior through the combination and tracking of human body key nodes, which is crucial for describing human posture and predicting human behavior.

This technology is used in Xbox.

Human body key point detection

Scene text recognition

Many photos contain some textual information, which plays an important role in understanding the image.

Scene text recognition is the process of converting image information into text sequences when the image background is complex, the resolution is low, the fonts are diverse, and the distribution is random.

License plate recognition in parking lots and toll stations is a typical application scenario.

Scene text recognition

Target Tracking

Target tracking refers to the process of tracking one or more specific objects of interest in a specific scene. The traditional application is the interaction between video and the real world, observing after the initial object is detected.

This technology will be used in autonomous driving.

Target Tracking

Application scenarios of CV in daily life

Computer vision has a wide range of application scenarios. Here are a few common application scenarios in life.

  1. Face recognition on access control and Alipay
  2. License plate recognition in parking lots and toll stations
  3. Risk identification when uploading images or videos to websites
  4. Various props on Douyin (you need to identify the position of the face first)

It should be noted here that the scanning of barcodes and QR codes is not considered computer vision.

This kind of image recognition is still based on fixed rules and does not require processing of complex images, and does not use AI technology at all.

computer vision

It is a science that studies how to make machines "see". To put it further, it refers to using cameras and computers instead of human eyes to perform machine vision such as target identification, tracking and measurement, and further performs graphics processing, making computer processing a Images more suitable for human eye observation or transmitted to instruments for detection. As a scientific discipline, computer vision studies related theories and technologies, trying to build artificial intelligence systems that can obtain 'information' from images or multi-dimensional data. The information referred to here refers to information defined by Shannon that can be used to help make a "decision". Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science that studies how to make artificial systems "perceive" from images or multi-dimensional data.

Computer vision is an interdisciplinary scientific field that deals with how computers can be made to gain high-level understanding from digital images or videos. From an engineering perspective, it seeks to automate tasks that human vision systems can accomplish.

Computer vision tasks include methods for acquiring, processing, analyzing, and understanding digital images, as well as extracting high-dimensional data from the real world in order to produce numerical or symbolic information, for example, in the form of decisions.

Understanding in this context means converting visual images (input from the retina) into descriptions of the world that can interact with other thought processes and elicit appropriate actions. This kind of image understanding can be seen as unraveling symbolic information from image data using models built from geometry, physics, statistics, and learning theory.

As a scientific discipline, computer vision focuses on the theory behind artificial systems that extract information from images. Image data can take many forms, such as video sequences, views from multiple cameras, or multidimensional data from medical scanners. As a technical discipline, computer vision attempts to apply its theories and models to the construction of computer vision systems. Subfields of computer vision include scene reconstruction, event detection, video tracking, object recognition, 3D pose estimation, learning, indexing, motion estimation, and image restoration.

Guess you like

Origin blog.csdn.net/qq_38998213/article/details/132520987