Camera Depth Understanding

Camera Depth Understanding

Here we focus on digital video cameras. Since cameras provide images for vision and machine learning analysis, it is important to understand how cameras collect and distribute these images.

1 Introduction

Digital video cameras are everywhere. Billions of people have smartphones or tablets with built-in cameras, and hundreds of millions have webcams on their computers.

Digital video has a short history. The first semiconductor image sensor (CCD) was invented at Bell Laboratories in 1969. The second type, called a CMOS sensor, was invented in 1993 at the Jet Propulsion Laboratory across the street in Pasadena, California. In the early 1990s, there was a convergence of technologies that allowed the streaming of digital video to consumer-grade computers. The first popular consumer webcam, the Connectix QuickCam, was launched in 1994 for $100, 320×240 resolution, and 16-bit grayscale. It was amazing at the time.

CMOS technology is now used in the vast majority of sensors in consumer digital video products. Over time, the resolution of sensors has improved while adding countless functions.

Even with a short history, there are plenty of abbreviations and acronyms to navigate to understand what people are talking about in a given context. It's hard to talk about something if you don't know the proper name.

Here we'll focus on the cameras attached to the Jetson, although these can be attached to smaller machines. As an example, here's a 4K video camera:

img

Arducam IMX477 for Jetson Nano and Xavier NX

You can think of the camera as several different parts. First is the image sensor, which collects light and digitizes it. The second part is the optics, which help focus the light on the sensor and provide the shutter. Then there are the electronic circuits that interface with the sensors, collect the images and transmit them.

2. Image sensor

There are mainly two types of image sensors in use today. The first is CMOS and the other is CCD . CMOS dominates in most low-cost applications. Raw sensors provide monochrome (grayscale) images.

Here is an image of the image sensor Sony IMX477:

insert image description here

2.1 Color image

There are various ways to acquire color images from these sensors. By far the most common approach is to use a Bayer filter mosaic, which is an array of colored filters. Mosaic arranges color filters on the pixel array of the image sensor. The filter pattern is half green, one quarter red, and one quarter blue. The human eye is most sensitive to green, which is why there is an extra green in filter mode.

Each filter tunes photons of a specific wavelength to a sensor pixel. For example, blue filters make sensor pixels sensitive to blue light. The pixel signals based on how many photons it sees, in this case how many blue light photons there are.

insert image description here

There are other variants of color filter arrays using this approach. The Bayer method is patented, so some people try to solve this problem. Alternatives are CYGM (Cyan, Yellow, Green, Magenta) and RGBE (Red, Green, Blue, Emerald).

In a Bayer filter, colors can be arranged in different patterns. To get all combinations, you might see BGGR ( Blue , Green , Green , Red ), RGBG , GRBG , and RGGB . Interpolate a color image with a demosaicing algorithm.

The raw output of a Bayer-filtered camera is called a Bayer-pattern image. Remember that each pixel is filtered to only record one of the three colors. The demosaic algorithm examines each pixel and its surrounding neighbors to estimate the full RGB color of that pixel. That's why it's important to understand the arrangement of colors in a filter.

These algorithms can be simple or complex, depending on the computational elements on the camera. As you can imagine, this is a big problem. These algorithms make trade-offs and assumptions about the scene they are capturing and take into account the time allowed for calculating color values. Depending on the chosen scene and algorithm, there may be artifacts in the final color image.

Time is an important factor when you're trying to estimate the color of each pixel in real time. Let's say you're streaming data at 30 frames per second. This means you have about 33 milliseconds between frames. Images are best done before the next one arrives! If you have several megapixels to demosaic per frame, that means you have a lot of work to do! Accurate color estimation can be the enemy of speed, depending on the algorithm used.

The sensor module contains only the image sensor. The Raspberry Pi V2 camera IMX219 and the high-quality version IMX477 are two such modules that work on the Jetson Nano and Xavier NX. These sensors transmit raw Bayer pattern images** via the Camera Serial Interface (CSI) bus . The Jetson then uses the onboard Image Signal Processor ( ISP ) to perform various tasks on the image. Tegra can be configured with ISP hardware to handle demosaicing, auto white balance, downscaling and more. See Image Processing and Management for an overview of the extension.

Camera modules, on the other hand, include smart devices on the module to perform these tasks. Some of these modules have a CSI interface, but are usually used by cameras that have an alternate interface such as USB. While some of these modules transmit raw Bayer pattern images, the use cases you are most likely to encounter are encapsulating video streams, raw color images, or compressed images.

2.2 Infrared light

Bayer filters are transparent to infrared light. Many image sensors can detect near-infrared wavelengths. Most color cameras add an IR filter to the lens to help better estimate color.

However, sometimes it is useful to view a scene illuminated by infrared light! Security "night vision" systems typically have an IR emitter and a camera image sensor without an IR filter. This allows the camera to "see in the dark". An example is the Raspberry Pi NoIR Camera Module V2 . This Jetson-compatible sensor is the same as the aforementioned V2 IMX219 RPi camera, but with the IR lens removed.

2.3 Optics

The optics of a digital video camera consist of a lens and a shutter. Most cheap cameras use plastic lenses and offer limited manual focus control. There are also some fixed focal length lenses that do not have an adjustment function. Other cameras have glass lenses, and some have interchangeable lenses.

You will hear lenses classified by different terms. Typically, a lens is specified by its focal length. The focal length of the lens may be a fixed distance. If the focal length is variable, it is called a zoom lens.

Another classification is aperture, denoted by f , such as f2.8. Lenses can have a fixed aperture or a variable aperture. The size of the aperture determines how much light can hit the sensor. The larger the aperture, the more light passes through the lens. The larger the aperture, the smaller the f-number.

Lens field of view** (** FoV ) is also important. Usually, this is expressed in degrees, in the horizontal and vertical dimensions, or diagonally, with the center of the lens being the midpoint of the two angles.

​Figureimg 3

The fourth category is camera types with interchangeable lenses. Interchangeable lenses provide greater flexibility when capturing images. In the Jetson world, you've probably heard of M12 mounts. It uses metric M12 threads with a 0.5mm pitch. This is also known as an S-mount. Another common term is C or CS lens mount. Can be directly connected to the sensor's PCB. The Raspberry Pi Hi Def camera uses this type of mount.

insert image description here

Camera shutters can be mechanical or electronic. The shutter exposes the sensor for a predetermined period of time. Shutter uses two main exposure methods. The first is the rolling shutter . A rolling shutter scans the sensor in steps, either horizontally or vertically. The second is the global shutter, which exposes the entire sensor at once. Rolling shutter is the most common because it tends to be less expensive to implement on CMOS devices, although there may be image artifacts such as smearing for fast-moving objects in the scene.

For scenes without any fast-moving objects, rolling shutter might be a good choice. However, for other applications this may not be acceptable. For example, a mobile robot that is already a rickety platform may not be able to generate a good enough visualization if the image is smeared. So a global shutter is more appropriate.

2.4 Electronic circuit

The electronic circuitry of a digital video camera controls image acquisition, interpolation, and image output. Some cameras have this circuitry on the sensor chip (many phone cameras do this to save space), others have external circuitry to handle the task.

Camera sensors, on the other hand, only need to interface with a host that directly handles data acquisition. Jetson has multiple Tegra ISPs to handle this task.

Data compression is an important task. Video data streams can be very large. Most cheap webcams have a built-in ASIC for image interpolation and video compression.

Newer "smart" cameras on the market may have additional circuitry to handle the video data stream. This includes more complex tasks such as computer vision or deep image processing. These professional cameras can combine multiple sensors in the camera.

For example, an RGBD camera (Red, Green, Blue, Depth) may have two sensors for calculating depth and another sensor for capturing color images. Some of these cameras use infrared illuminators to help the depth sensor in low light situations.

Electronic circuitry transfers video data from the camera to the host device. This can be achieved through one of several physical paths. On the Jetson, this is implemented via the MIPI Camera Serial Interface** (** MIPI CSI ) or via the familiar USB. Third parties provide GMSL ( Gigabit Multimedia Serial Link ink ) connectors on Jetson carrier boards . GMSL allows longer transmission distances than typical CSI ribbon cables by using buffers to serialize/deserialize video data streams. For example, you might see these types of connections in robots or cars.

img

​ GMSL Camera Connector

2.5 Data Compression and Transmission

This is where we started getting interested. Data travels across the network, how do we interpret it?

We talked about creating full-color images. Usually we think of these as three channels of Red , Green and Blue ( RGB ). The number of bits in each channel determines how much "true" color can be displayed. 8 bits per channel is common, you may see 10 bits. In professional video, you'll see higher numbers. The more bits, the more colors you can represent.

Assuming it's 8 bits per color channel, that's 24 bits per pixel. If the image is 1920 by 1080 pixels, then 2,073,600 pixels X 3 bytes = 12,441,600 bytes. If you have 30 frames per second, you get 373,248,000 bytes per second. Of course, if you're using 4K video, you'll get 4x that amount.

As I believe you have pointed out by now, we take a Bayer pattern image and unfold it. Of course we could transmit the image itself with an identifier indicating the color mode on the sensor! Of course we can! However, this forces the receiver to do a color conversion, which may not be an optimal solution.

2.5.1 Data Compression Types

There are many ways to reduce the amount of image data transferred from a video stream. Usually this is done by:

  • color space conversion
  • lossless compression
  • lossy compression
  • time compression

We will not delve into this topic here. There is an entire industry dedicated to these themes. However, if you've used a camera in the past, you may already be familiar with some of the topic names here.

In color space conversion, YUV encoding converts an RGB signal into an intensity component (Y) ranging from black to white and two other components (U and V) that encode color. This can be a lossless or lossy method. Lossless means we can convert the image back to the original without any loss, lossy means we lose some data.

Then there is image compression. You're probably familiar with PNG files that use lossless bitmap compression. JPEG files are a lossy compression method based on discrete cosine transform. In general, you can reduce the size by a factor of about 4 with lossless compression, and much higher with lossy compression. Of course, the quality of lossy compressed images may suffer.

Temporal compression, usually by setting one frame as a keyframe and measuring the difference of subsequent frames from it from there. That way, you only need to send a keyframe, and then the difference. New keyframes are usually generated after a given time interval, or when the scene changes. For most static scenes, the size savings can be significant.

There are various algorithms for this task, known as encoding. Names of these encoders include H.264, H.265, VP8, VP9, ​​and MJPEG. The matching decoder at the receiving end reconstructs the video.

2.5.2 fourcc

The four-character identifier ( fourcc ) identifies how the video data stream was encoded. A throwback to the old Macintosh days, QuickTime was built on the idea of ​​Apple's file manager to define containers with four characters. These four characters fit conveniently into a 32-bit word. Audio also uses this method.

Some fourcccodes are easy to guess, such as H264and H265. MJPGIndicates that each image is JPEG encoded. Others, not so easy, are YUYVfairly common, and it's a compressed format with ½ horizontal chroma resolution, also known as YUV 4:2:2. Part of this confusion is because manufacturers can register these format names. Also, the same code may have aliases on different platforms over the years. For example, on Windows platforms, YUYVis called YUY2.

reference list

Common, it is a compressed format with ½ horizontal chroma resolution, also known as YUV 4:2:2. Part of this confusion is because manufacturers can register these format names. Also, the same code may have aliases on different platforms over the years. For example, on Windows platforms, YUYVis called YUY2.

reference list

https://jetsonhacks.com/2022/01/13/in-depth-cameras/

Guess you like

Origin blog.csdn.net/weixin_43229348/article/details/127681873