Basic concepts related to audio

Basic concepts related to audio

1. The nature of sound

The essence of sound is the phenomenon of wave propagation in the medium, and the essence of sound wave is a wave, a physical quantity. The two are different. Sound is an abstraction, a phenomenon of sound wave propagation, and sound wave is a physical quantity.

2. Three elements of sound

Loudness: The size of the sound perceived by humans (commonly known as volume), determined by the "amplitude" (amplitude) and the distance between the person and the sound source . bigger.

Pitch: The pitch of the sound (treble, bass) is determined by the frequency. The higher the frequency, the higher the pitch (frequency unit Hz, Hertz). The human hearing range is 20-20000Hz. Below 20Hz is called infrasound, above 20000Hz is called ultrasonic).

Timbre (Timbre) : The waveform determines the pitch of the sound . Due to the characteristics of different objects and materials, the sound has different characteristics. The timbre itself is an abstract thing, but the waveform is this abstract and intuitive performance. Waveforms vary from pitch to pitch, and different pitches can be distinguished by their waveforms.
Fourier theory (Jean Baptiste Joseph Fourier, 1768-1830. He proposed that any periodic signal can be regarded as the superposition of a series of sine waves and cosine waves. ) tells us that any electrical signal in the time domain can be composed of one or more A superposition of sine waves of appropriate frequency, amplitude and phase.

3. Several basic concepts

  • Bit rate (code rate) : Bit rate is the number of bits transmitted per second. The unit is bit (bps bit/second). Refers to the amount of audio data played per second, in bit. For example, for PCM stream, the sampling rate is 44100Hz, the sampling size is 16, and the number of channels is 2, then the bit rate is: 44100* 16 * 2 = 1411200 bps.
    Calculation of audio file size File size = sampling rate * recording time * number of sampling bits / 8 * number of channels (bytes)

  • Sampling : Sampling is to convert a continuous time signal into a discrete digital signal.

  • Sampling rate : Simply put, it is the number of times a sound sample is taken per second.

Sound is an energy wave that has the characteristics of audio frequency and amplitude. Then the sampling process is actually to extract the frequency value of a certain point. If we sample more points in one second, the more information we get; the higher the sampling rate, the better the sound quality. But it does not mean that the higher the sampling rate, the better, because the range of human hearing is 20Hz ~ 20kHz. Generally speaking, the sampling rate of 44100HZ has been able to meet the basic requirements.

  • Number of samples : The number of samples is related to the sampling rate and time. For example, if the sampling rate is 44100Hz and the sampling time is 1s, then the number of samples in 1s is 44100.

  • Sampling bits  : The sampling bits are also called sample size or quantization bits. The quantization depth indicates how many bits are used for each sampling point, and the audio quantization depth is generally 8, 16, 32 bits, etc. For example: when the quantization depth is 8bit, each sampling point can represent 256 different quantization values, and when the quantization depth is 16bit, each sampling point can represent 65536 different quantization values.
      The size of the quantization depth affects the quality of the sound. Obviously, the more the number of bits, the closer the quantized waveform is to the original waveform, the higher the quality of the sound, and the more storage space is required; the fewer the number of bits, the better the quality of the sound. The lower the value, the less storage space is required. The CD sound quality adopts 16 bits.
      
     - Number of channels : The number of channels is the number of sound channels, and the common ones are mono, binaural and stereo.
      Monophonic sound can only be produced by one speaker, or it can be processed into two speakers to output the same channel sound. When playing back monophonic information through two speakers, we can clearly feel that the sound is from two speakers. It is impossible to determine the specific location of the sound source if it is transmitted to our ears between the two speakers.
      Binaural means that there are two sound channels. The principle is that when people hear the sound, they can judge the specific position of the sound source according to the phase difference between the left ear and the right ear. The sound was split into two separate channels during the recording, resulting in great sound localization.
      When recording sound, if one sound wave data is generated at a time, it is called monophonic; two sound wave data are generated at a time, which is called two-channel (stereo). Stereo (two-channel) files are twice the size of mono files.

  • Audio frame : Audio is not the same as video. Each frame of video is an image, but because audio is streaming, there is no concept of a frame. And sometimes there is really no way to say what a frame is like. For example, for a PCM stream, the sampling rate is 44100Hz, the number of sampling bits is 16, and the number of channels is 2, so the fixed size of one second of audio is: 44100162 / 8 bytes. But people can specify the concept of a frame, such as the amr frame is relatively simple, it stipulates that every 20ms of audio is a frame.

  • Nyquist sampling law (Nyquist) : also known as the sampling law, when the sampling rate is greater than or equal to twice the highest frequency component of the continuous signal, the sampling signal can be used to perfectly reconstruct the original continuous signal, usually 44.1KHz, 48kHz.

  • PCM stream
    PCM stream is the original audio recording, the data will be saved in a string of buffers, and this string of buffers is stored in PCM format. Usually, the audio sampling process is also called pulse code modulation coding, that is, PCM (Pulse Code Modulation) coding, and the sampling value is also called PCM value.
      In windows, the original data obtained by collecting sound through WaveIn or CoreAudio is a series of buffers in PCM format.

4. The process of coding

Encoding process: analog signal -> sampling -> quantization -> encoding -> digital signal

4.1 Sampling

The so-called sampling is to digitize the signal only on the time axis.

According to Nyquist's law (also known as the sampling law), sampling is performed at twice the highest frequency of the sound. The frequency (pitch) range of human hearing is 20Hz–20KHz. So at least greater than 40KHz. The sampling frequency is generally 44.1kHz, which ensures that the sound can be digitized even when it reaches 20kHz. 44.1kHz means 44100 samples per second.

4.2 Quantization

How should each sample be represented??? This involves quantization.
  Quantization refers to the digitization of a signal on the magnitude axis. If a 16 (8/32) bit binary signal is used to represent a sample, then the range represented by a sample is [-32768, 32767].

4.3 Encoding

Each quantization is a sample, and storing so many samples is called encoding. The so-called encoding is to record the sampled and quantized digital data according to a certain format, such as sequential storage or compressed storage, and so on.
  The so-called raw audio data format is pulse code modulation (PCM) data. A piece of PCM data usually needs quantization format (bit depth, usually 16bit), sampling rate, and number of channels to describe.
  For the sound format, there is another concept used to describe its size, that is, the bit rate, which is the number of bits in 1 second, and is used to measure the capacity of audio data per unit time.

4.4 Digital Signals

It is enough to express the encoded data with high and low levels.

5. Audio processing related

5.1 Algorithm name and explanation of some functions

AEC (Acoustic Echo Cancellation) echo cancellation algorithm
  During a video or audio call, after the local sound is transmitted to the peer for playback, the sound will be collected by the microphone of the peer,
mixed with the voice of the peer and transmitted to the local for playback, so that the local playback The sound contains the original locally collected sound, causing the subjective feeling of hearing your own echo. Taking WebRTC as an example, the echo suppression module recommends that mobile devices use the AECM algorithm with a small amount of computation.

AGC (Automatic Gain Control) Gain Control/Automatic Gain Control
The audio data collected by mobile phones and other devices is often louder sometimes, and sometimes lower loudness, causing the sound to fluctuate, affecting the subjective experience of the listener. The automatic gain control algorithm makes positive/negative adjustments to the input sound according to the pre-configured parameters, so that the output sound is suitable for the subjective experience of the human ear.

VAD (Voice Activity Detection) Endpoint Detection/Mute Detection/Voice Endpoint Detection/Voice Boundary Detection The
basic principle of silence detection: calculate the power spectral density of the audio, if the power spectral density is less than the threshold, it is considered silent, otherwise it is considered sound. Silence detection is widely used in audio coding, AGC, AECM, etc.

NS (Noise Suppression) Noise Suppression/Noise Cancellation/Active Noise Control/Noise Cancellation/Active Noise Cancellation

The original sound collected by mobile phones and other devices often contains background noise, which affects the subjective experience of listeners and reduces the audio compression efficiency. Taking Google's famous open source framework WebRTC as an example, we conducted rigorous tests on the noise suppression algorithm and found that the algorithm can suppress white noise and colored noise well. Meet the requirements of video or voice calls. Other common noise suppression algorithms, such as the noise suppression algorithm included in the open source project Speex, also have good results. This algorithm has a wider application range than WebRTC's noise suppression algorithm and can be used at any sampling rate.
CNG comfort noise generation (Comfortable Noise Generation)
 The basic principle of comfort noise generation: According to the power spectral density of the noise, the noise is artificially constructed. Broadly applicable to audio codecs. The power spectral density of white noise during silence is calculated at the coding end, and the silent period and power spectral density information are encoded. At the decoding end, random white noise is reconstructed according to time information and power spectral density information.

ANC (Active Noise Control) noise suppression / noise reduction / active noise control / noise cancellation / active noise reduction
ANS (Automatic Noise Suppression) noise suppression / noise reduction / active noise control / noise cancellation / active noise reduction
NC (Noise Cancellation) noise Suppression/Noise Reduction/Active Noise Control/Noise Cancellation/Active Noise Cancellation
AFC (Acoustic Feedback Cancellation) Howling Suppression/Adaptive Acoustic Feedback Cancellation/Acoustic Feedback Cancellation
EQ Audio Equalization
Dereverberation Reverberation Removal
Beam Forming Speech Recognition
Speech Recognition
ASR (Automatic Speech Recognition) Speech Recognition
KWS (Keyword Spotting) Speech Enhancement
Speech Enhancement
Audio encode Microphone
Array Microphone Array Voiceprint
recognition Voiceprint recognition Sound
source localization Sound source localization

5.2 Some services

Compressor (compressor): reduce the output of high signal
Automatic gain (AGC): reduce the high signal and increase the low signal
Feedback cancellation (AFC): quickly weaken the input signal at a certain frequency point to prevent this frequency The effect of signal passing to avoid howling.
Echo Cancellation (AEC): Echo cancellation.
Dodger: Ensure that only the input signal is effective at the same time.
Delayer: Delay the signal output time.
Speaker Manager (main mixer): Can correct Some fine-tuning of the output signal
Limiter (limiter): controls the maximum value of the output signal

6. What is the source of audio collection and how to calculate it?

First of all, the source of audio is generally a microphone (MediaRecorder.AudioSource.MIC)

Sampling rate (unit: Hertz)
  the number of audio sampling points per second (8000/44100Hz), the process of digitizing analog signals, and the digital signal
  
channel represented by 0101

  • AudioFormat.CHANNEL_IN_MONO Mono, one channel for sampling
  • AudioFormat.CHANNEL_IN_STEREO Binaural, two channels for sampling

The audio sampling accuracy
 specifies the format of the data and the size used each time. The format of the data returned is PCM format. The bit width used each time is 16bit. Generally, this AudioFormat.ENCODING_PCM_16BIT is used (the official document indicates that the sampling accuracy guarantees all devices are supported)

The size of samples per second
Sampling rate * sample size * number of channels
The size of samples per second = 16bit (bit width) * 2 (dual channel) * 44100 (the number of times per sample hz) = 1411200b = 1411.2kbps

7. Audio usage scenarios and applications

In real life, audio is mainly used in two scenarios: voice and music. Voice is mainly used for communication, such as making phone calls. Now, due to the development of voice recognition, human-computer voice interaction is also an application of voice. It is currently in the limelight, and many major manufacturers have launched smart speakers. Music is mainly used for appreciation, such as music playback.
Key applications for audio development:

7.1 Audio Player

Recorder
Voice call Audio
and video monitoring application
Audio and video live broadcast application
Audio editing/processing software (ktv sound effect, voice changing, ringtone conversion)
Bluetooth headset/speaker

7.2 Specific content of audio development:

Audio acquisition/playback;
audio algorithm processing (noise removal, VAD detection, echo cancellation, sound effect processing, power amplifier/enhancement, sound mixing/separation, etc.);
audio codec and format conversion;
audio transmission protocol development (SIP, A2DP, AVRCP, etc.).

8. Introduction to sound mixing technology

Mixing : As the name implies, it is to mix two or more audio streams together to form an audio stream.
Mixed flow : It refers to the mixing of audio and video streams, that is, the alignment of video images and sounds, also known as mixed flow.

Not any two audio streams can be mixed directly.

8.1 Two audio and video streams must meet the following conditions

  • The format is the same, it needs to be decompressed into PCM format.
  • The sample rate is the same, to be converted to the same sample rate. Popular sample rates include: 16k Hz, 32k Hz, 44.1k Hz, and 48kHz.
  • The frame length is the same, and the frame length is determined by the encoding format. PCM has no concept of frame length, and the developer decides the frame length by himself. In order to be consistent with the frame length of mainstream audio encoding formats, it is recommended to use 20ms as the frame length.
  • The bit depth (Bit-Depth) format or sample format (Sample Format) is the same, and the number of bits carrying the data of each sampling point
    must be the same.
  • The number of channels is the same, and they must be mono or dual channels (stereo). In this way, after aligning the format, sampling rate, frame length,
    bit depth, and number of channels, the two audio streams can be mixed.

8.2 Processing such as echo cancellation, noise suppression and silence detection

Before mixing, processing such as echo cancellation, noise suppression and silence detection is also required . Echo cancellation and noise suppression belong to the category of speech pre-processing. Acquisition, pre-speech processing, pre-mixing processing, audio mixing, and post-mixing processing should be performed sequentially before encoding. **Mute suppression (VAD, Voice Activity Detect) can be done or not. For terminal mixing, it is necessary to mix the collected host voice with the accompaniment voice read from the audio file. If the host stops for a period of time without making a sound, and it is detected by VAD, then it is fine to directly use the data of the accompaniment music without mixing during this period. However, VAD may also be omitted for simplicity. During the period when the host is silent, it is also possible to continue to mix (the voice of the host has zero amplitude).

9. Audio resampling

Resampling is to resample audio to obtain audio with a new sampling rate.

The reason for resampling???
There may be multiple audio tracks in the audio system, and the original sampling rate of each audio track may be inconsistent. For example, in the process of playing music, if there is a prompt sound, it is necessary to mix the music and the prompt sound to the codec output. The original sampling rate of the music and the original sampling rate of the prompt sound may be inconsistent. The problem is that if the sampling rate of the codec is set to the original sampling rate of the music, the prompt sound will be distorted. Therefore, the simplest and most effective solution is: the sampling rate of the codec is fixed at a value (44.1KHz/48KHz), all audio tracks are resampled to this sampling rate, and then the audio tracks are resampled to this sampling rate, and then Sent to the codec to ensure that all audio tracks do not sound distorted, ensuring that all audio tracks do not sound distorted.

10. Spectrum

The spectrum is a set of sine waves that, when properly combined, form the time-domain signal under consideration . A waveform of a composite signal is shown. Suppose we want to see a sine wave, but it is clear that the signal shown is not a pure sine wave, and it is difficult to determine the reason for this by observation alone.

11. Embedded Digital Signal Processor (EDSP)

It is an embedded processor that is very good at high-speed implementation of various digital signal processing operations (such as digital filtering, spectrum analysis , etc.). Due to the special design of the DSP hardware structure and instructions, it can complete various digital signal processing algorithms at high speed.

11.1 Features

The strength of the embedded digital signal processor is that it can perform data processing with a large amount of calculations such as vector operations and pointer linear addressing.
The embedded digital signal processor is an embedded processor specially used for signal processing, and has been specially designed in terms of system structure and instruction algorithm. Therefore, it has high compilation efficiency and instruction execution speed. The DSP chip adopts the Harvard structure in which the program and data are separated. With dedicated hardware multipliers, extensive pipeline operations are used. Provide special DSP instructions, which can quickly realize various digital signal processing algorithms.

Guess you like

Origin blog.csdn.net/bentao1997719/article/details/125303508