[Voice Interaction] Voice collection and processing

1. Three elements of sound

1.1 Tone

        Tone, on the sensory level, refers to "low and harsh", and on the physical level, it refers to the frequency of the vibration of an object. A low frequency is "muffled" and a high frequency is harsh.

        The vibration frequency of human vocal cords is approximately 10hz~10Khz, while the hearing range of the human ear is 20hz~20Khz. Below 20 is called infrasound, and within it is called the audible area. More than 20 is called ultrasonic.

1.2 Loudness      

        Loudness, that is, the loudness of sound, refers to the vibration amplitude of an object on a physical level. The greater the amplitude, the greater the loudness. On the contrary, the smaller it is.

1.3 Tone

        "The sound of wind, rain, and reading", the timbre creates different sound backgrounds; from a physical level. It refers to the difference in the material of the sound-producing object.

The overall vibration of an object emits only the fundamental sound, and its various parts also have composite vibrations. These composite vibrations will also emit sounds and form overtones. Different combinations of fundamental tone + overtones produce diverse timbres.

        This waveform graph can be used to control the variable method to view the influencing factors that cause the sound.

2. Audio digitization

        Sound consists of changing air pressure. The generated sound waves drive the diaphragm on the microphone to vibrate. The diaphragm drives the coil on the magnet to vibrate, producing an analog signal of the sound. The process of digitizing audio signals is the process of converting analog data into digital signals. The process of audio digitization can be seen from the figure below. In the waveform diagram, take the horizontal axis as the time dimension and the vertical axis as the amplitude dimension.

2.1 Sampling: Digitize analog signals on the time axis at a certain sampling rate

        The simulated data is used to sample the sound at fixed time intervals. According to the fixed time interval T (assuming T=0.1s), multiple points are taken in sequence (points on the wave corresponding to 1~10 in the figure). At this time, T is called the sampling period, and the reciprocal of T is the sampling rate of this sampling (f=1/T=10Hz). f represents the number of samplings per second, and the unit is Hertz (Hz).

        The frequency range that the human ear can hear is 20HZ~20KHZ. According to Shannon Nyquist sampling theorem (in order to restore the analog signal without distortion, the sampling frequency should be greater than or equal to 2 times the highest frequency in the analog signal spectrum. fs ≥ 2f max) , the sampling frequency is generally 44.1Khz. The audio is then sampled and the audio level is measured. The computer captures the signal every few microseconds, such as 44.1KHz, which is 44,100 times per second.

2.2 Quantization: Digitize analog signals on the amplitude axis with a certain accuracy

        After completing the sampling, the second step of audio digitization is quantization. Sampling is to digitize the audio signal on the time axis to obtain multiple sampling points; while quantization is to digitize in the amplitude direction to obtain the amplitude value of each sampling point.

        As shown in the figure above, set the coordinate range of the vertical axis to 0 ~ 8 to obtain the vertical coordinate of each sampling point (rounded up). The coordinate value here is the quantized amplitude value. Because we divided the amplitude axis into 8 segments, there are 8 values ​​used for quantization and rounding, that is, the accuracy of this quantization is 8. Obviously, if there are more segments, the quantized value of the amplitude will be more accurate (the error caused by rounding will be smaller), and the original waveform can be better represented. There is a proprietary term to describe the quantization accuracy of amplitude - bit depth.

        A computer converts these captured signals into numbers that define changes in amplitude. Then the obtained numbers are binary encoded to represent the voltage value of the measured instantaneous waveform, completing the quantization process .

2.3 Encoding: Record sampled/quantized data in a specific format

        So if we want to play these sounds again, we need to convert the digital signal into an analog signal through a digital-to-analog converter, and then use an anti-folding filter to smooth the extreme differences in the analog signal and restore it to the original analog signal .

        After quantization, we get the amplitude value of each sampling point. Next, is the final step in digitizing the audio signal, encoding. Encoding is to convert the amplitude quantization value of each sampling point into a binary byte sequence that can be understood by the computer.

        As shown in the figure above , referring to the table in the encoding part, the sample number is the sample sampling order, and the sample value (decimal) is the quantized amplitude value. The sample value (binary) is the encoded data after the amplitude value is converted. Finally, a binary byte sequence in the form of "0" and "1" is obtained, that is, a discrete digital signal. What is obtained here is a naked stream of uncompressed audio sample data, also called PCM audio data (Pulse Code Modulation, pulse code modulation). In actual applications, other encoding algorithms are often used for further compression.

3 Three elements of audio digital signal quality

3.1 Sampling rate

        ​​​​​​​Audio sampling rate refers to the number of samples of the sound signal per unit time (1s).

        For the audio signal with the maximum frequency f, when we use the sampling rates of f, 2f, and 4f/3 for sampling, the obtained sampling results refer to the figure below. Obviously, only when the sampling rate is 2f, the original signal characteristics can be effectively retained. The results obtained at sampling rates f and 3f/4 are very different from the original waveform.

The impact of graph sampling rate on sampling results

3.2 Sampling bit depth

        When learning the "quantization" step of the audio digitization process, the concept of quantization accuracy-bit depth is mentioned. Sampling bit depth refers to the accuracy of the amplitude value of each sampling point during the audio collection and quantization process. Bit is generally used as the unit. For example, when the sampling bit depth is 8bit, the amplitude value of each sampling point can be represented by 2^8=256 quantized values; when the sampling bit depth is 16bit, the amplitude value of each sampling point can be represented by 2^16= 65536 quantized value representations. Obviously, 16bit can store and represent more and more detailed data than 8bit, and the error loss caused during quantization is smaller. Bit depth affects the resolution accuracy and delicacy of sound. We can think of it as the "resolution" of the sound signal. The greater the bit depth, the more realistic and vivid the sound.

        The selection of sampling bit depth is similar to the selection of sampling rate. Although theoretically, the larger the bit depth, the better, but considering bandwidth, storage, and actual listening experience, we should choose different bit depths for different scenarios.

3.3 Number of channels

        What we often say about mono and dual channels actually describes the number of channels of an audio signal. The number of channels generally refers to the number of sound sources during sound collection and recording or the number of speakers during playback.

3.4 Audio code rate

        Audio code rate, also known as bit rate, refers to the amount of audio data contained in unit time (generally 1s), and can be calculated by a formula. For example, for two-channel audio PCM data with a sampling rate of 44.1K Hz and a bit depth of 16 bits, its original bit rate is: Original bit rate = sampling rate/sx bit depth/bit x number of channels x duration (1s)

44.1 * 1000 * 16 * 2 * 1 = 1411200 bps = 1411.2 kbps = 1.411 Mbps (bit per second, bits per second)

If a PCM file is 1 minute long, the amount of data required to transmit/store this file is: 1.411 Mbps * 60s = 86.46Mb.

Reference blog: Introduction to the basics of audio and video development|Sound collection and quantification, audio digital signal quality, audio bit rate_ZEGO Technology's blog-CSDN blog_Sound collection

Guess you like

Origin blog.csdn.net/weixin_44362628/article/details/128323464