VoIP, voice coder

Chapter Overview conventional voice over IP technology (VOIP) the codec (codess). They are often referred to as solutions encoder, a speech encoder, or simply the encoder. In this regard a lot of knowledge.

This section briefly describes the main features of the encoder, the encoder of the classification, and then describes a VOIP three encoders: IUT-T G.723 speech decoder editorial, the ITU-T the G.729 speech coder.

First, the function of the main functions of the speech encoder the speech coder is the user's voice PCM (Pulse Code Modulation) samples encoded into a small number of bits (frames). This method allows the voice generated error even in the road, is robust (Robustness-) network jitter and burst transmission. At the receiving end, the speech frame is first error as PCM voice samples, and then convert sonic idiom shape.

Second, the classification speech codec speech encoder is shaped into three types: (a) is a waveform coding; (b) vocoder; (c) hybrid encoder. The encoder configuration of a waveform will include background noise as an analog waveform, including single. Since all of the waveform applied to the encoder input signal, it will produce a high quality sample. However, waveform coders work at high bit rates. For example: the ITU-G.7 . 11 specification (PCM) with a bit rate of 64Kbps.

Vocoder (Vocoder) does not reproduce the original waveform. This set of encoder extracts a set of parameters, these parameters are set to the receiving terminal, used to derive the speech production mold shape. Linear predictive coding (the LP C) the parameters used to obtain a time-varying digital filter. This filter is used to simulate speaker channel output [WEST96]. Using the vocoder in the telephone system, voice quality is not good enough.


Low bit rate encoder MOS scores - bit rate curve (WEST96)

VOIP used in the speech encoder is a hybrid encoder, it is integrated into a waveform coder strengths's sound, it is another feature that it operate at very low bit rates (4-6Kbps). Mixing together encoder synthetic analysis (AbS).

To illustrate the problem, consider a person's speech pattern generated channel: When people speak generating voiced speech signal will be issued (e.g., phonemes pa, da, etc.) and unvoiced (e.g., phonemes SH, TH) . The excitation signal is derived from the input speech signal, which is that the difference between the synthesized speech and the input speech is very small. LPC usage, generating an excitation analysis and synthesis (of AbS) error checking system are shown in Figure 4-1. Easy to achieve toll quality at an encoder bit rate to 8Kbps, shown in Figure 4-2. Toll-quality voice Mean Opinion Score (MOS) it must be divided perhaps more points. PCN conventional speech bit rate of less than 32 Kbps, the voice quality is seriously deteriorated, not discussed here the PCN. Hybrid encoder bit rate vocoder scores are relatively low MOS acceptable. At this stage, most of the VOIP-based encoder working range in the 5.2 ~ 8kbps. Studies have shown that in the standard encoder bit rate is 4 when Kbps provide acceptable NOS score scoring system used in a number of points on the MOS of 4.8 Kbps to 3.8.

Vector quantization code excited linear prediction and a better method is to use optimal prediction parameter stored (symbol vectors) encoding the codebook vector representing the input speech signal, this technique technique referred to as vector quantization (the VQ, vector quantization). The combination of VQ and AbS art will further improve coding performance. AbS VQ is constituted CELP-based technology . The main difference is that the VQ of AbS and VQ quantized using a codebook search vector quantization distortion measure different [WONG96] defined.

Third, linear predictive coder the synthetic analysis the most common speech encoder bit rate between 4.8kbps ~ 16 kbps encoder of the model, these encoders are linear predictive synthesis based analysis (LPAS) method. To change with time of an analog voice signal, a linear predictive speech production model must be energized with an appropriate signal. Fixed time intervals (e.g. every 20m S), the speech model parameters and excitation parameters must be estimated and updated once, and is used to control the speech model. Here are two LPAS encoder: to be forward and backward LPAS encoder adaptive LPAS encoder. Before LPAS encoder adaptive to 3.1: 8kbps G.729 encoder and 6.3kbps and 5.3kbps G.723.1 coder adaptive forward AbS encoder prediction filter coefficients and the gain are transmitted to the display. In order to provide toll quality speech performance, both of the encoder depend on the source model. Excitation signal (voice group information indicating a form of modulation cycles) also transmitted. This model provided by the encoder the speech signal is good, but for some or most of the noise filter is not appropriate. Therefore , in the context of noise and music environment, quality LPAS encoder quality than 7.726 and 7.727 of the encoder to be worse.

  ① G.723.1 ITU-T G.723.1 coder to provide toll quality speech at 6.4kbps. While working in G.723.1 5.3kbps further comprising a low-quality speech coder . G.723.1 is a low bit rate videophone designed. In this adaptation, because video encoding delay is usually greater than the speech coding delay, so the demands on delay is not critical. G. 723.1 frame length encoder 30ms, there is a front view of 7.5ms. Together with the processing delay, the coder unidirectional total delay of 67.5ms. Further delayed by the buffer system and cause network.

  G.723.1 encoder first speech signal bandwidth of conventional telephone wave filter (based G.712), and then sampled speech signal (based on the G.711) with the conventional rate of 8000Hz , and converted into bit linear PCM code as as input of the encoder. In an encoder to output the inverse operation to reconstruct the speech signal. G.723.1 coding systems LPAS speech signal encoding framing. The encoder can generate voice traffic between two rates: (a) a high rate of 6.3kbps;

  (b) a low rate of 5.3kbps. Primary rate encoder using natural quantization multi-pulse maximum (MP-MLQ), the encoder uses the low-rate Algebraic Code Excited Linear Prediction (the ACELP, of Al gebraic-Code-Excited Linear-Prediction) method. The encoder and decoder must support two rates, and through smart enough to convert two speeds between frames, this system is also capable of compressing and decompressing music and other audio signals, but it is the best speech signal of.

  The encoder operates on a frame, each frame comprising 240 samples, using the rate of 8000Hz. After further processing (high-pass filter to the DC component) of each frame is divided into 4 subframes, each subframe includes 60 samples, various other operations include computing the LPC filter coefficients and the unquantized LSP filter, It will result in packet delay of 30ms. For each subframe, with the calculated LPC filter input signal unprocessed. The last sub-frame of a filter coefficient used to predict the split vector quantizer (PSVQ, the Predictive Split the Vector a quantizer) quantized. As previously described, front possession 7.5ms, so the whole coding delay is 37.5ms. The delay in the review encoders, especially through the data network when the network to transmit voice is a very important factor, because if the encoding and decoding delay is relatively small, then it has a greater time delay and jitter of the Internet means that the processing of degrees of freedom. Solution processing decoder is also based on the frame, the decoding process is as follows (the G.723.1 Algorithm Abstract):

  · Number of LPC quantization index is decoded.

  · Structure of the LPC synthesis filter.

  · For each sub-frame, prior to the present adaptive codebook excitation and fixed codebook excitation decoder, then the synthesis filter input.

  · Excitation signal after pitch post filter processing, and then sent to the synthesis filter.

  Synthetic signal is input to the formant post filter, which uses the gain scaling unit so as to maintain the output power level at the input of the paste on.

  Silence suppression have been applied for many years, it used the total session time of silence accounted for about 50% of that fact. The basic idea is to reduce the number of bits transmitted during silence, thus saving the total number of bits required for transmission. In a telephone network, for many years the assigned speech interpolation (TASI, Time-Assigned Speech analog voice signal with respect to time the Interpolation) a main method for processing. The other technique is the speech signal or data signal during the muting placed within the conversation, thereby providing additional capacity for the multi-channel link quantity. Today, TASI has been using digital signal and was given a new name - one example is the Time Division Multiple Access (TDMA, Time Division Multiple Access) . Brief To speaking, DTMA is to divide the signal into a small conventional, digital fragment (slots, time slots). These and other time slots with time slot in a time division multiplexed channels.

  G.723.1 uses the mute compression performing discontinuous transmission, which means that the addition of artificial noise in the silence period in the bit stream. In addition to the reserved bandwidth, making this technique send the modem side channel holding unit for continuous operation, and to avoid a break when the carrier signal is turned on.

  ② G.729 G.729 encoder designed for low-latency applications, it is only the frame length 10ms, processing delay is 10ms, together with the forward-looking 5ms, which makes G.729 generated point delay 25ms, which corresponds bit rate of 8 kbps. The delay performance is very important on the Internet, because we know that anything that can reduce the latency is very important.

  There are two versions G.729: G.729 and G.729A. G.729 simpler than G.723.1. Both versions are compatible, but their performance is somewhat different, a low complexity version (the G.729 poor A) performance. The encoder provides two kinds of hidden processing mechanism of frame loss and packet loss, so voice transmission over the Internet, the two encoders are all good choices. Cox et [COX98] G.729 considered poor performance in terms of handling random bit errors. The encoder is not recommended in the random bit errors on a channel, unless the use of the channel coding (forward error correction code and a convolutional code, part of the discussion will be wireless) to protect the most sensitive bits.

  LPAS coding adaptive to 3.2: 16 kbps G.728 low-delay code excited linear prediction G.728 is a low bit linear predictive coder the synthetic analysis (the G.729 and the G.723.1) a mixture of the ADPCM encoder and a post . A G.728 LD-CELP coder, which handles only 5 samples.

  CELP is a speech coding technique on the seed, which is the excitation signal from a set of possible excitation signals selected by the full search method. Low rate speech coding LU pair predictive filter samples prior to the adaptive scheme uses. The LD-CELP uses the adaptive filter to update and make every 2.5ms. CELP total of 1024 possible excitation vectors. These vectors can enter further analyzed four possible gain for the two symbols (+ and -) 128 vector shapes.

  For low-rate (56 ~ 128 kbps) Integrated Services Digital Network (ISDN) videophone, G.728 speech coder implements a recommendation. Due to the adaptive nature Thereafter, thus G.728 encoder is a low latency but which are more complex than the other encoder, because it is necessary repeated in the order of the encoder 50 in LPC analysis. G.728 also use adaptive post filter to improve its performance.

Fourth, the parameters of the speech coder: 2.4 kbps mixed Excited Linear Prediction speech parameter encoder using a simplified model of the excitation signal, which can work at the lowest bit rate. And discussed before all the speech encoder can be described as a waveform trace, the phase of the input signal waveform and their output signals are very similar.

  Parametric speech coder is different, it does not appear as waveform trace. Such an encoder based on an analysis synthesis model, the available relatively few parameters represent voice signals. These parameters are usually 20ms ~ 40ms will be extracted from the speech signal and the quantization intervals. At the receiving end, these parameters are used to generate a synthetic speech signal. Under ideal conditions, the synthesized speech sounds and original voice with like. In case of background noise is large, since the input speech signal is not in accordance with its inherent good speech modeling, any parameter encoder will fail. The US government selected the 2.4 kbps MELP for secure telephone.

  For me media applications, [COX98] research indicates that: when a low bit rate, encoder parameters is a good choice. For example, simple user games often parameter encoder . This reduces the required storage space. For the same reason, the parametric encoder is also a good choice for some type of multimedia message service. For all types of voice environment, the parameters of absolute encoders voice quality are low, especially in noisy environments. If you can do beforehand careful editing of audio files, then this disadvantage is insurmountable. Currently, multimedia applications, most of the parameters in the encoder are not standard. But it is applicable to such a dedicated encoder.

  For wireless communication G.723.1 Annex C of the variable rate encoder G.723.1 specifies a channel coding specification, which can be used with the triple-rate speech coder. This variable bit rate channel encoder, as a part of the overall H.324 standard range, is designed for mobile multimedia applications.

  This channel encoder supports bit rates ranging from 0.7 kbps to 4.3 kbps. It also supports three modes of operation of the G.723.1 codec, i.e. high rate mode, low speed mode type and a discontinuous transmission mode. This channel encoder uses a truncated convolutional code bits may be different depending on the bit rate of the channel coder information bits of each type of subjective importance TYPE optimized. This allocation algorithm for the encoder and the decoder are known. The system control signal each time the rate of change, whether to change the channel or G.723.1 encoder bit rate, the algorithm will make a channel encoder adapted to the new voice service configuration.

  If the channel encoder is available at a lower rate, then the first protected subjectively most sensitive bits. When increasing bit rate of the channel encoder, first redundant bits for channel protection guard additional information bits, and then enhance the protection of protected bits off type. Before the use of channel coding, speech parameters to be varied as part of the channel adaptation layer to improve transfer transmission error robustness.

Fifth, the encoder evaluation of several important factors to consider when evaluating the performance of the encoder. These factors are suggested:

  • Frame size: the size of the frame indicates the length of time voice traffic, also known as frame delay. Frame is a discrete component of the speech signal, and is updated each frame in accordance with the speech samples. This article describes the encoder is a first treatment. Information of each frame on each respective voice packets and transmitted to the receiving end.

  · Processing delay: It represents the time required to do the coding in the encoder for a speech processing algorithm. It is usually included in a simple frame delay. Well known as algorithmic delay processing delay.

  · Delay front view: an encoder for encoding the help of the current frame and the next frame checking a length, the length of this delay is called a front view. Forward-looking idea is to take advantage of the neighboring language close correlation between sound frames. Frame length: This value represents the number of bytes obtained by the encoding process (not including the header).

  · The speech bit rate: when the input codec standard pulse code modulation of voice bit stream (a bit rate of 64 kbit / s), the output rate codec.

  · DSP MIPS: This value is the minimum speed DSP processors support a particular encoder. Value is noted that the DSP mentioned MISP MISP rate independent of other processors. And in the station with the personal computer and workstation different general purpose processor, DSP for specific tasks designed. Accordingly, to achieve the above codec processing MISP demand, a general purpose processor dedicated DSP processors than large.

  · RAM requirements: it describes the size of the encoding process to support a particular need RAM.

  The key factor in evaluating the performance of the encoder is required for the time encoder work. This time is the encoder buffer and processing time delay is called a one-way system. Its value is equal: frame size + processing delay + delay front view. Clearly, decoding delay is also very important. In fact, about half of the decoding delay coded delay.

Six, comparator speech encoder in order to discuss a standard encoder as summarized in Table 4-1 [RUDK97] several bit rate encoder, the MOS, complexity (with reference to G .711) and time delay (frame front view of the size and time) as a comparison. 

standard
Encoding type
Bit rate (kbps)
MOS
Complexity
Delay (ms)
G.711
PCM
64
4.3
1
0.125
G.726
ADPCM
32
4.0
10
0.125
G.728
LD-CELP
16
4.0
50
0.625
GSM
RAE_LPT
13
3.7
5
20
G.729
CSA-CELP
8
4.0
30
15
G.729A
15
G.723.1
ACELP
6.3
3.8
25
37.5
MP-MLQ
6.3
US Dod
LPC-10
2.4
Synthesized speech
10
22.5
FS1015

  Seven summary speech coder is to create and manipulate an engine of VOIP packets. It is driven by the DSP. Original DS0, TMD G.711 64kbps encoder industry will eventually be eliminated , replaced by a low bit rate encoder.

Reproduced in: https: //www.cnblogs.com/leaway/archive/2007/10/27/939292.html

Guess you like

Origin blog.csdn.net/weixin_34354173/article/details/93840440