Introduction to Speech Recognition 01-Basic Concepts

1 Basic concepts

Audio sampling rate (sample rate)

  • The audio sampling rate refers to the number of times the sound signal is sampled by the recording device in one second. The higher the sampling frequency, the more real and natural the sound is restored.
  • The speech synthesis service only supports two sampling rates of 16000Hz and 8000Hz, of which 8000Hz is generally used in the telephone service of the customer service scene.
  • The voice recognition service only supports the recognition of audio at two sampling rates of 16000Hz and 8000Hz, of which 8000Hz is generally used in the telephone service of the customer scene.
  • When calling the voice service, you need to set the sampling rate. The voice data and the sampling rate parameters must be consistent, otherwise the recognition/synthesis effect will be problematic. If the data sampling rate is higher than 16000Hz, it can be reduced to 16000Hz and sent to the identification service. If the voice data sampling rate is 8000Hz, please do not increase the sampling rate to 16000Hz, you should choose the 8000Hz model for recognition.

Audio sample size (sample size)

  • Sampling value (bit depth), that is, the quantization of the sample amplitude, the number used to measure the fluctuation of the sound, can be said to be the resolution of the sound card. The larger the value, the higher the resolution. At present, the commonly used number of sampling bits in speech recognition is 16 bits little-endian, and the audio information of each sampling is stored in 2 bytes.
  • The time amplitude of each sampling data record, the sampling accuracy depends on the number of sampling bits:
  • 1 byte (8bit) records 256 numbers, that is, divides the amplitude into 256 levels.
  • 2 bytes (16bit) record 65536 numbers.

Speech coding (format)

The way of storing and transmitting voice data, voice coding is different from the file format. For example, the common wav file format defines the coding of voice data in its header, and audio data usually uses PCM or other coding. Before calling the voice service, you should confirm that the encoding format of your voice data is supported by the service.

Channel

The sound is recorded as independent audio signals collected at different spatial locations, so the number of channels is the number of sound sources when the sound is recorded. Common audio data is mono or dual (stereo). Recognition services other than recording file recognition only support mono voice data. If the data is dual-channel, it needs to be converted to mono first.

Speech recognition result

  • The non-streaming API identification results are returned at one time.
  • Streaming API recognition results will continue to return text (intermediate recognition results) during the recognition process, and the continuous screen effect can be achieved through the intermediate results.
    For example, a speech, the recognition result is "Hello Biaobei Technology", and the intermediate result may be returned many times.
你

你好

你好标贝

你好标贝科技

reference

Guess you like

Origin blog.csdn.net/sdlypyzq/article/details/108334322