The Ultimate Guide to Python Speech Recognition

Produced | Yuedong Smart (Public ID: aibbtcom)

If you encounter problems, please leave a message at the end of the article

 

The huge success of Amazon 's  Alexa  has proven that in the not too distant future, achieving some level of voice support will be a basic requirement for everyday technology . Python programs that incorporate speech recognition provide interactivity and accessibility unmatched by other technologies. Best of all, implementing speech recognition in a Python program is very simple. With this guide, you will learn:

  • How speech recognition works;
  • which packages are supported by PyPI;
  • How to install and use the SpeechRecognition package - a full-featured and easy-to-use Python speech recognition library.

 

▌Overview of how language recognition works

Speech recognition grew out of research done at Bell Labs in the early 1950s. Early speech recognition systems could only recognize a single speaker and a vocabulary of only about a dozen words. Modern speech recognition systems have come a long way, can recognize multiple speakers, and have large vocabularies that recognize multiple languages.

The first part of speech recognition is, of course, speech. With a microphone, speech is converted from physical sound to electrical signals, which are then converted to data by an analog-to-digital converter. Once digitized, several models can be applied to transcribe audio to text.

Most modern speech recognition systems rely on Hidden Markov Models (HMMs). The working principle is that the speech signal can be approximated as a stationary process on a very short time scale (eg 10 ms), that is, a process whose statistical properties do not change with time.

Many modern speech recognition systems use neural networks to simplify speech signals through feature transformation and dimensionality reduction techniques before HMM recognition . Voice activity detectors (VADs) can also be used to reduce the audio signal to parts that may only contain speech.

Fortunately for Python users, some speech recognition services are available  online through APIs , and most of them also provide a Python SDK.

 

▌Select Python speech recognition package

There are some ready-made speech recognition packages in PyPI. These include:

•apiai

•google-cloud-speech

•pocketsphinx

•SpeechRcognition

•watson-developer-cloud

•wit

Some software packages (like wit and apiai) provide built-in capabilities beyond basic speech recognition, such as natural language processing to identify speaker intent . Other packages, like Google Cloud Speech, focus on speech-to-text conversion.

Among them, SpeechRecognition stands out because of its ease of use.

Recognizing speech requires input audio, and retrieving audio input in SpeechRecognition is simple enough to automatically retrieve and run in minutes without building scripts to access the microphone and process audio files from scratch.

The SpeechRecognition library caters to several mainstream speech APIs, so it is extremely flexible. The  Google  Web Speech API supports default API keys that are hardcoded into the SpeechRecognition library and can be used without registration. SpeechRecognition is the best choice for writing Python programs because of its flexibility and ease of use.

 

▌Install SpeechRecognation

SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but some additional installation steps are required to use it in Python 2. All development versions in this tutorial default to Python 3.3+.

Readers can install SpeechRecognition from the terminal using the pip command:

$ pip install SpeechRecognition

Once the installation is complete please open an interpreter window and enter the following to verify the installation:

>>> import speech_recognition as sr
>>> sr.__version__
'3.8.1'

NOTE: Do not close this session, you will be using it in the next few steps.

To process existing audio files, simply call SpeechRecognition directly, noting some dependencies for specific use cases. Also note that the PyAudio package is installed to get microphone input.

 

▌Identifier class

The core of SpeechRecognition is the recognizer class.

The main purpose of the Recognizer API is to recognize speech. Each API has various settings and functions to recognize the speech of the audio source, namely:

  • recognize_google(): Google Web Speech API
  • recognize_google_cloud(): Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify(): Houndify by SoundHound
  • recognize_ibm(): IBM Speech to Text
  • recognize_sphinx(): CMU Sphinx - requires installing PocketSphinx
  • recognize_wit(): Wit.ai

Of the above seven, only recognition_sphinx() works offline with the CMU Sphinx engine, the other six require an internet connection.

SpeechRecognition comes with the default API key for the Google Web Speech API, which can be used directly. The other six APIs require authentication using an API key or username/password combination, so this article uses the Web Speech API.

Now to get started, call the recognize_google() function in the interpreter session.

>>> r.recognize_google()

The screen will appear:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: recognize_google() missing 1 required positional argument: 'audio_data'

Believe you have guessed the result, how is it possible to identify data from an empty file?

The seven recognize_*() recognizer classes all require the input audio_data parameter, and the audio_data of each recognizer must be an instance of SpeechRecognition's AudioData class.

There are two paths for the creation of AudioData instances: audio files or audio recorded by the microphone, starting with the audio files that are easier to use.

 

▌Use of audio files

First you need to download the audio files (https://github.com/realpython/python-speech-recognition/tree/master/audio_files) and save it to the directory where the Python interpreter session is located.

The AudioFile class can be initialized with a path to an audio file and provides a context manager interface for reading and manipulating the file contents.

Supported file types

The file types currently supported by SpeechRecognition are:

  • WAV: must be in  PC M/LPCM format
  • FLAC: Must be in native FLAC format; OGG-FLAC format is not available

If you use x-86 under Linux system, macOS or Windows system, you need to support FLAC files. If running on other systems, you need to install the FLAC encoder and make sure you have access to the flac command.

Use  record()  to get data from a file

Type the following command into the interpreter dialog to process the contents of the "harvard.wav" file:

>>> harvard = sr.AudioFile('harvard.wav')
>>> with harvard as source:
...   audio = r.record(source)
...

Open the file through the context manager and read the file content, store the data in the AudioFile instance, and then record the entire file's data into the AudioData instance through record(), which can be confirmed by checking the audio type:

>>> type(audio)
<class 'speech_recognition.AudioData'>

Recognition_google() can now be called to try to recognize speech in the audio.

>>> r.recognize_google(audio)
'the stale smell of old beer lingers it takes heat
to bring out the odor a cold dip restores health and
zest a salt pickle taste fine with ham tacos al
Pastore are my favorite a zestful food is the hot
cross bun'

The above completes the recording of the first audio file.

Get audio clip with offset and duration

What if I only want to capture part of the speech in the file? The record() command has a duration keyword argument that causes the command to stop recording after the specified number of seconds.

For example, the following gets only the first four seconds of speech in the file:

>>> with harvard as source:
...   audio = r.record(source, duration=4)
...
>>> r.recognize_google(audio)
'the stale smell of old beer lingers'

When the record() command is called in a with block, the file stream moves forward. This means that if you record four seconds, then four seconds, the first four seconds will return to the second four seconds of audio.

>>> with harvard as source:
...   audio1 = r.record(source, duration=4)
...   audio2 = r.record(source, duration=4)
...
>>> r.recognize_google(audio1)
'the stale smell of old beer lingers'
>>> r.recognize_google(audio2)
'it takes heat to bring out the odor a cold dip'

In addition to specifying the recording duration, you can also use the offset parameter to specify a starting point for the record() command, whose value represents the time at which to start recording. For example: to get only the second phrase in the file, you can set an offset of 4 seconds and record a duration of 3 seconds.

>>> with harvard as source:
...   audio = r.record(source, offset=4, duration=3)
...
>>> recognizer.recognize_google(audio)
'it takes heat to bring out the odor'

The offset and duration keyword arguments are useful for splitting audio files when the structure of the speech in the file is known in advance. But inaccurate usage can lead to poor transcription.

>>> with harvard as source:
...   audio = r.record(source, offset=4.7, duration=2.8)
...
>>> recognizer.recognize_google(audio)
'Mesquite to bring out the odor Aiko'

This program starts recording from 4.7 seconds, so that the phrase "it takes heat to bring out the odor", the "it t" in the phrase is not recorded, at this time the API only gets the input "akes heat" and matches it The result is "Mesquite".

Likewise, the API only captures "a co" when retrieving the phrase "a cold dip restores health and zest" at the end of the recording, which is incorrectly matched as "Aiko".

Noise is also a major culprit in translation accuracy. The above example works fine because the audio file is clean, but in reality it is impossible to get noise free audio unless the audio file is processed beforehand.

The effect of noise on speech recognition

 

Noise does exist in the real world, all recordings have some level of noise, and unprocessed noise can destroy the accuracy of speech recognition applications.

To see how noise affects speech recognition, download the "jackhammer.wav" (https://github.com/realpython/python-speech-recognition/tree/master/audio_files) file and make sure to save it to the interpreter session's in the working directory. In the document the phrase "the stale smell of old beer lingers" is pronounced in the background of loud wall drilling.

What happens when you try to transcribe this file?

>>> jackhammer = sr.AudioFile('jackhammer.wav')
>>> with jackhammer as source:
...   audio = r.record(source)
...
>>> r.recognize_google(audio)
'the snail smell of old gear vendors'

So how to deal with this problem? You can try calling the adjust_for_ambient_noise() command of the Recognizer class.

>>> with jackhammer as source:
...   r.adjust_for_ambient_noise(source)
...   audio = r.record(source)
...
>>> r.recognize_google(audio)
'still smell of old beer vendors'

This is much closer to an accurate result, but there are still problems with accuracy, and the "the" at the beginning of the phrase is missing, why?

Because the adjust_for_ambient_noise() command is used, the first second of the file stream is recognized as the noise level of the audio by default, so the first second of the file has been consumed before using record() to get the data.

The time analysis range of the adjust_for_ambient_noise() command can be adjusted using the duration keyword parameter, which is in seconds and defaults to 1. This value is now reduced to 0.5.

>>> with jackhammer as source:
...   r.adjust_for_ambient_noise(source, duration=0.5)
...   audio = r.record(source)
...
>>> r.recognize_google(audio)
'the snail smell like old Beer Mongers'

Now we've got the "the" of the sentence, but now some new problems have arisen - sometimes the signal is too loud to cancel out the effects of the noise.

If these problems are encountered frequently, some preprocessing of the audio is required. This preprocessing can be done with audio editing software, or in a Python package such as SciPy that applies filters to the file. When dealing with noisy files, you can improve accuracy by looking at the actual API response. Most APIs return a JSON string containing multiple possible transcriptions, but when not mandating a full response, the recognition_google() method always returns only the most likely transcription characters.

Give the full response by changing the True parameter in recognition_google() to show_all.

>>> r.recognize_google(audio, show_all=True)
{'alternative': [
 {'transcript': 'the snail smell like old Beer Mongers'}, 
 {'transcript': 'the still smell of old beer vendors'}, 
 {'transcript': 'the snail smell like old beer vendors'},
 {'transcript': 'the stale smell of old beer vendors'}, 
 {'transcript': 'the snail smell like old beermongers'}, 
 {'transcript': 'destihl smell of old beer vendors'}, 
 {'transcript': 'the still smell like old beer vendors'}, 
 {'transcript': 'bastille smell of old beer vendors'}, 
 {'transcript': 'the still smell like old beermongers'}, 
 {'transcript': 'the still smell of old beer venders'}, 
 {'transcript': 'the still smelling old beer vendors'}, 
 {'transcript': 'musty smell of old beer vendors'}, 
 {'transcript': 'the still smell of old beer vendor'}
], 'final': True}

As you can see, recognition_google() returns a list with the keyword 'alternative', which refers to a list of all possible responses. This response list structure varies by API and is primarily used to debug the results.

 

▌Use of Microphone

To use the SpeechRecognizer to access the microphone the PyAudio package must be installed, close the current interpreter window and do the following:

Install PyAudio

The process of installing PyAudio varies by operating system .

Debian Linux

If you are using a Debian-based Linux such as Ubuntu, you can install PyAudio using apt:

$ sudo apt-get install python-pyaudio python3-pyaudio

It may still be necessary to enable pip install pyaudio after installation, especially if running virtual.

macOS: macOS users first need to use Homebrew to install PortAudio, and then invoke the pip command to install PyAudio.

$ brew install portaudio
$ pip install pyaudio

Windows: Windows users can directly invoke pip to install PyAudio.

$ pip install pyaudio

Installation test: After installing PyAudio, you can test the installation from the console.

$ python -m speech_recognition

Make sure the default microphone is on and unmuted, if the installation works you should see something like this:

A moment of silence, please...
Set minimum energy threshold to 600.4452854381937
Say something!

Please speak into the microphone and watch how SpeechRecognition transcribes your speech.

Microphone class

Please open another interpreter session and create an example that recognizes an identifier class.

>>> import speech_recognition as sr
>>> r = sr.Recognizer()

The default system microphone will be used instead of the audio file as the source. The reader can access it by creating an instance of the Microphone class.

>>> mic = sr.Microphone()

If your system doesn't have a default microphone (like on the RaspberryPi) or you want to use a non-default microphone, you need to specify the microphone to use by providing the device index. Readers can get a list of microphone names by calling the list_microphone_names() function of the Microphone class.

>>> sr.Microphone.list_microphone_names()
['HDA Intel PCH: ALC272 Analog (hw:0,0)',
 'HDA Intel PCH: HDMI 0 (hw:0,3)',
 'sysdefault',
 'front',
 'surround40',
 'surround51',
 'surround71',
 'hdmi',
 'pulse',
 'dmix', 
 'default']

Note: Your output may differ from the example above.

list_microphone_names() returns the index of the microphone device names in the list. In the above output, if you want to use the microphone named "front", which has index 3 in the list, you can create a microphone instance like this:

>>> # This is just an example; do not run
>>> mic = sr.Microphone(device_index=3)

But in most cases you need to use the system default microphone.

Use  listen() to get microphone input data

 

Once the microphone instance is ready, the reader can capture some input.

Like the AudioFile class, Microphone is a context manager. Input from the microphone can be captured using the listen() method of the Recognizer class in the with block. This method takes an audio source as the first argument and automatically records the input from the source until it automatically stops when silence is detected.

>>> with mic as source:
...   audio = r.listen(source)
...

Try saying "hello" into the microphone after executing the with block. Please wait for the interpreter to show the prompt again, once the ">>>" prompt returns, the speech can be recognized.

>>> r.recognize_google(audio)
'hello'

If no prompts come back again, probably because the microphone is picking up too much ambient noise, interrupt the process with Ctrl+C to get the interpreter to show the prompts again.

To deal with ambient noise, call the adjust_for_ambient_noise() function of the Recognizer class, which operates the same way as when dealing with noisy audio files. Since microphone input sounds less predictable than audio files, this process can be used any time you listen to microphone input.

>>> with mic as source:
...   r.adjust_for_ambient_noise(source)
...   audio = r.listen(source)
...

After running the code above, wait a moment and try saying "hello" into the microphone. Likewise, you must wait for the interpreter prompt to return before attempting speech recognition.

Remember that adjust_for_ambient_noise() analyzes 1 second long audio from the audio source by default. If the reader thinks that this time is too long, the duration parameter can be used to adjust.

The SpeechRecognition data suggest that the  duration parameter should be no less than 0.5 seconds. In some cases, you may find that a duration longer than the default of one second produces better results. The minimum value you need depends on the surrounding environment where the microphone is located, however, this information is usually unknown during development. In my experience, the default duration of one second is sufficient for most applications.

Handling difficult-to-recognize speech

Try feeding the previous code example into the interpreter and some incomprehensible noise into the microphone. You should get something like this:

Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/home/david/real_python/speech_recognition_primer/venv/lib/python3.5/site-packages/speech_recognition/__init__.py", line 858, in recognize_google
  if not isinstance(actual_result, dict) or len(actual_result.get("alternative", [])) == 0: raise UnknownValueError()
speech_recognition.UnknownValueError

Audio that cannot be matched as text by the API raises an UnknownValueError exception, so try and except blocks are frequently used to resolve such problems. The API will do its best to translate any sound into text, such as short grunts may be recognized as "How", coughs, applause and tongue clicks may be translated into text and cause anomalies.

▌Conclusion _

In this tutorial, we've been recognizing speech in English, which is the default language for every recognition_*() method in the SpeechRecognition package. However, recognizing other speech is also absolutely possible and easy to do. To recognize speech in a different language, set the language keyword argument of the recognition_*() methods to a string corresponding to the desired language.

Original link: http://www.aibbt.com/a/28552.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324397220&siteId=291194637