In this article, we will use Whisper to create a speech-to-text application. Whisper requires a Python backend, so we will use Flask to create the server for the application.
React Native as a framework for building mobile clients. I hope you enjoyed the process of creating this app, because I did. Let’s dive right into it.
What is speech recognition?
Speech recognition enables programs to process human speech into a written format. Grammar, syntax, structure and audio are critical to understanding and processing human speech.
Speech recognition algorithms are one of the most complex areas in computer science . How to show/hide battery percentage on macOS Big Sur . Developments in artificial intelligence, machine learning, unsupervised pre-training techniques, and frameworks such as Wav2Vec 2.0, which are effective at self-supervised learning and learning from raw audio, have improved their capabilities.
The speech recognizer consists of the following components:
-
Voice input
-
A decoder that relies on acoustic models, pronunciation dictionaries, and language models for output
-
word output
These components and technological advances enable the consumption of large datasets of unlabeled speech. Pre-trained audio encoders are capable of learning high-quality speech representations; their only drawback is their unsupervised nature.
What is a decoder?
A high-performance decoder maps speech representations to usable outputs. The decoder solves the problem of audio encoder monitoring. However, decoders limit the effectiveness of frameworks like Wav2Vec for speech recognition. Decoders can be complex to use and require a skilled practitioner, especially since technologies such as Wav2Vec 2.0 are difficult to use.
The key is to combine as many high-quality speech recognition data sets as possible. Models trained in this way are more efficient than models trained on a single source.
What is a whisper?
Whisper or WSPR stands for Web-scale Supervised Pre-training for Speech Recognition. The Whisper model is trained to be able to predict the text of transcripts.
Whisper relies on a sequence-to-sequence model to map between an utterance and its transcribed form, which makes the speech recognition pipeline more efficient. Whisper comes with an audio language detector, a fine-tuned model trained on VoxLingua107.
The Whisper dataset consists of audio paired with transcripts from the internet. The quality of the dataset is improved by using automated filtering methods.
Set whisper
To use Whisper, we need to rely on Python as our backend. Whisper also requires the command line tool ffmpeg , which enables our application to record, convert and stream audio and video.
The following are the necessary commands to install ffgmeg on different machines:
# on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg # on Arch Linux sudo pacman -S ffmpeg # on MacOS using Homebrew (https://brew.sh/) brew install ffmpeg # on Windows using Chocolatey (https://chocolatey.org/) choco install ffmpeg # on Windows using Scoop (https://scoop.sh/) scoop install ffmpeg
Create a backend application using Flask
In this section, we will create the backend services for our application. Flask is a web framework written in Python. I chose to use Flask for this application because of its ease of setup.
The Flask development team recommends using the latest version of Python, although Flask still supports Python ≥ 3.7.
Once the installation prerequisites are complete, we can create project folders to hold the client and backend applications.
mkdir translateWithWhisper && cd translateWithWhisper && mkdir backend && cd backend
Flask utilizes virtual environments to manage project dependencies; Python has an out-of-the-box venv module to create them.
Use the following command in a terminal window to create the folder. This folder contains our dependencies. venv
python3 -m venv venv
Specify project dependencies
Use files to specify necessary dependencies. This file is located in the root of the backend directory. requirements.txt``requirements.txt
touch requirements.txt code requirements.txt
Copy and paste the following code into the file: requirements.txt
numpy tqdm transformers>=4.19.0 ffmpeg-python==0.2.0 pyaudio SpeechRecognition pydub git+https://github.com/openai/whisper.git --extra-index-url https://download.pytorch.org/whl/cu113 torch flask flask_cors
Create a Bash shell script to install dependencies
In the root project directory, create a Bash shell script file. A Bash script handles the installation of dependencies in a Flask application.
In the root project directory, open a terminal window. Create a shell script using the following command:
touch install_dependencies.sh code install_dependencies.sh
Copy and paste the following code block into the file: install_dependencies.sh
# install and run backend cd backend && python3 -m venv venv source venv/Scripts/activate pip install wheel pip install -r requirements.txt
Now, open a terminal window in the root directory and run the following command:
sh .\install_dependencies.sh
Create terminal node transcribe
Now, we will create an endpoint in the application that will receive audio input from the client. The application will transcribe the input and return the transcribed text to the client. transcribe
This endpoint accepts requests and handles input. When the response is a 200 HTTP response, the client receives the transcribed text. POST
Create a file to hold the logic for processing input. Open a new terminal window and create a file in the backend directory: app.py``app.py
touch backend/app.py code backend/app.py
What can I do by copying the following code block into a new Bing chat? (Bing AI chat function) Paste into the file: app.py
import us import tempfile import flask from flask import request from flask_cors import CORS import whisper app = flask.Flask(__name__) CORS(app) // endpoint for handling the transcribing of audio inputs @app.route('/transcribe', methods=['POST']) def transcribe(): if request.method == 'POST language = request.form['language'] model = request.form['model_size'] # there are no english models for large if model != 'large' and language == 'english': model = model + '.en' audio_model = whisper.load_model(model) temp_dir = tempfile.mkdtemp() save_path = os.path.join(temp_dir, 'temp.wav') wav_file = request.files['audio_data'] wav_file.save(save_path) if language == 'english': result = audio_model.transcribe(save_path, language='english') else: result = audio_model.transcribe(save_path) return result['text'] else: return "This endpoint only processes POST wav blob"
Run flask application
In an activated terminal window containing variables, run the following command to start the application: venv
$ cd backend $ flask run –port 8000
Expect the application to start without any errors. If this is the case, the following results should appear in the terminal window:
This concludes the process of creating endpoints in your Flask application. transcribe
hosting server
Fun Fact Notes on Creating HTTP Terminations in iOS - Share a valuable tutorial! To make a network request, we need to route it to an HTTPS server. ngrok solves the problem of creating reroutes.
Download ngrok, then install the package and open it. A terminal window starts; enter the following command to host the server using ngrok:
ngrok http 8000
ngrok will generate a hosting URL that will be used in client applications for requests.
Create a speech recognition mobile app using React Native
For this part of the tutorial, you need to install a few things:
-
Expo CLI: Command line tool for interfacing with Expo tools
-
Expo Go app for Android and iOS: used to open apps available through the Expo CLI
In a new terminal window, initialize the React Native project:
npx create-expo-app client cd client
Now, start the development server:
npx expo start
To open the app on your iOS device, open the camera and scan the QR code on the terminal. On your Android device, press Scan QR Code on the Home tab of the Expo Go app.
Our Expo Go App
Process recordings
Expo-av handles the recording of audio in our application. Our Flask server requires files in the file format. The expo-av package allows us to specify the format before saving. .wav
Install the necessary packages in the terminal:
yarn add axios expo-av react-native-picker-select
Create a model selector
The application must be able to select the model size. There are five options to choose from:
-
Small
-
Base
-
Small
-
medium
-
big
The selected input size determines which model the input is compared to on the server.
Again in the terminal, use the following command to create a folder and subfolder called: src/components
mkdir src mkdir src/components touch src/components/Mode.tsx code src/components/Mode.tsx
Paste the code block into the file: Mode.tsx
import React from "react"; import { View, Text, StyleSheet } from "react-native"; import RNPickerSelect from "react-native-picker-select"; const Mode = ({ onModelChange, transcribeTimeout, onTranscribeTimeoutChanged, }: any) => { function onModelChangeLocal(value: any) { onModelChange(value); } function onTranscribeTimeoutChangedLocal(event: any) { onTranscribeTimeoutChanged(event.target.value); } return ( <View> <Text style={styles.title}>Model Size</Text> <View style={ { flexDirection: "row" }}> <RNPickerSelect onValueChange={(value) => onModelChangeLocal(value)} useNativeAndroidPickerStyle={false} placeholder={ { label: "Select model", value: null }} items={[ { label: "tiny", value: "tiny" }, { label: "base", value: "base" }, { label: "small", value: "small" }, { label: "medium", value: "medium" }, { label: "large", value: "large" }, ]} style={customPickerStyles} /> </View> <View> <Text style={styles.title}>Timeout :{transcribeTimeout}</Text> </View> </View> ); }; export default Mode; const styles = StyleSheet.create({ title: { fontWeight: "200", fontSize: 25, float: "left", }, }); const customPickerStyles = StyleSheet.create({ inputIOS: { fontSize: 14, paddingVertical: 10, paddingHorizontal: 12, borderWidth: 1, borderColor: "green", borderRadius: 8, color: "black", paddingRight: 30, // to ensure the text is never behind the icon }, inputAndroid: { fontSize: 14, paddingHorizontal: 10, paddingVertical: 8, borderWidth: 1, borderColor: "blue", borderRadius: 8, color: "black", paddingRight: 30, // to ensure the text is never behind the icon }, });
Create output Transcribe
The server returns output with text. This component receives output data and displays it.
mkdir src mkdir src/components touch src/components/TranscribeOutput.tsx code src/components/TranscribeOutput.tsx
Paste the code block into the file: TranscribeOutput.tsx
import React from "react"; import { Text, View, StyleSheet } from "react-native"; const TranscribedOutput = ({ transcribedText, interimTranscribedText, }: any) => { if (transcribedText.length === 0 && interimTranscribedText.length === 0) { return <Text>...</Text>; } return ( <View style={styles.box}> <Text style={styles.text}>{transcribedText}</Text> <Text>{interimTranscribedText}</Text> </View> ); }; const styles = StyleSheet.create({ box: { borderColor: "black", borderRadius: 10, marginBottom: 0, }, text: { fontWeight: "400", fontSize: 30, }, }); export default TranscribedOutput;
Create client functionality
The application relies on Axios to send and receive data from the Flask server; we installed it in the previous section. The default language for test applications is English.
In the file, import the necessary dependencies: App.tsx
import * as React from "react"; import { Text, StyleSheet, View, Button, ActivityIndicator, } from "react-native"; import { Audio } from "expo-av"; import FormData from "form-data"; import axios from "axios"; import Mode from "./src/components/Mode"; import TranscribedOutput from "./src/components/TranscribeOutput";
Create state variables
The application needs to track recordings, transcribed data, recordings, and ongoing transcriptions. By default, language, model and timeout are set in status.
export default () => { const [recording, setRecording] = React.useState(false as any); const [recordings, setRecordings] = React.useState([]); const [message, setMessage] = React.useState(""); const [transcribedData, setTranscribedData] = React.useState([] as any); const [interimTranscribedData] = React.useState(""); const [isRecording, setIsRecording] = React.useState(false); const [isTranscribing, setIsTranscribing] = React.useState(false); const [selectedLanguage, setSelectedLanguage] = React.useState("english"); const [selectedModel, setSelectedModel] = React.useState(1); const [transcribeTimeout, setTranscribeTimout] = React.useState(5); const [stopTranscriptionSession, setStopTranscriptionSession] = React.useState(false); const [isLoading, setLoading] = React.useState(false); return ( <View style={styles.root}></View> ) } const styles = StyleSheet.create({ root: { display: "flex", flex: 1, alignItems: "center", textAlign: "center", flexDirection: "column", }, });
Create reference, language, and model option variables
useRef Hook enables us to keep track of currently initialized properties. We're going to set up the transcription session, language, and model. useRef
Paste the code block under the hook: setLoading``useState
const [isLoading, setLoading] = React.useState(false); const intervalRef: any = React.useRef(null); const stopTranscriptionSessionRef = React.useRef(stopTranscriptionSession); stopTranscriptionSessionRef.current = stopTranscriptionSession; const selectedLangRef = React.useRef(selectedLanguage); selectedLangRef.current = selectedLanguage; const selectedModelRef = React.useRef(selectedModel); selectedModelRef.current = selectedModel; const supportedLanguages = [ "english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese", ]; const modelOptions = ["tiny", "base", "small", "medium", "large"]; React.useEffect(() => { return () => clearInterval(intervalRef.current); }, []); function handleTranscribeTimeoutChange(newTimeout: any) { setTranscribeTimout(newTimeout); }
Create a recording function
In this section, we will write five functions to handle audio transcription.
function startRecording
The first function is a function. This function enables an application to request permission to use the microphone. The required audio format is preset and we have a reference for tracking timeouts: startRecording
async function startRecording() { try { console.log("Requesting permissions.."); const permission = await Audio.requestPermissionsAsync(); if (permission.status === "granted") { await Audio.setAudioModeAsync({ allowsRecordingIOS: true, playsInSilentModeIOS: true, }); alert("Starting recording.."); const RECORDING_OPTIONS_PRESET_HIGH_QUALITY: any = { android: { extension: ".mp4", outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4, audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AMR_NB, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, }, ios: { extension: ".wav", audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_MIN, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, linearPCMBitDepth: 16, linearPCMIsBigEndian: false, linearPCMIsFloat: false, }, }; const { recording }: any = await Audio.Recording.createAsync( RECORDING_OPTIONS_PRESET_HIGH_QUALITY ); setRecording(recording); console.log("Recording started"); setStopTranscriptionSession(false); setIsRecording(true); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); console.log("erer", recording); } else { setMessage("Please grant permission to app to access microphone"); } } catch (err) { console.error(" Failed to start recording", err); } }
functionstopRecording
This feature enables users to stop recording. State variables store and save updated records. stopRecording``recording
async function stopRecording() { console.log("Stopping recording.."); setRecording(undefined); await recording.stopAndUnloadAsync(); const uri = recording.getURI(); let updatedRecordings = [...recordings] as any; const { sound, status } = await recording.createNewLoadedSoundAsync(); updatedRecordings.push({ sound: sound, duration: getDurationFormatted(status.durationMillis), file: recording.getURI(), }); setRecordings(updatedRecordings); console.log("Recording stopped and stored at", uri); // Fetch audio binary blob data clearInterval(intervalRef.current); setStopTranscriptionSession(true); setIsRecording(false); setIsTranscribing(false); }
and function getDurationFormatted``getRecordingLines
To get the duration of the recording and the length of the recorded text, create the and function: getDurationFormatted `` getRecordingLines
function getDurationFormatted(millis: any) { const minutes = millis / 1000 / 60; const minutesDisplay = Math.floor(minutes); const seconds = Math.round(minutes - minutesDisplay) * 60; const secondDisplay = seconds < 10 ? `0${seconds}` : seconds; return `${minutesDisplay}:${secondDisplay}`; } function getRecordingLines() { return recordings.map((recordingLine: any, index) => { return ( <View key={index} style={styles.row}> <Text style={styles.fill}> {" "} Recording {index + 1} - {recordingLine.duration} </Text> <Button style={styles.button} onPress={() => recordingLine.sound.replayAsync()} title="Play" ></Button> </View> ); }); }
Create function transcribeRecording
This feature allows us to communicate with the Flask server. We use functions in the expo-av library to access the audio we created. , , and are the key pieces of data we send to the server. getURI() language
model_size``audio_data
The response indicates success. We store the response in a Hook. This response contains our transcript. 200 setTranscribedData
useState
function transcribeInterim() { clearInterval(intervalRef.current); setIsRecording(false); } async function transcribeRecording() { const uri = recording.getURI(); const filetype = uri.split(".").pop(); const filename = uri.split("/").pop(); setLoading(true); const formData: any = new FormData(); formData.append("language", selectedLangRef.current); formData.append("model_size", modelOptions[selectedModelRef.current]); formData.append( "audio_data", { uri, type: `audio/${filetype}`, name: filename, }, "temp_recording" ); axios({ url: "https://2c75-197-210-53-169.eu.ngrok.io/transcribe", method: "POST", data: formData, headers: { Accept: "application/json", "Content-Type": "multipart/form-data", }, }) .then(function (response) { console.log("response :", response); setTransscribedData((oldData: any) => [...oldData, response.data]); setLoading(false); setIsTranscribing(false); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); }) .catch(function (error) { console.log("error : error"); }); if (!stopTranscriptionSessionRef.current) { setIsRecording(true); } }
Assembling the application
Let's assemble all the parts created so far:
import * as React from "react"; import { Text, StyleSheet, View, Button, ActivityIndicator, } from "react-native"; import { Audio } from "expo-av"; import FormData from "form-data"; import axios from "axios"; import Mode from "./src/components/Mode"; import TranscribedOutput from "./src/components/TranscribeOutput"; export default () => { const [recording, setRecording] = React.useState(false as any); const [recordings, setRecordings] = React.useState([]); const [message, setMessage] = React.useState(""); const [transcribedData, setTranscribedData] = React.useState([] as any); const [interimTranscribedData] = React.useState(""); const [isRecording, setIsRecording] = React.useState(false); const [isTranscribing, setIsTranscribing] = React.useState(false); const [selectedLanguage, setSelectedLanguage] = React.useState("english"); const [selectedModel, setSelectedModel] = React.useState(1); const [transcribeTimeout, setTranscribeTimout] = React.useState(5); const [stopTranscriptionSession, setStopTranscriptionSession] = React.useState(false); const [isLoading, setLoading] = React.useState(false); const intervalRef: any = React.useRef(null); const stopTranscriptionSessionRef = React.useRef(stopTranscriptionSession); stopTranscriptionSessionRef.current = stopTranscriptionSession; const selectedLangRef = React.useRef(selectedLanguage); selectedLangRef.current = selectedLanguage; const selectedModelRef = React.useRef(selectedModel); selectedModelRef.current = selectedModel; const supportedLanguages = [ "english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese", ]; const modelOptions = ["tiny", "base", "small", "medium", "large"]; React.useEffect(() => { return () => clearInterval(intervalRef.current); }, []); function handleTranscribeTimeoutChange(newTimeout: any) { setTranscribeTimout(newTimeout); } async function startRecording() { try { console.log("Requesting permissions.."); const permission = await Audio.requestPermissionsAsync(); if (permission.status === "granted") { await Audio.setAudioModeAsync({ allowsRecordingIOS: true, playsInSilentModeIOS: true, }); alert("Starting recording.."); const RECORDING_OPTIONS_PRESET_HIGH_QUALITY: any = { android: { extension: ".mp4", outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4, audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AMR_NB, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, }, ios: { extension: ".wav", audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_MIN, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, linearPCMBitDepth: 16, linearPCMIsBigEndian: false, linearPCMIsFloat: false, }, }; const { recording }: any = await Audio.Recording.createAsync( RECORDING_OPTIONS_PRESET_HIGH_QUALITY ); setRecording(recording); console.log("Recording started"); setStopTranscriptionSession(false); setIsRecording(true); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); console.log("erer", recording); } else { setMessage("Please grant permission to app to access microphone"); } } catch (err) { console.error(" Failed to start recording", err); } } async function stopRecording() { console.log("Stopping recording.."); setRecording(undefined); await recording.stopAndUnloadAsync(); const uri = recording.getURI(); let updatedRecordings = [...recordings] as any; const { sound, status } = await recording.createNewLoadedSoundAsync(); updatedRecordings.push({ sound: sound, duration: getDurationFormatted(status.durationMillis), file: recording.getURI(), }); setRecordings(updatedRecordings); console.log("Recording stopped and stored at", uri); // Fetch audio binary blob data clearInterval(intervalRef.current); setStopTranscriptionSession(true); setIsRecording(false); setIsTranscribing(false); } function getDurationFormatted(millis: any) { const minutes = millis / 1000 / 60; const minutesDisplay = Math.floor(minutes); const seconds = Math.round(minutes - minutesDisplay) * 60; const secondDisplay = seconds < 10 ? `0${seconds}` : seconds; return `${minutesDisplay}:${secondDisplay}`; } function getRecordingLines() { return recordings.map((recordingLine: any, index) => { return ( <View key={index} style={styles.row}> <Text style={styles.fill}> {" "} Recording {index + 1} - {recordingLine.duration} </Text> <Button style={styles.button} onPress={() => recordingLine.sound.replayAsync()} title="Play" ></Button> </View> ); }); } function transcribeInterim() { clearInterval(intervalRef.current); setIsRecording(false); } async function transcribeRecording() { const uri = recording.getURI(); const filetype = uri.split(".").pop(); const filename = uri.split("/").pop(); setLoading(true); const formData: any = new FormData(); formData.append("language", selectedLangRef.current); formData.append("model_size", modelOptions[selectedModelRef.current]); formData.append( "audio_data", { uri, type: `audio/${filetype}`, name: filename, }, "temp_recording" ); axios({ url: "https://2c75-197-210-53-169.eu.ngrok.io/transcribe", method: "POST", data: formData, headers: { Accept: "application/json", "Content-Type": "multipart/form-data", }, }) .then(function (response) { console.log("response :", response); setTransscribedData((oldData: any) => [...oldData, response.data]); setLoading(false); setIsTranscribing(false); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); }) .catch(function (error) { console.log("error : error"); }); if (!stopTranscriptionSessionRef.current) { setIsRecording(true); } } return ( <View style={styles.root}> <View style={ { flex: 1 }}> <Text style={styles.title}>Speech to Text. </Text> <Text style={styles.title}>{message}</Text> </View> <View style={styles.settingsSection}> <Mode disabled={isTranscribing || isRecording} possibleLanguages={supportedLanguages} selectedLanguage={selectedLanguage} onLanguageChange={setSelectedLanguage} modelOptions={modelOptions} selectedModel={selectedModel} onModelChange={setSelectedModel} transcribeTimeout={transcribeTimeout} onTranscribeTiemoutChanged={handleTranscribeTimeoutChange} /> </View> <View style={styles.buttonsSection}> {!isRecording && !isTranscribing && ( <Button onPress={startRecording} title="Start recording" /> )} {(isRecording || isTranscribing) && ( <Button onPress={stopRecording} disabled={stopTranscriptionSessionRef.current} title="stop recording" /> )} <Button title="Transcribe" onPress={() => transcribeRecording()} /> {getRecordingLines()} </View> {isLoading !== false ? ( <ActivityIndicator size="large" color="#00ff00" hidesWhenStopped={true} animating={true} /> ) : ( <Text></Text> )} <View style={styles.transcription}> <TranscribedOutput transcribedText={transcribedData} interimTranscribedText={interimTranscribedData} /> </View> </View> ); }; const styles = StyleSheet.create({ root: { display: "flex", flex: 1, alignItems: "center", textAlign: "center", flexDirection: "column", }, title: { marginTop: 40, fontWeight: "400", fontSize: 30, }, settingsSection: { flex: 1, }, buttonsSection: { flex: 1, flexDirection: "row", }, transcription: { flex: 1, flexDirection: "row", }, recordIllustration: { width: 100, }, row: { flexDirection: "row", alignItems: "center", justifyContent: "center", }, fill: { flex: 1, margin: 16, }, button: { margin: 16, }, });
Run application
Run the React Native application using the following command:
yarn start
The project repository is publicly available.
in conclusion
In this article, we learned how to create speech-to-text functionality in a React Native application. I foresee Whisper changing the way narration and dictation work in everyday life. The technology described in this article enables the creation of dictation applications.
I'm excited to see new and innovative ways developers extend Whisper, for example, using Whisper to perform actions on our mobile and web devices, or using Whisper to improve the accessibility of our websites and apps.