Guidebook to reduce latency for Azure Speech-To-Text (STT) and Text-To-Speech (TTS) applications (2024)

Latency in speech recognition and synthesis can be a significant hurdle in creating seamless and efficient applications. Reducing latency not only improves user experience but also enhances the overall performance of real-time applications. This blog post will explore strategies to reduce latency in general transcription, real-time transcription, file transcription, and speech synthesis.

1. Network Latency: Move the Speech Resource Closer to the App
One of the primary factors contributing to latency in speech recognition is network latency. To mitigate this, it's essential to minimize the distance between your application and the speech recognition resource. Here are some tips:

Speech Containers : It gives the flexibility of running the model within on-premise or on edge thereby eliminating the need to send over the audio data over cloud and hence reducing network latency. Link - Install and run Speech containers with Docker - Speech service - Azure AI services | Microsoft Learn

Leverage Cloud Providers:Choose cloud service providers with data centers in regions that are closer to your users. This reduces the network latency significantly.
Use Embedded Speech: It is a compact model specially designed for on-device scenarios where internet connectivity is limited or unavailable thereby significantly reducing network latency. However, it might cause a slight drop in accuracy. So, for optimal accuracy, consider a hybrid approach: utilize Azure AI Speech via cloud when there's a network connection, and switch to embedded speech when there's no network. This provides high-quality and accurate speech processing with a reliable backup option. Link -Embedded Speech - Speech service - Azure AI services | Microsoft Learn

2. Real-Time Transcription:

Real-time transcription requires immediate processing of audio input to provide instant feedback. Here are some recommendations to achieve low latency in real-time transcription:

2.1 Use Real-Time Streaming
Instead of recording the entire audio and then processing it, use real-time streaming to send audio data in small chunks to the speech recognition service. This allows for immediate processing and reduces the overall latency.

def speech_recognize_continuous_async_from_microphone(): """performs continuous speech recognition asynchronously with input from microphone""" speech_config = speechsdk.SpeechConfig(subscription=os.getenv("SUBSCRIPTION_KEY"), region="centralIndia") speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config) done = False def recognized_cb(evt: speechsdk.SpeechRecognitionEventArgs): print('RECOGNIZED: {}'.format(evt.result.text)) def stop_cb(evt: speechsdk.SessionEventArgs): """callback that signals to stop continuous recognition""" print('CLOSING on {}'.format(evt)) nonlocal done done = True # Connect callbacks to the events fired by the speech recognizer speech_recognizer.recognized.connect(recognized_cb) speech_recognizer.session_stopped.connect(stop_cb) # Other tasks can be performed on this thread while recognition starts... result_future = speech_recognizer.start_continuous_recognition_async() result_future.get() # wait for voidfuture, so we know engine initialization is done. print('Continuous Recognition is now running, say something.') while not done: print('type "stop" then enter when done') stop = input() if (stop.lower() == "stop"): print('Stopping async recognition.') speech_recognizer.stop_continuous_recognition_async() break print("recognition stopped, main thread can exit now.")speech_recognize_continuous_async_from_microphone()

Azure Speech SDK also provides a way to stream audio into the recognizer as an alternative to microphone or file input. You can choose between PushAudioInputStream and PullAudioInputStream depending upon your requirement. For details - Speech SDK audio input stream concepts - Azure AI services | Microsoft Learn

2.2 Define the Default Language

If the default language is known, define it at the start of the transcription process. This eliminates the additional processing time required to detect the input language. If the default language is not known, use the "SpeechServiceConnection_LanguageIdMode" to detect the language at the start of the transcription and specify the list of expected languages to reduce the processing time

speech_config = speechsdk.SpeechConfig(subscription="YourSubscriptionKey", region="YourServiceRegion") speech_config.speech_recognition_language = "en-US" # Set default language## OR speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceConnection_LanguageIdMode, value = "AtStart")auto_detect_source_language_config = speechsdk.languageconfig.AutoDetectSourceLanguageConfig(languages=["en-US", "gu-In", "bn-IN", "mr-IN"])speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config,auto_detect_source_language_config=auto_detect_source_language_config)

2.3 Use Asynchronous Methods
Utilize asynchronous methods likestart_continuous_recognition_asyncinstead ofstart_continuous_recognitionandstop_continuous_recognition_asyncinstead ofstop_continuous_recognition. These methods allow for non-blocking operations and reduce latency.

speech_recognizer.start_continuous_recognition_async(); # Perform other tasksspeech_recognizer.stop_continuous_recognition_async()

2.4 Use Fast Transcription

Fast Transcription transcribes audio significantly faster than real-time streaming transcription and is apt for scenarios where immediate transcript is essential like call center analytics, meeting summarization, voice dubbing and many others. It can transcribe 30 min audio in less than a minute. Albeit this is in public preview and supports only a handful of locales. For complete list of supported languages, check outLanguage support - Speech service - Azure AI services | Microsoft Learn

3.File Transcription
For file transcription, processing large audio files can introduce significant latency. Here are some strategies to reduce latency:

3.1 Split the Audio into Small Chunks
Divide the audio file into smaller chunks and run the transcription for each chunk in parallel. This allows for faster processing and reduces the overall transcription time. One caveat with audio chunking is it might result a small drop in transcription quality based on the chunking strategy but if the transcription layer is followed by an intelligence layer of LLM for analytical insights, post processing etc, the drop in quality should get offset by the superior LLM intelligence.

from pydub import AudioSegment import concurrent.futures def transcribe_chunk(chunk): # Transcription logic for each chunk pass audio = AudioSegment.from_file("large_audio_file.wav") chunk_length_ms = 10000 # 10 seconds chunks = [audio[i:i + chunk_length_ms] for i in range(0, len(audio), chunk_length_ms)] with concurrent.futures.ThreadPoolExecutor() as executor: futures = [executor.submit(transcribe_chunk, chunk) for chunk in chunks] results = [f.result() for f in concurrent.futures.as_completed(futures)]

3.2 Increase the Speed of Audio
Increase the playback speed of the audio file before sending it for transcription. This reduces the time taken to process the entire file with negligible compromise on the accuracy of transcription.

def increase_audio_speed(filename, output_filename = "modified_audio_file.wav", speed_change_factor = 1.7): # Load your audio file audio = AudioSegment.from_file(filename) # Change to your file format # Change speed: Speed up (e.g., 1.5 times) speed_change_factor = speed_change_factor # Increase this to make it faster, decrease to slow down new_audio = audio._spawn(audio.raw_data, overrides={'frame_rate': int(audio.frame_rate * speed_change_factor)}) # Set the frame rate to the new audio new_audio = new_audio.set_frame_rate(audio.frame_rate) # Export the modified audio new_audio.export(output_filename, format="wav") # Change to your desired format

3.3 Compress the Input Audio
Compress the input audio before sending it for transcription. This reduces the file size for faster transmission, optimizing bandwidth usage and storage efficiency in transcription.

from pydub import AudioSegmentinput_audio = 'gujrati_tts.wav'output_audio = 'compressed_audio.mp3'try: # Load the audio file audio = AudioSegment.from_file(input_audio) # Export the audio file with a lower bitrate to compress it audio.export(output_audio, format="mp3", bitrate="64k") print(f"Compressed audio saved as {output_audio}")except Exception as e: print(f"An error occurred: {e}")

4. Speech Synthesis
Latency in speech synthesis can be a bottleneck, especially in real-time applications. Here are some recommendations to reduce latency:

4.1 Use Asynchronous Methods
Instead of usingspeak_text_asyncfor speech synthesis, which blocks the streaming until the entire audio is processed, switch to thestart_speaking_text_asyncmethod. This method starts streaming the audio output as soon as the first audio chunk is received, reducing latency significantly.

4.2 Text Streaming :Streaming text allows the TTS system to start processing and generating speech as soon as the initial part of the text is received, rather than waiting for the entire text to be available. This reduces the initial delay before speech output beginsmaking it ideal for interactive applications, live events, and responsive AI-driven dialogues

# tts sentence end marktts_sentence_end = [ ".", "!", "?", ";", "。", "！", "？", "；", "\n" ]completion = gpt_client.chat.completions.create( model="gpt-4o", messages=[, {"role": "user", "content": <prompt>} ], stream=True)collected_messages = []last_tts_request = Nonefor chunk in completion: if len(chunk.choices) > 0: chunk_text = chunk.choices[0].delta.content if chunk_text: collected_messages.append(chunk_text) if chunk_text in tts_sentence_end: text = "".join(collected_messages).strip() # join the received message together to build a sentence last_tts_request = speech_synthesizer.start_speaking_text_async(text).get() collected_messages.clear()

4.3 Optimize Audio Output Format
The payload size impacts latency. Use a compressed audio format to save network bandwidth, which is crucial when the network is unstable or has limited bandwidth. Switching to theRiff48Khz16BitMonoPcmformat, which has a 384 kbps bitrate, automatically uses a compressed output format for transcription, thereby reducing latency.

By following these strategies, you can significantly reduce the latency in STT and TTS applications, providing a smoother and more efficient user experience. Implementing these techniques will ensure that your applications are responsive and performant, even in real-time scenarios.