, and Siri are the ubiquitous voice assistants that serve most of the internet connected population today. For the most part, English is the dominant language used with these voice assistants. However, for a voice assistant to be truly helpful, it must be able to understand the user as they naturally speak. In many parts of the world, especially in a diverse country like India, it is common for people to be multilingual and to switch between multiple languages in a single conversation. A truly smart assistant should be able to handle this.

Google Assistant offers the ability to add a second language; but its functionality is limited to certain devices only and offers this only for a limited set of major languages. For example, Google’s Nest Hub does not yet support bilingual capabilities for Tamil, a language spoken by over 80 million people. Alexa supports bilingual approach as long as it is supported in its internal language pair; again this only supports a limited set of major languages. Siri does not have bilingual capability and allows only one language at a time.

In this article I will discuss the approach taken to enable my Voice Assistant to have a bilingual capability with English and Tamil as the languages. Using this approach, the voice assistant will be able to automatically detect the language a person is speaking by analyzing the audio directly. By using a “confidence score”-based algorithm, the system will determine if English or Tamil is spoken and respond in the corresponding language.

Approach to Bilingual Capability

To make the assistant understand both English and Tamil, there are a few potential solutions. The first approach would be to train a custom Machine Learning model from scratch, specifically on Tamil language data, and then integrate that model into the Raspberry Pi. While this would offer a high degree of customization, it is an incredibly time-consuming and resource-intensive process. Training a model requires a massive dataset and significant computational power. Furthermore, running a heavy custom model would likely slow down the Raspberry Pi, leading to a poor user experience.

fastText Approach

A more practical solution is to use an existing, pre-trained model that is already optimized for a specific task. For language identification, a great option is fastText.

fastText is an open-source library from Facebook AI Research designed for efficient text classification and word representation. It comes with pre-trained models that can quickly and accurately identify the language of a given piece of text from a large number of languages. Because it is lightweight and highly optimized, it is an excellent choice for running on a resource-constrained device like a Raspberry Pi without causing significant performance issues. The plan, therefore, was to use fastText to classify the user’s spoken language.

To use fastText, you download the corresponding model (lid.176.bin) and store it in your project folder. Specify this as the MODEL_PATH and load the model.

import fastText
import speech_recognition as sr
import fasttext

# --- Configuration ---
MODEL_PATH = "./lid.176.bin" # This is the model file you downloaded and unzipped

# --- Main Application Logic ---
print("Loading fastText language identification model...")
try:
    # Load the pre-trained model
    model = fasttext.load_model(MODEL_PATH)
except Exception as e:
    print(f"FATAL ERROR: Could not load the fastText model. Error: {e}")
    exit()

The next step would be to pass the voice commands, as recordings, to the model and get the prediction back. This can be achieved through a dedicated function.

def identify_language(text, model):
    # The model.predict() function returns a tuple of labels and probabilities
    predictions = model.predict(text, k=1)
    language_code = predictions[0][0] # e.g., '__label__en'
    return language_code

try:
    with microphone as source:
        recognizer.adjust_for_ambient_noise(source, duration=1)
        print("\nPlease speak now...")
        audio = recognizer.listen(source, phrase_time_limit=8)

    print("Transcribing audio...")
    # Get a rough transcription without specifying a language
    transcription = recognizer.recognize_google(audio)
    print(f"Heard: \"{transcription}\"")

    # Identify the language from the transcribed text
    language = identify_language(transcription, model)

    if language == '__label__en':
        print("\n---> Result: The detected language is English. <---")
    elif language == '__label__ta':
        print("\n---> Result: The detected language is Tamil. <---")
    else:
        print(f"\n---> Result: Detected a different language: {language}")

except sr.UnknownValueError:
    print("Could not understand the audio.")
except sr.RequestError as e:
    print(f"Speech recognition service error; {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

The code block above follows a simple path. It uses the recognizer.recognize_google(audio) function to transcribe the voice command and then passes this transcription to the fastText model to get a prediction on the language. If the prediction is “__label__en” then English has been detected and if prediction is “__label_ta” then Tamil has been detected.

This approach led to poor predictions though. The problem is that speech_recognition library defaults to English. So when I speak something in Tamil, it finds the closest (and incorrect) equivalent sounding words in English and passes it to fastText.

For example when I said “En Peyar enna” (What is my Name in Tamil), speech_recognition understood it as “Empire NA” and hence fastText predicted the language as English. To overcome this, I can hardcode the speech_recognition function to detect only Tamil. But this would defeat the idea of being truly ‘smart’ and ‘bilingual’. The assistant should be able to detect the language based on what is spoken; not based on what is hard coded.

Photo by Siora Photography on Unsplash

The ‘Confidence Score’ method

What we need is a more direct and data-driven method. The solution lies within a feature of the speech_recognition library. The recognizer.recognize_google() function is the Google Speech Recognition API and it can transcribe audio from a vast number of languages, including both English and Tamil. A key feature of this API is that for every transcription it provides, it can also return a confidence score — a numerical value between 0 and 1, indicating how certain it is that its transcription is correct.

This feature allows for a much more elegant and dynamic approach to language identification. Let’s take a look at the code.

def recognize_with_confidence(recognizer, audio_data):
    
    tamil_text = None
    tamil_confidence = 0.0
    english_text = None
    english_confidence = 0.0

    # 1. Attempt to recognize as Tamil and get confidence
    try:
        print("Attempting to transcribe as Tamil...")
        # show_all=True returns a dictionary with transcription alternatives
        response_tamil = recognizer.recognize_google(audio_data, language='ta-IN', show_all=True)
        # We only look at the top alternative
        if response_tamil and 'alternative' in response_tamil:
            top_alternative = response_tamil['alternative'][0]
            tamil_text = top_alternative['transcript']
            if 'confidence' in top_alternative:
                tamil_confidence = top_alternative['confidence']
            else:
                tamil_confidence = 0.8 # Assign a default high confidence if not provided
    except sr.UnknownValueError:
        print("Could not understand audio as Tamil.")
    except sr.RequestError as e:
        print(f"Tamil recognition service error; {e}")

    # 2. Attempt to recognize as English and get confidence
    try:
        print("Attempting to transcribe as English...")
        response_english = recognizer.recognize_google(audio_data, language='en-US', show_all=True)
        if response_english and 'alternative' in response_english:
            top_alternative = response_english['alternative'][0]
            english_text = top_alternative['transcript']
            if 'confidence' in top_alternative:
                english_confidence = top_alternative['confidence']
            else:
                english_confidence = 0.8 # Assign a default high confidence
    except sr.UnknownValueError:
        print("Could not understand audio as English.")
    except sr.RequestError as e:
        print(f"English recognition service error; {e}")

    # 3. Compare confidence scores and return the winner
    print(f"\nConfidence Scores -> Tamil: {tamil_confidence:.2f}, English: {english_confidence:.2f}")
    if tamil_confidence > english_confidence:
        return tamil_text, "Tamil"
    elif english_confidence > tamil_confidence:
        return english_text, "English"
    else:
        # If scores are equal (or both zero), return neither
        return None, None

The logic in this code block is simple. We pass the audio to the recognize_google() function and get the whole list of alternatives and its scores. First we try the language as Tamil and get the corresponding confidence score. Then we try the same audio as English and get the corresponding confidence score from the API. Once we have both, we then compare the confidence scores and choose the one with the higher score as the language detected by the system.

Below is the output of the function when I speak in English and when I speak in Tamil.

Screenshot from Visual Studio output (Tamil). Image owned by author.
Screenshot from Visual Studio output (English). Image owned by author.

The results above show how the code is able to understand the language spoken dynamically, based on the confidence score.

Putting it all together — The Bilingual Assistant

The final step would be to integrate this approach into the code for the Raspberry Pi based Voice assistant. The full code can be found in my GitHub. Once integrated the next step would be to test the functioning of the Voice Assistant by speaking in English and Tamil and seeing how it responds for each language. The recordings below demonstrate the working of the Bilingual Voice Assistant when asked a question in English and in Tamil.

Conclusion

In this article, we have seen how to successfully upgrade a simple voice assistant into a truly bilingual tool. By implementing a “confidence score” algorithm, the system can be made to determine whether a command is spoken in English or Tamil, allowing it to understand and reply in the user’s chosen language for that specific query. This creates a more natural and seamless conversational experience.

The key advantage of this method is its reliability and scalability. While this project focused on just two languages, the same confidence score logic could easily be extended to support three, four, or more by simply adding an API call for each new language and comparing all the results. The techniques explored here serve as a robust foundation for creating more advanced and intuitive personal AI tools.

Reference:

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

Share.

Comments are closed.