Crafting a Custom Voice Assistant with Perplexity

, Alexa, and Siri are the dominating voice assistants available for everyday use. These assistants have become ubiquitous in almost every home, carrying out tasks from home automation, note taking, recipe guidance and answering simple questions. When it comes to answering questions though, in the age of LLMs, getting a concise and context-based answer from these voice assistants can be tricky, if not non-existent. For example, if you ask Google Assistant how the market is reacting to Jerome Powell’s speech in Jackson Hole on Aug 22, it will simply reply that it does not know the answer and give a few links that you can peruse. That is if you have the screen-based Google Assistant.

Often you just want a quick answer on current events, or you want to know if an Apple tree would survive the winter in Ohio, and often voice assistants like Google and Siri fall short of providing a satisfying answer. This got me interested in building my own voice assistant, one that would give me a simple, single sentence answer based on its search of the web.

Photo by Aerps.com on Unsplash

Of the various LLM powered search engines available, I have been an avid user of Perplexity for more than a year now and I use it exclusively for all my searches except for simple ones where I still go back to Google or Bing. Perplexity, in addition to its live web index, which enables it to provide up-to-date, accurate, sourced answers, allows users access to its functionality through a powerful API. Using this functionality and integrating it with a simple Raspberry Pi, I intended to create a voice assistant that would:

Answer to a wake word and be ready to answer my question
Answer my question in a simple, concise sentence
Go back to passive listening without selling my data or giving my unnecessary ads

The Hardware for the Assistant

To build our voice assistant, a few key hardware components are required. The core of the project is a Raspberry Pi 5, which serves as the central processor for our application. For the assistant’s audio input, I chose a simple USB gooseneck microphone. This type of microphone is omnidirectional, making it effective at hearing the wake word from different parts of a room, and its plug-and-play nature simplifies the setup. For the assistant’s output, a compact USB-powered speaker provides the audio output. A key advantage of this speaker is that it uses a single USB cable for both its power and audio signal, which minimizes cable clutter.

Block diagram showing the functionality of the custom voice assistant (image by author)

This approach of using readily available USB peripherals makes the hardware assembly straightforward, allowing us to focus our efforts on the software.

Getting the environment ready

In order to query Perplexity using custom queries and in order to have a wake word for the voice assistant, we need to generate a couple of API keys. In order to generate a Perplexity API key one can sign up for a Perplexity account, go to the Settings menu, select the API tab, and click “Generate API Key” to create and copy their personal key for use in applications. Access to API key generation usually requires a paid plan or payment method, so ensure the account is eligible before proceeding.

Platforms that offer wake word customization include PicoVoice Porcupine, Sensory TrulyHandsfree, and Snowboy, with PicoVoice Porcupine providing an easy online console for generating, testing, and deploying custom wake words across desktop, mobile, and embedded devices. A new user can generate a custom word for PicoVoice Porcupine by signing up for a free Picovoice Console account, navigating to the Porcupine page, selecting the desired language, typing in the custom wake word, and clicking “Train” to produce and download the platform-specific model file (.ppn) for use. Make sure to test the wake word for performance before finalizing, as this ensures reliable detection and minimal false positives. The wake word I have trained and will use is “Hey Krishna”.

Coding the Assistant

The complete Python script for this project is available on my GitHub repository. In this section, let’s look at the key components of the code to understand how the assistant functions.
The script is organized into a few core functions that handle the assistant’s senses and intelligence, all managed by a central loop.

Configuration and Initialization

The first part of the script is dedicated to setup. It handles loading the necessary API keys, model files, and initializing the clients for the services we’ll use.

# --- 1. Configuration ---
load_dotenv()
PICOVOICE_ACCESS_KEY = os.environ.get("PICOVOICE_ACCESS_KEY")
PERPLEXITY_API_KEY = os.environ.get("PERPLEXITY_API_KEY")
KEYWORD_PATHS = ["Krishna_raspberry-pi.ppn"] # My wake word pat
MODEL_NAME = "sonar"

This section uses the dotenv library to securely load your secret API keys from a .env file, which is a best practice that keeps them out of your source code. It also defines key variables like the path to your custom wake word file and the specific Perplexity model we want to query.

Wake Word Detection

For the assistant to be truly hands-free, it needs to listen continuously for a specific wake word without using significant system resources. This is handled by the while True: loop in the main function, which uses the PicoVoice Porcupine engine.

# This is the main loop that runs continuously
while True:
    # Read a small chunk of raw audio data from the microphone
    pcm = audio_stream.read(porcupine.frame_length)
    pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
    
    # Feed the audio chunk into the Porcupine engine for analysis
    keyword_index = porcupine.process(pcm)

    if keyword_index >= 0:
        # Wake word was detected, proceed to handle the command...
        print("Wake word detected!")

This loop is the heart of the assistant’s “passive listening” state. It continuously reads small, raw audio frames from the microphone stream. Each frame is then passed to the porcupine.process() function. This is a highly efficient, offline process that analyzes the audio for the specific acoustic pattern of your custom wake word (“Krishna”). If the pattern is detected, porcupine.process() returns a non-negative number, and the script proceeds to the active phase of listening for a full command.

Speech-to-Text — Converting user questions to text

After the wake word is detected, the assistant needs to listen for and understand the user’s question. This is handled by the Speech-to-Text (STT) component.

# --- This logic is inside the main 'if keyword_index >= 0:' block ---

print("Listening for command...")
frames = []
# Record audio from the stream for a fixed duration (~10 seconds)
for _ in range(0, int(porcupine.sample_rate / porcupine.frame_length * 10)):
    frames.append(audio_stream.read(porcupine.frame_length))

# Convert the raw audio frames into an object the library can use
audio_data = sr.AudioData(b"".join(frames), porcupine.sample_rate, 2)

try:
    # Send the audio data to Google's service for transcription
    command = recognizer.recognize_google(audio_data)
    print(f"You (command): {command}")
except sr.UnknownValueError:
    speak_text("Sorry, I didn't catch that.")

Once the wake word is detected, the code actively records audio from the microphone for approximately 10 seconds, capturing the user’s spoken command. It then packages this raw audio data and sends it to Google’s speech recognition service using the speech_recognition library. The service processes the audio and returns the transcribed text, which is then stored in the command variable.

Getting Answers from Perplexity

Once the user’s command has been converted to text, it is sent to the Perplexity API to get an intelligent, up-to-date answer.

# --- This logic runs if a command was successfully transcribed ---

if command:
    # Define the instructions and context for the AI
    messages = [{"role": "system", "content": "You are an AI assistant. You are located in Twinsburg, Ohio. All answers must be relevant to Cleveland, Ohio unless asked for differently by the user.  You MUST answer all questions in a single and VERY concise sentence."}]
    messages.append({"role": "user", "content": command})
    
    # Send the request to the Perplexity API
    response = perplexity_client.chat.completions.create(
        model=MODEL_NAME, 
        messages=messages
    )
    assistant_response_text = response.choices[0].message.content.strip()
    speak_text(assistant_response_text)

This code block is the “brain” of the operation. It first constructs a messages list, which includes a critical system prompt. This prompt gives the AI its personality and rules, such as answering in a single sentence and being aware of its location in Ohio. The user’s command is then added to this list, and the entire package is sent to the Perplexity API. The script then extracts the text from the AI’s response and passes it to the speak_text function to be read aloud.

Text-to-Speech — Converting Perplexity response to Voice

The speak_text function is what gives the assistant its voice.

def speak_text(text_to_speak, lang='en'):
    # Define a function that converts text to speech, default language is English
    
    print(f"Assistant (speaking): {text_to_speak}")
    # Print the text for reference so the user can see what is being spoken
    
    try:
        pygame.mixer.init()
        # Initialize the Pygame mixer module for audio playback
        
        tts = gTTS(text=text_to_speak, lang=lang, slow=False)
        # Create a Google Text-to-Speech (gTTS) object with the provided text and language
        # 'slow=False' makes the speech sound more natural (not slow-paced)
        
        mp3_filename = "response_audio.mp3"
        # Set the filename where the generated speech will be saved
        
        tts.save(mp3_filename)
        # Save the generated speech as an MP3 file
        
        pygame.mixer.music.load(mp3_filename)
        # Load the MP3 file into Pygame's music player for playback
        
        pygame.mixer.music.play()
        # Start playing the speech audio
        
        while pygame.mixer.music.get_busy():
            pygame.time.Clock().tick(10)
        # Keep the program running (by checking if playback is ongoing)
        # This prevents the script from ending before the speech finishes
        # The clock.tick(10) ensures it checks 10 times per second
        
        pygame.mixer.quit()
        # Quit the Pygame mixer once playback is complete to free resources
        
        os.remove(mp3_filename)
        # Delete the temporary MP3 file after playback to clean up
        
    except Exception as e:
        print(f"Error in Text-to-Speech: {e}")
        # Catch and display any errors that occur during the speech generation or playback

This function takes a text string, prints it for reference, then uses the gTTS (Google Text-to-Speech) library to generate a temporary MP3 audio file. It plays the file through the system’s speakers using the pygame library, waits until playback is finished, and then deletes the file. Error handling is included to catch issues during the process.

Testing the assistant

Below is a demonstration of the functioning of the custom voice assistant. To compare its performance with Google Assistant, I have asked the same question from Google as well as from the custom assistant.

As you can see, Google provides links to the answer rather than providing a brief summary of what the user wants. The custom assistant goes further and provides a summary and is more helpful and informational.

Conclusion

In this article, we looked at the process of building a fully functional, hands-free voice assistant on a Raspberry Pi. By combining the power of a custom wake word and the Perplexity API by using Python, we created a simple voice assistant device that helps in getting information quickly.

The key advantage of this LLM-based approach is its ability to deliver direct, synthesized answers to complex and current questions — a task where assistants like Google Assistant often fall short by simply providing a list of search links. Instead of acting as a mere voice interface for a search engine, our assistant functions as a true answer engine, parsing real-time web results to give a single, concise response. The future of voice assistants lies in this deeper, more intelligent integration, and building your own is the best way to explore it.