Use OpenAI Whisper for Automated Transcriptions

development lately with large language models (LLMs). A lot of the focus is on the question-answering you can do with both pure text-based models, or vision-language models (VLMs), where you can also input images.

However, there is another dimension that has evolved a ton over the last few years: Audio. Models that can both transcribe (speech -> text), speech synthesis (text -> speech), and also speech-to-speech, where you have a whole conversation with a language model, with audio going both in and out.

The arcitecture and and training pipeline for OpenAI’s Whisper model. Image from OpenAI Whisper GitHub repository with MIT license.

In this article, I’ll discuss how I’m utilizing the development within the audio model space to my advantage, becoming an even more efficient programmer.

This is an example video of me using the transcription tool. I first select the prompt field in Cursor and use my hotkey to activate the microphone, which is indicated by the orange icon in the top left. I then speak out the sentence I want to transcribe, and it quickly appears in the prompt window without me having to type on the keyboard at all. This is a more efficient way to type long English prompts into your editor. Video by the author.

Motivation

My primary motivation for writing this article is that I am continually seeking ways to become a more efficient programmer. After using the ChatGPT mobile app for a while, I discovered their transcription option (the microphone icon to the right in the user input field). I used the transcription and quickly realized how much better this transcription is compared to others I have used before, such as Apple’s built-in iPhone transcription.

OpenAI’s transcription almost always captures all of my words, with very few mistakes. Even if I use less common words, for example, acronyms related to computer science, it is still able to pick up what I am saying.

The transcription icon from the OpenAI application. Image by the author, taken from OpenAI’s ChatGPT.

This transcription was only available in the ChatGPT app. However, I know that OpenAI has an API endpoint for their Whisper model, which is (presumably) the same model they are using to transcribe text in the app. I thus wanted to set this model up on my Mac to be available via a shortcut.

(I know there are apps such as Macwhisper available, but I wanted to develop a completely free solution, other than the costs of the API calls themselves)

Prerequisites

Alfred (I will be using Alfred on the Mac to trigger some scripts. However, alternatives to this also exist. In general, you need a way to trigger scripts on your Mac / PC from a hotkey.

Pros

The main advantage of using this transcription is that you can input words into your computer more quickly. When I type as quickly as I can on my computer, I am not even able to reach 100 words per minute, and if I am to type at that speed, I really have to focus. However, the average talking speed is at a minimum of 110, according to this article.

This means you can be a lot more effective if you are able to speak your words with transcription, instead of typing them out on the keyboard.

I think this is especially relevant after the rise of large language models such as ChatGPT. You spend more time prompting the language models, for example, asking questions to ChatGPT, or prompting the cursor to implement a feature, or fixing a bug. Thus, the use of the English language is much more prevalent now than before, compared to the use of programming languages such as Python directly.

Note: Of course, you will still be writing a lot of code, but from experience, I spend a lot more time prompting the cursor, for example, with extensive English prompts, in which case, using this transcription saves me a lot of time.

Cons

There can, however, be some downsides to using the transcription as well. One of the main ones is that a lot of times, you do not want to speak out loud when programming. You might be sitting in the airport (as I am when writing this article), or even in your office. When you’re in these scenarios, you probably don’t want to disturb those around you by speaking out loud. However, if you are sitting in a home office, this is naturally not a problem.

Another negative side is that smaller prompts might not be that much faster. Imagine this: if you just want to write a prompt of a single sentence, it will, in many scenarios, be faster just to type the prompt out by hand. This is because of the delay in starting, stopping, and transcribing audio into text. Sending the API call takes a little bit of time, and the shorter the prompt you have, the larger fraction of the time you have to spend waiting for the response.

How to implement

You can see the code I used in this article on my GitHub. However, you also need to add hotkeys to run the scripts.

First, you have to:

Clone the GitHub repository:

git clone https://github.com/EivindKjosbakken/whisper-shortcut.git

Create a virtual environment called .venv and install the required packages:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Get an OpenAI API Key. You can do that by:
- Going to the OpenAI API Overview, logging in/creating a profile
- Go to your profile, and API Keys
- Create a new key. Remember to copy the key, as you will not be able to see it again

The scripts from the GitHub repository work by:

start_recording.sh — starts recording your voice. The first time you use this, it will ask you for permission to use the microphone
stop_recording.sh — sends a stop signal to the script to stop recording. Then sends the recorded audio to OpenAI for transcription. Furthermore, it adds the transcribed text to your clipboard and pastes the text if you have a text field on your PC selected

The entire repository is available with an MIT license.

Alfred

You can find the Alfred workflow on the GitHub repository here: Transcribe.alfredworkflow.

This is how I set up the Alfred workflow:

My Alfred workflow. I have two hotkeys, one to start the transcription (record voice), and one to stop transcription (stop recording, and send the audio to the OpenAI Whisper API for transcription). The option + Q command runs the start_recording.sh script, and the option + W run the stop_recording.sh script. You can, of course, change the hotkeys for these commands. Image by the author.

You can simply download it and add it to your Alfred.

Also, remember to have a terminal window open whenever you want to run this script, as you activate the Python script from the terminal. I had to do it this way because if the script was activated directly from Alfred, I got permission issues. The first time you run the script, you should be prompted to give your terminal access to the microphone, which you should approve.

Cost

An important consideration when using APIs such as OpenAI Whisper is the cost of the API usage. I would consider the cost of using OpenAI’s Whisper model moderately high. As always, the cost is fully dependent on how much you use the model. I would say I use the model up to 25 times a day, up to 150 words, and the cost is less than 1 dollar per day.

This means, however, that if you use the model a lot, you can see costs up to 30 dollars per month, which is definitely a substantial cost. However, I think it’s important to take note of the time savings you have from the model. If each model usage saves you 30 seconds, and you use it 20 times per day, you have just saved ten minutes of your day. Personally, I am willing to pay one dollar to save ten minutes of my day, performing a task (writing on my keyboard), that doesn’t really grant me any other benefit. If any, using your keyboard may contribute to a higher risk of injuries such as carpal tunnel syndrome. Using the model is thus definitely worth it for me.

Conclusion

In this article, I started off discussing the immense advances within language models in the last few years. This has helped us create powerful chatbots, saving us enormous amounts of time. However, with the advances of language models, we have also seen advances in voice models. Transcription using OpenAI Whisper is now near perfect (from personal experience), which makes it a powerful tool you can use to input words on your computer more effectively. I discussed the pros and cons of using OpenAI Whisper on your PC, and I also went step by step through how you can implement it on your own computer.