A Step-by-Step Guide
This text goals to guide you in creating an easy yet powerful voice assistant tailored to your preferences. We’ll use two powerful tools, Whisper and GPT, to make this occur. You almost certainly already know GPT and the way powerful it’s, but what’s Whisper?
Whisper is a complicated speech recognition model from OpenAI that provides accurate audio-to-text transcription.
We’ll walk you thru each step, with coding instructions included. At the tip, you’ll have your very own voice assistant up and running.
Open AI API keys
If you happen to have already got an OpenAI API key you’ll be able to skip this section.
Each Whisper and GPT APIs require an OpenAI API key to be accessed. Unlike ChatGPT where the subscription is a hard and fast fee, the API secret’s paid based on how much you employ the service.
The costs are reasonable. On the time of writing, Whisper is priced at $0.006 / minute, GPT (with the model gpt-3.5-turbo) at $0.002 / 1K tokens (a token is roughly 0.75 words).
To get your key, first create an account on the OpenAI website. After signing in, click in your name on the top-right corner and select View API keys. When you click Create recent secret key your secret’s displayed. Be certain that to reserve it, since you won’t find a way to see it again.
Packages
The code chunk shows the required libraries for the project. The project involves using OpenAI’s Python library for AI tasks, pyttsx3 for generating speech, SoundDevice for recording and playing back audio, numpy and scipy for mathematical operations. As at all times, you need to create a brand new virtual environment before installing packages when starting a brand new project.
Our code might be structured around a single class, and take up roughly 90 lines of code in total. It assumes that you’ve a basic understanding of Python classes.
The listen
method captures the user’s spoken input and converts it to text using Whisper. The think
method sends the text to GPT, which generates a natural language response. The speak
method converts the response text into an audio that’s played back. The method repeats: the user is in a position to interact in a conversation by making one other request.
This function takes care of initializing the history and organising the API key.
We want a history that keep track of the previous messages. It’s principally our assistant’s short-term memory, and allows it to recollect what you said earlier within the conversation.
This method is our assistant’s ears.
The listen
function allows to receive input from the user. This function records audio out of your microphone and transcribes it into text.
Here’s what it does:
- Prints Listening… when recording audio.
- Records audio for 3 seconds (or any duration you would like) using sounddevice at a sample rate of 44100 Hz.
- Saves the recorded audio as a NumPy array in a brief WAV file.
- Uses the OpenAI API’s
transcribe
method to send the audio to Whisper, which transcribes it. - Prints the transcribed text to the console to substantiate that the transcription was successful.
- Returns the transcribed text as a string.
In the instance, the assistant listens for 3 seconds, but you’ll be able to change the time as you would like.
Our assistant’s brain is powered by GPT. The think function receives what the assistant hears and elaborates a response. How?
The response isn’t created in your computer. The text must be sent to OpenAI’s servers to be processed through the APIs. The response is then saved within the response variable, and each the user message and the response are added to the history, the assistant’s short term memory. provide context to the GPT model for generating responses.
The speak function is answerable for converting text into speech and playing it back to the user. This function takes a single parameter: text. It must be a string that represents the text to be converted to speech.
When the function is named with a text string as an argument, it initializes the pyttsx3 speech engine with the command engine = pyttsx3.init()
This object, engine
is the most important interface for converting text to speech.
The function then instructs the speech engine to convert the provided text into speech using the command engine.say(text)
. This queues up the provided text to be spoken. The command engine.runAndWait
tells the engine to process the queued command.
Pyttsx3 handles all text-to-speech conversion locally, which is usually a significant advantage when it comes to latency.
The assistant is now ready. We just must create an assistant object, and start the conversation.
The conversation is an infinite loop that ends when the user says a sentence containing Goodbye.
Customizing your GPT assistant is a breeze! The code that we built could be very modular, and it permits you to customize it by adding a quite a lot of features. Listed here are some ideas to get you began:
- Give a task to the assistant: Change the initial prompt to make your assistant act as your English teacher, motivational speaker, or the rest you’ll be able to consider! Try Awesome ChatGPT Prompts for more ideas.
- Change the language: Need to use one other language? No problem! Simply change english within the code to your required language.
- Construct an app: You possibly can easily integrate the assistant in any application.
- Add personality: Give your assistant a novel personality by adding custom responses or using different tones and language styles.
- Integrate with other APIs: Integrate your assistant with other APIs to offer more advanced functionality, equivalent to weather forecasts or news updates.
In this text, we explained how one can retrieve your OpenAI API key and provided code examples for the listen, think, and speak functions which might be used to capture user input, generate responses, and convert text to speech for playback.
With this data, it’s possible you’ll begin creating your personal unique voice assistant that’s suited to your specific demands. The probabilities are infinite, from creating a private assistant to assist with every day tasks, to constructing a voice-controlled automation system. You possibly can access all of the code within the linked GitHub repo.