Learn how to Create a Speech-to-Text-to-Speech Program
It’s been exactly a decade since I began attending GeekCon (yes, a geeks’ conference 🙂) — a weekend-long hackathon-makeathon wherein all projects have to be useless and just-for-fun, and this yr there was an exciting twist: all projects were required to include some type of AI.
My group’s project was a speech-to-text-to-speech game, and here’s how it really works: the user selects a personality to discuss with, after which verbally expresses anything they’d prefer to the character. This spoken input is transcribed and sent to ChatGPT, which responds as if it were the character. The response is then read aloud using text-to-speech technology.
Now that the sport is up and running, bringing laughs and fun, I’ve crafted this how-to guide to show you how to create an identical game on your personal. Throughout the article, we’ll also explore the varied considerations and decisions we made in the course of the hackathon.
Wish to see the total code? Here is the link!
Once the server is running, the user will hear the app “talking”, prompting them to decide on the figure they need to discuss with and begin conversing with their chosen character. Every time they need to talk out loud — they need to press and hold a key on the keyboard while talking. After they finish talking (and release the important thing), their recording will probably be transcribed by Whisper
(a text-to-speech model by OpenAI
), and the transcription will probably be sent to ChatGPT
for a response. The response will probably be read out loud using a text-to-speech library, and the user will hear it.
Disclaimer
Note: The project was developed on a Windows operating system and incorporates the pyttsx3
library, which lacks compatibility with M1/M2 chips. As pyttsx3
isn’t supported on Mac, users are advised to explore alternative text-to-speech libraries which might be compatible with macOS environments.
Openai Integration
I utilized two OpenAI
models: Whisper
, for speech-to-text transcription, and the ChatGPT
API for generating responses based on the user’s input to their chosen figure. While doing so costs money, the pricing model may be very low-cost, and personally, my bill remains to be under $1 for all my usage. To start, I made an initial deposit of $5, and up to now, I even have not exhausted this sediment, and this initial deposit won’t expire until a yr from now.
I’m not receiving any payment or advantages from OpenAI
for writing this.
When you get your OpenAI
API key — set it as an environment variable to make use of upon making the API calls. Ensure to not push your key to the codebase or any public location, and never to share it unsafely.
Speech to Text — Create Transcription
The implementation of the speech-to-text feature was achieved using Whisper
, an OpenAI
model.
Below is the code snippet for the function accountable for transcription:
async def get_transcript(audio_file_path: str,
text_to_draw_while_waiting: str) -> Optional[str]:
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = Noneasync def transcribe_audio() -> None:
nonlocal transcript
try:
response = openai.Audio.transcribe(
model="whisper-1", file=audio_file, language="en")
transcript = response.get("text")
except Exception as e:
print(e)
draw_thread = Thread(goal=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.start()
transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task
if transcript is None:
print("Transcription not available inside the desired timeout.")
return transcript
This function is marked as asynchronous (async) because the API call may take a while to return a response, and we await it to make sure that this system doesn’t progress until the response is received.
As you possibly can see, the get_transcript
function also invokes the print_text_while_waiting_for_transcription
function. Why? Since obtaining the transcription is a time-consuming task, we desired to keep the user informed that this system is actively processing their request and never stuck or unresponsive. Consequently, this text is progressively printed because the user awaits the following step.
String Matching Using FuzzyWuzzy for Text Comparison
After transcribing the speech into text, we either utilized it as is, or attempted to check it with an existing string.
The comparison use cases were: choosing a figure from a predefined list of options, deciding whether to proceed playing or not, and when opting to proceed – deciding whether to decide on a brand new figure or keep on with the present one.
In such cases, we wanted to check the user’s spoken input transcription with the choices in our lists, and subsequently we decided to make use of the FuzzyWuzzy
library for string matching.
This enabled selecting the closest option from the list, so long as the matching rating exceeded a predefined threshold.
Here’s a snippet of our function:
def detect_chosen_option_from_transcript(
transcript: str, options: List[str]) -> str:
best_match_score = 0
best_match = ""for option in options:
rating = fuzz.token_set_ratio(transcript.lower(), option.lower())
if rating > best_match_score:
best_match_score = rating
best_match = option
if best_match_score >= 70:
return best_match
else:
return ""
If you ought to learn more concerning the FuzzyWuzzy
library and its functions — you possibly can take a look at an article I wrote about it here.
Get ChatGPT Response
Once we now have the transcription, we are able to send it over to ChatGPT
to get a response.
For every ChatGPT
request, we added a prompt asking for a brief and funny response. We also told ChatGPT
which figure to pretend to be.
So our function looked as follows:
def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
try:
return make_openai_request(
system_instructions=system_instructions,
user_question=transcript).selections[0].message["content"]
except Exception as e:
logging.error(f"couldn't get ChatGPT response. error: {str(e)}")
raise e
and the system instructions looked as follows:
def get_system_instructions(figure: str) -> str:
return f"You provide funny and short answers. You might be: {figure}"
Text to Speech
For the text-to-speech part, we opted for a Python library called pyttsx3
. This alternative was not only straightforward to implement but in addition offered several additional benefits. It’s freed from charge, provides two voice options — female and male — and permits you to select the speaking rate in words per minute (speech speed).
When a user starts the sport, they pick a personality from a predefined list of options. If we couldn’t discover a match for what they said inside our list, we’d randomly select a personality from our “fallback figures” list. In each lists, each character was related to a gender, so our text-to-speech function also received the voice ID corresponding to the chosen gender.
That is what our text-to-speech function looked like:
def text_to_speech(text: str, gender: str = Gender.FEMALE.value) -> None:
engine = pyttsx3.init()engine.setProperty("rate", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices[0].id if gender == "male" else voices[1].id
engine.setProperty("voice", voice_id)
engine.say(text)
engine.runAndWait()
The Fundamental Flow
Now that we’ve kind of got all of the pieces of our app in place, it’s time to dive into the gameplay! The important flow is printed below. You would possibly notice some functions we haven’t delved into (e.g. choose_figure
, play_round
), but you possibly can explore the total code by trying out the repo. Eventually, most of those higher-level functions tie into the inner functions we’ve covered above.
Here’s a snippet of the important game flow:
import asynciofrom src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, start, play_round,
is_another_round
def farewell() -> None:
farewell_message = "It was great having you here, "
"hope to see you again soon!"
print(f"n{farewell_message}")
text_to_speech(farewell_message)
async def get_round_settings(figure: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "recent figure":
return {"figure": "", "another_round": True}
elif new_round_choice == "no":
return {"figure": "", "another_round": False}
elif new_round_choice == "yes":
return {"figure": figure, "another_round": True}
async def important():
start()
another_round = True
figure = ""
while True:
if not figure:
figure = await choose_figure()
while another_round:
await play_round(chosen_figure=figure)
user_choices = await get_round_settings(figure)
figure, another_round =
user_choices.get("figure"), user_choices.get("another_round")
if not figure:
break
if another_round is False:
farewell()
break
if __name__ == "__main__":
asyncio.run(important())
We had several ideas in mind that we didn’t get to implement in the course of the hackathon. This was either because we didn’t find an API we were satisfied with during that weekend, or attributable to the time constraints stopping us from developing certain features. These are the paths we didn’t take for this project:
Matching the Response Voice with the Chosen Figure’s “Actual” Voice
Imagine if the user selected to discuss with Shrek, Trump, or Oprah Winfrey. We wanted our text-to-speech library or API to articulate responses using voices that matched the chosen figure. Nonetheless, we couldn’t discover a library or API in the course of the hackathon that offered this feature at an affordable cost. We’re still open to suggestions if you have got any =)
Let the Users Check with “Themselves”
One other intriguing idea was to prompt users to supply a vocal sample of themselves speaking. We’d then train a model using this sample and have all of the responses generated by ChatGPT read aloud within the user’s own voice. On this scenario, the user could select the tone of the responses (affirmative and supportive, sarcastic, offended, etc.), however the voice would closely resemble that of the user. Nonetheless, we couldn’t find an API that supported this throughout the constraints of the hackathon.
Adding a Frontend to Our Application
Our initial plan was to incorporate a frontend component in our application. Nonetheless, attributable to a last-minute change within the variety of participants in our group, we decided to prioritize the backend development. Consequently, the appliance currently runs on the command line interface (CLI) and doesn’t have frontend side.
Latency is what bothers me most in the meanwhile.
There are several components within the flow with a comparatively high latency that for my part barely harm the user experience. For instance: the time it takes from ending providing the audio input and receiving a transcription, and the time it takes because the user presses a button until the system actually starts recording the audio. So if the user starts talking right after pressing the important thing — there will probably be no less than one second of audio that won’t be recorded attributable to this lag.
Wish to see the entire project? It’s right here!
Also, warm credit goes to Lior Yardeni, my hackathon partner with whom I created this game.
In this text, we learned methods to create a speech-to-text-to-speech game using Python, and intertwined it with AI. We’ve used the Whisper
model by OpenAI
for speech recognition, played around with the FuzzyWuzzy
library for text matching, tapped into ChatGPT
’s conversational magic via their developer API, and brought all of it to life with pyttsx3
for text-to-speech. While OpenAI
’s services (Whisper
and ChatGPT
for developers) do include a modest cost, it’s budget-friendly.
We hope you’ve found this guide enlightening and that it’s motivating you to embark in your projects.
Cheers to coding and fun! 🚀