Now, greater than ever before is the time for AI-powered voice-based systems. Consider a call to customer support. Soon all of the brittleness and inflexibility might be gone – the stiff robotic voices, the “press one for sales”-style constricting menus, the annoying experiences which have had us all frantically pressing zero within the hopes of talking as a substitute with a human agent. (Or, given the long waiting times that being transferred to a human agent can entail, had us giving up on the decision altogether.)
No more. Advances not only in transformer-based large language models (LLMs) but in automatic speech recognition (ASR) and text-to-speech (TTS) systems mean that “next-generation” voice-based agents are here – if you recognize learn how to construct them.
Today we have a look into the challenges confronting anyone hoping to construct such a state-of-the-art voice-based conversational agent.
Before jumping in, let’s take a fast take a look at the final attractions and relevance of voice-based agents (versus text-based interactions). There are a lot of the reason why a voice interaction is likely to be more appropriate than a text-based one – these can include, in increasing order of severity:
-
Preference or habit – speaking pre-dates writing developmentally and historically
-
Slow text input – many can speak faster than they will text
-
Hands-free situations – similar to driving, understanding or doing the dishes
-
Illiteracy – a minimum of within the language(s) the agent understands
-
Disabilities – similar to blindness or lack of non-vocal motor control
In an age seemingly dominated by website-mediated transactions, voice stays a strong conduit for commerce. For instance, a recent study by JD Power of customer satisfaction within the hotel industry found that guests who booked their room over the phone were more satisfied with their stay than those that booked through an internet travel agency (OTA) or directly through the hotel’s website.
But interactive voice responses, or IVRs for brief, should not enough. A 2023 study by Zippia found that 88% of consumers prefer voice calls with a live agent as a substitute of navigating an automatic phone menu. The study also found that the highest things that annoy people probably the most about phone menus include listening to irrelevant options (69%), inability to totally describe the problem (67%), inefficient service (33%), and confusing options (15%).
And there’s an openness to using voice-based assistants. Based on a study by Accenture, around 47% of consumers are already comfortable using voice assistants to interact with businesses and around 31% of consumers have already used a voice assistant to interact with a business.
Whatever the explanation, for a lot of, there’s a preference and demand for spoken interaction – so long as it’s natural and cozy.
Roughly speaking, a superb voice-based agent should reply to the user in a way that’s:
-
Relevant: Based on an accurate understanding of what the user said/wanted. Note that in some cases, the agent’s response is not going to just be a spoken reply, but some type of motion through integration with a backend (e.g., actually causing a hotel room to be booked when the caller says “Go ahead and book it”).
-
Accurate: Based on the facts (e.g., only say there’s a room available on the hotel on January nineteenth if there’s)
-
Clear: The response ought to be comprehensible
-
Timely: With the type of latency that one would expect from a human
-
Protected: No offensive or inappropriate language, revealing of protected information, etc.
Current voice-based automated systems try to meet the above criteria on the expense of a) being a) very limited and b) very frustrating to make use of. A part of this can be a results of the high expectations that a voice-based conversational context sets, with such expectations only getting higher the more that voice quality in TTS systems becomes indistinguishable from human voices. But these expectations are dashed within the systems which can be widely deployed for the time being. Why?
In a word – inflexibility:
-
Limited speech – the user is usually forced to say things unnaturally: briefly phrases, in a specific order, without spurious information, etc. This offers little or no advance over the old skool number-based menu system
-
Narrow, non-inclusive notion of “acceptable” speech – low tolerance for slang, uhms and ahs, etc.
-
No backtracking: If something goes improper, there could also be little likelihood of “repairing” or correcting the problematic piece of knowledge, but as a substitute having to begin over, or wait for a transfer to a human.
-
Strict turn-taking – no ability to interrupt or speak an agent
It goes without saying that individuals find these constraints annoying or frustrating.
The excellent news is that modern AI systems are powerful and fast enough to vastly improve on the above sorts of experiences, as a substitute of approaching (or exceeding!) human-based customer support standards. That is resulting from a wide range of aspects:
-
Faster, more powerful hardware
-
Improvements in ASR (higher accuracy, overcoming noise, accents, etc.)
-
Improvements in TTS (natural-sounding and even cloned voices)
-
The arrival of generative LLMs (natural-sounding conversations)
That last point is a game-changer. The important thing insight was that a superb predictive model can function a superb generative model. A man-made agent can get near human-level conversational performance if it says whatever a sufficiently good LLM predicts to be the most probably thing a superb human customer support agent would say within the given conversational context.
Cue the arrival of dozens of AI startups hoping to resolve the voice-based conversational agent problem just by choosing, after which connecting, off-the-shelf ASR and TTS modules to an LLM core. On this view, the answer is only a matter of choosing a mix that minimizes latency and value. And naturally, that’s necessary. But is it enough?
There are several specific the reason why that easy approach won’t work, but they derive from two general points:
-
LLMs actually can’t, on their very own, provide good fact-based text conversations of the kind required for enterprise applications like customer support. In order that they can’t, on their very own, do this for voice-based conversations either. Something else is required.
-
Even if you happen to do complement LLMs with what is required to make a superb text-based conversational agent, turning that into a superb voice-based conversational agent requires greater than just hooking it as much as one of the best ASR and TTS modules you may afford.
Let’s take a look at a selected example of every of those challenges.
Challenge 1: Keeping it Real
As is now widely known, LLMs sometimes produce inaccurate or ‘hallucinated’ information. That is disastrous within the context of many business applications, even when it would make for a superb entertainment application where accuracy might not be the purpose.
That LLMs sometimes hallucinate is just to be expected, on reflection. It’s a direct consequence of using models trained on data from a 12 months (or more) ago to generate answers to questions on facts that should not a part of, or entailed by, an information set (nevertheless huge) that is likely to be a 12 months or more old. When the caller asks “What’s my membership number?”, a straightforward pre-trained LLM can only generate a plausible-sounding answer, not an accurate one.
Probably the most common ways of coping with this problem are:
-
Advantageous-tuning: Train the pre-trained LLM further, this time on all of the domain-specific data that you simply want it to give you the chance to reply appropriately.
-
Prompt engineering: Add the additional data/instructions in as an input to the LLM, along with the conversational history
-
Retrieval Augmented Generation (RAG): Like prompt engineering, except the info added to the prompt is decided on the fly by matching the present conversational context (e.g., the client has asked “Does your hotel have a pool?”) to an embedding encoded index of your domain-specific data (that features, e.g. a file that claims: “Listed here are the facilities available on the hotel: pool, sauna, EV charging station.”).
-
Rule-based control: Like RAG, but what’s to be added to (or subtracted from) the prompt will not be retrieved by matching a neural memory but is decided through hard-coded (and hand-coded) rules.
Note that one size doesn’t fit all. Which of those methods might be appropriate will depend upon, for instance, the domain-specific data that’s informing the agent’s answer. Specifically, it would depend upon whether said data changes ceaselessly (call to call, say – e.g. customer name) or hardly (e.g., the initial greeting: “Hello, thanks for calling the Hotel Budapest. How may I assist you today?”). Advantageous-tuning wouldn’t be appropriate for the previous, and RAG can be a slipshod solution for the latter. So any working system can have to make use of a wide range of these methods.
What’s more, integrating these methods with the LLM and one another in a way that minimizes latency and value requires careful engineering. For instance, your model’s RAG performance might improve if you happen to fine-tune it to facilitate that method.
It could come as no surprise that every of those methods in turn introduce their very own challenges. For instance, take fine-tuning. Advantageous-tuning your pre-trained LLM in your domain-specific data will improve its performance on that data, yes. But fine-tuning modifies the parameters (weights) which can be the idea of the pre-trained model’s (presumably fairly good) general performance. This modification subsequently causes an unlearning (or “catastrophic forgetting”) of a few of the model’s previous knowledge. This may end up in the model giving incorrect or inappropriate (even unsafe) responses. In case you want your agent to proceed to reply accurately and safely, you would like a fine-tuning method that mitigates catastrophic forgetting.
Determining when a customer has finished speaking is critical for natural conversation flow. Similarly, the system must handle interruptions gracefully, ensuring the conversation stays coherent and attentive to the client’s needs. Achieving this to a typical comparable to human interaction is a posh task but is important for creating natural and nice conversational experiences.
An answer that works requires the designers to think about questions like these:
-
How long after the client stops speaking should the agent wait before deciding that the client has stopped speaking?
-
Does the above depend upon whether the client has accomplished a full sentence?
-
What ought to be done if the client interrupts the agent?
-
Specifically, should the agent assume that what it was saying was not heard by the client?
These issues, having largely to do with timing, require careful engineering above and beyond that involved in getting an LLM to present an accurate response.
The evolution of AI-powered voice-based systems guarantees a revolutionary shift in customer support dynamics, replacing antiquated phone systems with advanced LLMs, ASR, and TTS technologies. Nevertheless, overcoming challenges in hallucinated information and seamless endpointing might be pivotal for delivering natural and efficient voice interactions.
Automating customer support has the ability to grow to be a real game changer for enterprises, but provided that done appropriately. In 2024, particularly with all these recent technologies, we are able to finally construct systems that may feel natural and flowing and robustly understand us. The web effect will reduce wait times, and improve upon the present experience now we have with voice bots, marking a transformative era in customer engagement and repair quality.