Optimizing the ASR Experience

Patrick Ehlen / Jan 2022 / Technology, Artificial Intelligence

Successfully engaging with contact centers through speech recognition is more complicated than talking to computers through the now-common personal devices.

You ask Siri if you need your umbrella, and she tells you it looks like rain. You ask Alexa to play some jazz, and Miles Davis flows out of the speaker. So, computers understand what we say now, right?

But the answer isn’t so simple. Because to understand what we say, computers must solve two problems:

1. Taking the sounds people make and turning them into words, a task commonly known as automatic speech recognition (ASR).

2. Analyzing those words and deciding what they mean.

The Problem of Transcribing Human Conversations

Let’s look at the first problem of taking audio and converting it to text by using ASR, particularly in the context of the contact center.

Now, to recognize the speech of a person talking to a digital assistant like Siri or Alexa is one certain kind of issue. Namely a person speaking directly to a machine using computer-directed speech, usually in the form of some command, so the duration of that speech is limited, and the syntax is not usually complex.

In the case of the contact center is another kind of issue. One in which you typically have one person (the customer) talking to another person (the contact center agent), who then talks back, and so on and so forth, engaging in an ongoing activity commonly known as a conversation.

For a computer to “listen in” and attempt to transcribe that conversation is a lot more complicated than in the computer-directed speech case. Here’s why:

The duration of time that either person is speaking is usually longer and less predictable.
The speech they use not only has more complex syntax, but also includes many conversational artifacts that you don’t usually hear in computer-directed speech, such as interruptions, hesitations, hedges, restarts, repairs, and backchanneling.
The people speaking with each other may have dramatically different rates of speech, regional accents, audio volumes, and other factors that make automatic recognition more challenging.

Because the task of automatically transcribing human-to-human conversations is so challenging, many companies that provide ASR services now offer models that are particularly tuned for them, in order to transcribe contact center calls, videos, and other types of conversations. ASR that is tuned to these particularly difficult use cases is called conversational ASR.

Examining High Accuracy Claims

The good news is, while conversational ASR remains a difficult problem, it has seen significant improvements in accuracy over the past half decade.

In fact, some companies have reported benchmark accuracy results that appear high enough to lead one to believe that the problem of transcribing human-to-human speech is also mostly solved.

But is that true?

For companies to report benchmark results on conversational ASR, they need to do so on a standardized dataset that is publicly available, allowing them to make fair, apples-to-apples comparisons.

The high accuracies you might hear about for conversational ASR systems typically come from a single dataset, known as the Switchboard Corpus, which is a corpus of telephone calls collected by Texas Instruments (TI) from 1990 to 1991.

To collect the corpus, TI recruited ordinary people from around the United States and asked them to call each other on landline telephones and carry on casual conversations from a fixed set of general-interest topics.

The recordings from these conversations were then collected and transcribed by human transcribers, providing a “gold standard” transcription that future conversational ASR systems could test against.

While high accuracy on the Switchboard Corpus benchmark might indicate that a conversational ASR does well on the general task of understanding human conversations, this doesn’t necessarily imply that an ASR will exhibit the same level of accuracy when it comes to transcribing contact center calls.

Why? Because of the differing characteristics between the data in the Switchboard Corpus and contact center data, in particular:

1. Audio quality. The Switchboard Corpus audio data was collected at a time when most people used landline telephones and mobile phones were rare. Thus, the recordings were made from landline handsets that offered low-noise audio delivered from stationary locations within generally quiet rooms.

But in modern contact centers, customer audio frequently originates from cell phones and wireless headsets and is delivered from a wide variety of noisy mobile circumstances. Audio in the Switchboard Corpus is also normalized for the amplitude between the two speakers. But in contact centers there is often a big difference between the quality and volume of the audio for the customers as compared to the agents.

2. Speaker accents. Participants in the Switchboard Corpus data collection were overwhelmingly native U.S. English speakers with mostly similar dialects and accents. But this does not reflect the distribution of accents and dialects encountered in today’s international contact centers.

3. Vocabulary. The speakers in the Switchboard study were asked to have a conversation with a random stranger about a topic from a predefined list of general knowledge topics, such as gardening.

The result is that the language used throughout the corpus is very general. It does not, however, cover any of the obscure jargon specific to a company or an industry that frequently comes up in real-life contact center scenarios.

Because of these significant differences between the Switchboard Corpus and modern contact centers, we can’t assume that general-purpose ASRs that perform well on the Switchboard Corpus will also perform well in real-life contact center scenarios. Like those that have poor audio quality, background noise, technical jargon, multiple accents, and other problematic issues common to the contact center.

Instead, we need to look beyond these generalized benchmarks when evaluating ASR capabilities for contact center audio.

Improving Conversational ASR Performance

If we can’t rely on these general conversational ASR models to provide good performance on contact center transcription, then what can be done?

One answer is to train ASR models specifically for contact center calls, using only contact center data. By doing this, the ASR model can be tuned to perform particularly well by optimizing for certain aspects of contact center calls.

Another answer is optimizing for individual audio channels. By treating the customer and agent sides of calls differently, we can optimize for the variations in acoustics and accents that are common to each side of the call.

If customers typically have one type of accent and agents have another, these factors can be predicted ahead of time and baked into the ASR model’s training and expectations about what it will hear.

Contact centers can also train on specific vocabularies. If a center serves a particular company or industry, the vocabulary and jargon common in that center can be thoroughly represented in the model’s training. Thus making it easier for the ASR to recognize these important words.

Finally, centers can optimize beyond word-for-word vocabularies. Typically, the transcripts of contact center conversations produced by ASR are not the “end-result” of the analytics pipeline. Contact center analysts want to know what issues people were calling about, what sentiments they expressed, and how the callers’ issues were resolved.

To this end, some words in the automatic transcripts are far more important than other words. However, most ASR models are trained to optimize for a metric called word error rate (WER), which counts how many words the ASR got wrong, and all those words are treated the same.

That is, in an ASR optimized for WER, mistaking the word “a” for “the” is counted as just as much of an error as mistaking the word “hate” for “great.”

But of course, the latter might pose a much bigger problem than the former for the end-result of the system. Thus, ASR providers that control the entire pipeline of the call analytics platform can optimize the ASR to perform much better than those that lack insight into what the desired downstream results might be.

The Bottom Line

As we’ve seen, producing accurate ASR transcripts from human-to-human conversations is a particularly difficult problem, and producing them from contact center conversations is more difficult still.

Producing accurate ASR transcripts from human-to-human conversations is a particularly difficult problem.

While some companies have shown significant improvements in accuracy on general conversational ASR benchmarks, the differences between the data used for those benchmarks and the data seen in contact centers is significant enough to warrant using specialized models for transcribing their call data.

An ASR system that is designed specifically for contact centers delivers maximum business value as part of a suite of technologies that aims to automate and enhance the customer-agent journey. Investment in these areas dramatically improves customer experience and brand perception.

Subscribers Download Article [PDF]