Speech Recognition for Dummies

OK, I often have to explain to people what I do and in most cases I get an enquiring and mystified look! What is Speech Recognition, let alone VUI Design! So I guess I have to go back to basics for a bit and explain what Speech Recognition is and what speech recognition applications involve.

What is Speech Recognition then?

Speech Recognition is the conversion of speech to text.  The words that you speak are turned into a written representation of those words for the computer to process further (figure out what you want in order to decide what to do or say next). This is not an exact science because – even among us humans – speech recognition is difficult and is fraught with misunderstandings or incomplete understanding. How many times have you had to repeat your name to someone (both in person and on the phone)? How many times have you had someone cracking up with laughter, because they thought you said something different to what you actually said? These are examples of human speech recognition failing magnificently! So it is no wonder that machines do it even less well. It’s all guesswork really.

In the case of machine speech recognition, the machine will have a kind of lexicon into its disposal with possible words in the corresponding language (English, French, German etc.) and their phonetic representation. This phonetic representation describes the ways that people are most likely to pronounce this specific word (think of Queen’s English or Hochdeutsch for German, at this point). Now if you bring regional accents and foreigners speaking the language into the equation, things get even more complicated. The very same letter combinations or whole words are pronounced completely differently depending on whether you are from London, Liverpool, Newcastle, Edinburgh, Dublin, Sydney, New York, or New Orleans. Likewise, the very same English letter combinations and words will sound even more different when spoken by a Greek, a German or a Japanese person. In order to deal with those cases, speech recognition lexica are augmented with additional “pronunciations” for each problematic word. So the machine can hear 3 different versions of the same word spoken by different people and still recognise it as one and the same word. Sorted! Of course you don’t need to go into all this trouble for every possible word or phrase in the language you are covering with your speech application. You only need to go to such lengths for words and phrases that are relevant to your specific application (and domain), as well as for accents that are representative of your end-user population. If an app is going to be used mainly in England, you are better off covering Punjabi and Chinese pronunciations of your English app words rather than Japanese or German variants. There will of course be Japanese and German users of your system, but they represent a much smaller percentage of your user population and we can’t have everything!!

Speech recognition may be based on text representations of words and their phonetic “translation” (pronunciations) but the whole process is actually statistical. What you say to the system will be processed by the system as a wave signal like this one here:

Speech signal for “.. and sadly crime experts predict that one day even a friendly conversation between mother and daughter will be conducted at gunpoint” 🙂  (Based on the Channel 4 comedy series “Brass Eye” – Season 1)

So the machine will have to figure out what you’re saying by chopping this signal up into parts, each representing a word that makes sense in the context of the surrounding words. Unfortunately the same signal can potentially be chopped up in several different ways, each representing a different string of words and of course a different meaning! There’s a famous example of the following ambiguous string:

signal for “How to Wreck a Nice Beach” err I mean “How to recognise Speech”!! (Taken from FNLP 2010: Lecture 1: Copyright (C) 2010 Henry S. Thompson)

The same speech signal can be heard as “How to wreck a nice beach” or .. “How to recognise speech“!!! They sound very similar actually!! (Taken from FNLP 2010: Lecture 1) So you can see the types of problems that us humans, let alone a machine, are faced with when trying to recognise each other!

Speech Recognition Techniques

The approach to speech recognition described above, which uses hand-crafted lexica, is the standard “manual” approach. This is effective and sufficient for applications that represent very limited domains, e.g. ordering a printer or getting your account balance. The lexica and the corresponding manual “grammars” can describe most relevant phrases that are likely to be spoken by the user population. Any other phrases will be just irrelevant one-offs that can be ignored without negatively affecting the performance of the system.

For anything more complex and advanced, there is the “statistical” approach. This involves the collection of large amounts of real-world speech data, preferably in your application domain: medical data for medical apps, online shopping data for a catalogue ordering app etc. The statistical recogniser will be run over this data multiple times resulting in statistical representations of the most likely and meaningful combinations of sounds in the specific human language (English, German, French, Urdu etc.).  This type of speech recogniser is much more robust and accurate than a “symbolic” recogniser (which uses the manual approach), because it can accurately predict sound and word combinations that could not have been pre-programmed in a hand-crafted grammar. Thus statistical recognisers have got much better coverage of what people actually say (rather than what the programmer or linguist thinks that people say). Sadly, most speech apps (the Interactive Voice Response systems or IVRs, for instance, used in Call Centre automation) are based on the manual symbolic approach rather than the fancy statistical one, because the latter requires considerable amounts of data and this data is not readily available (especially for a new app that has never existed before). A lot of time would need to be spent recording relevant human-2-human conversations and even more time to analyse it in a useful manner. Even when data is available, things such as cost and privacy protection get in the way of either acquiring it or putting it into use.

Speech Recognition Applications

By now you should have realised how complex speech recognition is at the best of times, let alone how difficult it is to recognise people with different regional accents, linguistic backgrounds, and .. even moods or health conditions! (more on that later) Now let’s look at the different types of speech recognition applications. First of all, we should distinguish between speaker-dependent and speaker-independent apps.

Speaker-dependent applications involve the automatic speech recognition of a single person / speaker. It could be your dictation system that you’ve installed on your PC to take notes down, or start writing emails and letters. It could be your hand-held dictation system that you carry around as a doctor or a lawyer, composing a medical report on your patients or talking to your clients, walking up and down the room. It could even be your standard mobile phone or smartphone / iphone / Android  that you use to call (voice dial) one of your saved contacts, search through your music library for a track with a simple voice command (or two), or even to tweet. All these are speaker-dependent applications in that the corresponding recogniser has been trained to work with your voice and your voice only. You may have trained it in as little as 5 minutes of speaking to it or longer / shorter in other cases, but it will work sufficiently well with your voice, even if you’ve got a cold (and therefore a hoarse voice) or you’re feeling low (and are therefore more quiet than usual). Give it to your mate or colleague though and it will break down, or misrecognise you in some way. The same recogniser will have to be retrained with any other speaker in order to work.

Enter speaker-independent speech recognition systems! They have been trained on huge amounts of real-world data with thousands of speakers of all kinds of different linguistic, ethnic, regional, or educational backgrounds. As a result, those systems can recognise anyone, both you and your mate and even all your colleagues or anyone else you are likely to meet in the future. They are not tied to the way you pronounce things, your physiology or your voiceprint; they have been developed to work with any human (or indeed machine pretending to be a human, come to think of it!).  So when you buy off-the-shelf speech recognition software, it’s going to work immediately with any speaker, even if badly in some cases. You can later customise it to work for your specific app world and for your target user population, usually with some external help (Enter Professional Services providers.). Speaker-independent applications can work on any phone (mobile or landline) and are used mainly to (partly) automate Call Centres and Helplines, e.g. speech and DTMF IVRs for online shopping, telephone banking or e-Government apps. OK, speech recognition on a mobile can be tricky as the signal may not be good, i.e. intermittent, the line could be crackling, and of course there is the additional problem of background noise, since you are most likely to use it out in the busy streets or some kind of loud environment. Speaker-independent recognition is also used to create voice portals, i.e. speech-enabled versions of websites for greater accessibility and usability (think of disabled Web users). Moreover, a speaker-independent recogniser is also used for voicemail transcription, that is when you get all the voicemails you have received on your phone transcribed automatically and sent to you as text messages, for instant and – importantly – discrete accessibility. They are B2B applications, which means that the solution is sold to a company (a Call Centre, a Bank, a Government organisation). In contrast, speaker-dependent apps are sold to an individual, so they are B2C apps, they are sold directly to the end customer.

Because speaker-independent apps have to work with any speaker calling from any device or channel (even the web, think of Skype), the corresponding speech recogniser is usually stored on a server or cloud somewhere. Speaker-dependent apps on the other hand are stored locally on your personal PC, laptop, Mac, mobile phone or handheld.

And to clear any potential confusion beforehand, when you ring up from your mobile an automated Call Centre IVR (for instance to pay a utilities bill), you are using a speech recogniser stored at that Call Centre’s, the company’s, reseller’s or solution provider’s server rooms. So in that case, although you are using your unique voice on your personal mobile phone, the recogniser does not reside on it. The same holds for voicemail transcription, curiously! Although you are using your unique voiceprint on your personal phone to leave a voicemail on your mate’s phone, the speech recogniser used for the automatic transcription of your mate’s voicemail will be residing on some secret server somewhere, perhaps at the Headquarters of their mobile provider or whoever is charging your mate for this handy service. In contrast, when you use a dictation / voice-to-text app on your smartphone to voice dial one of your contacts, your personal voiceprint, created during training and stored on the device, is used for the speech recognition process. So recognition is a built-in feature. Nowadays there is, however, a third case: if you are using your smartphone to search for an Indian restaurant on Google Maps, the recogniser actually resides in the cloud, on Google servers, rather than on the device. So there are increasingly more permutations of system configurations now!

There are many off-the-shelf speech recognition software packages out there. Nuance is one of the biggest technology providers for both speaker-independent and speaker-dependent / dictation apps.  Other automatic speech recognition (ASR) software companies are Loquendo, Telisma, and LumenVox.  Companies specialising in speaker-dependent / dictation systems are Philips, Grundig and Olympus, among others.  However Microsoft has also long been active in Speech processing and lately Google has also been catching up very fast.

The sky is the limit, as the saying goes!

24 responses to “Speech Recognition for Dummies”

  1. […] in 2015 were: “A.I.: from Sci-Fi to Science reality” and the ever popular older “Speech Recognition for Dummies” and the classic “Voice-activated lift won’t do Scottish! (Burnistoun S1E1 […]

  2. […] spoke about the weird and wonderful world of Voice Recognition (“Voice Recognition FTW!”): from the inaccurate – and far too often funny – […]

  3. […] spoke about the weird and wonderful world of Voice Recognition (“Voice Recognition FTW!”): from the inaccurate – and far too often funny – […]

  4. […] spoke about the weird and wonderful world of Voice Recognition (“Voice Recognition FTW!”): from the inaccurate – and far too often funny – […]

  5. Alice Cudmore Avatar
    Alice Cudmore

    I Liked your summary, it was useful as an introduction to speech recognition software, something I’m not totally familiar with.

    However, I did cringe at the term ‘voiceprint’, a term greatly frowned upon in my discipline, as voices are not unique entities at all and there is unfortunately no such thing.

    1. I’m afraid that’s the term used worldwide. What’s your discipline then??

      1. Alice Cudmore Avatar
        Alice Cudmore

        Perhaps unfortunately so. I’m only a postgraduate student in Forensic Speech Science, but I experience the amount of ‘plasticity’ prevalent in speech samples every day, drilling into me how individual voices are no where near unique. I wish they were, It would make everything a lot easier!

        As for the term itself, it seems unnecessarily adjacent to ‘fingerprint’, a concept which I’m sure you’ll agree is a much more rigid entity, arguably a fixed physical attribute as opposed to that of the flexibility of vocal organs and of course language itself.

        Thanks for replying!

      2. No thank YOU for taking the time to explain (and object! :)). As a forensic scientist (even in the making), I’m sure you know more about voiceprints that us computational linguists and voice user interface designers, so I can’t win here. Still, it has to be said that usually those “voiceprints” are not used on their own or in isolation, but rather multiple criteria are taken into account. E.g. asking the user to repeat random sentences or letters or numbers, asking them for personal details and the answers to security questions. So the process as a whole can usually distinguish pretty well the frauds from the real users. The opposite is also true though. Your voice changes when you are under stress, in pain or have got a cold, so in a sense under those circumstances your “voiceprint” will be unique to you, i.e. not only does it reflect your physiology but also your emotional state at any one time. And if the change is extreme, there can even be false alarms or false negatives, i.e. discarding a user as a fraud just because their voice has changed so much because of a cold. It is a fine balance and there can never be 100% accuracy or certainty. Still however politically incorrect the term “voiceprint” must be for you, I think it mirrors really well the process behind speaker identification and verification as used in speech interfaces. And yes the parallels to fingerprints are intentional and quite effective in understanding how “voiceprints” work. I can see your point though!

  6. […] your speech tech needs and challenges“. Maybe you need to check out my older blog post on speech recognition (for dummies!) to get an idea of what I will be chatting about with everyone. Although you do need to pre-book, […]

  7. […] Scottish!“, a very reasonable expectation, which would work particularly well with a speaker-dependent dictation system of the kind you’ve got on your PC, laptop or hand-held device. This speaker-independent one […]

  8. Maria

    I like your description, but have a few comments.

    While it used to be true that speaker dependent was local and independent non-local, more and more systems store speaker specific training on the non-local server. So in effect, many of the non-local systems are speaker dependent (or more succinctly speaker adaptive).

    I think Google does most of its recognition in the cloud – might check to see if that’s true on Google maps. They don’t make it particularly easy to find out where things are done, and the networks are now fast enough to make the difference irrelevant.

    As to rule-based vs statistical, it’s a question of training and coverage. If your rule-based grammar actually covers the situation (digits, for instance), they can be excellent (“say your 16 digit credit card number…”). The statistical systems have as good coverage as their training data represents, so with a lot of data they do a better job.

    Regards

    Jordan

    1. I am really honoured that you read and commented on my blog post, Jordan.
      To a Speech veteran like yourself, my post must read as very simplistic. My purpose was in fact to present the basic technologies, trends and applications and that necessarily involves some simplistic generalisations. I am certain you would have written a much more erudite and technically detailed and sound article on the subject of Speech Recognition. That was not my goal with this introductory post, as you understand.

      To your comments:
      – You are right. I didn’t address the case of speaker adaptation as found in remotely hosted applications. This is a more specialised case but definitely worth a mention. Thanks.

      – You are probably right about Google maps too. A rather complex instance of device-integral / device-specific voice control.

      – Finally, I am all for rule-based grammars for well-specified domains and restricted applications, such as digit recognition. They can indeed work really well. And on the other hand, however wonderful statistical recognition can be, it hugely depends on the quantity and quality of the training data indeed, so the performance can be less predictable / controllable. There are prons and cons to both methodologies.

      Many thanks again for reading my blog and for addressing some important points.

  9. Really well written introduction to this often over complicated subject. I like the clear use of both technical and non-technical content and the examples and references were well researched.

    Rgds

    Steve

    1. Thank you so much Steve! I feel really honoured by your kind comments!

  10. Hi Maria,

    realy a helpful post.

    We collect information like this and put this in our books and other publications. Let’s have a talk about working together for the next voice compass international.

    Detlev Artelt

    1. Many thanks for the kind comment Detlev and for the invitation to collaborate. I’ll get in touch with you about the next Voice Compass.

      Viele Gruesse aus Manchester

  11. Merci bien Christian! Und danke fuer die Blumen! 😀

  12. Hi Maria,

    excellent overview of the challenges and solutions in VUI design, a great post !

    1. Great explanation of the subject. Even though I was part of team for prepairing lexical corpora and I’m familiar with this field, it’s really well and clear written. Many thanks, Maria!

      1. Thank YOU for the kind comments, Miha! I’m glad you approve!

  13. […] This post was mentioned on Twitter by Dr. Maria Aretoulaki, DialogCONNECTION Ltd. DialogCONNECTION Ltd said: new blog post "SPEECH RECOGNITION FOR DUMMIES!!" http://lnkd.in/YTb6hu […]

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.