Tag Archives: voice recognition

Voice control in Space!

20 Nov

I recently attended The Association for Conversational Interaction Design (ACIXD) Brown bag “Challenges of Implementing Voice Control for Space Applications” presented by the NASA Authority in the field, George Salazar. George Salazar is Human Computing Interface Technical Discipline Lead at NASA with over 30 years of experience and innovation in Space applications. Among a long list of achievements, he was involved in the development of the International Space Station internal audio system and has been awarded several awards, including a John F. Kennedy Astronautics Award, a NASA Silver Achievement Medal and a Lifetime Achievement Award for his service and commitment to STEM. His acceptance speech for that last one brought tears to my eyes! An incredibly knowledgeable and experienced man with astounding modesty and willingness to pass his knowledge and passion to younger generations.

George Salazar’s Acceptance Speech

Back to Voice Recognition.

Mr Salazar explained how space missions slowly migrated over the years from ground control (with dozens of engineers involved) to vehicle control and from just 50 to 100s of buttons. This put the onus of operating all those buttons to the 4-5 person space crew, which in turn brought in speech recognition as an invaluable interface that would make good sense in such a complex environment. 

Screenshot from George Salazar’s ACIxD presentation

Factors affecting ASR accuracy in Space

He described how they have tested different Speech Recognition (ASR) software to see which fared the best, both speaker-independent and speaker-dependent. As he noted, they all claim 99% accuracy officially but that is never the case in practice! He listed many factors that affect recognition accuracy, including:

  • background noise (speaker vs background signal separation)
  • multiple speakers speaking simultaneously (esp. in such a noisy environment)
  • foreign accent recognition (e.g. Dutch crew speaking English)
  • intraspeaker speech variation due to psychological factors (as being in space can, apparently, make you depressed, which in turn affects your voice!), but presumably also to physiological factors (e.g. just having a cold)
  • Astronaut gender (low pitch in males vs high pitch in females): ASR software was designed for males, so male astronauts always had better error rates!
  • The effects of microgravity (physiological effects) on the voice quality, as already observed on the first flight (using templates from ground testing as the baseline), are impossible to separate from the environment and crew stress and can lead to a 10-30% error increase!
Screenshot from George Salazar’s ACIxD presentation

  • Even radiation can affect the ASR software, but also the hardware (computing power). As a comparison, AMAZON Alexa uses huge computer farms, whereas in Space they rely on slow “radiation-hardened” processors: they can handle the radiation, but are actually 5-10 times slower than commercial processors!
Screenshot from George Salazar’s ACIxD presentation

Solutions to Space Challenges

To counter all these negative factors, a few different approaches and methodologies have been employed:

  • on-orbit retrain capability: rendering the system adaptive to changes in voice and background noise, resulting in up to 100% accuracy
  • macro-commanding: creating shortcuts to more complex commands
  • redundacy as fallback (i.e. pressing a button as a second modality)
Screenshot from George Salazar’s ACIxD presentation

Critical considerations

One of the challenges that Mr Salazar mentioned in improving ASR accuracy is overadaptation or skewing the system to a single astronaut.

In addition, he mentioned the importance of Dialog Design in NASA’s human-centered design (HCD) Development approach. The astronauts should always be able to provide feedback to the system, particularly for error correction (Confusability leads to misrecognitions).

Screenshot from George Salazar’s ACIxD presentation
Screenshot from George Salazar’s ACIxD presentation

In closing, Mr Salazar stressed that speech recognition for Command and Control in Space applications is viable, especially in the context of a small crew navigating a complex habitat.

Moreover, he underlined the importance of trust that the ASR system needs to inspire in its users, as in this case the astronauts may literally be placing their lives onto its performance and accuracy.

Screenshot from George Salazar’s ACIxD presentation

Q & A

After Mr Salazar’s presentation, I couldn’t help but pose a couple of questions to him, given that I consider myself to be a Space junkie (and not in the sci-fi franchise sense either!).

So, I asked him to give us a few examples of the type of astronaut utterances and commands that their ASR needs to be able to recognise. Below are some such phrases:

  • zoom in, zoom out
  • tilt up, tilt down
  • pan left
  • please repeat

and their synonyms. He also mentioned the case of one astronaut who kept saying “Wow!” (How do you deal with that!)

I asked whether the system ever had to deal with ambiguity in trying to determine which component to tilt, pan or zoom. He answered that, although they do carry out plenty of confusability studies, the context is quite deterministic: the astronaut selects the monitor by going to the monitor section and speaking the associated command. Thus, there is no real ambiguity as such.

Screenshot from George Salazar’s ACIxD presentation

My second question to Mr Salazar was about the type of ASR they have gone for. I understood that the vocabulary is small and contained / unambiguous, but wasn’t sure whether they went for speaker-dependent or speaker-independent recognition in the end. He replied that the standard now is speaker-independent ASR, which however has been adapted to a small group of astronauts (i.e. “group-dependent“). Hence, all the challenges of distinguishing between different speakers with different pitch and accents, all against the background noise and the radiation and microgravity effects! They must be really busy!

It was a great pleasure to listen to the talk and an incredible and rare honour to get to speak with such an awe-inspiring pioneer in Space Engineering!

Ad Astra!

UBIQUITOUS VOICE: Essays from the Field now on Kindle!

14 Oct

In 2018, a new book on “Voice First” came out on Amazon and I was proud and deeply honoured, as it includes one of my articles! Now it has come out on Kindle as an e-Book and we are even more excited at the prospect of a much wider reach!

“Ubiquitous Voice: Essays from the Field”: Thoughts, insights and anecdotes on Speech Recognition, Voice User Interfaces, Voice Assistants, Conversational Intelligence, VUI Design, Voice UX issues, solutions, Best practices and visions from the veterans!

I have been part of this effort since its inception, working alongside some of the pioneers in the field who now represent the Market Leaders (GOOGLE, AMAZON, NUANCE, SAMSUNG VIV .. ). Excellent job by our tireless and intrepid Editor, Lisa Falkson!

My contribution “Convenience + Security = Trust: Do you trust your Intelligent Assistant?” is on data privacy concerns and social issues associated with the widespread adoption of voice activation. It is thus platform-, ASR-, vendor- and company-agnostic.

You can get the physical book here and the Kindle version here.

Prepare to be enlightened, guided and inspired!

An Amazon Echo in every hotel room?

16 Dec

The Wynn Las Vegas Hotel just announced that it will be installing the Amazon Echo device in every one of its 4,748 guest rooms by Summer 2017. Apparently, hotel guests will be able to use Echo, Amazon’s hands-free voice-controlled speaker, to control room lights, temperature, and drapery, but also some TV functions.

 

CEO Steve Wynn:  “I have never, ever seen anything that was more intuitively dead-on to making a guest experience seamlessly delicious, effortlessly convenient than the ability to talk to your room and say .. ‘Alexa, I’m here, open the curtains, … lower the temperature, … turn on the news.‘ She becomes our butler, at the service of each of our guests”.

 

The announcement does, however, also raise security concerns. The Alexa device is always listening, at least for the “wake word”. This is, of course, necessary for it to work when you actually need it. It needs to know when it is being “addressed” to start recognising what you say and hopefully act on it afterwards. Interestingly, though, according to the Alexa FAQ:

 

When these devices detect the wake word, they stream audio to the cloud, including a fraction of a second of audio before the wake word.

That could get embarrassing or even dangerous, especially if the “wake word” was actually a “false alarm“, i.e. something the guest said to someone else in the room perhaps that sounded like the wake word.

All commands are saved on the device’s History. The question is: Will the hotel automatically wipe the device’s history once a guest has checked out? Or at least before the next guest arrives in the room! Can perhaps every guest have access to their own history of commands, so that they can delete it themselves just before check-out? These are crucial security aspects that the Hotel needs to consider, because it would be a shame for this seamlessly delicious and effortlessly convenient experience to be cut short by paranoid guests switching the Echo off as soon as they enter the room!

Meet META, the Meta-cognitive skills Training Avatar!

16 Jun

METALOGUE logo

EU FP7 logo

 

Since November 2013, I’ve had the opportunity to participate in the EU-funded FP7 R & D project, METALOGUE, through my company DialogCONNECTION Ltd, one of 10 Consortium Partners. The project aims to develop a natural, flexible, and interactive Multi-perspective and Multi-modal Dialogue system with meta-cognitive abilities; a system that can:

  • monitor, reason about, and provide feedback on its own behaviour, intentions and strategies, and the dialogue itself,
  • guess the intentions of its interlocutor,
  • and accordingly plan the next step in the dialogue.

The system tries to dynamically adapt both its strategy and behaviour (speech and non-verbal aspects) in order to influence the dialogue partner’s reaction, and, as a result, the progress of the dialogue over time, and thereby also achieve its own goals in the most advantageous way for both sides.

The project is in its 3rd and final year (ending in Oct 2016) and has a budget of € 3,749,000 (EU contribution: € 2,971,000). METALOGUE brings together 10 Academic and Industry partners from 5 EU countries (Germany, Netherlands, Greece, Ireland, and UK).

 

METALOGUE focuses on interactive and adaptive training situations, where negotiation skills play a key role in the decision-making processes. Reusable and customisable software components and algorithms have been developed, tested and integrated into a prototype platform, which provides learners with a rich and interactive environment that motivates them to develop meta-cognitive skills, by stimulating creativity and responsibility in the decision-making, argumentation, and negotiation process. The project is producing a virtual trainer, META, a Training Avatar capable of engaging in natural interaction in English (currently, with the addition of German and Greek in the future), using gestures, facial expressions, and body language.

METALOGUE Avatar

Pilot systems have been developed for 2 different user scenarios: a) debatingand b) negotiation, both tested and evaluated by English-speaking students at the Hellenic Youth Parliament. We are currently targeting various industry verticals, in particular Call Centres, e.g. to semi-automate and enhance Call Centre Agent Training.

 

And here’s META in action!

 

In this video, our full-body METALOGUE Avatar is playing the role of a business owner, who is negotiating a smoking ban with a local Government Counsellor.   Still imperfect (e.g. there is some slight latency before replying – and an embarrassing repetition at some point!), but you can also see the realistic facial expressions, gaze, gestures, and body language, and even selective and effective pauses. It can process natural spontaneous speech in a pre-specified domain (smoking ban, in this case) and it has reached an ASR error rate below 24% (down from almost 50% 2 years ago!). The idea is to use such an Avatar in Call Centres to provide extra training support on top of existing training courses and workshops. It’s not about replacing the human trainer, but rather empowering and motivating Call Centre Trainee Agents who are trying to learn how to read their callers and how to successfully negotiate deals and even complaints with them in an optimal way.

IMG_20151218_143348

 

My company, DialogCONNECTION, is charged with the task of attracting interest and feedback from industry to gauge the relevance and effectiveness of the METALOGUE approach in employee training contexts (esp. negotiation and decision-making). We are looking in particular for Call Centres;both small and agile (serving multiple small clients) and large (and probably plagued by the well-known agent burn-out syndrome). Ideally, you would give us access to real-world Call Centre Agent-Caller/Customer recordings or even simulated Trainer – Trainee phone calls that are used for situational Agent training (either already available or collected specifically for the project). A total of just 15 hours of audio (and video if available) would suffice to train the METALOGUE speech recognisers and the associated acoustic and language models, as well as its metacognitive models.

However, if you don’t want to commit your organisation’s data, any type of input and feedback would make us happy! As an innovative pioneering research project, we really need guidance, evaluation and any input from the real world of industry! So, if we have sparked your interest in any way and you want to get involved and give it a spin, please get in touch!

The 2015 stats are in!

20 Jan

The WordPress.com stats monkeys prepared an annual report for this blog.

Top blog posts in 2015 were: “A.I.: from Sci-Fi to Science reality” and the ever popular older “Speech Recognition for Dummies” and the classic “Voice-activated lift won’t do Scottish! (Burnistoun S1E1 – ELEVEN!“.

Scottish Elevator – Voice Recognition – ELEVEN!

(YouTube – Burnistoun – Series 1 , Episode 1 [ Part 1/3 ])

Voice recognition technology? …  In a lift? … In Scotland? … You ever TRIED voice recognition technology? It don’t do Scottish accents!

🙂

 

So, we had 4,500 unique visitors in 2015! Thank you!

A New York City subway train holds 1,200 people. This blog was viewed about 4,500 times in 2015. If it were a NYC subway train, it would take about 4 trips to carry that many people.

Check out some more stats in the full WordPress report.

Happy 2016! 🙂

Develop your own Android voice app!

26 Dec

Voice application Development for Android

My colleague Michael F. McTear has got a new and very topical book out! Voice Application Development for Android, co-authored with Zoraida Callejas. Apart from a hands-on step-by-step but still condensed guide to voice application development, you get the source code to develop your own Android apps for free!

Get the book here or through Amazon. And have a look at the source code here.

Exciting times ahead for do-it-yourself Android speech app development!

The AVIxD 49 VUI Tips in 45 Minutes !

6 Nov

Image

 

 

The illustrious Association for Voice Interaction Design (AVIxD) organised a Workshop in the context of SpeechTEK in August 2010, whose goal was “to provide VUI designers with as many tips as possible during the session“. Initially the goal was 30 Tips in 45 minutes. But they got overexcited and came up with a whooping 49 Tips in the end! The Session was moderated by Jenni McKienzie, and the panelists were David Attwater, Jon Bloom, Karen Kaushansky, and Julie Underdahl. This list dates back 3 years now, but it’s by no means outdated. This is the most sound advice you will find in designing better voice recognition IVRs and I hated it being buried in a PDF!

So I am audaciously plagiarising and bringing you here: the 49 VUI Tips for Better Voice User Interface Design! Or go and read the .PDF yourselves here:

Image

Image

Image

Image

Image

Image

Image

Image

Image

Image

And finally ….

Image

 

Have you got a VUI Tip you can’t find in this list that you’d like to share? Tell us here!

 

XOWi: The wearable Voice Recognition Personal Assistant

30 Oct

I just found out about the new venture of my colleagues, Ahmed Bouzid and Weiye Ma, and I’m all excited and want to spread the word!

They came up with the idea of a Wearable and hence Ubiquitous Personal Voice Assistant, XOWi (pronounced Zoe). The basic concept is that XOWi is small and unintrusive (you wear it like a badge or pin it somewhere near you) but still connects to your smartphone and through that to all kinds of apps and websites for communicating with people (Facebook, Twitter, Ebay) and controlling data and information (selecting TV channels, switching the aircon on). Moreover, it is completely voice-driven, so it is completely hands- and eyes-free. This means that it won’t distract you (if you’re driving, reading, working) and if you have any vision impairment or disability, you are still completely connected and communicable. So, XOWi truly turns Star Trek into reality! The video below explains the concept:


The type of application context is exemplified by the following diagram.

XOWi architecture

And here is how it works:

Ahmed and Weiye have turned to Kickstarter for crowdfunding. If they manage to get $100,000 by 21st November, XOWi will become a product and I will get one for my birthday in March 2014! 😀 Join the Innovators and support the next generation in smart communicators!

2012 in review – Not Bad! :)

9 Jan

It’s nice, flashy, and insightful, so I had to repost! 🙂 

Apparently, my Top 5 posts with the most views in 2012 were:

  1. The voice-activated lift won’t do Scottish! (Burnistoun S1E1 – ELEVEN!) 8 COMMENTS July 2010
  2. Speech Recognition for Dummies 20 COMMENTS May 2010
  3. TEDxSalford (28 Jan 2012): 10 hours of mind-blowing inspiration 2 COMMENTS March 2012
  4. TEDxManchester (13 Feb 2012): Best of! 1 COMMENT May 2012
  5. What to do after a PhD? (University of Manchester Pathways 2011) 0 COMMENTS June 2011

Happy New Year! 🙂

The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.

Here’s an excerpt:

600 people reached the top of Mt. Everest in 2012. This blog got about 6,900 views in 2012. If every person who reached the top of Mt. Everest viewed this blog, it would have taken 12 years to get that many views.

Click here to see the complete report.

TEDxManchester 2012: Voice Recognition FTW!

12 Sep

After the extensive TEDxSalford report, and the TEDxManchester Best-of, it’s about time I posted the YouTube video of my TEDxManchester talk!

TEDxManchester took place on Monday 13th February this year at one of the iconic Manchester locations – and my “local” – the Cornerhouse. Among the luminary speakers were people I have always been admiring, such as the radio Goddess Mary Anne Hobbs, and people I have become very close friends with over the years – which has led me to an equal amount of admiration, such as Ian Forrester (@cubicgarden to most of us). You can check out their respective talks, as well as some awesome others, in my TEDxManchester report below.

My TEDxManchester talk

I spoke about the weird and wonderful world of Voice Recognition (“Voice Recognition FTW!”): from the inaccurate – and far too often funny – simple voice-to-text apps and dictation systems on your smartphones, to the most frustrating automated Call Centres, to the next generation, sophisticated SIRI and everything in-between. I explained why things go wrong and when things can go wonderfully right. The answer is “CONTEXT”; the more you have of it , the more accurate and relevant the interpretation of user intention will be, and the more relevant and impressive the system reaction / reply will be.

Here is finally my TEDxManchester video on YouTube.

And below are my TEDxManchester slides.