Explainable VUI Design

2 Feb
No alt text provided for this image

I was recently invited by FORD‘s Shyamala Prayaga to talk about “Explainable VUI Design” on her Digital Assistant Academy podcast “The Future is Spoken” and I was of course delighted to oblige! You can find our (audio and video) conversation on YouTube, Google podcasts or wherever you get your podcasts, as well as on the DA Academy website.

There was a live LinkedIn event premiering the podcast on Tuesday 2nd February at 2pm UK / 9am ET. One of the participants, Ashish Handa, amazingly made – on the fly – incredibly detailed and comprehensive and yet very clear and insightful drawings of what we were discussing. You can find all 3 of them below. I am so impressed and honoured by his creativity and dedication! πŸ™πŸ™ A true Master! πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Below are some highlights from our conversation taken from the Podcast Episode Notes:

[00:25 ] From Linguistics to NLP to VUI Design

  • Maria originally studied Linguistics and English Literature in Greece, so she naturally wanted to go and study and work in the UK.
  • She was looking for Sponsorship for her Masters, when she bumped into the field of Machine Translation and NLP.
  • During her PhD studies in 1993, she discovered the world of Artificial Neural Networks and got fascinated by their potential and decided to apply them to Automatic Text Summarisation.
  • It was through a Post-Doc in Spoken Dialogue Management for Speech IVRs that she got into the world of Voice and Speech Recognition back in 1996. From then on, she has been a VUI Designer!

 [07:20 ] SIRI comes into play!

  • Maria explains how the iPhone was like having a full computer in our pockets and how SIRI was the beginning of a new era, making Speech Recognition and Voice mainstream.
  • She feels very proud about the Voice field, which she considers like her “baby” growing up to be an adult!
No alt text provided for this image

 [09:40 ] Explainable VUI Design

  • Maria coined the term “Explainable VUI” in 2019 amid the myriad of Voice applications and Voice Assistant skills / actions / capsules designed by programmers or marketing people.
  • “Explainable VUI” means to design a Human-Computer interface bearing in mind both the complexities and imperfections of human language and the limitations of the technology (ASR / speech recognition / NLP).
  • A lot of her work with various companies and organisations creating VUI Designs from scratch or reviewing existing ones is carefully crafting system prompts.
  • She stressed the importance of knowing how the background technology works.

 [20:23 ] Balancing UX with discoverability of new features.

 [22:16 ] How many menu options can a user take?

  • Voice helps the user to quickly bypass a big chunk of the menu tree, if they know what they want, especially power users. Menu options are for newbies or people who are not sure what to ask for.

 [24:05 ] Challenges of VUI vs GUI

  • Maria elaborates on the different types of challenges that Voice interfaces have compared to graphical interfaces.
  • Voice can empower and engage many more disenfranchised people, if designed right.
  • She also explains how designers can address inclusivity in their conversation design, without necessarily showcasing the shortcomings of the interface.

 [39:17 ] The importance of Data

  • Maria emphasises the need for a data-driven approach in Conversational Design.
  • Conversational designers certainly need to spend a lot of time thinking about the dialogues, but there are so many other factors to consider in designing a good dialogue system that enjoys high acceptability and adoption rates.
  • Designing dialogue flows for Voice or Chat is not the same as writing catchy mottos or gripping stories.

 [47:20 ] Designing for Listenability rather than Readability

  • Chatbots vs Voicebots and Voice Assistants
  • Maria elaborates on how to ensure designing for listeners rather than for readers.
  • She stresses the need to provide concise but also clear and unambiguous information.
  • Voice input presents a great deal more challenges than Chat input, which is far less ambiguous.

 [52:14] Must Listen

  • Maria’s pieces of advice for aspiring Conversational Designers and people new in this field on how to start, learn, get ahead, and flourish.

You can find the full transcript here.

No alt text provided for this image
No alt text provided for this image

The WITLINGO Voice First Channel

27 Jan

WITLINGO has launched the Voice First Channel on their website and on YouTube, featuring bitesized interviews with people working in Voice, both Veterans and Newbies.

A couple of days ago I contributed some thoughts myself and was later proud to be called a “Voice First OG”! πŸ™‚ You can find my recording here and the transcript below:

Hi, I’m Dr Maria Aretoulaki, CEO of DialogCONNECTION, a UK-based Consultancy that I founded 12 years ago, specialising in Voice AI, Conversational AI, VUI Design & Conversational Experience Design.

I started working in Voice actually 25 years ago, first in Academia with Spoken Dialogue Management, then in Industry and Start-ups with VUI Design – mainly for Speech IVRs for Call Centre Automation. In the past couple of years, I’ve also designed the first German Bixby Capsules for Samsung, as well as some Chatbots for telecoms.

I feel proud that Voice has finally gone mainstream after decades of obscurity. For 20 years, I would be met with confusion every time I told someone what I did for a living. VUI Designers were definitely a niche profession! The advent of Voice Assistants certainly shot Voice Design to prominence and Speech Recognition into everyone’s pockets!

I am currently fascinated by the new possibilities that context and personalisation bring to Voice experiences, and I’m interested in how massive amounts of personal data are used to better understand what users say (Speech Analytics), but also predict what they want or need. Tightly coupled with this, is of course my interest in Ethics: how we get the user’s consent, how we protect user data, how we prevent bias in the data we collect and how we prevent nefarious uses of the data. It’s a fine balancing act!

DialogCONNECTION offers what I call Explainable VUI Design, that is Conversational Experiences that take into account both the technology, so, how Speech Recognition and ASR works, and how real users speak and type (i.e. Linguistics). Hint: they are both imperfect!

You can find me at dialogconnection.com (dialog spelt the American way, so without the -ue at the end), and you can also find me on Twitter, LinkedIn and several Voice Events and Podcasts.

Until then, have fun! πŸ™‚

You, too, can contribute your thoughts here.

The unbearable Lightness of Voice First

6 Jan

Ahmed Bouzid wrote an interesting post on frequent unrealistic expectations from Voice First (“Voice First Sucks!“) and has started another one listing meaningful Voice First Use Cases. In that second post, he lists 6 dimensions that need to be considered when considering the effectiness and acceptability of Use Cases for Voice / Speech.

I was kindly invited to comment and I jumped on the opportunity πŸ™‚

Voice First Use Cases - The 6 Dimensions

Like Ahmed, I, too, find the lack of knowledge among far too many people in the Voice First Community of how ASR / Speech Recognition actually works frustrating (if not criminal!). I go on about this here.

Voice / Speech is an interface of its own and yes it is serial and invisible (and often relentless). Nevertheless, at the same time, there are already plenty of use cases where it can help avoid taxing the user’s memory (e.g. not having to list all options upfront and expecting the user to remember them all and choose 1) and, if the ASR has been trained well (i.e. on sufficiently large amounts of real-world and representative data), the user doesn’t even have to speak loudly nor enunciate clearly. Heck, the user doesn’t even need to be patient, if they are allowed to barge in and interrupt the system’s prompts / instructions / factoids. And thankfully, Voice interfaces won’t get offended if you ask them to repeat something 2, 3, 10 times, so users don’t have to be that focused either. Good Voice interfaces that is πŸ™‚

Time Sensitivity is a bastard though! You pause for too long or at the wrong point in your utterance and you may find yourself down a path you hadn’t envisaged or wanted (Enter misrecognitions). But even that can be modulated by playing around with the various timeout settings (if you have access to them!) and ensuring there are enough implicit confirmations of what you want at critical points.

This is all part of Voice User Interface Design (or, as I call it, Explainable VUI Design); decisions you have to take early on, before the detailed design and certainly before you launch to millions of users.

Here is the link to Ahmed’s LinkedIn post, where you can read other people’s comments.

AI Thought Leaders choose books that have impacted them

23 Sep

A few days ago Richard Foster-Fletcher 🌎 asked 15 AI leaders and influencers to choose a recent book that has impacted them the most. The answers ranged from books on Machine Learning coding, to AI Ethics and the future of AI, and even some visionary fiction.

Here is my own choice: Max Tegmark‘s Life 3.0 (which, by chance, appeared 2 more times in this list!).

No alt text provided for this image
My book choice: Max Tegmark’s Life 3.0

If I were to name a book that deeply impacted me from any decade, it would have been very difficult. Nevertheless, the following 3 books from 1991 – 1993 spring to mind:

Connectionist Approaches to Natural Language Processing

No alt text provided for this image

Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems

No alt text provided for this image

Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon and Memory

No alt text provided for this image

Thank you to all the others who contributed to Richard’s list, as they have definitely enriched my own reading wishlist!

Dr. Andree Bates, Giuseppe Bonaccorso, ∞ Ravit Jain ∞, Dr Ligia Catherine Arias-Barrera, Tonii Leach, delphine nyaboke, Danny Ma, Monika Sheoran Sangwan, Imtiaz Adam, Karen Silverman, SCOTT TAYLOR – The Data Whisperer, Barbara Fusinska, Dr. KaT Zarychta Thom Ives, Ph.D.

How to Design & Optimise Conversational AI Dialogue Systems

26 Jun
Data Futurology EP122 with Dr Maria Aretoulaki

The latest episode of Data Futurology features myself talk with Felipe Flores about how I got into AI, Machine Learning and Deep Learning (“When I discovered Artificial Neural Networks back in 1993, I thought I had discovered God!”) and later Data Science and Big Data Speech Analytics for Voice User Interface (VUI) and Voice First Design (for Speech IVRs and Voice Assistants).

In the podcast, I give an overview of all the different steps involved in VUI Design, the often tricky interaction with the different stakeholders (Coders, Business Units, Marketeers) and the challenges of working in an Agile PM framework, when skipping detailed Voice Design is suicide (and my worst nightmare!). I showcase some of the interesting and outright funny things you discover, when you analyse real-world human-machine dialogues taken from Speech IVR systems (hint, most original swearwords!) and I pinpoint the main differences between a Speech IVR and a Voice Assistant skill / action / capsule. I also contemplate on where Voice is heading now with the ubiquity of Voice First and the prevalance of skills that were developed by software engineers, web designers and Marketing copywriters, without VUI Design expertise, Linguistics Training or knowledge of how Speech Recognition works.

Data Futurology Podcast EP122 – Optimising Conversational AI Dialogue Systems with Dr. Maria Aretoulaki
Episode list of contents

I even provide tips on how to survive working for a Start-up and strategies on how to stay focused and strong when you run your own SME Business. Below is a brief checklist of some of the core qualities required and lessons I have learned working for Start-ups:

No alt text provided for this image

You can listen to the whole episode on Data Futurology , on Apple Podcasts, on Google Podcasts, on Spotify, or wherever you get your Podcasts or you can watch it on YouTube:

#122 – Optimising Conversational AI Dialogue Systems with Dr. Maria Aretoulaki

It was a delight speaking to Felipe! It made me think about and consolidate some of the – occasionally hard – lessons I have learned along the way about Voice Design, NLP, AI, ML and running a Business, so that others hopefully prepare for and ideally also avoid any heartaches!

No alt text provided for this image

The immortality of Data

10 Jun
Boundless Podcast Episode EP48

The latest Boundless Podcast Episode is out! It features myself in a deep conversation with Richard Foster-Fletcher about Big Data and Speech Analytics for Voice First & Voice Assistants (but not only), BigTech, AI Ethics and the need for a new legal framework for Responsible AI and Explainable Machine Learning. Below is a snippet of the conversation:

Boundless Podcast Episode EP48 – snippet

β€œMy data is being used and exploited, and I can do nothing about it. We need to modernise the legal system. Apart from all the ethical, moral discussions that need to be made, we need a legal system that takes into consideration the fact that intelligence doesn’t need to be visible to be acting against me.”

I wouldn’t call myself a technophobe. Quite the opposite. I was learning how to program (BASIC!) back in 1986 in the Summer after getting my degree in English Language & Literature; I was teaching computers how to translate between human languages and chatting online with mainframes in University buildings 2 kms away back in 1991; I was programming Artificial Neural Networks and using Parallel computers back in 1993; I was reading internet newsgroups and downloading music – very much legally! – on a web browser (Netscape!) in 1994; I was designing Voice interfaces already in 1996 and voice-controlled home assistants back in 1998. I have even been using LinkedIn since 2005.

Yet, I am very sceptical and pessimistic about our uninhibited sharing of personal data and sensitive information all day every day on multiple channels, to multiple audiences, much of it willingly, much more unwillingly, in the name of sharing and human connection, service and product personalisation and ultimately, far too often, monetisation.

What will that mean for our legacy as individuals? Who will own, control and curate all the generated data after our death? Who can clone us and why? Will there be a second life after death? Will there ever be freedom from a dictator or will there ever be any point in bequeathing anything in a will? These and many more questions are discussed in this episode. I had lots of fun recording this! Thank you so much to Richard for creating thisπŸ™

You can listen to a snippet of the conversation here (Download)

No alt text provided for this image

You can listen to the full episode here or wherever you get your podcasts.

Alternatively, you can listen to it on YouTube (in 2 parts):

Part 1

Boundless EP48 – Part 1

And Part 2

Boundless EP48 – Part 2

#BigData #BigTech #SpeechAnalytics #VoiceFirst #VoiceAssistants #Alexaskills #GoogleAssistant #Bixby #AIethics #responsibleAI #explainableVUI #AI #ArtificialIntelligence #MachineLearning #ML #DeepLearning #ANNs

‘statistics β‰  understanding’​

7 May
‘statistics β‰  understanding’​

I recently read an article on a new approach to Common Sense understanding, which uses a combination of traditional, Good Old-Fashioned AI (GOFAI) symbolic and the latest data-intensive Machine Learning (ML) / Deep Learning neural network approaches to deal with the hard problem of human reasoning. Here’s a link to the article (with thanks to Phillip Hunter for the pointer!):

My favourite quote from the article is:

‘statistics β‰  understanding'”

That’s because (another favourite quote):

“common sense, like natural language, remains fundamentally fuzzy”

I was delighted to read about this research, especially because almost 30 years ago, when I was doing my PhD at the University of Manchester, I, too, realised that the only promising way to capture this fuzziness, ambiguity and complexity of language and meaning is through a hybrid approach, combining hand-crafted “rules” (human annotations, i.e. symbolic processing) with the automatic weight distribution and semi-supervised learning of a neural network (connectionist processing).

No alt text provided for this image
No alt text provided for this image
My PhD Thesis

Thus, I used text annotations generated by humans, which encoded morphosyntactic / grammatical, lexical-semantic and discourse pragmatic features of each sentence in a news article.

No alt text provided for this image
sentence annotations with discourse pragmatic features

I would then feed them into a basic feed-forward backpropagation neural network (ANN) that would calculate the degree of “importance” of each sentence in the whole article and generate a YES or NO answer to the question whether that specific sentence would be included (not necessarily verbatim) in the final summary of that news article.

ANN decides the degree of importance of a sentence in a summary

It was a neat idea, very imperfectly executed, as both the data set was not that large for today’s standards (1,100 sentences representing 55 news articles) and the ANN barely had 3 layers and the single hidden layer only had 30 units (so very skin-deep learning!).

You can find my PhD thesis as a PDF below:

My PhD thesis only scratched the surface. It’s awesome to see a similarly hybrid approach now gaining momentum! We now have both the huge data collections and the sophisticated Deep Learning algorithms to try out different things and better copy and simulate human intelligence in AI systems and, hence, achieve deeper understanding and generate more relevant and useful responses and actions. This will also contribute to more Explainable AI and, by extension, more Explainable Conversational AI for transparency and reusability in Voice User Interface (VUI) Design.

Neuro-Symbolic AI – combining Deep Learning with Symbolic Logic for common-sense understanding

13 Jan

I recently read an article in MIT Technology Review entitled “A hybrid AI model lets it reason about the world’s physics like a child“. In it, a so-called “Neuro-Symbolic model” is presented, a new AI approach that combines both Deep Learning and Symbolic Logic.

“it uses a neural network to recognize the colors, shapes, and materials of the objects and a symbolic system to understand the physics of their movements and the causal relationships between them. It outperformed existing models across all categories of questions.”

The motivation behind this approach was two-fold:

  1. Deep Learning is superb at pattern matching, but bad at actually understanding the data it processes.
  2. Symbolic Logic can capture and model really well human reasoning, correlations and interdependencies between actions, events, objects, people, and “common sense” or even objective rigorous Physics understanding.

Thus, the natural next step is to combine both and leverage the strengths of each in a Hybrid model:

“Deep Learning excels at scalability and pattern recognition; symbolic systems are better at abstraction and reasoning.”

Funnily enough that was the approach I took myself for my PhD research back in 1993-96, when I conceived of a Hybrid architecture for automatic Text Summarisation that uses Artificial Neural Networks (ANNs – which have more recently turned into Deep Learning) and Symbolic Reasoning. It was a very ambitious undertaking, but it seemed to be to be the next natural step in AI methodologies. The title of my PhD was “COSY-MATS: A Hybrid Connectionist-Symbolic Approach To The Pragmatic Analysis Of Texts For Their Automatic Summarisation” (University of Manchester, UK, 1996) and you can find the whole Thesis here.

No alt text provided for this image

#NLP #NLU #Summarisation #Summarization #AI #ML #MachineLearning #DeepLearning #SymbolicAI #Logic #Hybrid #Connectionist #NeuroSymbolic #AImodels #CosyMats #ConnectionistSymbolic #ANNs #NeuralNetworks #Manchester

My baby, DialogCONNECTION, is 11!

4 Dec

This week, my company, DialogCONNECTION Limited, turned 11 years old! πŸŽ‰ πŸ₯‚ 😁

It feels like yesterday, when in December 2008 I registered it with Companies House and became Company Director (with multiple hats).

My very first client project was for the NHS Business Authority on their EHIC Helpline (which hopefully will survive the Brexit negotiations). Back then, whenever I was telling anyone what my company does (VUI Design for Speech IVRs), I was greeted by blank stares of confusion or incomprehension. It did feel a bit lonely at times!

Many more clients and thousands of long hours, long days and working weekends since, here we are in December 2019 and I suddenly find myself surrounded by VUI Designers and Voice Strategists who have now seen the potential and inescapable nature of speech interfaces and have followed on my footsteps. I feel vindicated, especially since I started in Voice back in 1996 with my Post-Doc in Spoken Dialogue Management at the University of Erlangen! 😎 (Yet another thing I’m hugely grateful to the EU for!)

We started with Voice-First VUI Design back in 1996, well before Samsung’s BIXBY (2017), Google’s ASSISTANT (2016), Amazon’s ALEXA (2014), Apple’s SIRI (2010) and even before the world started using GOOGLE for internet searches (1998)!


It’s quite frustrating when I realise that many of these newcomers have never heard of an IVR (Interactive Voice Response) system before, but they will eventually learn. πŸ€“ In the past 25 years it was the developers who insisted could design conversational interfaces without any (Computational) Linguistics, Natural Language Processing (NLP) or Speech Recognition (ASR) background and didn’t need, therefore, a VUI Designer. And we were an allegedly superfluous luxury and rarity in those times. In the past couple of years it’s the shiny Marketing people, who make a living from their language mastery, and the edgy GUI Designers, who excell in visual design and think they can design voice interfaces too, but still know nothing about NLP or ASR.

What they don’t know is that, by modifying, for instance, just the wording of what your system says (prompt tuning), you can achieve dramatically better speech recognition and NLU accuracy, because the user is covertly “guided” to say what we expect (and have covered in the grammar). The same holds for tuned grammars (for out-of-vocabulary words), word pronunciations (for local and foreign accents), tuned VUI designs (for error recovery strategies) and tuned ASR engine parameters (for timeouts and barge-ins). It’s all about knowing how the ASR software and our human brain language software works.

Excited to see what the next decade is going to bring for DialogCONNECTION and the next quarter of a century for Voice! Stay tuned!

Towards EU collaboration on Conversational AI, Data & Robotics

22 Nov

I was really interested to read the BDVA – Big Data Value Association‘s and euRobotics‘ recent report on “Strategic Research, Innovation and Deployment Agenda for an AI PPP: A focal point for collaboration on Artificial Intelligence, Data and Robotics“, which you can find here.

Of particular relevance to me was the Section on Physical and Human Action and Interaction (pp. 39-41), which describes the dependencies, challenges and expected outcome of coordinated action on NLP, NLU and multimodal dialogue processing. The associated challenges are:

  • Natural interaction in unstructured contexts, which is the default in the case of voice assistants for instance, as they are expected to hold a conversation on any of a range of different topics and act on them
  • Improved natural language understanding, interaction and dialogue covering all European languages and age ranges, thus shifting the focus from isolated recognition to the interpretation of the semantic and cultural context, and the user intention
  • Development of verbal and non-verbal interaction models for people and machines, underlining the importance of gestures and emotion recognition and generation (and not only in embodied artificial agents)
  • Co-development of technology and regulation to assure safe interaction in safety-critical and unstructured environments, as the only way to assure trust and, hence, widespread citizen and customer adoption
  • The development of confidence measures for interaction and the interpretation of actions, leading to explanable AI and, hence, improved and more reliable decision-making
No alt text provided for this image

You can find the excellent and very comprehensive report here.