SDS 652: A.I. Speech for the Speechless

Podcast Guest: Jon Krohn

February 10, 2023

In this Five-Minute Friday, Jon Krohn investigates the technology that allows patients who have lost their ability to speak via medical ventilation to communicate clearly.

 
New AI technology is voicing what mechanically ventilated patients want to say. Several million patients suffering from respiratory illnesses must be mechanically ventilated each year. Such a procedure requires clinicians to insert a tube down the patient’s throat, removing their ability to speak. This can be a distressing time for a patient, who will want to talk with friends, family, and their doctors while they recover.
Developed by Arne Peine’s research team at the University of Aachen, this machine vision model can predict what a patient wants to say with a 6.3% error rate. Using a 7000-video dataset of German- and English-language speakers, this model processed video frames of moving lips before matching them with probable sound sequences. With this audio, the model then predicts the words patients are trying to express with its complex deep-learning architecture.
Listen in to hear Jon detail the technology behind the model.
Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.  

ITEMS MENTIONED IN THIS PODCAST:

DID YOU ENJOY THE PODCAST?

Podcast Transcript

(00:05):
This is Five-Minute Friday on A.I. Speech for the Speechless. 

(00:19):
In a given year, there are estimated to be between 13 and 20 million people worldwide who are mechanically ventilated in order to allow them to get enough oxygen in their blood while they suffer with respiratory failure. While this mechanical ventilation — air being forced into and out of the patients’ lungs by a machine — can be life-saving, in severe cases it requires a tube to be inserted 24/7 into the patient’s mouth or directly into the throat via surgery. In either case, the patient loses their ability to speak and, perhaps unsurprisingly, they report this as the most distressing aspect of being mechanically ventilated. 
(01:00):
Well, thanks to cutting-edge machine learning research being carried out in Germany, A.I. is now providing speech for these speechless patients. The research, led by Dr. Arne Peine of the University of Aachen, involves a machine vision model that predicts what the patient is trying to say based on their lip movements — it’s a lip-reading algorithm. Even though this research is in its infancy, it works quite well, with only a 6.3% error rate — dramatically increasing the capacity for these mechanically-ventilated patients to communicate with healthcare workers and loved ones.
(01:33):
In order to train their model, the researchers collected their own dataset consisting of 7000 short videos of patients uttering several hundred different sentences in German and in English. This relatively small dataset was effective because while lip-reading can be trickier in normal social settings, patients on a mechanical ventilator are unable to move their heads around so the lips remain fixed in one place with a full field of view to the camera.
(01:56):
In terms of the machine learning model itself, it involves two distinct stages each of which consists of relatively simple deep learning model architectures. The first stage — called the Audio Feature Estimator — takes in several consecutive frames of video of the lips moving while the patient is speaking. Deep learning models consist of several layers of so-called artificial neurons that loosely mimic the way biological brain cells work. In this Audio Feature Estimator model, the video frames pass through neural network layers called convolutional neural network layers and these are specialized to identify spatial patterns in images and then the information passes through layers called gated recurrent units that are specialized for dealing with data that occur in a sequence over time. So, there’s essentially two kinds of artificial neurons in this Audio Feature Estimator, there’s a convolutional style and there’s a gated recurrent unit style, that’s it. This results in this Audio Feature Estimator, the first of two stages in this model, outputting a prediction of the sound the patient is trying to utter with their lips. So, the first stage of the model takes in video of lips moving and outputs a prediction of the audio that would be associated with that lip movement. 
(02:59):
The second stage of the model — called the Speech-to-Text stage — takes the audio that was output by the first stage and then — again using a relatively straightforward deep learning architecture consisting of convolutional neural network layer and a gated recurrent unit layer — this second stage model outputs a prediction of the words that the patient is trying to say based on the audio that was fed into it by the first stage. So two stage model, the first stage model is taking in video of lips moving and outputting an estimate of the associated audio and then the second model is taking that estimated associated audio and outputting text. Once we have that text the words can simply be routed through standard off-the-shelf text-reading algorithms in order to allow mechanically-ventilated patients to speak in real-time, just by moving their lips! 
(04:03):
Super cool, right? A.I. providing speech for the speechless. Having now prototyped their approach, the next step for these German clinical researchers is to expand their training dataset and explore more recent deep learning architectures such as those used for visual style transfer in order to get the 6% error rate lower and broaden the range of speechless patients that their approach is effective for. If you’d like to dig deeper on the research, check out the full paper — called Two-stage visual speech recognition for intensive care patients — a link to which is available in the show notes. 
(04:38):
I loved this practical, socially-beneficial application of A.I. by Dr. Peine and his associates at the university of Aachen and I hope you enjoyed hearing about it too. I encourage you to tag me in social-media posts on LinkedIn or Twitter to suggest super-cool new A.I. applications for future Five-Minute Friday episodes. The idea for today’s episode, for example, came from a LinkedIn comment I was tagged in by the brilliant A.I. product manager Alice Desthuilliers, with whom I’ve had the great pleasure of working with regularly at my machine learning company Nebula. 
(05:09):
And if you don’t already know deep learning well — including the convolutional neuron layers and the gated recurrent units mentioned in today’s episode — you can learn all about these layer types and how they fit into deep neural networks from my book Deep Learning Illustrated, which is available in seven languages. A link to ordering physical copies is in the show notes. Alternatively, the book is available in a digital format within the O’Reilly learning platform, where you can also find a video-tutorial version of my deep learning content called Deep Learning with TensorFlow, Keras, and PyTorch. Many employers and universities provide access to O’Reilly; however, if you don’t already have access, you can grab a free 30-day trial of the platform using our special code SDSPOD23. We’ve got a link to that code ready for you in the show notes as well. 
(05:57):
All right that’s it for this Five-Minute Friday episode on A.I. speech for the speechless. Until next time, keep on rockin’ it out there, folks, and I’m looking forward to enjoying another round of the SuperDataScience podcast with you very soon. 
Show All

Share on

Related Podcasts