Mel Chua and Ian Smith on September 13, 2019
In the US alone, approximately 3% of the population (10 million) are either deaf or have moderate to profound hearing loss. This is 3 times as many people than those in wheelchairs yet reasonable disability accommodations for the deaf or hearing impaired only require an ASL (American Sign Language) interpreter in certain circumstances such in official political, legal, education, law enforcement, and employment events and situations. The problem with this is that only a fraction of the functionally deaf (250-500 thousand) speak ASL (also called "signers") and situations that have accommodations are few and far in-between anyway so how can the hearing impaired engage with events like the rest of us and especially those that don't have interpreters such as meetups, conferences, and debates.
This talk was liveblogged by two Deaf attendees (and native lipreaders), Mel Chua and Ian Smith. Commentary is in italics throughout.
Speaker: "What if you couldn’t hear me well?" (Deaf audience members, grinning: A completely hypothetical question, indeed!)
Hearing loss is a more common problem than most people think. The ADA was passed in 1990, and it (theoretically - not in practice) prohibits discrimination / requires accommodation) for disabled people. However, it’s not always present. What happens with PA systems, theatrical events, etc? Hearing loss is an invisible disability. 3x as many people have hearing loss compared to those who use wheelchairs and canes. (Note: one Deaf audience member is a wheelchair user and started to laugh good-naturedly at this point.)
Reasonable accommodations for deaf and hard of hearing (DHH) people are not always provided proactively. Even if they were, DHH people often have issues with simple daily interactions, such as at the supermarket when the cashier says something. (Deaf audience members: Actually, those simple daily interactions are predictable and easy to manage via lipreading if you choose to do that, or via non-auditory/speech means if you'd rather -- pointing, writing, etc. Those are actually the easiest kinds of interactions you can have.)
Misconceptions:
Some examples of other assistive technologies:
Assistive tech for DHH people:
Speaker: “We attempted to solve this problem using our own toolset.” (Which was machine learning.)
(Deaf audience: This list seems correct to us.)
Additionally, this would be implemented on readily available technology. The idea is to have an augmented reality app that would run on a phone -- point it at something, and you’d see the camera image with the captions overlaid. Being able to run this software off a regular phone/tablet would be advantageous; there wouldn’t be a need to purchase specialized hardware.
In order to explain which specific technical domain this talk falls into, the speaker described three nested and increasingly specific domains. The broadest domain is artificial intelligence, which refers fairly generally to “things that enable machines to emulate human behavior.” A subset of that is machine learning, which uses statistical methods to enable machines to improve at what they do -- with the caution that it only learns from the data you give it. A subset of machine learning is deep learning, which can do more complex things than other kinds of machine learning.
(Deaf popcorn gallery: leans in closer)
Speech recognition is not unisensory - it uses a lot more senses than just hearing. For speakers, there is haptic feedback where you can physically feel yourself speaking; for audience members, you can see the speaker as they talk and also have that visual channel, and so forth.
Combining both audio and visual data -- the sound and the visual (which isn’t affected by background noise) - might give us greater accuracy, especially in varying noise situations. (Deaf attendees: Much like human lip readers do - neat!)
Lipreading and speech recognition have some similar challenges. For instance, the words “meteor,” “meatier,” and “meat eater” sound - and look! - very similar on the lips.
A tricky example to disambiguate:
Michael and “my call” are homophones. They sound alike, and they look alike on the lips -- how would you distinguish them?
The answer is with what we call “long short term memory” (LSTM). Each section of the network learns from what has happened in the time preceding the data it’s currently processing -- but at the same time, it doesn’t remember that information forever. It uses a probabilistic method to figure out which parts to focus on and which ones to ignore.
In order to train this system, they used 500 hours of TED talks, of which 400 were selected and used for the training. They’re pretty diverse in terms of speaker gender and ethnicity and voice types and so forth, have clear audio and video (good lighting, etc.)
The first step was to get a visual data stream, which required isolating the lips on the video. They took the video and cropped it so that it would be centered on the mouth and scaled it so each image of a mouth consisted of the same number of pixels.
The audio stream also needed to be processed; they used an existing algorithm to select which features of the audio data to focus in on.
The machine learning model then had the video data (of the centered, scaled image of the mouth), the audio, and the transcript of the “correct” answer as a training dataset to work with. Each module in the system has a convolutional neural network (CNN) that runs through the data before it goes into the LSTM. And each module in the system has a (missed this term) that decides what data to continue paying attention to, and which data to discard before passing this on to the next step.
Their training results were in 11.5% character error rate and 17.6% word error rate, which is fairly impressive considering the small dataset. This project (which combined audio and video/lipreading data) performed favorably compared to a BBC project which relied on visual/lipreading data alone.
A demo was shown (pre-recorded) with the presenter speaking, which was captured by the computer as “can you read my lips" - they noted that if a human captioner was doing it, they would catch the intonation and add punctuation: “Can you read my lips?” would be written out as a question, as would the exclamation of “Wow!” afterwards.
This was followed by a few demos from the dataset:
We’d need a larger training set with annotated data in order to improve this -- the presenter is looking for such datasets, if anyone knows of any!