SpeechRecognition is likely by 2015. (See notes below on why.)
Applications of Speech Recognition
People imagine that Speech Recognition will be used to create Word documents. This is a limited perspective.
More likely, speech recognition will be used... * ...to keep real-time transcripts during conversations. * ...to annotate and to comment. * ...to instruct and answer computers in a hands-free environment. (while driving; see DrivingCars, though) * ...to send instant messages. * ...eventually, for most computer interaction; the LinguisticUserInterface
Real-Time Transcripts: Discovering Conversation
Imagine that you are studying biology, in particular
Mitochondrea. You study with a co-learner, by voice, over the Internet. Because the subject is educational, you let the conversation be public. (see
OverHear for details.)
A computer program is transcribing your conversation in real-time, and another program is indexing your conversation in real-time.
A few states away, someone else is also studying biology. They perform a search, and discover the conversations you are having. They may leave a note at an information node representing your conversation (see
ProjectSpaceNetwork,
WikiAsCollage,) or, if you are talking at that particular moment, opt to listen in. A small icon lets you know that someone is listening in on the conversation. You may invite her in, or she may knock requesting to come in.
This is made possible by SpeechRecognition, but it is not a scenario people think of when they think of speech recognition. Most people imagine that they will be writing word documents with speech recognition.
Instant Messages
Speaking is intutive and fast.
Reading is fast, easier on memory, and easier to index.
"Index:" When you read something, it only takes a moment to go back to the beginning. You don't have to say: "Go back to the beginning," or "Wait; what did you just say?" Rather, your eye darts back to the beginning of the sentence. We call this, "indexing."
But listening is slow, hard to index, and forgetable.
And writing is slow and requires some intention.
With SpeechRecognition technology, we will not have to choose between one or the other. We will have the best of both.
That is, you will speak to tell someone something, and they will read to understand it. Your microphone will be connected to your instant messenger. When you say, "Jim, how are you doing?", the computer will recognize that you mean to talk with Jim, and will send the text "how are you doing?" to him. (Your identity will likely be recognized by VoiceRecognition.)
Jim may be gone at the moment. But when he returns in 10 minutes, he may speak "Dave, I'm doing fine. Work was a bit wearisome, but otherwise, I'm fine."
(TODO: Our CommunicationMores will be different. We will likely communicate suits or roles, and have different ways of collecting messages, in a more organized fashion.)
You are both speaking, but you are both reading each other's text.
Comments
Similarly, when you attach a comment to Slashdot, you will just hold down the spacebar, and speak your mind. Comment attached. Similar for attaching comments to documents, songs that are playing, or anything you care to comment on.
Real-Time Transcripts: Everywhere and (almost) Always
Recording conversations will be the norm. There will be few conversations about who said or didn't say something at work; It'll all be automatically recorded, like having a court reporter in every room. There will be a searchable, time-indexed, tagged and annotated transcript of everything that is spoken. Everything.
When people have a hard time understanding a concept, because it's being poorly presented, we'll have all the evidence we need. "See, when you explain things this way, it usually takes 3 times longer to explain it, than when you explain it this other way."
All of this is unlocked when you have SpeechRecognition. SpeechRecognition is no small thing. Do not be one of those people who envision themselves writing Word documents with speech recognition.
Interesting Interactions
SubVocalRecording -- technology that can record thoughts "spoken" in your mind -- actually easier and closer than one might think (10-20 years)
ArgumentGraphs -- visual maps of arguments
AugmentedReality, HudInTheEye -- when you can float words above people's heads, all speech becomes visual, spoken words produce hanging refridgerator magnets
2005
Divide speech recognition into two types:
general (TODO: find real name for it?) -- take any spoken words by someone in a known language, and turn it into text
selective -- with a list of possible words, figure out what the person said (ie: "I am expecting a number.")
Selective speech recognition is very good, and presently rolling out into corporate phone trees. It's hardly ubiquitous, but it's not rare either. Merely: uncommon, and expanding.
Philip Greenspun provides instructions online on how a developer can make a voice program today, that works with the existing plain-old telephone system.
General speech recognition is better, but still bad. You still have to speak a little slower, and provide some corrections. But the computer is pretty good at recognizing context, and letting you correct it.
John Udell has a Flash video demonstrating what Speech Recognition is capable of on November 2004.
(Associated article.)
2015
In the IntelDeveloperForum2005Keynote, JustinRattner was blase about speech-to-text. He said that by 2015, computers will have "strong capabilities" in speech-to-text. Near the end of the keynote, he said (TODO: I can only say: "Something to the effect of") "Absolutely going to happen. No questions." (TODO: relisten, or find transcript.) He seemd bored to talk about it. He was far more interested in VideoAnalysis, where your computer knows it's you because it sees you through it's camera, and 3DGraphics.
(older stuff)
On MarshallBrain's
Robots In 2015 page, he writes:
"For a taste of just how good robotic voice recognition has gotten, call (800) 555-1212 and ask for the listing for American Airlines or Delta Airlines. A robotic system will give you the number. Then call American or Delta and navigate their voice-operating arrival and departure systems. In 10 years, these systems will be flawless and they will understand multiple languages with ease. By 2015, big box retailers will deploy voice-recognizing robots and kiosks throughout the store to help customers find the items they need."
I recommend actually doing this.
call (800) 555-1212, ask for "American Airlines"
call the number you were given, and go to the English flight times listing: (1-1-1 through the phone tree)
ask for info about any flight from your local airport, to some other airport
I was, personally, surprised by the quality of the voice recognition:
I wasn't speaking carefully, and yet it was able to figure out what I said.
When it asked "what time is the flight departing," I answered with, "Oh, I don't know- something around 2-3 oclock?" ("Is that in the morning or afternoon?") "Oh- in the afternoon." (It gave me a flight time at 2:39PM.)
It was never trained on my voice.
I was a little disappointed that I had to got through the initial phone tree (1-1-1), but my girlfriend tells me that there are systems in place that don't. She works in the health industry, and there are some places you call, and they ask: "What do you want to do?" "I want to refill my medication." (And then it goes from there.)
MarshallBrain's prediction doesn't seem out of hand.
There are other things to note as well:
People in the voice recognition world are saying that we're going to chip out voice recognition functionality. (We should find the link to the info.) Voice recognition takes up a lot of cycles right now, but we want it to be transparent, as we move towards the LinguisticUserInterface.
Voice recognition doesn't need to be perfect to be useful. Suppose you wanted to construct a live index of all chat- if recognition falters here and there, it's still likely a useful index. There are cheesy hacks, as well: Assemble lists of "this word is spoken in the context of these other words."
We, humans, don't have perfect voice recognition. Just something to keep in mind.
In LionsTimelineFrom2004, I've listed "Mature LinguisticUserInterface" at around 2018-2022. By that, I mean: "Fluid communication with the computer," where you don't have to think about it.
-- LionKimbro 2005-01-17 01:30:05
-- LionKimbro 2005-02-03 06:19:23
In the IntelDeveloperForum2005Keynote, JustinRattner spoke almost blase about speech-to-text. He said that by 2015, we will have "strong capabilities" in speech-to-text. I believe he uttered something like, "Absolutely going to happen. No questions," towards the end. (I'd have to relisten to it.) He seemed to be positively bored talking about it. He was far more interested in talking about VideoAnalysis, where your computer knows it's you because it sees you through it's camera, and 3DGraphics.
So, when I talk with people, I say, "Speech-to-text. 5-10 years. JustinRattner says so."
(That is, 2010-2015.)
Personally, I imagine he's already seen it.
Again, call American Airlines, or
observe consumer grade speech recognition. It would not surprise me if JustinRattner, "
named Scientist of the Year by R&D Magazine for his leadership in parallel and distributed computer architecture," has seen better, and knows where we're headed.
-- LionKimbro 2005-03-27 07:04:59
I just saw on how to write your own speech-to-text apps <i>today.</i>
-- LionKimbro 2005-04-15 05:37:34