Cercle de Vie: Star Trek’s universal translator

mardi 14 mai 2013

Star Trek’s universal translator

Microsoft invente le traducteur universel... de voix !

Nous sommes tous désormais habitués aux systèmes de dictées vocales qui fonctionnent plutôt correctement. Mais nous sommes en train de faire un bond en avant avec ce nouveau système présenté par Microsoft.

Passer de la voix au texte est une chose commune aujourd'hui. Mais la grande nouveauté qui a surpris tous les spectateurs de cette présentation, c'est la traduction, quasi simultanée d'anglais en mandarin. Ceci, sans utiliser une voie robotisée et pré enregistrée, mais en générant directement le discours à partir de la voix de l'orateur !

Ce système faisait encore partie de la science fiction jusqu'ici. On pense à ce pauvre C3PO qui, malgré ses 6 millions de formes de communications connues en sera donc cantonné au protocole.

Évidement, ce serait trop beau si le système était parfait. Pour le moment il se trompe encore une fois sur huit (un mot sur huit). Il y a encore de la marge et certainement quelques années de développement à venir. Lorsqu'on voit comment s'en sortent les « simples » traducteurs de texte type google, les vrais traducteurs humains ont encore de beaux jours devant eux !

Heureusement ! Je n'aimerai pas voir Florent Gorges perdre une partie de son métier !

Pour voir le moment intéressant de la présentation passer directement à 5:36

A demonstration I gave in Tianjin, China at Microsoft Research Asia’s 21st Century Computing event has started to generate a bit of attention, and so I wanted to share a little background on the history of speech-to-speech technology and the advances we’re seeing today.
In the realm of natural user interfaces, the single most important one – yet also one of the most difficult for computers - is that of human speech.
For the last 60 years, computer scientists have been working to build systems that can understand what a person says when they talk.
In the beginning, the approach used could best be described as simple pattern matching. The computer would examine the waveforms produced by human speech and try to match them to waveforms that were known to be associated with particular words.
While this approach sometimes worked, it was extremely fragile. Everyone’s voice is different and even the same person can say the same word in different ways. As a result these early systems were not really usable for practical applications.
In the late 1970s a group of researchers at Carnegie Mellon University made a significant breakthrough in speech recognition using a technique called hidden Markov modeling which allowed them to use training data from many speakers to build statistical speech models that were much more robust. As a result, over the last 30 years speech systems have gotten better and better. In the last 10 years the combination of better methods, faster computers and the ability to process dramatically more data has led to many practical uses.
Today if you call a bank in the US you almost certainly are talking to a computer that can answer simple questions about your account and connect you to a real person if necessary. Several products on the market today, including XBOX Kinect, use speech input to provide simple answers or navigate a user interface. In fact our Microsoft Windows and Office products have included speech recognition in them since the late 90’s. This functionality has been invaluable to our customers with accessibility needs.
Until recently though, even the best speech systems still had word error rates of 20-25% on arbitrary speech.
Just over two years ago, researchers at Microsoft Research and the University of Toronto made another breakthrough. By using a technique called Deep Neural Networks, which is patterned after human brain behavior, researchers were able to train more discriminative and better speech recognizers than previous methods.
During my October 25 presentation in China, I had the opportunity to showcase the latest results of this work. We have been able to reduce the word error rate for speech by over 30% compared to previous methods. This means that rather than having one word in 4 or 5 incorrect, now the error rate is one word in 7 or 8. While still far from perfect, this is the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979, and as we add more data to the training we believe that we will get even better results.
Machine translation of text is similarly difficult. Just like speech, the research community has been working on translation for the last 60 years, and as with speech, the introduction of statistical techniques and Big Data have also revolutionized machine translation over the last few years. Today millions of people each day use products like Bing Translator to translate web pages from one language to another.
In my presentation, I showed how we take the text that represents my speech and run it through translation- in this case, turning my English into Chinese in two steps. The first takes my words and finds the Chinese equivalents, and while non-trivial, this is the easy part. The second reorders the words to be appropriate for Chinese, an important step for correct translation between languages.
Of course, there are still likely to be errors in both the English text and the translation into Chinese, and the results can sometimes be humorous. Still, the technology has developed to be quite useful.
Most significantly, we have attained an important goal by enabling an English speaker like me to present in Chinese in his or her own voice, which is what I demonstrated in China. It required a text to speech system that Microsoft researchers built using a few hours speech of a native Chinese speaker and properties of my own voice taken from about one hour of pre-recorded (English) data, in this case recordings of previous speeches I’d made.
Though it was a limited test, the effect was dramatic, and the audience came alive in response. When I spoke in English, the system automatically combined all the underlying technologies to deliver a robust speech to speech experience—my voice speaking Chinese. You can see the demo in the video above.
The results are still not perfect, and there is still much work to be done, but the technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers.
In other words, we may not have to wait until the 22^nd century for a usable equivalent of Star Trek’s universal translator, and we can also hope that as barriers to understanding language are removed, barriers to understanding each other might also be removed. The cheers from the crowd of 2000 mostly Chinese students, and the commentary that’s grown on China’s social media forums ever since, suggests a growing community of budding computer scientists who feel the same way.

Microsoft may have co-opted Star Trek a full century early by demonstrating an honest-to-goodness universal translator — one that not only renders what you’re saying into another language in real time, but that manages to sound like you while doing so.
In fact, assuming everything really is as it appears in the video above, Microsoft just pulled off something pretty amazing — much more impactful, in theory anyway, than a mere voice recognition app like Apple’s Siri.
Microsoft’s global head of research, Rick Rashid, demonstrated the surprisingly mature technology to a crowd of 2,000 students and teachers on Oct. 25 at the 14th annual Computing in the 21st Century Conference, held in Tianjin, China (the fourth largest city in the country).
Standing onstage with a large screen above him, Rashid’s speech was at first rendered on the screen as English text, the words appearing as spoken with near-perfect accuracy. A “recognizability” percentile in the lower-right-hand corner indicated how identifiable Rashid’s speech patterns were, operating well above 70% for most of the presentation.
After walking through a few watershed moments in speech-recognition research, Rashid shifted to live speech translation, explaining that Microsoft’s approach to the process happens in three steps. First, the company converts spoken English word-by-word into Chinese text. Next, the words are rearranged, since the word order of a Chinese sentence is different from its English analogue. Last, the newly translated Chinese text is converted back into speech, and — here’s the really clever part — made to sound as if the original speaker were vocalizing in the translated language (you can hear this yourself, starting around the video’s 7:30 mark).
“In the realm of natural user interfaces, the single most important one — yet also one of the most difficult for computers — is that of human speech,” wrote Rashid in a followup blog post. “For the last 60 years, computer scientists have been working to build systems that can understand what a person says when they talk.”
In the course of its research, Microsoft says it’s been able to reduce errors by 30% — an increase, according to the company, from one word in four to five being incorrect, to just one word in seven to eight. Rashid calls that “the most dramatic change in accuracy since the introduction of hidden Markov modeling in 1979.” (Markov modeling is a math concept dealing with probability theory.)
And with its speech-language translation engine, Rashid argues:

…we may not have to wait until the 22^nd century for a usable equivalent of Star Trek’s universal translator, and we can also hope that as barriers to understanding language are removed, barriers to understanding each other might also be removed. The cheers from the crowd of 2000 mostly Chinese students, and the commentary that’s grown on China’s social media forums ever since, suggests a growing community of budding computer scientists who feel the same way.

So that’s Microsoft pitch — impressive, real enough and clearly promising for the future of engagement between speakers of different languages.
But would a universal translator also have downsides?
I can think of one in the Nicholas Carr “Is Google Making Us Stupid?” vein — the notion that externalizing so much of what we do mentally with computers and via the Internet is making us shallower, cognitively speaking. And academic research into the Internet’s role as an extension of our brains suggests that the more we’re sure of having access to information in the future, the less we’re able to summon it from memory.
So what happens if we outsource our brains linguistically? Would a universal translator render language instruction obsolete? Why, if you could just clip something onto your shirt, Star Trek-style, would you bother to actually learn a second or third or fourth language, when a computer could just play wingman and save you the effort? Doesn’t learning another language actually increase our brainpower? Would externalizing that diminish us somehow?
You’ve probably heard how learning a second language can be a serious brain booster. In an interview on the subject, Therese Sullivan Caccavale, president of the National Network for Early Language Learning, references a 2007 Harwich, Massachusetts study, which she explains “showed that students who studied a foreign language in an articulated sequence outperformed their non-foreign language learning peers on the Massachusetts Comprehensive Assessment System (MCAS) test after two-three years and significantly outperformed them after seven-eight years on all MCAS subtests.”
She continues:

Furthermore, there is research … that shows that children who study a foreign language, even when this second language study takes time away from the study of mathematics, outperform (on standardized tests of mathematics) students who do not study a foreign language and have more mathematical instruction during the school day. Again, this research upholds the notion that learning a second language is an exercise in cognitive problem solving and that the effects of second language instruction are directly transferable to the area of mathematical skill development.

Of course futurists and brain augmentation wonks will argue that we’re fast approaching a point at which the distinction between brains and computers becomes irrelevant. But in the meantime, amazing as technology like this is — in particular its promise to let us speak any language (estimates put the number of “living languages” in the world at just under 7,000) – it raises interesting new questions about the role of cognitive externalization: the pros and cons of handing increasingly more of what we used to do with our biological “human tech” over to computers.

Aucun commentaire:

Enregistrer un commentaire

Remarque : Seul un membre de ce blog est autorisé à enregistrer un commentaire.