谷歌利用大數據提高通用翻譯

07-24

谷歌翻譯(Google Translate)是目前翻譯網頁或簡短的文字片段使用最多的一個快捷工具。據德國媒體Der Spiegel報道，支持該服務的後台核心技術，會在不久的將來被改進為類似「星際迷航（Star Trek)」那樣的通用翻譯器。

當然，谷歌並不是唯一一家致力於此事的公司。從Facebook到微軟的每個人都有這樣一個野心，那就是創建一個能最終徹底解決語言障礙的服務。而這個野心實際嗎？如果想要實現又需要付出多大的努力？

機器翻譯的存在由來已久，但一直遠遠落後於人工翻譯。很多機器翻譯軟體的開發問題，是如何對不同語言的語法以及辭彙進行定義，而這些都不容易解決。

在工程師Franz Och的指導下，谷歌的做法顛覆上面的一切，它家採用純粹的統計方法。舉個例子，通過並行處理大量的可利用的翻譯資料，英法語之間的翻譯就比舊的通過演算法驅動的翻譯方法好很多。平行處理的可利用的文本資料庫越大，翻譯效果就越好。（當然這也離不開過去幾十年數據存儲及計算處理能力的巨大發展。）

如果說谷歌的方法是自力更生從零基礎開始構建，那麼Facebook則是選擇了藉助他山之石。早在2012年8月，Facebook收購了一家語言翻譯軟體公司Mobile Technologies，這項收購被 Facebook產品管理總監描述為「長期的產品路線圖的投資」。在MobileTechnologies產品中就有一款叫做Jibbigo的能執行語音翻譯的應用。

從以上兩種不同的方法中，我們就可以看到一個共同點, 那就是他們都擁有海量的實體語言資料庫後盾。谷歌和微軟都有能捕獲網路實時數據的搜索引擎; Facebook還有高達十幾億的實時聊天用戶。所有這一切都給翻譯資料提供了海量數據寶庫。

而如今最大的尚未解決的問題是：谷歌，Facebook，微軟，以及其他公司能否在使用實時會話生成語言翻譯資料庫的同時做到匿名化？設立自願選擇加入(opt-in)程序使人們在會話時同意被採集建庫似乎是最好的解決辦法。可是根據以往經驗，這些公司往往更有可能只是簡單粗暴地把這種數據採集條款加入服務協議中。

原文：

Google taps bigdata for universal translator

By InfoWorld Tech Watch

Google Translate is currently best known for being aquick and dirty way to render Web pages or short text snippets in anotherlanguage. But accordingto Der Spiegel[1],the next step for the core technology behind that service is a device thatamounts to the universal translator from "Star Trek."

Google isn"t alone, either. Apparently everyone fromFacebook to Microsoft is ramping up similar ambitions: to create services thateradicate language barriers as we currently know them. A realistic goal orstill science fiction? And at what cost?

Machine translation has been around in one form oranother for decades, but has always lagged far behind translations produced byhuman hands. Much of the software written to perform machine translationinvolved defining different languages" grammars and dictionaries, a difficultand inflexible process.

Google"s approach, under the guidance of engineerFranz Och, was to replace all that with a purely statistical approach. Lookingat masses of data in parallel -- for instance, the English and Frenchtranslations of various public-domain texts -- produced far better translationsthan the old algorithm-driven method. The bigger the corpus, or body ofparallel texts, the better the results. (The imploding costs of storage andprocessing power over the last couple of decades have also helped.)

If Google"s plan is to create its own technology fromscratch, Facebook"s strategy appears to be to import it. Back in August,Facebook pickedup language translation software company Mobile Technologies[2], which Facebook product managementdirector described[3] as "an investment in our long-termproduct roadmap." Among Mobile Technologies" products is an app namedJibbigo, which translates speech.

From these two projects alone, it"s easy to see acommon element: the backing of a company that has tons of real-world linguisticdata at its disposal. Google and Microsoft both have search engines thatharvest the Web in real time; Facebook has literally a billion users chattingaway. All of this constitutes a massive data trove that can be harvested forthe sake of a translation corpus.

The big unanswered question so far: If Google,Facebook, Microsoft, and the rest plan on using real-time conversations togenerate a corpus for translations, will any of that data be anonymized? Is iteven possible? An opt-in program that allows people to let their talk be usedas part of the corpus seems like the best approach. But based on their previousbehavior, isn"t it more likely they"ll simply roll such harvesting into aterms-of-service agreement?

Thisarticle, "Google taps big data for universaltranslator[4]," was originally published atInfoWorld.com[5].Follow the latest developments in businesstechnology news[6] and get a digest of the keystories each day in the InfoWorld Daily newsletter[7].For the latest business technology news, followInfoWorld on Twitter[8].