自然語言處理(NLP)和計算語言學之間的區別和聯繫是什麼?


Quora上Jason eisner的回答,比較全面了。

註:Jason Eisner 是現在nlp幾個神牛之一

http://qr.ae/7bCDJr

貼個原文為防止翻不了牆,恕不翻譯

I use these terms to indicate different research goals. The methodologies are often related and the communities overlap. We go to the same conferences (the strongest work in both fields appears at ACL, EMNLP, NAACL, etc.) and easily talk together about our problems and solutions.

Here"s the distinction I personally make:

Computational linguistics is analogous to computational biology or any other computational fill-in-the-blank. It develops computational methods to answer the scientific questions of linguistics.

The core questions in linguistics involve the nature of linguistic representations and linguistic knowledge, and how linguistic knowledge is acquired and deployed in the production and comprehension of language. Answering these questions describes the human language ability and may help to explain the distribution of linguistic data and behavior that we actually observe.

In computational linguistics, we propose formal answers to these core questions. Linguists are really asking what humans are computing and how. So we mathematically define classes of linguistic representations and formal grammars (which are usually probabilistic models nowadays) that seem adequate to capture the range of phenomena in human languages. We study their mathematical properties, and devise efficient algorithms for learning, production, and comprehension. Because the algorithms can actually run, we can test our models and find out whether they make appropriate predictions.

Linguistics also considers a variety of questions beyond this core -- think of sociolinguistics, historical linguistics, psycholinguistics, and neurolinguistics. These scientific questions are fair game as well for computational linguists, who might use models and algorithms to make sense of the data. In this case, we are not trying to model the competence of everyday speakers in their native language, but rather to automate the special kind of reasoning that linguists do, potentially enabling us to work on bigger datasets (or even new kinds of data) and draw more accurate conclusions. Similarly, computational linguists may design software tools to help document endangered languages.

Natural language processing is the art of solving engineering problems that need to analyze (or generate) natural language text. Here, the metric of success is not whether you designed a better scientific theory or proved that languages X and Y were historically related. Rather, the metric is whether you got good solutions on the engineering problem.

For example, you don"t judge Google Translate on whether it captures what translation "truly is" or explains how human translators do their job. You judge it on whether it produces reasonably accurate and fluent translations for people who need to translate certain things in practice. The machine translation community has ways of measuring this, and they focus strongly on improving those scores.

NLP is mainly used to help people navigate and digest large quantities of information that already exist in text form. It is also used to produce better user interfaces so that humans can better communicate with computers and with other humans.

By saying that NLP is engineering, I don"t mean that it is always focused on developing commercial applications. NLP may be used for scientific ends within other academic disciplines such as political science (blog posts), economics (financial news and reports), medicine (doctor"s notes), digital humanities (literary works, historical sources), etc. But then it is being used as a tool within computational X-ology in order to answer the scientific questions of X-ologists, rather than the scientific questions of linguists.

Both fields make use of formal training in CS, linguistics, and machine learning. If you want to truly advance either field in a lasting way, you should develop enough strength to do original research in all three of these areas. It might help to go to a school with a strong interdisciplinary culture, where many of the CS faculty and students are actively interested in linguistics for its own sake (or vice-versa).

That said, NLP people often get away with relatively superficial linguistics. They look at the errors made by their current system, and learn only as much linguistics as they need to understand and fix the most prominent types of errors. After all, their goal is not a full theory but rather the simplest, most efficient approach that will get the job done.

Conversely, if you study computational linguistics in a linguistics department, you will typically get a lot more linguistics and a lot less CS/ML. The students in those departments are technically adept, since linguistics is quite a technical field. But they tend to know much less math and CS. So the computational courses tend to be providing only some exposure to formal language theory, programming, and applied NLP. (These courses are popular among linguistics students who hope to improve their employability.)

Eventually I hope the two research programs will draw even closer together. If we can build a strong model of the human linguistic capacity, then that should solve a wide range of NLP problems for us. So today"s computational linguistics is developing methods for tomorrow"s NLP. That"s been historically true too.


計算語言學是從語言學的角度出發的,是語言學的一個分支,該學科的目的是提出一種可被計算機處理的語言學理論、框架、模型。我認為WordNet, treebank, TimeML等項目應該都屬於此類。

自然語言處理是從計算機科學的角度出發的,算是計算機的一個子學科。目的是高效的可用於處理自然語言的演算法。如基於字序列標註的中文分詞,HMM詞性標註,CKY, Early演算法,N-gram,雜訊信道模型,這些應該都算是NLP的成果。

但總體趨勢是二者的界限開始變模糊了。統計NLP已取得巨大成果,但其極度以來統計手段,相比之下深入的語言學思考少很多。遇到今天的瓶頸,許多NLP的研究者都在引入一些語言學的知識來幫助他們提取更多的訓練特徵,和更靠譜的思維角度。計算語言學方面,雖然說做的還是那些理論工作,但衡量一個理論是否有效,還是得放在真實語料上做些實驗才知道。

比如說我們做文本分類,具體一點,給新聞分類,這會用到計算機科學中的分類器。CS的人做了很多分類器,樸素貝葉斯,最大熵模型,支持向量機等等,非常牛逼。我們用「詞袋」作為分類器的輸入,結果出來一看,不錯,絕大多數都正確。這個簡單的分類問題可以當成一個純NLP的東西,主要看分類器演算法的好壞,很簡單的特徵就能滿足需要。

假如我們有了一個更複雜的文本分類問題,給對話中的句子分類,看這個句子是祈使句,疑問句,還是陳述句。作為分類器的特徵,「詞袋」肯定不夠用了,我們需要更多的信息,比如句法結構。

再比如,我們要做一個更更複雜的應用,智能對話問答。對於一個句子,我們要讓計算機明白這句話是什麼意思。那麼你就必須得用一些語義的東西了。實際上現在對語義的開發還處於比較淺的層面,能利用的東西不多,顯得好像智能問答是個純統計的東西,但結果就是現在的只能問答一點都不智能。假如說計算語言學的某大牛做出了一個準確率極高的semantic parser,那麼這個系統也能得到極大提升。現階段在計算系統或者演算法上能有所進步已非常困難,計算語言學固然也很困難,但能看到的開發空間巨大。所以,為了做出來這些高級應用,許多NLP的研究者也在這個領域努力著。

因此,一個NLP或者計算語言學研究者應該是既具備CS功底又有語言學insight的。如果再有過硬的數學基礎和些許認知心理學的知識(為了真正的智能),就更好了。


自然語言處理與語言學的研究有著密切的聯繫,但又有重要的區別。自然語言處理並不是一般地研究自然語言,而在於研究能有效地實現讓計算機分析和處理自然語言的技術,特別是利用計算機的能力來高效地處理大規模的文本。

舉個簡單例子:文本分類是自然語言處理的一個主要應用,但是,它和計算語言學有個毛關係呀!


上個月和計算語言學的教授討論過這個問題。我曾經一直把自然語言處理與計算語言學混為一談。這兩個學科由於Overlap太大,以至於很難用明確的界限來劃分。想要區別這兩門學科,得從他們的本質,或者說他們的目標出發。

計算語言學,是語言學的一個分支。語言學,追根溯源,是心理學的一個分支。語言學的目的,簡單的說,是建造一個人腦處理語言的模型。不管是語音學,語義學還是語法學,一直以來都是在不斷地提出模型,驗證模型,修改模型。語法學裡那些個樹形結構,那些個Movement;語音學裡的那些規則,都是在試著理解人腦如何處理語言。語言學家最大的目標之一,就是提出一個適用於所有語言,沒有例外的模型。

自然語言處理,是計算機科學/統計學/數學的一個分支,是一門應用科學。自然語言處理的目標,是「正確」地處理自然語言。

計算語言學,注重過程,為的是理解人腦處理語言的過程。

自然語言處理,注重結果,為的是得到人腦處理語言的過程。

因此,自然語言處理運用了很多「黑科技」:機器學習、炫酷的概率論等等。其中機器學習便是一種魔法。你給它輸入大量的數據訓練,它便能給你輸出十分準確的答案。

而計算語言學,在提出一個模型的時候,要考慮這個模型是否合理,是否過於複雜超過了人腦的cognitive load。機器學習以及無比複雜的統計顯然不是人腦能夠輕易做到的。

最後,舉個例子。

讓計算語言學家和自然語言處理學家翻譯以下這句Central Embedded句子:

"The rat the cat the dog bit chased escaped"

計算語言學家會說:人腦沒法處理這句句子,因為在處理這句句子的時候,cognitive load太大了。所以他們會輸出Error。

自然語言處理學家則會通過使用各種分句工具,各種概率工具,給出正確答案:那隻被狗咬的貓追著的老鼠逃走了。

例子可能不夠恰當,但是這麼個意思。

每個人對這兩個學科的定義不盡相同,我更認同這種區別方式,希望能有所幫助。


一方面,從學科劃分而言,計算機語言學是在語言學領域,一般國內的院校都劃入了人文學院中文系下面,而自然語言處理多在計算機學院;

另一方面,兩門學科的應用場景是不一樣的,計算語言學,追求的是一個統一的模型,能夠解釋語言的生成,探求人腦對語言理解的過程;自然語言處理其根本出發點是為了讓機器能夠理解人類語言,無論是計算機分詞、實體識別、詞性標註等,均為了將語言分解為計算機能夠直接計算的單元;

在現在的人工智慧語言領域,絕大部分的就業者是自然語言處理方向相關的人,涵蓋的專業包括計算機、數學、統計等,計算機語言學專業畢業生主要研究語言理論,實際應用較少。


推薦閱讀:

做文本挖掘是否需要了解自然語言處理?
word embedding 詞向量能否用於判別中文辭彙難易度?
word2vec和word embedding有什麼區別?
一個計算機系的學生想學一點語言學知識,從哪裡入手比較好?
有哪些高質量的中文分詞api?

TAG:自然語言處理 |