I use these terms to indicate different research goals. The methodologies are often related and the communities overlap. We go to the same conferences (the strongest work in both fields appears at ACL, EMNLP, NAACL, etc.) and easily talk together about our problems and solutions.

Here"s the distinction I personally make:

Computational linguistics is analogous to computational biology or any other computational fill-in-the-blank. It develops computational methods to answer the scientific questions of linguistics.

The core questions in linguistics involve the nature of linguistic representations and linguistic knowledge, and how linguistic knowledge is acquired and deployed in the production and comprehension of language. Answering these questions describes the human language ability and may help to explain the distribution of linguistic data and behavior that we actually observe.

In computational linguistics, we propose formal answers to these core questions. Linguists are really asking what humans are computing and how. So we mathematically define classes of linguistic representations and formal grammars (which are usually probabilistic models nowadays) that seem adequate to capture the range of phenomena in human languages. We study their mathematical properties, and devise efficient algorithms for learning, production, and comprehension. Because the algorithms can actually run, we can test our models and find out whether they make appropriate predictions.

Linguistics also considers a variety of questions beyond this core -- think of sociolinguistics, historical linguistics, psycholinguistics, and neurolinguistics. These scientific questions are fair game as well for computational linguists, who might use models and algorithms to make sense of the data. In this case, we are not trying to model the competence of everyday speakers in their native language, but rather to automate the special kind of reasoning that linguists do, potentially enabling us to work on bigger datasets (or even new kinds of data) and draw more accurate conclusions. Similarly, computational linguists may design software tools to help document endangered languages.

Natural language processing is the art of solving engineering problems that need to analyze (or generate) natural language text. Here, the metric of success is not whether you designed a better scientific theory or proved that languages X and Y were historically related. Rather, the metric is whether you got good solutions on the engineering problem.

For example, you don"t judge Google Translate on whether it captures what translation "truly is" or explains how human translators do their job. You judge it on whether it produces reasonably accurate and fluent translations for people who need to translate certain things in practice. The machine translation community has ways of measuring this, and they focus strongly on improving those scores.

NLP is mainly used to help people navigate and digest large quantities of information that already exist in text form. It is also used to produce better user interfaces so that humans can better communicate with computers and with other humans.

By saying that NLP is engineering, I don"t mean that it is always focused on developing commercial applications. NLP may be used for scientific ends within other academic disciplines such as political science (blog posts), economics (financial news and reports), medicine (doctor"s notes), digital humanities (literary works, historical sources), etc. But then it is being used as a tool within computational X-ology in order to answer the scientific questions of X-ologists, rather than the scientific questions of linguists.

Both fields make use of formal training in CS, linguistics, and machine learning. If you want to truly advance either field in a lasting way, you should develop enough strength to do original research in all three of these areas. It might help to go to a school with a strong interdisciplinary culture, where many of the CS faculty and students are actively interested in linguistics for its own sake (or vice-versa).

That said, NLP people often get away with relatively superficial linguistics. They look at the errors made by their current system, and learn only as much linguistics as they need to understand and fix the most prominent types of errors. After all, their goal is not a full theory but rather the simplest, most efficient approach that will get the job done.

Conversely, if you study computational linguistics in a linguistics department, you will typically get a lot more linguistics and a lot less CS/ML. The students in those departments are technically adept, since linguistics is quite a technical field. But they tend to know much less math and CS. So the computational courses tend to be providing only some exposure to formal language theory, programming, and applied NLP. (These courses are popular among linguistics students who hope to improve their employability.)

Eventually I hope the two research programs will draw even closer together. If we can build a strong model of the human linguistic capacity, then that should solve a wide range of NLP problems for us. So today"s computational linguistics is developing methods for tomorrow"s NLP. That"s been historically true too.

計算語言學是從語言學的角度出發的,是語言學的一個分支,該學科的目的是提出一種可被計算機處理的語言學理論、框架、模型。我認為WordNet, treebank, TimeML等項目應該都屬於此類。

自然語言處理是從計算機科學的角度出發的,算是計算機的一個子學科。目的是高效的可用於處理自然語言的演算法。如基於字序列標註的中文分詞,HMM詞性標註,CKY, Early演算法,N-gram,雜訊信道模型,這些應該都算是NLP的成果。




再比如,我們要做一個更更複雜的應用,智能對話問答。對於一個句子,我們要讓計算機明白這句話是什麼意思。那麼你就必須得用一些語義的東西了。實際上現在對語義的開發還處於比較淺的層面,能利用的東西不多,顯得好像智能問答是個純統計的東西,但結果就是現在的只能問答一點都不智能。假如說計算語言學的某大牛做出了一個準確率極高的semantic parser,那麼這個系統也能得到極大提升。現階段在計算系統或者演算法上能有所進步已非常困難,計算語言學固然也很困難,但能看到的開發空間巨大。所以,為了做出來這些高級應用,許多NLP的研究者也在這個領域努力著。










而計算語言學,在提出一個模型的時候,要考慮這個模型是否合理,是否過於複雜超過了人腦的cognitive load。機器學習以及無比複雜的統計顯然不是人腦能夠輕易做到的。


讓計算語言學家和自然語言處理學家翻譯以下這句Central Embedded句子:

"The rat the cat the dog bit chased escaped"

計算語言學家會說:人腦沒法處理這句句子,因為在處理這句句子的時候,cognitive load太大了。所以他們會輸出Error。








