文本分析雜記1

02-04

最近因為想分析一些文本，打算研究研究文本分析的相關技術，看看到底能分析到什麼程度。之前做編譯器相關的時候，關注過自然語言分析這塊，但是沒有太深入。這兩年再看，優質的資料真的很多了，而且很多方向也做的很不錯了。

之前經常在

Coursera | Online Courses From Top Universities. Join for Freewww.coursera.org

聽課，所以第一時間上去找了找了，找到了UIUC大學的兩門課：

Text Mining and Analytics

Text Retrieval and Search Engines

都是ChengXiang Zhai開設的。同時，也推薦了自己的一些相關論文和著作作為課程的參考資料，這些資料目前在網路上都可以找到免費的：

1、C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016.

2、Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA: May 1999.

3、Chengxiang Zhai, Exploiting context to identify lexical atoms: A statistical view of linguistic context. Proceedings of the International and Interdisciplinary Conference on Modelling and Using Context (CONTEXT-97), Rio de Janeiro, Brazil, Feb. 4-6, 1997, pp.

4、Shan Jiang and ChengXiang Zhai, Random walks on adjacency graphs for mining lexical relations from big text data. Proceedings of IEEE BigData Conference 2014, pp.

自然語言處理（NLP）是文本分析的基礎。自然語言處理（NLP）的大致過程有點類似於編程語言的分析，也只是過程類似（詞法、語法、語義等），核心的東西差別還是很大的。自然語言因為是作為人類交流的語言，本身潛意識的省略了很多通識性的知識，這些知識是理解自然語言所需要的基礎，人可以通過學習和交流自然習得這些常識，而計算機則在這個問題上則要費力很多。同時，本身自然語言表達的時候，其中包含了一些具有二義性的內容，這對計算機來說也是個問題。

ambiguity可以分為word-level ambiguity和syntactic ambituity兩個層級的。

PS：剛剛接觸，一點記錄，歡迎各位大神指正。