Python自然語言處理筆記（一）語言處理與Python

04-05

筆記背景：筆記基於《Python自然語言處理》一書的中文版以及NLTK官網的電子書在線版而整理，內容僅包含了nltk模塊的部分，不包含書中有關Python基礎知識的部分和對於自然語言處理的宏大敘事。由於官網在線版代碼已更新為Python3和NLTK3 ，所以筆記中代碼使用Python3。

1. 環境準備

1.1 安裝nltk

暫時略

1.2 下載book

>>> import nltk>>> nltk.download()

使用nltk.download()打開下載頁，在Collection選項卡上中選擇book,點擊download按鈕。下載書中所需的數據。

>>> from nltk.book import ** Introductory Examples for the NLTK Book *Loading text1, ..., text9 and sent1, ..., sent9Type the name of the text or sentence to view it.Type: texts() or sents() to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908

2. 搜索文本

2.1 搜索單詞，查看上下文

>>> text1.concordance(monstrous,50)Displaying 11 of 11 matches:, one was of a most monstrous size . ... This camS . " Touching that monstrous bulk of the whale oheathenish array of monstrous clubs and spears . , and wondered what monstrous cannibal and savageed the flood ; most monstrous and most mountainout at Moby Dick as a monstrous fable , or still wo" CHAPTER 55 Of the Monstrous Pictures of Whales connexion with the monstrous pictures of whales on those still more monstrous stories of them whiummaged out of this monstrous cabinet there is nos ; for Whales of a monstrous size are oftentimes

方法concordance()有三個參數，第一個參數是需要搜索的字元串，第二個是搜索結果截取上下文後顯示的寬度，默認width_=79 。第三個參數lines是當搜索結果過多時的條目顯示數，默認lines=25 。以上命令搜索了《白鯨記》中『monstrous』的出現次數，寬度設為50 。

>>> text3.concordance(lived,70,10)Displaying 10 of 38 matches:en they were created . And Adam lived an hundred and thirty years , ars : And all the days that Adam lived were nine hundred and thirty yeirty yea and he died . And Seth lived an hundred and five years , andars , and begat Enos : And Seth lived after he begat Enos eight hundr years : and he died . And Enos lived ninety years , and begat Cainans , and begat Cainan : And Enos lived after he begat Cainan eight hunears : and he died . And Cainan lived seventy years and begat Mahalald begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight : and he died . And Mahalaleel lived sixty and five years , and begand begat Jared : And Mahalaleel lived after he begat Jared eight hund

以上命令通過在《創世紀》中搜索『lived』來查看大家都活了多久。

2.2 搜索相似上下文的單詞

>>> text2.similar(love,10)affection sister heart mother time see town life it dear

方法similar()用來查看與目標詞出現在相似上下文中的詞。第一個參數是目標詞，第二個參數是相似詞的個數，默認num=20 。以上命令在《理智與情感》中搜索了10個與『love』出現在相似上下文中的詞。

2.3 搜索多個單詞的共同上下文

>>> text2.common_contexts([monstrous, very]) a_pretty am_glad a_lucky is_pretty be_glad

方法common_contexts()可以查看一個列表中的詞的共同上下文。

2.4 製作單詞的位置信息離散圖

>>> text4.dispersion_plot([citizens, democracy, freedom, duties, America])Backend TkAgg is interactive backend. Turning interactive mode on.

方法dispersion_plot()可以用離散圖表示詞的位置信息。以上代碼分析了歷年美國總統就職演說中一些顯著的詞語用法模式，橫軸表示從文本開頭算起前方有多少詞。結果如下圖。（需要提前安裝Matplotlib包）

3. 單詞計數

3.1 計算文本的長度

>>> len(text3)44764

文本長度計算的是單詞與標點或者叫做「標識符」的數量的總和。以上代碼計算了《創世紀》中標識符的個數。

3.2 辭彙表

>>> sorted(set(text3))[!, "", (, ), ,, ,), ., .), :, ;, , ?, ?), A, Abel, Abelmizraim, Abidah, Abide, Abimael, Abimelech, Abr, Abrah, Abraham, Abram, Accad, Achbor, Adah, ...]>>> len(set(text3))2789

set()可以生成文本的辭彙表，即將重複的標識符合併後生成的列表。以上代碼使用sorted()函數將列表排序（列表內容被手工省略）。之後計算出辭彙表的長度。

3.3 文本辭彙豐富度

>>> len(set(text3))/len(text3)0.06230453042623537

3.4 單詞佔比

>>> 100*text4.count(a)/len(text4)1.4643016433938312

使用count()可以直接返回某單詞在文本中的個數，以上代碼計算就職演說語料庫中a的單詞佔比百分數。

4. 計算語言：簡單的統計

4.1 頻率分布

>>> fdist1 = FreqDist(text1)>>> print(fdist1)<FreqDist with 19317 samples and 260819 outcomes>>>> fdist1.most_common(10)[(,, 18713), (the, 13721), (., 6862), (of, 6536), (and, 6024),(a, 4569), (to, 4542), (;, 4072), (in, 3916), (that, 2982)]>>> fdist1[whale]906

調用FreqDist()，傳遞文本的名稱作為參數，可以看到《白鯨記》文本長度為260819，辭彙表大小為19317 。表達式most_common(10)返回頻率最高的10個詞及其頻次。

4.2 頻率分布圖

>>> fdist1.plot(50,cumulative=True)

以上代碼生成了《白鯨記》中50個最常用詞的累積頻率圖，這些詞佔了所有標識符的近一半。

5. 細粒度的詞選擇

5.1 按字元長度選擇單詞

>>> V = set(text1)>>> long_words = [w for w in V if len(w) > 15]>>> sorted(long_words)[CIRCUMNAVIGATION, Physiognomically, apprehensiveness, cannibalistically,characteristically, circumnavigating, circumnavigation, circumnavigations,comprehensiveness, hermaphroditical, indiscriminately, indispensableness,irresistibleness, physiognomically, preternaturalness, responsibilities,simultaneousness, subterraneousness, supernaturalness, superstitiousness,uncomfortableness, uncompromisedness, undiscriminating, uninterpenetratingly]

以上命令首先用變數V保存了《白鯨記》的辭彙表，又將其中長度超過15個字元的詞保存在了long_words中，最後將其排序。

5.2 多重條件選擇單詞

>>> fdist5 = FreqDist(text5)>>> sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)[#14-19teens, #talkcity_adults, ((((((((((, ........, Question,actually, anything, computer, cute.-ass, everyone, football,innocent, listening, remember, seriously, something, together,tomorrow, watching]

以上命令首先得到了聊天語料庫的頻率分布，之後在其辭彙表中選取了詞長大於7並且出現頻次超過7的詞，並將其排序。

6. 詞語搭配和雙連詞（bigrams）

6.1 雙連詞

>>> list(nltk.bigrams([more, is, said, than, done]))[(more, is), (is, said), (said, than), (than, done)]

一個「搭配」是經常在一起出現的詞序列，要獲取搭配，我們先從提取文本辭彙中的雙連詞開始。以上使用了bigrams()獲取了包含傳入辭彙的雙連詞。

6.2 搭配

>>> text4.collocations()United States; fellow citizens; four years; years ago; FederalGovernment; General Government; American people; Vice President; OldWorld; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;God bless; every citizen; Indian tribes; public debt; one another;foreign nations; political parties>>> text8.collocations()would like; medium build; social drinker; quiet nights; non smoker;long term; age open; Would like; easy going; financially secure; funtimes; similar interests; Age open; weekends away; poss rship; wellpresented; never married; single mum; permanent relationship; slimbuild

以上代碼使用方法collocations()從兩個語料庫中獲取了一些搭配，默認返回20個。