基於 Gensim 的 Word2Vec 實踐

02-03

Word2Vec

基於 Gensim 的 Word2Vec 實踐，從屬於筆者的程序猿的數據科學與機器學習實戰手冊，代碼參考gensim.ipynb。推薦前置閱讀Python語法速覽與機器學習開發環境搭建，Scikit-Learn 備忘錄。

Word2Vec Tutorial

Getting Started with Word2Vec and GloVe in Python

模型創建

Gensim中 Word2Vec 模型的期望輸入是進過分詞的句子列表，即是某個二維數組。這裡我們暫時使用 Python 內置的數組，不過其在輸入數據集較大的情況下會佔用大量的 RAM。Gensim 本身只是要求能夠迭代的有序句子列表，因此在工程實踐中我們可以使用自定義的生成器，只在內存中保存單條語句。

# 引入 word2vecnfrom gensim.models import word2vecnn# 引入日誌配置nimport loggingnnlogging.basicConfig(format=%(asctime)s : %(levelname)s : %(message)s, level=logging.INFO)nn# 引入數據集nraw_sentences = ["the quick brown fox jumps over the lazy dogs","yoyoyo you go home now to sleep"]nn# 切分辭彙nsentences= [s.encode(utf-8).split() for s in sentences]nn# 構建模型nmodel = word2vec.Word2Vec(sentences, min_count=1)nn# 進行相關性比較nmodel.similarity(dogs,you)n

這裡我們調用Word2Vec創建模型實際上會對數據執行兩次迭代操作，第一輪操作會統計詞頻來構建內部的詞典數結構，第二輪操作會進行神經網路訓練，而這兩個步驟是可以分步進行的，這樣對於某些不可重複的流（譬如 Kafka 等流式數據中）可以手動控制：

model = gensim.models.Word2Vec(iter=1) # an empty model, no training yetnmodel.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generatornmodel.train(other_sentences) # can be a non-repeatable, 1-pass generatorn

Word2Vec 參數

min_count

model = Word2Vec(sentences, min_count=10) # default value is 5n

在不同大小的語料集中，我們對於基準詞頻的需求也是不一樣的。譬如在較大的語料集中，我們希望忽略那些只出現過一兩次的單詞，這裡我們就可以通過設置min_count參數進行控制。一般而言，合理的參數值會設置在0~100之間。

size

size參數主要是用來設置神經網路的層數，Word2Vec 中的默認值是設置為100層。更大的層次設置意味著更多的輸入數據，不過也能提升整體的準確度，合理的設置範圍為 10~數百。

model = Word2Vec(sentences, size=200) # default value is 100n

workers

workers參數用於設置並發訓練時候的線程數，不過僅當Cython安裝的情況下才會起作用：

model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelizationn

外部語料集

在真實的訓練場景中我們往往會使用較大的語料集進行訓練，譬如這裡以 Word2Vec 官方的text8為例，只要改變模型中的語料集開源即可：

sentences = word2vec.Text8Corpus(text8)nmodel = word2vec.Word2Vec(sentences, size=200)n

這裡語料集中的語句是經過分詞的，因此可以直接使用。筆者在第一次使用該類時報錯了，因此把 Gensim 中的源代碼貼一下，也方便以後自定義處理其他語料集：

class Text8Corpus(object):n """Iterate over sentences from the "text8" corpus, unzipped from http://mattmahoney.net/dc/text8.zip ."""n def __init__(self, fname, max_sentence_length=MAX_WORDS_IN_BATCH):n self.fname = fnamen self.max_sentence_length = max_sentence_lengthnn def __iter__(self):n # the entire corpus is one gigantic line -- there are no sentence marks at alln # so just split the sequence of tokens arbitrarily: 1 sentence = 1000 tokensn sentence, rest = [], bn with utils.smart_open(self.fname) as fin:n while True:n text = rest + fin.read(8192) # avoid loading the entire file (=1 line) into RAMn if text == rest: # EOFn words = utils.to_unicode(text).split()n sentence.extend(words) # return the last chunk of words, too (may be shorter/longer)n if sentence:n yield sentencen breakn last_token = text.rfind(b ) # last token may have been split in two... keep for next iterationn words, rest = (utils.to_unicode(text[:last_token]).split(),n text[last_token:].strip()) if last_token >= 0 else ([], text)n sentence.extend(words)n while len(sentence) >= self.max_sentence_length:n yield sentence[:self.max_sentence_length]n sentence = sentence[self.max_sentence_length:]n

我們在上文中也提及，如果是對於大量的輸入語料集或者需要整合磁碟上多個文件夾下的數據，我們可以以迭代器的方式而不是一次性將全部內容讀取到內存中來節省 RAM 空間：

class MySentences(object):n def __init__(self, dirname):n self.dirname = dirnamenn def __iter__(self):n for fname in os.listdir(self.dirname):n for line in open(os.path.join(self.dirname, fname)):n yield line.split()nnsentences = MySentences(/some/directory) # a memory-friendly iteratornmodel = gensim.models.Word2Vec(sentences)n

模型保存與讀取

model.save(text8.model)n2015-02-24 11:19:26,059 : INFO : saving Word2Vec object under text8.model, separately Nonen2015-02-24 11:19:26,060 : INFO : not storing attribute syn0normn2015-02-24 11:19:26,060 : INFO : storing numpy array syn0 to text8.model.syn0.npyn2015-02-24 11:19:26,742 : INFO : storing numpy array syn1 to text8.model.syn1.npynnmodel1 = Word2Vec.load(text8.model)n nmodel.save_word2vec_format(text.model.bin, binary=True)n2015-02-24 11:19:52,341 : INFO : storing 71290x200 projection weights into text.model.binn nmodel1 = word2vec.Word2Vec.load_word2vec_format(text.model.bin, binary=True)n2015-02-24 11:22:08,185 : INFO : loading projection weights from text.model.binn2015-02-24 11:22:10,322 : INFO : loaded (71290, 200) matrix from text.model.binn2015-02-24 11:22:10,322 : INFO : precomputing L2-norms of word weight vectorsn

模型預測

Word2Vec 最著名的效果即是以語義化的方式推斷出相似辭彙：

model.most_similar(positive=[woman, king], negative=[man], topn=1)n[(queen, 0.50882536)]nmodel.doesnt_match("breakfast cereal dinner lunch";.split())ncerealnmodel.similarity(woman, man)n0.73723527nmodel.most_similar([man])n[(uwoman, 0.5686948895454407),n (ugirl, 0.4957364797592163),n (uyoung, 0.4457539916038513),n (uluckiest, 0.4420626759529114),n (userpent, 0.42716869711875916),n (ugirls, 0.42680859565734863),n (usmokes, 0.4265017509460449),n (ucreature, 0.4227582812309265),n (urobot, 0.417464017868042),n (umortal, 0.41728296875953674)]n

如果我們希望直接獲取某個單詞的向量表示，直接以下標方式訪問即可：

model[computer] # raw NumPy vector of a wordnarray([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)n

模型評估

Word2Vec 的訓練屬於無監督模型，並沒有太多的類似於監督學習裡面的客觀評判方式，更多的依賴於端應用。Google 之前公開了20000條左右的語法與語義化訓練樣本，每一條遵循A is to B as C is to D這個格式，地址在這裡:

model.accuracy(/tmp/questions-words.txt)n2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)n2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)n2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)n2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)n2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)n2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)n2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)n2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)n2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)n2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)n

還是需要強調下，訓練集上表現的好也不意味著 Word2Vec 在真實應用中就會表現的很好，還是需要因地制宜。)