word embedding之GLOVE代碼

源代碼地址

stanfordnlp/GloVe

演算法實現

1. vocab count

vocab count的輸入是一組用空格分割的token序列。可以是Stanford Tokenizer的的輸出結果。vocab count主要是統計單詞的詞頻,可以根據單詞的最小頻率和辭彙表的大小。下面是vocab count的用法:

$./vocab_count

Simple tool to extract unigram counts

Author: Jeffrey Pennington (jpennin@stanford.edu)

Usage options:

-verbose <int>

Set verbosity: 0, 1, or 2 (default)

-max-vocab <int>

Upper bound on vocabulary size, i.e. keep the <int> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.

-min-count <int>

Lower limit such that words which occur fewer than <int> times are discarded.

Example usage:

./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt

vocab count的輸出是一個帶有單詞詞頻的詞典

2. cooccur(計算單詞的共現頻率)

cooccur的輸入是vocab count的輸出,即帶詞頻的辭彙表和語料庫,輸出是詞和詞的共現表。其命令格式如下:

./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrences.bin

3. shuffle

shuffle是把cooccur的輸出結果打亂順序,其實就是把訓練樣本打亂順序,好多機器學習演算法都會有shuffle的步驟。命令格式如下:

./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin

4. glove

glove是真正的模型訓練階段,它的輸入是vocab count的輸出和shuffle的輸出,最後的輸出是和word2vec類似的格式,一個詞對應一個向量,gove的代碼和word2vec有些類似,但相比word2vec邏輯更簡單一些。命令格式如下:

./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file vectors -gradsq-file gradsq -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2

怎麼樣,看上面的輸入,有沒有似曾相識的感覺,是不是很像word2vec,但請注意,這裡的參數alpha絕對不是learning rate,eta參數才是learning rate,關於alpha的含義可以回去翻下論文。

感興趣的可以去看下這些源碼,代碼量不大,邏輯也不算太難。


推薦閱讀:

Paper Reading | 讓深度學習更高效運行的兩個視角
如何六個月內學會深度學習
機器學習篇-指標:AUC
「伊人」何處,宛在雲中央:用 Datalab 在雲上部署互動式編程環境

TAG:word2vec | 機器學習 | 語料庫文本挖掘 |