word embedding之GLOVE代碼

02-22

源代碼地址

stanfordnlp/GloVe

演算法實現

1. vocab count

vocab count的輸入是一組用空格分割的token序列。可以是Stanford Tokenizer的的輸出結果。vocab count主要是統計單詞的詞頻，可以根據單詞的最小頻率和辭彙表的大小。下面是vocab count的用法：

$./vocab_count

Simple tool to extract unigram counts

Author: Jeffrey Pennington (jpennin@stanford.edu)

Usage options:

-verbose <int>

Set verbosity: 0, 1, or 2 (default)

-max-vocab <int>

Upper bound on vocabulary size, i.e. keep the <int> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.

-min-count <int>

Lower limit such that words which occur fewer than <int> times are discarded.

Example usage:

./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt

vocab count的輸出是一個帶有單詞詞頻的詞典

2. cooccur（計算單詞的共現頻率）

cooccur的輸入是vocab count的輸出，即帶詞頻的辭彙表和語料庫，輸出是詞和詞的共現表。其命令格式如下：

./cooccur -verbose 2 -symmetric 0 -window-size 10 -vocab-file vocab.txt -memory 8.0 -overflow-file tempoverflow < corpus.txt > cooccurrences.bin

3. shuffle

shuffle是把cooccur的輸出結果打亂順序，其實就是把訓練樣本打亂順序，好多機器學習演算法都會有shuffle的步驟。命令格式如下：

./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin

4. glove

glove是真正的模型訓練階段，它的輸入是vocab count的輸出和shuffle的輸出，最後的輸出是和word2vec類似的格式，一個詞對應一個向量，gove的代碼和word2vec有些類似，但相比word2vec邏輯更簡單一些。命令格式如下：

./glove -input-file cooccurrence.shuf.bin -vocab-file vocab.txt -save-file vectors -gradsq-file gradsq -verbose 2 -vector-size 100 -threads 16 -alpha 0.75 -x-max 100.0 -eta 0.05 -binary 2 -model 2

怎麼樣，看上面的輸入，有沒有似曾相識的感覺，是不是很像word2vec，但請注意，這裡的參數alpha絕對不是learning rate，eta參數才是learning rate，關於alpha的含義可以回去翻下論文。

感興趣的可以去看下這些源碼，代碼量不大，邏輯也不算太難。