詞向量可視化
通過可視化主觀地觀察,詞向量訓練的效果好不好
思路:先降維再畫圖
方法(一):tensorflow-tensorboard
方法(二):python svd分解接plt畫圖
環境:mac os py3.6 anaconda-spyder
方法(一):tensorflow-tensorboard
這個集成,可視化都特別好了,其中也可以選擇用tsne還是pca降維,可以看到迭代次數,手動停止迭代這些。tsne還是動態的,炫酷,上圖。tsne和pca都是通用降維方法,可以搜tsne的demo看,也是動態的,本身就很炫酷。
實現:
在終端pip install tensorflow
訓練詞向量model,保存模型。路徑啥的完全沒改
import gensimfrom gensim.models import word2vecdimention=100#詞向量維數model100 = gensim.models.Word2Vec(sentence, sg=0,size=dimention, min_count=0, window=5)#訓練詞向量model100.save(/Users/caiyunxin/Desktop/word2vec_model_100)
接下來要運行一個py文件,我先新建一個文件夾保存輸出的東西 /Users/caiyunxin/Desktop/visualize/test
運行py文件,裡面只用換成自己模型和改輸出路徑如上,py文件源自於
Keras 模型中使用預訓練的 gensim 詞向量和可視化這個貌似要翻牆才看得到,就把代碼給出來吧:
import sys, osfrom gensim.models import Word2Vecimport tensorflow as tfimport numpy as npfrom tensorflow.contrib.tensorboard.plugins import projectordef visualize(model, output_path): meta_file = "w2x_metadata.tsv" placeholder = np.zeros((len(model.wv.index2word), 100)) with open(os.path.join(output_path,meta_file), wb) as file_metadata: for i, word in enumerate(model.wv.index2word): placeholder[i] = model[word] # temporary solution for https://github.com/tensorflow/tensorflow/issues/9094 if word == : print("Emply Line, should replecaed by any thing else, or will cause a bug of tensorboard") file_metadata.write("{0}".format(<Empty Line>).encode(utf-8) + b
) else: file_metadata.write("{0}".format(word).encode(utf-8) + b
) # define the model without training sess = tf.InteractiveSession() embedding = tf.Variable(placeholder, trainable = False, name = w2x_metadata) tf.global_variables_initializer().run() saver = tf.train.Saver() writer = tf.summary.FileWriter(output_path, sess.graph) # adding into projector config = projector.ProjectorConfig() embed = config.embeddings.add() embed.tensor_name = w2x_metadata embed.metadata_path = meta_file # Specify the width and height of a single thumbnail. projector.visualize_embeddings(writer, config) saver.save(sess, os.path.join(output_path,w2x_metadata.ckpt)) print(Run `tensorboard --logdir={0}` to run visualize result on tensorboard.format(output_path))if __name__ == "__main__": model = Word2Vec.load("/Users/caiyunxin/Desktop/word2vec_model_100") visualize(model,"/Users/caiyunxin/Desktop/visualize/test")
就一個函數加調用,要改函數的兩個參數。w2x_metadata.tsv至於這個文件,運行代碼自己生成的,不用管它。
Run `tensorboard --logdir=/Users/caiyunxin/Desktop/visualize/test` to run visualize result on tensorboard
運行完輸出這個,把 tensorboard --logdir=/Users/caiyunxin/Desktop/visualize/test 這個直接複製到終端執行
caiyunxindeMacBook-Pro:~ caiyunxin$ tensorboard --logdir=/Users/caiyunxin/Desktop/visualize/test/Users/caiyunxin/anaconda/lib/python3.6/importlib/_bootstrap.py:205: RuntimeWarning: compiletime version 3.5 of module tensorflow.python.framework.fast_tensor_util does not match runtime version 3.6 return f(*args, **kwds)W0212 16:47:46.708952 Reloader tf_logging.py:86] Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.TensorBoard 0.4.0rc3 at http://caiyunxindeMacBook-Pro.local:6006 (Press CTRL+C to quit)
終端輸出這個,然後打開chrome瀏覽器,在網址處直接輸入http://localhost:6006/
進入tensorboard後刷新一下或者點點右側的search
然後要把文調出來,只需要把滑鼠移到隨便一個點上就大功告成啦,左下方可以訓練tsne或pca
方法(二):python svd分解接plt畫圖
word=[]for i in range(6318): word+=[j for j in sentence0[i]]for i in range(1068): word+=[j for j in sentence1[i]]import numpy as npvisualizeVecs=[]visualizeWords=[]word_list=list(set(word)) #word 來自謠言預測 clean_data.pyimport gensimfrom gensim.models import word2vecdimention=100#詞向量維數model100 = gensim.models.Word2Vec(sentence, sg=0,size=dimention, min_count=5, window=5)#訓練詞向量#因為有低頻詞過濾所以加了try exceptfor i in word_list: try: visualizeVecs.append(model100[i]) #model100為w2v模型 visualizeWords.append(i) except KeyError: continue visualizeVecs = np.array(visualizeVecs).astype(np.float64) #詞向量列表import matplotlib.pyplot as pltfrom matplotlib.font_manager import *temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))covariance = 1.0 / visualizeVecs.shape[0] * temp.T.dot(temp)U, S, V = np.linalg.svd(covariance)coord = temp.dot(U[:, 0:2])myfont = FontProperties(fname=/Library/Fonts/Songti.ttc)for i in range(len(visualizeWords)): #print (i) #print (coord[i, 0]) #print (coord[i, 1]) color = red plt.text(coord[i, 0], coord[i, 1], visualizeWords[i], bbox=dict(facecolor=color, alpha=0.03), fontsize=6,fontproperties = myfont) # fontproperties = ChineseFont1plt.xlim((np.min(coord[:, 0])-0.5, np.max(coord[:, 0])+0.5))plt.ylim((np.min(coord[:, 1])-0.5, np.max(coord[:, 1])+0.5))plt.savefig(/Users/caiyunxin/Desktop/visualize/w2v100_test.png, format=png,dpi = 1000,bbox_inches=tight)plt.show()
這個可視化效果當然沒有tensorboard好,而且svd分解這原理我不了解不知道是否靠譜。
推薦閱讀:
※DNN模型訓練詞向量原理
※Dynamic word embedding 動態詞向量表示
※《GloVe: Global Vectors for Word Representation》閱讀筆記
※一文通覽詞向量
TAG:詞向量 |