pyLDAvis 模塊代碼及應用
背景
pyLDAvis模塊是python中的一個對LDA主題模型演算法的可視化模塊。本文的代碼是根據github上的某個項目代碼修改而得,很感謝github及創造原始代碼的大牛朋友們!
代碼
import pandas as pddf = pd.read_csv("C:\Users\Desktop\neg.csv",errors=ignore)print(df.head())print(df.shape)import jiebajieba.load_userdict("C:\Users\Desktop\中文分詞詞庫整理\中文分詞詞庫整理\百度分詞詞庫.txt") #自定義分詞詞典def chinese_word_cut(mytext): return " ".join(jieba.cut(mytext))#分詞df["content_cutted"] = df.content.apply(chinese_word_cut)print(df.content_cutted.head())from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizern_features = 1000tf_vectorizer = CountVectorizer(strip_accents = unicode, max_features=n_features, stop_words=english, max_df = 1.0, min_df = 0.1)#訓練詞矩陣tf = tf_vectorizer.fit_transform(df.content_cutted)from sklearn.decomposition import LatentDirichletAllocationn_topics = 3lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=50, learning_method=online, learning_offset=50., random_state=0)#LDA模型訓練lda.fit(tf)def print_top_words(model, feature_names, n_top_words):#主題相關的top詞計算 for topic_idx, topic in enumerate(model.components_): print("Topic #%d:" % topic_idx) print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])) print()n_top_words = 25tf_feature_names = tf_vectorizer.get_feature_names()print_top_words(lda, tf_feature_names, n_top_words)import pyLDAvis#所需可視化模塊import pyLDAvis.sklearndata = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)pyLDAvis.show(data)#可視化主題模型
效果展示
推薦閱讀:
※量化投資必讀書目(三)——《量化投資:數據挖掘技術與實踐(MATLAB版)》
※大數據看190部國產片,年輕人的觀影口味發生了這些變化
※那麼多人都在進行大數據培訓?你到底還在猶豫什麼?
※數據挖掘面試題之梯度提升樹
※關聯規則筆記(理論)