Python數據分析及可視化實例之詞袋word2bow(28)
系列文章總目錄:Python數據分析及可視化實例目錄
1.項目背景:
分詞用上一期的結巴搞定之後,形成了一個中文列表,
但是計算機不認識漢字,需要轉化成向量然後進行分析,
大體上自然語言處理用在:主題獲取,文本分類,情感分析等。
一步步來,今天搞定詞袋。
2.分析步驟:
(1)找個測試文檔,將其分詞;
(2)形成字典(詞袋);
(3)通過字典對測試字元串進行轉換(word2bow)
(4)下一彈:文本相似度。
參考資料:python+gensim︱jieba分詞、詞袋doc2bow、TFIDF文本挖掘 - CSDN博客
3.源碼(公眾號:海豹戰隊):
# coding: utf-8n# 親,轉載即同意幫推公眾號:海豹戰隊,嘿嘿......n# 數據源可關注公眾號:海報戰隊,後留言:數據nn# In[1]:nnimport loggingnfrom gensim import corporanimport renimport jiebanfrom collections import defaultdictnfrom pprint import pprint # pretty-printernlogging.basicConfig(format=%(asctime)s : %(levelname)s : %(message)s, level=logging.INFO)nnn# In[2]:nndocuments = ["Human machine interface for lab abc computer applications",n "A survey of user opinion of computer system response time",n "The EPS user interface management system",n "System and human system engineering testing of EPS",n "Relation of user perceived response time to error measurement",n "The generation of random binary unordered trees",n "The intersection graph of paths in trees",n "Graph minors IV Widths of trees and well quasi ordering",n "Graph minors A survey"]nnn# In[3]:nnnstoplist = set(for a of the and to in.split()) # 刪除幾個簡易的停用詞,中文上結巴分詞哈ntexts = [[word for word in document.lower().split() if word not in stoplist]n for document in documents]ntextsnnn# In[4]:nn# 去掉只出現一次的單詞nfrequency = defaultdict(int)nfor text in texts:n for token in text:n frequency[token] += 1ntexts = [[token for token in text if frequency[token] > 1]n for text in texts]ntextsnnn# In[5]:nndictionary = corpora.Dictionary(texts) # 將文檔存入字典nnn# In[8]:nnget_ipython().magic(pinfo2 dictionary)nnn# In[6]:nndictionary.token2id # 單次詞頻鍵值對,可以直接用來生成詞雲nnn# In[7]:nndictionary.dfsnnn# In[9]:nndictionary.filter_tokens()nnn# In[10]:nndictionary.compactify()nnn# In[12]:nndictionary.save(../../tmp/deerwester.dict)nnn# In[13]:nn# 輸出dictionary中個單詞的出現頻率ndef PrintDictionary():n token2id = dictionary.token2idn dfs = dictionary.dfsn token_info = {}n for word in token2id:n token_info[word] = dict(n word = word,n id = token2id[word],n freq = dfs[token2id[word]]n )n token_items = token_info.values()n token_items = sorted(token_items, key = lambda x:x[id])n print(The info of dictionary: )n pprint(token_items)nnn# In[14]:nn# 測試 ditonary的doc2bow功能,轉化為one-hot presentationnnew_doc = "Human computer interaction"nnew_vec = dictionary.doc2bow(new_doc.lower().split())nprint(new_vec) # interaction" 沒有在字典中,所以忽略了nnn# In[15]:nn# 將文本轉化為 doc2bow 形式的數組ncorpus = [dictionary.doc2bow(text) for text in texts]nnn# In[16]:nncorpus # 詞帶中對應詞出現的次數,如(2, 1)代表詞帶中的2號詞出現了1次nnn# In[17]:nncorpora.MmCorpus.serialize(../../tmp/deerwester.mm, corpus) # 保存至本地n# 除了MmCorpus以外,還有SvmLightCorpus等以各種格式存入磁碟n
膠水語言博大精深,
本主只得一二為新人帶路,
老鳥可去另一專欄:Python中文社區
新手可查閱歷史目錄:
Python數據分析及可視化實例目錄
最後,別只收藏不關注哈
推薦閱讀:
※智能時代的怪獸四:語言的運算(下)
※C++解析Word、Excel、PPT、PDF等格式的文件用什麼庫?
※條件隨機場介紹(譯)Introduction to Conditional Random Fields
※NLP入門之信息熵