University Of Michigan - Generative Models and LDA

multinomial(多項式)quota(配額)

Suppose you decide that for a particular document, 40% of the words come from topic A, then you use that topic As multinomial distribution to output the 40% of the words. (這是lda比較簡單的解釋方法,假設你決定對於一個特定的文檔,40%的單詞從話題A中產生,然後你就可以話題A多項式分布去產生40%佔比的單詞)But for now,it is enough to

kind of understand that LDA is also a generative model and it creates its document based on some notion of length of the document, mixture of topics in that document and then, individual topics multinomial distributions.(LDA的數學解釋,可以去看詳細的論文,現在只需要明白LDA是一個生成模型,它根據文檔的長度來生成文檔,多項式分布對應獨立的話題)

生成標籤是非常主觀的一件事情

stemming(詞幹) ,將英語單詞中的其它形式變成同一種形式,中文當中沒有這種問題(meet,meeting,met -> meet)

將符號化的文檔轉換成文檔矩陣

First you create a dictionary, dictionary is mapping between IDs and words, then you create corpus, and corpus you create going through this, all the documents in the doc set , and creating a document to bag of words model, this is the step that creates the document term matrix, once you have that, then you input that in the LdaModel call, so that you use gensim.models LdaModel,where you also specify the number of topics you want to learn. So in this case, we said number of topics is going to be four, and you also specify this mapping, the id2word mapping, thats a dictionary that is learned two step ahead.Once you have learned that,then thats it, and you can say how many passes it should go through, and there are other parameters that I would

encourage you to read upon. But once you have defined this lda model, you can then use the lda model to print the topics ,So in this particular case, we learnt four topics, and you can say , give me the top five words of these four topics and then it will bring that one out for you. lda model can also be used to find topic distributions of documents, So when you have a new document and you apply the lda model on it, so you infer it. You can say, what was the topic distribution, across these four topics, for that new document .

LDA stands for Latent Dirichlet Allocation.

總結一下,用LDA模型可以去做Topic modeling(主題分類)

推薦閱讀:

怎麼從通俗意義上理解邏輯回歸的損失函數?
演算法研究屬於數學專業還是計算機專業?
沒想到你是這樣的「兔子」---kNN
小密圈的反黃策略該怎麼做?
Markov Chain和Gibbs分布到底是什麼關係?

TAG:自然语言处理 | 机器学习 |