CS224N Lecture2 筆記

02-12

主要內容：

How is word meaning represented in WordNet?

What is one-hot representation? What limitations does it have?

What is the main idea of skip-gram models (for word2vec)?

What is softmax? Why do we use it?

How to train the model, i.e., to optimize the negative log

likelihood?

What is the gradient of the model? How to interpret the compact form

(u_0 - sum_x p(x|c)u_x)?

What are the benefits of using SGD rather than GD?