Approximating softmax (log-linear) model
Motivation
Computing Z
and log Z
are expansive in some tasks such as language model and retrieval model.
This is the place to keep reading notes and random thoughts I have.
NCE and IS
Negative contrastive estimation [6,5] and a similar variant based on importance sampling [1] reduce the problem into an easier classification problem that requires to identify the true item among randomly sampled negative ones.
A generative view: [1] explains the methods pretty well using a Bayesian generative view. Let P()
and Q()
be the true data distribution and the proposal distribution used to generate random noise. Consider the IS based method. In each training sample, a item is draw from true distribution P()
and n
items are draw from Q()
. The generative story of this multi-class classification task is
- draw a categorical variable
y
from uniform multi-normial distributionp(y=k) = 1/n
- draw
n+1
items: the k-th item fromP()
and the rest items fromQ()
The goal is to infer p(y|x)
. By Bayesian rule, this posterier should be proportional to P()/Q()
. In other words, after training, the learned model approximates P()/Q()
. If Q()
is a simple distribution such as uniform, we can re-parameterize and obtain a good approximation of P()
(which is the distribution we want to learn) by offsetting the learned model by the constant factor introduced by Q()
.
Thoughts:
Q()
has to be a simple fixed distribution. IfQ()
gets close toP()
, we might improve sampling efficiency and hence obtain faster training convergence. However,Q()
wouldnt be static and Im not sure if the same Bayesian argument applies in this case. Perhaps directly approximatinglog Z
and its derivative ? Some papers aim to tackle this problem [3][4]
Reference:
[1] Exploring the Limits of Language Modeling (Section 3.1) https://arxiv.org/pdf/1602.02410.pdf
[2] Strategies for Training Large Vocabulary Neural Language Models. https://arxiv.org/pdf/1512.04906.pdf
[3] A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models. https://arxiv.org/pdf/1703.05160.pdf
[4] LSH Softmax: Sub-Linear Learning and Inference of the Softmax Layer in Deep Architectures. https://openreview.net/pdf?id=SJ3dBGZ0Z
[5] A fast and simple algorithm for training neural probabilistic language models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf
[6] Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. https://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf
[7] Approximating the Softmax for Learning Word Embeddings
推薦閱讀:
TAG:機器學習 | 深度學習DeepLearning | 自然語言處理 |