Approximating softmax (log-linear) model

07-08

Motivation

Computing Z and log Z are expansive in some tasks such as language model and retrieval model.

This is the place to keep reading notes and random thoughts I have.

NCE and IS

Negative contrastive estimation [6,5] and a similar variant based on importance sampling [1] reduce the problem into an easier classification problem that requires to identify the true item among randomly sampled negative ones.

A generative view: [1] explains the methods pretty well using a Bayesian generative view. Let P() and Q() be the true data distribution and the proposal distribution used to generate random noise. Consider the IS based method. In each training sample, a item is draw from true distribution P() and n items are draw from Q(). The generative story of this multi-class classification task is

draw a categorical variable y from uniform multi-normial distribution p(y=k) = 1/n
draw n+1 items: the k-th item from P() and the rest items from Q()

The goal is to infer p(y|x). By Bayesian rule, this posterier should be proportional to P()/Q(). In other words, after training, the learned model approximates P()/Q(). If Q() is a simple distribution such as uniform, we can re-parameterize and obtain a good approximation of P() (which is the distribution we want to learn) by offsetting the learned model by the constant factor introduced by Q().

Thoughts:

Q() has to be a simple fixed distribution. If Q() gets close to P(), we might improve sampling efficiency and hence obtain faster training convergence. However, Q() wouldnt be static and Im not sure if the same Bayesian argument applies in this case. Perhaps directly approximating log Z and its derivative ? Some papers aim to tackle this problem [3][4]

Reference:

[1] Exploring the Limits of Language Modeling (Section 3.1) https://arxiv.org/pdf/1602.02410.pdf

[2] Strategies for Training Large Vocabulary Neural Language Models. https://arxiv.org/pdf/1512.04906.pdf

[3] A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models. https://arxiv.org/pdf/1703.05160.pdf

[4] LSH Softmax: Sub-Linear Learning and Inference of the Softmax Layer in Deep Architectures. https://openreview.net/pdf?id=SJ3dBGZ0Z

[5] A fast and simple algorithm for training neural probabilistic language models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf

[6] Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. https://www.cs.helsinki.fi/u/ahyvarin/papers/Gutmann10AISTATS.pdf

[7] Approximating the Softmax for Learning Word Embeddings