Sigmoid vs Softmax 輸出層選擇

05-17

來自專欄 AI工程

（題圖來自維基百科 Sigmoid function）

今天有人提到這個問題，為什麼現在一般深度學習的分類模型最後輸出層都用Softmax而不是簡單的Sigmoid？

谷歌到兩個相關回答

Sigmoid + cross-entropy (eq.57) follows the Bernoulli distribution, while softmax + log-likelihood (eq.80) follows the multinomial distribution with one observation (which is a multiclass version of the Bernoulli).

For binary classification problems, the softmax function outputs two values (between 0 and 1 and sum up to 1), to represent the probabilities of each class.

While the sigmoid function outputs one value between 0 and 1, to represent the probability of one class (so the probability of the other class is just 1-p).

dontloo ( neural networks )

Sigmoid+互信息輸出結果是伯努利分布（註： $P(y_1|X), P(y_2|X),...,P(y_n|X)$ ）

而Softmax輸出的是多項分布（註： $P(y_1, y_2,...,y_n|X)$ ）

對於二值分類問題，Softmax輸出兩個值，這兩個值相加為1

對於Sigmoid來說，也輸出兩個值，不過沒有可加性，兩個值各自是0到1的某個數，對於一個值p來說，1-p是它對應的另一個概率。

例如：

如果我們預測某個東西是或者不是，那麼我們可以這樣：

輸出(0, 1)代表「是」，輸出(1, 0)代表「否」

Softmax可能輸出(0.3, 0.7)，代表演算法認為「是」的概率是0.7，「否」的概率是0.3，相加為1

Sigmoid的輸出可能是(0.4, 0.8)，它們相加不為1，解釋來說就是Sigmoid認為輸出第一位為1的概率是0.4，第一位不為1的概率是0.6（1-p），第二位為1的概率是0.8，第二位不為1的概率是0.2。

Geoff Hinton covered exactly this topic in his coursera course on neural nets. The problem with sigmoids is that as you reach saturation (values get close to 1 or 0), the gradients vanish. This is detrimental to optimization speed. Softmax doesnt have this problem, and in fact if you combine softmax with a cross entropy error function the gradients are just (z-y), as they would be for a linear output with least squares error.

nkorslund ( https://www.reddit.com/r/MachineLearning/comments/32iyt9/question_comparison_between_softmax_and_sigmoid/ )

這個回答提到Hinton在coursera的課提到這個課題了，很可惜我沒上過這門課（不過這門課正在準備2016年9月份重開，https://www.coursera.org ）。Hinton認為當Sigmoid函數的某個輸出接近1或者0的時候，就會產生梯度消失，嚴重影響優化速度，而Softmax沒有這個問題。