已知兩個高斯分布及他們的關係,如何求條件期望?

下面h和x是兩個符合高斯分布的變數,分布函數已給出,並且x=Ph,請問最終的條件期望E(h|x)是如何得到的?

我試著推理如下,但沒法繼續了


謝邀。

Again I am passionate about this kind of technicalities.

I do not redefine the variables in the question. But let me define something more. Note that mathbf{h} has one more dimension than mathbf{x}, where except the first dimension of mathbf{h}, x_i = mu + h_i. For later convenience, I define mathbf{h} = left[ egin{array}{c} mu \ mathbf{	ilde{h}} end{array} 
ight] , where mu is the first dimension and mathbf{	ilde{h}} contains the rest.

On the other hand, given Sigma_x calculated above, one can easily show that its inverse is given by

Sigma_x^{-1} = left[ egin{array}{ccc} frac{1}{S_h} - frac{S_{mu}}{S_h} frac{1}{n S_{mu} +S_h}  - frac{S_{mu}}{S_h} frac{1}{n S_{mu} +S_h}  dotsc \ - frac{S_{mu}}{S_h} frac{1}{n S_{mu} +S_h}  frac{1}{S_h} - frac{S_{mu}}{S_h} frac{1}{n S_{mu} +S_h}  dotsc \ vdots  vdots  ddots end{array} 
ight].

(Note: I found this out by clumsily going through the calculation below, but verification is easy. And this is actually the main obstacle of this exercise.)

First we define the joint probability distribution function P(mathbf{x}, mathbf{h}). We know that mathbf{h} obeys the Gaussian distribution, and mathbf{x} is a linear superposition of the elements in mathbf{h}. It is the most appropriate to use Dirac delta function to describe this:

P(mathbf{x}, mathbf{h}) = frac{1}{sqrt{(2 pi)^{n+1} |Sigma_h|}} exp left( -frac{1}{2} mathbf{h}^T Sigma_h^{-1} mathbf{h} 
ight) delta^{(n)} (mathbf{x} - mathbf{P} mathbf{h}).

As we know, the probability distribution for mathbf{x} is in the form

P(mathbf{x}) = frac{1}{sqrt{(2 pi)^n |Sigma_x|}} exp left( -frac{1}{2} mathbf{x}^T Sigma_x^{-1} mathbf{x} 
ight).

However, we can also get P(mathbf{x}) by integrating out mathbf{h}:

P(mathbf{x}) = int d^{n+1} mathbf{h} P(mathbf{x}, mathbf{h}) \ = frac{1}{(2pi)^{n+1} | Sigma_h |} int dmu int d^n mathbf{	ilde{h}} exp left( - frac{1}{2} frac{mu^2}{S_{mu}} - frac{1}{2} mathbf{	ilde{h}}^T Sigma_{	ilde{h}}^{-1} mathbf{	ilde{h}} 
ight) prod_{i=1}^n delta(x_i - mu - h_i) \ = frac{1}{sqrt{(2pi)^{n+1} S_u S_h^n}} int dmu expleft[ -frac{1}{2} left(frac{1}{S_{mu}} + frac{n}{S_h}
ight) left( mu - frac{1}{frac{1}{S_{mu}} + frac{n}{S_h}} sum_{i=1}^n frac{x_i}{S_h} 
ight)^2 - frac{1}{2} mathbf{x}^T Sigma_x^{-1} mathbf{x} 
ight] \ = frac{1}{sqrt{(2pi)^n S_{mu} S_h^n left( frac{1}{S_{mu}} + frac{n}{S_h} 
ight)}} expleft( - frac{1}{2} mathbf{x}^T Sigma_x^{-1} mathbf{x} 
ight),

where you can see the determinant of Sigma_x is |Sigma_x| = S_{mu} S_h^n left( frac{1}{S_{mu}} + frac{n}{S_h} 
ight).

Then the conditional probability can be calculated:

E( mathbf{h} | mathbf{x} ) = int d^{n+1} mathbf{h} [mathbf{h} P(mathbf{h} | mathbf{x})] = int d^{n+1} mathbf{h} frac{mathbf{h} P(mathbf{x}, mathbf{h})}{P(mathbf{x})} \ = frac{1}{sqrt{2pi frac{|Sigma_h|}{|Sigma_x|}}} int dmu int d^n mathbf{	ilde{h}} left[ egin{array}{c} mu \ mathbf{	ilde{h}} end{array} 
ight] exp left( -frac{1}{2} frac{mu^2}{S_{mu}} - frac{1}{2} mathbf{	ilde{h}}^T Sigma_{	ilde{h}}^{-1} mathbf{h} + frac{1}{2} mathbf{x}^T Sigma_x^{-1} mathbf{x} 
ight) prod_{i=1}^n delta(x_i - mu - h_i) \ = sqrt{frac{frac{1}{S_{mu}}+frac{n}{S_h}}{2pi}} int dmu left[ egin{array}{c} mu \ mathbf{x} - mu mathbf{1}_c end{array} 
ight] exp left( -frac{1}{2} frac{mu^2}{S_{mu}} - frac{1}{2} (mathbf{x} - mu mathbf{1}_c)^T Sigma_{	ilde{h}}^{-1} (mathbf{x} - mu mathbf{1}_c) + frac{1}{2} mathbf{x}^T Sigma_x^{-1} mathbf{x} 
ight)

In the above, we simply exploit the definition of conditional probability, and integrate over all mathbf{	ilde{h}} with the Dirac delta function. There are a lot of algebra involved which the readers can diligently verify on their own. Continuing the calculation gives

E( mathbf{h} | mathbf{x} ) = sqrt{frac{frac{1}{S_{mu}}+frac{n}{S_h}}{2pi}} int dmu left[ egin{array}{c} mu \ mathbf{x} - mu mathbf{1}_c end{array} 
ight] exp left[ -frac{1}{2} left( frac{1}{S_{mu}} + frac{n}{S_h} 
ight) left( mu - frac{S_{mu}}{S_h+nS_{mu}} sum_{i=1}^n x_i 
ight)^2 
ight] \ = left[ egin{array}{c} frac{S_{mu}}{S_h+nS_{mu}} sum_{i=1}^n x_i \ mathbf{x} - frac{S_{mu}}{S_h+nS_{mu}} sum_{i=1}^n x_i mathbf{1}_c end{array} 
ight]

A careful operation of algebra can show that this is exactly equal to Sigma_h P^T Sigma_x^{-1} mathbf{x}.

I have skipped many details, but readers can verify this on their own.

P.S.: For me, the most difficult part is to find the inverse of Sigma_x, which I found by calculating the integrals above and got the matrix elements, and then verify it by multiplying it by itself to see if it gives an identity matrix. The difference in the number of dimensions of mathbf{x} and mathbf{h}does impose some inconvenience. However, it is more about menial algebra instead of an intellectually challenging problems. If you want to have a taste without much algebraic operation, take the fewer dimensions and do the calculation first using Mathematica to get a feeling.



首先先安利下 The Matrix Cookbook,從此媽媽再也不用擔心我的矩陣推導了!

http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/3274/pdf/imm3274.pdf

將向量h與x合在一起看做一個符合多元高斯分布的隨機向量(h,x),依據cookbook中公式(353)

可得E(h|x)=Cov(h,x)cdot Cov(x,x)^{-1}x (1)

再由cookbook中公式(314):Cov(Ax,By)=Acdot Cov(x,y)cdot B^{T}

則由x=Ph可得

Cov(h,x)=Cov(h,Ph)=Cov(h,h)P^T=Sigma_{h}P^T

Cov(x,x)=Sigma_x=Cov(Ph,Ph)=PSigma_{h}P^T

代入式(1)既得結果E(h|x)=Sigma_hP^T Sigma_x^{-1}x=Sigma_hP^T(PSigma_hP^T)^{-1}x (2)

Remark:注意這裡的P是一般的實矩陣,其維度由h與x的維度決定,所以P不一定可逆,甚至不是方陣,所以式(2)小括弧不能再展開繼續推導了,但如果P可逆會如何?那繼續推導唄

E(h|x)=Sigma_hP^T Sigma_x^{-1}x=Sigma_hP^T(PSigma_hP^T)^{-1}x
\
=Sigma_hP^T(P^T)^{-1}Sigma_h^{-1}P^{-1}x=P^{-1}x

繞了半天x=PhRightarrow h=P^{-1}x。。。

再從貝葉斯角度來看的話:

P不是方陣但行滿秩,求M-P廣義逆,有h=P^T(PP^T)^{-1}x

若列滿秩,有h=(P^TP)^{-1}P^Tx

這個是線性方程組中的極小範數最小二乘解,其實也是隨機向量符合多元高斯分布假設下的極大似然估計,而對應題主問題P應為行滿秩,觀察行滿秩的形式與式(2)頗為相似,這是由於加入了h的先驗分布,而對極大似然估計進行了對先驗分布的修正,求得的對應的最大後驗估計。


推薦閱讀:

深度學習乃至機器學習和凸論有什麼本質聯繫?
怎麼看待現在一些論文提出的新點子就只在mnist或者cifar上跑跑實驗?
如何評價CVPR best paper simGAN?
如何解讀論文《Generalization in Deep Learning》?
如何開發一個特定領域的自動問答機器人(Chat Bot)?

TAG:機器學習 | 貝葉斯統計 | 正態分布 | 期望值 |