機器學習筆記24 —— 推薦系統

01-27

推薦系統（Recommender systems）:

在我們逛淘寶、聽歌、看電影的時候，首頁都離不開其背後的推薦系統為你推薦你也許喜歡的類型。

我們以電影評分系統為例子。

有四個用戶給五部電影分別評了分數（沒看過則記為：？）。其中前面三部屬於同一類型為愛情片，後面兩部為動作片。

我們用 $n_u$ 表示用戶的數量，用 $n_m$ 表示電影的數量。當第 $j$ 個用戶對第 $i$ 部電影評分了，則 $r(i,j)=1$ 。而 $y^{(i,j)}$ 則表示評的分數。

我們根據用戶已經評分的電影，去自動為其沒看過的電影進行一個評分估計。

練習1
In our notation, $r(i,j)=1$ if user $j$ has rated movie $i$ ,and $y^{(i,j)}$ is his rating on that movie.Consider the following example(no. of movies $n_m=2$ ,no. of users $n_u=3$ )

What is $r(2,1)$ ?How about $y^{(2,1)}$ ?
A. $r(2,1)=0，y^{(2,1)}=1$
B. $r(2,1)=1，y^{(2,1)}=1$
C. $r(2,1)=0，y^{(2,1)}=undefined$
D. $r(2,1)=1，y^{(2,1)}=undefined$
Answer：C

分析：第2部電影第1個用戶對其的評分不知道。

那麼又怎麼進行預測呢？

我們以預測第3部電影第1個用戶可能評的分數為例子。

首先我們用 $x_1$ 表示愛情浪漫電影類型， $x_2$ 表示動作片類型。上圖左表右側則為每部電影對於這兩個分類的相關程度。我們默認 $x_0=1$ 。則第一部電影與兩個類型的相關程度可以這樣表示： $x^{(3)}=left[ begin{array}{ccc}1 0.99 0 end{array} right]$ 。然後用 $theta^{(j)}$ 表示第 $j$ 個用戶對於該種類電影的評分。這裡我們假設已經知道（詳情下面再講） $theta^{(1)}=left[ begin{array}{ccc}0 5 0 end{array} right]$ ，那麼我們用 $(theta^{(j)})^Tx^{(i)}$ 即可計算出測第3部電影第1個用戶可能評的分數。這裡計算出是4.95。

練習2
Consider the following set of movie ratings:

Which of the following is a reasonable value for $theta^{(3)}$ ?Recall that $x_0=1$ 。
A. $theta^{(3)}=left[ begin{array}{ccc}0 5 0 end{array} right]$
B. $theta^{(3)}=left[ begin{array}{ccc}0 0 1 end{array} right]$
C. $theta^{(3)}=left[ begin{array}{ccc}1 0 4 end{array} right]$

D. $theta^{(3)}=left[ begin{array}{ccc}0 0 5 end{array} right]$
Answer：D
分析：不就是上面的反向例子。

那麼 $theta^{(j)}$ 是如何確定的呢？

我們用 $m^{(j)}$ 表示已經被評分了的電影數量。

與線性回歸類似，其 $theta^{(j)}$ 為： $min(theta^{(j)})frac{1}{2m^{(j)}}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2m^{(j)}}sum_{k=1}^{n}{(theta_k^{(j)})^2}$ 。

因為 $m^{(j)}$ 只是一個標量，所以為了簡化運算可以去掉。

則一般的情況計算 $theta^{(j)}$ 為：

$min(theta^{(j)})frac{1}{2}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{k=1}^{n}{(theta_k^{(j)})^2}$

計算出所有的 $theta$ 為： $min(theta^{(1)},…,theta^{(n_u)})frac{1}{2}sum_{j=1}^{n_u}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{j=1}^{n_u}sum_{k=1}^{n}{(theta_k^{(j)})^2}$

與前面所學線性回歸內容的相同，為了計算出 $J(theta^{(1)},…,theta^{(n_u)})$ ，我們利用梯度下降函數演算法。

但是，對於每部電影的與各個類型的相關程度，我們又是怎麼得出這些相關程度的數據呢？

在這裡我們需要一種方法叫做：協同過濾演算法（Collaborative Filtering）。

這裡前提是我們知道了 $theta^{(j)}$ 也就是每個用戶對於各個電影類型的喜愛程度。那麼我們就可以根據各個用戶對各部電影的評分= $(theta^{(j)})^Tx^{(i)}$ 反推出 $x^{(i)}$ 。

練習3
Consider the following movie ratings:

note that there is only one feature $x_1$ .Suppose that :

What would be a reasonable value for $x_1^{(1)}$ (the value denoted"?" in the table above)?
A.0.5
B.1
C.2
D.Any of these values would be equally reasonable.
Answer：A
分析：
根據評分= $(theta^{(j)})^Tx^{(i)}$ 可得出答案。

當用戶給出他們喜歡的類型，即 $theta^{(1)},…,theta^{(n_u)}$ ，我們可以由下列式子得出 $x^{(i)}$ ： $min(x^{(j)})frac{1}{2}sum_{j:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{k=1}^{n}{(x_k^{(i)})^2}$

可出所有的 $x$ 則為：

$min(x^{(1)},…,x^{(n_u)})frac{1}{2}sum_{i=1}^{n_m}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{j=1}^{n_m}sum_{k=1}^{n}{(x_k^{(i)})^2}$

練習4
Suppose you use gradient descent to minimize：
$min(x^{(1)},,…x^{(n_u)})frac{1}{2}sum_{i=1}^{n_m}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{j=1}^{n_m}sum_{k=1}^{n}{(x_k^{(i)})^2}$
Which of the following is a correct gradient descent update rule for $ine 0$ ?
A. $x_k^{(i)}:=x_k^{(i)}+alpha(sum_{j:r(i,j)=1}^{}{((theta_k^{(j)})^T(x^{(i)})-y^{(i,j)})theta_k^{(j)}})$
B. $x_k^{(i)}:=x_k^{(i)}-alpha(sum_{j:r(i,j)=1}^{}{((theta_k^{(j)})^T(x^{(i)})-y^{(i,j)})theta_k^{(j)}})$
C. $x_k^{(i)}:=x_k^{(i)}+alpha(sum_{j:r(i,j)=1}^{}{((theta_k^{(j)})^T(x^{(i)})-y^{(i,j)})theta_k^{(j)}}+lambda x_k^{(i)})$
D. $x_k^{(i)}:=x_k^{(i)}-alpha(sum_{j:r(i,j)=1}^{}{((theta_k^{(j)})^T(x^{(i)})-y^{(i,j)})theta_k^{(j)}}+lambda x_k^{(i)})$
Answer：D

這也是為什麼很多軟體一進去的時候，讓你選擇你喜歡的類型，這樣他就能給你推薦出更多新的你喜歡的種類。

也就是說，只要我們得到 $theta$ 或者 $x$ ，都能互相推導出來。

協同過濾演算法基本思想就是當我們得到其中一個數據的時候，我們推導出另一個，然後根據推導出來的再推導回去進行優化，優化後再繼續推導繼續優化，如此循環協同推導。下面將詳細介紹。

上面我們提到幾個概念，當我們得到幾個特徵值表示電影種類的時候，我們可以使用這些數據獲得用戶未評分電影的參數數據。假如我們得到用戶評分電影的數據呢，我們又可以推出該電影的種類特徵。我們將這些概念合併起來，形成我們的協同過濾演算法。

為了更有效率地計算，我們結合成下列演算法：

即： $min(x^{(1)},…,x^{(n_u)})(theta^{(1)},…,theta^{(n_u)})frac{1}{2}sum_{(i,j):r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{j=1}^{n_m}sum_{k=1}^{n}{(x_k^{(i)})^2}+frac{lambda}{2}sum_{j=1}^{n_u}sum_{k=1}^{n}{(theta_k^{(j)})^2}$

因為正則化的原因在這裡面不再有之前的 $x_0=1,theta_0=0$ 。

協同過濾演算法的步驟為：

首先隨機初始化參數值 $x^{(1)},…,x^{(n_u)},theta^{(1)},…,theta^{(n_u)}$ 。

然後通過梯度下降的演算法計算出 $J(x^{(1)},…,x^{(n_u)},theta^{(1)},…,theta^{(n_u)})$ 。

然後得到 $theta^Tx$ 該用戶可能對電影的評分。

練習5
In the algorithm we described, we initialized $x^{(1)},…,x^{(n_u)},theta^{(1)},…,theta^{(n_u)}$ to small random values.Why is this ?
A.This is optional. Initializing to all 0s would work just as well.
B.Random initialization is always necessary when using gradient descent on any problem.
C.This ensures that $x^{(i)}ne theta^{(j)}$ for any $i,j$ .
D.This serves as symmetry breaking (similar to the random initialization of a neural networks parameters)and ensures the algorithm learns features $x^{(1)},…,x^{(n_u)}$ that are different from each other.

Answer：D

下面我們將協同過濾演算法向量化。

還是以電影評分為例子。首先我們將用戶的評分寫成一個矩陣 $Y$ 。

更為詳細的表達如上圖所示。矩陣 $Y$ 可表示為 $Theta^TX$ 。這個演算法也叫低秩矩陣分解（Low Rank Matric Factorization）。

練習6
Let $X=left[ begin{array}{ccc}—(x^{(1)})^T— … —(x^{(n_m)})^T— end{array} right]$ ， $Theta=left[ begin{array}{ccc}—(theta^{(1)})^T— … —(theta^{(n_u)})^T— end{array} right]$ .What is another way of writing the following： $left[ begin{array}{ccc}(x^{(1)})^T(theta^{(1)})&…& (x^{(1)})^T(theta^{(n_u)}) …&&… (x^{(n_m)})^T(theta^{(1)})&…&(x^{(n_m)})^T(theta^{(n_u)})end{array} right]$
A. $XTheta$
B. $X^TTheta$
C. $XTheta^T$
D. $Theta^TX^T$
Answer：C

均值歸一化（Mean Normalization）：

當有一個用戶什麼電影都沒有看過的話，我們用 $Theta^TX$ 計算最後得到的結果全部都是一樣的，並不能很好地推薦哪一部電影給他。

均值歸一化要做的就是先計算每一行的平均值，再將每一個數據減去該行的平均值，得出一個新的評分矩陣。然後根據這個矩陣擬合出 $Theta^TX$ ，最後的衡量結果加上平均值，即： $Theta^TX+mu_i$ 。而該 $mu_i$ 就作為之前什麼都沒有的一個權值進行推薦。

練習7
We talked about mean normalization.However,unlike some other applications of feature scaling, we did not scale the movie ratings by dividing by the range(max - min value).This is because:
A.This sort of scaling is not useful when the value being predicted is real-valued.
B.All the movie ratings are already comparable(e.g., 0 to 5 stars),so they are already on similar scales.
C.Subtracting the mean is mathematically equivalent to dividing by the rage.
D.This makes the overall algorithm significantly more computationally efficient.
Answer：B
分析：均值歸一化和特徵縮放不一樣在於均值歸一化的數據本來就在一定的範圍內（0-5星）。

下面是本章內容的練習：

練習1
Suppose you run a bookstore, and have ratings (1 to 5 stars) of books. Your collaborative filtering algorithm has learned a parameter vector $θ^{(j)}$ for user $j$ ,
and a feature vector $x^{(i)}$ for each book. You would like to compute the "training error", meaning the average squared error of your systems predictions on all the ratings that you have gotten from your users. Which of these are correct ways of doing so(check all that apply)?
For this problem, let $m$ be the total number of ratings you have gotten from your users. (Another way of saying this is that $m=sum_{i=1}^{n_m}sum_{j=1}^{n_u}{r(i,j)}$ . [Hint: Two of the four options below are correct.]
A. $frac{1}{m}sum_{j=1}^{n_u}sum_{i:r(i,j)=1}^{}({sum_{k=1}^{n}{(theta^{(j)})_kx_k^{(i)}-y^{(i,j)})^2}}$
B. $frac{1}{m}sum_{(i,j):r(i,j)=1}^{}({sum_{k=1}^{n}{(theta^{(j)})_kx_k^{(i)}-y^{(i,j)})^2}}$
C. $frac{1}{m}sum_{j=1}^{n_u}{sum_{k=1}^{n}({(theta^{(j)})_kx_k^{(i)}-y^{(i,j)})^2}}$
D. $frac{1}{m}sum_{(i,j):r(i,j)=1}^{}{sum_{k=1}^{n}({(theta^{(j)})_kx_k^{(i)}-y^{(i,j)})^2}}$
Answer：A、B
分析：概念題。
練習2
In which of the following situations will a collaborative filtering system be the
most appropriate learning algorithm (compared to linear or logistic regression)?
A.Youre an artist and hand-paint portraits for your clients. Each client gets a different portrait (of themselves) and gives you 1-5 star rating feedback, and each client purchases at most 1 portrait. Youd like to predict what rating your next customer will give you.
B.You manage an online bookstore and you have the book ratings from many users. You want to learn to predict the expected sales volume (number of books sold) as a function of the average rating of a book.
C.You own a clothing store that sells many styles and brands of jeans. You have collected reviews of the different styles and brands from frequent shoppers, and you want to use these reviews to offer those shoppers discounts on the jeans you think they are most likely to purchase
D.You run an online bookstore and collect the ratings of many users. You want to use this to identify what books are "similar" to each other (i.e., if one user likes a certain book, what are other books that she might also like?)
Answer：C、D
分析：協同過濾演算法的一個特點是其特徵量和數據比較多。
A.你是一個藝術家和手繪肖像為您的客戶。每個客戶得到一個不同的肖像（他們自己），給你1-5星級評價反饋，每個客戶最多購買1張肖像。你想預測你的下一位顧客會給你什麼評級。顯然這用邏輯回歸更好。
B.你管理一個網上書店，你有許多用戶的圖書評級。你想學習預測預期銷售量（賣出書籍的數量）作為一本書的平均評級的函數。顯然這用線性回歸更好。
C.你有一家賣很多款式和品牌牛仔褲的服裝店。你已經收集了來自不同購物者的不同風格和品牌的評論，你想利用這些評論給那些你認為最有可能購買的牛仔褲的顧客打折。特徵量很多吧。
D.你運行一個在線書店，收集許多用戶的評分。你想用這個來識別哪些書是相似的（也就是說，如果一個用戶喜歡某一本書，那她還可能喜歡什麼書呢？）特徵量也很多吧。
練習3
You run a movie
empire, and want to build a movie recommendation system based on collaborative
filtering. There were three popular review websites (which well call A, B and
C) which users to go to rate movies, and you have just acquired all three
companies that run these websites. Youd like to merge the three companies
datasets together to build a single/unified system. On website A, users rank a
movie as having 1 through 5 stars. On website B, users rank on a scale of 1 -
10, and decimal values (e.g., 7.5) are allowed. On website C, the ratings are
from 1 to 100. You also have enough information to identify users/movies on one
website with users/movies on a different website. Which of the following
statements is true?
A.Assuming that there is at least one movie/user in one database that doesnt also appear in a second database, there is no sound way to merge the datasets,because of the missing data.
B.You can combine all three training sets into one as long as your perform mean normalization and feature scaling after you merge the data.
C.You can merge the three datasets into one, but you should first normalize each
datasets ratings (say rescale each datasets ratings to a 0-1 range)
D.It is not possible to combine these websites data. You must build three separate
recommendation systems.
Answer：C
（或者You can merge the three datasets into one, but you should first normalize each dataset separately by subtracting the mean and then dividing by (max - min) where the max and min (5-1) or (10-1) or (100-1) for the three 、websites respectively）
分析：
說了這麼多，就是說有三種衡量電影的方法，能不能將其結合在一起衡量？怎麼做？做個特徵縮放就好了。
練習4
Which of the following are true of collaborative filtering systems? Check all that apply.
A.If you have a dataset of users ratings on some products, you can use these to predict one users preferences on products he has not rated.
B.When using gradient descent to train a collaborative filtering system, it is okay to
initialize all the parameters ( $x^{(i)}$ and $θ^{(j)}$ ) to zero.
C.To use collaborative filtering, you need to manually design a feature vector for every item (e.g., movie) in your dataset, that describes that items most important
properties.
D.Recall that the cost function for the content-based recommendation system is $J(theta)=frac{1}{2}sum_{i=1}^{n_u}sum_{i:r(i,j)=1}^{}{((theta^{(j)})^T(x^{(i)})-y^{(i,j)})^2}+frac{lambda}{2}sum_{j=1}^{n_u}sum_{k=1}^{n}{(theta_k^{(j)})^2}$ . Suppose there
is only one user and he has rated every movie in the training set. This implies
that $n_u=1$ and $r(i,j)=1$ for every $i,j$ . In this case, the cost function $J(θ)$ is equivalent to the one used for regularized linear regression.
Answer：A、D
分析：
A.可用協同過濾系統退出沒有打分的產品。
B.是隨機初始化，不是都等於零。
C.並不需要自己找特徵量。
D.只有一個樣本，代價函數等同於線性回歸，正確。
練習5
Suppose you have two matrices A and B, where A is 5x3 and B is 3x5. Their product is C=AB, a 5x5 matrix. Furthermore, you have a 5x5 matrix R where every entry is 0 or 1. You want to find the sum of all elements C(i,j) for which the corresponding R(i,j) is 1, and ignore all elements C(i,j) where R(i,j)=0. One way to do so is the following code:

Which of the following pieces of Octave code will also correctly compute this total?Check all that apply. Assume all options are in code.
A.total= sum(sum((A * B) .* R))
B.C = (A * B) .*R; total = sum(C(:));
C.total= sum(sum((A * B) * R))
D.C = (A * B) *R; total = sum(C(:));
Answer：A、B
分析：R中只有0和1，C和R作用，出來的結果是R為0的位置為0，R為1的位置還是為原來C的值。關鍵字：點乘實現。

筆記整理自Coursera吳恩達機器學習課程。

避免筆記的冗雜，翻閱時不好找，所以分成幾個部分寫，有興趣的同學可以關注一下其它的筆記。

機器學習筆記1 —— 機器學習定義、有監督學習和無監督學習

機器學習筆記2 —— 線性模型、價值函數和梯度下降演算法

機器學習筆記3 —— 線性代數基礎

機器學習筆記4 —— 多特徵量線性回歸

機器學習筆記5 —— 正規方程

機器學習筆記6 —— Matlab編程基礎

機器學習筆記7 —— 編程作業1

機器學習筆記8 —— 邏輯回歸模型的代價函數和梯度下降演算法

機器學習筆記9 —— 過擬合和正則化

機器學習筆記10 —— 編程作業2

機器學習筆記11 —— 神經網路

機器學習筆記12 —— 編程作業3

機器學習筆記13 —— 神經網路的代價函數和反向傳播演算法(BP演算法)

機器學習筆記14 —— BP演算法相關編程與編程作業4

機器學習筆記15 —— 演算法性能的評估

機器學習筆記16 —— 編程作業5線性回歸演算法的評估

機器學習筆記17 —— 垃圾郵件分類器、查准率和召回率

機器學習筆記18 —— 支持向量機、核函數

機器學習筆記19 —— 支持向量機作業和編程作業6支持向量機和垃圾郵件

機器學習筆記20 —— K均值聚類演算法

機器學習筆記21 —— 維數約簡的PCA演算法

機器學習筆記22 —— 編程作業7 K均值聚類演算法和PCA演算法

機器學習筆記23 —— 異常檢測演算法