Coursera Machine Learning疑惑與解答-第0篇-Week2 Assignments

04-29

每天都會保證至少一個小時的機器學習的時間

上面的是我的github提交量，下面是大佬的，差距顯而易見，

向大佬學習，讓github一片綠。。。

項目地址：

willwinworld/Coursera-Andrew-Ng-Machine-Learning?

github.com

之前嘗試直接上手cs224n, cs231n, 無奈基礎太差，沒法一口氣吃成胖子，只能從coursera上吳恩達的經典課程machine learning入手。

目前進度在第二周的Linear Regression處

首先是筆記，我實在懶得記了，於是在網上搜了一下，找到別人寫好的，可以時不時拿來參考一下，

http://daniellaah.github.io/2016/Machine-Learning-Andrew-Ng-My-Notes.html?

daniellaah.github.io

新增好的筆記鏈接（中文版，內附PDF下載）：

fengdu78/Coursera-ML-AndrewNg-Notes?

github.com

接下來是我的疑惑與解答，可能有不對的地方，也望大佬們輕拍，幫我解惑2333

實現損失函數cost-function
損失函數: $J( heta) = frac{1}{2m}sum_{i=1}^{m}{(h_{ heta}(x^{(i)}) - y^{(i)})^2}$
關於損失函數，別人筆記的這句話說的很好，選擇均方誤差來作為衡量標準，所以我們就是想使均方誤差最小。即我們想要每個樣例的估計值與真實值之間差的平方的均值最小。
首先補充在octave和matlab中 .* 代表對應矩陣元素相乘
假設函數為 $h_{ heta} (x) = heta_{0} + heta_{1}x$
那麼這裡的computeCost函數給了3個參數X, y, theta,並沒有給 heta_{0},經人提醒，其實 $heta_{0}$ 可以等於 $heta_{1}$ 乘以列向量中為1的元素, 這裡的 $heta_{1}$ 其實就是參數 heta，即 $heta_{0} = heta * 1$ ，a = [1,2,3,4] .* [1,2,3,4] = [1,4,9,16], 而sum(a) = 30 ，所以計算 $sum_{i=1}^{m}{(h_{ heta}(x^{(i)})-y^{(i)})^{2}}$ 這部分的時候，首先將列向量計算出來H = X * theta -y, 這個值對應的就是括弧里的 ${(h_{ heta}(x^{(i)})-y^{(i)})}$ , 然後要求這個列向量的平方，而採用的方法是matlab的.*,這樣就把列向量中的每個元素相乘起來，然後最後是求和，直接用sum就好，就把列向量中的所有元素都加起來了。

function J = computeCost(X, y, theta)%COMPUTECOST Compute cost for linear regression% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.h_theta = X*theta - y; % h_theta(x) = theta_0 + theta_1 * xJ = (1 / (2 * m)) * sum(h_theta .* h_theta); % h_theta是行向量，所以sum就相當於將所有元素加起來% .* 是對應元素相乘，例：[1,2,3,4] .* [1,2,3,4] = [1,4,9,16] % =========================================================================end

實現gradientDescent,一開始看到這題我是有點懵逼的，但後面看了筆記，稍微有點明白了

但是在

根據

這裡對 $heta_0, heta_1$ 求偏導的時候我還是有點疑惑的，然後去看下鏈式求導法則，

我是這樣理解的，首先j=0, 這時是對 $heta_{0}$ 求偏導，而累加的時候i是從1到m的，平方拿下來，乘以 $frac{1}{2m}$ ,自然等於 $frac{1}{m}$ ，然後再對裡面的內函數 $h_{ heta}(x^{(i)})-y^{(i)}$ 進行求導，而 $h_{ heta}(x^{(i)}) = heta_{0} + heta_{1}x^{(i)}$ ,但是 $i e j = 0$ ,所以裡面的式子對 $heta_{0}$ 求偏導，得到1，後面的式子 $heta_{1}x^{i}$ 對 $heta_{0}$ 求偏導自然為0，然後內函數求偏導的結果就為1，不斷的乘以1，自然就得到了結果。

然後對 $heta_{1}$ 求偏導，那從i=1到m,每一項都會得到 $x^{i}$ ,例如i=1的時候， $heta_{0} + heta_{1}x^{(1)}$ ,這個式子對 $heta_{1}$ 求導是顯而易見的，就是 $x^{1}$ ,以此類推，就可以得到後面的 $x^{(i)}$ ,這個數學的推導就完美解決啦~

然後還有一個疑惑是在octave中X * theta, 貌似反過來不行，我的想法是不就是一個列向量乘以一個標量數字嗎？正向反嚮應該都可以啊，但就只能是X * theta,這裡讓我很疑惑。

然後代碼如下：

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)%GRADIENTDESCENT Performs gradient descent to learn theta% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha% Initialize some useful valuesm = length(y); % number of training examplesJ_history = zeros(num_iters, 1);for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCost) and gradient here. % temp1 = alpha * (1 / m) * sum(X * theta - y); temp2 = alpha * (1 / m) * sum((X * theta -y).*X(:,2)); theta(1) = theta(1) - temp1; theta(2) = theta(2) - temp2; % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCost(X, y, theta);endend

根據上面兩張圖不難得出結果，但要注意 $heta_{0}, heta_{1}$ 所對應的是theta(1),theta(2)

接下來的多變數線性回歸的實現，先要實現featureNormalize.m,也就是均值歸一化，

這個不難，代碼如下：

function [X_norm, mu, sigma] = featureNormalize(X)%FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where% the mean value of each feature is 0 and the standard deviation% is 1. This is often a good preprocessing step to do when% working with learning algorithms.% You need to set these values correctlyX_norm = X;mu = zeros(1, size(X, 2));sigma = zeros(1, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: First, for each feature dimension, compute the mean% of the feature and subtract it from the dataset,% storing the mean value in mu. Next, compute the % standard deviation of each feature and divide% each feature by its standard deviation, storing% the standard deviation in sigma. %% Note that X is a matrix where each column is a % feature and each row is an example. You need % to perform the normalization separately for % each feature. %% Hint: You might find the mean and std functions useful.% num_features = size(X, 2);for x = 1:num_features mu(x) = mean(X(:,x)); sigma(x) = std(X(:,x)); X_norm(:,x) = (X_norm(:,x)-mu(x))/sigma(x);end% ============================================================end

要注意octave,matlab數組的序列是從1開始的，這讓我很不習慣。。。

接下來要實現多變數的梯度下降，多變數的代價函數，公式如下

多變數的代價函數 $J( heta_{0}， heta_{1},... heta_{m})$ 的公式如上圖所示，代碼只要按照公式去套就可以了

function J = computeCostMulti(X, y, theta)%COMPUTECOSTMULTI Compute cost for linear regression with multiple variables% J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.% h_theta = X * theta - y; % h_theta(x) = theta_0 + theta_1 * x_1 + ... + theta_n * x_npredictions = X * theta;square_errors = (predictions - y).^2;J = 1/(2*m)*sum(square_errors);% 多變數線性回歸的損失函數與單變數沒有本質區別% =======================================================================end

接下來是多變數的梯度下降，這裡我一開始是比較疑惑的，先放代碼，然後我來講一下我疑惑的點以及如何去解決

function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)%GRADIENTDESCENTMULTI Performs gradient descent to learn theta% theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by% taking num_iters gradient steps with learning rate alpha% Initialize some useful valuesm = length(y); % number of training examplesJ_history = zeros(num_iters, 1);for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCostMulti) and gradient here. % predictions = X * theta; updates = X * (predictions - y); theta = theta - alpha * (1 / m) * updates; % theta = theta - alpha * (1/m) * sum(sqrt(sqerrors)) * X; % theta - (alpha/m) * (X * (X * theta - y)); % theta = theta - (alpha/m) * (X * (X * theta - y)); % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCostMulti(X, y, theta);endend

這四種答案其實是等價的，看到第一個答案的時候，特別是updates, 我沒看懂，然後我又去看了一下公式

再對比一下第2個答案，這裡的sqerrors來自於之前的computeCost.m,

function J = computeCost(X, y, theta)%COMPUTECOST Compute cost for linear regression% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.predictions = X*theta;sqerrors = (predictions - y).^2;J = 1/(2*m)* sum(sqerrors);% =========================================================================end

看到這裡我就明白了， $X * (predictions - y)$ 與 $sum(sqrt(sqerrors)) * X$ 是等價的，前面的是矩陣運算

接下來是通過改變 $alpha$ 學習率這一參數，並畫圖，來觀察收斂速度，這裡又學習到了

We recommend trying values of the learning rate $alpha$ on a log-scale, at multiplicative

steps of about 3 times the previous value (i.e., 0.3, 0.1, 0.03, 0.01 and so on). 也就是3倍的係數去嘗試。然後去畫圖觀察，我是又改了迭代次數，代碼如下

% Choose some alpha valuealpha = 1;num_iters = 50;% Init Theta and Run Gradient Descent theta = zeros(3, 1);[theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters);% Plot the convergence graphfigure;plot(1:numel(J_history), J_history, -b, LineWidth, 2);xlabel(Number of iterations);ylabel(Cost J);% Display gradient descents resultfprintf(Theta computed from gradient descent: );fprintf( %f , theta);fprintf( );%hold on;%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, alpha*3, num_iters);%plot(1:numel(J_history), J_history, -g, LineWidth, 2);%fprintf(Theta computed from gradient descent: );%fprintf( %f , theta);%fprintf( );%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, alpha*9, num_iters);%plot(1:numel(J_history), J_history, -y, LineWidth, 2);%fprintf(Theta computed from gradient descent: );%fprintf( %f , theta);%fprintf( );%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, 1, num_iters);%plot(1:numel(J_history), J_history, -r, LineWidth, 2);%fprintf(Theta computed from gradient descent: );%fprintf( %f , theta);%fprintf( );

hold on命令可以讓好多線在一張圖裡顯示

接下來是利用求出來的 $heta$ 去預估房子價格

% Estimate the price of a 1650 sq-ft, 3 br house% ====================== YOUR CODE HERE ======================% Recall that the first column of X is all-ones. Thus, it does% not need to be normalized.price = 0; % You should change thiselements = [1, (1650-mu(1))/sigma(1) , (3-mu(2))/sigma(2)]; % X的第一列都是1，所以不用均值歸一化price = elements * theta;

這裡我是有點疑惑的，為什麼elments的第一個元素是1啊？難道是因為注釋里的Recall that the first column of X is all-ones. Thus, it does not need to be normalized.這句話嗎？

後面兩個元素，對應的房子面積，房子房間數，都要進行了歸一化，最後根據假設的線性方程可以得到一個結果。

後面問了一下別人，是這樣的，上面的理解有問題，在多變數線性回歸的時候，假設函數是這樣的 $h_{ heta}(x) = heta_{0}+ heta_{1}x + ... + heta_{n}x_n$ ,也就是可以向量化成以下兩個部分， $X = (1, x_1, ... , x_n)$ , $heta = ( heta_{0}, heta_{1}, ... , heta_{n})$ ,所以就解釋了上面的問題。