標籤:

Coursera Machine Learning疑惑與解答-第0篇-Week2 Assignments

每天都會保證至少一個小時的機器學習的時間

上面的是我的github提交量,下面是大佬的,差距顯而易見,

向大佬學習,讓github一片綠。。。

項目地址:

willwinworld/Coursera-Andrew-Ng-Machine-Learning?

github.com圖標

之前嘗試直接上手cs224n, cs231n, 無奈基礎太差,沒法一口氣吃成胖子,只能從coursera上吳恩達的經典課程machine learning入手。

目前進度在第二周的Linear Regression處

首先是筆記,我實在懶得記了,於是在網上搜了一下,找到別人寫好的,可以時不時拿來參考一下,

http://daniellaah.github.io/2016/Machine-Learning-Andrew-Ng-My-Notes.html?

daniellaah.github.io

新增好的筆記鏈接(中文版,內附PDF下載):

fengdu78/Coursera-ML-AndrewNg-Notes?

github.com圖標

接下來是我的疑惑與解答,可能有不對的地方,也望大佬們輕拍,幫我解惑2333

  • 實現損失函數cost-function

    損失函數: J(	heta) = frac{1}{2m}sum_{i=1}^{m}{(h_{	heta}(x^{(i)}) - y^{(i)})^2}

    關於損失函數,別人筆記的這句話說的很好,選擇均方誤差來作為衡量標準,所以我們就是想使均方誤差最小。即我們想要每個樣例的估計值與真實值之間差的平方的均值最小。
  • 首先補充在octave和matlab中 .* 代表對應矩陣元素相乘

    假設函數為 h_{	heta} (x) = 	heta_{0} + 	heta_{1}x

    那麼這裡的computeCost函數給了3個參數X, y, theta,並沒有給 heta_{0},經人提醒,其實 	heta_{0} 可以等於 	heta_{1} 乘以列向量中為1的元素, 這裡的 	heta_{1} 其實就是參數 heta,即 	heta_{0} = 	heta * 1 ,a = [1,2,3,4] .* [1,2,3,4] = [1,4,9,16], 而sum(a) = 30 ,所以計算 sum_{i=1}^{m}{(h_{	heta}(x^{(i)})-y^{(i)})^{2}} 這部分的時候,首先將列向量計算出來H = X * theta -y, 這個值對應的就是括弧里的 {(h_{	heta}(x^{(i)})-y^{(i)})} , 然後要求這個列向量的平方,而採用的方法是matlab的.*,這樣就把列向量中的每個元素相乘起來,然後最後是求和,直接用sum就好,就把列向量中的所有元素都加起來了。

function J = computeCost(X, y, theta)%COMPUTECOST Compute cost for linear regression% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.h_theta = X*theta - y; % h_theta(x) = theta_0 + theta_1 * xJ = (1 / (2 * m)) * sum(h_theta .* h_theta); % h_theta是行向量,所以sum就相當於將所有元素加起來% .* 是對應元素相乘,例:[1,2,3,4] .* [1,2,3,4] = [1,4,9,16] % =========================================================================end

  • 實現gradientDescent,一開始看到這題我是有點懵逼的,但後面看了筆記,稍微有點明白了

但是在

根據

這裡對 	heta_0,	heta_1 求偏導的時候我還是有點疑惑的,然後去看下鏈式求導法則,

我是這樣理解的,首先j=0, 這時是對 	heta_{0} 求偏導,而累加的時候i是從1到m的,平方拿下來,乘以 frac{1}{2m} ,自然等於 frac{1}{m} ,然後再對裡面的內函數 h_{	heta}(x^{(i)})-y^{(i)} 進行求導,而 h_{	heta}(x^{(i)}) = 	heta_{0} + 	heta_{1}x^{(i)} ,但是 i 
e j = 0 ,所以裡面的式子對 	heta_{0} 求偏導,得到1,後面的式子 	heta_{1}x^{i}	heta_{0} 求偏導自然為0,然後內函數求偏導的結果就為1,不斷的乘以1,自然就得到了結果。

然後對 	heta_{1} 求偏導,那從i=1到m,每一項都會得到 x^{i} ,例如i=1的時候, 	heta_{0} + 	heta_{1}x^{(1)} ,這個式子對 	heta_{1} 求導是顯而易見的,就是 x^{1} ,以此類推,就可以得到後面的 x^{(i)} ,這個數學的推導就完美解決啦~

然後還有一個疑惑是在octave中X * theta, 貌似反過來不行,我的想法是不就是一個列向量乘以一個標量數字嗎?正向反嚮應該都可以啊,但就只能是X * theta,這裡讓我很疑惑。

然後代碼如下:

function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)%GRADIENTDESCENT Performs gradient descent to learn theta% theta = GRADIENTDESCENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha% Initialize some useful valuesm = length(y); % number of training examplesJ_history = zeros(num_iters, 1);for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCost) and gradient here. % temp1 = alpha * (1 / m) * sum(X * theta - y); temp2 = alpha * (1 / m) * sum((X * theta -y).*X(:,2)); theta(1) = theta(1) - temp1; theta(2) = theta(2) - temp2; % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCost(X, y, theta);endend

根據上面兩張圖不難得出結果,但要注意 	heta_{0},	heta_{1} 所對應的是theta(1),theta(2)

  • 接下來的多變數線性回歸的實現,先要實現featureNormalize.m,也就是均值歸一化,

這個不難,代碼如下:

function [X_norm, mu, sigma] = featureNormalize(X)%FEATURENORMALIZE Normalizes the features in X % FEATURENORMALIZE(X) returns a normalized version of X where% the mean value of each feature is 0 and the standard deviation% is 1. This is often a good preprocessing step to do when% working with learning algorithms.% You need to set these values correctlyX_norm = X;mu = zeros(1, size(X, 2));sigma = zeros(1, size(X, 2));% ====================== YOUR CODE HERE ======================% Instructions: First, for each feature dimension, compute the mean% of the feature and subtract it from the dataset,% storing the mean value in mu. Next, compute the % standard deviation of each feature and divide% each feature by its standard deviation, storing% the standard deviation in sigma. %% Note that X is a matrix where each column is a % feature and each row is an example. You need % to perform the normalization separately for % each feature. %% Hint: You might find the mean and std functions useful.% num_features = size(X, 2);for x = 1:num_features mu(x) = mean(X(:,x)); sigma(x) = std(X(:,x)); X_norm(:,x) = (X_norm(:,x)-mu(x))/sigma(x);end% ============================================================end

要注意octave,matlab數組的序列是從1開始的,這讓我很不習慣。。。

  • 接下來要實現多變數的梯度下降,多變數的代價函數,公式如下

多變數的代價函數 J(	heta_{0},	heta_{1},...	heta_{m}) 的公式如上圖所示,代碼只要按照公式去套就可以了

function J = computeCostMulti(X, y, theta)%COMPUTECOSTMULTI Compute cost for linear regression with multiple variables% J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.% h_theta = X * theta - y; % h_theta(x) = theta_0 + theta_1 * x_1 + ... + theta_n * x_npredictions = X * theta;square_errors = (predictions - y).^2;J = 1/(2*m)*sum(square_errors);% 多變數線性回歸的損失函數與單變數沒有本質區別% =======================================================================end

接下來是多變數的梯度下降,這裡我一開始是比較疑惑的,先放代碼,然後我來講一下我疑惑的點以及如何去解決

function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)%GRADIENTDESCENTMULTI Performs gradient descent to learn theta% theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by% taking num_iters gradient steps with learning rate alpha% Initialize some useful valuesm = length(y); % number of training examplesJ_history = zeros(num_iters, 1);for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCostMulti) and gradient here. % predictions = X * theta; updates = X * (predictions - y); theta = theta - alpha * (1 / m) * updates; % theta = theta - alpha * (1/m) * sum(sqrt(sqerrors)) * X; % theta - (alpha/m) * (X * (X * theta - y)); % theta = theta - (alpha/m) * (X * (X * theta - y)); % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCostMulti(X, y, theta);endend

這四種答案其實是等價的,看到第一個答案的時候,特別是updates, 我沒看懂,然後我又去看了一下公式

再對比一下第2個答案,這裡的sqerrors來自於之前的computeCost.m,

function J = computeCost(X, y, theta)%COMPUTECOST Compute cost for linear regression% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the% parameter for linear regression to fit the data points in X and y% Initialize some useful valuesm = length(y); % number of training examples% You need to return the following variables correctly J = 0;% ====================== YOUR CODE HERE ======================% Instructions: Compute the cost of a particular choice of theta% You should set J to the cost.predictions = X*theta;sqerrors = (predictions - y).^2;J = 1/(2*m)* sum(sqerrors);% =========================================================================end

看到這裡我就明白了, X * (predictions - y)sum(sqrt(sqerrors)) * X 是等價的,前面的是矩陣運算

  • 接下來是通過改變 alpha 學習率這一參數,並畫圖,來觀察收斂速度,這裡又學習到了

We recommend trying values of the learning rate alpha on a log-scale, at multiplicative

steps of about 3 times the previous value (i.e., 0.3, 0.1, 0.03, 0.01 and so on). 也就是3倍的係數去嘗試。然後去畫圖觀察,我是又改了迭代次數,代碼如下

% Choose some alpha valuealpha = 1;num_iters = 50;% Init Theta and Run Gradient Descent theta = zeros(3, 1);[theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters);% Plot the convergence graphfigure;plot(1:numel(J_history), J_history, -b, LineWidth, 2);xlabel(Number of iterations);ylabel(Cost J);% Display gradient descents resultfprintf(Theta computed from gradient descent:
);fprintf( %f
, theta);fprintf(
);%hold on;%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, alpha*3, num_iters);%plot(1:numel(J_history), J_history, -g, LineWidth, 2);%fprintf(Theta computed from gradient descent:
);%fprintf( %f
, theta);%fprintf(
);%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, alpha*9, num_iters);%plot(1:numel(J_history), J_history, -y, LineWidth, 2);%fprintf(Theta computed from gradient descent:
);%fprintf( %f
, theta);%fprintf(
);%theta = zeros(3, 1);%[theta, J_history] = gradientDescentMulti(X, y, theta, 1, num_iters);%plot(1:numel(J_history), J_history, -r, LineWidth, 2);%fprintf(Theta computed from gradient descent:
);%fprintf( %f
, theta);%fprintf(
);

hold on命令可以讓好多線在一張圖裡顯示

  • 接下來是利用求出來的 	heta 去預估房子價格

% Estimate the price of a 1650 sq-ft, 3 br house% ====================== YOUR CODE HERE ======================% Recall that the first column of X is all-ones. Thus, it does% not need to be normalized.price = 0; % You should change thiselements = [1, (1650-mu(1))/sigma(1) , (3-mu(2))/sigma(2)]; % X的第一列都是1,所以不用均值歸一化price = elements * theta;

這裡我是有點疑惑的,為什麼elments的第一個元素是1啊?難道是因為注釋里的Recall that the first column of X is all-ones. Thus, it does not need to be normalized.這句話嗎?

後面兩個元素,對應的房子面積,房子房間數,都要進行了歸一化,最後根據假設的線性方程可以得到一個結果。

後面問了一下別人,是這樣的,上面的理解有問題,在多變數線性回歸的時候,假設函數是這樣的 h_{	heta}(x) = 	heta_{0}+	heta_{1}x + ... + 	heta_{n}x_n ,也就是可以向量化成以下兩個部分, X = (1, x_1, ... , x_n) , 	heta = (	heta_{0}, 	heta_{1}, ... , 	heta_{n}) ,所以就解釋了上面的問題。

  • 最後一題,通過Normal Equations來求解 	heta , 根據要求和上課的筆記去寫還是比較簡單的,只是要引申出Normal Equation是一種基礎的最小二乘方法,這個我之前是真的不知道,經人提醒才知道2333,菜的摳腳。。。

這裡的筆記已經寫的非常清楚了,就不再贅述了,但這裡我想說的是關於最小二乘法的引申

https://zh.wikipedia.org/wiki/%E6%9C%80%E5%B0%8F%E4%BA%8C%E4%B9%98%E6%B3%95?

zh.wikipedia.org

Normal equation和Least squares以及最小二乘法擬合圓?

sinb.github.io圖標王留行:掰開揉碎推導Normal Equation?

zhuanlan.zhihu.com圖標機器學習:Normal Equation 的理解?

panjunwen.com圖標

其中知乎的那個鏈接講解是比較詳細的,可以好好反覆看個幾遍,好好消化一下。

到此,week2的作業就完成了,第一遍還是只會用octave來實現,第二遍的時候再來用python來實現。


推薦閱讀:

BP神經網路實現(R語言+Python)
反向傳播演算法和梯度下降理解
【翻譯】Brian2高級指導_外部代碼交互
讓我們一起來學習CNTK吧
PMF:概率矩陣分解

TAG:機器學習 |