機器學習數學：梯度下降法

03-02

前記：在剛開始寫感知機時，發現有必要把梯度下降法拿出來單獨起一篇。因為在往後的很多機器學習演算法模型中求解損失函數的最小化問題都運用到了梯度下降法。
ps：看到了之前寫的文章的題圖註解，真是太矯情了，汗...

本來以為很簡單就能夠開始動筆的，結果是越看越懵逼，越看越不敢動手了。

梯度下降，如果給你一張我們在網上能很容易搜到的那張圖，說：在某點上，沿著這點的梯度的反方向移動，就是最快下降的方向；或者又能夠舉個例子，說：啊，比如你在山坡上，你什麼演算法和數學原理都不用知道，你自己就能夠找到最快到達底部的方法。看圖理解很簡單，可是，真的把每個式子都搞懂卻不容易。

代數定義

在微積分裡面，對多元函數的參數求偏導數，把求得的各個參數的偏導數以向量的形式寫出來，就是梯度。比如函數 $f(x,y)$ ，分別對x,y求偏導數，求得的梯度向量就是 $(?f/?x, ?f/?y)^T$ ，簡稱 $grad f(x,y)$ 或者 $▽f(x,y)$ 。對於在點 $(x_0,y_0)$ 的具體梯度向量就是 $(?f/?x_0, ?f/?y_0)^T$ ，或者 $▽f(x_0,y_0)$ ，如果是3個參數的向量梯度，就是 $(?f/?x,?f/?y,?f/?z)^T$ ，以此類推。

幾何定義

函數上某一點的梯度是向量，幾何意義上講，就是函數變化增加最快的地方。具體來說，對於函數 $f(x,y)$ ，在點 $(x_0,y_0)$ 沿著梯度向量的方向（就是 $(?f/?x_0, ?f/?y_0)^T$ )是 $f(x,y)$ 增加最快的地方。或者說沿著梯度向量的方向，更加容易找到函數的最大值。反過來說，沿著梯度向量相反的方向，也就是 $-(?f/?x_0, ?f/?y_0)^T$ 的方向，梯度減少最快，也就是更加容易找到函數的最小值。

個人理解

方嚮導數是各個方向上的導數
偏導數連續才有梯度存在
梯度的方向是方嚮導數中取到最大值的方向，梯度的值是方嚮導數的最大值

續：代碼實例

The logistic regression

目標：建立分類器（求解出三個參數 θ0θ1θ2）

設定閾值，根據閾值判斷錄取結果

要完成的模塊：

import numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inline#讀取數據import ospath = data + os.sep + LogiReg_data.txtdata = pd.read_csv(path, header=None, names=[Exam 1, Exam 2, Admitted])data.head()

#觀察數據positive = data[data[Admitted] == 1] # returns the subset of rows such Admitted = 1, i.e. the set of *positive* examplesnegative = data[data[Admitted] == 0] # returns the subset of rows such Admitted = 0, i.e. the set of *negative* examples# positive.head()fig, ax = plt.subplots(figsize=(10,5))ax.scatter(positive[Exam 1], positive[Exam 2], s=60, c=b, marker=o, label=Admitted)ax.scatter(negative[Exam 1], negative[Exam 2], s=60, c=r, marker=x, label=Not Admitted)ax.legend()ax.set_xlabel(Exam 1 Score)ax.set_ylabel(Exam 2 Score)ax

#映射到概率的函數def sigmoid(z): return 1 / (1 + np.exp(-z))#sigmoid函數圖像nums = np.arange(-10, 10, step=1) #creates a vector containing 20 equally spaced values from -10 to 10fig, ax = plt.subplots(figsize=(12,4))ax.plot(nums, sigmoid(nums), r)

#返回預測結果值def model(X, theta): #X和權值向量的點乘 return sigmoid(np.dot(X, theta.T)#添加x的第一個分量 x0=1pdData.insert(0, Ones, 1)pdData.head()

orig_data = pdData.as_matrix() # convert the Pandas representation of the data to an array useful for further computationscols = orig_data.shape[1] #4#設置訓練集和標記X = orig_data[:,0:cols-1]y = orig_data[:,cols-1:cols]#初始化權值向量theta = np.zeros([1, 3])X.shape, y.shape, theta.shape#((100, 3), (100, 1), (1, 3))

參考邏輯斯諦回歸（Logistic Regression）可知損失函數為：

#損失函數def cost(X, y, theta): left = np.multiply(-y, model(X, theta)) right = np.log(1 + np.exp(model(X, theta))) return np.sum(left + right) / (len(X))cost(X, y, theta)#0.67407698418010642

計算梯度：

def gradient(X, y, theta): grad = np.zeros(theta.shape) left = (model(X, theta)- y).ravel()#xij不是數組要拉平 #每個分量求梯度 for j in range(len(theta.ravel())): term = np.multiply(error, X[:,j]) grad[0, j] = np.sum(term) / len(X) return grad

STOP_ITER = 0 #迭代次數STOP_COST = 1 #損失差值STOP_GRAD = 2 #梯度足夠小範數？def stopCriterion(type, value, threshold): #設定三種不同的停止策略 if type == STOP_ITER: return value > threshold elif type == STOP_COST: return abs(value[-1]-value[-2]) < threshold elif type == STOP_GRAD: return np.linalg.norm(value) < threshold

import numpy.random#洗牌def shuffleData(data): np.random.shuffle(data) cols = data.shape[1] X = data[:, 0:cols-1] y = data[:, cols-1:] return X, y

import timedef descent(data, theta, batchSize, stopType, thresh, alpha): #梯度下降求解 init_time = time.time() i = 0 # 迭代次數 k = 0 # batch X, y = shuffleData(data) grad = np.zeros(theta.shape) # 計算的梯度 costs = [cost(X, y, theta)] # 第一次的損失值 while True: grad = gradient(X[k:k+batchSize], y[k:k+batchSize], theta) k += batchSize #取batch數量個數據 if k >= n: k = 0 X, y = shuffleData(data) #重新洗牌 theta = theta - alpha*grad # 參數更新 costs.append(cost(X, y, theta)) # 計算新的損失 i += 1 if stopType == STOP_ITER: value = i elif stopType == STOP_COST: value = costs elif stopType == STOP_GRAD: value = grad if stopCriterion(stopType, value, thresh): break return theta, i-1, costs, grad, time.time() - init_time

def runExpe(data, theta, batchSize, stopType, thresh, alpha): #import pdb; pdb.set_trace(); theta, iter, costs, grad, dur = descent(data, theta, batchSize, stopType, thresh, alpha) rate = "learning rate: {}".format(alpha) if batchSize==n: strDescType = "Gradient" elif batchSize==1: strDescType = "Stochastic" else: strDescType = "Mini-batch ({})".format(batchSize) if stopType == STOP_ITER: strStop = "{} iterations".format(thresh) elif stopType == STOP_COST: strStop = "costs change < {}".format(thresh) else: strStop = "gradient norm < {}".format(thresh) iters = "iter {}".format(iter) lastcost = "last cost:{}".format(costs[-1]) dur = "dur time:{}".format(dur) print(strDescType) print(rate) print("stop: "+strStop) print(iters) print(lastcost) print(dur) fig, ax = plt.subplots(figsize=(12,4)) ax.plot(np.arange(len(costs)), costs, r) ax.set_xlabel(Iterations) ax.set_ylabel(Cost) ax.set_title( Error vs. Iteration) return theta

不同參數的損失和迭代次數圖形如下：

嘗試下對數據進行標準化將數據按其屬性(按列進行)減去其均值，然後除以其方差。最後得到的結果是，對每個屬性/每列來說所有數據都聚集在0附近，方差值為1

from sklearn import preprocessing as ppscaled_data = orig_data.copy()scaled_data[:, 1:3] = pp.scale(orig_data[:, 1:3])runExpe(scaled_data, theta, 10, STOP_ITER, thresh=50000, alpha=0.001)

既然都看到這兒了，少年點個贊可好？感謝！

done 2017年12月27日10:31:15