邏輯回歸3（代碼解釋）

05-08

1.梯度上升演算法

代碼來自《機器學習實戰》

同樣假設有100個樣本，每個樣本包含3個特徵值和對應結果，把它們轉換成兩個Python列表：

dataMatrix是特徵值列表，其中包含100個子列表，每個子列表是1個樣本，每個子列表有3個元素對應一個樣本的3個變數 $x_{j}$ （其中j=1,2,3），100個子列表對應100個樣本 $x^{(i)}$ （其中i=1,2,3,...,100）;

labelMat是結果列表，包含100個元素，對應每組樣本的 $y^{(i)}$ 。

def gradAscent(dataMatIn, classLabels): """ Function： logistic回歸梯度上升函數 Input： dataMatIn：數據列表100*3 classLabels：標籤列表1*100 Output： weights：權重參數矩陣 """ #轉換為numpy矩陣dataMatrix：100*3 dataMatrix = mat(dataMatIn) #轉換為numpy矩陣並轉置為labelMat：100*1 labelMat = mat(classLabels).transpose() #獲得矩陣行列數 m,n = shape(dataMatrix) #初始化移動步長 alpha = 0.001 #初始化迭代次數 maxCycles = 500 #初始化權重參數矩陣，使用ones()生成值為1的numpy數組3*1 weights = ones((n,1)) #開始迭代計算參數 for k in range(maxCycles): #矩陣相乘100*3 * 3*1 得出100*1的矩陣，即為(h(w)x) h = sigmoid(dataMatrix * weights) #計算誤差y-(h(w)x)，得出100*1的矩陣 error = (labelMat - h) #更新參數值 #w?=w?+Σα(y-(h(w)x))x?，得出3*1 weights = weights + alpha * error * dataMatrix.transpose() #返回權重參數矩陣 return weights

根據這個演算法的得到的最佳擬合線如下：

擬合結果不錯，唯一的問題是每次都要計算所有的樣本，當數據更多參數也更多的時候梯度上升演算法的計算量太大，下面給出隨機梯度上升法。

2.隨機梯度上升演算法

即，一次僅用一個樣本來更新回歸係數，梯度上升法是一個在線演算法，可以在每個樣本到來時就完成參數更新而不需要重新讀取整個數據集，下面看下具體操作。

同樣假設有100個樣本，每個樣本包含3個特徵值和對應結果，把它們轉換成兩個Python列表：

labelMat是結果列表，包含100個元素，對應每組樣本的 $y^{(i)}$ 。

def stocGradAscent0(dataMatrix, classLabels): """ Function：隨機梯度上升演算法 Input： dataMatIn：數據列表100*3 classLabels：標籤列表1*100 Output： weights：權重參數矩陣 """ #獲取數據列表大小 m,n = shape(dataMatrix) #步長設置為0.01 alpha = 0.01 #初始化權重參數矩陣，初始值都為1 weights = ones(n) #遍歷每一行數據 for i in range(m): #1*3 * 3*1 h = sigmoid(sum(dataMatrix[i]*weights)) #計算誤差，一次使用樣本對應的y值，而不是使用矩陣進行計算 error = labelMat[i] - h #更新權重值 weights = weights + alpha * error * dataMatrix[i] #返回權重參數矩陣 return weights

根據這個演算法的得到的最佳擬合線如下：

這個擬合結果差強人意，可以通過提高迭代次數，避免局部波動，來優化演算法。

3.優化的隨機梯度上升法

優化後的隨機梯度演算法效果與梯度上升演算法相似，但佔用更少計算機資源速度更快。

同樣假設有100個樣本，每個樣本包含3個特徵值和對應結果，把它們轉換成兩個Python列表

labelMat是結果列表，包含100個元素，對應每組樣本的 $y^{(i)}$ 。

def stocGradAscent1(dataMatrix, classLabels, numIter=150): """ Function：改進的隨機梯度上升演算法 Input： dataMatIn：數據列表100*3 classLabels：標籤列表1*100 numIter：迭代次數 Output： weights：權重參數矩陣 """ #獲取數據列表大小，這裡是列表沒有轉化為矩陣 m, n = shape(dataMatrix) #初始化權重參數矩陣，初始值都為1 weights = ones(n) #初始化迭代次數為numIter，可以作為演算法函數的參數在調用時傳入 numIter=150 #開始迭代 for j in range(numIter): #初始化index列表， #因為Py3中range並不會產生一個列表 #所以需要將range輸出轉換成list dataIndex = list(range(m)) #遍歷每一行數據，這裡要注意將range輸出轉換成list for i in list(range(m)): #更新alpha值，即使用隨機步長，緩解數據高頻波動 alpha = 4/(1.0+j+i)+0.0001 #隨機生成序列號，從而減少隨機性的波動 randIndex = int(random.uniform(0, len(dataIndex))) #序列號對應的元素與權重矩陣相乘，求和後再求sigmoid h = sigmoid(sum(dataMatrix[randIndex]*weights) #求誤差，和之前一樣的操作 #使用隨機序列號對應的y值而不是使用矩陣進行計算 error = labelMat[randIndex] - h #更新權重矩陣 weights = weights + alpha * error * dataMatrix[randIndex] #在index列表中刪除參加這次計算的序列號，保證迭代時沒有重複樣本 del(dataIndex[randIndex]) #返回權重參數矩陣 return weights

根據這個演算法的得到的最佳擬合線如下：

4.用邏輯回歸進行分類

在這裡使用書中「用疝氣病症預測病馬的死亡率」的示例代碼

首先分析數據的過程如下：

1.收集數據：
2.準備數據：用Python解析文本並填充缺失值
3.分析數據：可視化並觀察數據

4.訓練演算法：使用優化演算法，找到最佳係數
5.測試演算法：為量化回歸演算法需要觀察錯誤率。根據錯誤率觀察是否要退回到訓練階段，通過改變迭代次數和步長等參數來得到更好的回歸係數。
6.使用演算法：

在這裡只展示訓練演算法和測試演算法兩個步驟

首先定義函數classVector（），以回歸係數和特徵向量作為輸入來計算對應的Sigmnid值，返回0和1。

def classifyVector(inX, weights): """ Function：分類函數 Input： inX：計算得出的矩陣100*1 weights：權重參數矩陣 Output：分類結果 """ #計算sigmoid值 prob = sigmoid(sum(inX*weights)) #返回分類結果 if prob > 0.5: return 1.0 else: return 0.0

接下來定義函數colicTest()，用於打開測試集和訓練集，並對數據進行格式化處理。

首先導入訓練集，使用stocGradAscent1（）來計算回歸係數向量。係數計算好以後導入測試集計算分類錯誤率。

def colicTest(): """ Function：訓練和測試函數 Input：訓練集和測試集文本文檔 Output：分類錯誤率 """ #打開訓練集 frTrain = open(horseColicTraining.txt) #打開測試集 frTest = open(horseColicTest.txt) #初始化訓練集數據列表 trainingSet = [] #初始化訓練集標籤列表 trainingLabels = [] #遍歷訓練集數據 for line in frTrain.readlines(): #切分數據集 currLine = line.strip().split( ) #初始化臨時列表 lineArr = [] #遍歷21項數據重新生成列表，因為後面格式要求，這裡必須重新生成一下。 for i in range(21): lineArr.append(float(currLine[i])) #添加數據列表 trainingSet.append(lineArr) #添加分類標籤 trainingLabels.append(float(currLine[21])) #獲得權重參數矩陣 trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 500) #初始化錯誤分類計數 errorCount = 0 numTestVec = 0.0 #遍歷測試集數據 for line in frTest.readlines(): #計算數據總數量 numTestVec += 1.0 #切分數據集 currLine =line.strip().split( ) #初始化臨時列表 lineArr = [] #遍歷21項數據重新生成列表，因為後面格式要求，這裡必須重新生成一下。 for i in range(21): lineArr.append(float(currLine[i])) #如果分類結果和分類標籤不符，則錯誤計數+1 if int(classifyVector(array(lineArr), trainWeights)) != int(currLine[21]): errorCount += 1 #計算分類錯誤率 errorRate = (float(errorCount)/numTestVec) #列印分類錯誤率 print("the error rate of this test is: %f" % errorRate) #返回分類錯誤率 return errorRate

最後一個函數是multiTest()，作用是調用colicTest()10次並求結果的平均值

def multiTest(): """ Function：求均值函數 Input：無 Output：十次分類結果的平均值 """ #迭代次數 numTests = 10 #初始錯誤率和 errorSum = 0.0 #調用十次colicTest()，累加錯誤率 for k in range(numTests): errorSum += colicTest() #列印平均分類結果 print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))

調用函數

multiTest()

輸出

the error rate of this test is: 0.358209

the error rate of this test is: 0.417910

the error rate of this test is: 0.268657

the error rate of this test is: 0.253731

the error rate of this test is: 0.417910

the error rate of this test is: 0.328358

the error rate of this test is: 0.283582

the error rate of this test is: 0.328358

the error rate of this test is: 0.283582

the error rate of this test is: 0.402985

after 10 iterations the average error rate is: 0.334328