5_1使用sklearn搭建簡單線性回歸模型

05-08

使用機器學習演算法進行數據分析步驟：

提出問題
理解數據
清洗數據
切分數據
確定模型超參數
訓練模型
模型測試
模型應用
結果可視化

1.提出問題：明確分析對象和目標，分析學習時間和成績的關係

#導入數據分析包from collections import OrderedDictimport pandas as pd

2.理解數據：導入數據，並查看數據的特徵維度和標籤，這裡的數據集特徵只有一維『學習時間』,『成績』是因變數

#數據集examDict = { 學習時間:[0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25, 2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50], 成績:[10, 22, 13, 43, 20, 22, 33, 50, 62, 48, 55, 75, 62, 73, 81, 76, 64, 82, 90, 93]}examOrderDict = OrderedDict(examDict)examDf = pd.DataFrame(examOrderDict)examDf.head()

3.清洗數據：在使用機器學習的數據分析中，主要是要對數據中的空值和標準化進行處理（這裡不對數據集做處理）

4.切分數據：確定模型的特徵數據與標籤數據，將數據集切分為訓練集和測試集，一般按9:1切分，並將修改數據結構使其能嵌入到模型中

#特徵數據exam_x = examDf.loc[:,學習時間]#標籤數據exam_y = examDf.loc[:,成績]#由於sklearn的線性回歸模型要求輸入為二維數據，所以使用reshape（）改變數據維度exam_x = exam_x.values.reshape(-1,1)exam_y = exam_y.values.reshape(-1,1)#-1表示該維度根據其他維度來確定#使用train_test_split方法切分數據from sklearn.cross_validation import train_test_splitx_train,x_test,y_train,y_test = train_test_split(exam_x, exam_y, train_size = 0.8)#輸出數據大小print(原始數據特徵：,exam_x.shape, ,訓練數據特徵：,x_train.shape, ,測試數據特徵：,x_test.shape)print(原始數據特徵：,exam_y.shape, ,訓練數據標籤：,y_train.shape, ,測試數據標籤：,y_test.shape)

原始數據特徵： (20, 1) ,訓練數據特徵： (16, 1) ,測試數據特徵： (4, 1)

原始數據特徵： (20, 1) ,訓練數據標籤： (16, 1) ,測試數據標籤： (4, 1)

#繪製訓練數據與測試數據的散點圖import matplotlib.pyplot as pltplt.scatter(x_train[:,0],y_train[:,0],color=b,label=train data)plt.scatter(x_test[:,0],y_test[:,0],color=r,label=test data)plt.legend(loc=2)plt.xlabel(Hours)plt.ylabel(Pass)plt.show()

5.確定超參數：使用交叉驗證等方法確定模型的超參數，由於簡單線性回歸模型沒有超參數，所以這一步略過

6.訓練模型：使用確定好的超參數在訓練集上訓練模型，確定模型的參數

#導入線性回歸from sklearn.linear_model import LinearRegression#實例化模型：線性回歸model = LinearRegression()#使用訓練數據訓練model.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

#從訓練完的模型中得出模型參數"""最佳擬合線：y = wx + b回歸係數：w截距：b"""#回歸係數w = model.coef_#截距b = model.intercept_print(最佳擬合線: 回歸係數w=,w,截距b=,b)

最佳擬合線: 回歸係數w= [[15.87988981]] 截距b= [9.27906336]

#相關係數：corr返回結果是一個數據框，存放的是相關係數矩陣rDf = examDf.corr()print(相關係數矩陣：)rDf

#線性回歸的score方法得到的是決定係數R平方#評估模型：決定係數R平方model.score(x_test, y_test)

0.8943676194711957

7.模型測試：將訓練好的模型在測試集上測試，計算預測結果誤差率

8.模型應用：載入新的數據，應用模型進行數據分析

import numpy as np#需要預測的特徵值predict_x = np.linspace(0.5,5.5,5)predict_x = predict_x.reshape(-1,1)#預測的結果predict_y = model.predict(predict_x)predict_y

9.結果可視化

#繪製擬合曲線與散點圖import matplotlib.pyplot as pltplt.scatter(exam_x, exam_y, color=blue, label=exam data)#原始數據plt.scatter(predict_x[:,0], predict_y[:,0], color=red, label=predict data)#預測數據plt.plot(predict_x[:,0], predict_y[:,0], color=black, label=predict line)#預測曲線plt.legend(loc=2)plt.xlabel(Hours)plt.ylabel(Score)plt.show()