標籤:

KNN學習-糖尿病患者預測

KNN學習-糖尿病患者預測

來自專欄數據分析之路

1.數據集:該數據集最初來自國家糖尿病/消化/腎臟疾病研究所。(可以到habook.com搜索黃永昌scikit-learn機器學習下載)數據集的目標是基於數據集中包含的某些診斷測量來診斷性的預測患者是否患有糖尿病。數據集由多個醫學預測變數和一個目標變數組Outcome。2.特徵選取包括患者的懷孕次數、血漿葡萄糖濃度、舒張壓、肱三頭肌皮膚褶皺厚度、兩個小時血清胰島素、BMI、胰島素水平、糖尿病血統指數、年齡等8個影響因素。

#導入數據import numpy as npimport matplotlib.pyplot as pltimport pandas as pddata=pd.read_csv("C:\Users\nanafighting\Desktop\code\datasets\pima-indians-diabetes\diabetes.csv",encoding=utf8)print(dataset shape{}.format(data.shape))data.head()print(data.groupby("Outcome").size())x=data.iloc[:,0:8]y=data.iloc[:,8]print(shape of x {};shape of y {}.format(x.shape,y.shape))#測試數據與訓練數據from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2);

3.模型比較

from sklearn.neighbors import KNeighborsClassifier,RadiusNeighborsClassifiermodels=[]models.append(("KNN",KNeighborsClassifier(n_neighbors=2)))models.append(("KNN with weights",KNeighborsClassifier(n_neighbors=2,weights=distance)))models.append(("Radius Neighbors",KNeighborsClassifier(n_neighbors=2,radius=500.0)));for name,model in models: results = [ ] model.fit(x_train,y_train) results.append((name,model.score(x_test,y_test)))for i in range(len(results)): print("name:{};score:{}".format(results[i][0],results[i][1]))

4.選擇普通的K-均值演算法進行模型訓練和分析

from sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scoreresults=[]for name,model in models: kfold=KFold(n_splits=10) cv_results=cross_val_score(model,x,y,cv=kfold) results.append((name,cv_results))for i in range(len(results)): print("name:{};cross val score:{}".format(results[i][0],results[i][1].mean()))knn=KNeighborsClassifier(n_neighbors=2)knn.fit(x_train,y_train)train_score=knn.score(x_train,y_train)test_score=knn.score(x_test,y_test)print("train score:{};test score:{}".format(train_score,test_score))

5.畫圖

knn=KNeighborsClassifier(n_neighbors=2)knn.fit(x_train,y_train)train_score=knn.score(x_train,y_train)test_score=knn.score(x_test,y_test)print("train score:{};test score:{}".format(train_score,test_score))from sklearn.model_selection import ShuffleSplitfrom sklearn.model_selection import learning_curvedef plot_learning_curve(plt,estimator,title,X,y,ylim=None,cv=None, n_jobs=1,train_sizes=np.linspace(.1,1.0,5)): knn=KNeighborsClassifier(n_neighbors=2) cv=ShuffleSplit(n_splits=10,test_size=0.2,random_state=0) plt.figure(figsize=(10,6),dpi=200) plot_learning_curve(plt,knn,"Learn Curve for KNN Diabetes", x,y,ylim=(0.0,1.01),cv=cv);from sklearn.feature_selection import SelectKBestselector=SelectKBest(k=2)x_new=selector.fit_transform(x,y)print(x_new[0:5])plt.figure(figsize=(10,6),dpi=200)plt.ylabel("BMI")plt.xlabel("Glucose")plt.scatter(x_new[y==0][:,0],x_new[y==0][:,1],c="r",s=20,marker=o)plt.scatter(x_new[y==1][:,0],x_new[y==1][:,1],c="g",s=20,marker=^)print(plt.show())

6運行結果

推薦閱讀:

TAG:糖尿病 |