拆代碼學演算法之用python實現KNN過程詳解

編者註:拆代碼學演算法,即通過代碼拆解了解其每一步的意義與目的,代碼本身就是對演算法運算、預測過程的機器呈現,基於每一步的詳細解讀既學習了演算法的實現過程,又對演算法本身有個深入理解。在機器學習、可視化等等數據分析中,對代碼的學習過程是必不可少也是很關鍵的一環。本文借鑒當前流行的拆書學習法來實現拆代碼學演算法的過程。

KNN演算法,即K近鄰法(k-nearst neighbors),其核心思想是,在一個含未知樣本的空間,可以根據離這個樣本最鄰近的k個樣本的數據類型來確定樣本的數據類型。其基本步驟如下:

圖0中,有橙黃綠三個已知類別及其樣本分布,一個灰色的未知類別的樣本;圖1中,計算灰色未知樣本與其他已知樣本的距離,特別是與2.1/2.4/3.1/4.5最近鄰樣本的距離;圖2中灰色樣本與四個近鄰樣本的距離大小及其排序如圖;圖3中,經過投票可以看出在k個近鄰中,與黃色樣本的相似最多,因此判定灰色樣本歸屬於黃色樣本那一類。

關於KNN演算法在這不過多解釋,相關資料很多,在python實現方面目前也有現成程序(neighbors.KNeighborsClassifier),但是本文在這裡不使用該程序而是基於knn原理構造相應函數來進行分析與預測,通過構造函數來了解knn的實際原理。版本為python2.X。

1.導入數據並生成訓練集與測試集

#導入sklearn及numpy模塊nfrom sklearn.datasets import load_irisnfrom sklearn import cross_validationnimport numpy as npn# 導入鳶尾花數據集niris = load_iris()nX_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)n# 重新生成train/test datasetsntrain = np.array(zip(X_train,y_train))ntest = np.array(zip(X_test, y_test))n

在生成訓練集與測試集的命令中,按照6:4的比例分割數據,其中訓練集為60%(設置test_size=0.4),其中iris.data 為鳶尾花不同屬性的數值,主要有4個變數,iris.target為鳶尾花的類別0,1,2。random_state=1為隨機數種子,即固定在這一分類比例下。

2.測量所有樣本之間的距離

對於一個未知種類的鳶尾花,我們需要通過其與已有3種鳶尾花種類的相似程度來研判其所屬類別(最近鄰),在這裡需要一個相似度的測量,其中的一個方法就是歐氏距離(Euclidean distance),即不同屬性下的數值間平方和的開方:d = sqrt((a1-b1)^2 + (a2-b2)^2+(a3-b3)^2+(a4-b4)^2)。

在這裡我們構造歐氏距離計算函數:

import mathndef get_distance(data1, data2):n points = zip(data1, data2)n diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]n return math.sqrt(sum(diffs_squared_distance))n

其中採用了zip函數將兩個樣本合併在一起,然後採用for函數將對應數值相減再求平方和,然後再加總開平方。比如對訓練集前兩個樣本的屬性值計算歐氏距離得到結果:

>>> get_distance(train[0][0], train[1][0])

4.052159917870962

3.計算所有近鄰的距離

接下來需要構造一個函數來計算每一個測試集數值與所有訓練集數據之間的距離。

from operator import itemgetter #獲取多維對象中的某個值nndef get_neighbours(training_set, test_instance, k):n distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]n # index 1 is the calculated distance between training_instance and test_instancen sorted_distances = sorted(distances, key=itemgetter(1))n # extract only training instancesn sorted_training_instances = [tuple[0] for tuple in sorted_distances]n # select first k elementsn return sorted_training_instances[:k]nndef _get_tuple_distance(training_instance, test_instance):nreturn (training_instance, get_distance(test_instance, training_instance[0]))n

上述代碼中,構建_get_tuple_distance函數(前導下劃線表示該函數僅供內部使用)是基於歐氏距離函數get_distance的一個簡單轉化,將所得距離與相應訓練集數據放入同一元組內,如下(以train[0], test[0][0]數據為例):

>>> _get_tuple_distance(train[0], test[0][0])

(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953)

在def get_neighbours函數中,運用for函數來遍歷所有訓練集數據進行距離計算(以train[0], test[0][0]數據為例),得到結果如下:

>>>[_get_tuple_distance(training_instance, test[0][0]) for training_instance in train[0:3]]

[(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953),

(array([array([ 5.7, 2.5, 5. , 2. ]), 2], dtype=object), 4.465422712353221),

(array([array([ 6.3, 2.7, 4.9, 1.8]), 2], dtype=object), 4.264973622427225)]

然後用sorted_distances進行距離排序,此時需要採用key=itemgetter(1)來定位排序所依據的數值是距離,即(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953)中的第二個。

再用sorted_training_instances = [tuple[0] for tuple in sorted_distances],僅僅保留下排序完的訓練集樣本數據,並保留前k個return sorted_training_instances[:k]。比如下邊的例子:

>>> distances = [_get_tuple_distance(training_instance, test[0][0]) for training_instance in

train[0:3]]

>>> sorted_distances = sorted(distances, key=itemgetter(1))

>>> [tuple[0] for tuple in sorted_distances]

[array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object),

array([array([ 6.3, 2.7, 4.9, 1.8]), 2], dtype=object),

array([array([ 5.7, 2.5, 5. , 2. ]), 2], dtype=object)]

4.對k個近鄰投票選擇測試樣本最優的屬性

from collections import Countern#構建投票選擇函數ndef get_majority_vote(neighbours):n # index 1 is the classn classes = [neighbour[1] for neighbour in neighbours]n count = Counter(classes)nreturn count.most_common()[0][0]n

其中Counter函數是一個字典子類,用於計算對象的出現次數,most_common()則是提取出最常用的,如無具體數值則全部輸出。如以下例子:

>>> Counter([7,7,7,6,6,9])

Counter({7: 3, 6: 2, 9: 1})

>>> Counter(bananas)

Counter({a: 3, n: 2, s: 1, b: 1})

>>> Counter(bananas).most_common()

[(a, 3), (n, 2), (s, 1), (b, 1)]

上述函數通過計算與測試數值k近鄰樣本的類別出現次數,通過選取第一個次數(出現最多的)所代表的屬性(即count.most_common()[0][0]),最終獲取該測試數據的類別。

5.基於iris數據運行代碼

#設置k值nfrom sklearn.metrics import classification_report, accuracy_scorenpredictions = []nk = 5n#對於每一個測試集數據,基於前邊設置的函數來判定其屬性,並加入predictions列表中nfor x in range(len(X_test)):n print Classifying test instance number + str(x) + ":",n neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)n majority_vote = get_majority_vote(neighbours)n predictions.append(majority_vote)n print Predicted label= + str(majority_vote) + , Actual label= + str(test[x][1])n # 對分類預測結果進行整體分析與報告n print nThe overall accuracy of the model is: + str(accuracy_score(y_test, predictions)) + "n"n report = classification_report(y_test, predictions, target_names = iris.target_names)n print A detailed classification report: nn + reportnif __name__ == "__main__":nmain()n

在這裡應用了sklearn.metrics模塊中的classification_report, accuracy_score函數來計算精確率和分類情況報告。

6.結果

最終得到的結果為:

這是每一個測試集的預測結果與實際結果比較。

這是預測精度以及預測結果的報告。

#The precision is the ratio tp / (tp +fp) 預測是真的中,有多少實際確實是真的

#The recall is the ratio tp / (tp +fn) 實際是真的中,有多少被預測出來了

#The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall。

完整代碼如下:

from sklearn.datasets import load_irisnfrom sklearn import cross_validationnfrom sklearn.metrics import classification_report, accuracy_scorenfrom operator import itemgetternimport numpy as npnimport mathnfrom collections import Countern# 1) given two data points, calculate the euclidean distance between themndef get_distance(data1, data2):n points = zip(data1, data2)n diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]n return math.sqrt(sum(diffs_squared_distance))n# 2) given a training set and a test instance, use getDistance to calculate all pairwise distancesndef get_neighbours(training_set, test_instance, k):n distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]n # index 1 is the calculated distance between training_instance and test_instancen sorted_distances = sorted(distances, key=itemgetter(1))n # extract only training instancesn sorted_training_instances = [tuple[0] for tuple in sorted_distances]n # select first k elementsn return sorted_training_instances[:k]ndef _get_tuple_distance(training_instance, test_instance):n return (training_instance, get_distance(test_instance, training_instance[0]))n# 3) given an array of nearest neighbours for a test case, tally up their classes to vote on test case classndef get_majority_vote(neighbours):n # index 1 is the classn classes = [neighbour[1] for neighbour in neighbours]n count = Counter(classes)n return count.most_common()[0][0] n# setting up main executable methodndef main():n # load the data and create the training and test setsn # random_state = 1 is just a seed to permit reproducibility of the train/test splitn iris = load_iris()n X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)n # reformat train/test datasets for conveniencen train = np.array(zip(X_train,y_train))n test = np.array(zip(X_test, y_test))n predictions = []n # lets arbitrarily set k equal to 5, meaning that to predict the class of new instances,n k = 5n # for each instance in the test set, get nearest neighbours and majority vote on predicted classn for x in range(len(X_test)):n print Classifying test instance number + str(x) + ":",n neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)n majority_vote = get_majority_vote(neighbours)n predictions.append(majority_vote)n print Predicted label= + str(majority_vote) + , Actual label= + str(test[x][1])n # summarize performance of the classificationn print nThe overall accuracy of the model is: + str(accuracy_score(y_test, predictions)) + "n"n report = classification_report(y_test, predictions, target_names = iris.target_names)n print A detailed classification report: nn + reportnif __name__ == "__main__":n main()n

推薦閱讀:

Python黑帽編程 3.3 MAC洪水攻擊
Python黑帽編程 3.5 DTP攻擊
學習數據分析的8個 pandas資源
使用Python爬取網頁圖片

TAG:Python | 机器学习 |