

KNN演算法,即K近鄰法(k-nearst neighbors),其核心思想是,在一個含未知樣本的空間,可以根據離這個樣本最鄰近的k個樣本的數據類型來確定樣本的數據類型。其基本步驟如下:




#導入sklearn及numpy模塊nfrom sklearn.datasets import load_irisnfrom sklearn import cross_validationnimport numpy as npn# 導入鳶尾花數據集niris = load_iris()nX_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)n# 重新生成train/test datasetsntrain = np.array(zip(X_train,y_train))ntest = np.array(zip(X_test, y_test))n

在生成訓練集與測試集的命令中,按照6:4的比例分割數據,其中訓練集為60%(設置test_size=0.4),其中iris.data 為鳶尾花不同屬性的數值,主要有4個變數,iris.target為鳶尾花的類別0,1,2。random_state=1為隨機數種子,即固定在這一分類比例下。


對於一個未知種類的鳶尾花,我們需要通過其與已有3種鳶尾花種類的相似程度來研判其所屬類別(最近鄰),在這裡需要一個相似度的測量,其中的一個方法就是歐氏距離(Euclidean distance),即不同屬性下的數值間平方和的開方:d = sqrt((a1-b1)^2 + (a2-b2)^2+(a3-b3)^2+(a4-b4)^2)。


import mathndef get_distance(data1, data2):n points = zip(data1, data2)n diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]n return math.sqrt(sum(diffs_squared_distance))n


>>> get_distance(train[0][0], train[1][0])




from operator import itemgetter #獲取多維對象中的某個值nndef get_neighbours(training_set, test_instance, k):n distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]n # index 1 is the calculated distance between training_instance and test_instancen sorted_distances = sorted(distances, key=itemgetter(1))n # extract only training instancesn sorted_training_instances = [tuple[0] for tuple in sorted_distances]n # select first k elementsn return sorted_training_instances[:k]nndef _get_tuple_distance(training_instance, test_instance):nreturn (training_instance, get_distance(test_instance, training_instance[0]))n

上述代碼中,構建_get_tuple_distance函數(前導下劃線表示該函數僅供內部使用)是基於歐氏距離函數get_distance的一個簡單轉化,將所得距離與相應訓練集數據放入同一元組內,如下(以train[0], test[0][0]數據為例):

>>> _get_tuple_distance(train[0], test[0][0])

(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953)

在def get_neighbours函數中,運用for函數來遍歷所有訓練集數據進行距離計算(以train[0], test[0][0]數據為例),得到結果如下:

>>>[_get_tuple_distance(training_instance, test[0][0]) for training_instance in train[0:3]]

[(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953),

(array([array([ 5.7, 2.5, 5. , 2. ]), 2], dtype=object), 4.465422712353221),

(array([array([ 6.3, 2.7, 4.9, 1.8]), 2], dtype=object), 4.264973622427225)]

然後用sorted_distances進行距離排序,此時需要採用key=itemgetter(1)來定位排序所依據的數值是距離,即(array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object), 1.2328828005937953)中的第二個。

再用sorted_training_instances = [tuple[0] for tuple in sorted_distances],僅僅保留下排序完的訓練集樣本數據,並保留前k個return sorted_training_instances[:k]。比如下邊的例子:

>>> distances = [_get_tuple_distance(training_instance, test[0][0]) for training_instance in


>>> sorted_distances = sorted(distances, key=itemgetter(1))

>>> [tuple[0] for tuple in sorted_distances]

[array([array([ 4.8, 3.4, 1.6, 0.2]), 0], dtype=object),

array([array([ 6.3, 2.7, 4.9, 1.8]), 2], dtype=object),

array([array([ 5.7, 2.5, 5. , 2. ]), 2], dtype=object)]


from collections import Countern#構建投票選擇函數ndef get_majority_vote(neighbours):n # index 1 is the classn classes = [neighbour[1] for neighbour in neighbours]n count = Counter(classes)nreturn count.most_common()[0][0]n


>>> Counter([7,7,7,6,6,9])

Counter({7: 3, 6: 2, 9: 1})

>>> Counter(bananas)

Counter({a: 3, n: 2, s: 1, b: 1})

>>> Counter(bananas).most_common()

[(a, 3), (n, 2), (s, 1), (b, 1)]



#設置k值nfrom sklearn.metrics import classification_report, accuracy_scorenpredictions = []nk = 5n#對於每一個測試集數據,基於前邊設置的函數來判定其屬性,並加入predictions列表中nfor x in range(len(X_test)):n print Classifying test instance number + str(x) + ":",n neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)n majority_vote = get_majority_vote(neighbours)n predictions.append(majority_vote)n print Predicted label= + str(majority_vote) + , Actual label= + str(test[x][1])n # 對分類預測結果進行整體分析與報告n print nThe overall accuracy of the model is: + str(accuracy_score(y_test, predictions)) + "n"n report = classification_report(y_test, predictions, target_names = iris.target_names)n print A detailed classification report: nn + reportnif __name__ == "__main__":nmain()n

在這裡應用了sklearn.metrics模塊中的classification_report, accuracy_score函數來計算精確率和分類情況報告。





#The precision is the ratio tp / (tp +fp) 預測是真的中,有多少實際確實是真的

#The recall is the ratio tp / (tp +fn) 實際是真的中,有多少被預測出來了

#The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall。


from sklearn.datasets import load_irisnfrom sklearn import cross_validationnfrom sklearn.metrics import classification_report, accuracy_scorenfrom operator import itemgetternimport numpy as npnimport mathnfrom collections import Countern# 1) given two data points, calculate the euclidean distance between themndef get_distance(data1, data2):n points = zip(data1, data2)n diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]n return math.sqrt(sum(diffs_squared_distance))n# 2) given a training set and a test instance, use getDistance to calculate all pairwise distancesndef get_neighbours(training_set, test_instance, k):n distances = [_get_tuple_distance(training_instance, test_instance) for training_instance in training_set]n # index 1 is the calculated distance between training_instance and test_instancen sorted_distances = sorted(distances, key=itemgetter(1))n # extract only training instancesn sorted_training_instances = [tuple[0] for tuple in sorted_distances]n # select first k elementsn return sorted_training_instances[:k]ndef _get_tuple_distance(training_instance, test_instance):n return (training_instance, get_distance(test_instance, training_instance[0]))n# 3) given an array of nearest neighbours for a test case, tally up their classes to vote on test case classndef get_majority_vote(neighbours):n # index 1 is the classn classes = [neighbour[1] for neighbour in neighbours]n count = Counter(classes)n return count.most_common()[0][0] n# setting up main executable methodndef main():n # load the data and create the training and test setsn # random_state = 1 is just a seed to permit reproducibility of the train/test splitn iris = load_iris()n X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.4, random_state=1)n # reformat train/test datasets for conveniencen train = np.array(zip(X_train,y_train))n test = np.array(zip(X_test, y_test))n predictions = []n # lets arbitrarily set k equal to 5, meaning that to predict the class of new instances,n k = 5n # for each instance in the test set, get nearest neighbours and majority vote on predicted classn for x in range(len(X_test)):n print Classifying test instance number + str(x) + ":",n neighbours = get_neighbours(training_set=train, test_instance=test[x][0], k=5)n majority_vote = get_majority_vote(neighbours)n predictions.append(majority_vote)n print Predicted label= + str(majority_vote) + , Actual label= + str(test[x][1])n # summarize performance of the classificationn print nThe overall accuracy of the model is: + str(accuracy_score(y_test, predictions)) + "n"n report = classification_report(y_test, predictions, target_names = iris.target_names)n print A detailed classification report: nn + reportnif __name__ == "__main__":n main()n


