python3機器學習經典實例-第四章聚類15

05-19

來自專欄 python機器學習經典實踐-學習筆記

建立均值漂移聚類模型

均值漂移是一種非常強大的無監督學習演算法，用於集群數據點。該演算法把數據點的分布看成是概率密度函數（probability-density function），希望在特徵空間中根據函數分布特徵找出數據點的「模式」（mode）。這些「模式」就對應於一群群局部最密集（local maxima）分布的點。均值漂移演算法的優點是它不需要事先確定集群的數量。假設有一組輸入點，我們要在不知道要尋找多少集群的情況下找到它們。均值漂移演算法就可以把這些點看成是服從某個概率密度函數的樣本。如果這些數據點有集群，那麼它們對應於概率密度函數的峰值。該演算法從一個隨機點開始，逐漸收斂於各個峰值。你可以在 http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf 中學習更詳細的內容。

具體步驟代碼

導入必要資料庫

import numpy as npfrom sklearn.cluster import MeanShift, estimate_bandwidthimport utilities

從data_multivar.txt文件中載入輸入數據：

# Load data from input fileX = utilities.load_data(data_multivar.txt)

通過指定輸入參數創建一個均值漂移模型：

# Estimating the bandwidth bandwidth = estimate_bandwidth(X, quantile=0.1, n_samples=len(X))# Compute clustering with MeanShiftmeanshift_estimator = MeanShift(bandwidth_=bandwidth, bin_seeding=True)meanshift_estimator.fit(X)labels = meanshift_estimator.labels_

從模型中提取集群的中心點，然後列印集群數量

centroids = meanshift_estimator.cluster_centers_num_clusters = len(np.unique(labels))print ("Number of clusters in input data =", num_clusters)

將數據集畫出來

import matplotlib.pyplot as pltplt.figure()# specify marker shapes for different clustersmarkers = .*xvfor i, marker in zip(range(num_clusters), markers): # plot the points belong to the current cluster plt.scatter(X[labels==i, 0], X[labels==i, 1], marker=marker, color=mediumaquamarine) # plot the centroid of the current cluster centroid = centroids[i] plt.plot(centroid[0], centroid[1], marker=o, markerfacecolor=pink, markeredgecolor=k, markersize=15)plt.title(Clusters and their centroids)plt.show()

輸出結果out

Number of clusters in input data = 4