聚類演算法——K-Means

05-11

介紹經典的聚類演算法——K-Means演算法，重點研究不同的初始化方式對於聚類效果的影響。

聚類，是機器學習的任務之一。同分類演算法一樣，聚類演算法也被廣泛的應用在各個領域，如根據話題，對文章、網頁和搜索結果做聚類；根據社區發現對社交網路中的用戶做聚類；根據購買歷史記錄對消費者做聚類。和分類演算法不同的是，聚類演算法的樣本是沒有標籤的，也就是說，我們並不知道樣本有哪些類別，演算法需要根據樣本的特徵，對樣本進行聚類，形成不同的聚類中心點。這篇文章，主要介紹比較著名的聚類演算法——K-means演算法。

首先，我們看一下基於目標來做聚類的演算法定義:

Input : A set S of n points, also a distance/dissimilarity measure specifying the distance d(x, y) between pairs (x, y).

Goal: output a partition of the data

基於這個定義，選擇不同的距離計算公式，有以下三種具體的演算法:

k-means: find center partitions $c_1, c_2, …, c_k$ to minimize
$sum min_{j in{i, …,k}}d^2(x^i, c_j)$
k-median: find center partitions $c_1, c_2, …, c_k$ to minimize
$sum min_{j in{i, …,k}}d(x^i, c_j)$
k-center: find partition to minimize the maximum radius

Euclidean k-means clustering

採用歐拉距離公式的k-means演算法定義如下:

Input: A set of n datapoints $x^1, x^2, …, x^n$ in $R^d$ (target #clusters k)

Output: k representatives $c1, c2, …, c_k in R^d$

Objective: choose $c1, c2, …, ck in R^d$ to minimize

$sum min{j in {1,…,k}}||x^i - c_j||^2$

求解該演算法的最優解是一個NP難的問題，所有我們沒有辦法獲得最優解，當然，當k=1或d=1這種特殊情況下，是可以獲得最優解，有興趣的可以自行推導一下，這裡不在贅述，這裡我們主要介紹Lloyds method[1]，該方法的核心演算法如下:

Input: A set of n datapoints $x^1, x^2, …, x^n in R^d$

Initialize centers $c1, c2, …, ck in R^d$ and clusters $C1, C2, …, Ck$ in any way.

Repeat until there is no further change in the cost.

For each j: $Cj leftarrow {x in S whose closest center is c_j}$
For each j: $c_j leftarrow mean of C_j$

對於該演算法，難度不是特別大，最重要的地方在Repeat中的1，2兩個步驟，其中，步驟1將固定住聚類中心 $c_1, c_2, …, c_k$ ，更新聚類集 $C_1, C_2, …, C_k$ 。步驟2固定住聚類 $C_1, C_2, …, C_k$ ，更新聚類中心 $c_1, c_2, …, c_k$ （本文用大寫字元表示集合，這裡用C指聚類簇；用小寫字元表示單個點，這裡用c指聚類中心。在本人的其他博文中，也會採用這種表述方式）。

大部分學習k-means演算法的人理解了步驟1和步驟2就覺得已經理解了k-means了，其實不然，先不說k-means中比較重要的聚類中心的初始化問題，任何一個機器學習演算法，它要是有效的，必須證明其可收斂，也需要給出其時間複雜度和空間複雜度。

Converges

目標函數的值在每一輪的迭代都會降低，這個特性由演算法中步驟1和步驟2保證，因為對於每個樣本點，我們每次都是選擇最接近的聚類中心；而且，在每個聚類簇里，我們選擇平均值作為其聚類中心。
目標函數有最小值0。

由於目標函數有最小值，而且在每一輪中都是值都是減少的，所有演算法必然會收斂。

Running Time

O(nkd) n為樣本數 k為聚類中心數 d為維度

Initialization

介紹完了整個演算法過程、收斂性和時間複雜度之後，該演算法的兩個核心點需要我們思考: 1. 如何選擇k的值; 2. 演算法剛開始，並沒有聚類中心，如何初始化聚類中心。對於問題1，我目前還沒有過多的認識。這裡主要介紹問題2，如何初始化聚類中心。

Random Initialization

這種初始化方式是最簡單的方式，就是隨機選k個點作為聚類中心，雖然簡單，但是會存在問題，我們看下面的這個例子:

由於，我們採用了隨機初始化的方式，對於這個樣本，我們隨機初始化的三個點如上圖的綠、紅、黑三個樣本點，再後面的迭代中，我們最後的聚類簇如上圖的箭頭所示，這樣的效果好嗎？顯然是不好的，為什麼呢？因為很顯然最左邊三個、中間三個、最右邊三個應該是被歸為一個聚類簇的:

我們可以看到，聚類中心初始化得不好，直接影響我們最後聚類的效果，可能上面舉的例子樣本分布和初始化聚類中心太極端，不能說明問題，我們現在假設樣本的分布是多個高斯分布的情況下，聚類中心初始化不好導致的最後聚類的效果:

我們現在計算一下假設有k個高斯分布，我們隨機初始化正確的概率有大(所謂正確是指任何兩個隨機初始化中心不在同一個高斯分布中): $frac {k!}{k^k} approx frac {1}{e^k}$ ，當k增大時，這個概率會越來越低。

Furthest Point Heuristic

這種方法是一個中心點一個中心點依次進行初始化的，首先隨機初始化 $c_1$ ，然後選擇距離 $c_1$ 最遠的點來初始化 $c_2$ ，以此類推。

演算法描述如下:

Choose $c_1$ arbitrarily (or at random).

For j = 2, …, k

Pick $c_j$ among datapoints $x^1, x^2, …, x^n$ that is farthest from previously chosen $c1, c2, …, c{j-1}$

這種方法解決了隨機初始化高斯分布例子中的問題:

但是，這種方法的問題是容易受雜訊點干擾，請看下面的例子:

所以這種方式進行初始化也是不行的，一旦出現雜訊點，就極大的影響了最後聚類的結果。雖然實際上，幾乎沒有哪一個k-means演算法會採用上面兩種初始化方式，但是這裡這樣介紹是順著我們的思維方式進行的，一般思考的方式都是從簡單到複雜，所以下面，我們也順理成章的引出k-means++這個初始化演算法，該演算法很好的反映出隨機化思想在演算法中的重要性。

k-means++

演算法描述如下:

Choose $c_1$ at random.
For j = 2, …, k
Pick $c_j$ among $x^1, x^2, …, x^n$ according to the distribution

$Pr(cj = x^i) propto min{j < j}left | x^i - c_{j} ight |^2$

這就是k-means++的初始化過程，這個過程比較不好理解。關於這個過程，作以下幾點說明:

這個初始化演算法引入隨機化，下一個被選為中心點的樣本不是固定的，而是一個概率值，這個概率值正比於「離最近中心點的距離「。
」離最近中心點的距離「如何計算，實際上非常簡單，就是當前樣本距離各個中心點的距離中，最小的那個距離。
既然概率正比於」距離「，那麼離群點的」距離「肯定是最大的，它的概率肯定是最大的，可是為什麼演算法不一定會選擇它呢？舉個例子說明，如果我們現在有一個聚類集合 $S={x_1,x_2,x_3}$ ,和離群點 $x_o$ ，假設選中 $x_o$ 的概率為 $1/3$ , 選中 $x_1, x_2, x_3$ 的概率分別為 2/9，這樣看，即使 $x_o$ 的概率很大，但是它只有1個，而 $x_1, x_2, x_3$ 即使每個概率不大，但是我們只要隨便選中其中一個都是可以的(這是因為它們都在一個聚類簇中，只要選擇聚類簇中任何一個點當聚類中心都可以)，所以可以把他們的概率相加，最後得到的概率就大於選中 $x_o$ 的概率。

In Action

當然，在實際項目中，我們可能不會自己實現k-means演算法，一般我們都會用現成的比較好的一些機器學習庫，我們這裡結合scikit-learn來看一下，它是如何實現k-means演算法的。

首先看一下，sklearn.cluster.k_means模塊下的函數k_means方法:

def k_means(X, n_clusters, init=k-means++, precompute_distances=auto, n_init=10, max_iter=300, verbose=False, tol=1e-4, random_state=None, copy_x=True, n_jobs=1, algorithm="auto", return_n_iter=False):

首先，我們看到參數有一個init，這裡是指定k-means初始化方法，這裡我們看下注釋:

""" init : {k-means++, random, or ndarray, or a callable}, optional Method for initialization, default to k-means++: k-means++ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details. random: generate k centroids from a Gaussian with mean and variance estimated from the data. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers. If a callable is passed, it should take arguments X, k and and a random state and return an initialization."""

可以看到，sklearn實現了2種初始化演算法，一個是隨機初始化演算法，另一個是k-means++演算法，默認採用的是k-means++演算法。然後，我們先看一下sklearn實現k-means++的代碼:

def _k_init(X, n_clusters, x_squared_norms, random_state, n_local_trials=None): """Init n_clusters seeds according to k-means++ Parameters ----------- X : array or sparse matrix, shape (n_samples, n_features) The data to pick seeds for. To avoid memory copy, the input data should be double precision (dtype=np.float64). n_clusters : integer The number of seeds to choose x_squared_norms : array, shape (n_samples,) Squared Euclidean norm of each data point. random_state : numpy.RandomState The generator used to initialize the centers. n_local_trials : integer, optional The number of seeding trials for each center (except the first), of which the one reducing inertia the most is greedily chosen. Set to None to make the number of trials depend logarithmically on the number of seeds (2+log(k)); this is the default. Notes ----- Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. see: Arthur, D. and Vassilvitskii, S. "k-means++: the advantages of careful seeding". ACM-SIAM symposium on Discrete algorithms. 2007 Version ported from http://www.stanford.edu/~darthur/kMeansppTest.zip, which is the implementation used in the aforementioned paper. """ n_samples, n_features = X.shape centers = np.empty((n_clusters, n_features), dtype=X.dtype) assert x_squared_norms is not None, x_squared_norms None in _k_init # Set the number of local seeding trials if none is given if n_local_trials is None: # This is what Arthur/Vassilvitskii tried, but did not report # specific results for other than mentioning in the conclusion # that it helped. n_local_trials = 2 + int(np.log(n_clusters)) # Pick first center randomly center_id = random_state.randint(n_samples) if sp.issparse(X): centers[0] = X[center_id].toarray() else: centers[0] = X[center_id] # Initialize list of closest distances and calculate current potential closest_dist_sq = euclidean_distances( centers[0, np.newaxis], X, Y_norm_squared=x_squared_norms, squared=True) current_pot = closest_dist_sq.sum() # Pick the remaining n_clusters-1 points for c in range(1, n_clusters): # Choose center candidates by sampling with probability proportional # to the squared distance to the closest existing center rand_vals = random_state.random_sample(n_local_trials) * current_pot candidate_ids = np.searchsorted(stable_cumsum(closest_dist_sq), rand_vals) # Compute distances to center candidates distance_to_candidates = euclidean_distances( X[candidate_ids], X, Y_norm_squared=x_squared_norms, squared=True) # Decide which candidate is the best best_candidate = None best_pot = None best_dist_sq = None for trial in range(n_local_trials): # Compute potential when including center candidate new_dist_sq = np.minimum(closest_dist_sq, distance_to_candidates[trial]) new_pot = new_dist_sq.sum() # Store result if it is the best local trial so far if (best_candidate is None) or (new_pot < best_pot): best_candidate = candidate_ids[trial] best_pot = new_pot best_dist_sq = new_dist_sq # Permanently add best center candidate found in local tries if sp.issparse(X): centers[c] = X[best_candidate].toarray() else: centers[c] = X[best_candidate] current_pot = best_pot closest_dist_sq = best_dist_sq return centers

該演算法的是基於 k-means++:the advantages of careful seeding[2]實現的，有興趣的可以看一下這篇論文。代碼第49行，可以看到，第一個初始中心是隨機初始化的。代碼62行，通過循環，依次初始化其他的聚類中心。

References

Lloyd, Stuart P. Least squares quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28(2):129-137.
Arthur D, Vassilvitskii S. k-means++:the advantages of careful seeding[C]// Eighteenth Acm-Siam Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 2007:1027-1035.
Julyedu 機器學習演算法班