深入機器學習系列2-支持向量機

04-27

轉載請註明出處，該文章的官方來源：

Support Vector Machine | Teaching ML

寫在前面的：

SVM演算法是在深度學習大火之前最受歡迎的機器學習演算法，也是廣大機器學習愛好者的入門演算法。請大家系好安全帶，當心老司機一言不合就甩你一臉代碼！

1 SVM

1.1 介紹

Support Vector Machine 支持向量機是一種機器學習演算法。

給定一個訓練集 $S = { (x_i, y_i) }_{i=1}^{m}$ 其中 $x_{i}∈?^{n}$ 並且 $y_i in { +1, -1 }$ 圖1展示了一個 SVM 需要解決的問題。我們標記 $w cdot x - b = 0$ 為超平面， w 代表該超平面的向量。我們需要做的是找到能將 $y_i=1$ 的點和 $y_i=-1$ 的點分開的邊際最大的超平面. 這就意味著 $y_i(w cdot x_i -b ) geq 1$ ，對於所有 $1≤i≤n$

所以優化問題可以寫成：

$frac{2}{|w|}$

這等價於最小化

$frac{1}{2} | w |^2$

subject to $y_{i}(w?x_{i}?b)≥1$ for all $1 leq i leq n$

Figure1: SVM

事實上，它可以被看作一個帶有懲罰項的最小化損失問題。最終，我們希望找到以下問題的最小解

$min_{w} frac{lambda}{2}|w|^2 + frac{1}{m}sum_{(x, y) in S} ell(w; (x, y))$

其中 λ 是正規化參數, $ell(w, (x, y))$ 是 hinge 損失函數:

$ell(w, (x, y)) = max{0, 1-y langle w, x angle }$

對於這一最優化問題，我們可以使用梯度下降演算法來達到最小值。

目標函數為：

$f(w) = frac{lambda}{2}|w|^2 + frac{1}{m}sum_{(x_i, y_i) in S}ell(w; (x_i, y_i))$

所以，迭代 t 時的梯度為：

$abla_t = lambda w_t - frac{1}{m}sum_{(x_i, y_i) in S}mathbb I[y_i langle w, x_i angle < 1]y_i x_i$

於是，我們可以更新 w, 其中 $eta_t$ 是下降速度

$w_{t+1}←w_{t}?η_{t}?_{t}$

1.2 SGD

從上一節我們可以看到每次迭代我們都需要所有的數據點來計算梯度。而當數據集變大後，無疑會耗費大量的計算時間。這就是為什麼在大規模梯度下降演算法中，我們總會使用 SGD（隨機梯度下降）。SDG 在每次迭代時只使用一部分數據而不是全部，從而降低了計算量。

所以，現在目標函數變成了：

$f(w, A_t) = frac{lambda}{2}|w|^2 + frac{1}{k}sum_{(x_i, y_i) in A_t}ell(w; (x_i, y_i))$

where $A_t subset S$ , $|A_t| = k$ . At each iteration, we take a subset of data point.

然後梯度為:

$abla_t = lambda w_t - frac{1}{k}sum_{(x_i, y_i) in A_t}mathbb I[y_i langle w, x_i angle < 1]y_i x_i$

1.3 Pegasos and MLlib implementation

Pegasos 是 SVM 使用梯度下降演算法的一種實現。Spark MLlib 也提供了 SVM 的梯度下降實現，於 Pegasos 稍有不同。主要是梯度的更新速度不同。

$w_{t+1}←w_{t}?η_{t}?_{t}$

在 Pegasos 演算法中, 更新速度為

$eta_t = frac{alpha}{tlambda}$

而在 MLlib 中，為：

$eta_t = frac{alpha}{sqrt{t}}$

其中 α 是更新速度參數。

2 SGD in Spark

2.1 treeAggregate

Spark 來計算 SGD 的主要優勢使可以分散式地計算梯度，然後將它們累加起來。在 Spark 中，這一任務是通過 RDD 的 treeAggregate 方法來完成的。 Aggregate 可被視為泛化的 Map 和 Reduce 的組合。 treeAggregate 的定義為

RDD.treeAggregate(zeroValue: U)( seqOp: (U, T) => U, combOp: (U, U) => U, depth: Int = 2): U

在此方法中有三個參數，其中前兩個對我們更重要：

seqOp: 計算每隔 partition 中的子梯度
combOp: 將 seqOp 或上層 combOp 的值合併
depth: 控制 tree 的深度

Figure 2: tree aggregate

2.2 實現

SGD 是一個求最優化的演算法，許多機器學習演算法都可以用 SGD 來求解。所以 Spark 對其做了抽象。

class SVMWithSGD private ( private var stepSize: Double, private var numIterations: Int, private var regParam: Double, private var miniBatchFraction: Double) extends GeneralizedLinearAlgorithm[SVMModel] with Serializable { private val gradient = new HingeGradient() private val updater = new SquaredL2Updater() @Since("0.8.0") override val optimizer = new GradientDescent(gradient, updater) .setStepSize(stepSize) .setNumIterations(numIterations) .setRegParam(regParam) .setMiniBatchFraction(miniBatchFraction)

可以看到 SVMWithSGD 繼承了 GeneralizedLinearAlgorithm ，並定義 optimizer 來確定如何獲得優化解。而 optimizer 即是 SGD 演算法的實現。正如上節所述，線性 SVM 實際上是使用 hinge 損失函數和一個 L2 懲罰項的線性模型，因此這裡使用了 HingeGradient 和 SquaredL2Updater 作為 GradientDescent 的參數。

class HingeGradient extends Gradient { override def compute(data: Vector, label: Double, weights: Vector): (Vector, Double) = { val dotProduct = dot(data, weights) // Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x))) // Therefore the gradient is -(2y - 1)*x val labelScaled = 2 * label - 1.0 if (1.0 > labelScaled * dotProduct) { val gradient = data.copy scal(-labelScaled, gradient) (gradient, 1.0 - labelScaled * dotProduct) } else { (Vectors.sparse(weights.size, Array.empty, Array.empty), 0.0) } } override def compute( data: Vector, label: Double, weights: Vector, cumGradient: Vector): Double = { val dotProduct = dot(data, weights) // Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x))) // Therefore the gradient is -(2y - 1)*x val labelScaled = 2 * label - 1.0 if (1.0 > labelScaled * dotProduct) { axpy(-labelScaled, data, cumGradient) 1.0 - labelScaled * dotProduct } else { 0.0 } }}

/** * :: DeveloperApi :: * Updater for L2 regularized problems. * R(w) = 1/2 ||w||^2 * Uses a step-size decreasing with the square root of the number of iterations. */@DeveloperApiclass SquaredL2Updater extends Updater { override def compute( weightsOld: Vector, gradient: Vector, stepSize: Double, iter: Int, regParam: Double): (Vector, Double) = { // add up both updates from the gradient of the loss (= step) as well as // the gradient of the regularizer (= regParam * weightsOld) // w = w - thisIterStepSize * (gradient + regParam * w) // w = (1 - thisIterStepSize * regParam) * w - thisIterStepSize * gradient val thisIterStepSize = stepSize / math.sqrt(iter) val brzWeights: BV[Double] = weightsOld.asBreeze.toDenseVector brzWeights :*= (1.0 - thisIterStepSize * regParam) brzAxpy(-thisIterStepSize, gradient.asBreeze, brzWeights) val norm = brzNorm(brzWeights, 2.0) (Vectors.fromBreeze(brzWeights), 0.5 * regParam * norm * norm) }}

此節中, Code 1展示了 GradientDescent 的主要執行邏輯。重複執行 numIterations 次以獲得最終的 $w$ 。

首先, data.sample 通過 miniBatchFraction 取一部分樣本. 然後使用 treeAggregate 。在 seqOp 中, gradientSum 會通過 axpy(y, b_x, c._1) 更新，如果 $ylangle w, x angle < 1$ ，即分類錯誤。在 combOp 中, gradientSum 通過 c1._1 += c2._1 被集合起來。當獲得 gradientSum 後, 我們就可以計算 step 和 gradient 了。最後, 我們使用 axpy(-step, gradient, weights) 更新 weights 。

while (!converged && i <= numIterations) { val bcWeights = data.context.broadcast(weights) // Sample a subset (fraction miniBatchFraction) of the total data // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum, miniBatchSize) = data.sample(false, miniBatchFraction, 42 + i) .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( seqOp = (c, v) => { // c: (grad, loss, count), v: (label, features) val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) }, combOp = (c1, c2) => { // c: (grad, loss, count) (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) }) if (miniBatchSize > 0) { /** * lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ stochasticLossHistory.append(lossSum / miniBatchSize + regVal) val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize.toDouble), stepSize, i, regParam) weights = update._1 regVal = update._2 previousWeights = currentWeights currentWeights = Some(weights) if (previousWeights != None && currentWeights != None) { converged = isConverged(previousWeights.get, currentWeights.get, convergenceTol) } } else { logWarning(s"Iteration ($i/$numIterations). The size of sampled batch is zero") } i += 1

Code 1: GradientDescent 代碼片斷

3.1 正確性驗證

我們模擬了一些簡單的 2D 和 3D 數據來驗證正確性。

Figure 3: 2D linear

Figure 4: 3D linear

3.2 收斂速度

我們比較兩種實現的收斂速度差異。這裡，我們使用 5GB 帶有 1000 個特徵的模擬數據。使用 4 個 executors 并迭代 100 次。

Figure 5: before aligning Y axis

Figure 6: after aligning Y axis

4 參考文獻

Zaharia, Matei, et al. "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing." Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012
Zaharia, Matei, et al. "Spark: cluster computing with working sets." Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010
Shalev-Shwartz, Shai, et al. "Pegasos: Primal estimated sub-gradient solver for svm." Mathematical programming 127.1 (2011): 3-30