應用機器學習演算法的一些具體建議

01-26

本文來自於學習Ng的Machine Learning課程筆記。學習時用英文記錄，寫本文時加上了部分中文解釋，儘可能的口語化了。另外部分名詞可能翻譯的不太對，如果您發現了請不吝指正，謝謝。

具體實現代碼請參考：https://github.com/wrymax/machine-learning-assignments/tree/master/week6/machine-learning-ex5/ex5

使用Matlab / Octave載入，運行ex5即可。

本文主要指出了一些機器學習實踐中的技巧，coding前必備。

Lets get started.

關鍵名詞：

訓練數據集 Training Set
訓練數據集的代價函數 Jtrain(theta)
交叉驗證數據集 Cross Validation
交叉驗證數據集的代價函數 Jcv(theta)
測試數據集 Test
測試數據集的代價函數 Jtest(theta)
預測誤差、代價 error（代價函數的計算結果）
偏差 Bias
特徵多樣性、方差 Variance
欠擬合 Under-Fitting
過擬合 Over-Fitting
正則化 Regularisation
查准率 Precision
召回率 Recall
F值 F Score

核心概念

Machine Learning Diagnostic 機器學習診斷
1. To get better avenue, you may:
  1. Collect larger training examples 收集更多的訓練數據 => 解決overfitting
  2. Get additional features 獲得和使用額外的特徵 => 解決high bias
  3. Add polynomial features 添加多項式特徵 => 解決high bias
  4. Reduce features 減少特徵 => 解決overfitting
  5. Increase lambda 增大lambda => 解決overfitting
  6. Decrease lambda => 解決underfitting
2. However, it may takes you 6 month to train data, and nothing to gain.
3. Machine Learning Diagnostic is: 機器學習診斷思路如下：
  1. A test that you can run to gain insight what is / isn』t working with a learning algorithm, and gain guidance as to how best to improve its performance. 這是一種測試方法，你可以據此嘗試並改進演算法的性能
  2. It takes time to implement. 這得花點時間。。
  3. It sometimes rule out certain courses of action (changes to your learning algorithm) as being unlikely to improve its performance significantly. 有可能這並沒有什麼卵用
4. Training/Testing Procedure 訓練/測試過程
  1. Split dataset into 7:3, 70% of which is going to be the training set, 30% of which is going to be the testing set. 切分數據集：70%作為訓練數據，30%作為測試數據
  2. Learn parameter theta from training data. 使用訓練數據習得你需要的theta
  3. Compute test set error: 使用剛剛訓練出來的Cost Function來跑一下測試數據集，得到測試數據集的誤差
    1. Jtest(theta) = CostFunction(training_data) => the square deviation equation
Model Selection and Train/Validation/Test Sets 模型選擇與訓練/校驗/測試數據集
1. Degree of Polynomial 多項式級數（一個我們需要關注的指標），這個東西就是下圖中的「d」
2. 如上圖，我們從d = 1到d = 10，分別用test數據集計算Cost Function，然後選取一個最優化的d（上圖中選擇了d = 5），後面幾部講了具體怎麼實施這個過程
3. Evaluating your hypothesis 評估你的預測函數
  1. Training Set - 60% of data set 把60%的數據設置為訓練數據集
  2. Cross Validation Set (CV) - 20% of data set 把20%的數據設置為交叉驗證數據集
  3. Test Set - 20% of data set 把20%的數據設置為測試數據集
4. Use Validation Set to select the model 使用交叉驗證數據集來選擇模型
  1. Compute cost function J(theta) by CV in different degrees
  2. Choose the minimal one as the target degree, as below, d = 4
  3. Estimate generalisation erro for test set Jtest(theta(4))
  4. 以上這段英文就是步驟1 - 3了
Bias vs. Variance 偏差 vs. 特徵多樣性（or方差？）
1. Bias => 特徵值以外的代價函數偏差
  1. High Bias => high lambda => Under-fitting
2. Variance => 特徵多樣性
  1. High Variance => low lambda => Overfitting
3. Diagnosing Bias vs. Variance 診斷：偏差高低和特徵值大小
  1. Set coordinates of (degree_of_polynomial_d, error); 設置一個橫軸為特徵數量（多項式維度）d，縱軸為預測誤差error的坐標系
  2. Train the 60% data set, draw a curve which will converge when d goes bigger; 用60%的數據集作為訓練集合，計算J(theta)，當多項式維度d（特徵數）增大時，error會減少，向0收斂
  3. Run prediction with Cross-Validation-Set by trained model, there will be an overfitting point, after which Jcv(theta) will continuously arise. 使用從訓練集中獲得的模型，計算交叉驗證數據集的預測誤差，會發現過擬合問題。
  4. 如下圖：
4. Question 問題來了...
  1. How can we figure out it is suffered from Bias, or from Variance? 我們如何知道模型的問題出在偏差還是特徵多樣性上？
    1. Bias => Underfitting, as on the left part of the diagram 偏差往往和欠擬合相關，如圖左半部分，特徵數過少
      1. Jtraining(theta) will be high 訓練集的代價函數error會很大
      2. Jcv(theta) ≈ Jtraining(theta) 交叉驗證的代價函數error約等於訓練集，也很大
    2. Variance => Overfitting, as on the right part of the diagram 特徵多樣性往往和過擬合相關，如圖右半部分，特徵數量過多
      1. Jtraining(theta) will be low 訓練集的代價函數error越來越低
      2. Jcv(theta) >>( much higher than ) Jtraining(theta) 交叉驗證數據集的代價函數error在經過一個極小值後開始上升，最終遠大於訓練集的錯誤。這是典型的過擬合特徵：對新數據的fitting性能非常差。
  2. 如下圖：
5. Choosing the regularisation parameter lambda 選擇正則化參數lambda
  1. Try lambda from small to large, like 從小到大，嘗試lambda
    1. 0
    2. 0.01
    3. 0.02
    4. 0.04
    5. 0.08
    6. ...
    7. 10
  2. Pick up best fitting lambda of Cross-Validation cost function, say Theta(5) 選擇一個對交叉驗證數據集的代價函數擬合最好的lambda值，例如theta(5)
  3. Compute J(theta) by the test data set 計算測試數據集的代價函數
Bias/Variance as a function of the regularisation parameter lambda 以lambda為參數的偏差/方差函數
1. When lambda is 0 當lambda為0時
  1. you can fit the training set relatively well, since there is no regularisation. 訓練集會擬合的相對不錯，因為沒有做任何的正則化
2. When lambda is small 當lambda很小時
  1. You get a small value of Jtrain 訓練集的預測誤差也很小
3. When lambda goes large 當lambda變大時
  1. The bias becomes larger, so Jtrain goes much larger 偏差越來越大，訓練集的預測誤差會顯著增大
Learning Curves 學習曲線
1. When training set number 「m」grows
  1. It is much harder to fit, so training error grows 訓練集增大，代價函數越來越難擬合所有的數據集，error會增大
  2. As examples grows, it does better at generalising to new examples, so Cross-Validation error decreases 訓練集增大時，代價函數歸納新元素的性能會更好，因此cross-validation的錯誤率會下降
2. High Bias 高偏差的情況
  1. High Bias means low Variance, so h(theta) would a low-dimensional function, which cannot fit all the dataset well 高偏差代表預測多項式的維度過低，因此很難預測整個數據集
  2. When m is small, Jcv is high and Jtrain is low, while both of them will converge to similar value when dataset grows large enough 當數據集的總數很小時，交叉驗證集的預測誤差很大，訓練數據集的預測誤差很小；但當m越來越大，兩者將越來越接近
  3. Both the error of Jcv and Jtrain would be fairly HIGH 交叉驗證和訓練數據集的J都會很大
  4. Conclusion 結論
    1. If a learning algorithm is suffering from high bias, getting more training data will not(by itself) help much 當學習演算法存在高偏差問題時，訓練更多的數據無法解決問題（J會收斂於一個很高的錯誤值，不再下降）
3. High Variance 高特徵多樣性的情況
  1. High Variance means low lambda and the polynomial hypothesis function has many many features 高特徵多樣性說明正則化參數lambda很小，此外預測多項式有非常多的特徵
  2. When m is small 當訓練數據量m很小時
    1. Jtrain is small; as m grows up, Jtrain becomes larger too, but the training set error would still be pretty low 訓練集的J也很小；但是當訓練數據越來越多時，由於預測函數的維度過高，擬合開始變得困難，Jtrain逐漸上升，但是仍然是一個非常小的數值
    2. Jcv is large; as m grows up, Jtrain becomes smaller and coverage to a value similar with Jtrain 交叉驗證數據集的J很大（高特徵多樣性帶來的過擬合會導致預測函數對樣本外的數據點預測偏差很大）；當m增大時，Jcv會逐漸下降
    3. The indicative diagnostic that we have a high variance problem 我們遇到高特徵多樣性問題的一個象徵性指標：
      1. With m becomes larger, there is a large gap between the training error and cross-validation error 隨著m增大，訓練數據集的預測誤差和交叉驗證數據集的預測誤差之間會存在一個很大的空白
  3. Conclusion 結論
    1. If learning algorithm is suffering from high variance, getting moe training data is likely to help 當學習演算法存在高特徵多樣性為題是，使用更多的訓練數據可能會有幫助
Deciding what to try next (revisited) 決定下一步做什麼
1. When debugging a learning algorithm, you find your model makes unacceptably large errors in its prediction, what to do next? 當調試一個學習演算法時，你發現你的預測模型得出了不可接受的高誤差，下一步該怎麼辦？
  1. Get more training examples 使用更多的訓練數據
    1. When Jcv is much higher than Jtrain, it fixes high variance 當Jcv比Jtrain大的多時，它可以解決高特徵多樣性的問題
  2. Try smaller sets of features 嘗試減少特徵數量
    1. It fixes high variance problem too 減少一部分用處不大的特徵可以解決高特徵多樣性的問題
  3. Try getting additional features 嘗試使用額外的特徵
    1. Maybe the model is under-fitting (high bias), try additional features can make the model fitting training set better 高誤差也有可能是因為特徵數量太少了，因此使用額外的特徵可以解決高偏差的問題
  4. Try adding polynomial features 嘗試添加多項式特徵 (x1 * x1, x2 * x2, x1 * x2, etc)
    1. Which also solves high bias problem 也是可以解決高偏差的問題（欠擬合）
  5. Try decreasing lambda 嘗試降低正則化參數lambda
    1. Which also solves high bias problem 也是可以解決高偏差的問題（欠擬合）
  6. Try increasing lambda 嘗試增加正則化參數lambda
    1. Which also solves high variance problem 也是可以解決高特徵數量的問題（過擬合）
2. Neural Networks and overfitting 神經網路與過擬合
  1. Small neural network 小型神經網路
    1. fewer parameters, more prone to under-fitting 網路層數少、神經元數量少，更易導致欠擬合
    2. Computationally cheaper 計算資源消耗少
  2. Large neural network 大型神經網路
    1. Type1. few layers, lot of units 類型1：層數少，每層的神經元多
    2. Type2. few units, lot of layers 類型2：層數多，每層的神經元少
    3. more parameters, more prone to over-fitting 參數多，更易導致過擬合
    4. Computationally more expensive 計算資源消耗多
    5. Use regularisation to address overfitting 使用正則化來解決過擬合問題
    6. Try one layer, two layers and three layers.. and compute Jcv(theta) to decide how many layers you will use 嘗試1層、2層、3層… 並計算交叉驗證代價函數，據此來選擇最合適的神經網路層數

Machine Learning System Design 機器學習系統設計

（本段中樓主懶癌發作不想打字。。。直接貼Ng的截圖了）

Building a Spam Classifier 構建一個垃圾郵件分類器
1. Prioritising What to Work On
2. Recommended Approach 推薦的方法
Handling Skewed Data 處理歪曲/偏斜的數據
1. Error Metrics for Skewed Data 歪曲數據的錯誤度量方法
  1. 查准率 Precision
    1. Precision = TruePositive / (No. of predicted positive)
    2. No. of predicted positive = TruePositive + FalsePositive
  2. 召回率 Recall
    1. Recall = TruePositive / (No. of actual positive)
    2. No. of actual positive = TruePositive + FalseNegative
2. Trading off Precision and Recall 權衡查准率和召回率
  1. High Precision and Low Recall: Suppose we want to predict y = 1 only if very confident
    1. Predict 1 if h(x) >= 0.9
    2. Predict 0 if h(x) < 0.9
  2. High Recall and Low Precision: Suppose we want to predict y = 0 only if very confident
    1. Predict 1 if h(x) >= 0.3
    2. Predict 0 if h(x) < 0.3
  3. F Score F值
    1. For making a decision which pair of Precision and Recall is better
    2. Measure P and R on the cross validation set, and choose the value of threshold which maximises F score, as below
Using Large Data Sets 使用大數據集
1. 首先問自己兩個問題：
2. 如果可以，那麼：