從0開始, Data Scientist 之路 Day 3
1)課程網站:Introduction to Machine Learning
2) Michael Jordan 關於「Computational Thinking,Inferential Thinking, and Data Science」的講座
3)Stanford CS 231 Convolutional Neural Networks for Visual Recognition 的課堂作業集錦。
Course Projects Winter 2015
Course Topics
- classification: perceptrons, support vector machines (SVMs), Gaussian discriminant analysis (including linear discriminant analysis, LDA, and quadratic discriminant analysis, QDA), logistic regression, decision trees, neural networks, convolutional neural networks, nearest neighbor search;
- regression: least-squares linear regression, logistic regression, polynomial regression, ridge regression, Lasso;
- dimensionality reduction: principal components analysis (PCA), latent factor analysis; and
- clustering: k-means clustering, hierarchical clustering, spectral graph clustering.
Core Material
--Finding patterns in data: using them to make predictions.
--Models and statistics help us understand patterns.
--Optimization algorithms "learn" the patterns.
--Collecttraining data: reliable debtors and defaulted debtors.
--Evaluatenew applicants(prediction)
--Train aclassifier: it learns to distinguish 7 from not 7
--Testthe classifier on NEW images
2 kindsof error
-Trainingset error: fraction of training images not classified correctly.
-Test seterror: fraction of misclassified NEW images, not seen during training.
Outliers:points whose labels are atypical.
Overfitting:when the test error deteriorates because the classifier becomes too sensitiveto outliers or other spurious patterns.
Most MLalgorithms have a few hyperparameters that control over/underfitting.
We selectthem by validation:
--Holdback a subset of training data, called the validation set.
--Trainthe classifier multiple times with different hyperparameter settings.
--Choosethe settings that work best on validation set.
Now wehave 3 sets:
Trainingset used to learn model weights.
Validationset used to tune hyperparameters, choose among different models.
Test setused as FINAL evaluation of model. Keep in a vault. Run ONCE, at the very end.
---Classification:is this email spam?
---Regression:how likely does this patient have cancer?
---Clustering:which DNA sequences are similar to each other?
---Dimensionalityreduction: what are the common features of faces? Common differences?
