從0開始, Data Scientist 之路 Day 3
有一些有趣的資源共享給大家:
1)課程網站:Introduction to Machine Learning
2) Michael Jordan 關於「Computational Thinking,Inferential Thinking, and Data Science」的講座
3)Stanford CS 231 Convolutional Neural Networks for Visual Recognition 的課堂作業集錦。
Course Projects Winter 2015
有一個做中國人,韓國人,日本人面相識別的,超可愛!
4)第一節課的課堂筆記
Course Topics
- classification: perceptrons, support vector machines (SVMs), Gaussian discriminant analysis (including linear discriminant analysis, LDA, and quadratic discriminant analysis, QDA), logistic regression, decision trees, neural networks, convolutional neural networks, nearest neighbor search;
- regression: least-squares linear regression, logistic regression, polynomial regression, ridge regression, Lasso;
- dimensionality reduction: principal components analysis (PCA), latent factor analysis; and
- clustering: k-means clustering, hierarchical clustering, spectral graph clustering.
Core Material
--Finding patterns in data: using them to make predictions.
--Models and statistics help us understand patterns.
--Optimization algorithms "learn" the patterns.
CLASSIFICATION
--Collecttraining data: reliable debtors and defaulted debtors.
--Evaluatenew applicants(prediction)
Validation
--Train aclassifier: it learns to distinguish 7 from not 7
--Testthe classifier on NEW images
2 kindsof error
-Trainingset error: fraction of training images not classified correctly.
-Test seterror: fraction of misclassified NEW images, not seen during training.
Outliers:points whose labels are atypical.
Overfitting:when the test error deteriorates because the classifier becomes too sensitiveto outliers or other spurious patterns.
Most MLalgorithms have a few hyperparameters that control over/underfitting.
We selectthem by validation:
--Holdback a subset of training data, called the validation set.
--Trainthe classifier multiple times with different hyperparameter settings.
--Choosethe settings that work best on validation set.
Now wehave 3 sets:
Trainingset used to learn model weights.
Validationset used to tune hyperparameters, choose among different models.
Test setused as FINAL evaluation of model. Keep in a vault. Run ONCE, at the very end.
Supervisedlearning:
---Classification:is this email spam?
---Regression:how likely does this patient have cancer?
Unsupervisedlearning:
---Clustering:which DNA sequences are similar to each other?
---Dimensionalityreduction: what are the common features of faces? Common differences?
推薦閱讀:
※自編碼器是什麼?有什麼用?這裡有一份入門指南(附代碼)
※如何用Python爬數據?(一)網頁抓取
※Teradata SQL基礎:從已有表創建新表
※teradata SQL基礎:字元串處理
※灣區超級獨角獸們怎麼玩轉數據科學