Essentials of Machine Learning Algorithms (with Python and R Codes)

05-11

cited from

Introduction

Google』sself-driving cars and robots get a lot of press, but the company』s real futureis in machine learning, the technology that enables computers to get smarterand more personal.

– EricSchmidt (Google Chairman)

We are probably livingin the most defining period of human history. The period when computing movedfrom large mainframes to PCs to cloud. But what makes it defining is not whathas happened, but what is coming our way in years to come.

What makesthis period exciting for some one like me is the democratization of thetools and techniques, which followed the boost in computing. Today, as a datascientist, I can build data crunching machines with complex algorithms for afew dollors per hour. But, reaching here wasn』t easy! I had my dark daysand nights.

Who can benefit the most from this guide?

What Iam giving out today is probably the most valuable guide, I have ever created.

The idea behind creating this guide is tosimplify the journey of aspiring data scientists and machine learningenthusiasts across the world. Through this guide, I will enable you to work onmachine learning problems and gain from experience. I am providing a high level understandingabout various machine learning algorithms along with R & Python codes torun them. These should be sufficient to get your hands dirty.

I have deliberately skipped the statisticsbehind these techniques, as you don』t need to understand them at the start. So,if you are looking for statistical understanding of these algorithms, youshould look elsewhere. But, if you are looking to equip yourself to startbuilding machine learning project, you are in for a treat.

Broadly, there are 3 types ofMachine Learning Algorithms..

1. Supervised Learning

How it works: This algorithm consist of a target /outcome variable (or dependent variable) which is to be predicted from agiven set of predictors (independent variables). Using these set of variables,we generate a function that map inputs to desired outputs. Thetraining process continues until the model achieves a desired level of accuracyon the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.

2. Unsupervised Learning

How it works:In this algorithm, we do not have any target or outcomevariable to predict / estimate. It is used for clustering population indifferent groups, which is widely used for segmenting customers in differentgroups for specific intervention. Examples of Unsupervised Learning: Apriorialgorithm, K-means.

3. Reinforcement Learning:

How it works: Using this algorithm, the machine istrained to make specific decisions. It works this way: the machine is exposedto an environment where it trains itself continually using trial and error.This machine learns from past experience and tries to capture the best possibleknowledge to make accurate business decisions. Example of ReinforcementLearning: Markov Decision Process

List of Common Machine LearningAlgorithms

Here is the list of commonly used machinelearning algorithms. These algorithms can be applied to almost any dataproblem:

1. Linear Regression

2. Logistic Regression

3. Decision Tree

4. SVM

5. Naive Bayes

6. KNN

7. K-Means

8. Random Forest

9. DimensionalityReduction Algorithms

10. Gradient Boost &Adaboost

1. Linear Regression

It is used to estimate real values (cost of houses, numberof calls, total sales etc.) based on continuous variable(s). Here, we establishrelationship between independent and dependent variables by fitting abest line. This best fit line is known as regression line and representedby a linear equation Y= a *X + b.

The best way to understand linear regressionis to relive this experience of childhood. Let us say, you ask a child in fifthgrade to arrange people in his class by increasing order of weight, withoutasking them their weights! What do you think the child will do? He / she wouldlikely look (visually analyze) at the height and build of people and arrangethem using a combination of these visible parameters. This is linear regressionin real life! The child has actually figured out that height and build would becorrelated to the weight by a relationship, which looks like the equationabove.

In thisequation:

Y – Dependent Variable
a – Slope
X – Independentvariable
b – Intercept
These coefficients a and b are derived basedon minimizing the sum of squared difference of distance between data points andregression line.
Look at the below example. Here we haveidentified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find theweight, knowing the height of a person.

LinearRegression is of mainly two types: Simple Linear Regression and Multiple LinearRegression. Simple Linear Regression is characterized by one independentvariable. And, Multiple Linear Regression(as the name suggests) ischaracterized by multiple (more than 1) independent variables. Whilefinding best fit line, you can fit a polynomial or curvilinear regression.And these are known as polynomial or curvilinear regression.

Python Code:

#Import Library

#Import other necessary libraries like pandas, numpy...

from sklearn import linear_model

#Load Train and Test datasets

#Identify feature and response variable(s) and values must benumeric & numpy arrays

x_train=input_variables_values_training_datasets

y_train=target_variables_values_training_datasets

x_test=input_variables_values_test_datasets

# Create linear regression object

linear = linear_model.LinearRegression()

# Train the model using the training sets and check score

linear.fit(x_train, y_train)

linear.score(x_train, y_train)

#Equation coefficient and Intercept

print(Coefficient:
, linear.coef_)

print(Intercept:
, linear.intercept_)

#Predict Output

predicted= linear.predict(x_test)

R Code

#Load Train and Test datasets

#Identify feature and response variable(s) and values must benumeric & numpy arrays

x_train <- input_variables_values_training_datasets

y_train <- target_variables_values_training_datasets

x_test <- input_variables_values_test_datasets

x <- cbind(x_train,y_train)

# Train the model using the training sets and check score

linear <- lm(y_train ~ ., data = x)

summary(linear)

#Predict Output

predicted= predict(linear,x_test)

2. LogisticRegression

Don』t get confused byits name! It is a classification not a regression algorithm. It is used toestimate discrete values ( Binary values like 0/1, yes/no,true/false ) based on given set of independent variable(s). In simplewords, it predicts the probability of occurrence of an event byfitting data to a logitfunction. Hence, it isalso known as logit regression. Since, it predicts the probability,its output values lies between 0 and 1 (as expected).

Again, let us try and understand this througha simple example.

Let』s sayyour friend gives you a puzzle to solve. There are only 2 outcome scenarios –either you solve it or you don』t. Now imagine, that you are being givenwide range of puzzles / quizzes in an attempt to understand which subjectsyou are good at. The outcome to this study would be something like this – ifyou are given a trignometry based tenth grade problem, you are 70% likely tosolve it. On the other hand, if it is grade fifth history question, theprobability of getting an answer is only 30%. This is what Logistic Regressionprovides you.

Coming tothe math, the log odds of the outcome is modeled as a linear combination of thepredictor variables.

odds= p/(1-p) = probability of event occurrence/ probability of not event occurrence

ln(odds) = ln(p/(1-p))

logit(p) = ln(p/(1-p)) = b0 + b1X1 + b2X2 + b3X3....+ bkXk

Above, p is the probability of presence of the characteristic ofinterest. It chooses parameters that maximize the likelihood of observing thesample values rather than that minimize the sum of squared errors (like inordinary regression).

Now, you may ask, why take alog? For the sake of simplicity, let』s just say that this is one of the bestmathematical way to replicate a step function. I can go in more details, butthat will beat the purpose of this article.

Python Code

#Import Library

from sklearn.linear_model import LogisticRegression

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create logistic regression object

model = LogisticRegression()

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Equation coefficient and Intercept

print(Coefficient:
, model.coef_)

print(Intercept:
, model.intercept_)

#Predict Output

predicted= model.predict(x_test)

R Code

x <- cbind(x_train,y_train)

# Train the model using the training sets and check score

logistic <- glm(y_train ~ ., data = x,family=binomial)

summary(logistic)

#Predict Output

predicted= predict(logistic,x_test)

Furthermore..

There are manydifferent steps that could be tried in order to improve the model:

including interactionterms
removing features
regularization techniques
using a non-linearmodel
3. Decision Tree
This is one of my favorite algorithm and I use it quitefrequently. It is a type of supervised learning algorithm that is mostlyused for classification problems. Surprisingly, it works forboth categorical and continuous dependent variables. In thisalgorithm, we split the population into two or more homogeneous sets. This isdone based on most significant attributes/ independent variables to make asdistinct groups as possible. For more details, you can read: Decision Tree Simplified.
source: statsexchange
In the image above, you can see that population is classifiedinto four different groups based on multiple attributes to identify『if they will play or not』. To split the population into differentheterogeneous groups, it uses various techniques like Gini, Information Gain,Chi-square, entropy.
The best way to understand how decision tree works, is to playJezzball – a classic game from Microsoft (image below). Essentially, you have aroom with moving walls and you need to create walls such that maximum area getscleared off with out the balls.
So, every time you split the room with a wall,you are trying to create 2 different populations with in the same room.Decision trees work in very similar fashion by dividing a population in asdifferent groups as possible.
More: Simplified Version of Decision Tree Algorithms

Python Code:

#Import Library

#Import other necessary libraries like pandas, numpy...

from sklearn import tree

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion=gini) # forclassification, here you can change the algorithm as gini or entropy (informationgain) by default it is gini # model = tree.DecisionTreeRegressor() forregression

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code:

library(rpart)

x <- cbind(x_train,y_train)

# grow tree

fit <- rpart(y_train ~ ., data = x,method="class")

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

4. SVM (Support Vector Machine)

It is a classification method. In this algorithm, weplot each data item as a point in n-dimensional space (where n is numberof features you have) with the value of each feature being the value of aparticular coordinate.

For example, if we only had two features like Height and Hairlength of an individual, we』d first plot these two variables in two dimensionalspace where each point has two co-ordinates (these co-ordinates areknown as Support Vectors)

Now, we will find some line thatsplits the data between the two differently classified groups of data. Thiswill be the line such that the distances from the closest point in each of thetwo groups will be farthest away.

In the example shown above, theline which splits the data into two differently classified groups is the blackline, since the two closest points are the farthest apart from the line. Thisline is our classifier. Then, depending on where the testing data lands oneither side of the line, that』s what class we can classify the new data as.

More: Simplified Version of Support Vector Machine

Think of this algorithm as playing JezzBall inn-dimensional space. The tweaks in the game are:

You candraw lines / planes at any angles (rather than just horizontal or verticalas in classic game)
The objective of thegame is to segregate balls of different colors in different rooms.
And the balls are notmoving.

Python Code:

#Import Library

from sklearn import svm

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create SVM classification object

model = svm.svc() # there is various option associated with it,this is simple for classification. You can refer link, for mo# re detail.

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code:

library(e1071)

x <- cbind(x_train,y_train)

# Fitting model

fit <-svm(y_train ~ ., data = x)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

5. Naive Bayes

It is a classification technique based on Bayes』theorem with an assumption ofindependence between predictors. In simple terms, a Naive Bayes classifierassumes that the presence of a particular feature in a class is unrelated tothe presence of any other feature. For example, a fruit may be considered to bean apple if it is red, round, and about 3 inches in diameter. Even if thesefeatures depend on each other or upon the existence of the other features, anaive Bayes classifier would consider all of these properties to independentlycontribute to the probability that this fruit is an apple.

Naive Bayesian model is easy to build andparticularly useful for very large data sets. Along with simplicity, NaiveBayes is known to outperform even highly sophisticated classificationmethods.

Bayes theorem providesa way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).Look at the equation below:

Here,

P(c|x) isthe posterior probability of class (target) given predictor (attribute).
P(c) isthe prior probability of class.
P(x|c) isthe likelihood which is the probability of predictor given class.
P(x) isthe prior probability of predictor.
Example: Let』s understand it using an example. Below Ihave a training data set of weather and corresponding target variable 『Play』.Now, we need to classify whether players will play or not based on weathercondition. Let』s follow the below steps to perform it.
Step 1: Convert thedata set to frequency table
Step 2: Create Likelihoodtable by finding the probabilities like Overcast probability = 0.29 andprobability of playing is 0.64.
Step 3: Now,use Naive Bayesian equation to calculate the posterior probability foreach class. The class with the highest posterior probability is the outcome ofprediction.
Problem: Players will pay if weather is sunny, is thisstatement is correct?
We can solve it usingabove discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P(Sunny)
Here we have P (Sunny|Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) =0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

NaiveBayes uses a similar method to predict the probability of different class basedon various attributes. This algorithm is mostly used in text classification andwith problems having multiple classes.

Python Code:

#Import Library

from sklearn.naive_bayes import GaussianNB

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create SVM classification object model = GaussianNB() # thereis other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link

# Train the model using the training sets and check score

model.fit(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code:

library(e1071)

x <- cbind(x_train,y_train)

# Fitting model

fit <-naiveBayes(y_train ~ ., data = x)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

6. KNN (K- Nearest Neighbors)

It can be used for both classification and regressionproblems. However, it is more widely used in classification problems in theindustry. K nearest neighbors is a simple algorithm that stores allavailable cases and classifies new cases by a majority vote of its kneighbors. The case being assigned to the class is most common amongst itsK nearest neighbors measured by a distance function.

These distance functions can be Euclidean,Manhattan, Minkowski and Hamming distance. First three functions are usedfor continuous function and fourth one (Hamming) for categorical variables. IfK = 1, then the case is simply assigned to the class of its nearest neighbor.At times, choosing K turns out to be a challenge while performing KNNmodeling.

More: Introduction to k-nearest neighbors: Simplified.

KNN can easilybe mapped to our real lives. If you want to learn about a person, of whomyou have no information, you might like to find out about his closefriends and the circles he moves in and gain access to his/her information!

Things to considerbefore selecting KNN:

KNN is computationallyexpensive
Variables should benormalized else higher range variables can bias it
Works onpre-processing stage more before going for KNN like outlier, noise removal
Python Code:

#Import Library

from sklearn.neighbors import KNeighborsClassifier

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create KNeighbors classifier object model

KNeighborsClassifier(n_neighbors=6) # default value forn_neighbors is 5

# Train the model using the training sets and check score

model.fit(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code:

library(knn)

x <- cbind(x_train,y_train)

# Fitting model

fit <-knn(y_train ~ ., data = x,k=5)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

7. K-Means

It is a type ofunsupervised algorithm which solves the clustering problem.Its procedure follows a simple and easy way to classify a given dataset through a certain number of clusters (assume k clusters). Datapoints inside a cluster are homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhatsimilar this activity. You look at the shape and spread to decipher how manydifferent clusters / population are present!

How K-means formscluster:

1. K-means picks k numberof points for each cluster known as centroids.

2. Each data point formsa cluster with the closest centroids i.e. k clusters.

3. Finds the centroid ofeach cluster based on existing cluster members. Here we have new centroids.

4. As we have newcentroids, repeat step 2 and 3. Find the closest distance for each datapoint from new centroids and get associated with new k-clusters. Repeat thisprocess until convergence occurs i.e. centroids does not change.

How to determine valueof K:

In K-means, we have clusters and each cluster has its owncentroid. Sum of square of difference between centroid and the data pointswithin a cluster constitutes within sum of square value for that cluster. Also,when the sum of square values for all the clusters are added, it becomes totalwithin sum of square value for the cluster solution.

We know that as the number of cluster increases, this valuekeeps on decreasing but if you plot the result you may see that the sum ofsquared distance decreases sharply up to some value of k, and then much moreslowly after that. Here, we can find the optimum number of cluster.

Python Code:

#Import Library

from sklearn.cluster import KMeans

#Assumed you have, X (attributes) for training data set andx_test(attributes) of test_dataset

# Create KNeighbors classifier object model

k_means = KMeans(n_clusters=3, random_state=0)

# Train the model using the training sets and check score

model.fit(X)

#Predict Output

predicted= model.predict(x_test)

R Code

library(cluster)

fit <- kmeans(X, 3) # 5 cluster solution

8. Random Forest

Random Forest is a trademark term for an ensemble of decisiontrees. In Random Forest, we』ve collection of decision trees (soknown as 「Forest」). To classify a new object based on attributes, each treegives a classification and we say the tree 「votes」 for that class. The forestchooses the classification having the most votes (over all the trees in theforest).

Each tree is planted & grown as follows:

1. If the number of casesin the training set is N, then sample of N cases is taken at random but withreplacement. This sample will be the training set for growing the tree.

2. If there are M inputvariables, a number m<<M is specified such that at each node, m variablesare selected at random out of the M and the best split on these m is used tosplit the node. The value of m is held constant during the forest growing.

3. Each tree is grown tothe largest extent possible. There is no pruning.

For more details onthis algorithm, comparing with decision tree and tuning model parameters, Iwould suggest you to read these articles:

1. Introductionto Random forest – Simplified

2. Comparinga CART model to Random Forest (Part 1)

3. Comparinga Random Forest to a CART model (Part 2)

4. Tuningthe parameters of your Random Forest model

Python

#Import Library

from sklearn.ensemble import RandomForestClassifier

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create Random Forest object

model= RandomForestClassifier()

# Train the model using the training sets and check score

model.fit(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code

library(randomForest)

x <- cbind(x_train,y_train)

# Fitting model

fit <- randomForest(Species ~ ., x,ntree=500)

summary(fit)

#Predict Output

predicted= predict(fit,x_test)

9. Dimensionality Reduction Algorithms

In the last 4-5 years,there has been an exponential increase in data capturing at every possiblestages. Corporates/ Government Agencies/ Research organisations are not onlycoming with new sources but also they are capturing data in great detail.

For example: E-commerce companies are capturing more detailsabout customer like their demographics, web crawling history, what they like ordislike, purchase history, feedback and many others to give them personalizedattention more than your nearest grocery shopkeeper.

As a data scientist, the data we are offeredalso consist of many features, this sounds good for building goodrobust model but there is a challenge. How』d you identify highlysignificant variable(s) out 1000 or 2000? In such cases, dimensionalityreduction algorithm helps us along with various other algorithms likeDecision Tree, Random Forest, PCA, Factor Analysis, Identify based oncorrelation matrix, missing value ratio and others.

To know more about this algorithms, you can read 「BeginnersGuide To Learn Dimension Reduction Techniques「.

Python Code:

#Import Library

from sklearn import decomposition

#Assumed you have training and test data set as train and test

# Create PCA obeject pca= decomposition.PCA(n_components=k)#default value of k =min(n_sample, n_features)

# For Factor analysis

#fa= decomposition.FactorAnalysis()

# Reduced the dimension of training dataset using PCA

train_reduced = pca.fit_transform(train)

#Reduced the dimension of test dataset

test_reduced = pca.transform(test)

#For more detail on this, please refer this link.

R Code:

library(stats)

pca <- princomp(train, cor = TRUE)

train_reduced <-predict(pca,train)

test_reduced <-predict(pca,test)

10. Gradient Boosting & AdaBoost

GBM &AdaBoost are boosting algorithms used when we deal with plenty ofdata to make a prediction with high prediction power. Boostingis an ensemble learning algorithm which combines the prediction of severalbase estimators in order to improve robustness over a single estimator. Itcombines multiple weak or average predictors to a build strong predictor.These boosting algorithms always work well in data science competitions likeKaggle, AV Hackathon, CrowdAnalytix.

More: Know about Gradient and AdaBoost in detail

Python Code

#Import Library

from sklearn.ensemble import GradientBoostingClassifier

#Assumed you have, X (predictor) and Y (target) for trainingdata set and x_test(predictor) of test_dataset

# Create Gradient Boosting Classifier object

model= GradientBoostingClassifier(n_estimators=100,learning_rate=1.0, max_depth=1, random_state=0)

# Train the model using the training sets and check score

model.fit(X, y)

#Predict Output

predicted= model.predict(x_test)

R Code:

library(caret)

x <- cbind(x_train,y_train)

# Fitting model

fitControl <- trainControl( method = "repeatedcv",number = 4, repeats = 4)

fit <- train(y ~ ., data = x, method = "gbm",trControl = fitControl,verbose = FALSE)

predicted= predict(fit,x_test,type= "prob")[,2]

GradientBoostingClassifierand Random Forest are two different boosting tree classifier and often peopleask about the difference between these two algorithms.

End Notes

By now, I am sure, youwould have an idea of commonly used machine learning algorithms. My soleintention behind writing this article and providing the codes in R and Pythonis to get you started right away. If you are keen to master machinelearning, start right away. Take up problems, develop a physicalunderstanding of the process, apply these codes and see the fun!

Did you find thisarticle useful ? Share your views and opinions in the comments section below.

If you like what you just read & want to continue youranalytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

TAG:機器學習 | Python | R |