Machinelearning Algorithm Basics2 NOTES
Machinelearning Algorithm Basics2 NOTES
Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or dependent variable)
which is to be predicted from a given set of predictors (independent variables). Using these set
of variables, we generate a function that map inputs to desired outputs. The training process
continues until the model achieves a desired level of accuracy on the training data. Examples of
Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic Regression etc.
2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable to predict /
estimate. It is used for clustering population in different groups, which is widely used for
segmenting customers in different groups for specific intervention. Examples of Unsupervised
Learning: Apriori algorithm, K-means.
3. Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific decisions. It works
this way: the machine is exposed to an environment where it trains itself continually using trial
and error. This machine learns from past experience and tries to capture the best possible
knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov
Decision Process
Here is the list of commonly used machine learning algorithms. These algorithms can be applied
to almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
5. Naive Bayes
6. KNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10. Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. CatBoost
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented by a
linear equation Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let us
say, you ask a child in fifth grade to arrange people in his class by increasing order of weight,
without asking them their weights! What do you think the child will do? He / she would likely
look (visually analyze) at the height and build of people and arrange them using a combination
of these visible parameters. This is linear regression in real life! The child has actually figured
out that height and build would be correlated to the weight by a relationship, which looks like
the equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference of
distance between data points and regression line.
Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a
person.
Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And,
Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1)
independent variables. While finding best fit line, you can fit a polynomial or curvilinear
regression. And these are known as polynomial or curvilinear regression.
2. Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm. It is used to
estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of
independent variable(s). In simple words, it predicts the probability of occurrence of an event
by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts
the probability, its output values lies between 0 and 1 (as expected).
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either
you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes
in an attempt to understand which subjects you are good at. The outcome to this study would
be something like this – if you are given a trignometry based tenth grade problem, you are 70%
likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting
an answer is only 30%. This is what Logistic Regression provides you.
Coming to the math, the log odds of the outcome is modeled as a linear combination of the
predictor variables.
Furthermore..
There are many different steps that could be tried in order to improve the model:
3. Decision Tree
This is one of my favorite algorithm and I use it quite frequently. It is a type of supervised
learning algorithm that is mostly used for classification problems. Surprisingly, it works for
both categorical and continuous dependent variables. In this algorithm, we split the population
into two or more homogeneous sets. This is done based on most significant attributes/
independent variables to make as distinct groups as possible. For more details, you can
read: Decision Tree Simplified.
In the example shown above, the line which splits the data into two differently classified groups
is the black line, since the two closest points are the farthest apart from the line. This line is our
classifier. Then, depending on where the testing data lands on either side of the line, that’s
what class we can classify the new data as.
You can draw lines / planes at any angles (rather than just horizontal or vertical as in
classic game)
The objective of the game is to segregate balls of different colors in different rooms.
And the balls are not moving.
5. Naive Bayes
Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:
Here,
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class.
The class with the highest posterior probability is the outcome of prediction.
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
It can be used for both classification and regression problems. However, it is more widely used
in classification problems in the industry. K nearest neighbors is a simple algorithm that stores
all available cases and classifies new cases by a majority vote of its k neighbors. The case being
assigned to the class is most common amongst its K nearest neighbors measured by a distance
function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First
three functions are used for continuous function and fourth one (Hamming) for categorical
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times,
choosing K turns out to be a challenge while performing KNN modeling.
NN can easily be mapped to our real lives. If you want to learn about a person, of whom you
have no information, you might like to find out about his close friends and the circles he moves
in and gain access to his/her information!
7. K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure
follows a simple and easy way to classify a given data set through a certain number of clusters
(assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to
peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar this activity. You
look at the shape and spread to decipher how many different clusters / population are present!
How K-means forms cluster:
In K-means, we have clusters and each cluster has its own centroid. Sum of square of difference
between centroid and the data points within a cluster constitutes within sum of square value
for that cluster. Also, when the sum of square values for all the clusters are added, it becomes
total within sum of square value for the cluster solution.
We know that as the number of cluster increases, this value keeps on decreasing but if you plot
the result you may see that the sum of squared distance decreases sharply up to some value of
k, and then much more slowly after that. Here, we can find the optimum number of cluster.
8. Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest,
we’ve collection of decision trees (so known as “Forest”). To classify a new object based on
attributes, each tree gives a classification and we say the tree “votes” for that class. The forest
chooses the classification having the most votes (over all the trees in the forest).
1. If the number of cases in the training set is N, then sample of N cases is taken at random
but with replacement. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at each node, m
variables are selected at random out of the M and the best split on these m is used to
split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
For more details on this algorithm, comparing with decision tree and tuning model parameters,
I would suggest you to read these articles:
In the last 4-5 years, there has been an exponential increase in data capturing at every possible
stages. Corporates/ Government Agencies/ Research organisations are not only coming with
new sources but also they are capturing data in great detail.
For example: E-commerce companies are capturing more details about customer like their
demographics, web crawling history, what they like or dislike, purchase history, feedback and
many others to give them personalized attention more than your nearest grocery shopkeeper.
As a data scientist, the data we are offered also consist of many features, this sounds good for
building good robust model but there is a challenge. How’d you identify highly significant
variable(s) out 1000 or 2000? In such cases, dimensionality reduction algorithm helps us along
with various other algorithms like Decision Tree, Random Forest, PCA, Factor Analysis, Identify
based on correlation matrix, missing value ratio and others.
To know more about this algorithms, you can read “Beginners Guide To Learn Dimension
Reduction Techniques“.
10.1. GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a
prediction with high prediction power. Boosting is actually an ensemble of learning algorithms
which combines the prediction of several base estimators in order to improve robustness over a
single estimator. It combines multiple weak or average predictors to a build strong predictor.
These boosting algorithms always work well in data science competitions like Kaggle, AV
Hackathon, CrowdAnalytix.
GradientBoostingClassifier and Random Forest are two different boosting tree classifier and
often people ask about the difference between these two algorithms.
10.2. XGBoost
Another classic gradient boosting algorithm that’s known to be the decisive choice between
winning and losing in some Kaggle competitions.
The XGBoost has an immensely high predictive power which makes it the best choice for
accuracy in events as it possesses both linear model and the tree learning algorithm, making
the algorithm almost 10x faster than existing gradient booster techniques.
The support includes various objective functions, including regression, classification and
ranking.
One of the most interesting things about the XGBoost is that it is also called a regularized
boosting technique. This helps to reduce overfit modelling and has a massive support for a
range of languages such as Scala, Java, R, Python, Julia and C++.
Supports distributed and widespread training on many machines that encompass GCE, AWS,
Azure and Yarn clusters. XGBoost can also be integrated with Spark, Flink and other cloud
dataflow systems with a built in cross validation at each iteration of the boosting process.
10.3. LightGBM
LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is
designed to be distributed and efficient with the following advantages:
The framework is a fast and high-performance gradient boosting one based on decision tree
algorithms, used for ranking, classification and many other machine learning tasks. It was
developed under the Distributed Machine Learning Toolkit Project of Microsoft.
Since the LightGBM is based on decision tree algorithms, it splits the tree leaf wise with the
best fit whereas other boosting algorithms split the tree depth wise or level wise rather than
leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce
more loss than the level-wise algorithm and hence results in much better accuracy which can
rarely be achieved by any of the existing boosting algorithms.
10.4. Catboost
CatBoost is a recently open-sourced machine learning algorithm from Yandex. It can easily
integrate with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML.
The best part about CatBoost is that it does not require extensive data training like other ML
models, and can work on a variety of data formats; not undermining how robust it can be.
Make sure you handle missing data well before you proceed with the implementation.
Catboost can automatically deal with categorical variables without showing the type conversion
error, which helps you to focus on tuning your model better rather than sorting out trivial
errors.
Reference:- https://www.analyticsvidhya.com/
Simple Linear Regression Tutorial for Machine Learning
1x y
21 1
32 3
44 3
53 2
65 5
The attribute x is the input variable and y is the output variable that we are trying to predict. If
we got more data, we would only have x values and we would be interested in predicting y
values.
This is a good indication that using linear regression might be appropriate for this little dataset.
When we have a single input attribute (x) and we want to use linear regression, this is called
simple linear regression.
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear
regression. The procedure for linear regression is different and simpler than that for multiple
linear regression, so it is a good place to start.
In this section we are going to create a simple linear regression model from our training data,
then make predictions for our training data to get an idea of how well the model learned the
relationship in the data.
This is a line where y is the output variable we want to predict, x is the input variable we know
and B0 and B1 are coefficients that we need to estimate that move the line around.
Technically, B0 is called the intercept because it determines where the line intercepts the y-
axis. In machine learning we can call this the bias, because it is added to offset all predictions
that we make. The B1 term is called the slope because it defines the slope of the line or how x
translates into a y value before we add our bias.
The goal is to find the best estimates for the coefficients to minimize the errors in predicting y
from x.
Simple regression is great, because rather than having to search for values by trial and error or
calculate them analytically using more advanced linear algebra, we can estimate them directly
from our data.
Where mean() is the average value for the variable in our dataset. The xi and yi refer to the fact
that we need to repeat these calculations across all values in our dataset and i refers to the i’th
value of x or y.
We can calculate B0 using B1 and some statistics from our dataset, as follows:
B0 = mean(y) – B1 * mean(x)
Not that bad right? We can calculate these right in our spreadsheet.
First we need to calculate the mean value of x and y. The mean is calculated as:
1/n * sum(x)
Where n is the number of values (5 in this case). You can use the AVERAGE() function in your
spreadsheet. Let’s calculate the mean value of our x and y variables:
mean(x) = 3
mean(y) = 2.8
Now we need to calculate the error of each variable from the mean. Let’s do this with x first:
x mean(x) x - mean(x)
1 3 -2
2 3 -1
4 3 1
3 3 0
5 3 2
y mean(y) y - mean(y)
1 2.8 -1.8
3 2.8 0.2
3 2.8 0.2
2 2.8 -0.8
5 2.8 2.2
We now have the parts for calculating the numerator. All we need to do is multiple the error for
each x with the error for each y and calculate the sum of these multiplications.
-2 -1.8 3.6
-1 0.2 -0.2
1 0.2 0.2
0 -0.8 0
2 2.2 4.4
Now we need to calculate the bottom part of the equation for calculating B1, or the
denominator. This is calculated as the sum of the squared differences of each x value from the
mean.
We have already calculated the difference of each x value from the mean, all we need to do is
square each value and calculate the sum.
x - mean(x) squared
-2 4
-1 1
1 1
0 0
2 4
B1 = 8 / 10
B1 = 0.8
B0 = mean(y) – B1 * mean(x)
or
B0 = 2.8 – 0.8 * 3
or
B0 = 0.4
Easy.
Making Predictions
We now have the coefficients for our simple linear regression equation.
y = B0 + B1 * x
or
y = 0.4 + 0.8 * x
Let’s try out the model by making predictions for our training data.
x y predicted y
1 1 1.2
2 3 2
4 3 3.6
3 2 2.8
5 5 4.4
We can plot these predictions as a line with our data. This gives us a visual idea of how well the
line models our data.
Estimating Error
We can calculate a error for our predictions called the Root Mean Squared Error or RMSE.
RMSE = sqrt( sum( (pi – yi)^2 )/n )
Where sqrt() is the square root function, p is the predicted value and y is the actual value, i is
the index for a specific instance, n is the number of predictions, because we must calculate the
error across all predicted values.
First we must calculate the difference between each model prediction and the actual y values.
pred-y y error
1.2 1 0.2
2 3 -1
3.6 3 0.6
2.8 2 0.8
4.4 5 -0.6
We can easily calculate the square of each of these error values (error*error or error^2).
0.2 0.04
-1 1
0.6 0.36
0.8 0.64
-0.6 0.36
The sum of these errors is 2.4 units, dividing by n and taking the square root gives us:
RMSE = 0.692
Shortcut
Before we wrap up I want to show you a quick shortcut for calculating the coefficients.
Simple linear regression is the simplest form of regression and the most studied. There is a
shortcut that you can use to quickly estimate the values for B0 and B1.
Really it is a shortcut for calculating B1. The calculation of B1 can be re-written as:
Where corr(x) is the correlation between x and y an stdev() is the calculation of the standard
deviation for a variable.
Correlation (also known as Pearson’s correlation coefficient) is a measure of how related two
variables are in the range of -1 to 1. A value of 1 indicates that the two variables are perfectly
positively correlated, they both move in the same direction and a value of -1 indicates that they
are perfectly negatively correlated, when one moves the other moves in the other direction.
Standard deviation is a measure of how much on average the data is spread out from the mean.
You can use the function PEARSON() in your spreadsheet to calculate the correlation of x and y
as 0.852 (highly correlated) and the function STDEV() to calculate the standard deviation of x as
1.5811 and y as 1.4832.
B1 = 0.799
Close enough to the above value of 0.8. Note that we get 0.8 if we use the fuller precision in our
spreadsheet for the correlation and standard deviation equations.
Summary
In this post you discovered how to implement linear regression step-by-step in a spreadsheet.
You learned:
After reading this post, you will have a much better understanding of the most popular machine
learning algorithms for supervised learning and how they are related.
There are only a few main learning styles or learning models that an algorithm can have and
we’ll go through them here with a few examples of algorithms and problem types that they
suit.
This taxonomy or way of organizing machine learning algorithms is useful because it forces you
to think about the roles of the input data and the model preparation process and select one
that is the most appropriate for your problem in order to get the best result.
Let’s take a look at three different learning styles in machine learning algorithms:
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a
stock price at a time.
A model is prepared through a training process in which it is required to make predictions and
is corrected when those predictions are wrong. The training process continues until the model
achieves a desired level of accuracy on the training data.
Example algorithms include Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract
general rules. It may be through a mathematical process to systematically reduce redundancy,
or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
3. Semi-Supervised Learning
Example algorithms are extensions to other flexible methods that make assumptions about
how to model the unlabeled data.
Overview
When crunching data to model business decisions, you are most typically using supervised and
unsupervised learning methods.
A hot topic at the moment is semi-supervised learning methods in areas such as image
classification where there are large datasets with very few labeled examples.
I think this is the most useful way to group algorithms and it is the approach we will use here.
This is a useful grouping method, but it is not perfect. There are still algorithms that could just
as easily fit into multiple categories like Learning Vector Quantization that is both a neural
network inspired method and an instance-based method. There are also categories that have
the same name that describe the problem and the class of algorithm such as Regression and
Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that
subjectively is the “best” fit. I like this latter approach of not duplicating algorithms to keep
things simple.
In this section, I list many of the popular machine learning algorithms grouped the way I think is
the most intuitive. The list is not exhaustive in either the groups or the algorithms, but I think it
is representative and will be useful to you to get an idea of the lay of the land.
Please Note: There is a strong bias towards algorithms used for classification and regression,
the two most prevalent supervised machine learning problems you will encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments and
share it with us. Let’s dive in.
Regression Algorithms
Regression is concerned with modeling the relationship between variables that is iteratively
refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical
machine learning. This may be confusing because we can use regression to refer to the class of
problem and the class of algorithm. Really, regression is a process.
Such methods typically build up a database of example data and compare new data to the
database using a similarity measure in order to find the best match and make a prediction. For
this reason, instance-based methods are also called winner-take-all methods and memory-
based learning. Focus is put on the representation of the stored instances and similarity
measures used between instances.
Regularization Algorithms
An extension made to another method (typically regression methods) that penalizes models
based on their complexity, favoring simpler models that are also better at generalizing.
I have listed regularization algorithms separately here because they are popular, powerful and
generally simple modifications made to other methods.
Ridge Regression
Least Absolute Shrinkage and Selection Operator (LASSO)
Elastic Net
Least-Angle Regression (LARS)
Decisions fork in tree structures until a prediction decision is made for a given record. Decision
trees are trained on data for classification and regression problems. Decision trees are often
fast and accurate and a big favorite in machine learning.
Bayesian Algorithms
Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as
classification and regression.
Naive Bayes
Gaussian Naive Bayes
Multinomial Naive Bayes
Averaged One-Dependence Estimators (AODE)
Bayesian Belief Network (BBN)
Bayesian Network (BN)
Clustering Algorithms
Clustering, like regression, describes the class of problem and the class of methods.
Clustering methods are typically organized by the modeling approaches such as centroid-based
and hierarchal. All methods are concerned with using the inherent structures in the data to best
organize the data into groups of maximum commonality.
k-Means
k-Medians
Expectation Maximisation (EM)
Hierarchical Clustering
These rules can discover important and commercially useful associations in large
multidimensional datasets that can be exploited by an organization.
Apriori algorithm
Eclat algorithm
Artificial Neural Network Algorithms
Artificial Neural Networks are models that are inspired by the structure and/or function of
biological neural networks.
They are a class of pattern matching that are commonly used for regression and classification
problems but are really an enormous subfield comprised of hundreds of algorithms and
variations for all manner of problem types.
Note that I have separated out Deep Learning from neural networks because of the massive
growth and popularity in the field. Here we are concerned with the more classical methods.
Perceptron
Back-Propagation
Hopfield Network
Radial Basis Function Network (RBFN)
Deep Learning methods are a modern update to Artificial Neural Networks that exploit
abundant cheap computation.
They are concerned with building much larger and more complex neural networks and, as
commented on above, many methods are concerned with semi-supervised learning problems
where large datasets contain very little labeled data.
This can be useful to visualize dimensional data or to simplify data which can then be used in a
supervised learning method. Many of these methods can be adapted for use in classification
and regression.
Ensemble Algorithms
Ensemble methods are models composed of multiple weaker
models that are independently trained and whose predictions are combined in some way to
make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to
combine them. This is a very powerful class of techniques and as such is very popular.
Boosting
Bootstrapped Aggregation (Bagging)
AdaBoost
Stacked Generalization (blending)
Gradient Boosting Machines (GBM)
Gradient Boosted Regression Trees (GBRT)
Random Forest
Other Algorithms
Many algorithms were not covered.
For example, what group would Support Vector Machines go into? Its own?
I did not cover algorithms from specialty tasks in the process of machine learning, such as:
A curated list of awesome machine learning frameworks, libraries and software (by language).
Inspired by awesome-php.
If you want to contribute to this list (please do), send me a pull request or contact
me @josephmisiti Also, a listed repository should be deprecated if:
Further resources:
For a list of free machine learning books available for download, go here.
For a list of (mostly) free machine learning courses available online, go here.
Table of Contents
APL
o General-Purpose Machine Learning
C
o General-Purpose Machine Learning
o Computer Vision
C++
oComputer Vision
o General-Purpose Machine Learning
o Natural Language Processing
o Sequence Analysis
o Gesture Recognition
Common Lisp
o General-Purpose Machine Learning
Clojure
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
Crystal
o General-Purpose Machine Learning
Elixir
o General-Purpose Machine Learning
o Natural Language Processing
Erlang
o General-Purpose Machine Learning
Go
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
Haskell
o General-Purpose Machine Learning
Java
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
o Deep Learning
Javascript
o Natural Language Processing
o Data Analysis / Data Visualization
o General-Purpose Machine Learning
o Misc
Julia
o General-Purpose Machine Learning
o Natural Language Processing
o Data Analysis / Data Visualization
o Misc Stuff / Presentations
Lua
o General-Purpose Machine Learning
o Demos and Scripts
Matlab
o Computer Vision
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
.NET
o Computer Vision
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
Objective C
o General-Purpose Machine Learning
OCaml
o General-Purpose Machine Learning
Perl
o Data Analysis / Data Visualization
o General-Purpose Machine Learning
Perl 6
PHP
o Natural Language Processing
o General-Purpose Machine Learning
Python
o Computer Vision
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
o Misc Scripts / iPython Notebooks / Codebases
o Kaggle Competition Source Code
o Neural Networks
o Reinforcement Learning
Ruby
o Natural Language Processing
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
o Misc
Rust
o General-Purpose Machine Learning
R
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
SAS
o General-Purpose Machine Learning
o Data Analysis / Data Visualization
o High Performance Machine Learning (MPP)
o Natural Language Processing
o Demos and Scripts
Scala
o Natural Language Processing
o Data Analysis / Data Visualization
o General-Purpose Machine Learning
Swift
o General-Purpose Machine Learning
TensorFlow
o General-Purpose Machine Learning
Credits
APL
Darknet - Darknet is an open source neural network framework written in C and CUDA.
It is fast, easy to install, and supports CPU and GPU computation.
Recommender - A C library for product recommendations/suggestions using
collaborative filtering (CF).
Hybrid Recommender System - A hybrid recommender system based upon scikit-learn
algorithms.
Computer Vision
Speech Recognition
HTK -The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and
manipulating hidden Markov models.
C++
Computer Vision
DLib - DLib has C++ and Python interfaces for face detection and training general object
detectors.
EBLearn - Eblearn is an object-oriented C++ library that implements various machine
learning models
OpenCV - OpenCV has C++, C, Python, Java and MATLAB interfaces and supports
Windows, Linux, Android and Mac OS.
VIGRA - VIGRA is a generic cross-platform C++ computer vision and machine learning
library for volumes of arbitrary dimensionality with Python bindings.
BLLIP Parser - BLLIP Natural Language Parser (also known as the Charniak-Johnson
parser)
colibri-core - C++ library, command line tools, and Python binding for extracting and
working with basic linguistic constructions such as n-grams and skipgrams in a quick and
memory-efficient way.
CRF++ - Open source implementation of Conditional Random Fields (CRFs) for
segmenting/labeling sequential data & other Natural Language Processing tasks.
CRFsuite - CRFsuite is an implementation of Conditional Random Fields (CRFs) for
labeling sequential data.
frog - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser,
dependency parser, NER, shallow parser, morphological analyzer.
libfolia - C++ library for the FoLiA format
MeTA - MeTA : ModErn Text Analysis is a C++ Data Sciences Toolkit that facilitates
mining big text data.
MIT Information Extraction Toolkit - C, C++, and Python tools for named entity
recognition and relation extraction
ucto - Unicode-aware regular-expression based tokenizer for various languages. Tool
and C++ library. Supports FoLiA format.
Speech Recognition
Kaldi - Kaldi is a toolkit for speech recognition written in C++ and licensed under the
Apache License v2.0. Kaldi is intended for use by speech recognition researchers.
Sequence Analysis
Gesture Detection
Common Lisp
Clojure
Natural Language Processing
Crystal
Elixir
General-Purpose Machine Learning
Erlang
Go
Haskell
Java
Speech Recognition
CMU Sphinx - Open Source Toolkit For Speech Recognition purely based on Java speech
recognition library.
Flink - Open source platform for distributed stream and batch data processing.
Hadoop - Hadoop/HDFS
Onyx - Distributed, masterless, high performance, fault tolerant data processing.
Written entirely in Clojure.
Spark - Spark is a fast and general engine for large-scale data processing.
Storm - Storm is a distributed realtime computation system.
Impala - Real-time Query for Hadoop
DataMelt - Mathematics software for numeric computation, statistics, symbolic
calculations, data analysis and data visualization.
Dr. Michael Thomas Flanagan's Java Scientific Library
Deep Learning
Javascript
D3.js
High Charts
NVD3.js
dc.js
chartjs
dimple
amCharts
D3xter - Straight forward plotting built on D3
statkit - Statistics kit for JavaScript
datakit - A lightweight framework for data analysis in JavaScript
science.js - Scientific and statistical computing in JavaScript.
Z3d - Easily make interactive 3d plots built on Three.js
Sigma.js - JavaScript library dedicated to graph drawing.
C3.js- customizable library based on D3.js for easy chart drawing.
Datamaps- Customizable SVG map/geo visualizations using D3.js.
ZingChart- library written on Vanilla JS for big data visualization.
cheminfo - Platform for data visualization and analysis, using the visualizer project.
Learn JS Data
AnyChart
FusionCharts
Misc
Julia
General-Purpose Machine Learning
Lua
Torch7
o cephes - Cephes mathematical functions library, wrapped for Torch. Provides
and wraps the 180+ special mathematical functions from the Cephes
mathematical library, developed by Stephen L. Moshier. It is used, among many
other places, at the heart of SciPy.
o autograd - Autograd automatically differentiates native Torch code. Inspired by
the original Python version.
o graph - Graph package for Torch
o randomkit - Numpy's randomkit, wrapped for Torch
o signal - A signal processing toolbox for Torch-7. FFT, DCT, Hilbert, cepstrums, stft
o nn - Neural Network package for Torch
o torchnet - framework for torch which provides a set of abstractions aiming at
encouraging code re-use as well as encouraging modular programming
o nngraph - This package provides graphical computation for nn library in Torch7.
o nnx - A completely unstable and experimental package that extends Torch's
builtin nn library
o rnn - A Recurrent Neural Network library that extends Torch's nn. RNNs, LSTMs,
GRUs, BRNNs, BLSTMs, etc.
o dpnn - Many useful features that aren't part of the main nn package.
o dp - A deep learning library designed for streamlining research and development
using the Torch7 distribution. It emphasizes flexibility through the elegant use of
object-oriented design patterns.
o optim - An optimization library for Torch. SGD, Adagrad, Conjugate-Gradient,
LBFGS, RProp and more.
o unsup - A package for unsupervised learning in Torch. Provides modules that are
compatible with nn (LinearPsd, ConvPsd, AutoEncoder, ...), and self-contained
algorithms (k-means, PCA).
o manifold - A package to manipulate manifolds
o svm - Torch-SVM library
o lbfgs - FFI Wrapper for liblbfgs
o vowpalwabbit - An old vowpalwabbit interface to torch.
o OpenGM - OpenGM is a C++ library for graphical modeling, and inference. The
Lua bindings provide a simple way of describing graphs, from Lua, and then
optimizing them with OpenGM.
o sphagetti - Spaghetti (sparse linear) module for torch7 by @MichaelMathieu
o LuaSHKit - A lua wrapper around the Locality sensitive hashing library SHKit
o kernel smoothing - KNN, kernel-weighted average, local linear regression
smoothers
o cutorch - Torch CUDA Implementation
o cunn - Torch CUDA Neural Network Implementation
o imgraph - An image/graph library for Torch. This package provides routines to
construct graphs on images, segment them, build trees out of them, and convert
them back to images.
o videograph - A video/graph library for Torch. This package provides routines to
construct graphs on videos, segment them, build trees out of them, and convert
them back to videos.
o saliency - code and tools around integral images. A library for finding interest
points based on fast integral histograms.
o stitch - allows us to use hugin to stitch images and apply same stitching to a
video sequence
o sfm - A bundle adjustment/structure from motion package
o fex - A package for feature extraction in Torch. Provides SIFT and dSIFT modules.
o OverFeat - A state-of-the-art generic dense feature extractor
Numeric Lua
Lunatic Python
SciLua
Lua - Numerical Algorithms
Lunum
Matlab
Computer Vision
Contourlets - MATLAB source code that implements the contourlet transform and its
utility functions.
Shearlets - MATLAB code for shearlet transform
Curvelets - The Curvelet transform is a higher dimensional generalization of the Wavelet
transform designed to represent images at different scales and different angles.
Bandlets - MATLAB code for bandlet transform
mexopencv - Collection and a development kit of MATLAB mex functions for OpenCV
library
.NET
Computer Vision
OpenCVDotNet - A wrapper for the OpenCV project to be used with .NET applications.
Emgu CV - Cross platform wrapper of OpenCV which can be compiled in Mono to e run
on Windows, Linus, Mac OS X, iOS, and Android.
AForge.NET - Open source C# framework for developers and researchers in the fields of
Computer Vision and Artificial Intelligence. Development has now shifted to GitHub.
Accord.NET - Together with AForge.NET, this library can provide image processing and
computer vision algorithms to Windows, Windows RT and Windows Phone. Some
components are also available for Java and Android.
Stanford.NLP for .NET - A full port of Stanford NLP packages to .NET and also available
precompiled as a NuGet package.
numl - numl is a machine learning library intended to ease the use of using standard
modeling techniques for both prediction and clustering.
Math.NET Numerics - Numerical foundation of the Math.NET project, aiming to provide
methods and algorithms for numerical computations in science, engineering and every
day use. Supports .Net 4.0, .Net 3.5 and Mono on Windows, Linux and Mac; Silverlight 5,
WindowsPhone/SL 8, WindowsPhone 8.1 and Windows 8 with PCL Portable Profiles 47
and 344; Android/iOS with Xamarin.
Sho - Sho is an interactive environment for data analysis and scientific computing that
lets you seamlessly connect scripts (in IronPython) with compiled code (in .NET) to
enable fast and flexible prototyping. The environment includes powerful and efficient
libraries for linear algebra as well as data visualization that can be used from any .NET
language, as well as a feature-rich interactive shell for rapid development.
Objective C
General-Purpose Machine Learning
YCML - A Machine Learning framework for Objective-C and Swift (OS X / iOS).
MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
MLPNeuralNet predicts new examples by trained neural network. It is built on top of the
Apple's Accelerate Framework, using vectorized operations and hardware acceleration if
available.
MAChineLearning - An Objective-C multilayer perceptron library, with full support for
training through backpropagation. Implemented using vDSP and vecLib, it's 20 times
faster than its Java equivalent. Includes sample code for use from Swift.
BPN-NeuralNetwork - It implemented 3 layers neural network ( Input Layer, Hidden
Layer and Output Layer ) and it named Back Propagation Neural Network (BPN). This
network can be used in products recommendation, user behavior analysis, data mining
and data analysis.
Multi-Perceptron-NeuralNetwork - it implemented multi-perceptrons neural network
(ニューラルネットワーク) based on Back Propagation Neural Network (BPN) and
designed unlimited-hidden-layers.
KRHebbian-Algorithm - It is a non-supervisor and self-learning algorithm (adjust the
weights) in neural network of Machine Learning.
KRKmeans-Algorithm - It implemented K-Means the clustering and classification
algorithm. It could be used in data mining and image compression.
KRFuzzyCMeans-Algorithm - It implemented Fuzzy C-Means (FCM) the fuzzy clustering /
classification algorithm on Machine Learning. It could be used in data mining and image
compression.
OCaml
Perl
Perl Data Language, a pluggable architecture for data and image processing, which can
be used for machine learning.
General-Purpose Machine Learning
Perl 6
Perl Data Language, a pluggable architecture for data and image processing, which can
be used for machine learning.
PHP
PHP-ML - Machine Learning library for PHP. Algorithms, Cross Validation, Neural
Network, Preprocessing, Feature Extraction and much more in one library.
PredictionBuilder - A library for machine learning that builds predictions using a linear
regression.
Python
Computer Vision
NLTK - A leading platform for building Python programs to work with human language
data.
Pattern - A web mining module for the Python programming language. It has tools for
natural language processing, machine learning, among others.
Quepy - A python framework to transform natural language questions to queries in a
database query language
TextBlob - Providing a consistent API for diving into common natural language
processing (NLP) tasks. Stands on the giant shoulders of NLTK and Pattern, and plays
nicely with both.
YAlign - A sentence aligner, a friendly tool for extracting parallel sentences from
comparable corpora.
jieba - Chinese Words Segmentation Utilities.
SnowNLP - A library for processing Chinese text.
spammy - A library for email Spam filtering built on top of nltk
loso - Another Chinese segmentation library.
genius - A Chinese segment base on Conditional Random Field.
KoNLPy - A Python package for Korean natural language processing.
nut - Natural language Understanding Toolkit
Rosetta - Text processing tools and wrappers (e.g. Vowpal Wabbit)
BLLIP Parser - Python bindings for the BLLIP Natural Language Parser (also known as the
Charniak-Johnson parser)
PyNLPl - Python Natural Language Processing Library. General purpose NLP library for
Python. Also contains some specific modules for parsing common NLP formats, most
notably for FoLiA, but also ARPA language models, Moses phrasetables, GIZA++
alignments.
python-ucto - Python binding to ucto (a unicode-aware rule-based tokenizer for various
languages)
python-frog - Python binding to Frog, an NLP suite for Dutch. (pos tagging,
lemmatisation, dependency parsing, NER)
python-zpar - Python bindings for ZPar, a statistical part-of-speech-tagger, constiuency
parser, and dependency parser for English.
colibri-core - Python binding to C++ library for extracting and working with with basic
linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient
way.
spaCy - Industrial strength NLP with Python and Cython.
PyStanfordDependencies - Python interface for converting Penn Treebank trees to
Stanford Dependencies.
Distance - Levenshtein and Hamming distance computation
Fuzzy Wuzzy - Fuzzy String Matching in Python
jellyfish - a python library for doing approximate and phonetic matching of strings.
editdistance - fast implementation of edit distance
textacy - higher-level NLP built on Spacy
stanford-corenlp-python - Python wrapper for Stanford CoreNLP
CLTK - The Classical Language Toolkit
rasa_nlu - turn natural language into structured data
yase - Transcode sentence (or other sequence) to list of word vector
Polyglot - Multilingual text (NLP) processing toolkit
DrQA - Reading Wikipedia to answer open-domain questions
auto_ml - Automated machine learning for production and analytics. Lets you focus on
the fun parts of ML, while outputting production-ready code, and detailed analytics of
your dataset and results. Includes support for NLP, XGBoost, LightGBM, and soon, deep
learning.
machine learning - automated build consisting of a web-interface, and set
of programmatic-interface API, for support vector machines. Corresponding dataset(s)
are stored into a SQL database, then generated model(s) used for prediction(s), are
stored into a NoSQL datastore.
XGBoost - Python bindings for eXtreme Gradient Boosting (Tree) Library
Bayesian Methods for Hackers - Book/iPython notebooks on Probabilistic Programming
in Python
Featureforge A set of tools for creating and testing machine learning features, with a
scikit-learn compatible API
MLlib in Apache Spark - Distributed machine learning library in Spark
Hydrosphere Mist - a service for deployment Apache Spark MLLib machine learning
models as realtime, batch or reactive web services.
scikit-learn - A Python module for machine learning built on top of SciPy.
metric-learn - A Python module for metric learning.
SimpleAI Python implementation of many of the artificial intelligence algorithms
described on the book "Artificial Intelligence, a Modern Approach". It focuses on
providing an easy to use, well documented and tested library.
astroML - Machine Learning and Data Mining for Astronomy.
graphlab-create - A library with various machine learning models (regression, clustering,
recommender systems, graph analytics, etc.) implemented on top of a disk-backed
DataFrame.
BigML - A library that contacts external servers.
pattern - Web mining module for Python.
NuPIC - Numenta Platform for Intelligent Computing.
Pylearn2 - A Machine Learning library based on Theano.
keras - Modular neural network library based on Theano.
Lasagne - Lightweight library to build and train neural networks in Theano.
hebel - GPU-Accelerated Deep Learning Library in Python.
Chainer - Flexible neural network framework
prophet - Fast and automated time series forecasting framework by Facebook.
gensim - Topic Modelling for Humans.
topik - Topic modelling toolkit
PyBrain - Another Python Machine Learning Library.
Brainstorm - Fast, flexible and fun neural networks. This is the successor of PyBrain.
Crab - A flexible, fast recommender engine.
python-recsys - A Python library for implementing a Recommender System.
thinking bayes - Book on Bayesian Analysis
Image-to-Image Translation with Conditional Adversarial Networks - Implementation of
image to image (pix2pix) translation from the paper by isola et al.[DEEP LEARNING]
Restricted Boltzmann Machines -Restricted Boltzmann Machines in Python. [DEEP
LEARNING]
Bolt - Bolt Online Learning Toolbox
CoverTree - Python implementation of cover trees, near-drop-in replacement for
scipy.spatial.kdtree
nilearn - Machine learning for NeuroImaging in Python
neuropredict - Aimed at novice machine learners and non-expert programmers, this
package offers easy (no coding needed) and comprehensive machine learning
(evaluation and full report of predictive performance WITHOUT requiring you to code)
in Python for NeuroImaging and any other type of features. This is aimed at absorbing
the much of the ML workflow, unlike other packages like nilearn and pymvpa, which
require you to learn their API and code to produce anything useful.
imbalanced-learn - Python module to perform under sampling and over sampling with
various techniques.
Shogun - The Shogun Machine Learning Toolbox
Pyevolve - Genetic algorithm framework.
Caffe - A deep learning framework developed with cleanliness, readability, and speed in
mind.
breze - Theano based library for deep and recurrent neural networks
pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov
Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing
on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with
weak-limit approximations.
mrjob - A library to let Python program run on Hadoop.
SKLL - A wrapper around scikit-learn that makes it simpler to conduct experiments.
neurolab - https://github.com/zueve/neurolab
Spearmint - Spearmint is a package to perform Bayesian optimization according to the
algorithms outlined in the paper: Practical Bayesian Optimization of Machine Learning
Algorithms. Jasper Snoek, Hugo Larochelle and Ryan P. Adams. Advances in Neural
Information Processing Systems, 2012.
Pebl - Python Environment for Bayesian Learning
Theano - Optimizing GPU-meta-programming code generating array oriented optimizing
math compiler in Python
TensorFlow - Open source software library for numerical computation using data flow
graphs
yahmm - Hidden Markov Models for Python, implemented in Cython for speed and
efficiency.
python-timbl - A Python extension module wrapping the full TiMBL C++ programming
interface. Timbl is an elaborate k-Nearest Neighbours machine learning toolkit.
deap - Evolutionary algorithm framework.
pydeep - Deep Learning In Python
mlxtend - A library consisting of useful tools for data science and machine learning tasks.
neon - Nervana's high-performance Python-based Deep Learning framework [DEEP
LEARNING]
Optunity - A library dedicated to automated hyperparameter optimization with a simple,
lightweight API to facilitate drop-in replacement of grid search.
Neural Networks and Deep Learning - Code samples for my book "Neural Networks and
Deep Learning" [DEEP LEARNING]
Annoy - Approximate nearest neighbours implementation
skflow - Simplified interface for TensorFlow, mimicking Scikit Learn.
TPOT - Tool that automatically creates and optimizes machine learning pipelines using
genetic programming. Consider it your personal data science assistant, automating a
tedious part of machine learning.
pgmpy A python library for working with Probabilistic Graphical Models.
DIGITS - The Deep Learning GPU Training System (DIGITS) is a web application for
training deep learning models.
Orange - Open source data visualization and data analysis for novices and experts.
MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with
Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Go, Javascript
and more.
milk - Machine learning toolkit focused on supervised classification.
TFLearn - Deep learning library featuring a higher-level API for TensorFlow.
REP - an IPython-based environment for conducting data-driven research in a consistent
and reproducible way. REP is not trying to substitute scikit-learn, but extends it and
provides better user experience.
rgf_python - Python bindings for Regularized Greedy Forest (Tree) Library.
skbayes - Python package for Bayesian Machine Learning with scikit-learn API
fuku-ml - Simple machine learning library, including Perceptron, Regression, Support
Vector Machine, Decision Tree and more, it's easy to use and easy to learn for
beginners.
Xcessiv - A web-based application for quick, scalable, and automated hyperparameter
tuning and stacked ensembling
PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
ML-From-Scratch - Implementations of Machine Learning models from scratch in Python
with a focus on transparency. Aims to showcase the nuts and bolts of ML in an
accessible way.
Neural Networks
Reinforcement Learning
Ruby
Natural Language Processing
Awesome NLP with Ruby - Curated link list for practical natural language processing in
Ruby.
Treat - Text REtrieval and Annotation Toolkit, definitely the most comprehensive toolkit
I’ve encountered so far for Ruby
Ruby Linguistics - Linguistics is a framework for building linguistic utilities for Ruby
objects in any language. It includes a generic language-independent front end, a module
for mapping language codes into language names, and a module which contains various
English-language utilities.
Stemmer - Expose libstemmer_c to Ruby
Ruby Wordnet - This library is a Ruby interface to WordNet
Raspel - raspell is an interface binding for ruby
UEA Stemmer - Ruby port of UEALite Stemmer - a conservative stemmer for search and
indexing
Twitter-text-rb - A library that does auto linking and extraction of usernames, lists and
hashtags in tweets
Awesome Machine Learning with Ruby - Curated list of ML related resources for Ruby
Ruby Machine Learning - Some Machine Learning algorithms, implemented in Ruby
Machine Learning Ruby
jRuby Mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in
the world of JRuby.
CardMagic-Classifier - A general classifier module to allow Bayesian and other types of
classifications.
rb-libsvm - Ruby language bindings for LIBSVM which is a Library for Support Vector
Machines
Random Forester - Creates Random Forest classifiers from PMML files
Misc
Rust
SAS
Enterprise Miner - Data mining and machine learning that creates deployable models
using a GUI or code.
Factory Miner - Automatically creates deployable machine learning models across
numerous market or customer segments using a GUI.
Data Analysis / Data Visualization
High Performance Data Mining - Data mining and machine learning that creates
deployable models using a GUI or code in an MPP environment, including Hadoop.
High Performance Text Mining - Text mining using a GUI or code in an MPP
environment, including Hadoop.
Scala
Swift
Bender - Fast Neural Networks framework built on top of Metal. Supports TensorFlow
models.
Swift AI - Highly optimized artificial intelligence and machine learning library written in
Swift.
BrainCore - The iOS and OS X neural network framework
swix - A bare bones library that includes a general matrix language and wraps some
OpenCV for iOS development.
DeepLearningKit an Open Source Deep Learning Framework for Apple’s iOS, OS X and
tvOS. It currently allows using deep convolutional neural network models trained in
Caffe on Apple operating systems.
AIToolbox - A toolbox framework of AI modules written in Swift: Graphs/Trees, Linear
Regression, Support Vector Machines, Neural Networks, PCA, KMeans, Genetic
Algorithms, MDP, Mixture of Gaussians.
MLKit - A simple Machine Learning Framework written in Swift. Currently features
Simple Linear Regression, Polynomial Regression, and Ridge Regression.
Swift Brain - The first neural network / machine learning library written in Swift. This is a
project for AI algorithms in Swift for iOS and OS X development. This project includes
algorithms focused on Bayes theorem, neural networks, SVMs, Matrices, etc..
Perfect TensorFlow - Swift Language Bindings of TensorFlow. Using native TensorFlow
models on both macOS / Linux.
Awesome CoreML - A curated list of pretrained CoreML models
Awesome Core ML Models - A curated list of machine learning models in CoreML
format.
TensorFlow