MLT- Module 3
MLT- Module 3
Bayesian learning
• Bayes theorem is also known with some other name such as Bayes
rule or Bayes Law. Bayes theorem helps to determine the probability
of an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred.
It is a best method to relate the condition probability and marginal
probability.
Bayes theorem is one of the most popular machine learning concepts
that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional
probability of event X with known event Y:
K-Nearest Neighbor(KNN) Algorithm
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
Minimum Description Length Principle
• The minimum description length (MDL) principle is a
powerful method of inductive inference, the basis of
statistical modeling, pattern recognition, and machine
learning. It holds that the best explanation, given a
limited set of observed data, is the one that permits the
greatest compression of the data.
• MDL methods are particularly well-suited for dealing
with model selection, prediction, and estimation
problems in situations where the models under
consideration can be arbitrarily complex, and overfitting
the data is a serious concern.
• Minimum Description Length (MDL) is a model
selection principle where the shortest description of the data is
the best model. MDL methods learn through a data
compression perspective and are sometimes described as
mathematical applications of Occam's razor.
• The MDL principle can be extended to other forms of inductive
inference and learning, for example to estimation and
sequential prediction, without explicitly identifying a single
model of the data.
• MDL has its origins mostly in information theory and has been further developed within the
general fields of statistics, theoretical computer science and machine learning, and more
narrowly computational learning theory.
• Selecting the minimum length description of the available data as the best model observes
the principle identified as Occam's razor. Prior to the advent of computer programming,
generating such descriptions was the intellectual labor of scientific theorists. It was far less
formal than it has become in the computer age. If two scientists had a theoretic disagreement,
they rarely could formally apply Occam's razor to choose between their theories. They would
have different data sets and possibly different descriptive languages. Nevertheless, science
advanced as Occam's razor was an informal guide in deciding which model was best.
• With the advent of formal languages and computer programming Occam's razor was
mathematically defined. Models of a given set of observations, encoded as bits of data, could
be created in the form of computer programs that output that data. Occam's razor could
then formally select the shortest program, measured in bits of this algorithmic information, as
the best model.
• MDL applies in machine learning when algorithms (machines)
generate descriptions. Learning occurs when an algorithm
generates a shorter description of the same data set.
• The theoretic minimum description length of a data set, called
its Kolmogorov complexity, cannot, however, be computed. That
is to say, even if by random chance an algorithm generates the
shortest program of all that outputs the data set, an automated
theorem prover cannot prove there is no shorter such program.
Nevertheless, given two programs that output the dataset, the
MDL principle selects the shorter of the two as embodying the
best model.
• Minimum description length (MDL), which originated
from algorithmic coding theory in computer science,
regards both model and data as codes. The idea is that
any data set can be appropriately encoded with the help
of a model, and that the more we compress the data by
extracting redundancy from it, the more we uncover
underlying regularities in the data.
• Thus, code length is directly related to the
generalization capability of the model, where the model
that provides the shortest description of the data should
be chosen.
Formally, the total description length of a model is given
by
Applications
Instance-based learning is used in various fields, including computer vision,
pattern recognition, and recommendation systems. It is particularly useful in
situations where the data may change rapidly, or where the relationships
between features are complex and hard to model with traditional methods.
Locally Weighted Regression
NOTE: For Locally Weighted Linear Regression, the data must always be
available on the machine as it doesn’t learn from the whole set of data
in a single shot. Whereas, in Linear Regression, after training the model
the training set can be erased from the machine as the model has
already learned the required parameters.
Locally Weighted Linear Regression is a non−parametric
method/algorithm.
In Linear regression, the data should be distributed linearly whereas
Locally Weighted Regression is suitable for non−linearly distributed
data.
Generally, in Locally Weighted Regression, points closer to the query
point are given more weightage than points away from it.
Parametric and Non-Parametric Models
• Time Series Analysis: LWLR is particularly useful in time series analysis, where
the relationship between variables may change over time. By adapting to the
local patterns and trends, LWLR can capture the dynamics of time-varying data
and make accurate predictions.
• Anomaly Detection: LWLR can be employed for anomaly detection in various
domains, such as fraud detection or network intrusion detection. By identifying
deviations from the expected patterns in a localized manner, LWLR helps detect
abnormal behavior that may go unnoticed using traditional regression models.
• Robotics and Control Systems: In robotics and control systems, LWLR can be
utilized to model and predict the behavior of complex systems. By adapting to
local conditions and variations, LWLR enables precise control and decision-
making in dynamic environments.
Benefits of Locally Weighted
Linear Regression
Disadvantages of Locally weighted Linear Regression
In the instance of more than one predictor variable, the Radial basis Functions Neural Network has the
same number of dimensions as there are variables. If three neurons are in a space with two predictor
variables, we can predict the value from the RBF functions. We can calculate the best-predicted value for
the new point by adding the output values of the RBF functions multiplied by the weights processed for
each neuron.
The radial basis function for a neuron consists of a center and a radius (also called the spread). The radius
may vary between different neurons. In DTREG-generated RBF networks, each dimension's radius can
differ.
As the spread grows larger, neurons at a distance from a point have more influence.
Case-based reasoning (CBR)
In Case-based reasoning, a new problem is solved by adapting solutions
that were also useful in the past. Therefore, it is also referred to as an
experience-based approach/ intelligent problem-solving method.
Therefore, it means learning from past experiences and using that
knowledge to approach new problems.
One of the most popular lazy learning algorithms is the k-nearest neighbors (k-NN)
algorithm. In k-NN, the k closest training instances to the query point are
considered, and their class labels are used to determine the class of the query. Lazy
learning methods excel in situations where the underlying data distribution is
complex or where the training data is noisy.
Benefits of Lazy Learning
Limitations of Lazy Learning
Eager Learning:
Eager learning algorithms, such as decision trees, neural networks, and support vector machines, work by
constructing a predictive model based on the entire training dataset. The model is built during the training
phase, which means that the learning process is completed before the prediction phase begins.
For instance, a decision tree algorithm will analyze the training data and construct a tree-like model of decisions
based on the features of the data. Similarly, a neural network will use the training data to adjust the weights and
biases of the network during the training phase. Once the model is built, it can quickly generate predictions for
new data points.
Eager learning is suitable when the entire training dataset is available and can be processed efficiently. It is
preferable in scenarios where the training data is relatively small and can fit into memory, as eager learning
algorithms require the entire dataset to construct the model. Additionally, eager learning is advantageous when
the prediction phase needs to be fast, as the model is already built and can quickly generate predictions for new
data points. This makes eager learning ideal for real-time or time-sensitive applications where immediate
predictions are required.
Examples of Real-World Use Cases
of Eager Learning
Benefits of Eager Learning
Limitations of Eager Learning
Genetic Algorithm
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that
belong to the larger part of evolutionary algorithms. Genetic algorithms
are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution
space. They are commonly used to generate high-quality solutions for
optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which
means those species who can adapt to changes in their environment
are able to survive and reproduce and go to next generation. In simple
words, they simulate “survival of the fittest” among individual of
consecutive generation for solving a problem. Each generation consist
of a population of individuals and each individual represents a point in
search space and possible solution. Each individual is represented as a
string of character/integer/float/bits. This string is analogous to the
Chromosome.
Foundation of Genetic
Algorithms
Genetic algorithms are based on an analogy with genetic structure and
behaviour of chromosomes of the population. Following is the foundation
of GAs based on this analogy –