0% found this document useful (0 votes)
15 views99 pages

MLT- Module 3

Uploaded by

bhuvanvasa23s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views99 pages

MLT- Module 3

Uploaded by

bhuvanvasa23s
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Module 3

Bayesian learning

• Bayes theorem is also known with some other name such as Bayes
rule or Bayes Law. Bayes theorem helps to determine the probability
of an event with random knowledge. It is used to calculate the
probability of occurring one event while other one already occurred.
It is a best method to relate the condition probability and marginal
probability.
Bayes theorem is one of the most popular machine learning concepts
that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional
probability of event X with known event Y:
K-Nearest Neighbor(KNN) Algorithm
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
Minimum Description Length Principle
• The minimum description length (MDL) principle is a
powerful method of inductive inference, the basis of
statistical modeling, pattern recognition, and machine
learning. It holds that the best explanation, given a
limited set of observed data, is the one that permits the
greatest compression of the data.
• MDL methods are particularly well-suited for dealing
with model selection, prediction, and estimation
problems in situations where the models under
consideration can be arbitrarily complex, and overfitting
the data is a serious concern.
• Minimum Description Length (MDL) is a model
selection principle where the shortest description of the data is
the best model. MDL methods learn through a data
compression perspective and are sometimes described as
mathematical applications of Occam's razor.
• The MDL principle can be extended to other forms of inductive
inference and learning, for example to estimation and
sequential prediction, without explicitly identifying a single
model of the data.
• MDL has its origins mostly in information theory and has been further developed within the
general fields of statistics, theoretical computer science and machine learning, and more
narrowly computational learning theory.
• Selecting the minimum length description of the available data as the best model observes
the principle identified as Occam's razor. Prior to the advent of computer programming,
generating such descriptions was the intellectual labor of scientific theorists. It was far less
formal than it has become in the computer age. If two scientists had a theoretic disagreement,
they rarely could formally apply Occam's razor to choose between their theories. They would
have different data sets and possibly different descriptive languages. Nevertheless, science
advanced as Occam's razor was an informal guide in deciding which model was best.
• With the advent of formal languages and computer programming Occam's razor was
mathematically defined. Models of a given set of observations, encoded as bits of data, could
be created in the form of computer programs that output that data. Occam's razor could
then formally select the shortest program, measured in bits of this algorithmic information, as
the best model.
• MDL applies in machine learning when algorithms (machines)
generate descriptions. Learning occurs when an algorithm
generates a shorter description of the same data set.
• The theoretic minimum description length of a data set, called
its Kolmogorov complexity, cannot, however, be computed. That
is to say, even if by random chance an algorithm generates the
shortest program of all that outputs the data set, an automated
theorem prover cannot prove there is no shorter such program.
Nevertheless, given two programs that output the dataset, the
MDL principle selects the shorter of the two as embodying the
best model.
• Minimum description length (MDL), which originated
from algorithmic coding theory in computer science,
regards both model and data as codes. The idea is that
any data set can be appropriately encoded with the help
of a model, and that the more we compress the data by
extracting redundancy from it, the more we uncover
underlying regularities in the data.
• Thus, code length is directly related to the
generalization capability of the model, where the model
that provides the shortest description of the data should
be chosen.
Formally, the total description length of a model is given
by

• where L(data∣model) is the description length of the


data given the model, measured in bits of information,
and L(model) is the description length of the model
itself.
• The above definition of minimum description length is
broad enough to be applied to any well-defined models,
even verbally defined, qualitative models.
Maximum likelihood and least squared error
hypothesis
Instance-based learning

Instance-based learning is a type of machine learning paradigm that


operates by comparing new problem instances with instances seen in
training. It is also known as memory-based learning or lazy learning, as
it delays the generalization process until prediction time. This approach
is fundamentally different from other learning methods, which build a
general model during training to apply to new instances.
• The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and then
generalizes to new instances based on some similarity measure. It is called
instance-based because it builds the hypotheses from the training instances.
It is also known as memory-based learning or lazy-learning (because they
delay processing until a new instance must be classified). The time
complexity of this algorithm depends upon the size of training data. Each
time whenever a new query is encountered, its previously stores data is
examined. And assign to a target function value for the new instance.
• The worst-case time complexity of this algorithm is O (n), where n is the
number of training instances. For example, If we were to create a spam filter
with an instance-based learning algorithm, instead of just flagging emails
that are already marked as spam emails, our spam filter would be
programmed to also flag emails that are very similar to them. This requires a
measure of resemblance between two emails. A similarity measure
between two emails could be the same sender or the repetitive use of the
same keywords or something else.
Advantages:
1.Instead of estimating for the entire instance set, local approximations can be
made to the target function.
2.This algorithm can adapt to new data easily, one which is collected as we
go .
Disadvantages:
3.Classification costs are high
4.Large amount of memory required to store the data, and each query
involves starting the identification of a local model from scratch.

Applications
Instance-based learning is used in various fields, including computer vision,
pattern recognition, and recommendation systems. It is particularly useful in
situations where the data may change rapidly, or where the relationships
between features are complex and hard to model with traditional methods.
Locally Weighted Regression

Linear Regression is a supervised learning algorithm


used for computing linear relationships between input (X)
and output (Y).
Locally Weighted Linear Regression (LWLR):
Locally weighted linear regression is a non-parametric algorithm, that
is, the model does not learn a fixed set of parameters as is done in
ordinary linear regression.

NOTE: For Locally Weighted Linear Regression, the data must always be
available on the machine as it doesn’t learn from the whole set of data
in a single shot. Whereas, in Linear Regression, after training the model
the training set can be erased from the machine as the model has
already learned the required parameters.
Locally Weighted Linear Regression is a non−parametric
method/algorithm.
In Linear regression, the data should be distributed linearly whereas
Locally Weighted Regression is suitable for non−linearly distributed
data.
Generally, in Locally Weighted Regression, points closer to the query
point are given more weightage than points away from it.
Parametric and Non-Parametric Models

Parametric Methods: The basic idea behind the parametric method is


that there is a set of fixed parameters that uses to determine a
probability model that is used in Machine Learning .
It has a collection of parameters that summarize the data through
these parameters.
These parameters are fixed in number, which means that the model
already knows about these parameters and they do not depend on the
data. They are also independent in nature with respect to the training
samples.
Non - Parametric
• Non−parametric algorithms do not make particular assumptions
about the kind of mapping function. These algorithms do not accept a
specific form of the mapping function between input and output data
as true.
• They have the freedom to choose any functional form from the
training data. As a result, for parametric models to estimate the
mapping function they require much more data than parametric ones.
• The basic idea behind the parametric method is no need to make any
assumption of parameters for the given population or the population
we are studying. In fact, the methods don’t depend on the
population.
Principles of Locally Weighted Linear
Regression

LWLR functions on the premise that the association between the


dependent and independent variables adheres to linearity; however,
this relationship is allowed to exhibit variability across distinct sections
within the dataset. This is achieved by employing an individual linear
regression model for each prediction, employing a weighted least
squares technique.
Applications of Locally Weighted Linear
Regression

• Time Series Analysis: LWLR is particularly useful in time series analysis, where
the relationship between variables may change over time. By adapting to the
local patterns and trends, LWLR can capture the dynamics of time-varying data
and make accurate predictions.
• Anomaly Detection: LWLR can be employed for anomaly detection in various
domains, such as fraud detection or network intrusion detection. By identifying
deviations from the expected patterns in a localized manner, LWLR helps detect
abnormal behavior that may go unnoticed using traditional regression models.
• Robotics and Control Systems: In robotics and control systems, LWLR can be
utilized to model and predict the behavior of complex systems. By adapting to
local conditions and variations, LWLR enables precise control and decision-
making in dynamic environments.
Benefits of Locally Weighted
Linear Regression
Disadvantages of Locally weighted Linear Regression

• This process is highly exhaustive and may consume a huge amount of


resources.
• We may simply avoid the Locally Weighted algorithm for linearly
related problems which are simpler.
• Cannot accommodate a large number of features.
Radial Basis Functions
Radial Basis Functions are a special class of feed-forward neural
networks consisting of three layers: an input layer, a hidden layer, and
the output layer. This is fundamentally different from most neural
network architectures, which are composed of many layers and bring
about nonlinearity by recurrently applying non-linear activation
functions. The input layer receives input data and passes it into the
hidden layer, where the computation occurs. The hidden layer of Radial
Basis Functions Neural Network is the most powerful and very different
from most Neural networks. The output layer is designated for
prediction tasks like classification or regression.
RBF Networks Working
RBF Neural networks are conceptually similar to K-Nearest Neighbor (k-NN)
models, though the implementation of both models is starkly different. The
fundamental idea of Radial Basis Functions is that an item's predicted target value
is likely to be the same as other items with close values of predictor variables. An
RBF Network places one or many RBF neurons in the space described by the
predictor variables. The space has multiple dimensions corresponding to the
number of predictor variables present. We calculate the Euclidean distance from
the evaluated point to the center of each neuron. A Radial Basis Function (RBF),
also known as kernel function, is applied to the distance to calculate every
neuron's weight (influence). The name of the Radial Basis Function comes from
the radius distance, which is the argument to the function. Weight =
RBF[distance)The greater the distance of a neuron from the point being
evaluated, the less influence (weight) it has.
A Radial Basis Function is a real-valued function, the value of which depends only on the distance from the
origin. Although we use various types of radial basis functions, the Gaussian function is the most common.

In the instance of more than one predictor variable, the Radial basis Functions Neural Network has the
same number of dimensions as there are variables. If three neurons are in a space with two predictor
variables, we can predict the value from the RBF functions. We can calculate the best-predicted value for
the new point by adding the output values of the RBF functions multiplied by the weights processed for
each neuron.

The radial basis function for a neuron consists of a center and a radius (also called the spread). The radius
may vary between different neurons. In DTREG-generated RBF networks, each dimension's radius can
differ.

As the spread grows larger, neurons at a distance from a point have more influence.
Case-based reasoning (CBR)
In Case-based reasoning, a new problem is solved by adapting solutions
that were also useful in the past. Therefore, it is also referred to as an
experience-based approach/ intelligent problem-solving method.
Therefore, it means learning from past experiences and using that
knowledge to approach new problems.

For example, assume there is a CBR mechanism for an e-commerce


application that provides services to its customer. The CBR mechanism
can be used to improve customer experiences based on past experiences.
Let’s say if someone likes a particular category, then the CBR mechanism
helps find similar cases to the customer.
Case-Based Reasoning VS Other
Techniques
The case-based reasoning is an exceptionally well-designed technique that uses past experiences to
improve its performance. Let's discuss how it differs from other techniques.
Rule-based systems
In rule-based reasoning, pre-defined rules are used for solving problems. Experts in the field design this set
of rules. On the other hand, CBR is efficient in handling situations where the rules may not be efficient or
are not present.
Decision trees
These are a type of algorithms are used for solving classification problems and widely used in Machine
Learning and data mining. While CBR uses past experiences instead of creating a decision tree based on
the given data.
Neural networks
These are machine learning algorithms that use past information to make predictions on the provided
information. They are used in natural language processing, stock market predictions, intelligent searching,
and image recognition and are inspired by a human’s brain structure. While CBR’s primary focus is learning
from stored past experiences cases.
Lazy & Eager Learning
Lazy Learning:
Lazy learning algorithms work by memorizing the training data rather than
constructing a general model. When a new query is received, lazy learning
retrieves similar instances from the training set and uses them to generate a
prediction. The similarity between instances is usually calculated using distance
metrics, such as Euclidean distance or cosine similarity.

One of the most popular lazy learning algorithms is the k-nearest neighbors (k-NN)
algorithm. In k-NN, the k closest training instances to the query point are
considered, and their class labels are used to determine the class of the query. Lazy
learning methods excel in situations where the underlying data distribution is
complex or where the training data is noisy.
Benefits of Lazy Learning
Limitations of Lazy Learning
Eager Learning:
Eager learning algorithms, such as decision trees, neural networks, and support vector machines, work by
constructing a predictive model based on the entire training dataset. The model is built during the training
phase, which means that the learning process is completed before the prediction phase begins.

For instance, a decision tree algorithm will analyze the training data and construct a tree-like model of decisions
based on the features of the data. Similarly, a neural network will use the training data to adjust the weights and
biases of the network during the training phase. Once the model is built, it can quickly generate predictions for
new data points.

Eager learning is suitable when the entire training dataset is available and can be processed efficiently. It is
preferable in scenarios where the training data is relatively small and can fit into memory, as eager learning
algorithms require the entire dataset to construct the model. Additionally, eager learning is advantageous when
the prediction phase needs to be fast, as the model is already built and can quickly generate predictions for new
data points. This makes eager learning ideal for real-time or time-sensitive applications where immediate
predictions are required.
Examples of Real-World Use Cases
of Eager Learning
Benefits of Eager Learning
Limitations of Eager Learning
Genetic Algorithm
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that
belong to the larger part of evolutionary algorithms. Genetic algorithms
are based on the ideas of natural selection and genetics. These are
intelligent exploitation of random search provided with historical data
to direct the search into the region of better performance in solution
space. They are commonly used to generate high-quality solutions for
optimization problems and search problems.
Genetic algorithms simulate the process of natural selection which
means those species who can adapt to changes in their environment
are able to survive and reproduce and go to next generation. In simple
words, they simulate “survival of the fittest” among individual of
consecutive generation for solving a problem. Each generation consist
of a population of individuals and each individual represents a point in
search space and possible solution. Each individual is represented as a
string of character/integer/float/bits. This string is analogous to the
Chromosome.
Foundation of Genetic
Algorithms
Genetic algorithms are based on an analogy with genetic structure and
behaviour of chromosomes of the population. Following is the foundation
of GAs based on this analogy –

1.Individual in population compete for resources and mate


2.Those individuals who are successful (fittest) then mate to create more
offspring than others
3.Genes from “fittest” parent propagate throughout the generation, that is
sometimes parents create offspring which is better than either parent.
4.Thus each successive generation is more suited for their environment.
Fitness Score
• A Fitness Score is given to each individual which shows the ability of
an individual to “compete”. The individual having optimal fitness
score (or near optimal) are sought.
• The GAs maintains the population of n individuals
(chromosome/solutions) along with their fitness scores. The
individuals having better fitness scores are given more chance to
reproduce than others. The individuals with better fitness scores are
selected who mate and produce better offspring by combining
chromosomes of parents.
• So, some individuals die and get replaced by new arrivals eventually
creating new generation when all the mating opportunity of the old
population is exhausted. It is hoped that over successive generations
better solutions will arrive .
• Each new generation has on average more “better genes” than the
individual (solution) of previous generations. Thus each new
generations have better “partial solutions” than previous
generations. Once the offspring produced having no significant
difference from offspring produced by previous populations, the
population is converged. The algorithm is said to be converged to a
set of solutions for the problem.
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the
generation using following operators –

1) Selection Operator: The idea is to give preference to the individuals


with good fitness scores and allow them to pass their genes to
successive generations.
2) Crossover Operator: This represents mating between individuals.
Two individuals are selected using selection operator and crossover
sites are chosen randomly. Then the genes at these crossover sites are
exchanged thus creating a completely new individual (offspring). For
example –
3) Mutation Operator: The key idea is to insert random genes in
offspring to maintain the diversity in the population to avoid premature
convergence. For example –
Hypothesis Space Search
• Hypothesis space is the set of all the possible legal hypothesis. This is
the set from which the machine learning algorithm would determine
the best possible (only one) which would best describe the target
function or the outputs.
To better understand the Hypothesis Space and Hypothesis consider
the following coordinate that shows the distribution of some data:
• Hypothesis space search is a key concept in machine learning that
refers to the process of searching for the best possible model or
hypothesis that can fit the given data. In machine learning, the
hypothesis space is the set of all possible models that the algorithm
can learn from the data.
• GAs search the hypothesis space by generating successor hypotheses
which repeatedly mutate and recombine parts of the best currently
known hypotheses.
Genetic Programming

Genetic programming is a domain-independent method that


genetically breeds a population of computer programs to solve a
problem. Specifically, genetic programming iteratively transforms a
population of computer programs into a new generation of programs
by applying analogs of naturally occurring genetic operations.

In artificial intelligence, genetic programming is a technique of evolving


programs, starting from a population of unfit programs, fit for a
particular task by applying operations analogous to natural genetic
processes to the population of programs.
Parallizing Genetic Algorithms
Sometimes, if a problem is complex and individuals are heavy it might not be
possible to implement efficient genetic algorithm, because computations
take too much time and it is not possible to store all needed data in memory.
In order to overcome the second issue one may store some individuals on
disk, and load them into memory once they are needed. This does not solve
performance issues, moreover, it brings one more: reading (and, possibly,
deserializing) an individual from disk to memory may make the entire system
even more slower. Moreover, all individuals an algorithm has ever seen were
created using the same genetic operators and thus may have too much in
common, so that an algorithm will be walking around a local optima. Is there
a way to solve this? Possibly, yes: we can use a parallel or a distributed
genetic algorithm.
These genetic algorithms do not depend on each other, as a result, they can run
in parallel, taking advantage of a multicore CPU. Each algorithm has its own set
of individual, as a result these individuals may differ from individuals of another
algorithm, because they have different mutation/crossover history.

A parallel genetic algorithm that uses two independent algorithms to improve


its performance. The difference between these two algorithms is the way
individuals are selected for mutation and crossover. Moreover, some creatures,
with the highest fitting, are allowed to ‘migrate’ from one algorithm to another.
While this may be sufficient sometimes, but when a task is very hard to solve or
an individual is a complex entity, we may need even more diversity within
individuals.
In order to achieve this we may use as many algorithms as possible
(say, two times more than the number of CPU cores we have) and
variate almost every property of an algorithm. As a result, each
algorithm has its own set of individuals that was created using methods
that differ from those used by other algorithms. This is also an ‘island
model’, even though ‘islands’ are more different from each other.
The only restriction is that all algorithms use the same fitting function to assess individuals so that it
is possible to compare individuals that belong to different algorithms. As a result, these genetic
algorithms may differ in the following aspects

• how new individuals are created;


• how individuals mutate and how they are selected for mutation;
• how individuals crossover and how they are selected for crossover;
• how much individuals are taking part in a single crossover event and how much individuals are
created as a result;
• how much individuals survive after each algorithm iteration and how these creatures are selected;
• how much individuals each generation contain;
• how much generations an algorithm walks through.
• By varying these characteristics it is possible to make many different genetic algorithms that solve
the same task but have entirely different individuals.
• It is important to note that these independent genetic algorithms have the same structure as any
other conventional genetic algorithm, so that they can be extracted from a parallel genetic
algorithm and be used on their own.

You might also like