UNIT-1 Machine Learning
UNIT-1 Machine Learning
Prepared by
Dr. Syeda Husna Mehanoor
Associate Professor
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
R22 B.Tech. CSE Syllabus JNTU Hyderabad CS601PC: MACHINE
LEARNING B. Tech III Year II Sem.
LTPC
3003
Course Objectives:
• To introduce students to the basic concepts and techniques of Machine Learning.
• To have a thorough understanding of the Supervised and Unsupervised learning techniques
• To study the various probability-based learning techniques
Course Outcomes:
• Distinguish between, supervised, unsupervised and semi-supervised learning
• Understand algorithms for building classifiers applied on datasets of non-linearly separable
classes
• Understand the principles of evolutionary computing algorithms
• Design an ensembler to increase the classification accuracy
UNIT - I
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron – Design a
Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task – Concept
Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the Candidate
Elimination Algorithm – Linear Discriminants: – Perceptron – Linear Separability – Linear Regression.
UNIT - II
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error – Multi-layer
Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial
Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and
Basis Functions – Support Vector Machines
UNIT - III
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and Regression
Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers – Basic
Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means
Algorithms
UNIT - IV
Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis – Factor
Analysis – Independent Component Analysis – Locally Linear Embedding – Isomap – Least Squares
Optimization Evolutionary Learning – Genetic algorithms – Genetic Offspring: - Genetic Operators –
Using Genetic Algorithms
UNIT - V
Reinforcement Learning – Overview – Getting Lost Example Markov Chain Monte Carlo Methods –
Sampling – Proposal Distribution – Markov Chain Monte Carlo – Graphical Models – Bayesian
Networks – Markov Random Fields – Hidden Markov Models – Tracking Methods
TEXT BOOKS:
1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
UNIT - I
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron
– Design a Learning System – Perspectives and Issues in Machine Learning – Concept
Learning Task – Concept Learning as Search – Finding a Maximally Specific Hypothesis –
Version Spaces and the Candidate Elimination Algorithm – Linear Discriminants: –
Perceptron – Linear Separability – Linear Regression.
LEARNING
Definition of learning
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
Therefore, a computer program which learns from experience is called a machine learning
program or simply a learning program. Such a program is sometimes also referred to as a learner.
Machine Learning
Machine learning enables a machine to automatically learn from data, prove performance from
experiences, and predict things without being explicitly programmed.
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately. Suppose we have a complex problem, where we need to
perform some predictions, so instead of writing a code for it, we just need to feed the data to
generic algorithms, and with the help of these algorithms, machine builds the logic as per the
data+ and predict the output.
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being explicitly
programmed.” However, there is no universally accepted definition for machine learning.
Different authors define the term differently.
Application Description
Enhanced Accuracy Medical image analysis for ML models detect subtle patterns in medical images
and Precision disease diagnosis with high accuracy, aiding in early diagnosis.
Improved Efficiency Facial recognition in security ML processes large volumes of images quickly,
and Scalability systems making it ideal for real-time surveillance.
Adaptability and Image classification for ML models improve over time by adapting to new
Continuous product categorization in e- data, ensuring accurate categorization of new
Learning commerce products.
trust issues.
TYPES OF LEARNING
Here are brief definitions for different types of machine learning:
1. Supervised Learning: A type of machine learning where the model is trained on labeled
data, meaning both input and output are provided. Example: Spam email detection.
2. Unsupervised Learning: The model learns patterns and structures from unlabeled data
without explicit outputs. Example: Customer segmentation.
3. Semi-Supervised Learning: Combines aspects of both supervised and unsupervised
learning by using a small amount of labeled data along with a large amount of unlabeled
data. Example: Medical diagnosis with limited labeled samples.
4. Reinforcement Learning: The model learns through trial and error by interacting with
an environment and receiving rewards or penalties. Example: Training an AI to play
chess.
5. Evolutionary Learning: A type of machine learning inspired by natural selection, where
algorithms evolve over generations by selecting the best solutions and applying mutations
or crossovers. Example: Genetic algorithms used for optimization problems.
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y). In the real-world, supervised
learning can be used for Risk Assessment, Image classification, Fraud Detection, spam filtering,
etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
• If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc. Below are some popular
Classification algorithms which come under supervised learning:
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Advantages of Supervised learning:
• With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes of object.
The brain is an amazing system that can handle messy and complicated information (like
pictures) and give quick and accurate answers. It’s made up of simple building blocks called
neurons, which send signals when activated. These signals travel through connections called
synapses, creating a huge network of about 100 trillion links. Even as we age and lose neurons,
the brain keeps working well.
Each neuron acts like a tiny decision-maker in a massive network of 100 billion neurons. This
has inspired scientists to create AI systems that try to copy how the brain learns. The brain learns
by changing the strength of its connections i:e plasticity which refers to its ability to change and
adapt by modifying the strength of the connections (called synapses) between neurons or
forming new connections altogether. This is how the brain learns and remembers things and
forming new connections between neurons in the brain. One famous idea, suggested by Donald
Hebb in 1949, is that learning happens when neurons that frequently work together strengthen
their connection.
Hebb’s Rule
Hebb's rule is a simple idea: if two neurons fire at the same time repeatedly, their connection
becomes stronger. On the other hand, if they never fire together, their connection weakens and
might disappear. This is how the brain learns to associate things.
Here’s an example: Imagine you always see your grandmother when she gives you chocolate.
Neurons in your brain that recognize your grandmother and neurons that make you happy about
chocolate will fire at the same time. Over time, their connection strengthens. Eventually, just
seeing your grandmother (even in a photo) makes you think of chocolate. This is similar to
classical conditioning, where Pavlov trained dogs to associate a bell with food. When the bell
and food were paired repeatedly, the dogs began to salivate at the sound of the bell alone because
the "bell" neurons and "salivation" neurons became strongly connected.
This idea is called long-term potentiation or neural plasticity, and it’s a real process in our
brains that helps us learn and form memories.
Scientists have studied neurons and created a mathematical model of them to simplify
understanding. Real neurons are tiny and hard to study, but Hodgkin and Huxley studied large
neurons in squids to measure how they work, earning them a Nobel Prize. Later, McCulloch and
Pitts created a simplified model of a neuron in 1943 that focused on the essential parts.
Imagine the neuron model as a simple flowchart with three main parts:
• Inputs (x₁, x₂, x₃, ...): These are signals coming into the neuron from other neurons.
Think of them as messages or pieces of information.
• Weights (w₁, w₂, w₃, ...): Each input has a weight that represents the strength or
importance of that input. A higher weight means the input has a stronger influence on the
neuron's decision to fire.
Example:
• x1=1 (active)
• x2=0 (inactive)
• x3=0.5 (partially active)
• Weights: w1=1, w2=−0.5, w3=−1
2. Summation (Adder)
• The neuron adds up all the inputs after they’ve been multiplied by their respective
weights.
• Formula: h=w1x1+w2x2+w3x3+…
• h = (1×1)+(0×−0.5)+(0.5×−1)=1+0+(−0.5)=0.5
• h=0.5
• After summing the inputs, the neuron decides whether to "fire" (send a signal) or not
based on a threshold value (θ).
• Decision Rule:
o If h > θ, the neuron fires (output = 1)
o If h ≤ θ, the neuron does not fire (output = 0)
4. Output
• The result of the activation function is the neuron's output, which can be sent to other
neurons.
Key Features:
• Simple Decision-Maker: Despite its simplicity, this model can perform basic decisions
based on input signals.
• Foundation for Neural Networks: Multiple such neurons can be connected to form
complex networks capable of more advanced computations.
• Adjusting Weights: Learning in neural networks involves adjusting these weights to
improve decision-making based on data.
• Inputs (x₁, x₂, x₃): Different sensors detecting things (like motion, light, sound).
• Weights (w₁, w₂, w₃): The importance of each sensor in deciding whether to turn on the
light.
• Summation (h): Adding up the signals from all sensors.
• Threshold (θ): The level of combined signals needed to decide to turn the light on.
• Output: The light is either on (1) or off (0).
By adjusting the weights, you can make the system more or less sensitive to certain sensors, just
like training a neural network to recognize patterns.
The McCulloch and Pitts (M&P) neuron model is a simplified version of how real neurons work.
While it has been influential in early neural network models, it has several limitations when
compared to actual biological neurons.
1. Simplified Summing: In the McCulloch and Pitts model, inputs to the neuron are simply
added together in a linear fashion. Real neurons, however, may have non-linear
interactions, meaning their inputs don’t just add up but interact in more complex ways.
2. Single Output vs. Spike Train: The M&P neuron produces just one output, either firing
or not firing, based on a threshold. Real neurons, however, send out a series of pulses,
called a "spike train," to represent information. So, real neurons don't just decide whether
to fire or not—they generate a sequence of signals that encode data.
3. Changing Thresholds: In the M&P model, the threshold for firing is constant. In real
neurons, the threshold can change depending on the current state of the organism, like
how much neurotransmitter is available, which influences the neuron’s sensitivity.
4. Asynchronous vs. Synchronous Updates: The M&P model updates neurons in a
regular, clocked sequence (synchronously). Real neurons don't work this way; they
update asynchronously, meaning they fire at different times, influenced by random
factors, not just a regular time cycle.
5. Excitatory and Inhibitory Weights: The M&P model allows weights (connections
between neurons) to change from positive to negative, which isn’t seen in real neurons. In
the brain, synaptic connections are either excitatory (increase the likelihood of firing) or
inhibitory (decrease the likelihood of firing), and they don't switch from one type to the
other.
6. Feedback Loops: Real neurons can have feedback connections where a neuron connects
back to itself. The M&P model typically doesn't include this, although it’s a feature in
some more advanced models.
7. Biological Complexity Ignored: The M&P model focuses on the basic idea of deciding
whether a neuron fires or not, leaving out more complex biological factors, such as
chemical concentrations or refractory periods (the time it takes for a neuron to reset
before firing again).
According to Tom Mitchell, “A computer program is said to be learning from experience (E),
with respect to some task (T). Thus, the performance measure (P) is the performance at task T,
which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
• Task, T: To classify mails into Spam or Not Spam.
• Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
• Experience, E: Set of Mails with label “Spam”
Step 1- Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm. It is
important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
• The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback
to itself like instead of this move if this is chosen the chances of success increases.
• Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to the
machine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
• Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience
by passing through more and more examples and hence its performance will
increase.
Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken. For
example: While playing chess with the opponent, when opponent will play then the machine
learning algorithm will decide what be the number of possible legal moves taken in order to
get success.
Step 3- Choosing Representation for Target function: When the machine algorithm will
know all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form
etc. The NextMove function will move the Target move like out of these move which will
provide more success rate. For Example: while playing chess machine have 4 possible moves,
so the machine will choose that optimized move which will provide success to it.
Step 5- Final Design: The final design is created at last when system goes from number of
examples, failures and success, correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game against
the chess expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.
PERSPECTIVES AND ISSUES IN MACHINE LEARNING
One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge
held by the learner. For example, consider the space of hypotheses that could in principle be
output by the above checkers learner. This hypothesis space consists of all evaluation functions
that can be represented by some choice of values for the weight’s wo through w6. The learner's
task is thus to search through this vast space to locate the hypothesis that is most consistent with
the available training examples. The LMS algorithm for fitting weights achieves this goal by
iteratively tuning the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value. This algorithm works
well when the hypothesis representation considered by the learner defines a continuously
parameterized space of potential hypotheses.
• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character
of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of generalizing
from examples? Can prior knowledge be helpful even when it is only approximately
correct?
• What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn?
Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
CONCEPT LEARNING TASK
Concept learning is a fundamental task in machine learning that involves training a model to
recognize and categorize patterns or concepts from a set of examples or data points. It's like
teaching a machine to understand the underlying rules of a specific concept, such as identifying a
cat in an image or predicting whether a customer will make a purchase.
Key Concepts
• Target Concept: The underlying rule or pattern that the model aims to learn.
• Training Data: A set of labeled examples used to train the model. Each example consists
of an input and its corresponding output (label).
• Hypothesis: A proposed rule or function that the model learns from the training data.
• Generalization: The ability of the model to accurately classify new, unseen data based on
the learned concept.
Concept learning involves exploring a hypothesis space to identify the hypothesis that best
explains the training examples. This hypothesis space is implicitly defined by the hypothesis
representation chosen by the learning algorithm designer. By selecting a specific representation,
the designer determines the space of all hypotheses the program can represent and learn.
Example: EnjoySport Learning Task
In the EnjoySport learning task, we aim to find a hypothesis (rule) that determines whether the
weather conditions are favorable for enjoying sports. Let's break it down step by step.
This represents all possible combinations of weather attributes. The attributes and their possible
values are:
To find the total number of possible weather conditions (instances in XXX), multiply the
number of possible values for each attribute:
∣X∣=3x2x2x2x2x2
=3×32
=96
A hypothesis is a rule that classifies instances as positive or negative. Hypotheses can use
specific values (e.g., "Sunny") or wildcards (?), which mean "any value is fine." For each
attribute:
Syntactically distinct hypotheses: additionally 2 more values: ?( accepts any values which is
most general hypothesis and Ø (reject any values which is more specific hypothesis)
For each of the 6 attributes, there are 4 options. The total number of syntactically distinct
hypotheses is:
∣H∣=5x4x4x4x4x4=5120
Some hypotheses, like those containing only "Ø," classify all instances as negative and are
redundant. Removing these, the number of semantically distinct hypotheses becomes:
=1+(4×3x3x3x3x3)
=1+(4×243)
=1+972=973
After finding all syntactically and semantically distinct hypothesis we search the best match from
all these that matches our learning model (training example).
The FIND-S algorithm is a simple way to find a rule (or hypothesis) that matches all the positive
examples in a dataset while ignoring the negative ones. It works step by step, starting with a
very specific rule and gradually making it more general to include all positive examples. Here's
how it works in an easy way:
FIND-S Algorithm
The FIND-S algorithm is like starting with the most specific guess and slowly relaxing it until it
fits all the examples.
1. Start small: Begin with the most specific rule (e.g., "Only this exact weather works").
2. Fix the rule: For each good (positive) example, check if your rule matches it:
o If it does, great—do nothing!
o If it doesn’t, make the rule a bit more general (e.g., "Okay, maybe it works if the wind
isn’t strong").
3. Finish: When you’re done, you have a rule that matches all the good examples.
Example:
Imagine you're trying to figure out what kind of weather makes you enjoy playing a sport, using
this data:
• Start with the most specific rule: h=(?, ?, ?, ?, ?, ?), which means "no conditions are set
yet."
• Look at the first positive example: (Sunny, Warm, Normal, Strong, Warm, Same)
o Rule becomes: h=(Sunny, Warm, Normal, Strong, Warm, Same)
• Look at the second positive example: (Sunny, Warm, High, Strong, Warm, Same)
o Update the rule to match both examples:
h=(Sunny, Warm, ?, Strong, Warm, Same)
• Ignore the negative example.
• Look at the fourth positive example: (Sunny, Warm, High, Strong, Cool, Change)
o Update the rule again: h=(Sunny, Warm, ?, Strong, ?, ?)
Final rule: (Sunny, Warm, ?, Strong, ?, ?). This means you enjoy playing sports if it’s sunny,
warm, and windy, regardless of the other conditions.
Properties of FIND-S
Limitations of FIND-S
• Good for Clean Data: Works well if the data is perfect (no mistakes or noise).
• Ignores Negatives: It doesn't use negative examples to refine the rule.
• May Miss Other Rules: If there are multiple valid rules, it picks the most specific one
but doesn’t explore other options.
In short, FIND-S is like a detective who focuses only on positive clues and tries to make the
simplest case for what’s true!
VERSION SPACES AND THE CANDIDATE ELIMINATION
ALGORITHM
VERSION SPACES
A version space is a set of all hypotheses (rules) that are consistent with the given training data.
It represents everything the learner currently knows about the target concept.
The "version" refers to different possibilities or hypotheses that might explain the data. The
space includes:
How It Works:
A version space has two boundaries:
1. Specific boundary (S): The most specific hypotheses consistent with the data.
2. General boundary (G): The most general hypotheses consistent with the data.
• Efficient Representation: Instead of listing all hypotheses, it tracks only the boundaries
S and G.
• Keeps Track of Knowledge: Helps understand what the learner knows and doesn’t know
yet.
• Flexible Search: Allows for adding or removing examples to refine the boundaries.
A version space is the range of hypotheses consistent with the training data, bounded by the
most specific (S) and the most general (G) hypotheses. It narrows down as you process more
examples, zeroing in on the true concept.
CANDIDATE ELIMINATION ALGORITHM
It’s a way to learn rules for when something happens (like "Play Sport = Yes") by narrowing
down possibilities. The algorithm works by keeping two boundaries:
1. Specific Boundary (S): The most specific rule that only fits positive examples.
2. General Boundary (G): The most general rule that excludes negative examples.
The Data
Sky Temp Humidity Wind Water Forecast Play Sport?
Our goal is to find the rule for when "Play Sport" is Yes.
Step-by-Step Execution
Start with Initial S and G:
It’s another positive example. S is too specific, so we generalize it to fit both positive examples:
• Compare each attribute in S to the new example:
o Sky = Sunny: Matches, no change.
o Temp = Warm: Matches, no change.
o Humidity = Normal vs High: Doesn’t match, so generalize to ?.
o Wind = Strong: Matches, no change.
o Water = Warm: Matches, no change.
o Forecast = Same: Matches, no change.
• Updated S:
S=⟨Sunny,Warm,?,Strong,Warm,Same⟩
(This means Humidity can be anything now.)
It’s a negative example. We update G to exclude this negative example while staying as general
as possible.
It’s a positive example. S needs to generalize further to fit this example. Compare with previous
S.
Why is G=⟨Sunny,?,?,?,?,?⟩
1. General Boundary (G) starts very broad (because it initially includes everything).
2. After processing positive examples, G gets refined to include only the conditions that
must be true for playing sports ("Yes").
3. The Sky condition (Sunny) is the only attribute that must always be true in the general
rule.
4. The other attributes (Temperature, Humidity, Wind, Water, Forecast) can be anything
since the general rule still covers all the positive examples we’ve seen so far.
When we look at all the positive examples (the ones where Play Sport = Yes), we find that they
all have Sky = Sunny. Since G should cover all positive examples, we make Sky = Sunny and
warm a condition in the rule. But the other attributes (Temperature, Humidity, Wind, etc.) can
vary, so we leave them as wildcards ( ? ).
LINEAR DISCRIMINANTS
Machine learning models are often used to solve supervised learning tasks,
particularly classification problems, where the goal is to assign data points to specific categories
or classes. However, as datasets grow larger with more features, it becomes challenging for
models to process the data effectively. This is where dimensionality reduction techniques like
Linear Discriminant Analysis (LDA) come into play.
LDA not only helps to reduce the number of features but also ensures that the important class-
related information is retained, making it easier for models to differentiate between classes.
Linear Discriminant Analysis (LDA) is a supervised learning technique used for classification
tasks. It helps distinguish between different classes by projecting data points onto a lower-
dimensional space, maximizing the separation between those classes.
LDA performs two key roles:
The core idea of Linear Discriminant Analysis (LDA) is to find a new axis that best separates
different classes by maximizing the distance between them. LDA achieves this by reducing the
dimensionality of the data while retaining the class-discriminative information.
Key Concepts:
Advantages:
1. Face Recognition: LDA helps extract features from facial images, classifying them based on
individuals. It is commonly used in biometric systems to identify or verify users.
2. Disease Diagnosis in Healthcare: LDA is used to analyze medical data for classifying diseases,
such as distinguishing between different stages of cancer or predicting the presence of heart
disease.
4. Credit Risk Assessment in Finance: Financial institutions use LDA to assess credit risk by
analyzing customer data to predict the likelihood of loan defaults or creditworthiness.
Perceptron Are Based on Biological Neurons and Originally proposed in 1957, perceptrons were
one of the earliest components of artificial neural networks. The structure of the perceptron is
based on the anatomy of neurons. Neurons have several parts, but for our purposes, the most
important parts are the dendrites, which receive inputs from other neurons, and the axon, which
produces outputs.
Neuron Activation
Neurons “fire” – that is, produce an output – in an all or nothing way. The outputs of a neuron
are essentially 0 or 1. On or off. A neuron will “fire” if the input signals at the dendrites are
sufficiently large, collectively. If the amount of input signal at the dendrites is high enough, the
neuron will “fire” an produce an output. But if the amount of input signal is insufficient, the
neuron will not produce a output. Put simply, the neuron sums up the inputs, and if the collective
input signals meet a certain threshold, then it will produce an output. If the collective input
signals are under the threshold, it will not produce an output. This is, of course, a very simple
explanation of how a neuron works (because they are very complex at the chemical level), but
it’s roughly accurate.
What is Perceptron?
The Perceptron Learning is a fundamental concept in machine learning and serves as one of the
simplest types of artificial neural networks. It is primarily used for binary classification tasks
and is based on the idea of learning a linear decision boundary to separate data points into two
classes. The perceptron algorithm was introduced by Frank Rosenblatt in 1958. It operates on a
set of input features and produces an output that is either 1 or −1 (or 0 depending on the
implementation). The model is trained iteratively, adjusting its weights based on the error
between predicted and actual labels.
1. Input Features: Take a vector of input features (x1,x2,…,…,xn) from the dataset.
2. Compute Weighted Sum: Calculate
3. Apply Activation Function: Use the step function to decide the output (1 or 0).
4. Update Weights (During Training):
• If the predicted output is incorrect, adjust the weights and bias using the Perceptron
Learning Algorithm.
A perceptron is like a very basic "brain" for a machine. It looks at input data (numbers) and
makes a decision: Class A or Class B (e.g., "yes" or "no").
• Imagine you have some input features (e.g., x1, x2) like:
• Each input has a weight (w1, w2) that tells the perceptron how important that input is.
• Multiply each input by its weight, and add them all together. Then, add a bias (b), which is like a
nudge to adjust the sum.
• Use a simple rule: If the weighted sum (z) is positive, output 1 (e.g., "yes").
• If it’s negative or zero, output −1 (e.g., "no").
• Compare the perceptron’s guess (y^) with the actual answer (y).
• If it’s correct, you’re good! If it’s wrong, adjust the weights and bias.
Step 6: Repeat
• Go through the dataset multiple times, adjusting weights and bias each time the perceptron makes
a mistake.
Key Features
• Linear Model: The perceptron can only separate data that is linearly separable.
• Supervised Learning: It requires labeled data for training.
• Binary Classification: It predicts one of two possible classes (1 or −1).
Strengths
Limitations
• Cannot Handle Non-linear Data: It fails when data is not linearly separable.
• Binary Outputs: Limited to binary classification tasks.
• Sensitive to Feature Scaling: Requires normalization or scaling for effective learning.
Applications
EXAMPLE: Here's an example of a perceptron for the logical AND function with the given
parameters:
0 0 0
0 1 0
1 0 0
1 1 1
For each input, compute the weighted sum and update weights if necessary, using:
2. For (0,1):
✅ No update needed.
3. For (1,0):
4. For (1,1):
✅ Correct.
Final Weights
LINEAR SEPARABILITY
Linear separability means that you can draw a straight line (or a flat surface, or a hyperplane)
that separates two groups of data points perfectly without any overlap.
You can draw a straight line between the two classes, and the points on one side belong to Class
1, while the points on the other side belong to Class 2.
This line is the decision boundary, and the data is linearly separable because the line separates
the two classes without overlap.
Real-World Applications
1. Image Classification:
o Linear separability is rare; deep learning handles non-linear boundaries.
2. Medical Diagnosis:
o Linearly separable cases may involve straightforward conditions; complex
diseases often require advanced methods.
3. Spam Detection:
o Simple keyword-based filters assume linear separability, while modern techniques
use non-linear models.
Why is Linear Separability Important?
The concept of linear separability helps us decide which machine learning algorithms to use.
Some algorithms work well when the data is linearly separable, while others are better for more
complex, non-linearly separable data.
• Linear Models (e.g., Perceptron, SVM): These work best when the data is linearly
separable. They try to find the straightest line or plane to divide the data.
• Non-Linear Models (e.g., Neural Networks, Decision Trees): These are more flexible
and can handle non-linearly separable data. They can create complex decision
boundaries.
x1 x2 OR Output y
0 0 0
0 1 1
1 0 1
1 1 1
x1 x2 XOR Output y
0 0 0
0 1 1
1 0 1
1 1 0
1. Linearly Separable Data:
o Blue points (+1) are separated from red points (−1) by the dashed green line
(x2=x1+1).
o The data is perfectly separable by this straight line.
2. Non-Linearly Separable Data (XOR):
o Blue points (+1) and red points (−1) are arranged in such a way that no single
straight line can separate the classes.
o This is a classic XOR problem where the decision boundary requires a more
complex, non-linear solution.
1. Simplicity:
o Linear separability allows using simple models with fewer parameters.
2. Faster Training:
o Models converge quickly during training due to straightforward optimization.
3. Interpretability:
o Easy to visualize and understand the decision boundary.
4. Optimal Solution:
o Algorithms like SVM find the maximum margin boundary, ensuring optimal
performance for separable data.
5. Good Generalization:
o Models are less likely to overfit due to their simplicity.
1. Limited Applicability:
o Many real-world datasets are not linearly separable.
2. Lack of Flexibility:
o Cannot capture complex patterns in the data.
3. Over-Simplification:
o May miss subtle relationships or nuances.
4. Sensitive to Noise:
o Outliers or noisy data near the boundary can disrupt the model.
5. Feature Dependence:
o Requires feature transformations for non-linearly separable data.
6. Failure for Non-Linearly Separable Data:
o Cannot separate inherently non-linear datasets without additional techniques.
LINEAR REGRESSION
Linear regression is a fundamental supervised learning algorithm used in machine learning for
modeling the relationship between one or more independent variables (features) and a dependent
variable (target). The goal is to find the best-fit line (or hyperplane in higher dimensions) that
minimizes the error in predicting the dependent variable.
In machine learning, labeled datasets contain input data (features) and output labels (target
values). For linear regression in machine learning, we represent features as independent variables
and target values as the dependent variable. It predicts the continuous output variables based on
the independent input variable. like the prediction of house prices based on different parameters
like house age, distance from the main road, location, area, etc.
In the above data, the target House Price is the dependent variable represented by Y, and the
feature, Square Feet, is the independent variable represented by X. The input features (X) are
used to predict the target label (Y). So, the independent variables are also known as predictor
variables, and the dependent variable is known as the response variable.
The main goal of the linear regression model is to find the best-fitting straight line (often called a
regression line) through a set of data points.
Line of Regression
A straight line that shows a relation between the dependent variable and independent variables is
known as the line of regression or regression line.
Simple linear regression is a type of regression analysis in which a single independent variable
(also known as a predictor variable) is used to predict the dependent variable. In other words, it
models the linear relationship between the dependent variable and a single independent variable.
In the above image, the straight line represents the simple linear regression line where Ŷ is the
predicted value, and X is the input value.
Mathematically, the relationship can be modelled as a linear equation −
Y=w0+w1X+ϵ
Where,
Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.
When dealing with more than one independent variable, we extend simple linear regression to
multiple linear regression. The model is expressed as:
Multiple linear regression extends the concept of simple linear regression to multiple
independent variables. The model is expressed as:
Y=w0+w1X1+w2X2+⋯+wpXp+ϵ
Where,
The main goal of linear regression is to find the best-fit line through a set of data points that
minimizes the difference between the actual values and predicted values. So it is done? This is
done by estimating the parameters w0, w1 etc.
The working of linear regression in machine learning can be broken down into many steps as
follows −
• Hypothesis− We assume that there is a linear relation between input and output.
• Cost Function − Define a loss or cost function. The cost function quantifies the model's
prediction error. The cost function takes the model's predicted values and actual values
and returns a single scaler value that represents the cost of the model's prediction.
• Optimization − Optimize (minimize) the model's cost function by updating the model's
parameters.
It continues updating the model's parameters until the cost or error of the model's prediction is
optimized (minimized).
In linear regression problems, we assume that there is a linear relationship between input features
(X) and predicted value (Ŷ). The hypothesis function returns the predicted value for a given
input value. Generally we represent a hypothesis by hw(X) and it is equal to Ŷ.
For different values of parameters (weights), we can find many regression lines. The main goal is
to find the best-fit lines.
A regression line is said to be the best fit if the error between actual and predicted values is
minimal.
Below image shows a regression line with error (ε) at input data point X. The error is calculated
for all data points and our goal is to minimize the average error/ loss. We can use different types
of loss functions such as mean square error (MSE), mean average error (MAE), L1 loss, L2 Loss,
etc.
Loss Function for Linear Regression
The error between actual and predicted values can be quantified using a loss function of the cost
function. The cost function takes the model's predicted values and actual values and returns a
single scaler value that represents the cost of the model's prediction. Our main goal is to
minimize the cost function.
The most commonly used cost function is the mean squared error function.
Where,
1. Predictive Modeling: Linear regression is widely used for predictive modeling. For instance,
in real estate, predicting house prices based on features such as size, location, and number of
bedrooms can help buyers, sellers, and real estate agents make informed decisions.
2. Feature Selection: In multiple linear regression, analyzing the coefficients can help in feature
selection. Features with small or zero coefficients might be considered less important and can be
dropped to simplify the model.
3. Financial Forecasting: In finance, linear regression models predict stock prices, economic
indicators, and market trends. Accurate forecasts can guide investment strategies and financial
planning.
4. Risk Management: Linear regression helps in risk assessment by modeling the relationship
between risk factors and financial metrics. For example, in insurance, it can model the
relationship between policyholder characteristics and claim amounts.
1. Overfitting: Overfitting occurs when the regression model performs well on training data but
lacks generalization on test data. Overfitting leads to poor prediction on new, unseen data.
2. Multicollinearity: When the dependent variables (predictor or feature variables) correlate, the
situation is known as multicollinearity. In this, the estimates of the parameters (coefficients) can
be unstable.
3. Outliers and Their Impact: Outliers can cause the regression line to be a poor fit for the
majority of data points.