Maharana Pratap Group of Institutions, Mandhana, Kanpur
(Approved By AICTE, New Delhi And Affiliated To AKTU, Luck now)
Digital Notes
[Department of Computer Science
Engineering]
Course: B.TECH
Branch: CSE 3rd Yr
Subject: Machine Learning Techniques
Subject code: BCDS062
UNIT 1
• INTRODUCTION –
• Learning, Types of Learning, Well defined learning
problems, Designing a Learning System, History of
ML, Introduction of Machine Learning Approaches
–(Artificial Neural Network, Clustering,
Reinforcement Learning, Decision Tree Learning,
Bayesian networks, Support Vector Machine,
Genetic Algorithm), Issues in Machine Learning
and Data Science Vs Machine Learning;
UNIT 2
• REGRESSION: Linear Regression and Logistic
Regression BAYESIAN LEARNING - Bayes
theorem, Concept learning, Bayes Optimal
Classifier, Naïve Bayes classifier, Bayesian belief
networks, EM algorithm. SUPPORT VECTOR
MACHINE: Introduction, Types of support vector
kernel – (Linear kernel, polynomial kernel,and
Gaussiankernel), Hyperplane – (Decision
surface), Properties of SVM, and Issues in SVM.
UNIT 3
• DECISION TREE LEARNING –
• Decision tree learning algorithm, Inductive
bias, Inductive inference with decision trees,
Entropy and information theory, Information
gain, ID-3 Algorithm, Issues in Decision tree
learning. INSTANCE-BASED LEARNING – k-
Nearest Neighbour Learning, Locally Weighted
Regression, Radial basis function networks,
Case-based learning
UNIT 4
• ARTIFICIAL NEURAL NETWORKS –
• Perceptron’s, Multilayer perceptron, Gradient descent
and the Delta rule, Multilayer networks, Derivation of
Backpropagation Algorithm, Generalization,
Unsupervised Learning – SOM Algorithm and its variant;
DEEP LEARNING - Introduction,concept of convolutional
neural network , Types of layers – (Convolutional Layers ,
Activation function , pooling , fully connected) , Concept
of Convolution (1D and 2D) layers, Training of network,
Case study of CNN for eg on Diabetic Retinopathy,
Building a smart speaker, Self-deriving car etc.
UNIT 5
• REINFORCEMENT LEARNING – Introduction to
Reinforcement Learning, Learning Task, Example of
Reinforcement Learning in Practice, Learning Models
for Reinforcement – (Markov Decision process , Q
Learning - Q Learning function, Q Learning Algorithm),
Application of Reinforcement Learning, Introduction to
Deep Q Learning. GENETIC ALGORITHMS: Introduction,
Components, GA cycle of reproduction, Crossover,
Mutation, Genetic Programming, Models of Evolution
and Learning, Applications.
Machine Learning
• Machine Learning: Machine learning is a branch of artificial intelligence that
enables algorithms to uncover hidden patterns within datasets. It allows them
to predict new, similar data without explicit programming for each task.
Machine learning finds applications in diverse fields such as image and speech
recognition, natural language processing, recommendation systems, fraud
detection, portfolio optimization, and automating tasks.
• Machine learning’s impact extends to autonomous vehicles, drones, and
robots, enhancing their adaptability in dynamic environments. This approach
marks a breakthrough where machines learn from data examples to generate
accurate outcomes, closely intertwined with data mining and data science.
Machine Learning
Need for Machine Learning
• Here are some specific areas where machine
learning is being used:
1. Predictive modeling
2. Natural language processing
3. Computer vision
4. Fraud detection
5. Recommendation systems
Difference between Machine Learning, Traditional
Programming and Artificial Intelligence
Machine Learning Lifecycle
• Defining the Problem: Clearly identify the real-world problem to be solved.
• Data Collection: Gather necessary data from various sources.
• Data Cleaning and Preprocessing: Resolve data quality issues and prepare the data for
analysis.
• Exploratory Data Analysis (EDA): Analyze data to identify patterns, outliers, and trends.
• Feature Engineering and Selection: Enhance data features and select relevant ones to
improve model performance.
• Model Selection: Choose suitable models based on the problem type and data
characteristics.
• Model Training: Train the model using a split of training and validation datasets.
• Model Evaluation and Tuning: Assess and optimize the model using relevant metrics.
• Model Deployment: Implement the model in a production environment for real-time
predictions.
• Model Monitoring and Maintenance: Regularly check and update the model to
maintain accuracy.
Types of Machine Learning
• 1. Supervised Machine Learning
• Supervised learning algorithms are trained on the
labeled dataset. They learn to map input features to
targets based on labeled training data. There are two
main types of supervised learning:
• Regression: Regression algorithm learns to predict
continuous values based on input features.
• Classification: Classification algorithm learns to assign
input data to a specific category or class based on input
features. The output labels in classification are discrete
values.
2. Unsupervised Machine Learning
• Unsupervised learning algorithm learns to recognize patterns in data
without being explicitly trained using labeled examples. The goal is to
discover the underlying structure or distribution in the data.
• There are two main types of unsupervised learning:
• Clustering: Clustering algorithms group similar data points together
based on their characteristics. The goal is to identify groups, or
clusters, of data points that are similar to each other, while being
distinct from other groups.
• Dimensionality reduction: Dimensionality reduction algorithms
reduce the number of input variables in a dataset while preserving as
much of the original information as possible. This is useful for
reducing the complexity of a dataset and making it easier to visualize
and analyze.
3. Reinforcement Machine Learning
• In Reinforcement Learning, an agent learns to interact with an environment by performing
actions and receiving rewards or penalties based on its actions. The goal of reinforcement
learning is to learn a policy, which is a mapping from states to actions, that maximizes the
expected cumulative reward over time.
• There are two main types of reinforcement learning:
• Model-based reinforcement learning: The agent learns a model of the environment, including
the transition probabilities between states and the rewards associated with each state-action
pair. The agent then uses this model to plan its actions in order to maximize its expected reward.
• Model-free reinforcement learning: The agent learns a policy directly from experience without
explicitly building a model of the environment. The agent interacts with the environment and
updates its policy based on the rewards it receives.
Various Applications of Machine Learning
• Automation: Machine learning, which works entirely autonomously in any field without the need for any
human intervention. For example, robots perform the essential process steps in manufacturing plants.
• Finance Industry: Machine learning is growing in popularity in the finance industry. Banks are mainly using
ML to find patterns inside the data but also to prevent fraud.
• Government organization: The government makes use of ML to manage public safety and utilities. Take the
example of China with its massive face recognition. The government uses Artificial intelligence to prevent
jaywalking.
• Healthcare industry: Healthcare was one of the first industries to use machine learning with image detection.
• Marketing: Broad use of AI is done in marketing thanks to abundant access to data. Before the age of mass
data, researchers develop advanced mathematical tools like Bayesian analysis to estimate the value of a
customer. With the boom of data, the marketing department relies on AI to optimize customer relationships
and marketing campaigns.
• Retail industry: Machine learning is used in the retail industry to analyze customer behavior, predict demand,
and manage inventory. It also helps retailers to personalize the shopping experience for each customer by
recommending products based on their past purchases and preferences.
• Transportation: Machine learning is used in the transportation industry to optimize routes, reduce fuel
consumption, and improve the overall efficiency of transportation systems. It also plays a role in autonomous
vehicles, where ML algorithms are used to make decisions about navigation and safety.
Limitations of Machine Learning
• Data Availability: Machines require sufficient data to
learn; without it, learning cannot occur.
• Diversity in Data: A lack of diversity within the dataset
can significantly hinder machine learning processes.
• Need for Heterogeneity: Diverse and varied data are
crucial for extracting meaningful insights.
• Impact of Low Variation: Algorithms struggle to derive
information from datasets with minimal variation.
• Observations Per Group: It is recommended to have at
least 20 observations per group to ensure effective
learning.
Well Defined Learning Problem
• A computer program is said to learn from
experience E with respect to some class of tasks
T and performance measure P, if its performance
in tasks T, as measured by P, improves with
experience E.
• Features in a Learning Problem
• The class of tasks (T)
• The measure of performance to be improved (P)
• The source of experience (E)
Examples of Well Defined Learning
Problem
• Checkers Learning Problem
• Task (T): Playing Checkers
• Performance Measure (P): Percent of games
won against opponents.
• Training Experience (E): Playing practice games
against itself.
Handwriting Recognition Learning Problem
• Task (T): Recognizing and classifying
handwritten words within images.
• Performance Measure (P): Percent of words
correctly classified.
• Training Experience (E): A dataset of
handwritten words with given classifications.
Robot Driving License Problem
• Task (T): Driving on public four-lane highways
using vision cameras.
• Performance Measure (P): Average distance
travelled before an error (as judged by a
human observer).
• Training Experience (E): A sequence of images
and steering commands recorded while
observing a human driver.
Design a Learning System in Machine
Learning
• When we fed the Training Data to Machine Learning Algorithm, this algorithm will produce
a mathematical model and with the help of the mathematical model, the machine will
make a prediction and take a decision without being explicitly programmed.
• Example : In Driverless Car, the training data is fed to Algorithm like how to Drive Car in
Highway, Busy and Narrow Street with factors like speed limit, parking, stop at signal etc.
After that, a Logical and Mathematical model is created on the basis of that and after that,
the car will work according to the logical model. Also, the more data the data is fed the
more efficient output is produced.
Design a Learning System in Machine
Learning
Designing a Learning System in Machine
Learning :
• According to Tom Mitchell, “A computer program is said to
be learning from experience (E), with respect to some task
(T). Thus, the performance measure (P) is the performance
at task T, which is measured by P, and it improves with
experience E.”
• Example: In Spam E-Mail detection,
• Task, T: To classify mails into Spam or Not Spam.
• Performance measure, P: Total percent of mails being
correctly classified as being “Spam” or “Not Spam”.
• Experience, E: Set of Mails with label “Spam”
• Step 1- Choosing the Training Experience: The very important and first task is to
choose the training data or training experience which will be fed to the Machine
Learning Algorithm.
• Below are the attributes which will impact on Success and Failure of Data:
• The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback to
itself like instead of this move if this is chosen the chances of success increases.
• Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to the
machine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
• Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience by
passing through more and more examples and hence its performance will increase.
• Step 2- Choosing target function: The next important step is choosing
the target function. It means according to the knowledge fed to the
algorithm the machine learning will choose NextMove function which
will describe what type of legal moves should be taken. For example :
While playing chess with the opponent, when opponent will play then
the machine learning algorithm will decide what be the number of
possible legal moves taken in order to get success.
• Step 3- Choosing Representation for Target function: When the machine
algorithm will know all the possible legal moves the next step is to
choose the optimized move using any representation i.e. using linear
Equations, Hierarchical Graph Representation, Tabular form etc. The
NextMove function will move the Target move like out of these move
which will provide more success rate. For Example : while playing chess
machine have 4 possible moves, so the machine will choose that
optimized move which will provide success to it.
• Step 4- Choosing Function Approximation Algorithm: An optimized
move cannot be chosen just with the training data. The training data
had to go through with set of example and through these examples
the training data will approximates which steps are chosen and after
that machine will provide feedback on it. Example : When a training
data of Playing chess is fed to algorithm so at that time it is not
machine algorithm will fail or get success and again from that failure
or success it will measure while next move what step should be
chosen and what is its success rate.
• Step 5- Final Design: The final design is created at last when system
goes from number of examples , failures and success , correct and
incorrect decision and what will be the next step etc. Example:
DeepBlue is an intelligent computer which is ML-based won chess
game against the chess expert Garry Kasparov, and it became the first
computer which had beaten a human chess expert.
Artificial Neural Network(ANN)?
• Artificial neural networks (ANNs) are created to replicate how the human
brain processes data in computer systems.
• Neurons within interconnected units collaborate to identify patterns, acquire
knowledge from data, and generate predictions. Artificial neural networks
(ANNs) are commonly employed in activities such as identifying images,
processing language, and making decisions.
• Like human brains, artificial neural networks are made up of neurons that are
connected like brain cells. These neurons process and receive information
from nearby neurons before sending it to other neurons.
Artificial Neural Networks Architecture
• There are three layers in the network architecture: the
input layer, the hidden layer (more than one), and the
output layer. A typical feedforward network processes
information in one direction, from input to output.
Because of the numerous layers are sometimes referred
to as the MLP (Multi-Layer Perceptron).
• It is possible to think of the hidden layer as a “distillation layer,”
which extracts some of the most relevant patterns from the
inputs and sends them on to the next layer for further analysis.
• It accelerates and improves the efficiency of the network by
recognizing just the most important information from the inputs
and discarding the redundant information.
Activation function
• The activation function is important for two reasons: first, it allows you to turn
on your computer. It contributes to the conversion of the input into a more
usable final output.
• This model captures the presence of nonlinear relationships between the
inputs.
• It contributes to the conversion of the input into a more usable output
• Finding the “optimal values of W — weights” that minimize prediction error is
critical to building a successful model. The “backpropagation algorithm” is a
method by which neural networks work, converting ANN into a learning
algorithm by learning from mistakes.
• The optimization approach uses a “gradient descent” technique to quantify
prediction errors. This technique is a cornerstone of supervised learning, as it
iteratively adjusts weights to minimize errors. In order to find the optimum
value for W, we try small adjustments in W and examine the impact on
prediction errors. Ultimately, we choose those W values as ideal because
further changes in W do not reduce mistakes.
Types of Artificial Neural Networks
Five Types of Artificial Neural Networks:
• Feedforward Neural Networks (FNNs): These are straightforward networks
where information flows in one direction, like from the input to the output.
They’re used for tasks like identifying patterns in data or making predictions,
making them ideal for pattern recognition.
• Convolutional Neural Networks (CNNs): Think of these as networks designed
specifically for understanding images. They’re great at recognizing patterns in
pictures, making them perfect for tasks like identifying objects in photos or
videos.
• Recurrent Neural Networks (RNNs): These networks are good with sequences,
like predicting the next word in a sentence or understanding the context of
words. They remember previous information, which helps them understand
the current data better.
• Long Short-Term Memory Networks (LSTMs): LSTMs are a type of RNN
that are really good at remembering long sequences of data. They’re
often used in tasks where understanding context over time is
important, like translating languages or analyzing time-series data.
• Generative Adversarial Networks (GANs): These networks are like
artists. One part of the network generates new data, like images or
music, while the other part critiques it to make sure it looks or sounds
realistic. GANs are a key technology in generative AI. GANs are used for
creating new content, enhancing images, or even generating
deepfakes.
How to Perform An Artificial Neural
Network Step-by-step
• Now it is time to perform the ANN. However, there are many
platforms for python that can be used to perform
machine learning. Here, Google Colab is used since no library
installation is required. You just need a Gmail account.
• Step 1: Upload Dataset
• Step 1 of any data analysis in Google Colab is to upload the
data set. Here, the Pima Indian diabetes dataset is uploaded
first.
• This dataset is available on Kaggle (link). The diabetes
dataset contains several columns that may play an
important role in the risk of diabetes. These columns are:
• Number of pregnancies
• Glucose
• Blood pressure
• Skin thickness
• Insulin
• BMI
• Diabetes pedigree function
• Age
• Outcome (diabetic or non-diabetic)
Step 2: Importing The Libraries
• Libraries in python play a pivotal role in data analysis. Here, NumPy,
pandas, and TensorFlow have been imported. It is important to remember
that you may face different types of challenges in different datasets.
Hence, you may have to use a different set of libraries or many other
libraries.
Step 3: Data Preprocessing
• It is difficult to imagine any data analysis with the preprocessing of
the data set. Again, every dataset comes with a different challenge.
Hence, excellent knowledge of python is essential. The most
common challenge in data preprocessing is the exclusion of
inconsequential columns and conversion of categorical data, such as
gender into 0 and 1.
• The current diabetes dataset only requires separation of
independent (X) and dependent (y) variables. Pandas library was
used to read the dataset and the formation of dependent and
independent variables.
Step 4: Splitting the Dataset into the Training set and Test set
• Splitting of the dataset into training and test sets is
necessary. The training set is used to apply the machine
learning model. Hence, a large portion of data is
randomly selected as a training set from the whole
dataset. Here, 80% of data is selected as a training set
(test_size = 0.2).
• The random state is required to achieve a similar splitting
of dataset each time we run ANN. Otherwise, there will
be a slight change in results. The test set is used to
evaluate the model. Scikit library is required to split the
dataset.
Step 5: Feature Scaling
• The most frequent question asked in machine learning is when to use
feature scaling (before splitting the dataset or after splitting the dataset)?
• You must get a clue that it should be done after splitting the dataset into
training and test sets. Feature scaling is performed to normalize or
standardized all independent variables. Some variables, such as age and
salary are totally on different scales, hence may have a different effect on
Euclidean distance (there are many other ways to calculate distance such
as Manhattan distance). Therefore, all independent variables should be
on the same scale. If we perform feature scaling before splitting the
dataset, the mean value of the whole data set will influence the result.
Hence, feature scaling should be done after scaling the data set. Feature
scaling is an essential step as there is going to be a lot of computation in
ANN and you wouldn’t want any independent variable to dominate on
any other variable.
• Step 6: Building the ANN model
The next step of ANN is to build a model.
• Step 6.1: Initializing the ANN
Earlier you were loading Tensorflow and Keras separately to create a
sequential model. Keras library is now integrated into the new version of
TensorFlow (2.0). Sequential class is used to initialize the ANN as a sequence
of layers.
Step 6.2: Adding the Input Layer and the First
Hidden Layer
• The next few steps required the formation of sequence layers for ANN. In this
step, you will be adding the input layer and the hidden layer. To add these
layers, you should use a dense class. This dense class is used in whatever phase
of the neural network we are, as it is also evident by my few next steps. Here,
the add method is used to add anything such as a hidden layer from the
sequential class using the ‘ann’ variable created in the previous step.
• The layer module is used to add classes, which means layers you want to add in
our ANN. The number of units here suggests a number of hidden neurons or
hidden layers. Our input neurons or layers are all of our independent variables.
• The next most important question in ANN is how many hidden layers you
want?
• Is there any specific rule or it should be based on trial and error. There is no
specific rule to choose a number of units. It is experimentally based. You
have to use different hyperparameters. Here, you can see four, based on
several trials. There are many functions that may require based on different
approaches. In this blog, you will see just two of the unit and activation
functions. For a fully connected hidden layer, the rectifier activation
function (relu) is used. This rectifier function is mostly suggested to connect
hidden layers. Nonetheless, it is good to have a basic understanding of all
activation functions, as things may change based on different datasets.
• Step 6.3: Adding the Second Hidden Layer
The next step is adding the second hidden layer. All required steps
given below are explained above (step 6.2)
• Step 6.4: Adding the output layer
• Adding the output layer in ANN is slightly different than adding a hidden layer. It is
important to know the dimension of the output layer. In this dataset, you will be
predicting binary variables (0 or 1), hence dimension is one. It means you just
need one neuron to predict the final output. Remember, this is an example of a
classification approach. Another important change in this layer is the activation
function. Here, you can see the use of the sigmoid function, because it not only
gives a better prediction than rectifier function but also provide the probabilities.
Hence, you will get the prediction that if someone is having diabetes or not,
including their probabilities.
• Now all layers required for ANN are ready.
• Step 7: Training the ANN
The next step is to train the created model on our training set. Training of ANN requires
two steps.
• Step 7.1: Compiling the ANN
The first step is to compile the ANN with an optimizer, lost function, and metrics. Here, in
metrics, you are using accuracy function, as you are using classification (binary dependent
variables). The first argument is the optimizer. The optimizer is used for the optimization
algorithm we want to use to find the optimal set of weights in the neural networks. Here,
you will be using ‘adam’ optimization which is an extension of stochastic gradient descent.
If you check the mathematical detail of stochastic gradient descent, you can find that it is
based on the lost function that you need to optimize to find the optimal weights. The lost
function is not the sum of squares square errors like for linear regression but it's going to
be a logarithmic function that is called a logarithmic loss. When the weights are updated
after each observation or after each batch of many observations the algorithm uses this
accuracy criterion to improve the model’s performance. The name adam is derived from
adaptive moment estimation. Next is computing the cross-entropy loss between true
labels and predicted labels. Use this cross-entropy loss when there are only two label
classes (assumed to be 0 and 1).
• Step 7.2: Training the ANN on the Training set
• The next step is fitting the ANN of the training dataset. Two
arguments in this example are used to fit the training model,
epochs, and batch size. The sample size for the diabetes dataset
is 768. Batch size 50 suggests the number of samples processed
before our ANN model is updated. One epoch is when an entire
dataset is passed forward and backward through the neural
network only ONCE.
• Step 8: Making the Predictions and Evaluating the Model
Now the ANN model is ready and also performed on the training dataset. But, how would we
know if it is good? We will check our model using the test data set, however, only using
independent variables. Results from the predicted test dataset will then be compared with the
original results.
• Step 8.1: Predicting the Test set results
As the dataset being used has a binary outcome, a classification approach of supervised machine
learning is used here. The first step is to predict the outcome of the test dataset using the
independent variable, here X_test, using the ‘ann’ model trained on the training dataset. As you
used sigmoid activation function for output, hence the results are in probability. Next, you will
have to convert those probabilities into true ( > 0.05) or false (<0.05) and then change those true
and false into 0 and 1.
Step 8.2: Making the Confusion Matrix
• The confusion matrix is used to compare the predicted results of the
test dataset with the original results of the test dataset. Scikit library
is required to create a confusion matrix, here 2 x 2 table.
Result:
True positive: 92 (Diabetic)
True negative: 30 (non-diabetic)
False-positive: 15 (Non-diabetic but predicted as diabetic)
False-negative: 17 (diabetic but predicted as non-diabetic)
Step 10: Calculate the Accuracy score
Accuracy calculation also requires the Scikit library. It shows
how good the model is.
What is Clustering?
• The task of grouping data points based on their similarity with each
other is called Clustering or Cluster Analysis. This method is defined
under the branch of unsupervised learning, which aims at gaining
insights from unlabelled data points.
• Think of it as you have a dataset of customers shopping
habits. Clustering can help you group customers with similar purchasing
behaviors, which can then be used for targeted marketing, product
recommendations, or customer segmentation.
Types of Clustering
• Broadly speaking, there are 2 types of clustering that can be performed
to group similar data points:
• Hard Clustering:
• In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let’s say there are 4 data point and we
have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.
• Soft Clustering: In this type of clustering, instead of assigning each
data point into a separate cluster, a probability or likelihood of that
point being that cluster is evaluated. For example, Let’s say there are 4
data point and we have to cluster them into 2 clusters. So we will be
evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.
Types of Clustering Methods
• Various types of clustering algorithms are:
• Centroid-based Clustering (Partitioning methods)
• Density-based Clustering (Model-based methods)
• Connectivity-based Clustering (Hierarchical clustering)
1. Centroid-based Clustering (Partitioning
methods)
• K-means and
• K-medoids clustering
2. Density-based Clustering (Model-based
methods)
• DBSCAN and
• OPTICS (Ordering Points To Identify Clustering
Structure).
Connectivity-based Clustering (Hierarchical
clustering)
• Divisive Clustering: It follows a top-down approach, here
we consider all data points to be part one big cluster and
then this cluster is divide into smaller groups.
• Agglomerative Clustering: It follows a bottom-up approach,
here we consider all data points to be part of individual
clusters and then these clusters are clubbed together to
make one big cluster with all data points
Implementation of K-Means Clustering in
Python
• Step 1: Importing the necessary libraries
• We are importing Numpy for statistical
computations, Matplotlib to plot the graph,
and make_blobs from sklearn.datasets.
Step 2: Create the custom dataset with make_blobs and plot it
Step 3: Initialize the random centroids
• The code initializes three clusters for K-means
clustering. It sets a random seed and
generates random cluster centers within a
specified range, and creates an empty list of
points for each cluster.
Step 4: Plot the random initialize center with data points
• The plot displays a scatter plot of data points
(X[:,0], X[:,1]) with grid lines. It also marks the
initial cluster centers (red stars) generated for
K-means clustering.
Step 5: Define Euclidean distance
• def distance(p1,p2): return
np.sqrt(np.sum((p1-p2)**2))
Step 6: Create the function to Assign and Update the cluster
center
• This step assigns data
points to the nearest
cluster center, and the M-
step updates cluster
centers based on the
mean of assigned points
in K-means clustering.
Step 7: Create the function to Predict the cluster for the
datapoints
Step 9: Plot the data points with their predicted cluster center
Reinforcement Learning
• Reinforcement Learning (RL) is a branch of machine
learning focused on making decisions to maximize
cumulative rewards in a given situation. Unlike
supervised learning, which relies on a training
dataset with predefined answers, RL involves
learning through experience.
• In RL, an agent learns to achieve a goal in an
uncertain, potentially complex environment by
performing actions and receiving feedback through
rewards or penalties.
• Reinforcement Learning is a type of Machine Learning
paradigms in which a learning algorithm is trained not on
preset data but rather based on a feedback system. These
algorithms are touted as the future of Machine Learning
as these eliminate the cost of collecting and cleaning the
data.
Types of Reinforcement
• Positive: Positive Reinforcement is defined as when an event, occurs due to a
particular behavior, increases the strength and the frequency of the behavior. In
other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
– Maximizes Performance
– Sustain Change for a long period of time
– Too much Reinforcement can lead to an overload of states which can
diminish the results
• Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
– Increases Behavior
– Provide defiance to a minimum standard of performance
– It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given
time.
ii) Reward Function: Defines the goal of the RL
problem by providing feedback.
iii) Value Function: Estimates long-term rewards
from a state.
iv) Model of the Environment: Helps in
predicting future states and rewards for planning.
Example: CartPole Environment in OpenAI Gym
The CartPole environment is a classic reinforcement learning problem where the goal is
to balance a pole on a cart by applying forces to the left or right.
• import gym
• import numpy as np
• import warnings
• # Suppress specific deprecation warnings
• warnings.filterwarnings("ignore", category=DeprecationWarning)
• # Load the environment with render mode specified
• env = gym.make('CartPole-v1', render_mode="human")
• # Initialize the environment to get the initial state
• state = env.reset()
• # Print the state space and action space
• print("State space:", env.observation_space)
• print("Action space:", env.action_space)
# Run a few steps in the environment with random actions
for _ in range(10):
env.render() # Render the environment for visualization
action = env.action_space.sample() # Take a random action
# Take a step in the environment
step_result = env.step(action)
# Check the number of values returned and unpack accordingly
if len(step_result) == 4:
next_state, reward, done, info = step_result
terminated = False
else:
next_state, reward, done, truncated, info = step_result
terminated = done or truncated
print(f"Action: {action}, Reward: {reward}, Next State: {next_state}, Done: {done}, Info: {info}")
if terminated:
state = env.reset() # Reset the environment if the episode is finished
env.close() # Close the environment when done
Decision Tree
• A decision tree is a supervised learning algorithm used for
both classification and regression tasks.
• They break complex decisions into smaller steps, making
them easy to understand and implement.
• It models decisions as a tree-like structure where internal
nodes represent attribute tests, branches
represent attribute values, and leaf nodes represent final
decisions or predictions.
• Decision trees are widely used in machine learning for
predictive modeling.
Types of Decision Tree
•Classification trees: They are designed to predict
categorical outcomes means they classify data into
different classes. They can determine whether an email is
“spam” or “not spam” based on various features of the
email.
•Regression trees : These are used when the target variable
is continuous It predict numerical values rather than
categories. For example a regression tree can estimate the
price of a house based on its size, location, and other
features.
Example of Decision Tree
• Imagine you’re deciding whether to buy an umbrella:
• Step 1 – Ask a Question (Root Node):
Is it raining?
If yes, you might decide to buy an umbrella. If no, you move
to the next question.
• Step 2 – More Questions (Internal Nodes):
If it’s not raining, you might ask:
Is it likely to rain later?
If yes, you buy an umbrella; if no, you don’t.
• Step 3 – Decision (Leaf Node):
Based on your answers, you either buy or skip the umbrella
Types of Decision Tree
• ID3 : Ross Quinlan is credited within the development of ID3, which is shorthand for
“Iterative Dichotomiser 3.” This algorithm leverages entropy and information gain as
metrics to evaluate candidate splits. Some of Quinlan’s research on this algorithm from
1986 can be found.
• This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
• C4.5 : This is an improved version of ID3 that can handle missing data and continuous
attributes. This algorithm is considered a later iteration of ID3, which was also developed
by Quinlan. It can use information gain or gain ratios to evaluate split points within the
decision trees.
• CART : The term, CART, is an abbreviation for “classification and regression trees” and
was introduced by Leo Breiman. This algorithm typically utilizes Gini impurity to identify
the ideal attribute to split on. Gini impurity measures how often a randomly chosen
attribute is misclassified. When evaluating using Gini impurity, a lower value is more
ideal. This algorithm uses a different measure called Gini impurity to decide how to split
the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.
Decision Tree Terminologies
• Root Node: The initial node at the beginning of a decision tree, where the entire
population or dataset starts dividing based on various features or conditions.
• Decision Nodes: Nodes resulting from the splitting of root nodes are known as decision
nodes. These nodes represent intermediate decisions or conditions within the tree.
• Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
• Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section of a
these tree is referred to as a sub-tree. It represents a specific portion of the decision
tree.
• Pruning: The process of removing or cutting down specific nodes in a tree to prevent
overfitting and simplify the model.
• Branch / Sub-Tree: A subsection of the entire is referred to as a branch or sub-tree. It
represents a specific path of decisions and outcomes within the tree.
• Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is
known as a parent node, and the sub-nodes emerging from it are referred to as child
nodes. The parent node represents a decision or condition, while the child nodes
represent the potential outcomes or further decisions based on that condition.
Example of Decision Tree
Advantages of Decision Trees
• Easy to Understand: They are simple to visualize and
interpret, making them easy to understand even for non-
experts.
• Handles Both Numerical and Categorical Data: They can work
with both types of data without needing much preprocessing.
• No Need for Data Scaling: These trees do not require
normalization or scaling of data.
• Automated Feature Selection: They automatically identify the
most important features for decision-making.
• Handles Non-Linear Relationships: They can capture non-
linear patterns in the data effectively.
Disadvantages of Decision Trees
• Overfitting Risk: It can easily overfit the training data,
especially if they are too deep.
• Unstable with Small Changes: Small changes in data can lead
to completely different trees.
• Biased with Imbalanced Data: They tend to be biased if one
class dominates the dataset.
• Limited to Axis-Parallel Splits: They struggle with diagonal or
complex decision boundaries.
• Can Become Complex: Large trees can become hard to
interpret and may lose their simplicity.
How to choose the best attribute at each node
• Entropy: entropy measures the "impurity" or randomness of
a dataset, helping the algorithm decide how to split nodes to
create the purest subsets, leading to more accurate
predictions.
• Information gain: Information Gain tells us how useful a
question (or feature) is for splitting data into groups.
• S represents the data set that entropy is
calculated
• c represents the classes in set, S
• p(c) represents the proportion of data points
that belong to class c to the number of total
data points in set, S
• Entropy values can fall between 0 and 1. If all samples
in data set, S, belong to one class, then entropy will
equal zero. If half of the samples are classified as one
class and the other half are in another class, entropy
will be at its highest at 1. In order to select the best
feature to split on and find the optimal decision tree,
the attribute with the smallest amount of entropy
should be used.
• Information gain represents the difference in
entropy before and after a split on a given attribute.
The attribute with the highest information gain will
produce the best split as it’s doing the best job at
classifying the training data according to its target
classification. Information gain is usually represented
with the following formula,
Gini Impurity
• Gini impurity is the probability of incorrectly
classifying random data point in the dataset if it were
labeled based on the class distribution of the dataset.
Similar to entropy, if set, S, is pure—i.e. belonging to
one class) then, its impurity is zero. This is denoted
by the following formula:
Bayesian networks
• A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph.“
• It is also called a Bayes network, belief network, decision network, or Bayesian model.
• Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
• Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
• Bayesian Network can be used for building models from data
and experts opinions, and it consists of two parts:
• Directed Acyclic Graph
• Table of conditional probabilities.
• The generalized form of Bayesian network that represents and
solve decision problems under uncertain knowledge is known
as an Influence diagram.
• A Bayesian network graph is made up of nodes and Arcs
(directed links), where:
• Each node corresponds to the random variables, and a variable
can be continuous or discrete.
• Arc or directed arrows represent the causal relationship or
conditional probabilities between random variables. These
directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other
node, and if there is no directed link that means that nodes are
independent with each other
– In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
– If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
– Node C is independent of node A.
• Rain
Sprinkler Rain R ~R
0.2 0.8
• Sprinkler
R S ~S
F 0.4 0.6
Grass wet T 0.01 0.99
• Grass wet
Rain Sprinkler W
~R ~S 0
R ~S 0.8
~R S 0.9
R S 0.99
Support Vector Machine (SVM) Algorithm
• Support Vector Machine (SVM) is a supervised machine
learning algorithm used for classification and regression
tasks. While it can handle regression problems, SVM is
particularly well-suited for classification tasks.
• SVM aims to find the optimal hyperplane in an N-
dimensional space to separate data points into different
classes. The algorithm maximizes the margin between
the closest points of different classes
SVM Terminology
• Hyperplane: A decision boundary separating different classes in
feature space, represented by the equation wx + b = 0 in linear
classification.
• Support Vectors: The closest data points to the hyperplane, crucial for
determining the hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support
vectors. SVM aims to maximize this margin for better classification
performance.
• Kernel: A function that maps data to a higher-dimensional space,
enabling SVM to handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly separates
the data without misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack
variables, balancing margin maximization and misclassification
penalties when data is not perfectly separable.
• C: A regularization term balancing margin maximization and
misclassification penalties. A higher C value enforces a stricter penalty
for misclassifications.
• Hinge Loss: A loss function penalizing misclassified points or margin
violations, combined with regularization in SVM.
• Dual Problem: Involves solving for Lagrange multipliers associated with
support vectors, facilitating the kernel trick and efficient computation.
How does Support Vector Machine Algorithm Work?
• The key idea behind the SVM
algorithm is to find the
hyperplane that best separates
two classes by maximizing the
margin between them. This
margin is the distance from the
hyperplane to the nearest data
points (support vectors) on each
side.
• The best hyperplane, also known as the “hard
margin,” is the one that maximizes the distance
between the hyperplane and the nearest data points
from both classes. This ensures a clear separation
between the classes. So, from the above figure, we
choose L2 as hard margin.
• A soft margin allows for some misclassifications or
violations of the margin to improve generalization.
Kernel
• A kernel is a function that maps data points into a higher-
dimensional space without explicitly computing the
coordinates in that space. This allows SVM to work efficiently
with non-linear data by implicitly performing the mapping.
• For example, consider data points that are not linearly
separable. By applying a kernel function, SVM transforms the
data points into a higher-dimensional space where they
become linearly separable.
• Linear Kernel: For linear separability.
• Polynomial Kernel: Maps data into a polynomial space.
• Radial Basis Function (RBF) Kernel: Transforms data into a
space based on distances between data points.
Advantages of SVM
• High-Dimensional Performance: SVM excels in high-dimensional spaces,
making it suitable for image classification and gene expression analysis.
• Nonlinear Capability: Utilizing kernel functions like RBF and polynomial,
SVM effectively handles nonlinear relationships.
• Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
• Binary and Multiclass Support: SVM is effective for both binary
classification and multiclass classification, suitable for applications
in text classification.
• Memory Efficiency: SVM focuses on support vectors, making it memory
efficient compared to other algorithms.
Disadvantages of SVM
• Slow Training: SVM can be slow for large datasets, affecting
performance in SVM in data mining tasks.
• Parameter Tuning Difficulty: Selecting the right kernel and
adjusting parameters like C requires careful tuning,
impacting SVM algorithms.
• Noise Sensitivity: SVM struggles with noisy datasets and
overlapping classes, limiting effectiveness in real-world
scenarios.
• Limited Interpretability: The complexity of the hyperplane in
higher dimensions makes SVM less interpretable than other
models.
• Feature Scaling Sensitivity: Proper feature scaling is essential;
otherwise, SVM models may perform poorly.
Types of Support Vector Machine
• Based on the nature of the decision boundary, Support Vector Machines
(SVM) can be divided into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or
a hyperplane (in higher dimensions) can entirely divide the data points into
their respective classes. A hyperplane that maximizes the margin between
the classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a higher-
dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.
Genetic Algorithm (GA)
• A genetic algorithm (GA) is a computer programming
technique that uses biological evolution to solve
optimization problems. It's a type of evolutionary algorithm
that mimics natural selection and genetics.
• GA starts with a population of individuals, each representing
a potential solution
• GA randomly searches for better solutions by mutation and
crossover
• GA selects the individuals with the best traits to survive and
reproduce
• GA uses genetic operators inspired by biological evolution
Features of GA
• GA can solve smooth or non-smooth optimization problems
• GA can handle any type of constraints, including integer
constraints
• GA is a stochastic, population-based algorithm
• GA has a crossover operator that generates new individuals
• GA has a genotype that represents the underlying structure of a
potential solution
• Related GA variations Fitness Scaling Genetic Algorithm, Rank
Selection GA, and Boltzmann Selection GA.
• GA use cases
• GA is used to solve complex optimization problems in many fields,
including engineering and other problem-solving domains.
What Is the Genetic Algorithm?
• The genetic algorithm is a method for solving both constrained and unconstrained
optimization problems that is based on natural selection, the process that drives
biological evolution. The genetic algorithm repeatedly modifies a population of
individual solutions. At each step, the genetic algorithm selects individuals from the
current population to be parents and uses them to produce the children for the next
generation. Over successive generations, the population "evolves" toward an optimal
solution. You can apply the genetic algorithm to solve a variety of optimization
problems that are not well suited for standard optimization algorithms, including
problems in which the objective function is discontinuous, nondifferentiable,
stochastic, or highly nonlinear. The genetic algorithm can address problems of mixed
integer programming, where some components are restricted to be integer-valued.
• The genetic algorithm uses three main types of rules at
each step to create the next generation from the current
population:
• Selection rules select the individuals, called parents, that
contribute to the population at the next generation. The
selection is generally stochastic, and can depend on the
individuals' scores.
• Crossover rules combine two parents to form children for
the next generation.
• Mutation rules apply random changes to individual parents
to form children.
Yo u
nk
T ha