Objectives
To introduce the basic concepts and techniques of
Machine Learning
To introduce various supervised and unsupervised
algorithms
To introduce various ensemble techniques for
combining ML models.
To introduce the concept of dimensionality reduction
and its techniques.
August 12, 2025 1
Outcomes
Identify a Machine Learning technique for the given problem and understand the
concepts of Training Error, Generalization Error, Overfitting and Underfitting.
Apply Regression and Decision Tree techniques on the given data and examine
the performance of the model
Compare and Contrast Ensemble approaches for combining multiple Machine
Learning Techniques
Determine the type of Support Vector Machines variant which can applied on the
given data
Apply Unsupervised Learning technique on the given data for getting insights
from unlabeled data
Use Dimensionality Reduction techniques for dealing with data with large
number of attributes
August 12, 2025 2
Syllabus
3
4
5
6
7
8
What is machine learning
Machine learning is an area in the
computer science which involves teaching
computers to do things naturally by
learning through experience
A computer program is said to learn from
experience (E) with respect to some class
of task (T) and performance measure (P)
August 12, 2025 9
Features
Class of task
Performance measure
Source of experience
Example:-Robot navigation in a maze
August 12, 2025 10
Common definitions
Machine learning is used to parse data,
learn from it and then make a
determination or predi ction about
something in the world
Machine learning lies at the intersection
of computer science, engineering and
statistics and often appears in other
disciplines
August 12, 2025 11
Components of ML study
Computer Statistics
Science
Engineering
Engineering
Examples of machine learning
Facebook which continuously notices the
friends in the list, profiles often visited, your
interests, workplace, groups you are in and
so on. Based on the information retrieved ,
facebook gives you friend suggestions
Consider that you purchased an item from
amazon. If you purchased a mobile phone
online, then the site from where you
purchased it immediately recommends a
cover for the phone purchased
August 12, 2025 13
How does ML algorithm work?
Machine learning uses algorithms to find
patterns in data
then uses a model that recognizes those
patterns to make predictions on new data
Predictions
Training
Model:
algorithm:
Data recognizes
finds the
the pattern
patterns
New Data
Machine learning can be
implemented in the
Healthcare sector
Pharmaceutical companies
August 12, 2025 15
Where is machine learning used
Marketing and sales
Search engines
Transportation:- Based on travel history
and pattern of travelling across various
routes , machine learning can help
transportation companies predict
potential problems that could arise on
certain routes and accordingly advise
their customers to opt for a different
route
August 12, 2025 16
Types of machine learning
Supervised learning:-Suppose you have a
fruit basket and your task is to arrange
the fruit by type
You can group the fruits based on any
physical character
August 12, 2025 17
Rule 1:- If the color of the fruit is Red and
size of the fruit is small then the fruit is
cherry
If the color of the fruit is Red and size of
the fruit is Big then the fruit is apple
If the color of the fruit is Green and size of
the fruit is small then the fruit is grape
If the color of the fruit is green and size of
the fruit is Big then the fruit is
watermelon
August 12, 2025 18
Decision Tree Induction
August 12, 2025 20
August 12, 2025 21
Reinforcement Learning
This learning is similar to supervised learning.
In the supervised learning the correct target
output values are known for each input
pattern.
But in some cases, the less information might
be available
For example the network might be told that its
actual output is only 50% correct.
Thus here only critic information is available
not the exact information
August 12, 2025 22
The learning based on this critic information is
called as reinforcement learning and the
feedback sent is called as reinforcement
signal.
The reinforcement learning is a form of
supervised learning because the network
receives some feedback from its environment.
The reinforcement signals are processed in
the critic signal generator and the obtained
critic signals are sent to the network for
adjustments of weights.
August 12, 2025 23
The reinforcement learning is also called
learning with a critic as opposed to
learning with a teacher , which indicates
supervised learning
Reinforcement learning work very similar
to how you learn by yourself without any
guidance basically through hit and trial.
When you get something right , you get
reward, you feel happy and you move
ahead
August 12, 2025 24
When you get something wrong , you get
a penalty . You take a step back and then
you try to avoid incorrect path while
exploring another correct path.
Example:-Robots equipped with sensors
from to learn their surrounding
environment
August 12, 2025 25
Unsupervised Learning
K-Means clustering
August 12, 2025 26
Issues in machine learning
Data labelling:- Today there is a large
amount of data that is unlabelled and
raw. As you know supervised machine
learning works on labelled data.
Without adequate data labels in the
training dataset , it is not feasible to build
robust maching learning model.
Companies are putting thousands of man
hours to label the data so that it can be
used for machine learning
August 12, 2025 27
This is an active area of research where
the labels can be attached to the data as
it is used
August 12, 2025 28
Shortage of experts
Machine learning is an emerging field and
there are not many experts around the
world. You require experts who can
1)Understand the wide variety of data
2) Model the data correctly so as to meet
the desired objectives
3)Build and manage software and
hardware tools and techniques required
for machine learning
August 12, 2025 29
Obtaining massive training
datasets
It is difficult to obtain massive training
dataset for various areas of machine
learning
You may lack historical data and also the
quality of data for the training dataset
matters.
If the dataset obtained does not represent
a fair sample size then the resultant
machine learning model could be
erroneous
August 12, 2025 30
If you are trying to build machine learning
model to predict a particular type of
cancer from a given set of symptoms,
lifestyle and blood related parameters.
Then you may require quality data for
thousands of patients that have had that
particular type of cancer and the details
of their symptoms , lifestyle and blood
related parameters.
August 12, 2025 31
HARD TO EXPLAIN PROBLEMS
AND RESULTS
Complex machine learning models , often
built by experts may not be self
explanatory when used by common
people in the field
For example you tell a healthy person
that she has an 80% chance of getting a
particular disease then she may require
additional details behind that statement
August 12, 2025 32
Limited possibilities to reuse the
model
It is difficult to reuse an existing machine
learning model for other uses cases.
Companies have to invest time and
resources to build new model for solving
new use cases
August 12, 2025 33
Steps in developing machine
learning application
Collect data:-Some of the popular,
publicly available dataset resources are
as follows
1) Kaggle dataset
2) Amazon web services
3) Machine learning repository
4)Google tensor flow
5)Microsoft
6) Open ML
August 12, 2025 34
Prepare the input data
Once you have the data, you need to
ensure that it is in the right format such
that it can be processed by the chosen
algorithm and computer programs
August 12, 2025 35
Data Preprocessing?
Data Processing
Processing that involves transformation of raw
data into useful information.
Why pre-processing is required?
1. Real world data are generally
incomplete:
noisy:
August 12, 2025 36
2. Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers
Data integration
Integrating data from multiple sources
Data transformation
Normalization
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
August 12, 2025 37
preprocessing
August 12, 2025 38
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
August 12, 2025 39
How to Handle Missing
Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g.,
“unknown”
Use the attribute mean or median to fill in the missing value
Use the most probable value to fill in the missing value:
using techniques like regression , Bayesian
classification ,decision tree, Clustering algorithm
August 12, 2025 40
Example of Weather
Outlook Temperature Humidity W indy Class
sunny hot high false N
sunny hot high true N
overcast hot high false P
rain mild high false P
rain cool normal false P
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
August 12, 2025 41
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Regression
August 12, 2025 42
Binning
Consider sorted data for example price in
INR
4,8,9,15,21,21,24,25,26,28,29,34
N=3
Bin 1:4,8,9,15
Bin 2: 21,21,24,25
Bin 3:26,28,29,34
August 12, 2025 43
Smooth by bin means
Replace each value of bin with its mean
value
Bin 1:- 9,9,9,9
Bin 2:-23,23,23,23
Bin 3:-29,29,29,29
August 12, 2025 44
Smoothing by bin median
Bin 1:-8.5,8.5,8.5,8.5
Bin 2:-22.5,22.5,22.5,22.5
Bin 3:-28.5,28.5,28.5,28.5
August 12, 2025 45
Smoothing by bin boundaries
Bin 1:- 4,4,4,15
Bin 2:-21,21,25,25
Bin 3:- 26,26,26,34
August 12, 2025 46
Data Integration
Carl’s Coefficient Measure
Covariance
August 12, 2025 47
August 12, 2025 48
August 12, 2025 49
August 12, 2025 50
Data Reduction
Dimension reduction technique
August 12, 2025 51
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
August 12, 2025 52
Numerosity Reduction
Numerosity reduction 40
technique refers to 35
reducing the volume of
data by choosing smaller30
forms for data 25
representation.
20
1. Histograms
A popular data 15
reduction technique 10
Divide data into buckets
5
Range of bucket is
called as width. 0
10000 30000 50000 70000 90000
August 12, 2025 53
Histogram
August 12, 2025 54
August 12, 2025 55
histogram
D=[1,2,3,4,2,2,3,3,3,3,1,1,1,1,1,4,4,5,5,5,6,6,6,7,
7,7,1]
August 12, 2025 56
histogram
D=[1,2,3,4,2,2,3,3,3,3,1,1,1,1,1,4,4,5,5,5,6,6,6,7,7,7,
1]
1:-7 times
2:-3
3:-5
4:-3 times
5:-3
6:3
7:-3
August 12, 2025 57
Data Transformation
Normalization
August 12, 2025 58
Z-score v'
v meanA
stand _ devA
Sample data [10,20,30]
Mean=20
Std dev=square root of variance
Variance
59
Z-score v'
v meanA
stand _ devA
Sample data [10,20,30]
Mean=20
Std dev=square root of variance
Variance
=66.66
Std dev=8.16
V1=-1.22,0,1.22
60
Analyze the input data
You need to ensure that examples are
complete (there are no missing values)
August 12, 2025 61
Train the algorithm
Test the algorithm
August 12, 2025 Data Mining: Concepts and Techniques 63
Use the algorithm
You spent a lot of time collecting and
cleaning the data and then building and
testing the model
August 12, 2025 64
Periodic revisit
You should periodically review the result
that the model is producing and evaluate
if there are opportunities for improving it
in light of new data. You may carry out
minor adjustments to the model or may
retrain it with latest data to fine tune it
August 12, 2025 65