0% found this document useful (0 votes)

86 views44 pages

ML 02 Dataset-Feature Selection PDF

Uploaded by

moktarm243

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views44 pages

ML 02 Dataset-Feature Selection PDF

Uploaded by

moktarm243

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

LECTURE 2

MACHINE LEARNING PROCESS

Machine Learning ?
 Machine learning is used to make decisions based on data.

 By modeling the algorithms on the bases of historical data,

Algorithms find the patterns and relationships that are difficult
for humans to detect.

 These patterns are now further use for the future references to
predict solution of unseen problems
Machine learning process
machine
learning
processes
 Get data
 Pre-processing – cleaning
 Feature Selection and extraction
 Data splitting
 Choosing leaning algorithm and model building
 Model Training
 Model testing and evaluation
 Deployment (classification, clustering, prediction,
regression)
What is dataset?
a collection of data records for computer processing
that is collected for a specific purpose.
For example, the test scores of each student in a particular
class is a data set.

There are many ways in which data can be

collected—
for example, as part of surveys, interviews, observations, and
so on.
Dataset representation
A dataset is usually
presented in tabular
form.
Each column represents
a particular variable.
(attributes, features,
dimensions)
Each row corresponds
to a given member of
the dataset in question.
(instance, record,
object)
Types of Data (features)
 Numeric Data Types
 Binary… ex: gender
 integers … ex: num of student
 floats… ex: length
 Text Data Type (categorical ) ex: country
 Image data
 Audio data
 Video data
Data Samples
Open datasets
 These ready-to-use datasets are freely available online
for anyone to download, modify, and distribute without
legal or financial restrictions. These datasets are
regularly updated and are compatible with most ML
frameworks. The only drawback is that open datasets
lack personalization.
 Google dataset search
 AWS public datasets
 Kaggle datasets
Data cleaning
 Data Cleaning is the act of removing all flawed or
irrelevant parts of data so that what remains is
more suited to a particular goal; typically, data
science or machine learning
Dirty Data Problems
1) Naming conventions: ER: NYC vs New York
2) Missing required field (e.g. key field)
3) Different representations (2 vs Two)
4) Fields too long (get truncated)
5) Redundant Records (exact match or other)
6) Formating issues – esp dates
Example 1: Missing Values
Remove Rows with Missing Values
Remove columns with Missing Values
 Drop features with many missing values: If a given column in a dataset has a lot
of missing values, you may be able to drop it completely without losing much
information.
Example 2: Remove Duplicates
 keep only unique rows in a data set.
Example 3: Detect & Remove Outliers
 Errors can occur during measurement and data
entry. During data entry, typos can produce weird
values.
 Imagine that we’re measuring the height of adult
men and gather the following dataset.
 Drop features with low variance: If a given column in a
dataset has values that change very little, you may be able
to drop it since it’s unlikely to offer as much useful
information about a response variable compared to other
features.
 Drop features with low correlation with the response
variable: If a certain feature is not highly correlated with
the response variable of interest, you can likely drop it from
the dataset since it’s unlikely to be a useful feature in a
model
Conventional Definition of Data Quality

 Accuracy
 The data was recorded correctly.
 Completeness
 All relevant data was recorded.
 Uniqueness
 Entities are recorded once.
 Consistency
 The data agrees with itself.
Splitting Data

 training set—a subset to train a model.(to build the

model)
 test set—a subset to test the trained model.
 Make sure that your test set meets the following two conditions:
 Is large enough to yield statistically meaningful results.
 Is representative of the data set as a whole. In other words, don't pick
a test set with different characteristics than the training set.
For example, consider a model that predicts
whether an email is spam, using the subject line,
email body, and sender's email address as
features. We apportion the data into training
and test sets, with an 80-20 split.
Validation Set: Another Partition
 The validation set is used during the training phase of the model to provide
an unbiased evaluation of the model's performance and to fine-tune the
model's parameters. The test set, on the other hand, is used after the model
has been fully trained to assess the model's performance on completely
unseen data.
Types of Learning Errors

 What is Overfitting?(over complicating)

 What is Underfitting? (oversimplification)

Examples of Overfitting
 Now, assume we train a model from a dataset of 10,000 resumes and their
outcomes.
 Next, we try the model out on the original dataset, and it predicts outcomes with
99% accuracy… wow!
 But now comes the bad news.
 When we run the model on a new (“unseen”) dataset of resumes, we only get
50% accuracy… uh-oh!
 Our model doesn’t generalize well from our training data to unseen data.
 This is known as overfitting, and it’s a common problem in machine learning and
data science.
 In fact, overfitting occurs in the real world all the time. You only need to turn on
the news
Underfitting
 Underfitting occurs when a model is too simple –
informed by too few features or regularized too much –
which makes it inflexible in learning from the dataset.
 Simple learners tend to have less variance in their
predictions but more bias towards wrong outcomes (see:
The Bias-Variance Tradeoff).
What Is a Good Fit In Machine Learning?

 To find the good fit model, you need to look at the performance of a
machine learning model over time with the training data.
 As the algorithm learns over time, the error for the model on the training data
reduces, as well as the error on the test dataset.
 If you train the model for too long, the model may learn the unnecessary
details and the noise in the training set and hence lead to overfitting.
 In order to achieve a good fit, you need to stop training at a point where the
error starts to increase.
Traditional machine learning uses hand-crafted features
what a feature is,
 A feature in machine learning is an individual measurable property or
characteristic of an object.
 Features are the input that you feed to your machine learning model to
output a prediction or classification.
 Suppose you want to predict the price of a house, your input features
(properties) might include: square_foot, number_of_rooms, bathrooms, etc.
and the model will output the predicted price based on the values of your
features.
 Selecting good features that clearly distinguish your objects increases the
predictive power of machine learning algorithms.
Features must be...

 Identifiable
 Easily tracked and compared
 Consistent across different scales, lighting conditions, and viewing angles
 Still identifiable in noisy environment
 if I give you a feature like a wheel, and ask you to guess whether
the object is a motorcycle or a dog. What would your guess be?
 A motorcycle. Correct! In this case,
 the wheel is a strong feature that clearly distinguishes between
motorcycles and dogs.
 If I give you the same feature (a wheel) and ask you to guess
whether the object is a bicycle or a motorcycle.
 In this case, this feature isn’t strong enough to distinguish between

example both objects. Then we need to look for more features like a mirror,
license plate, maybe a pedal that collectively describes an object.
What makes a good (useful) feature?
 Machine learning models are only as good as the features you
provide.
 That means coming up with good features is an important job in
building ML models.
 But what makes a good feature? And how can you tell?
What makes a good (useful) feature?
 Let’s discuss this by an example:
Suppose we want to build a classifier to
tell the difference between two types of
dogs, Greyhound and Labrador.
 Let’s take two features and evaluate

them:
 1) the dogs’ height and
 2) their eye color.
 Let’s begin with height.

 Well, on average, Greyhounds tend to be a couple

of inches taller than Labradors, but not always. A lot
of variation exists in the world. Let’s evaluate this
feature across different values in both breeds
population.
 We can visualize the height distribution on a toy
example in the histogram below:
 if height <=20: return higher probability to
Labrador
 if height >=30: return higher probability to
greyhound
 if 20 < height >30: look for other features to
classify the object.
 Let’s look at eye color. imagine
that we have only two eye colors,
blue and brown. Here’s what a
histogram might look like for this
example:
 It’s clear that for most values, the
distribution is about 50/50 for
both types.
 Practically this feature tells us
nothing because it doesn’t
correlate with the type of dog.
Hence, it doesn’t distinguish
between Greyhounds and
Labradors.
Disadvantage of high dimensional data
 Can't visualization
 Need time for
processing
 Need large storage
 Noise data

Dimensionality reduction (simplification)

What are the new axes?

Original
Variable B
PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data

• Projections along PC1 discriminate the data most along
Example

Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Machine Learning Challenges and Solutions
No ratings yet
Machine Learning Challenges and Solutions
32 pages
Module 4
No ratings yet
Module 4
28 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
11 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
Machine Learning Applications and Techniques
No ratings yet
Machine Learning Applications and Techniques
53 pages
Unit III
No ratings yet
Unit III
19 pages
MLE
No ratings yet
MLE
15 pages
Machine Learning Basics and kNN Guide
No ratings yet
Machine Learning Basics and kNN Guide
60 pages
Machine Learning Basics & kNN Guide
No ratings yet
Machine Learning Basics & kNN Guide
94 pages
Machine Learning Lecture Notes for CSE
No ratings yet
Machine Learning Lecture Notes for CSE
45 pages
Classification vs Regression in ML
No ratings yet
Classification vs Regression in ML
15 pages
Chapter 3 NeeLXU
No ratings yet
Chapter 3 NeeLXU
68 pages
Data Science
No ratings yet
Data Science
64 pages
AI Feature Extraction & Model Building
No ratings yet
AI Feature Extraction & Model Building
35 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
73 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Machine Learning for Nigerian Languages
No ratings yet
Machine Learning for Nigerian Languages
67 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
316 pages
AI Unit 1
No ratings yet
AI Unit 1
30 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
87 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Feature Selection & Engineering in ML
No ratings yet
Feature Selection & Engineering in ML
33 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Five Steps of Data Science Explained
No ratings yet
Five Steps of Data Science Explained
19 pages
Basic Concepts of Machine Learning For Beginners
No ratings yet
Basic Concepts of Machine Learning For Beginners
102 pages
Final ML
No ratings yet
Final ML
2 pages
AI-900 - Fundamental Principles of ML
No ratings yet
AI-900 - Fundamental Principles of ML
55 pages
AIch 5
No ratings yet
AIch 5
50 pages
Allpiedml Unit2
No ratings yet
Allpiedml Unit2
19 pages
Subtitle
No ratings yet
Subtitle
3 pages
Machine Learning Basics and Steps
No ratings yet
Machine Learning Basics and Steps
13 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Data Science: Similarity & Model Building
No ratings yet
Data Science: Similarity & Model Building
41 pages
Etman MachineL 3
No ratings yet
Etman MachineL 3
47 pages
Overfitting and Feature Engineering Guide
No ratings yet
Overfitting and Feature Engineering Guide
37 pages
ML Interview Cheat Sheets Overview
No ratings yet
ML Interview Cheat Sheets Overview
18 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
ModalPaperUpload AIML201
No ratings yet
ModalPaperUpload AIML201
7 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
134 pages
Loan Default Prediction Pipeline Guide
No ratings yet
Loan Default Prediction Pipeline Guide
13 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
42 pages
Unit - 5 NLP
No ratings yet
Unit - 5 NLP
12 pages
Intro to Machine Learning & kNN
No ratings yet
Intro to Machine Learning & kNN
90 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Feature Engineering & Selection Guide
No ratings yet
Feature Engineering & Selection Guide
32 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
58 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
23 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
51 pages
MH 31a8j Technical Supplement PDF
No ratings yet
MH 31a8j Technical Supplement PDF
1 page
Graphics Syllabus
No ratings yet
Graphics Syllabus
9 pages
B.Tech 3rd Sem Syllabus
No ratings yet
B.Tech 3rd Sem Syllabus
19 pages
Electrical Power Consumption Analysis
No ratings yet
Electrical Power Consumption Analysis
2 pages
MCA Advance Python Exam Questions
No ratings yet
MCA Advance Python Exam Questions
2 pages
Cisco Packet Tracer Wireless Lab
No ratings yet
Cisco Packet Tracer Wireless Lab
3 pages
Automatic Door Opening System Using Ardiuno
No ratings yet
Automatic Door Opening System Using Ardiuno
11 pages
Screenshot 2025-04-14 at 4.56.00 PM
No ratings yet
Screenshot 2025-04-14 at 4.56.00 PM
4 pages
Internet Presentation
No ratings yet
Internet Presentation
12 pages
Pathways rw2 2e U10 Test
No ratings yet
Pathways rw2 2e U10 Test
11 pages
Real Time Energy Consumption Measurements
No ratings yet
Real Time Energy Consumption Measurements
72 pages
Data Communication Exam Questions 2022
No ratings yet
Data Communication Exam Questions 2022
3 pages
History Lesson
No ratings yet
History Lesson
2 pages
Freshbook-Group-1 20250110 221309 0000
No ratings yet
Freshbook-Group-1 20250110 221309 0000
12 pages
Tutorial Exam Solution
No ratings yet
Tutorial Exam Solution
9 pages
Policy vs Strategy: Key Differences Explained
No ratings yet
Policy vs Strategy: Key Differences Explained
1 page
Windows Services - All Roads Lead To System: Kostas Lintovois
No ratings yet
Windows Services - All Roads Lead To System: Kostas Lintovois
21 pages
Pushdown Automata
No ratings yet
Pushdown Automata
26 pages
WMS Putaway Strategy - Addition To Existing Stock
No ratings yet
WMS Putaway Strategy - Addition To Existing Stock
38 pages
IBM Qradar 7.3.2 Instalation Guide
No ratings yet
IBM Qradar 7.3.2 Instalation Guide
72 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
ECE Graduate's Software Career Journey
No ratings yet
ECE Graduate's Software Career Journey
1 page
C++ School Management Project
No ratings yet
C++ School Management Project
54 pages
Jio MNP Activation and Retailer Incentives
No ratings yet
Jio MNP Activation and Retailer Incentives
3 pages
SOC Automation for SMEs
75% (4)
SOC Automation for SMEs
64 pages
Diagnosing Faulty Computer Systems
No ratings yet
Diagnosing Faulty Computer Systems
1 page
License
No ratings yet
License
3 pages
Engineering Roadmap for Rover Project
No ratings yet
Engineering Roadmap for Rover Project
2 pages
IT ERA Midterm SY23 - 24
No ratings yet
IT ERA Midterm SY23 - 24
5 pages

ML 02 Dataset-Feature Selection PDF

Uploaded by

ML 02 Dataset-Feature Selection PDF

Uploaded by

LECTURE 2

MACHINE LEARNING PROCESS

 By modeling the algorithms on the bases of historical data,

There are many ways in which data can be

 training set—a subset to train a model.(to build the

 What is Overfitting?(over complicating)

 What is Underfitting? (oversimplification)

 Well, on average, Greyhounds tend to be a couple

Dimensionality reduction (simplification)

• Orthogonal directions of greatest variance in data

You might also like