0% found this document useful (0 votes)
86 views44 pages

ML 02 Dataset-Feature Selection PDF

Uploaded by

moktarm243
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views44 pages

ML 02 Dataset-Feature Selection PDF

Uploaded by

moktarm243
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

LECTURE 2

MACHINE LEARNING PROCESS


Machine Learning ?
 Machine learning is used to make decisions based on data.

 By modeling the algorithms on the bases of historical data,


Algorithms find the patterns and relationships that are difficult
for humans to detect.

 These patterns are now further use for the future references to
predict solution of unseen problems
Machine learning process
machine
learning
processes
 Get data
 Pre-processing – cleaning
 Feature Selection and extraction
 Data splitting
 Choosing leaning algorithm and model building
 Model Training
 Model testing and evaluation
 Deployment (classification, clustering, prediction,
regression)
What is dataset?
a collection of data records for computer processing
that is collected for a specific purpose.
For example, the test scores of each student in a particular
class is a data set.

There are many ways in which data can be


collected—
for example, as part of surveys, interviews, observations, and
so on.
Dataset representation
A dataset is usually
presented in tabular
form.
Each column represents
a particular variable.
(attributes, features,
dimensions)
Each row corresponds
to a given member of
the dataset in question.
(instance, record,
object)
Types of Data (features)
 Numeric Data Types
 Binary… ex: gender
 integers … ex: num of student
 floats… ex: length
 Text Data Type (categorical ) ex: country
 Image data
 Audio data
 Video data
Data Samples
Open datasets
 These ready-to-use datasets are freely available online
for anyone to download, modify, and distribute without
legal or financial restrictions. These datasets are
regularly updated and are compatible with most ML
frameworks. The only drawback is that open datasets
lack personalization.
 Google dataset search
 AWS public datasets
 Kaggle datasets
Data cleaning
 Data Cleaning is the act of removing all flawed or
irrelevant parts of data so that what remains is
more suited to a particular goal; typically, data
science or machine learning
Dirty Data Problems
1) Naming conventions: ER: NYC vs New York
2) Missing required field (e.g. key field)
3) Different representations (2 vs Two)
4) Fields too long (get truncated)
5) Redundant Records (exact match or other)
6) Formating issues – esp dates
Example 1: Missing Values
Remove Rows with Missing Values
Remove columns with Missing Values
 Drop features with many missing values: If a given column in a dataset has a lot
of missing values, you may be able to drop it completely without losing much
information.
Example 2: Remove Duplicates
 keep only unique rows in a data set.
Example 3: Detect & Remove Outliers
 Errors can occur during measurement and data
entry. During data entry, typos can produce weird
values.
 Imagine that we’re measuring the height of adult
men and gather the following dataset.
 Drop features with low variance: If a given column in a
dataset has values that change very little, you may be able
to drop it since it’s unlikely to offer as much useful
information about a response variable compared to other
features.
 Drop features with low correlation with the response
variable: If a certain feature is not highly correlated with
the response variable of interest, you can likely drop it from
the dataset since it’s unlikely to be a useful feature in a
model
Conventional Definition of Data Quality

 Accuracy
 The data was recorded correctly.
 Completeness
 All relevant data was recorded.
 Uniqueness
 Entities are recorded once.
 Consistency
 The data agrees with itself.
Splitting Data

 training set—a subset to train a model.(to build the


model)
 test set—a subset to test the trained model.
 Make sure that your test set meets the following two conditions:
 Is large enough to yield statistically meaningful results.
 Is representative of the data set as a whole. In other words, don't pick
a test set with different characteristics than the training set.
For example, consider a model that predicts
whether an email is spam, using the subject line,
email body, and sender's email address as
features. We apportion the data into training
and test sets, with an 80-20 split.
Validation Set: Another Partition
 The validation set is used during the training phase of the model to provide
an unbiased evaluation of the model's performance and to fine-tune the
model's parameters. The test set, on the other hand, is used after the model
has been fully trained to assess the model's performance on completely
unseen data.
Types of Learning Errors

 What is Overfitting?(over complicating)

 What is Underfitting? (oversimplification)


Examples of Overfitting
 Now, assume we train a model from a dataset of 10,000 resumes and their
outcomes.
 Next, we try the model out on the original dataset, and it predicts outcomes with
99% accuracy… wow!
 But now comes the bad news.
 When we run the model on a new (“unseen”) dataset of resumes, we only get
50% accuracy… uh-oh!
 Our model doesn’t generalize well from our training data to unseen data.
 This is known as overfitting, and it’s a common problem in machine learning and
data science.
 In fact, overfitting occurs in the real world all the time. You only need to turn on
the news
Underfitting
 Underfitting occurs when a model is too simple –
informed by too few features or regularized too much –
which makes it inflexible in learning from the dataset.
 Simple learners tend to have less variance in their
predictions but more bias towards wrong outcomes (see:
The Bias-Variance Tradeoff).
What Is a Good Fit In Machine Learning?

 To find the good fit model, you need to look at the performance of a
machine learning model over time with the training data.
 As the algorithm learns over time, the error for the model on the training data
reduces, as well as the error on the test dataset.
 If you train the model for too long, the model may learn the unnecessary
details and the noise in the training set and hence lead to overfitting.
 In order to achieve a good fit, you need to stop training at a point where the
error starts to increase.
Traditional machine learning uses hand-crafted features
what a feature is,
 A feature in machine learning is an individual measurable property or
characteristic of an object.
 Features are the input that you feed to your machine learning model to
output a prediction or classification.
 Suppose you want to predict the price of a house, your input features
(properties) might include: square_foot, number_of_rooms, bathrooms, etc.
and the model will output the predicted price based on the values of your
features.
 Selecting good features that clearly distinguish your objects increases the
predictive power of machine learning algorithms.
Features must be...

 Identifiable
 Easily tracked and compared
 Consistent across different scales, lighting conditions, and viewing angles
 Still identifiable in noisy environment
 if I give you a feature like a wheel, and ask you to guess whether
the object is a motorcycle or a dog. What would your guess be?
 A motorcycle. Correct! In this case,
 the wheel is a strong feature that clearly distinguishes between
motorcycles and dogs.
 If I give you the same feature (a wheel) and ask you to guess
whether the object is a bicycle or a motorcycle.
 In this case, this feature isn’t strong enough to distinguish between

example both objects. Then we need to look for more features like a mirror,
license plate, maybe a pedal that collectively describes an object.
What makes a good (useful) feature?
 Machine learning models are only as good as the features you
provide.
 That means coming up with good features is an important job in
building ML models.
 But what makes a good feature? And how can you tell?
What makes a good (useful) feature?
 Let’s discuss this by an example:
Suppose we want to build a classifier to
tell the difference between two types of
dogs, Greyhound and Labrador.
 Let’s take two features and evaluate

them:
 1) the dogs’ height and
 2) their eye color.
 Let’s begin with height.

 Well, on average, Greyhounds tend to be a couple


of inches taller than Labradors, but not always. A lot
of variation exists in the world. Let’s evaluate this
feature across different values in both breeds
population.
 We can visualize the height distribution on a toy
example in the histogram below:
 if height <=20: return higher probability to
Labrador
 if height >=30: return higher probability to
greyhound
 if 20 < height >30: look for other features to
classify the object.
 Let’s look at eye color. imagine
that we have only two eye colors,
blue and brown. Here’s what a
histogram might look like for this
example:
 It’s clear that for most values, the
distribution is about 50/50 for
both types.
 Practically this feature tells us
nothing because it doesn’t
correlate with the type of dog.
Hence, it doesn’t distinguish
between Greyhounds and
Labradors.
Disadvantage of high dimensional data
 Can't visualization
 Need time for
processing
 Need large storage
 Noise data

Dimensionality reduction (simplification)


What are the new axes?

Original
Variable B
PC 2
PC 1

Original Variable A

• Orthogonal directions of greatest variance in data


• Projections along PC1 discriminate the data most along
Example

You might also like