LECTURE 2
MACHINE LEARNING PROCESS
Machine Learning ?
Machine learning is used to make decisions based on data.
By modeling the algorithms on the bases of historical data,
Algorithms find the patterns and relationships that are difficult
for humans to detect.
These patterns are now further use for the future references to
predict solution of unseen problems
Machine learning process
machine
learning
processes
Get data
Pre-processing – cleaning
Feature Selection and extraction
Data splitting
Choosing leaning algorithm and model building
Model Training
Model testing and evaluation
Deployment (classification, clustering, prediction,
regression)
What is dataset?
a collection of data records for computer processing
that is collected for a specific purpose.
For example, the test scores of each student in a particular
class is a data set.
There are many ways in which data can be
collected—
for example, as part of surveys, interviews, observations, and
so on.
Dataset representation
A dataset is usually
presented in tabular
form.
Each column represents
a particular variable.
(attributes, features,
dimensions)
Each row corresponds
to a given member of
the dataset in question.
(instance, record,
object)
Types of Data (features)
Numeric Data Types
Binary… ex: gender
integers … ex: num of student
floats… ex: length
Text Data Type (categorical ) ex: country
Image data
Audio data
Video data
Data Samples
Open datasets
These ready-to-use datasets are freely available online
for anyone to download, modify, and distribute without
legal or financial restrictions. These datasets are
regularly updated and are compatible with most ML
frameworks. The only drawback is that open datasets
lack personalization.
Google dataset search
AWS public datasets
Kaggle datasets
Data cleaning
Data Cleaning is the act of removing all flawed or
irrelevant parts of data so that what remains is
more suited to a particular goal; typically, data
science or machine learning
Dirty Data Problems
1) Naming conventions: ER: NYC vs New York
2) Missing required field (e.g. key field)
3) Different representations (2 vs Two)
4) Fields too long (get truncated)
5) Redundant Records (exact match or other)
6) Formating issues – esp dates
Example 1: Missing Values
Remove Rows with Missing Values
Remove columns with Missing Values
Drop features with many missing values: If a given column in a dataset has a lot
of missing values, you may be able to drop it completely without losing much
information.
Example 2: Remove Duplicates
keep only unique rows in a data set.
Example 3: Detect & Remove Outliers
Errors can occur during measurement and data
entry. During data entry, typos can produce weird
values.
Imagine that we’re measuring the height of adult
men and gather the following dataset.
Drop features with low variance: If a given column in a
dataset has values that change very little, you may be able
to drop it since it’s unlikely to offer as much useful
information about a response variable compared to other
features.
Drop features with low correlation with the response
variable: If a certain feature is not highly correlated with
the response variable of interest, you can likely drop it from
the dataset since it’s unlikely to be a useful feature in a
model
Conventional Definition of Data Quality
Accuracy
The data was recorded correctly.
Completeness
All relevant data was recorded.
Uniqueness
Entities are recorded once.
Consistency
The data agrees with itself.
Splitting Data
training set—a subset to train a model.(to build the
model)
test set—a subset to test the trained model.
Make sure that your test set meets the following two conditions:
Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole. In other words, don't pick
a test set with different characteristics than the training set.
For example, consider a model that predicts
whether an email is spam, using the subject line,
email body, and sender's email address as
features. We apportion the data into training
and test sets, with an 80-20 split.
Validation Set: Another Partition
The validation set is used during the training phase of the model to provide
an unbiased evaluation of the model's performance and to fine-tune the
model's parameters. The test set, on the other hand, is used after the model
has been fully trained to assess the model's performance on completely
unseen data.
Types of Learning Errors
What is Overfitting?(over complicating)
What is Underfitting? (oversimplification)
Examples of Overfitting
Now, assume we train a model from a dataset of 10,000 resumes and their
outcomes.
Next, we try the model out on the original dataset, and it predicts outcomes with
99% accuracy… wow!
But now comes the bad news.
When we run the model on a new (“unseen”) dataset of resumes, we only get
50% accuracy… uh-oh!
Our model doesn’t generalize well from our training data to unseen data.
This is known as overfitting, and it’s a common problem in machine learning and
data science.
In fact, overfitting occurs in the real world all the time. You only need to turn on
the news
Underfitting
Underfitting occurs when a model is too simple –
informed by too few features or regularized too much –
which makes it inflexible in learning from the dataset.
Simple learners tend to have less variance in their
predictions but more bias towards wrong outcomes (see:
The Bias-Variance Tradeoff).
What Is a Good Fit In Machine Learning?
To find the good fit model, you need to look at the performance of a
machine learning model over time with the training data.
As the algorithm learns over time, the error for the model on the training data
reduces, as well as the error on the test dataset.
If you train the model for too long, the model may learn the unnecessary
details and the noise in the training set and hence lead to overfitting.
In order to achieve a good fit, you need to stop training at a point where the
error starts to increase.
Traditional machine learning uses hand-crafted features
what a feature is,
A feature in machine learning is an individual measurable property or
characteristic of an object.
Features are the input that you feed to your machine learning model to
output a prediction or classification.
Suppose you want to predict the price of a house, your input features
(properties) might include: square_foot, number_of_rooms, bathrooms, etc.
and the model will output the predicted price based on the values of your
features.
Selecting good features that clearly distinguish your objects increases the
predictive power of machine learning algorithms.
Features must be...
Identifiable
Easily tracked and compared
Consistent across different scales, lighting conditions, and viewing angles
Still identifiable in noisy environment
if I give you a feature like a wheel, and ask you to guess whether
the object is a motorcycle or a dog. What would your guess be?
A motorcycle. Correct! In this case,
the wheel is a strong feature that clearly distinguishes between
motorcycles and dogs.
If I give you the same feature (a wheel) and ask you to guess
whether the object is a bicycle or a motorcycle.
In this case, this feature isn’t strong enough to distinguish between
example both objects. Then we need to look for more features like a mirror,
license plate, maybe a pedal that collectively describes an object.
What makes a good (useful) feature?
Machine learning models are only as good as the features you
provide.
That means coming up with good features is an important job in
building ML models.
But what makes a good feature? And how can you tell?
What makes a good (useful) feature?
Let’s discuss this by an example:
Suppose we want to build a classifier to
tell the difference between two types of
dogs, Greyhound and Labrador.
Let’s take two features and evaluate
them:
1) the dogs’ height and
2) their eye color.
Let’s begin with height.
Well, on average, Greyhounds tend to be a couple
of inches taller than Labradors, but not always. A lot
of variation exists in the world. Let’s evaluate this
feature across different values in both breeds
population.
We can visualize the height distribution on a toy
example in the histogram below:
if height <=20: return higher probability to
Labrador
if height >=30: return higher probability to
greyhound
if 20 < height >30: look for other features to
classify the object.
Let’s look at eye color. imagine
that we have only two eye colors,
blue and brown. Here’s what a
histogram might look like for this
example:
It’s clear that for most values, the
distribution is about 50/50 for
both types.
Practically this feature tells us
nothing because it doesn’t
correlate with the type of dog.
Hence, it doesn’t distinguish
between Greyhounds and
Labradors.
Disadvantage of high dimensional data
Can't visualization
Need time for
processing
Need large storage
Noise data
Dimensionality reduction (simplification)
What are the new axes?
Original
Variable B
PC 2
PC 1
Original Variable A
• Orthogonal directions of greatest variance in data
• Projections along PC1 discriminate the data most along
Example