0% found this document useful (0 votes)
44 views

Semi Supervised Learning

Here are some examples of feature engineering: - Creating new features from existing features (e.g. ratio of features, interaction features) - Discretizing/binning continuous features - Normalizing/scaling features to standardized range - Transforming features (e.g. log, square root, Box-Cox) - Text processing techniques (e.g. bag-of-words, n-grams) for text data - Domain expertise to identify meaningful features - Feature selection/reduction techniques to filter irrelevant features The goal is to transform raw data into informative features that help machine learning models make better predictions. Feature engineering requires trial and error as well as domain knowledge to develop optimal representations of the data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Semi Supervised Learning

Here are some examples of feature engineering: - Creating new features from existing features (e.g. ratio of features, interaction features) - Discretizing/binning continuous features - Normalizing/scaling features to standardized range - Transforming features (e.g. log, square root, Box-Cox) - Text processing techniques (e.g. bag-of-words, n-grams) for text data - Domain expertise to identify meaningful features - Feature selection/reduction techniques to filter irrelevant features The goal is to transform raw data into informative features that help machine learning models make better predictions. Feature engineering requires trial and error as well as domain knowledge to develop optimal representations of the data
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Semi-supervised Learning

• Labelled + unlabelled data


• In supervised learning where the training data contains very few labelled
examples and a large number of unlabelled examples.
• Cluster analysis is a method that seeks to partition a dataset into homogenous
subgroups, meaning grouping similar data together with the data in each group
being different from the other groups.
• Semi supervised clustering uses some known cluster information in order to
classify other unlabelled data, meaning it uses both labelled and unlabelled
data just like semi supervised machine learning.
• Example: A text document classifier => Semi-supervised learning allows
for the algorithm to learn from a small amount of labelled text documents
while still classifying a large amount of unlabelled text documents in the
training data.
• Common algorithms: Cotraining, Semi-supervised SVM
Reinforcement Learning
Reinforcement Learning
• Describes a class of problems where an agent
operates in an environment and must learn to
operate using feedback.
• Reinforcement Learning has four essential
elements:
• Agent: The program you train, with the
aim of doing a job you specify.
• Environment: The world, real or virtual,
in which the agent performs actions.
• Action: A move made by the agent,
which causes a status change in the
environment.
• Rewards: The evaluation of an action,
which can be positive or negative.
Applications of Reinforcement Learning
Advantages Reinforcement Learning
1. It can solve higher-order and complex problems. Also, the solutions obtained
will be very accurate.
2. The reason for its perfection is that it is very similar to the human learning
technique.
3. This model will undergo a rigorous training process that can take time. This can
help to correct any errors.
4. Due to it’s learning ability, it can be used with neural networks. This can be
termed as deep reinforcement learning.
5. Since the model learns constantly, a mistake made earlier would be unlikely to
occur in the future.
6. Various problem-solving models are possible to build using reinforcement
learning.
7. When it comes to creating simulators, object detection in automatic cars,
robots, etc., reinforcement learning plays a great role in the models.
8. For various problems, which might seem complex to us, it provides the perfect
models to tackle them.
Feature

What is a feature and why we need the engineering of it?


• Basically, all machine learning algorithms use some input data to create outputs.
• This input data comprise features, which are usually in the form of structured columns.
Algorithms require features with some specific characteristic to work properly.
• Here, the need for feature engineering arises.

Two goals of Feature Engineering:


• Preparing the proper input dataset, compatible with the machine learning algorithm
requirements.
• Improving the performance of machine learning models.
Feature

According to a survey in Forbes, data scientists spend 80% of their time on data
preparation:
Feature Engineering

“Coming up with features is difficult, time-consuming,


requires expert knowledge. ‘Applied machine learning’
is basically feature engineering.”
— Prof. Andrew Ng.
Steps in Machine Learning Pipeline

• Data Collection:
 Where machine learning pipeline begins.
 Most common data formats:
CSV ( Comma Separated Variable) ,JSON,HTML,XML,Web-Scraping ,
SQL

• Data Descrption:
 Data collected from different sources and stored in different format is described
using following data types
Numeric,Text,Categorical (Nominal,Ordinal)

• Data Wrangling:
Data wrangling or data munging is the process of cleaning, transforming,
and mapping data from one form to another to utilize it for tasks such as analytics,
summarization, reporting, visualization, and so on.
Data Wrangling ...

Data Wrangling: Six steps

1. Discovering: Understand what is in your data, which will inform how you want
to analyze it.
2. Structuring: Organizing the data
A single column may turn into several rows for easier analysis.
One column may become two.
Movement of data is made for easier computation and analysis.
3. Cleaning: Null values are changed and standard formatting implemented,

4. Enriching: Addition of new data for enriching existing one.

5. Validating: In validation you verify data consistency, quality, and security

6. Publishing: Analysts prepare the wrangled data for use downstream – whether
by a particular user or software – and document any particular steps taken or logic
used to wrangle said data.
Data Cleaning ...

Data Cleaning: Data cleaning refers to identifying and correcting errors in the
dataset that may negatively impact a predictive model

• Using statistics to define normal data and identify outliers.


• Identifying columns that have the same value or no variance and removing
them.
• Identifying duplicate rows of data and removing them.
• Marking empty values as missing.
• Imputing missing values using statistics or a learned model.
Data Cleaning ...

Basics :
1. Identify and Remove redundant samples
2. Identify & Remove redundant feature
3. Identify & Remove features with low variance
4. Identify & Remove features with single value.

Outlier:
An outlier is an observation of a data point that lies an abnormal
distance from other values in a given population. (odd man out)
e.g. Age : 18,22,45,67,89,125,30
List of Cities: New York, Las Angles, London, France, Delhi
Data Cleaning ...

Finding Outliers:
1. Univariate Analysis
2. Multivariate Analysis
1. Univariate Analysis
- Using Box-Plot
Data Cleaning ...
Finding Outliers: Multivariate Analysis

Scatter plots are also used for identifying the outliers


Data Cleaning ...
Missing Values : 1. Mark 2. Imputation
2. Imputation: (Numerical Data)
• Mean, median, mode imputation

• Regression Imputation:
 Mean, median or mode imputation only look at the distribution of the values of the
variable with missing entries.
 If there is a correlation between the missing value and other variables, we can
often get better guesses by regressing the missing variable on other variables.

• K-Nearest Neignbour Imputation:


• For a discrete variable, KNN imputer uses the most frequent value among the knearest
neighbours
• For a continuous variable, use the mean or mode of knearest neighbours
Data Cleaning ...

Missing Values : 1. Mark 2. Imputation


2. Imputation: (Categorical Data)
• Replacing the missing values with the maximum occurred value in a
column is a good option for handling categorical columns.
• But if you think the values in the column are distributed uniformly and
there is not a dominant value, imputing a category like “Other” might be
more sensible.
Feature Engineering...

• A feature is typically a specific representation on top of raw data, which is


an individual, measurable attribute,
• It is typically depicted by a column in a dataset.
• Considering a generic two-dimensional dataset, each observation is
depicted by a row and each feature by a column, which will have a specific
value for an observation.

• Each row typically indicates a feature vector and the entire set of features
across all the observations forms a two-dimensional feature matrix also known
as a feature-set.
Feature Engineering...

.A standard pipeline for feature engineering, scaling, and selection

“Feature engineering is the process of transforming raw data into features that better represent
the underlying problem to the predictive models, resulting in improved model accuracy on
unseen data.”
— Dr. Jason Brownle
Feature Engineering

• Raw data: This is data in its native form after data retrieval from source. Typically some
amount of data processing and wrangling is done before the actual process of feature
engineering.
• Features: These are specific representations obtained from the raw data after the process
of feature engineering.
• The underlying problem: This refers to the specific business problem or use case we want
to solve with the help of Machine Learning. The business problem is typically converted
into a Machine Learning task.
• The predictive models: Typically feature engineering is used for extracting features to
build Machine Learning models that learn about the data and the problem to be solved
from these features. Supervised predictive models are widely used for solving diverse
problems.
• Model accuracy: This refers to model performance metrics that are used to evaluate the
model.
• Unseen data: This is basically new data that was not used previously to build or train the
model. The model is expected to learn and generalize well for unseen data based on good
quality features.
Feature Engineering

• Feature engineering is indeed both an art and a science to transform data into
features for feeding into models.
• For feature engineering you need combination of domain knowledge,
experience, intuition, and mathematical transformations to give you the features
you need.

Examples of engineering features

• Deriving a person’s age from birth date and the current date
• Getting the average and median view count of specific songs and music videos
• Extracting word and phrase occurrence counts from text documents
• Extracting pixel information from raw images
• Tabulating occurrences of various grades obtained by students
Feature Engineering

Why Feature Engineering?


• Better representation of data: Features are basically various representations of the
underlying raw data. These representations can be better understood by Machine Learning
algorithms. •
• Better performing models: The right features tend to give models that outperform other
models no matter how complex the algorithm is.
In general if you have the right feature set, even a simple model will perform well and
give desired results. In short, better features make better models.
• Essential for model building and evaluation:
Raw data cannot be used to build Machine Learning models.
Get our data, extract features, and start building models!
Also on evaluating model performance and tuning the models, you can reiterate
over your feature set to choose the right set of features to get the best model.
• More flexibility on data types: Feature engineering helps us build models on diverse data
types by applying necessary transformations and enables us to work even on complex
unstructured data.
• Emphasis on the business and domain: Feature engineering emphasizes to focus on
the business and the domain of the problem when building featur
Feature Engineering...

Feature Engineering
Feature Engineering...

Feature Engineering
Feature Engineering...

Feature Engineering on Numeric Data


1. Raw Measures:
 Raw numeric data can often be fed directly to machine learning
models based on the context and data format.
 Raw measures are typically indicated using numeric variables
directly as features without any form of transformation or
engineering.
 Typically these features can indicate values or counts.
2. Values
 Several attributes represent numeric raw values which can be used
directly.
3. Counts
 Another form of raw measures include features which represent
frequencies, counts or occurrences of specific attributes.
Feature Engineering...

4. Binarization:

Often raw frequencies or counts may not be relevant for building a


model based on the problem which is being solved.

5. Rounding:
Often when dealing with continuous numeric attributes like
proportions or percentages, we may not need the raw values
having a high amount of precision.
 Hence it often makes sense to round off these high precision
percentages into numeric integers.
Feature Engineering...

6. Binning (Discretization)
 The problem of working with raw, continuous numeric features is that often
the distribution of values in these features will be skewed.
 This signifies that some values will occur quite frequently while some will
be quite rare.
 Problem of the varying range of values in any of these features.
 There are strategies to deal with this, which include binning and
transformations.
 Binning, also known as quantization is used for transforming continuous
numeric features into discrete ones (categories).
 These discrete values or numbers can be thought of as categories or bins
into which the raw, continuous numeric values are binned or grouped into
 Each bin represents a specific degree of intensity and hence a specific
range of continuous numeric values fall into it.
Feature Engineering...

6. Binning (Discretization)
Feature Engineering...

Binning Types
 Two types of Binning : 1. Fixed Width 2. Adaptive Binning
Fixed-Width Binning: We manually create fix width bins based on some rules
and domain knowledge.
Ex.
age = [12, 15, 13, 78, 65, 42, 98, 24, 26, 38, 27, 32, 22, 45, 27]
Now, lets create bins of fixed width (say 10):
bins = [0 {0-9}, 1 {10-19}, 2 {20-29}, 3 {30-39}, 4 {40-49}, 5 {50-59}, 6 {60-
69}, 7 {70-79}, 8 {80-89}, 9 {90-99}]
After binning, age variable looks like this:
age = [1, 1, 1, 7, 6, 4, 9, 2, 2, 3, 2, 3, 2, 4, 2]
Limitation:
• creating irregular bins which are not uniform based on the number of data
points or values which fall under each bin.
• Some of the bins might be densely populated and some of them might be
sparsely populated or even empty.
Feature Engineering...

Adaptive Binning
• In Adaptive Binning, data distribution itself decides bin ranges for itself.
• No manual intervention is required.
• Quantile based binning is a good strategy to use for adaptive binning.
• Quantiles are values that divide the data into equal portions.
• Thus, q-Quantiles help in partitioning a numeric attribute into q equal
partitions.
• Popular examples of quantiles include the 2-Quantile known as the median
which divides the data distribution into two equal bins.
• 4-Quantiles known as the quartiles which divide the data into 4 equal bins
• 10-Quantiles also known as the deciles which create 10 equal width bins.
Feature Engineering...
• #Numerical Binning Example
• Value Bin
• 0-30 -> Low
• 31-70 -> Mid
• 71-100 -> High

• #Categorical Binning Example


• Value Bin
• Spain -> Europe
• Italy -> Europe
• Chile -> South America
• Brazil -> South America
Feature Engineering...

7 Statistical Transformations
• Their main significance is that statistaical or mathematical transformations
help in stabilizing variance.
• They are using statistical and mathematical transformations
a. Log Transform
• The log transform belongs to the power transform family of functions.
• It helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within the
range of the data.
• It also decreases the effect of the outliers, due to the normalization of
magnitude differences and the model become more robust.
• Mathematically represented as
Feature Engineering...

Feature Engineering on Categorical Data


 Feature Encoding is used for the transformation of a categorical feature into a
numerical variable

1. Label Encoding:
 It is a technique to transform categorical variables into numerical variables by
assigning a numerical value to each of the categories.
 Label encoding can be used for Ordinal variables
e.g.
Feature Engineering...
2. Ordinal encoding:
 Ordinal encoding is an encoding technique to transform an original categorical variable
to a numerical variable by ensuring the order .

3. Frequency encoding:
 Frequency encoding is an encoding technique to transform an original categorical variable to
a numerical variable by considering the frequency distribution of the data.
Feature Engineering...

4. Binary encoding:
Binary encoding is an encoding technique to transform an original categorical variable to a
numerical variable by encoding the categories as Integer and then converted into binary
code
Feature Engineering...

5. One hot encoding:


 One hot encoding technique splits the category each to a column.
 It creates k different columns each for a category and replaces one column with 1 rest of
the columns is 0.
Feature Engineering...
Feature Scaling:
 When dealing with numeric features, we have specific attributes which may be
completely unbounded in nature, like view counts of a video or web page hits.
 Using the raw values as input features might make models biased toward
features having really high magnitude values.
Feature Scaling Techniques:
1. Standardized Scaling:
 The standard scaler tries to standardize each value in a feature column by
removing the mean and scaling the variance to be 1 from the values.
 This is also known as centering and scaling and can be denoted mathematically
as

2. Min-Max Scaling
 Here we can transform and scale our feature values such that each value is within the
range of [0, 1].
Feature Engineering...

Robust Scaling
 The disadvantage of min-max scaling is that often the presence of outliers affects the
scaled values for any feature.
 Robust scaling tries to use specific statistical measures to scale features without being
affected by outliers. Mathematically this scaler can be represented as

Where X : each feature value , IQR : Inter-Quartile Range of X which is the range
(difference) between the first quartile (25th %ile) and the third quartile (75th %ile).
Dimensionality Reduction

• Higher the number of features, the harder it gets to visualize the training set and
then work on it.
• Sometimes, most of these features are correlated, and hence redundant.
• Dimensionality reduction is the process of reducing the total number of features
in our feature set using strategies like feature selection or feature extraction

 Feature Selection methods: Specific features are selected for each data sample from
the original list of features and other features are discarded.
No new features are generated in this process.

 Feature Extraction methods: We engineer or extract new features from the original list
of features in the data.
Thus the reduced subset of features will contain newly generated feature
Feature Selection...
Feature selection...

1. Filter methods:
 These techniques select features purely based on metrics like correlation,
mutual information and so on. (measured using statistical analysis)
 These methods do not depend on results obtained from any model and usually
check the relationship of each feature with the response variable to be
predicted.
 Popular methods include threshold based methods and statistical tests.
2. Wrapper methods:
 These techniques try to capture interaction between multiple features by using
a recursive approach to build multiple models using feature subsets and select
the best subset of features giving us the best performing model.
 Methods like backward selecting and forward elimination are popular
wrapper based methods.
Feature selection Techniques:

3. Embedded methods:
 These techniques try to combine the benefits of the other two methods.
 Rank and score feature variables based on their importance.
 Tree based methods like decision trees and ensemble methods like random forests
are popular examples of embedded methods
Filter methods:
Different Filter methods:
1. Information Gain
• Information gain calculates the reduction in entropy from the transformation
of a dataset.
• It can be used for feature selection by evaluating the Information gain of each
variable in the context of the target variable.
Different Filter methods:

2. Chi-square Test
• The Chi-square test is used for categorical features in a dataset.
• Calculate Chi-square between each feature and the target and select the
desired number of features with the best Chi-square scores.
3. Fisher’s Score:
• Fisher score is one of the most widely used supervised feature selection
methods.
• The algorithm which we will use returns the ranks of the variables based on
the fisher’s score in descending order.

• Distance between the sample means for each class per feature divided by their
variances:
4.Correlation Coefficient:
• Correlation is a measure of the linear relationship of 2 or more variables.
• Through correlation, we can predict one variable from the other.
• The logic behind using correlation for feature selection is that the good variables are
highly correlated with the target.
• Variables should be correlated with the target but should be uncorrelated among
themselves.
• If two variables are correlated, we can predict one from the other.
• If two features are correlated, the model only really needs one of them, as the second
one does not add additional information.

• e.g Pearson Correlation here.


• 5. Variance Threshold:
• The variance threshold is a simple baseline approach to feature selection.
• It removes all features which variance doesn’t meet some threshold.
• By default, it removes all zero-variance features, i.e., features that have the
same value in all samples.
6.Mean Absolute Difference (MAD):
The mean absolute difference (MAD) computes the absolute difference from the
mean value.
• The main difference between the variance and MAD measures is the absence
of the square in the latter.
• The MAD, like the variance, is also a scale variant.
• This means that higher the MAD, higher the discriminatory power.
Advantages of Filter methods

• Filter methods are model agnostic


• Rely entirely on features in the data set
• Computationally very fast
• Based on different statistical methods

Disadvantage of Filter methods

• The filter method looks at individual features for identifying it’s relative
importance.
• A feature may not be useful on its own but maybe an important influencer
when combined with other features.
• Filter methods may miss such features.
2.Wrapper Methods:
• Search the space of all possible subsets of features, assessing their quality by
learning and evaluating a classifier with that feature subset.
• The feature selection process is based on a specific machine learning algorithm
that we are trying to fit on a given dataset.
• It follows a greedy search approach by evaluating all the possible combinations
of features against the evaluation criterion.
• The wrapper methods usually result in better predictive accuracy than filter
methods.
Types of Wrapper Methods:
a) Recursive Feature Elimination
Recursive Feature Elimination(RFE) recursively removes the redundant features
until the desire number of features are achieved and hence improving the
performance and accuracy of the model.
b) Forward Feature Selection:
• Starts with single Feature.
• This is an iterative method wherein start with the best performing
variable against the target.
• Next, select another variable that gives the best performance in combination
with the first selected variable.
• This process continues until the preset criterion is achieved.
c) Backward Feature Elimination
This method works exactly opposite to the Forward Feature Selection method.
• Start with all the features available and build a model.
• Next, eliminate feature and check the model performance.
• This process is continued until the preset criterion is achieved.
Algorithm for Forward Selection
Algorithm for Backword Selection

The complexity of backward search has the same order of complexity


as forward search, except that training a system with more features is
costlier than training a system with fewer features, and forward search
may be preferable especially if we expect many useless features.
Terminologies for Algorithm
•In either case, checking the error should be done on a validation set which is
distinct from the training set.

•With more features, generally training error can be reduced, but validation error
may not be reduced.

•Let 𝐹 denotes, a feature set of input dimensions, 𝑥𝑖, 𝑖 = 1,...,𝑑.

•𝐸(𝐹) denotes the error incurred on the validation sample when only the inputs
in 𝐹 are used.
•Depending on the application, the error is either the mean square error or
misclassification error.
Terminologies for Algorithm
May be costly because to decrease the dimensions from d to k, to train and test the
system runs for 𝑑+(𝑑−1)+(𝑑−2)+· · ·+𝑑−𝑘 times, and the time required is
𝑂(𝑑2).

•Local search procedure which does not guarantee finding the optimal subset, namely,
the minimal subset causing the smallest error.

•For example,𝑥𝑖and 𝑥𝑗 individually does not give good effect but together may
decrease the error significantly. In this situation forward selection is not a good choice
because this algorithm is greedy and adds attributes one by one, it may not be able to
detect the effect of more than one features.
Feature Extraction Techniques
Feature Extraction:Feature extraction creates new features from functions of the
original features
PCA:

• PCA is a method of obtaining important variables (in form of components)


from a large set of variables available in a data set.
• It tends to find the direction of maximum variation (spread) in data.
• PCA is more useful when dealing with 3 or higher-dimensional data.
• PCA will try to reduce dimensionality by exploring how one feature of the
data is expressed in terms of the other features(linear dependency).
Principal Component Analysis
Principal Component Analysis…
Principal Component Analysis…
Principal Component Analysis…
Principal Component Analysis…
PCA converts all correlations
among all the cells into 2-D
graphs
Principal Component Analysis…
Principal Component Analysis…
What is Principal Component Analysis?

• The Principal Component Analysis is a popular unsupervised learning


technique for reducing the dimensionality of data.
• It increases interpretability yet, at the same time, it minimizes information
loss.
• It helps to find the most significant features in a dataset and makes the data
easy for plotting in 2D and 3D.
• PCA helps in finding a sequence of linear combinations of variables.

What is a Principal Component?

• The Principal Components are a straight line that captures most of the
variance of the data.
• They have a direction and magnitude.
• Principal components are orthogonal projections (perpendicular) of data onto
lower-dimensional space.
Principal Component Analysis…
Mathematics behind PCA:
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels
such that our new dataset becomes d dimensional.

2. Compute the mean for every dimension of the whole dataset.

3. Compute the covariance matrix of the whole dataset.

4. Compute eigenvectors and the corresponding eigenvalues.

5. Sort the eigenvectors by decreasing eigenvalues and choose k eigenvectors


with the largest eigenvalues to form a d × k dimensional matrix W.

6. Use this d × k eigenvector matrix to transform the samples onto the new
subspace.
Principal Component Analysis…
Applications of PCA in Machine Learning
• PCA is used to visualize multidimensional data.
• It is used to reduce the number of dimensions in healthcare data.
• PCA can help resize an image.
• It can be used in finance to analyze stock data and forecast returns.
• PCA helps to find patterns in the high-dimensional datasets.
Dataset Preparation
Background :
• Machine learning is at the peak of its popularity today.
• Despite this, a lot of decision-makers are in the dark about what exactly is
needed to design, train, and successfully deploy a machine learning algorithm.
• The details about collecting the data, building a dataset, and annotation
specifics are neglected as supportive tasks.

Sources of Data:
Dataset Preparation
The Features of a Proper, High-Quality Dataset in Machine Learning:

Quality of a Dataset: Relevance and Coverage


• High quality is the essential thing to take into consideration when you collect a
dataset for a machine learning project

Sufficient Quantity of a Dataset in Machine Learning


• Not only quality but quantity matters, too.
• It's important to have enough data to train your algorithm properly.
Dataset Preparation
Dataset :
• Defination: Oxford Dictionary defines a dataset as “a collection of data that is
treated as a single unit by a computer”.
• Dataset contains a lot of separate pieces of data but can be used to train an
algorithm with the goal of finding predictable patterns inside the whole
dataset.
Dataset Preparation
• In data science, training data and testing data are two major roles.
• Evaluating the performance of a built model is just as significant as training
and building the model
• To ensure the accuracy of the predictions, you must test and validate the
model well enough.
Dataset Preparation
Training Data
• Training data are the sub-dataset which we use to train a model.
• These datasets contain data observations in a particular domain.
• Algorithms study the hidden patterns and insights which are hidden inside
these observations and learn from them.
• The model will be trained over and over again using the data in the training set
machine learning and continue to learn the features of this data.
• Later trained model is deployed and have accurate predictions over new data.
• These predictions will be based on the learnings from the training dataset.
Dataset Preparation
Test data
• The sample of data used to provide an unbiased evaluation of a final model fit
on the training dataset.
• It is only used once a model is completely trained(using the train and
validation sets).
• Although Both train and test data extracted from the same dataset, the test
dataset should not contain any training dataset data.
• The purpose of creating a model is to predict unknown results.
• The test data is used to check the performance, accuracy, and precision of the
model created using training data.
Commonly used training data testing data split ratios.
• Train: 80%, Test: 20% Train: 67%, Test: 33% Train: 50%, Test: 50%
Dataset Preparation
Validation Dataset:
• The sample of data used to provide an unbiased evaluation of a model fit on the training
dataset while tuning model hyperparameters.
• The evaluation becomes more biased as skill on the validation dataset is incorporated into
the model configuration.
• The validation set is used to evaluate a given model, but this is for frequent evaluation.
• This data is used to fine-tune the model hyperparameters.
• Hence the model occasionally sees this data, but never does it “Learn” from this.
• The validation set affects a model, but only indirectly. The validation set is also known as
the Dev set or the Development set.
• his makes sense since this dataset helps during the “development” stage of the model.
Dataset Preparation
• What is cross-validation?
• Cross-Validation is a resampling technique that helps to make model efficient & accurate
on the unseen data.
• It is a method for evaluating Machine Learning models by training several other Machine
learning models on subsets of the available input data set and evaluating them on the subset
of the data set.
Hold-out Method for Training Machine Learning Models:
• The hold-out method for training machine learning model is the process of splitting the data
in different splits and using one split for training the model and other splits for validating
and testing the models.
• The hold-out method is used for both model evaluation and model selection..
Dataset Preparation
• Hold-out method for Model Evaluation
• Hold-out method for model evaluation represents the mechanism of splitting the
dataset into training and test dataset and evaluating the model performance in
order to get the most optimal model.
Dataset Preparation
Process of using hold-out method for model evaluation:

• Split the dataset into two parts (preferably based on 70-30% split; However, the
percentage split will vary)
• Train the model on the training dataset; While training the model, some fixed set
of hyper parameters is selected.
• Test or evaluate the model on the held-out test dataset
• Train the final model on the entire dataset to get a model which can generalize
better on the unseen or future dataset.
Dataset Preparation
Hold-out method for Model Selection
• Model selection process is referred to as hyper-parameters tuning .
• In hold-out method for model selection, the dataset is split into three different sets
Dataset Preparation
Process of Hold-out method for Model Selection
Dataset Preparation
Process of Hold-out method for Model Selection
• Split the dataset in three parts – Training dataset, validation dataset and test
dataset.
• Train different models using different machine learning algorithms. For
example, train the classification model using logistic regression, random forest,
XGBoost.
• For the models trained with different algorithms, tune the hyper-parameters and
come up with different models.
• For each of the algorithms mentioned in step 2, change hyper parameters
settings and come with multiple models.
• Test the performance of each of these models (belonging to each of the
algorithms) on the validation dataset.
• Select the most optimal model out of models tested on the validation dataset.
The most optimal model will have the most optimal hyper parameters settings
for specific algorithm.
• Test the performance of the most optimal model on the test dataset.
Hold-out method is one of the cross-validation technique
Dataset Preparation
Pros and Cons of Hold-out Method

Pros
• Simple, easy to understand, and implement.
• This Method is Fully independent of data.
Cons:
• Not suitable for an imbalanced dataset.
• A lot of data is isolated from training the model
Dataset Preparation
Cross - Validation Techniques :
1. k-fold cross-validation:
• In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds.
• Out of the k-folds or groups, for each iteration, one group is selected as validation data, and
the remaining (k-1) groups are selected as training data.

• The process is repeated for k times until each group is treated as validation and remaining as training
data.
Dataset Preparation

The final accuracy of the model is computed by taking the mean accuracy of the k-models
validation data.

Pros:

• The model has low bias


• Low time complexity
• The entire dataset is utilized for both training and validation.
• Models may not be affected much if an outlier is present in data.
• It helps us overcome the problem of variability

Cons:
• Not suitable for an imbalanced dataset.
Dataset Preparation
2.Leave-one-out cross-validation:

• Leave-one-out cross-validation (LOOCV) is an exhaustive cross-validation


technique.
• It is a category of LpOCV with the case of p=1. ( p-observation as validation data,
and remaining data is used to train the model)

• For a dataset having n rows, 1st row is selected for validation, and the rest (n-1)
rows are used to train the model.
• For the next iteration, the 2nd row is selected for validation and rest to train the
model.
• Similarly, the process is repeated until n steps or the desired number of
operations.
Dataset Prepartion

Pros:
• Simple, easy to understand, and implement.
Cons:
• The model may lead to a low bias.
• The computation time required is high.
Thank You

You might also like