Semi Supervised Learning
Semi Supervised Learning
According to a survey in Forbes, data scientists spend 80% of their time on data
preparation:
Feature Engineering
• Data Collection:
Where machine learning pipeline begins.
Most common data formats:
CSV ( Comma Separated Variable) ,JSON,HTML,XML,Web-Scraping ,
SQL
• Data Descrption:
Data collected from different sources and stored in different format is described
using following data types
Numeric,Text,Categorical (Nominal,Ordinal)
• Data Wrangling:
Data wrangling or data munging is the process of cleaning, transforming,
and mapping data from one form to another to utilize it for tasks such as analytics,
summarization, reporting, visualization, and so on.
Data Wrangling ...
1. Discovering: Understand what is in your data, which will inform how you want
to analyze it.
2. Structuring: Organizing the data
A single column may turn into several rows for easier analysis.
One column may become two.
Movement of data is made for easier computation and analysis.
3. Cleaning: Null values are changed and standard formatting implemented,
6. Publishing: Analysts prepare the wrangled data for use downstream – whether
by a particular user or software – and document any particular steps taken or logic
used to wrangle said data.
Data Cleaning ...
Data Cleaning: Data cleaning refers to identifying and correcting errors in the
dataset that may negatively impact a predictive model
Basics :
1. Identify and Remove redundant samples
2. Identify & Remove redundant feature
3. Identify & Remove features with low variance
4. Identify & Remove features with single value.
Outlier:
An outlier is an observation of a data point that lies an abnormal
distance from other values in a given population. (odd man out)
e.g. Age : 18,22,45,67,89,125,30
List of Cities: New York, Las Angles, London, France, Delhi
Data Cleaning ...
Finding Outliers:
1. Univariate Analysis
2. Multivariate Analysis
1. Univariate Analysis
- Using Box-Plot
Data Cleaning ...
Finding Outliers: Multivariate Analysis
• Regression Imputation:
Mean, median or mode imputation only look at the distribution of the values of the
variable with missing entries.
If there is a correlation between the missing value and other variables, we can
often get better guesses by regressing the missing variable on other variables.
• Each row typically indicates a feature vector and the entire set of features
across all the observations forms a two-dimensional feature matrix also known
as a feature-set.
Feature Engineering...
“Feature engineering is the process of transforming raw data into features that better represent
the underlying problem to the predictive models, resulting in improved model accuracy on
unseen data.”
— Dr. Jason Brownle
Feature Engineering
• Raw data: This is data in its native form after data retrieval from source. Typically some
amount of data processing and wrangling is done before the actual process of feature
engineering.
• Features: These are specific representations obtained from the raw data after the process
of feature engineering.
• The underlying problem: This refers to the specific business problem or use case we want
to solve with the help of Machine Learning. The business problem is typically converted
into a Machine Learning task.
• The predictive models: Typically feature engineering is used for extracting features to
build Machine Learning models that learn about the data and the problem to be solved
from these features. Supervised predictive models are widely used for solving diverse
problems.
• Model accuracy: This refers to model performance metrics that are used to evaluate the
model.
• Unseen data: This is basically new data that was not used previously to build or train the
model. The model is expected to learn and generalize well for unseen data based on good
quality features.
Feature Engineering
• Feature engineering is indeed both an art and a science to transform data into
features for feeding into models.
• For feature engineering you need combination of domain knowledge,
experience, intuition, and mathematical transformations to give you the features
you need.
• Deriving a person’s age from birth date and the current date
• Getting the average and median view count of specific songs and music videos
• Extracting word and phrase occurrence counts from text documents
• Extracting pixel information from raw images
• Tabulating occurrences of various grades obtained by students
Feature Engineering
Feature Engineering
Feature Engineering...
Feature Engineering
Feature Engineering...
4. Binarization:
5. Rounding:
Often when dealing with continuous numeric attributes like
proportions or percentages, we may not need the raw values
having a high amount of precision.
Hence it often makes sense to round off these high precision
percentages into numeric integers.
Feature Engineering...
6. Binning (Discretization)
The problem of working with raw, continuous numeric features is that often
the distribution of values in these features will be skewed.
This signifies that some values will occur quite frequently while some will
be quite rare.
Problem of the varying range of values in any of these features.
There are strategies to deal with this, which include binning and
transformations.
Binning, also known as quantization is used for transforming continuous
numeric features into discrete ones (categories).
These discrete values or numbers can be thought of as categories or bins
into which the raw, continuous numeric values are binned or grouped into
Each bin represents a specific degree of intensity and hence a specific
range of continuous numeric values fall into it.
Feature Engineering...
6. Binning (Discretization)
Feature Engineering...
Binning Types
Two types of Binning : 1. Fixed Width 2. Adaptive Binning
Fixed-Width Binning: We manually create fix width bins based on some rules
and domain knowledge.
Ex.
age = [12, 15, 13, 78, 65, 42, 98, 24, 26, 38, 27, 32, 22, 45, 27]
Now, lets create bins of fixed width (say 10):
bins = [0 {0-9}, 1 {10-19}, 2 {20-29}, 3 {30-39}, 4 {40-49}, 5 {50-59}, 6 {60-
69}, 7 {70-79}, 8 {80-89}, 9 {90-99}]
After binning, age variable looks like this:
age = [1, 1, 1, 7, 6, 4, 9, 2, 2, 3, 2, 3, 2, 4, 2]
Limitation:
• creating irregular bins which are not uniform based on the number of data
points or values which fall under each bin.
• Some of the bins might be densely populated and some of them might be
sparsely populated or even empty.
Feature Engineering...
Adaptive Binning
• In Adaptive Binning, data distribution itself decides bin ranges for itself.
• No manual intervention is required.
• Quantile based binning is a good strategy to use for adaptive binning.
• Quantiles are values that divide the data into equal portions.
• Thus, q-Quantiles help in partitioning a numeric attribute into q equal
partitions.
• Popular examples of quantiles include the 2-Quantile known as the median
which divides the data distribution into two equal bins.
• 4-Quantiles known as the quartiles which divide the data into 4 equal bins
• 10-Quantiles also known as the deciles which create 10 equal width bins.
Feature Engineering...
• #Numerical Binning Example
• Value Bin
• 0-30 -> Low
• 31-70 -> Mid
• 71-100 -> High
7 Statistical Transformations
• Their main significance is that statistaical or mathematical transformations
help in stabilizing variance.
• They are using statistical and mathematical transformations
a. Log Transform
• The log transform belongs to the power transform family of functions.
• It helps to handle skewed data and after transformation, the distribution
becomes more approximate to normal.
• In most of the cases the magnitude order of the data changes within the
range of the data.
• It also decreases the effect of the outliers, due to the normalization of
magnitude differences and the model become more robust.
• Mathematically represented as
Feature Engineering...
1. Label Encoding:
It is a technique to transform categorical variables into numerical variables by
assigning a numerical value to each of the categories.
Label encoding can be used for Ordinal variables
e.g.
Feature Engineering...
2. Ordinal encoding:
Ordinal encoding is an encoding technique to transform an original categorical variable
to a numerical variable by ensuring the order .
3. Frequency encoding:
Frequency encoding is an encoding technique to transform an original categorical variable to
a numerical variable by considering the frequency distribution of the data.
Feature Engineering...
4. Binary encoding:
Binary encoding is an encoding technique to transform an original categorical variable to a
numerical variable by encoding the categories as Integer and then converted into binary
code
Feature Engineering...
2. Min-Max Scaling
Here we can transform and scale our feature values such that each value is within the
range of [0, 1].
Feature Engineering...
Robust Scaling
The disadvantage of min-max scaling is that often the presence of outliers affects the
scaled values for any feature.
Robust scaling tries to use specific statistical measures to scale features without being
affected by outliers. Mathematically this scaler can be represented as
Where X : each feature value , IQR : Inter-Quartile Range of X which is the range
(difference) between the first quartile (25th %ile) and the third quartile (75th %ile).
Dimensionality Reduction
• Higher the number of features, the harder it gets to visualize the training set and
then work on it.
• Sometimes, most of these features are correlated, and hence redundant.
• Dimensionality reduction is the process of reducing the total number of features
in our feature set using strategies like feature selection or feature extraction
Feature Selection methods: Specific features are selected for each data sample from
the original list of features and other features are discarded.
No new features are generated in this process.
Feature Extraction methods: We engineer or extract new features from the original list
of features in the data.
Thus the reduced subset of features will contain newly generated feature
Feature Selection...
Feature selection...
1. Filter methods:
These techniques select features purely based on metrics like correlation,
mutual information and so on. (measured using statistical analysis)
These methods do not depend on results obtained from any model and usually
check the relationship of each feature with the response variable to be
predicted.
Popular methods include threshold based methods and statistical tests.
2. Wrapper methods:
These techniques try to capture interaction between multiple features by using
a recursive approach to build multiple models using feature subsets and select
the best subset of features giving us the best performing model.
Methods like backward selecting and forward elimination are popular
wrapper based methods.
Feature selection Techniques:
3. Embedded methods:
These techniques try to combine the benefits of the other two methods.
Rank and score feature variables based on their importance.
Tree based methods like decision trees and ensemble methods like random forests
are popular examples of embedded methods
Filter methods:
Different Filter methods:
1. Information Gain
• Information gain calculates the reduction in entropy from the transformation
of a dataset.
• It can be used for feature selection by evaluating the Information gain of each
variable in the context of the target variable.
Different Filter methods:
2. Chi-square Test
• The Chi-square test is used for categorical features in a dataset.
• Calculate Chi-square between each feature and the target and select the
desired number of features with the best Chi-square scores.
3. Fisher’s Score:
• Fisher score is one of the most widely used supervised feature selection
methods.
• The algorithm which we will use returns the ranks of the variables based on
the fisher’s score in descending order.
• Distance between the sample means for each class per feature divided by their
variances:
4.Correlation Coefficient:
• Correlation is a measure of the linear relationship of 2 or more variables.
• Through correlation, we can predict one variable from the other.
• The logic behind using correlation for feature selection is that the good variables are
highly correlated with the target.
• Variables should be correlated with the target but should be uncorrelated among
themselves.
• If two variables are correlated, we can predict one from the other.
• If two features are correlated, the model only really needs one of them, as the second
one does not add additional information.
• The filter method looks at individual features for identifying it’s relative
importance.
• A feature may not be useful on its own but maybe an important influencer
when combined with other features.
• Filter methods may miss such features.
2.Wrapper Methods:
• Search the space of all possible subsets of features, assessing their quality by
learning and evaluating a classifier with that feature subset.
• The feature selection process is based on a specific machine learning algorithm
that we are trying to fit on a given dataset.
• It follows a greedy search approach by evaluating all the possible combinations
of features against the evaluation criterion.
• The wrapper methods usually result in better predictive accuracy than filter
methods.
Types of Wrapper Methods:
a) Recursive Feature Elimination
Recursive Feature Elimination(RFE) recursively removes the redundant features
until the desire number of features are achieved and hence improving the
performance and accuracy of the model.
b) Forward Feature Selection:
• Starts with single Feature.
• This is an iterative method wherein start with the best performing
variable against the target.
• Next, select another variable that gives the best performance in combination
with the first selected variable.
• This process continues until the preset criterion is achieved.
c) Backward Feature Elimination
This method works exactly opposite to the Forward Feature Selection method.
• Start with all the features available and build a model.
• Next, eliminate feature and check the model performance.
• This process is continued until the preset criterion is achieved.
Algorithm for Forward Selection
Algorithm for Backword Selection
•With more features, generally training error can be reduced, but validation error
may not be reduced.
•𝐸(𝐹) denotes the error incurred on the validation sample when only the inputs
in 𝐹 are used.
•Depending on the application, the error is either the mean square error or
misclassification error.
Terminologies for Algorithm
May be costly because to decrease the dimensions from d to k, to train and test the
system runs for 𝑑+(𝑑−1)+(𝑑−2)+· · ·+𝑑−𝑘 times, and the time required is
𝑂(𝑑2).
•Local search procedure which does not guarantee finding the optimal subset, namely,
the minimal subset causing the smallest error.
•For example,𝑥𝑖and 𝑥𝑗 individually does not give good effect but together may
decrease the error significantly. In this situation forward selection is not a good choice
because this algorithm is greedy and adds attributes one by one, it may not be able to
detect the effect of more than one features.
Feature Extraction Techniques
Feature Extraction:Feature extraction creates new features from functions of the
original features
PCA:
• The Principal Components are a straight line that captures most of the
variance of the data.
• They have a direction and magnitude.
• Principal components are orthogonal projections (perpendicular) of data onto
lower-dimensional space.
Principal Component Analysis…
Mathematics behind PCA:
1. Take the whole dataset consisting of d+1 dimensions and ignore the labels
such that our new dataset becomes d dimensional.
6. Use this d × k eigenvector matrix to transform the samples onto the new
subspace.
Principal Component Analysis…
Applications of PCA in Machine Learning
• PCA is used to visualize multidimensional data.
• It is used to reduce the number of dimensions in healthcare data.
• PCA can help resize an image.
• It can be used in finance to analyze stock data and forecast returns.
• PCA helps to find patterns in the high-dimensional datasets.
Dataset Preparation
Background :
• Machine learning is at the peak of its popularity today.
• Despite this, a lot of decision-makers are in the dark about what exactly is
needed to design, train, and successfully deploy a machine learning algorithm.
• The details about collecting the data, building a dataset, and annotation
specifics are neglected as supportive tasks.
Sources of Data:
Dataset Preparation
The Features of a Proper, High-Quality Dataset in Machine Learning:
• Split the dataset into two parts (preferably based on 70-30% split; However, the
percentage split will vary)
• Train the model on the training dataset; While training the model, some fixed set
of hyper parameters is selected.
• Test or evaluate the model on the held-out test dataset
• Train the final model on the entire dataset to get a model which can generalize
better on the unseen or future dataset.
Dataset Preparation
Hold-out method for Model Selection
• Model selection process is referred to as hyper-parameters tuning .
• In hold-out method for model selection, the dataset is split into three different sets
Dataset Preparation
Process of Hold-out method for Model Selection
Dataset Preparation
Process of Hold-out method for Model Selection
• Split the dataset in three parts – Training dataset, validation dataset and test
dataset.
• Train different models using different machine learning algorithms. For
example, train the classification model using logistic regression, random forest,
XGBoost.
• For the models trained with different algorithms, tune the hyper-parameters and
come up with different models.
• For each of the algorithms mentioned in step 2, change hyper parameters
settings and come with multiple models.
• Test the performance of each of these models (belonging to each of the
algorithms) on the validation dataset.
• Select the most optimal model out of models tested on the validation dataset.
The most optimal model will have the most optimal hyper parameters settings
for specific algorithm.
• Test the performance of the most optimal model on the test dataset.
Hold-out method is one of the cross-validation technique
Dataset Preparation
Pros and Cons of Hold-out Method
Pros
• Simple, easy to understand, and implement.
• This Method is Fully independent of data.
Cons:
• Not suitable for an imbalanced dataset.
• A lot of data is isolated from training the model
Dataset Preparation
Cross - Validation Techniques :
1. k-fold cross-validation:
• In k-fold cross-validation, the original dataset is equally partitioned into k subparts or folds.
• Out of the k-folds or groups, for each iteration, one group is selected as validation data, and
the remaining (k-1) groups are selected as training data.
• The process is repeated for k times until each group is treated as validation and remaining as training
data.
Dataset Preparation
The final accuracy of the model is computed by taking the mean accuracy of the k-models
validation data.
Pros:
Cons:
• Not suitable for an imbalanced dataset.
Dataset Preparation
2.Leave-one-out cross-validation:
• For a dataset having n rows, 1st row is selected for validation, and the rest (n-1)
rows are used to train the model.
• For the next iteration, the 2nd row is selected for validation and rest to train the
model.
• Similarly, the process is repeated until n steps or the desired number of
operations.
Dataset Prepartion
Pros:
• Simple, easy to understand, and implement.
Cons:
• The model may lead to a low bias.
• The computation time required is high.
Thank You