Programming for Data
Analysis
Week 8
Dr. Ferdin Joe John Joseph
Faculty of Information Technology
Thai – Nichi Institute of Technology, Bangkok
Today’s lesson
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
2
• Feature Engineering
• Feature Selection
• Feature Construction
• Laboratory
Importance
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
3
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
4
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
5
Features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
6
Real Data features
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
7
Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
8
Feature Engineering
• Feature engineering is the process of using domain knowledge to
extract features from raw data via data mining techniques.
• These features can be used to improve the performance of machine
learning algorithms.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
9
Features
• A feature is an attribute or property shared by all of the independent
units on which analysis or prediction is to be done. Any attribute
could be a feature, as long as it is useful to the model.
• The purpose of a feature, other than being an attribute, would be
much easier to understand in the context of a problem. A feature is a
characteristic that might help when solving the problem.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
10
Process of Feature Engineering
Brainstorming or testing features
Deciding what features to create
Creating features
Checking how the features work with your model
Improving your features if needed
Go back to brainstorming/creating more features until the work is done
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
11
Techniques in Feature Engineering
• Imputation
• Handling Outliers
• Binning
• Log Transform
• One-Hot Encoding
• Grouping Operations
• Feature Split
• Scaling
• Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
12
Imputation
• Missing values are one of the most common problems you can
encounter when you try to prepare your data for machine learning.
• The reason for the missing values might be human errors,
interruptions in the data flow, privacy concerns, and so on.
• This affects the performance of machine learning models
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
13
Imputation
• Dropping columns with missing values will reduce performance
• Make a threshold of 70%
• Remove columns having more than 30% missing values
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
14
Numerical Imputation
• Fill missing values with a constant
• Fill missing values with a statistical formula
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
15
Categorical imputation
• Replacing missing value with maximum occurred value in that column
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
16
Handling Outliers
• Best way to detect outliers is to visualize data
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
17
Statistical ways to handle outliers
• Standard Deviation
• Percentiles
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
18
Handling outliers – Standard Deviation
• If a value has a distance to the average higher than x * standard
deviation, it can be assumed as an outlier.
• x = 2 to 4 is practical. Z-score can also be used
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
19
Handling Outliers - Percentile
• If your data ranges from 0 to 100, your top 5% is not the values
between 96 and 100.
• Top 5% means here the values that are out of the 95th percentile of
data.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
20
Binning
• Binning is done for numerical data
• Categorical data are converted to numerical format and binned
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
21
Binning - Example
#Numerical Binning Example
Value Bin
0-30 -> Low
31-70 -> Mid
71-100 -> High
#Categorical Binning Example
Value Bin
Spain -> Europe
Italy -> Europe
Chile -> South America
Brazil -> South America
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
22
Motivation of binning
• Make the model robust
• Prevent overfitting
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
23
Log Transform
• Logarithmic Transformation
• The data you apply log transform must have only positive values,
otherwise you receive an error.
• Also, you can add 1 to your data before transform it.
• Thus, you ensure the output of the transformation to be positive.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
24
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
25
One hot encoding
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
26
Grouping Operations
• Categorical Column Grouping
• Numerical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
27
Categorical Column Grouping
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
28
Numerical Column Grouping
• Numerical columns are grouped using sum and mean functions in
most of the cases.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
29
Feature Split
• Splitting features is a good way to make them useful in terms of
machine learning.
• By extracting the utilizable parts of a column into new features:
• We enable machine learning algorithms to comprehend them.
• Make possible to bin and group them.
• Improve model performance by uncovering potential information.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
30
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
31
Scaling
• In real life, it is nonsense to expect age and income columns to have
the same range.
• Scaling solves this problem.
• However, the algorithms based on distance calculations such as k-NN
or k-Means need to have scaled continuous features as model input.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
32
Scaling Methods
• Normalization
• Standardization
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
33
Normalization
• Normalization (or min-max normalization) scale all values in a fixed
range between 0 and 1.
• This transformation does not change the distribution of the feature
and due to the decreased standard deviations, the effects of the
outliers increases.
• Therefore, before normalization, it is recommended to handle the
outliers.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
34
Normalization - Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
35
Standardization
• Also known as z-score normalization
• Scales the values while taking into account standard deviation.
• If the standard deviation of features is different, their range also
would differ from each other.
• This reduces the effect of the outliers in the features.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
36
Example
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
37
Extracting Date
• Extracting the parts of the date into different columns: Year, month,
day, etc.
• Extracting the time period between the current date and columns in
terms of years, months, days, etc.
• Extracting some specific features from the date: Name of the
weekday, Weekend or not, holiday or not, etc.
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
38
Extracting Date
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
39
DSA 207 – Feature Engineering
Faculty of Information Technology, Thai - Nichi Institute of
Technology, Bangkok
40

Week 8: Programming for Data Analysis

  • 1.
    Programming for Data Analysis Week8 Dr. Ferdin Joe John Joseph Faculty of Information Technology Thai – Nichi Institute of Technology, Bangkok
  • 2.
    Today’s lesson Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok 2 • Feature Engineering • Feature Selection • Feature Construction • Laboratory
  • 3.
    Importance Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 3
  • 4.
    Features Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 4
  • 5.
    Features Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 5
  • 6.
    Features Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 6
  • 7.
    Real Data features Facultyof Information Technology, Thai - Nichi Institute of Technology, Bangkok 7
  • 8.
    Feature Engineering Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok 8
  • 9.
    Feature Engineering • Featureengineering is the process of using domain knowledge to extract features from raw data via data mining techniques. • These features can be used to improve the performance of machine learning algorithms. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 9
  • 10.
    Features • A featureis an attribute or property shared by all of the independent units on which analysis or prediction is to be done. Any attribute could be a feature, as long as it is useful to the model. • The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 10
  • 11.
    Process of FeatureEngineering Brainstorming or testing features Deciding what features to create Creating features Checking how the features work with your model Improving your features if needed Go back to brainstorming/creating more features until the work is done Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 11
  • 12.
    Techniques in FeatureEngineering • Imputation • Handling Outliers • Binning • Log Transform • One-Hot Encoding • Grouping Operations • Feature Split • Scaling • Extracting Date Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 12
  • 13.
    Imputation • Missing valuesare one of the most common problems you can encounter when you try to prepare your data for machine learning. • The reason for the missing values might be human errors, interruptions in the data flow, privacy concerns, and so on. • This affects the performance of machine learning models Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 13
  • 14.
    Imputation • Dropping columnswith missing values will reduce performance • Make a threshold of 70% • Remove columns having more than 30% missing values Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 14
  • 15.
    Numerical Imputation • Fillmissing values with a constant • Fill missing values with a statistical formula Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 15
  • 16.
    Categorical imputation • Replacingmissing value with maximum occurred value in that column Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 16
  • 17.
    Handling Outliers • Bestway to detect outliers is to visualize data Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 17
  • 18.
    Statistical ways tohandle outliers • Standard Deviation • Percentiles Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 18
  • 19.
    Handling outliers –Standard Deviation • If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier. • x = 2 to 4 is practical. Z-score can also be used Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 19
  • 20.
    Handling Outliers -Percentile • If your data ranges from 0 to 100, your top 5% is not the values between 96 and 100. • Top 5% means here the values that are out of the 95th percentile of data. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 20
  • 21.
    Binning • Binning isdone for numerical data • Categorical data are converted to numerical format and binned Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 21
  • 22.
    Binning - Example #NumericalBinning Example Value Bin 0-30 -> Low 31-70 -> Mid 71-100 -> High #Categorical Binning Example Value Bin Spain -> Europe Italy -> Europe Chile -> South America Brazil -> South America Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 22
  • 23.
    Motivation of binning •Make the model robust • Prevent overfitting Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 23
  • 24.
    Log Transform • LogarithmicTransformation • The data you apply log transform must have only positive values, otherwise you receive an error. • Also, you can add 1 to your data before transform it. • Thus, you ensure the output of the transformation to be positive. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 24
  • 25.
    Example Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 25
  • 26.
    One hot encoding Facultyof Information Technology, Thai - Nichi Institute of Technology, Bangkok 26
  • 27.
    Grouping Operations • CategoricalColumn Grouping • Numerical Column Grouping Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 27
  • 28.
    Categorical Column Grouping Facultyof Information Technology, Thai - Nichi Institute of Technology, Bangkok 28
  • 29.
    Numerical Column Grouping •Numerical columns are grouped using sum and mean functions in most of the cases. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 29
  • 30.
    Feature Split • Splittingfeatures is a good way to make them useful in terms of machine learning. • By extracting the utilizable parts of a column into new features: • We enable machine learning algorithms to comprehend them. • Make possible to bin and group them. • Improve model performance by uncovering potential information. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 30
  • 31.
    Example Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 31
  • 32.
    Scaling • In reallife, it is nonsense to expect age and income columns to have the same range. • Scaling solves this problem. • However, the algorithms based on distance calculations such as k-NN or k-Means need to have scaled continuous features as model input. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 32
  • 33.
    Scaling Methods • Normalization •Standardization Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 33
  • 34.
    Normalization • Normalization (ormin-max normalization) scale all values in a fixed range between 0 and 1. • This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. • Therefore, before normalization, it is recommended to handle the outliers. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 34
  • 35.
    Normalization - Example Facultyof Information Technology, Thai - Nichi Institute of Technology, Bangkok 35
  • 36.
    Standardization • Also knownas z-score normalization • Scales the values while taking into account standard deviation. • If the standard deviation of features is different, their range also would differ from each other. • This reduces the effect of the outliers in the features. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 36
  • 37.
    Example Faculty of InformationTechnology, Thai - Nichi Institute of Technology, Bangkok 37
  • 38.
    Extracting Date • Extractingthe parts of the date into different columns: Year, month, day, etc. • Extracting the time period between the current date and columns in terms of years, months, days, etc. • Extracting some specific features from the date: Name of the weekday, Weekend or not, holiday or not, etc. Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 38
  • 39.
    Extracting Date Faculty ofInformation Technology, Thai - Nichi Institute of Technology, Bangkok 39
  • 40.
    DSA 207 –Feature Engineering Faculty of Information Technology, Thai - Nichi Institute of Technology, Bangkok 40