0% found this document useful (0 votes)
172 views65 pages

Feature Generation & Extraction

The document outlines a course on 'Introduction to Data Science' at Jawaharlal Nehru Engineering College, focusing on feature generation and extraction. It covers course objectives, outcomes, and a syllabus that includes various units related to data science, emphasizing the importance of datasets and features in machine learning. Additionally, it discusses methods for feature generation and selection, highlighting their roles in improving model performance and reducing complexity.

Uploaded by

grebe64246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
172 views65 pages

Feature Generation & Extraction

The document outlines a course on 'Introduction to Data Science' at Jawaharlal Nehru Engineering College, focusing on feature generation and extraction. It covers course objectives, outcomes, and a syllabus that includes various units related to data science, emphasizing the importance of datasets and features in machine learning. Additionally, it discusses methods for feature generation and selection, highlighting their roles in improving model performance and reducing complexity.

Uploaded by

grebe64246
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Jawaharlal Nehru Engineering College,

Chh. Sambhajinagar

Course: Introduction of Data Science


Unit 2:Feature Generation & Extraction

Sandip S. Kankal
Assistant Professor, CSE, JNEC

25-Aug-24 1
Course: Introduction to Data Science
Semester -III
• Course code: CSE21MDL201
• Course name: Introduction of Data Science
• Course category: MDM
• Credits: 2
• Teaching scheme: L-2hrs/week
• Evaluation scheme: CA–60, ESE–40
25-Aug-24 2
Pre-requisite
• Basics of any programming language

25-Aug-24 3
Course Objectives:
• To provide the knowledge and expertise to become
a proficient data scientist.
• Demonstrate an understanding of statistics and
machine learning concepts that are vital for data
science.
• Critically evaluate data visualisations based on
their design and use for communicating stories
from data
25-Aug-24 4
Course Outcomes:
• At the end of the course, the students will be able to
CO1: To explain how data is collected, managed and stored
for data science
CO2: To understand the key concepts in data science
including their real-world applications and the toolkit used
by data scientists.
CO3: To understand different tools and languages used for
Data Science.

25-Aug-24 5
Syllabus
• Unit -1 : Introduction of Data Science
• Unit - 2: Feature Generation & Extraction
• Unit - 3: Data Visualization
• Unit - 4: Applications & Tools used in Data
Science

25-Aug-24 6
Unit 2: Feature Generation and Extraction
• Feature Generation and Feature Selection
(Extracting Meaning from Data)-
• Motivating application: user (customer) retention-
• Feature Generation (brainstorming, role of domain
expertise, and place for imagination)-
• Feature Selection algorithms.

25-Aug-24 7
Introduction
• Dataset
• Example
• Feature

25-Aug-24 8
Dataset
• Dataset is essentially the backbone for all operations,
techniques or models used by developers to interpret
them.
• Datasets involve a large amount of data points
grouped into one table.
• Datasets are used in almost all industries today for
various reasons.

25-Aug-24 9
Dataset
• To interact effectively with Datasets, many
Universities publicly release their Datasets for
example
– UCI repository and
– websites like Kaggle and even
– GitHub release datasets that developers can work
with to get the necessary outputs.

25-Aug-24 10
What is Dataset
• A Dataset is a set of data grouped into a collection with
which developers can work to meet their goals.
• In a dataset, the rows represent the number of data points and
the columns represent the features of the Dataset.
• They are mostly used in fields like machine learning, business,
and government to gain insights, make informed decisions, or
train algorithms.
• Datasets may vary in size and complexity and they mostly
require cleaning and preprocessing to ensure data quality and
suitability for analysis or modeling.
25-Aug-24 11
Dataset Example: Diabetes
• This is the Diabetes
dataset.
• Since this is a dataset
with which we build
models, there are input
features and output
features.
• Here:
• The input features are
Pregnancies, Glucose,
BloodPressure,
SkinThickness, Insulin,
BMI,
DiabetesPedigreeFunct
ion, Age.
• Outcome is the output
feature.

25-Aug-24 12
Why are datasets used?
• Datasets are used to train and test AI models,
analyze trends, and gain insights from data.
• They provide the raw material for computers to
learn patterns and make predictions.

25-Aug-24 13
Types of Datasets
• Numerical Dataset:
• Categorical Dataset:
• Web Dataset:
• Time series Dataset:
• Image Dataset:
• Ordered Dataset:
• Partitioned Dataset:
• File-Based Datasets:
• Bivariate Dataset:
• Multivariate Dataset:

25-Aug-24 14
Feature
• A feature (or column) represents a measurable piece
of data like name, age or gender etc.
• It is the basic building block of a dataset.
• The quality of a feature can vary significantly and
has an immense effect on model performance.
• We can improve the quality of a dataset’s features in
the pre-processing stage using processes like Feature
Generation and Feature Selection.
25-Aug-24 15
What is a Feature in Machine Learning
and Data Science?
• A feature is an individual measurable property within a
recorded dataset.
• In machine learning and statistics, features are often
called “variables” or “attributes.”
• Relevant features have a correlation or bearing (called
feature importance) on a model’s use case.
• In a patient medical dataset, features could be age,
gender, blood pressure, cholesterol level, and other
observed characteristics relevant to the patient.
25-Aug-24 16
Feature in Data Science
• Features can be individual variables, derived variables,
or combined attributes constructed from underlying
data elements.
• Based on measures of blood pressure, plus cholesterol
level, and other contributing factors, we can create an
“engineered” feature that is categorical for purposes of
identifying groups of observations into risk categories
for stroke or heart disease, for example.
25-Aug-24 17
Extracting Meaning from Data
• Motivating application: user (customer) retention

• Customer retention measures a business’s ability


to keep customers over a given period of time.
• The opposite of customer retention is customer
churn, a metric that shows how many customers a
company has lost over that same period of time.
25-Aug-24 18
Motivating application: user (customer)
retention
• Customer retention means
– “to maintain existing customers”
• Customer churn means
– “losing the existing customer”

– Churn rate is the percentage of customers that leave within a


given amount of time.
– Whereas retention rate is the percentage of customers that stay
with you.
25-Aug-24 19
25-Aug-24 20
Motivating Application: User (Customer) Retention

• Suppose an app called Chasing Dragons charges a


monthly subscription fee, with revenue increasing with
more users.
• However, only 10% of new users return after the first
month.
• To boost revenue, there are two options: increase the
retention rate of existing users or acquire new ones.
• Generally, retaining existing customers is cheaper than
acquiring new ones.

25-Aug-24 21
Motivating Application: User (Customer)
Retention
• Focusing on retention, a model could be built to predict if a new user will
return next month based on their behavior this month.
• This model could help in providing targeted incentives, such as a free
month, to users predicted to need extra encouragement to stay.
• A good crude model: Logistic Regression – Gives the probability the user
returns their second month conditional on their activities in the first month.
• User behavior is recorded for the first 30 days after sign-up, logging every
action with timestamps: for example, a user clicked "level 6" at 5:22 a.m.,
slew a dragon at 5:23 a.m., earned 22 points at 5:24 a.m., and was shown
an ad at 5:25 a.m.
• This phase involves collecting data on every possible user action.

25-Aug-24 22
Motivating Application: User (Customer)
Retention
• User actions, ranging from thousands to just a few,
are stored in timestamped event logs.
• These logs need to be processed into a dataset with
rows representing users and columns representing
features.
• This phase, known as feature generation, involves
brainstorming potential features without being
selective.
25-Aug-24 23
Motivating Application: User (Customer)
Retention
• The data science team, including game designers, software engineers,
statisticians, and marketing experts, collaborates to identify relevant features.
• Here are some examples:
– ✓ Number of days the user visited in the first month
– ✓ Amount of time until second visit
– ✓ Number of points on day j for j=1, . . .,30 (this would be 30 separate features)
– ✓ Total number of points in first month (sum of the other features)
– ✓ Did user fill out Chasing Dragons profile (binary 1 or 0)
– ✓ Age and gender of user
– ✓ Screen size of device
Notice there are redundancies and correlations between these features; that’s OK.

25-Aug-24 24
Motivating Application: User (Customer)
Retention
• To construct a logistic regression model predicting user return behavior,
the initial focus lies in attaining a functional model before refinement.
• Irrespective of the subsequent time frame, classification 𝑐𝑖=1 designates
a returning user.
• The logistic regression formula targeted is:

25-Aug-24 25
Motivating Application: User (Customer)
Retention
• Initially, a comprehensive set of features is gathered, encompassing user
behavior, demographics, and platform interactions.
• Following data collection, feature subsets must be refined for optimal
predictive power during model scaling and production.
• Three main methods guide feature subset selection: filters, wrappers,
and embedded methods.
– Filters independently evaluate feature relevance,
– wrappers use model performance to assess feature subsets, and
– embedded methods incorporate feature selection within model training.

25-Aug-24 26
Feature Generation

25-Aug-24 27
Feature Generation
• Feature generation is the process of constructing new features
from existing ones.
• The goal of feature generation is to derive new combinations
and representations of our data that might be useful to the
machine learning model.

25-Aug-24 28
Feature Generation
• Feature Generation (also known as feature
construction, feature extraction or feature engineering)
is the process of transforming features into new
features that better relate to the target.
• This can involve mapping a feature into a new feature
using a function like log, or creating a new feature
from one or multiple features using multiplication or
addition.
25-Aug-24 29
Feature Generation
• Feature Generation can improve model performance when there
is a feature interaction.
• Two or more features interact if the combined effect is (greater or
less) than the sum of their individual effects.
• It is possible to make interactions with three or more features, but
this tends to result in diminishing returns.
• Feature Generation is often overlooked as it is assumed that the
model will learn any relevant relationships between features to
predict the target variable.
• However, the generation of new flexible features is important
as it allows us to use less complex models that are faster to run
and easier to understand and maintain.

25-Aug-24 30
Feature Generation
• Feature generation, also known as feature extraction, is the
process of transforming raw data into a structured format
where each column represents a specific characteristic or
attribute (feature) of the data, and each row represents an
observation or instance.
• This involves identifying, creating, and selecting meaningful
variables from the raw data that can be used in machine
learning models to make predictions or understand patterns.

25-Aug-24 31
Feature Generation
• This process is both an art and a science. Having a domain
expert involved is beneficial, but using creativity and
imagination is equally important.
• Remember, feature generation is constrained by two factors:
the feasibility of capturing certain information and the
awareness to consider capturing

25-Aug-24 32
Feature Generation
• Information can be categorized into the following
buckets:
– Relevant and useful, but it’s impossible to capture it
– Relevant and useful, possible to log it, and you did
– Relevant and useful, possible to log it, but you didn’t
– Not relevant or useful, but you don’t know that and log it
– Not relevant or useful, and you either can’t capture it or it
didn’t occur to you

25-Aug-24 33
Feature Generation
• Relevant and useful, but it’s impossible to capture it.
– Keep in mind that much user information isn't captured, like free time, other apps, employment
status, or insomnia, which might predict their return. Some captured data may act as proxies for
these factors, such as playing the game at 3 a.m. indicating insomnia or night shifts.
• Relevant and useful, possible to log it, and you did.
– The decision to log this information during the brainstorming session was crucial. However, mere
logging doesn't guarantee understanding its relevance or usefulness. The feature selection process
aims to uncover this information.
• Relevant and useful, possible to log it, but you didn’t.
– Human limitations can lead to overlooking crucial information, emphasizing the need for creative
feature selection. Usability studies help identify key user actions for better feature capture.
• Not relevant or useful, but you don’t know that and log it.
– Feature selection aims to address this: while you've logged certain information, unknowing its
necessity.
• Not relevant or useful, and you either can’t capture it or it didn’t occur to you

25-Aug-24 34
Feature Selection
• Feature selection is a way of selecting the subset of
the most relevant features from the original features
set by removing the redundant, irrelevant, or noisy
features.

25-Aug-24 35
Feature Selection
• While developing the machine learning model, only a few
variables in the dataset are useful for building the model,
and the rest features are either redundant or irrelevant.
• If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce
the overall performance and accuracy of the model.
• Hence it is very important to identify and select the most
appropriate features from the data and remove the
irrelevant or less important features, which is done with the
help of feature selection in machine learning.
25-Aug-24 36
Feature Selection
• Feature selection is one of the important concepts
of machine learning, which highly impacts the
performance of the model.
• As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to
input the most appropriate and relevant dataset to
the model in order to get a better result.

25-Aug-24 37
What is Feature Selection?
• A feature is an attribute that has an impact on a
problem or is useful for the problem, and choosing
the important features for the model is known as
feature selection.
• Each machine learning process depends on feature
engineering, which mainly contains two processes;
which are Feature Selection and Feature Extraction.

25-Aug-24 38
Feature Selection
• Although feature selection and extraction processes may
have the same objective, both are completely different
from each other.
• The main difference between them is that feature
selection is about selecting the subset of the original
feature set, whereas feature extraction creates new
features.
• Feature selection is a way of reducing the input variable
for the model by using only relevant data in order to
reduce overfitting in the model.
25-Aug-24 39
Feature Selection
• So, we can define feature Selection as, "It is a process of
automatically or manually selecting the subset of most appropriate
and relevant features to be used in model building."
• Feature selection is performed by either including the
important features or excluding the irrelevant features in the
dataset without changing them.

25-Aug-24 40
Need for Feature Selection
• Dimensionality Reduction: High-dimensional datasets with many
features can lead to overfitting, increased computational complexity, and
decreased model interpretability. Selecting the most relevant features can
mitigate these issues.
• Enhanced Model Performance: Removing irrelevant or redundant
features can improve a model’s predictive accuracy, generalization, and
robustness.
• Reduced Training Time: Fewer features mean faster training times,
making it practical to work with large datasets.

25-Aug-24 41
benefits of using feature selection in
machine learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be
easily interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.

25-Aug-24 42
Feature Selection Techniques
• Supervised Feature Selection technique
Supervised Feature selection techniques consider
the target variable and can be used for the labelled
dataset.
• Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore
the target variable and can be used for the
unlabelled dataset.
25-Aug-24 43
25-Aug-24 44
1. Wrapper Methods
• In wrapper methodology, selection of
features is done by considering it as a
search problem, in which different
combinations are made, evaluated, and
compared with other combinations.
• It trains the algorithm by using the
subset of features iteratively.

25-Aug-24 45
Techniques of wrapper methods

• Forward selection -
• Backward elimination -
• Exhaustive Feature Selection-
• Recursive Feature Elimination-

25-Aug-24 46
Forward Selection
• Starting from Scratch: Begin with an empty set of features and iteratively
add one feature at a time.
• Model Evaluation: At each step, train and evaluate the machine learning
model using the selected features.
• Stopping Criterion: Continue until a predefined stopping criterion is met,
such as a maximum number of features or a significant drop in
performance.

25-Aug-24 47
Backward elimination
• Starting with Everything: Start with all available features.
• Iterative Removal: In each iteration, remove the least important feature
and evaluate the model.
• Stopping Criterion: Continue until a stopping condition is met.

25-Aug-24 48
Exhaustive Feature Selection
• Exploring All Possibilities: Evaluate all possible combinations of
features, which ensures finding the best subset for model performance.
• Computational Cost: This can be computationally expensive, especially
with a large number of features.

25-Aug-24 49
Recursive Feature Elimination
• Ranking Features: Start with all features and rank them based on their
importance or contribution to the model.
• Iterative Removal: In each iteration, remove the least important
feature(s).
• Stopping Criterion: Continue until a desired number of features is
reached.

25-Aug-24 50
2. Filter Methods
• In Filter Method, features are selected on the basis of
statistics measures.
• This method does not depend on the learning algorithm and
chooses the features as a pre-processing step.
• The filter method filters out the irrelevant feature and
redundant columns from the model by using different metrics
through ranking.
• The advantage of using filter methods is that it needs low
computational time and does not overfit the data.
25-Aug-24 51
2. Filter Methods

25-Aug-24 52
Techniques of Filter methods
• Information Gain
• Chi-square Test
• Fisher's Score
• Missing Value Ratio

25-Aug-24 53
Information Gain
• It is defined as the amount of information provided by the
feature for identifying the target value and measures
reduction in the entropy values.
• Information gain of each attribute is calculated considering
the target values for feature selection.

25-Aug-24 54
Chi-square Test
• Chi-square method (X2) is generally used to test the
relationship between categorical variables.
• It compares the observed values from different attributes of
the dataset to its expected value.

25-Aug-24 55
Fisher's Score
• Fisher’s Score selects each feature independently according
to their scores under Fisher criterion leading to a
suboptimal set of features.
• The larger the Fisher’s score is, the better is the selected
feature.

25-Aug-24 56
Missing Value Ratio
• The value of the missing value ratio can be used for
evaluating the feature set against the threshold value.
• The formula for obtaining the missing value ratio is the
number of missing values in each column divided by the
total number of observations.
• The variable is having more than the threshold value can be
dropped.

25-Aug-24 57
3. Embedded Methods
• Embedded methods combined the
advantages of both filter and wrapper
methods by considering the interaction of
features along with low computational cost.
• These are fast processing methods similar to
the filter method but more accurate than the
filter method.
• These methods are also iterative, which
evaluates each iteration, and optimally finds
the most important features that contribute
the most to training in a particular iteration.
25-Aug-24 58
Techniques of Embedded methods
• Regularization-
• Tree Based Method -

25-Aug-24 59
Regularization
• This method adds a penalty to different parameters of the machine
learning model to avoid over-fitting of the model.
• This approach of feature selection uses Lasso (L1 regularization) and
Elastic nets (L1 and L2 regularization).
• The penalty is applied over the coefficients, thus bringing down some
coefficients to zero.
• The features having zero coefficient can be removed from the dataset.

25-Aug-24 60
Tree Based Method
• These methods such as Random Forest, Gradient Boosting provides us
feature importance as a way to select features as well.
• Feature importance tells us which features are more important in making
an impact on the target feature.

25-Aug-24 61
Techniques of Embedded methods
• Regularization-
• Random Forest Importance -

25-Aug-24 62
How to choose a Feature Selection
Method?
• For machine learning engineers, it is
very important to understand that
which feature selection method will
work properly for their model.
• The more we know the datatypes of
variables, the easier it is to choose
the appropriate statistical measure
for feature selection.
• To know this, we need to first
identify the type of input and output
variables.
25-Aug-24 63
Input Output Variable Feature Selection technique
Variable
Numerical Numerical •Pearson's correlation coefficient (For linear Correlation).
•Spearman's rank coefficient (for non-linear correlation).

Numerical Categorical •ANOVA correlation coefficient (linear).


•Kendall's rank coefficient (nonlinear).

Categorical Numerical •Kendall's rank coefficient (linear).


•ANOVA correlation coefficient (nonlinear).

Categorical Categorical •Chi-Squared test (contingency tables).


•Mutual Information.
References
• https://www.javatpoint.com/feature-selection-
techniques-in-machine-learning

25-Aug-24 65

You might also like