NOTES
NOTES
Feature Engineering is the process of creating new features or transforming existing features
to improve the performance of a machine-learning model. It involves selecting relevant
information from raw data and transforming it into a format that can be easily understood by
a model. The goal is to improve model accuracy by providing more meaningful and relevant
information.
What is Feature Engineering?
Feature engineering is the process of transforming raw data into features that are suitable
for machine learning models. In other words, it is the process of selecting, extracting, and
transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features used
to train them. Feature engineering involves a set of techniques that enable us to create new
features by combining or transforming the existing ones. These techniques help to highlight
the most important patterns and relationships in the data, which in turn helps the machine
learning model to learn from the data more effectively.
What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute) is an
individual measurable property or characteristic of a data point that is used as input for a
machine learning algorithm. Features can be numerical, categorical, or text-based, and they
represent different aspects of the data that are relevant to the problem at hand.
For example, in a dataset of housing prices, features could include the number of
bedrooms, the square footage, the location, and the age of the property. In a dataset of
customer demographics, features could include age, gender, income level, and occupation.
The choice and quality of features are critical in machine learning, as they can greatly
impact the accuracy and performance of the model.
Need for Feature Engineering in Machine Learning?
We engineer features for various reasons, and some of the main reasons include:
Improve User Experience: The primary reason we engineer features is to enhance the
user experience of a product or service. By adding new features, we can make the product
more intuitive, efficient, and user-friendly, which can increase user satisfaction and
engagement.
Competitive Advantage: Another reason we engineer features is to gain a competitive
advantage in the marketplace. By offering unique and innovative features, we can
differentiate our product from competitors and attract more customers.
Meet Customer Needs: We engineer features to meet the evolving needs of customers.
By analyzing user feedback, market trends, and customer behavior, we can identify areas
where new features could enhance the product’s value and meet customer needs.
Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales, or a
feature that provides additional functionality could lead to more upsells or cross-sells.
Future-Proofing: Engineering features can also be done to future-proof a product or
service. By anticipating future trends and potential customer needs, we can develop
features that ensure the product remains relevant and useful in the long term.
Processes Involved in Feature Engineering
Feature engineering in Machine learning consists of mainly 5 processes: Feature Creation,
Feature Transformation, Feature Extraction, Feature Selection, and Feature Scaling. It is an
iterative process that requires experimentation and testing to find the best combination of
features for a given problem. The success of a machine learning model largely depends on the
quality of the features used in the model.
1. Feature Creation
Feature Creation is the process of generating new features based on domain knowledge or by
observing patterns in the data. It is a form of feature engineering that can significantly
improve the performance of a machine-learning model.
Types of Feature Creation:
1. Domain-Specific: Creating new features based on domain knowledge, such as creating
features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data, such as
calculating aggregations or creating interaction features.
3. Synthetic: Generating new features by combining existing features or synthesizing
new data points.
Why Feature Creation?
1. Improves Model Performance: By providing additional and more relevant information
to the model, feature creation can increase the accuracy and precision of the model.
2. Increases Model Robustness: By adding additional features, the model can become more
robust to outliers and other anomalies.
3. Improves Model Interpretability: By creating new features, it can be easier to
understand the model’s predictions.
4. Increases Model Flexibility: By adding new features, the model can be made more
flexible to handle different types of data.
2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suitable
representation for the machine learning model. This is done to ensure that the model can
effectively learn from the data.
Types of Feature Transformation:
1. Normalization: Rescaling the features to have a similar range, such as between 0 and 1,
to prevent some features from dominating others.
2. Scaling: Scaling is a technique used to transform numerical variables to have a
similar scale, so that they can be compared more easily. Rescaling the features to have
a similar scale, such as having a standard deviation of 1, to make sure the model
considers all features equally.
3. Encoding: Transforming categorical features into a numerical representation.
Examples are one-hot encoding and label encoding.
4. Transformation: Transforming the features using mathematical operations to change
the distribution or scale of the features. Examples are logarithmic, square root, and
reciprocal transformations.
Why Feature Transformation?
1. Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
2. Increases Model Robustness: Transforming the features can make the model more
robust to outliers and other anomalies.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By transforming the features, it can be easier to
understand the model’s predictions.
3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to provide more
relevant information to the machine learning model. This is done by transforming,
combining, or aggregating existing features.
Types of Feature Extraction:
1. Dimensionality Reduction: Reducing the number of features by transforming the data
into a lower-dimensional space while retaining important information. Examples
are PCA and t-SNE.
2. Feature Combination: Combining two or more existing features to create a new one.
For example, the interaction between two features.
3. Feature Aggregation: Aggregating features to create a new one. For example, calculating
the mean, sum, or count of a set of features.
4. Feature Transformation: Transforming existing features into a new representation.
For example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
1. Improves Model Performance: By creating new and more relevant features, the model
can learn more meaningful patterns in the data.
2. Reduces Overfitting: By reducing the dimensionality of the data, the model is less
likely to overfit the training data.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By creating new features, it can be easier to
understand the model’s predictions.
4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the dataset to
be used in a machine-learning model. It is an important step in the feature engineering
process as it can have a significant impact on the model’s performance.
Types of Feature Selection:
1. Filter Method: Based on the statistical measure of the relationship between the feature
and the target variable. Features with a high correlation are selected.
2. Wrapper Method: Based on the evaluation of the feature subset using a specific machine
learning algorithm. The feature subset that results in the best performance is selected.
3. Embedded Method: Based on the feature selection as part of the training process of the
machine learning algorithm.
Why Feature Selection?
1. Reduces Overfitting: By using only the most relevant features, the model can generalize
better to new data.
2. Improves Model Performance: Selecting the right features can improve the accuracy,
precision, and recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less computation
and storage resources.
4. Improves Interpretability: By reducing the number of features, it is easier to understand
and interpret the results of the model.
5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a similar scale.
This is important in machine learning because the scale of the features can affect the
performance of the model.
Types of Feature Scaling:
1. Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and 1, by
subtracting the minimum value and dividing by the range.
2. Standard Scaling: Rescaling the features to have a mean of 0 and a standard deviation of
1 by subtracting the mean and dividing by the standard deviation.
3. Robust Scaling: Rescaling the features to be robust to outliers by dividing them by
the interquartile range.
Why Feature Scaling?
1. Improves Model Performance: By transforming the features to have a similar scale, the
model can learn from all features equally and avoid being dominated by a few large
features.
2. Increases Model Robustness: By transforming the features to be robust to outliers,
the model can become more robust to anomalies.
3. Improves Computational Efficiency: Many machine learning algorithms, such as k-
nearest neighbors, are sensitive to the scale of the features and perform better with
scaled features.
4. Improves Model Interpretability: By transforming the features to have a similar
scale, it can be easier to understand the model’s predictions.
What are the Steps in Feature Engineering?
The steps for feature engineering vary per different Ml engineers and data scientists. Some of
the common steps that are involved in most machine-learning algorithms are:
1. Data Cleansing
Data cleansing (also known as data cleaning or data scrubbing) involves identifying
and removing or correcting any errors or inconsistencies in the dataset. This step is
important to ensure that the data is accurate and reliable.
2. Data Transformation
3. Feature Extraction
4. Feature Selection
Feature selection involves selecting the most relevant features from the dataset for use
in machine learning. This can include techniques like correlation analysis, mutual
information, and stepwise regression.
5. Feature Iteration
Feature iteration involves refining and improving the features based on the
performance of the machine learning model. This can include techniques like adding
new features, removing redundant features and transforming features in different ways.
1. Improved Model Performance: By focusing on the most relevant features, you can
enhance the accuracy of your model in predicting new, unseen data.
2. Reduced Overfitting: Fewer redundant features mean less noise in your data,
decreasing the chances of making decisions based on irrelevant information.
3. Faster Training Times: With a reduced feature set, your algorithms can train more
quickly, which is particularly important for large-scale applications.
4. Enhanced Interpretability: By focusing on the most important features, you can
gain better insights into the factors driving your model’s predictions.
5. Dimensionality Reduction: Feature selection helps to reduce the complexity of your
model by decreasing the number of input variables.
To illustrate the importance of feature selection, consider the following table comparing
model performance with and without feature selection:
Without Feature
Metric With Feature Selection
Selection
Since 2016, automated feature engineering is also used in different machine learning software
that helps in automatically extracting features from raw data. Feature engineering in ML
contains mainly four processes: Feature Creation, Transformations, Feature Extraction,
and Feature Selection.
These processes are described as below:
1. Feature Creation: Feature creation is finding the most useful variables to be used in
a predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting
the predictor variable to improve the accuracy and performance of the model. For
example, it ensures that the model is flexible to take input of the variety of data; it
ensures that all the variables are on the same scale, making the model easier to
understand. It improves the model's accuracy and ensures that all the features are
within the acceptable range to avoid any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process
that generates new variables by extracting them from the raw data. The main aim of
this step is to reduce the volume of data so that it can be easily used and managed for
data modelling. Feature extraction methods include cluster analysis, text analytics,
edge detection algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few
variables in the dataset are useful for building the model, and the rest features are
either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and
accuracy of the model. Hence it is very important to identify and select the most
appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine
learning. "Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant, irrelevant, or
noisy features."
Data Preparation: The first step is data preparation. In this step, raw data acquired
from different resources are prepared to make it in a suitable format so that it can be
used in the ML model. The data preparation may contain cleaning of data, delivery,
data augmentation, fusion, ingestion, or loading.
Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists. This
step involves analysis, investing data set, and summarization of the main
characteristics of data. Different data visualization techniques are used to better
understand the manipulation of data sources, to find the most appropriate statistical
technique for data analysis, and to select the best features for the data.
Benchmark: Benchmarking is a process of setting a standard baseline for accuracy to
compare all the variables from this baseline. The benchmarking process is used to
improve the predictability of the model and reduce the error rate.
Feature Engineering: A Comprehensive Guide for Numerical, Categorical, and Text Data
Feature engineering is the art of creating new features from existing raw data to improve the
performance of machine learning models. It's a crucial step in the data science pipeline, often
making the difference between a good model and a great one.
Numerical Data
Numerical data can be continuous (e.g., age, weight) or discrete (e.g., number of children,
house number). Here are some common techniques:
Scaling:
Normalization: Scales features to a specific range (e.g., 0-1). Useful for algorithms sensitive
to feature magnitudes.
Standardization: Scales features to have zero mean and unit variance. Often preferred for
algorithms that assume normally distributed data.
Transformation:
Log Transformation: Compresses large ranges and can handle skewed data.
Square Root Transformation: Useful for non-negative data with a skewed distribution.
Box-Cox Transformation: A family of power transformations that can handle various
distributions.
Feature Creation:
Interaction Features: Create new features by combining existing ones (e.g., multiplying age
and income).
Polynomial Features: Create polynomial features from numerical features (e.g., square, cube).
Time-Based Features: Extract features like day of week, month, or time of day from
timestamps.
Categorical Data
Categorical data represents categories or groups (e.g., gender, country). Here are some
techniques:
Encoding:
One-Hot Encoding: Creates binary features for each category. Suitable for nominal
categorical data.
Label Encoding: Assigns a unique integer to each category. Suitable for ordinal categorical
data.
Target Encoding: Encodes categories based on the target variable. Can be effective but prone
to overfitting.
Feature Creation:
Frequency Encoding: Replaces categories with their frequency in the dataset.
Grouping: Combine similar categories into a single category.
Text Data
Text data requires specific techniques to extract meaningful features:
Text Preprocessing:
Tokenization: Split text into words or tokens.
Stop Word Removal: Remove common words like "the," "and," "is."
Stemming/Lemmatization: Reduce words to their root form.