0% found this document useful (0 votes)

67 views18 pages

Feature Engineering

The document covers feature engineering techniques essential for preparing data for machine learning models, including feature scaling, categorical encoding, handling outliers, and creating new features. It emphasizes the importance of standardization and normalization, as well as the use of pipelines to streamline preprocessing steps. Additionally, it discusses data splitting for training and testing models to ensure accurate performance evaluation.

Uploaded by

Hshshe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views18 pages

Feature Engineering

Uploaded by

Hshshe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Feature Engineering

Feature Engineering & Preparation

Transforming Data for Machine Learning Models

mlpfu.pages.dev 1
Feature Engineering

What is Feature Scaling?

Feature scaling is a method used to standardize the range of independent
variables or features of data. Since the range of values of raw data varies
widely, some machine learning algorithms might not work correctly without
it.

Why scale features?

Improves Performance: Algorithms that compute distance (like KNN,

SVM, PCA) are sensitive to the scale of features.
Faster Convergence: Gradient descent-based algorithms converge
faster.

mlpfu.pages.dev 2
Feature Engineering

Feature Scaling: Standardization

This method rescales the data to have a mean of 0 and a standard
deviation of 1. It's useful when your data follows a Gaussian (normal)
distribution.

from sklearn.preprocessing import StandardScaler

# Initialize the scaler object

scaler = StandardScaler()

# Use fit_transform to learn the scaling parameters (mean, std)

# from the data and then apply the transformation.
# Note: We pass a DataFrame [['feature']] to get a 2D array.
df['scaled_feature'] = scaler.fit_transform(df[['feature']])

mlpfu.pages.dev 3
Feature Engineering

Feature Scaling: Normalization

This method rescales the data to a fixed range, typically 0 to 1. It's also
known as Min-Max scaling. This is useful when the distribution of your data
is unknown.

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler object

scaler = MinMaxScaler()

# Use fit_transform to learn the min and max values

# from the data and then apply the transformation.
df['normalized_feature'] = scaler.fit_transform(df[['feature']])

mlpfu.pages.dev 4
Feature Engineering

Standardization vs. Normalization

Standardization: Not bounded to a specific range, which can be a
problem for some algorithms (e.g., neural networks). Less affected by
outliers.
Normalization: Scales to a fixed range [0, 1]. Can be sensitive to
outliers, which can squash the other values into a small range.

Rule of thumb: Start with Standardization. If that doesn't work well, try
Normalization.

mlpfu.pages.dev 5
Feature Engineering

Categorical Encoding: One-Hot vs. Ordinal

Choosing the right encoding for categorical data is crucial.

Ordinal Data: Categories have a clear order (e.g., Low < Medium <
High ). Use Ordinal Encoding, which maps them to integers (0, 1, 2).

Nominal Data: Categories have no order (e.g., USA , India , UK ). Use

One-Hot Encoding, which creates a new binary column for each
category.

mlpfu.pages.dev 6
Feature Engineering

Categorical Encoding: The Code

from sklearn.preprocessing import OrdinalEncoder

# For Ordinal Data (e.g., 'size' column with ['S', 'M', 'L'])
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
df['size_encoded'] = encoder.fit_transform(df[['size']])

# For Nominal Data (e.g., 'color' column)

# Pandas get_dummies is a very convenient way to do this.
color_dummies = pd.get_dummies(df['color'], prefix='color')
df = pd.concat([df, color_dummies], axis=1)

mlpfu.pages.dev 7
Feature Engineering

Handling Outliers
Outliers are extreme values that can skew your model. After detecting them
(e.g., with a box plot), you can handle them.

Common Strategies:

1. Removal: Simply drop the outlier rows. (Use with caution!)

2. Capping (Winsorizing): Cap the outliers at a certain percentile.
3. Transformation: Use a mathematical function (e.g., log) to reduce the
impact of the outlier.

mlpfu.pages.dev 8
Feature Engineering

Handling Outliers: The Code

# Capping: Replace values above the 95th percentile and

# below the 5th percentile with those percentile values.
upper_limit = df['feature'].quantile(0.95)
lower_limit = df['feature'].quantile(0.05)

df['feature_capped'] = df['feature'].clip(lower=lower_limit, upper=upper_limit)

mlpfu.pages.dev 9
Feature Engineering

Feature Binning (Discretization)

This technique involves converting continuous numerical features into
discrete categorical bins.

Why do this?

Can help the model learn non-linear patterns.

Makes the model more robust to outliers.

# Example: Binning 'age' into categories

bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

mlpfu.pages.dev 10
Feature Engineering

Feature Engineering: Datetime Features

Raw datetime columns are not directly usable by most models. You need to
extract meaningful features from them.

# First, ensure the column is in datetime format

df['date_column'] = pd.to_datetime(df['date_column'])

# Extract useful features

df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day_of_week'] = df['date_column'].dt.dayofweek # Monday=0, Sunday=6

mlpfu.pages.dev 11
Feature Engineering

Basic Text Feature Extraction: Bag-of-Words

To use text data, we must convert it into numbers. The simplest method is
the Bag-of-Words (BoW) model. It counts the occurrences of each word in
a document.

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data

corpus = ['This is the first document.', 'This document is the second document.']

# Initialize the vectorizer

vectorizer = CountVectorizer()

# Fit and transform the corpus

X = vectorizer.fit_transform(corpus)

# The result is a sparse matrix of word counts

print(X.toarray())
print(vectorizer.get_feature_names_out())
mlpfu.pages.dev 12
Feature Engineering

Bringing It All Together: Pipelines

Pipelines are the best practice for chaining all your preprocessing steps.

Why use Pipelines?

Prevents Data Leakage: Ensures you only fit on the training data.
Simplifies Workflow: Combines many steps into one.
Improves Readability & Reproducibility.

mlpfu.pages.dev 13
Feature Engineering

Pipelines: The Code

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Define preprocessing for numeric and categorical features

numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Use ColumnTransformer to apply different transformations to different columns

preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

# Now you can just call preprocessor.fit_transform(X_train)

mlpfu.pages.dev 14
Feature Engineering

What is Feature Engineering?

This is the process of using domain knowledge to create new features from
existing data. Well-engineered features can significantly improve the
performance of a model.

Types of Feature Engineering:

Creating interaction or polynomial features.

Binning numerical data into categories.
Extracting information from date/time data (e.g., day of the week).

mlpfu.pages.dev 15
Feature Engineering

Feature Engineering: Interaction Example

An interaction feature is a feature created by combining two or more other
features. For example, if you have height and width , you could create an
area feature.

# Create a new feature by multiplying two existing ones.

# This can help the model capture a relationship that is
# dependent on both features together.
df['interaction_feat'] = df['feature1'] * df['feature2']

# Display the original and new features to see the result

print(df[['feature1', 'feature2', 'interaction_feat']].head())

mlpfu.pages.dev 16
Feature Engineering

Data Splitting: Training & Testing

This is one of the most critical steps. We split our data to train the model on
one subset and evaluate its performance on a completely separate, unseen
subset.

Training Set: The data the model learns from.

Testing Set: The data used to assess the model's accuracy.

mlpfu.pages.dev 17
Feature Engineering

Data Splitting: The Code

We use train_test_split from scikit-learn.

from sklearn.model_selection import train_test_split

# X contains all our independent variables (features).

# We drop the target variable because it's what we want to predict.
X = df.drop('target_variable', axis=1)

# y contains only our dependent variable (the target).

y = df['target_variable']

# test_size=0.2 means 20% of the data will be for testing.

# random_state=42 ensures we get the same split every time.
# This makes our results reproducible.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
mlpfu.pages.dev 18

Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
No ratings yet
Feature Engineering: Short Study: Indian Institute of Space Science and Technology, Department of Mathematics
6 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Unit II
No ratings yet
Unit II
119 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
ML Algorithms Explained
No ratings yet
ML Algorithms Explained
27 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Machine Learning (Feature Engineering)
No ratings yet
Machine Learning (Feature Engineering)
10 pages
C1 W2 Lab04 FeatEng PolyReg Soln
No ratings yet
C1 W2 Lab04 FeatEng PolyReg Soln
5 pages
Data Handling Essentials for ML
No ratings yet
Data Handling Essentials for ML
125 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
ML Da
No ratings yet
ML Da
55 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
12 pages
Data Pre-Processing for Machine Learning
No ratings yet
Data Pre-Processing for Machine Learning
12 pages
Summery of Feature Eng
No ratings yet
Summery of Feature Eng
4 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
86 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
38 pages
Feature Engineering Guide
No ratings yet
Feature Engineering Guide
51 pages
l06 Machine Learning
No ratings yet
l06 Machine Learning
52 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Unit-II Feature Engineering - Removed
No ratings yet
Unit-II Feature Engineering - Removed
158 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
ML Week 8
No ratings yet
ML Week 8
12 pages
ML Algorithms for Data Scientists
100% (2)
ML Algorithms for Data Scientists
148 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Lesson 2
No ratings yet
Lesson 2
9 pages
Unit 2
No ratings yet
Unit 2
91 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
45 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
03 Machine Learning Overview
No ratings yet
03 Machine Learning Overview
24 pages
ML-Unit 3
No ratings yet
ML-Unit 3
58 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
NOTES
No ratings yet
NOTES
9 pages
CSC407 - Chapter 4
No ratings yet
CSC407 - Chapter 4
28 pages
Scikit-Learn ML Cheat Sheet Guide
No ratings yet
Scikit-Learn ML Cheat Sheet Guide
3 pages
Feature Engineering in ML Guide
No ratings yet
Feature Engineering in ML Guide
6 pages
Feature Engineering for AI Models
No ratings yet
Feature Engineering for AI Models
12 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
22 pages
Unit 4
No ratings yet
Unit 4
25 pages
ML Unit 2
No ratings yet
ML Unit 2
33 pages
End-to-End ML Pipeline Example
No ratings yet
End-to-End ML Pipeline Example
50 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Preprocessing
No ratings yet
Preprocessing
5 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Week 10
No ratings yet
Week 10
50 pages
Well Posed Learning Problem
100% (1)
Well Posed Learning Problem
4 pages
Real Estate Data Analysis
No ratings yet
Real Estate Data Analysis
150 pages
Topic 1 - Types of Data PDF
No ratings yet
Topic 1 - Types of Data PDF
10 pages
Desert Pavements
No ratings yet
Desert Pavements
11 pages
Statistical Treatment in Research - A Simple Explanation
No ratings yet
Statistical Treatment in Research - A Simple Explanation
4 pages
Practical Research 2 Quarter 1 Activity Sheets
No ratings yet
Practical Research 2 Quarter 1 Activity Sheets
8 pages
R For Health Data Science 1st Edition Ewen Harrison Pius Riinu Available All Format
No ratings yet
R For Health Data Science 1st Edition Ewen Harrison Pius Riinu Available All Format
123 pages
Fast R
No ratings yet
Fast R
43 pages
(Trisnawati) Enterprise Risk Management Disclosure and CEO Characteristics An Empirical Study of Go Public Companies in Indonesia
No ratings yet
(Trisnawati) Enterprise Risk Management Disclosure and CEO Characteristics An Empirical Study of Go Public Companies in Indonesia
13 pages
Exec PGDM: Stats for Business
No ratings yet
Exec PGDM: Stats for Business
93 pages
ANOVA for Marketing Analysis
100% (1)
ANOVA for Marketing Analysis
50 pages
Objective
No ratings yet
Objective
6 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
Module 1 Lesson 3
0% (1)
Module 1 Lesson 3
26 pages
Applied Ergonomics: Flore Barcellini, Lor Ene Prost, Marianne Cerf
No ratings yet
Applied Ergonomics: Flore Barcellini, Lor Ene Prost, Marianne Cerf
10 pages
Data Analyst
No ratings yet
Data Analyst
21 pages
Lesson Plan in Practical Research Demo
No ratings yet
Lesson Plan in Practical Research Demo
8 pages
Review of Preventive Medicine MCQs
100% (1)
Review of Preventive Medicine MCQs
1 page
Lesson 2
No ratings yet
Lesson 2
22 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
How To Structure Quantitative Research Questions
No ratings yet
How To Structure Quantitative Research Questions
9 pages
Introduction to Statistics and Probability
No ratings yet
Introduction to Statistics and Probability
48 pages
Decap782 Advance Data Visualization
No ratings yet
Decap782 Advance Data Visualization
368 pages
Categorical Data Analysis Guide
No ratings yet
Categorical Data Analysis Guide
194 pages
Basic Stat - Chapter 2 Visual Description of Data
No ratings yet
Basic Stat - Chapter 2 Visual Description of Data
58 pages
GMATH Reviewer: Relations & Functions
No ratings yet
GMATH Reviewer: Relations & Functions
6 pages
Fundamentals of Biostatistics 8th Edition PDF
No ratings yet
Fundamentals of Biostatistics 8th Edition PDF
39 pages
SB Test Bank Chapter 2
No ratings yet
SB Test Bank Chapter 2
49 pages
10 - Eda To Prediction Dietanic
No ratings yet
10 - Eda To Prediction Dietanic
21 pages
Bar Graph vs. Line Graph Explained
No ratings yet
Bar Graph vs. Line Graph Explained
6 pages