Feature Engineering
Feature Engineering & Preparation
Transforming Data for Machine Learning Models
mlpfu.pages.dev 1
Feature Engineering
What is Feature Scaling?
Feature scaling is a method used to standardize the range of independent
variables or features of data. Since the range of values of raw data varies
widely, some machine learning algorithms might not work correctly without
it.
Why scale features?
Improves Performance: Algorithms that compute distance (like KNN,
SVM, PCA) are sensitive to the scale of features.
Faster Convergence: Gradient descent-based algorithms converge
faster.
mlpfu.pages.dev 2
Feature Engineering
Feature Scaling: Standardization
This method rescales the data to have a mean of 0 and a standard
deviation of 1. It's useful when your data follows a Gaussian (normal)
distribution.
from sklearn.preprocessing import StandardScaler
# Initialize the scaler object
scaler = StandardScaler()
# Use fit_transform to learn the scaling parameters (mean, std)
# from the data and then apply the transformation.
# Note: We pass a DataFrame [['feature']] to get a 2D array.
df['scaled_feature'] = scaler.fit_transform(df[['feature']])
mlpfu.pages.dev 3
Feature Engineering
Feature Scaling: Normalization
This method rescales the data to a fixed range, typically 0 to 1. It's also
known as Min-Max scaling. This is useful when the distribution of your data
is unknown.
from sklearn.preprocessing import MinMaxScaler
# Initialize the scaler object
scaler = MinMaxScaler()
# Use fit_transform to learn the min and max values
# from the data and then apply the transformation.
df['normalized_feature'] = scaler.fit_transform(df[['feature']])
mlpfu.pages.dev 4
Feature Engineering
Standardization vs. Normalization
Standardization: Not bounded to a specific range, which can be a
problem for some algorithms (e.g., neural networks). Less affected by
outliers.
Normalization: Scales to a fixed range [0, 1]. Can be sensitive to
outliers, which can squash the other values into a small range.
Rule of thumb: Start with Standardization. If that doesn't work well, try
Normalization.
mlpfu.pages.dev 5
Feature Engineering
Categorical Encoding: One-Hot vs. Ordinal
Choosing the right encoding for categorical data is crucial.
Ordinal Data: Categories have a clear order (e.g., Low < Medium <
High ). Use Ordinal Encoding, which maps them to integers (0, 1, 2).
Nominal Data: Categories have no order (e.g., USA , India , UK ). Use
One-Hot Encoding, which creates a new binary column for each
category.
mlpfu.pages.dev 6
Feature Engineering
Categorical Encoding: The Code
from sklearn.preprocessing import OrdinalEncoder
# For Ordinal Data (e.g., 'size' column with ['S', 'M', 'L'])
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
df['size_encoded'] = encoder.fit_transform(df[['size']])
# For Nominal Data (e.g., 'color' column)
# Pandas get_dummies is a very convenient way to do this.
color_dummies = pd.get_dummies(df['color'], prefix='color')
df = pd.concat([df, color_dummies], axis=1)
mlpfu.pages.dev 7
Feature Engineering
Handling Outliers
Outliers are extreme values that can skew your model. After detecting them
(e.g., with a box plot), you can handle them.
Common Strategies:
1. Removal: Simply drop the outlier rows. (Use with caution!)
2. Capping (Winsorizing): Cap the outliers at a certain percentile.
3. Transformation: Use a mathematical function (e.g., log) to reduce the
impact of the outlier.
mlpfu.pages.dev 8
Feature Engineering
Handling Outliers: The Code
# Capping: Replace values above the 95th percentile and
# below the 5th percentile with those percentile values.
upper_limit = df['feature'].quantile(0.95)
lower_limit = df['feature'].quantile(0.05)
df['feature_capped'] = df['feature'].clip(lower=lower_limit, upper=upper_limit)
mlpfu.pages.dev 9
Feature Engineering
Feature Binning (Discretization)
This technique involves converting continuous numerical features into
discrete categorical bins.
Why do this?
Can help the model learn non-linear patterns.
Makes the model more robust to outliers.
# Example: Binning 'age' into categories
bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)
mlpfu.pages.dev 10
Feature Engineering
Feature Engineering: Datetime Features
Raw datetime columns are not directly usable by most models. You need to
extract meaningful features from them.
# First, ensure the column is in datetime format
df['date_column'] = pd.to_datetime(df['date_column'])
# Extract useful features
df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day_of_week'] = df['date_column'].dt.dayofweek # Monday=0, Sunday=6
mlpfu.pages.dev 11
Feature Engineering
Basic Text Feature Extraction: Bag-of-Words
To use text data, we must convert it into numbers. The simplest method is
the Bag-of-Words (BoW) model. It counts the occurrences of each word in
a document.
from sklearn.feature_extraction.text import CountVectorizer
# Sample text data
corpus = ['This is the first document.', 'This document is the second document.']
# Initialize the vectorizer
vectorizer = CountVectorizer()
# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)
# The result is a sparse matrix of word counts
print(X.toarray())
print(vectorizer.get_feature_names_out())
mlpfu.pages.dev 12
Feature Engineering
Bringing It All Together: Pipelines
Pipelines are the best practice for chaining all your preprocessing steps.
Why use Pipelines?
Prevents Data Leakage: Ensures you only fit on the training data.
Simplifies Workflow: Combines many steps into one.
Improves Readability & Reproducibility.
mlpfu.pages.dev 13
Feature Engineering
Pipelines: The Code
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# Define preprocessing for numeric and categorical features
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Use ColumnTransformer to apply different transformations to different columns
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Now you can just call preprocessor.fit_transform(X_train)
mlpfu.pages.dev 14
Feature Engineering
What is Feature Engineering?
This is the process of using domain knowledge to create new features from
existing data. Well-engineered features can significantly improve the
performance of a model.
Types of Feature Engineering:
Creating interaction or polynomial features.
Binning numerical data into categories.
Extracting information from date/time data (e.g., day of the week).
mlpfu.pages.dev 15
Feature Engineering
Feature Engineering: Interaction Example
An interaction feature is a feature created by combining two or more other
features. For example, if you have height and width , you could create an
area feature.
# Create a new feature by multiplying two existing ones.
# This can help the model capture a relationship that is
# dependent on both features together.
df['interaction_feat'] = df['feature1'] * df['feature2']
# Display the original and new features to see the result
print(df[['feature1', 'feature2', 'interaction_feat']].head())
mlpfu.pages.dev 16
Feature Engineering
Data Splitting: Training & Testing
This is one of the most critical steps. We split our data to train the model on
one subset and evaluate its performance on a completely separate, unseen
subset.
Training Set: The data the model learns from.
Testing Set: The data used to assess the model's accuracy.
mlpfu.pages.dev 17
Feature Engineering
Data Splitting: The Code
We use train_test_split from scikit-learn.
from sklearn.model_selection import train_test_split
# X contains all our independent variables (features).
# We drop the target variable because it's what we want to predict.
X = df.drop('target_variable', axis=1)
# y contains only our dependent variable (the target).
y = df['target_variable']
# test_size=0.2 means 20% of the data will be for testing.
# random_state=42 ensures we get the same split every time.
# This makes our results reproducible.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
mlpfu.pages.dev 18