0% found this document useful (0 votes)
67 views18 pages

Feature Engineering

The document covers feature engineering techniques essential for preparing data for machine learning models, including feature scaling, categorical encoding, handling outliers, and creating new features. It emphasizes the importance of standardization and normalization, as well as the use of pipelines to streamline preprocessing steps. Additionally, it discusses data splitting for training and testing models to ensure accurate performance evaluation.

Uploaded by

Hshshe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views18 pages

Feature Engineering

The document covers feature engineering techniques essential for preparing data for machine learning models, including feature scaling, categorical encoding, handling outliers, and creating new features. It emphasizes the importance of standardization and normalization, as well as the use of pipelines to streamline preprocessing steps. Additionally, it discusses data splitting for training and testing models to ensure accurate performance evaluation.

Uploaded by

Hshshe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Feature Engineering

Feature Engineering & Preparation


Transforming Data for Machine Learning Models

mlpfu.pages.dev 1
Feature Engineering

What is Feature Scaling?


Feature scaling is a method used to standardize the range of independent
variables or features of data. Since the range of values of raw data varies
widely, some machine learning algorithms might not work correctly without
it.

Why scale features?

Improves Performance: Algorithms that compute distance (like KNN,


SVM, PCA) are sensitive to the scale of features.
Faster Convergence: Gradient descent-based algorithms converge
faster.

mlpfu.pages.dev 2
Feature Engineering

Feature Scaling: Standardization


This method rescales the data to have a mean of 0 and a standard
deviation of 1. It's useful when your data follows a Gaussian (normal)
distribution.

from sklearn.preprocessing import StandardScaler

# Initialize the scaler object


scaler = StandardScaler()

# Use fit_transform to learn the scaling parameters (mean, std)


# from the data and then apply the transformation.
# Note: We pass a DataFrame [['feature']] to get a 2D array.
df['scaled_feature'] = scaler.fit_transform(df[['feature']])

mlpfu.pages.dev 3
Feature Engineering

Feature Scaling: Normalization


This method rescales the data to a fixed range, typically 0 to 1. It's also
known as Min-Max scaling. This is useful when the distribution of your data
is unknown.

from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler object


scaler = MinMaxScaler()

# Use fit_transform to learn the min and max values


# from the data and then apply the transformation.
df['normalized_feature'] = scaler.fit_transform(df[['feature']])

mlpfu.pages.dev 4
Feature Engineering

Standardization vs. Normalization


Standardization: Not bounded to a specific range, which can be a
problem for some algorithms (e.g., neural networks). Less affected by
outliers.
Normalization: Scales to a fixed range [0, 1]. Can be sensitive to
outliers, which can squash the other values into a small range.

Rule of thumb: Start with Standardization. If that doesn't work well, try
Normalization.

mlpfu.pages.dev 5
Feature Engineering

Categorical Encoding: One-Hot vs. Ordinal


Choosing the right encoding for categorical data is crucial.

Ordinal Data: Categories have a clear order (e.g., Low < Medium <
High ). Use Ordinal Encoding, which maps them to integers (0, 1, 2).

Nominal Data: Categories have no order (e.g., USA , India , UK ). Use


One-Hot Encoding, which creates a new binary column for each
category.

mlpfu.pages.dev 6
Feature Engineering

Categorical Encoding: The Code

from sklearn.preprocessing import OrdinalEncoder

# For Ordinal Data (e.g., 'size' column with ['S', 'M', 'L'])
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
df['size_encoded'] = encoder.fit_transform(df[['size']])

# For Nominal Data (e.g., 'color' column)


# Pandas get_dummies is a very convenient way to do this.
color_dummies = pd.get_dummies(df['color'], prefix='color')
df = pd.concat([df, color_dummies], axis=1)

mlpfu.pages.dev 7
Feature Engineering

Handling Outliers
Outliers are extreme values that can skew your model. After detecting them
(e.g., with a box plot), you can handle them.

Common Strategies:

1. Removal: Simply drop the outlier rows. (Use with caution!)


2. Capping (Winsorizing): Cap the outliers at a certain percentile.
3. Transformation: Use a mathematical function (e.g., log) to reduce the
impact of the outlier.

mlpfu.pages.dev 8
Feature Engineering

Handling Outliers: The Code

# Capping: Replace values above the 95th percentile and


# below the 5th percentile with those percentile values.
upper_limit = df['feature'].quantile(0.95)
lower_limit = df['feature'].quantile(0.05)

df['feature_capped'] = df['feature'].clip(lower=lower_limit, upper=upper_limit)

mlpfu.pages.dev 9
Feature Engineering

Feature Binning (Discretization)


This technique involves converting continuous numerical features into
discrete categorical bins.

Why do this?

Can help the model learn non-linear patterns.


Makes the model more robust to outliers.

# Example: Binning 'age' into categories


bins = [0, 18, 35, 60, 100]
labels = ['Child', 'Young Adult', 'Adult', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)

mlpfu.pages.dev 10
Feature Engineering

Feature Engineering: Datetime Features


Raw datetime columns are not directly usable by most models. You need to
extract meaningful features from them.

# First, ensure the column is in datetime format


df['date_column'] = pd.to_datetime(df['date_column'])

# Extract useful features


df['year'] = df['date_column'].dt.year
df['month'] = df['date_column'].dt.month
df['day_of_week'] = df['date_column'].dt.dayofweek # Monday=0, Sunday=6

mlpfu.pages.dev 11
Feature Engineering

Basic Text Feature Extraction: Bag-of-Words


To use text data, we must convert it into numbers. The simplest method is
the Bag-of-Words (BoW) model. It counts the occurrences of each word in
a document.

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data


corpus = ['This is the first document.', 'This document is the second document.']

# Initialize the vectorizer


vectorizer = CountVectorizer()

# Fit and transform the corpus


X = vectorizer.fit_transform(corpus)

# The result is a sparse matrix of word counts


print(X.toarray())
print(vectorizer.get_feature_names_out())
mlpfu.pages.dev 12
Feature Engineering

Bringing It All Together: Pipelines


Pipelines are the best practice for chaining all your preprocessing steps.

Why use Pipelines?

Prevents Data Leakage: Ensures you only fit on the training data.
Simplifies Workflow: Combines many steps into one.
Improves Readability & Reproducibility.

mlpfu.pages.dev 13
Feature Engineering

Pipelines: The Code


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Define preprocessing for numeric and categorical features


numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Use ColumnTransformer to apply different transformations to different columns


preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

# Now you can just call preprocessor.fit_transform(X_train)


mlpfu.pages.dev 14
Feature Engineering

What is Feature Engineering?


This is the process of using domain knowledge to create new features from
existing data. Well-engineered features can significantly improve the
performance of a model.

Types of Feature Engineering:

Creating interaction or polynomial features.


Binning numerical data into categories.
Extracting information from date/time data (e.g., day of the week).

mlpfu.pages.dev 15
Feature Engineering

Feature Engineering: Interaction Example


An interaction feature is a feature created by combining two or more other
features. For example, if you have height and width , you could create an
area feature.

# Create a new feature by multiplying two existing ones.


# This can help the model capture a relationship that is
# dependent on both features together.
df['interaction_feat'] = df['feature1'] * df['feature2']

# Display the original and new features to see the result


print(df[['feature1', 'feature2', 'interaction_feat']].head())

mlpfu.pages.dev 16
Feature Engineering

Data Splitting: Training & Testing


This is one of the most critical steps. We split our data to train the model on
one subset and evaluate its performance on a completely separate, unseen
subset.

Training Set: The data the model learns from.


Testing Set: The data used to assess the model's accuracy.

mlpfu.pages.dev 17
Feature Engineering

Data Splitting: The Code


We use train_test_split from scikit-learn.

from sklearn.model_selection import train_test_split

# X contains all our independent variables (features).


# We drop the target variable because it's what we want to predict.
X = df.drop('target_variable', axis=1)

# y contains only our dependent variable (the target).


y = df['target_variable']

# test_size=0.2 means 20% of the data will be for testing.


# random_state=42 ensures we get the same split every time.
# This makes our results reproducible.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
mlpfu.pages.dev 18

You might also like