Encoding Categorical Data in Sklearn

Last Updated : 30 Jul, 2025

Categorical data is a common occurrence in many datasets especially in fields like marketing, finance and social sciences. Unlike numerical data, categorical data represents discrete values or categories such as gender, country or product type. Machine learning algorithms require numerical input, making it essential to convert categorical data into a numerical format. This process is known as encoding.

Implementation

Lets see the use of various techniques on a real-life dataset.

Step 1: Loading the Dataset

Here we will load our dataset. Click here to download the used dataset.

Python

import pandas as pd

df = pd.read_csv('data.csv')
print(df.head())

Step 2: Label Encoding

Label encoding is used to map categories to integers.

fit_transform: Learns and applies the mapping.
.classes_: Shows the mapping order.

Python

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['class_encoded'] = le.fit_transform(df['class'])

print("Class labels mapping:", dict(zip(le.classes_, le.transform(le.classes_))))
print(df[['class', 'class_encoded']].head())

Step 3: One-Hot Encoding

One-Hot encoding converts nominal/categorical variables like buying, maint, lug_boot, safety, etc into binary columns.

fit_transform: Finds all unique categories and encodes them to binary columns.
df_ohe.drop(columns=categorical_cols, inplace=True): Drop original categorical columns if you proceed with encoded values only

Python

from sklearn.preprocessing import OneHotEncoder

categorical_cols = ['buying', 'maint',
                    'doors', 'persons', 'lug_boot', 'safety']
ohe = OneHotEncoder(sparse_output=False)

ohe_array = ohe.fit_transform(df[categorical_cols])
print("OHE feature names:", ohe.get_feature_names_out(categorical_cols))

ohe_df = pd.DataFrame(
    ohe_array, columns=ohe.get_feature_names_out(categorical_cols))
df_ohe = pd.concat([df.reset_index(drop=True), ohe_df], axis=1)
print(df_ohe.head())

Screenshot-2025-07-23-181708 — One-Hot Encoding

Step 4: Ordinal Encoding

Ordinal encoding is use for features where order matters (here, let's treat safety as ordinal: low < med < high). Explicitly supplies category order to ensure model sees the true underlying order.

Python

from sklearn.preprocessing import OrdinalEncoder

ordinal_cols = ['safety']
categories_order = [['low', 'med', 'high']]

oe = OrdinalEncoder(categories=categories_order)
df['safety_ord'] = oe.fit_transform(df[['safety']])

print(df[['safety', 'safety_ord']].head())

Screenshot-2025-07-23-181654 — Ordinal Encoding

Step 5: Putting Data Together with ColumnTransformer

This approach cleanly manages both ordinal and nominal encoding and fits directly into any sklearn modeling pipeline.
Suitable for any supervised learning (classification/regression) with categorical inputs.

Python

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

ordinal_features = ['safety']
ordinal_categories = [['low', 'med', 'high']]

nominal_features = ['buying', 'maint', 'doors', 'persons', 'lug_boot']

preprocessor = ColumnTransformer(
    transformers=[
        ('ord', OrdinalEncoder(categories=ordinal_categories), ordinal_features),
        ('nom', OneHotEncoder(sparse_output=False), nominal_features)
    ]
)

features = ordinal_features + nominal_features
X = df[features]
X_prepared = preprocessor.fit_transform(X)

print("Transformed shape:", X_prepared.shape)

Transformed shape: (1728, 19)

Step 6: Inspection and Resulted Dataset

Always use the same encoder objects on train and test data to ensure consistency.
For categorical variable exploration and encoding in a deployed or production ML pipeline, prefer maintaining category order explicitly for any ordinal features.

Python

import numpy as np

final_df = pd.DataFrame(
    np.hstack([X_prepared, df[['class_encoded']].values]),
    columns = list(preprocessor.get_feature_names_out()) + ['class_encoded']
)
print(final_df.head())

Output:

Advantages and Disadvantages of each Encoding Technique

Encoding Technique	Advantages	Disadvantages
Label Encoding	- Simple and easy to implement - Suitable for ordinal data	- Introduces arbitrary ordinal relationships for nominal data - May not work well with outliers
One-Hot Encoding	- Suitable for nominal data - Avoids introducing ordinal relationships - Maintains information on the values of each variable	- Can lead to increased dimensionality and sparsity - May cause overfitting, especially with many categories and small sample sizes
Ordinal Encoding	- Preserves the order of categories - Suitable for ordinal data	- Not suitable for nominal data - Assumes equal spacing between categories which may not be true

Encoding categorical data is essential to make it usable for machine learning models. By applying label, one-hot or ordinal encoding, we turn categories into numbers, helping models learn effectively and produce accurate results.

sp774rvuq

Improve

Article Tags :

Encoding Categorical Data in Sklearn

Implementation

Step 1: Loading the Dataset

Step 2: Label Encoding

Step 3: One-Hot Encoding

Step 4: Ordinal Encoding

Step 5: Putting Data Together with ColumnTransformer

Step 6: Inspection and Resulted Dataset

Advantages and Disadvantages of each Encoding Technique

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?