How to Normalize Data Using scikit-learn in Python
Last Updated :
20 Jun, 2024
Data normalization is a crucial preprocessing step in machine learning. It ensures that features contribute equally to the model by scaling them to a common range. This process helps in improving the convergence of gradient-based optimization algorithms and makes the model training process more efficient. In this article, we'll explore how to normalize data using scikit-learn, a popular Python library for machine learning.
What is Data Normalization?
Data normalization involves transforming data into a consistent format. There are several normalization techniques, but the most common ones include:
- Min-Max Scaling: Rescales data to a range of [0, 1] or [-1, 1].
- Standardization (Z-score normalization): Rescales data to have a mean of 0 and a standard deviation of 1.
- Robust Scaling: Uses median and interquartile range, making it robust to outliers.
Why Normalize Data?
Normalization is essential for:
- Improving model performance: Algorithms like gradient descent converge faster on normalized data.
- Fair comparison of features: Ensures that features with larger ranges do not dominate the model.
- Consistent interpretation: Makes coefficients in linear models more interpretable.
Using scikit-learn for Normalization
scikit-learn provides several transformers for normalization, including MinMaxScaler
, StandardScaler
, and RobustScaler
. Let's go through each of these with examples.
1. Min-Max Scaling
Min-Max Scaling transforms features by scaling them to a given range, typically [0, 1]. The formula used is:
X' = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
Example code:
Python
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Sample data
data = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
# Initialize the scaler
scaler = MinMaxScaler()
# Fit and transform the data
normalized_data = scaler.fit_transform(data)
print("Normalized Data (Min-Max Scaling):")
print(normalized_data)
Output:
Normalized Data (Min-Max Scaling):
[[0. 0. ]
[0.33333333 0.33333333]
[0.66666667 0.66666667]
[1. 1. ]]
2. Standardization
Standardization scales data to have a mean of 0 and a standard deviation of 1. The formula used is:
X' = \frac{X - \mu}{\sigma}
where \mu_i is the mean and \sigma is the standard deviation.
Example code:
Python
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Fit and transform the data
standardized_data = scaler.fit_transform(data)
print("Standardized Data (Z-score Normalization):")
print(standardized_data)
Output:
Standardized Data (Z-score Normalization):
[[-1.34164079 -1.34164079]
[-0.4472136 -0.4472136 ]
[ 0.4472136 0.4472136 ]
[ 1.34164079 1.34164079]]
3. Robust Scaling
Robust Scaling uses the median and the interquartile range to scale the data, making it robust to outliers.
Example Code:
Python
from sklearn.preprocessing import RobustScaler
# Initialize the scaler
scaler = RobustScaler()
# Fit and transform the data
robust_scaled_data = scaler.fit_transform(data)
print("Robust Scaled Data:")
print(robust_scaled_data)
Output:
Robust Scaled Data:
[[-1. -1. ]
[-0.33333333 -0.33333333]
[ 0.33333333 0.33333333]
[ 1. 1. ]]
Conclusion
Data normalization is a vital step in the preprocessing pipeline of any machine learning project. Using scikit-learn, we can easily apply different normalization techniques such as Min-Max Scaling, Standardization, and Robust Scaling. Choosing the right normalization method can significantly impact the performance of your machine learning models.
By incorporating these normalization techniques, you can ensure that your data is well-prepared for modeling, leading to more accurate and reliable predictions.
Similar Reads
Data Normalization with Python Scikit-Learn Data normalization is a crucial step in machine learning and data science. It involves transforming features to similar scales to improve the performance and stability of machine learning models. Python's Scikit-Learn library provides several techniques for data normalization, which are essential fo
7 min read
Using KNNImputer in Scikit-Learn to Handle Missing Data in Python KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more useful method that works on the basic approach of the KNN algorithm rather than the naive approach of filling all the values with the mean or the median. In this approach, we specify a distan
4 min read
How to normalize a tensor to 0 mean and 1 variance in Pytorch? A tensor in PyTorch is like a NumPy array with the difference that the tensors can utilize the power of GPU whereas arrays can't. To normalize a  tensor, we transform the tensor such that the mean and standard deviation become 0 and 1 respectively. As we know that the variance is the square of the s
3 min read
How to Normalize Data in R? In this article, we will discuss how to normalize data in the R programming language. What is Normalization?Normalization is a pre-processing stage of any type of problem statement. In particular, normalization takes an important role in the field of soft computing, cloud computing, etc. for the man
3 min read
How to normalize an array in NumPy in Python? In this article, we are going to discuss how to normalize 1D and 2D arrays in Python using NumPy. Normalization refers to scaling values of an array to the desired range. Normalization of 1D-Array Suppose, we have an array = [1,2,3] and to normalize it in range [0,1] means that it will convert arra
3 min read
Map Data to a Normal Distribution in Scikit Learn A Normal Distribution, also known as a Gaussian distribution, is a continuous probability distribution that is symmetrical around its mean. It is defined by its norm, which is the center of the distribution, and its standard deviation, which is a measure of the spread of the distribution. The normal
5 min read
How to Install Scikit-Learn on Linux? In this article, we are going to see how to install Scikit-Learn on Linux. Scikit-Learn is a python open source library for predictive data analysis. It is built on NumPy, SciPy, and matplotlib. It is written in Python, Cython, C, and C++ language. It is available for Linux, Unix, Windows, and Mac.
2 min read
K-Means clustering on the handwritten digits data using Scikit Learn in Python K - means clustering is an unsupervised algorithm that is used in customer segmentation applications. In this algorithm, we try to form clusters within our datasets that are closely related to each other in a high-dimensional space. In this article, we will see how to use the k means algorithm to i
5 min read
Implementation of KNN classifier using Scikit - learn - Python K-Nearest Neighbors is a most simple but fundamental classifier algorithm in Machine Learning. It is under the supervised learning category and used with great intensity for pattern recognition, data mining and analysis of intrusion. It is widely disposable in real-life scenarios since it is non-par
3 min read
What is fit() method in Python's Scikit-Learn? Scikit-Learn, a powerful and versatile Python library, is extensively used for machine learning tasks. It provides simple and efficient tools for data mining and data analysis. Among its many features, the fit() method stands out as a fundamental component for training machine learning models. This
4 min read