What is Layer Normalization?
Last Updated :
01 May, 2025
Layer Normalization stabilizes and accelerates the training process in deep learning. In typical neural networks, activations of each layer can vary drastically which leads to issues like exploding or vanishing gradients which slow down training. Layer Normalization addresses this by normalizing the output of each layer which helps in ensuring that the activations stay within a stable range.
It works by normalizing the input to each neuron such that the mean activation becomes 0 and the variance becomes 1. Unlike Batch Normalization which normalizes over the batch i.e across all samples in the batch, it normalizes over the features for each individual data point.
Working of Layer Normalization
Let's consider an example where we have three vectors:
- x_1 = [3.0, 5.0, 2.0, 8.0]
- x_2 = [1.0, 3.0, 5.0, 8.0]
- x_3 = [3.0, 2.0, 7.0, 9.0]
For each input x of the layer, Layer Normalization computes the following:
1. Compute Mean and Variance for Each Feature
Mean and variance are calculated for each input but instead of across the batch, it’s done for the features (i.e per data point):
\mu = \frac{1}{H} \sum_{i=1}^{H} x_i
\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2
where, H is the number of features (neurons) in the layer, x_i is the input for each feature and \mu and \sigma^2 are the computed mean and variance.
Now, let's compute Mean and Variance for each feature (Per Data Point). For x_1 = [3.0, 5.0, 2.0, 8.0]:
- Mean (\mu_1): \mu_1 = \frac{1}{4} (3.0 + 5.0 + 2.0 + 8.0) = \frac{18.0}{4} = 4.5
- Variance (\sigma_1^2): \sigma_1^2 = \frac{1}{4} \left[ (3.0 - 4.5)^2 + (5.0 - 4.5)^2 + (2.0 - 4.5)^2 + (8.0 - 4.5)^2 \right] = \frac{21.0}{4} = 5.25
Similarly computer for x_2 and x_3:
Layer Nomalization ExampleEach feature is then normalized using the formula:
\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
Here \epsilon is a small constant added for numerical stability.
Now, we will normalize each feature in each vector by subtracting the mean and dividing by the standard deviation (square root of the variance) with a small constant \epsilon = 1e-5 added for numerical stability.
For x_1 = [3.0, 5.0, 2.0, 8.0]:
x_1' = \left[ \frac{3.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{5.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{2.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{8.0 - 4.5}{\sqrt{5.25 + 1e-5}} \right]
We calculate each normalized value for x_1:
x_1' = [−0.6547,0.2182,−1.0911,1.5275]
For x_2 = [1.0, 3.0, 5.0, 8.0]:
x_2' = \left[ \frac{1.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{3.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{5.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{8.0 - 4.25}{\sqrt{8.3125 + 1e-5}} \right]
We calculate each normalized value for x_2:
x_2' = [−1.2568,−0.4834,0.2900,1.4501]
For x_3 = [3.0, 2.0, 7.0, 9.0]:
x_3' = \left[ \frac{3.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{2.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{7.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{9.0 - 5.25}{\sqrt{3.6875 + 1e-5}} \right]
We calculate each normalized value for x_3:
x_3' =[−0.7863,−1.1358,0.6116,1.3106]
3. Apply Scaling and Shifting
To ensure that the normalized activations can still represent a wide range of values, learnable parameters \gamma (scaling) and \beta (shifting) are introduced. Final output y is computed as:
y_i = \gamma \hat{x}_i + \beta
This allows the network to scale and shift the normalized activations during training.
Here let’s assume \gamma = 1.5 and \beta = 0.5. We can apply this scaling and shifting to the normalized values to get the final output for each vector.
- For x_1: y_1 = [-0.4820, 0.8273, -1.1366, 2.7913]
- For x_2: y_2 = [-1.3851, -0.2250, 0.9350, 2.6751]
- For x_3: y_3 = [-0.6795, -1.2037, 1.4174, 2.4658]
These are the exact normalized values and the final outputs after applying Layer Normalization.
Implementation of Layer Normalization in a Simple Neural Network with PyTorch
We will be using Pytorch library for its implementation.
- nn.Linear(input_size, output_size): Creates a fully connected layer with the specified input and output dimensions.
- nn.LayerNorm(128): Applies Layer Normalization on the input of size 128.
- forward(self, x): Defines forward pass for the model by applying transformations to the input x step by step.
- torch.randn(10, 64): Generates a tensor of size (10, 64) filled with random values from a normal distribution.
- torch.relu(x): Applies ReLU (Rectified Linear Unit) activation function element-wise to x.
Python
import torch
import torch.nn as nn
class SimpleNN(nn.Module):
def __init__(self, input_size, output_size):
super(SimpleNN, self).__init__()
self.fc1 = nn.Linear(input_size, 128)
self.layer_norm = nn.LayerNorm(128)
self.fc2 = nn.Linear(128, output_size)
def forward(self, x):
x = self.fc1(x)
x = self.layer_norm(x)
x = torch.relu(x)
x = self.fc2(x)
return x
input_data = torch.randn(10, 64)
model = SimpleNN(64, 10)
output = model(input_data)
print(output)
Output:
ResAdvantages of Layer Normalization
- Works with Small Batches: It is not dependent on batch size which make it ideal for cases with small batches or when working with reinforcement learning where each input is processed individually.
- Stabilizes Learning: It normalizes activations within a layer helping to prevent issues like exploding and vanishing gradients which ensures smoother and more efficient training.
- Independent of Batch Size: Unlike Batch Normalization, it is calculated for each individual data point helps in making it more flexible and suitable for situations with variable batch sizes.
- Works Well with RNNs: It is useful in Recurrent Neural Networks to maintain stable gradient flow which helps in enhancing performance in sequential tasks.
- Faster Convergence: By normalizing activations it helps the model converge faster which allows for quicker training without the need for fine-tuning batch size or learning rate.
Applications of Layer Normalization
Layer Normalization is commonly used in various deep learning architectures:
- Recurrent Neural Networks (RNNs): In RNNs, LSTMs and GRUs it is used to stabilize training as Batch Normalization struggles with sequential data. It normalizes activations at each time step to maintain stable gradient flow.
- Transformers: Models like BERT and GPT apply Layer Normalization after attention layers to normalize activations which improves the stability and efficiency of training.
- Generative Models: In Generative Adversarial Networks it stabilizes the training of both the generator and discriminator networks.
- Speech Recognition: It is commonly applied in speech recognition systems to boast performance by normalizing the activations of the model’s recurrent layers.
Layer normalization is effective in scenarios where Batch Normalization would not be practical such as with small batch sizes or sequential models like RNNs. It helps to ensure a smoother and faster training process which leads to better performance across wide range of applications.
Similar Reads
What is Group Normalization?
Group Normalization (GN) is a technique introduced by Yuxin Wu and Kaiming He in 2018. It addresses some of the limitations posed by Batch Normalization, especially when dealing with small batch sizes that are common in high-resolution images or video processing tasks. Unlike Batch Normalization, wh
4 min read
What is Batch Normalization In Deep Learning?
Batch Normalization is used to reduce the problem of internal covariate shift in neural networks. It works by normalizing the data within each mini-batch. This means it calculates the mean and variance of data in a batch and then adjusts the values so that they have similar range. After that it scal
4 min read
What is Batch Normalization in CNN?
Batch Normalization is a technique used to improve the training and performance of neural networks, particularly CNNs. The article aims to provide an overview of batch normalization in CNNs along with the implementation in PyTorch and TensorFlow. Table of Content Overview of Batch Normalization Need
5 min read
Normalization and Scaling
Normalization and Scaling are two fundamental preprocessing techniques when you perform data analysis and machine learning. They are useful when you want to rescale, standardize or normalize the features (values) through distribution and scaling of existing data that make your machine learning model
9 min read
What is Label Smoothing ?
The Label smoothing is a regularization technique applied to the output labels of the classification model. Instead of using the hard labels (0s and 1s) as targets label smoothing modifies these labels to be slightly less definitive, distributing some of the probability mass to the incorrect classes
5 min read
What is Zero Mean and Unit Variance Normalization
Answer: Zero Mean and Unit Variance normalization rescale data to have a mean of zero and a standard deviation of one.Explanation:Mean Centering: The first step of Zero Mean normalization involves subtracting the mean value of each feature from all data points. This centers the data around zero, mea
2 min read
What is Standardization in Machine Learning
In Machine Learning we train our data to predict or classify things in such a manner that isn't hardcoded in the machine. So for the first, we have the Dataset or the input data to be pre-processed and manipulated for our desired outcomes. Any ML Model to be built follows the following procedure: Co
6 min read
Data Normalization with Python Scikit-Learn
Data normalization is a crucial step in machine learning and data science. It involves transforming features to similar scales to improve the performance and stability of machine learning models. Python's Scikit-Learn library provides several techniques for data normalization, which are essential fo
7 min read
What is a Neural Network?
Neural networks are machine learning models that mimic the complex functions of the human brain. These models consist of interconnected nodes or neurons that process data, learn patterns, and enable tasks such as pattern recognition and decision-making.In this article, we will explore the fundamenta
14 min read
Data Normalization Machine Learning
Normalization is an essential step in the preprocessing of data for machine learning models, and it is a feature scaling technique. Normalization is especially crucial for data manipulation, scaling down, or up the range of data before it is utilized for subsequent stages in the fields of soft compu
9 min read