What is Layer Normalization?

Last Updated : 01 May, 2025

Layer Normalization stabilizes and accelerates the training process in deep learning. In typical neural networks, activations of each layer can vary drastically which leads to issues like exploding or vanishing gradients which slow down training. Layer Normalization addresses this by normalizing the output of each layer which helps in ensuring that the activations stay within a stable range.

It works by normalizing the input to each neuron such that the mean activation becomes 0 and the variance becomes 1. Unlike Batch Normalization which normalizes over the batch i.e across all samples in the batch, it normalizes over the features for each individual data point.

Working of Layer Normalization

Let's consider an example where we have three vectors:

x_1 = [3.0, 5.0, 2.0, 8.0]
x_2 = [1.0, 3.0, 5.0, 8.0]
x_3 = [3.0, 2.0, 7.0, 9.0]

For each input x of the layer, Layer Normalization computes the following:

1. Compute Mean and Variance for Each Feature

Mean and variance are calculated for each input but instead of across the batch, it’s done for the features (i.e per data point):

\mu = \frac{1}{H} \sum_{i=1}^{H} x_i

\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2

where, H is the number of features (neurons) in the layer, x_i is the input for each feature and \mu and \sigma^2 are the computed mean and variance.

Now, let's compute Mean and Variance for each feature (Per Data Point). For x_1 = [3.0, 5.0, 2.0, 8.0]:

Mean (\mu_1): \mu_1 = \frac{1}{4} (3.0 + 5.0 + 2.0 + 8.0) = \frac{18.0}{4} = 4.5
Variance (\sigma_1^2): \sigma_1^2 = \frac{1}{4} \left[ (3.0 - 4.5)^2 + (5.0 - 4.5)^2 + (2.0 - 4.5)^2 + (8.0 - 4.5)^2 \right] = \frac{21.0}{4} = 5.25

Similarly computer for x_2 and x_3:

Layer-Normalization — Layer Nomalization Example

2. Normalize the Input

Each feature is then normalized using the formula:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

Here \epsilon is a small constant added for numerical stability.

Now, we will normalize each feature in each vector by subtracting the mean and dividing by the standard deviation (square root of the variance) with a small constant \epsilon = 1e-5 added for numerical stability.

For x_1 = [3.0, 5.0, 2.0, 8.0]:

x_1' = \left[ \frac{3.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{5.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{2.0 - 4.5}{\sqrt{5.25 + 1e-5}}, \frac{8.0 - 4.5}{\sqrt{5.25 + 1e-5}} \right]

We calculate each normalized value for x_1:

x_1' = [−0.6547,0.2182,−1.0911,1.5275]

For x_2 = [1.0, 3.0, 5.0, 8.0]:

x_2' = \left[ \frac{1.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{3.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{5.0 - 4.25}{\sqrt{8.3125 + 1e-5}}, \frac{8.0 - 4.25}{\sqrt{8.3125 + 1e-5}} \right]

We calculate each normalized value for x_2:

x_2' = [−1.2568,−0.4834,0.2900,1.4501]

For x_3 = [3.0, 2.0, 7.0, 9.0]:

x_3' = \left[ \frac{3.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{2.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{7.0 - 5.25}{\sqrt{3.6875 + 1e-5}}, \frac{9.0 - 5.25}{\sqrt{3.6875 + 1e-5}} \right]

We calculate each normalized value for x_3:

x_3' =[−0.7863,−1.1358,0.6116,1.3106]

3. Apply Scaling and Shifting

To ensure that the normalized activations can still represent a wide range of values, learnable parameters \gamma (scaling) and \beta (shifting) are introduced. Final output y is computed as:

y_i = \gamma \hat{x}_i + \beta

This allows the network to scale and shift the normalized activations during training.

Here let’s assume \gamma = 1.5 and \beta = 0.5. We can apply this scaling and shifting to the normalized values to get the final output for each vector.

For x_1: y_1 = [-0.4820, 0.8273, -1.1366, 2.7913]
For x_2: y_2 = [-1.3851, -0.2250, 0.9350, 2.6751]
For x_3: y_3 = [-0.6795, -1.2037, 1.4174, 2.4658]

These are the exact normalized values and the final outputs after applying Layer Normalization.

Implementation of Layer Normalization in a Simple Neural Network with PyTorch

We will be using Pytorch library for its implementation.

nn.Linear(input_size, output_size): Creates a fully connected layer with the specified input and output dimensions.
nn.LayerNorm(128): Applies Layer Normalization on the input of size 128.
forward(self, x): Defines forward pass for the model by applying transformations to the input x step by step.
torch.randn(10, 64): Generates a tensor of size (10, 64) filled with random values from a normal distribution.
torch.relu(x): Applies ReLU (Rectified Linear Unit) activation function element-wise to x.

Python

import torch
import torch.nn as nn

class SimpleNN(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, 128)
        self.layer_norm = nn.LayerNorm(128)  
        self.fc2 = nn.Linear(128, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.layer_norm(x) 
        x = torch.relu(x)
        x = self.fc2(x)
        return x

input_data = torch.randn(10, 64)
model = SimpleNN(64, 10)
output = model(input_data)
print(output)

Output:

Advantages of Layer Normalization

Works with Small Batches: It is not dependent on batch size which make it ideal for cases with small batches or when working with reinforcement learning where each input is processed individually.
Stabilizes Learning: It normalizes activations within a layer helping to prevent issues like exploding and vanishing gradients which ensures smoother and more efficient training.
Independent of Batch Size: Unlike Batch Normalization, it is calculated for each individual data point helps in making it more flexible and suitable for situations with variable batch sizes.
Works Well with RNNs: It is useful in Recurrent Neural Networks to maintain stable gradient flow which helps in enhancing performance in sequential tasks.
Faster Convergence: By normalizing activations it helps the model converge faster which allows for quicker training without the need for fine-tuning batch size or learning rate.

Applications of Layer Normalization

Layer Normalization is commonly used in various deep learning architectures:

Recurrent Neural Networks (RNNs): In RNNs, LSTMs and GRUs it is used to stabilize training as Batch Normalization struggles with sequential data. It normalizes activations at each time step to maintain stable gradient flow.
Transformers: Models like BERT and GPT apply Layer Normalization after attention layers to normalize activations which improves the stability and efficiency of training.
Generative Models: In Generative Adversarial Networks it stabilizes the training of both the generator and discriminator networks.
Speech Recognition: It is commonly applied in speech recognition systems to boast performance by normalizing the activations of the model’s recurrent layers.

Layer normalization is effective in scenarios where Batch Normalization would not be practical such as with small batch sizes or sequential models like RNNs. It helps to ensure a smoother and faster training process which leads to better performance across wide range of applications.