Open In App

Positional Encoding in Transformers

Last Updated : 06 May, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

In natural language processing order of words is very important for understanding its meaning in the tasks like translation and text generation. Transformers process all tokens in parallel which speeds up training but they don’t naturally capture order of tokens. To address this issue positional encoding was introduced which adds information about each token's position in the sequence which helps model to understand relationships and order between tokens. In this article, we will see more about Positional Encoding and its core concepts.

Example of Positional Encoding

Suppose we have a Transformer model which translates English sentences into French.

"The cat sat on the mat."

Before the sentence is fed into the Transformer model it gets tokenized where each word is converted into a token. Let's assume the tokens for this sentence are:

["The", "cat" , "sat", "on", "the" ,"mat"]

After that each token is mapped to a high-dimensional vector representation through an embedding layer. These embeddings encode semantic information about the words in the sentence. However they lack information about the order of the words.

Embeddings ={E1​,E2​,E3​,E4​,E5​,E6​}

where each Ei​ is a 4-dimensional vector. This is where positional encoding plays an important role. To help the model to understand the order of words in a sequence these are added to the word embeddings and they assign each token a unique representation based on its position in the sequence.

Calculating Positional Encodings

  • Let's say the embedding dimensionality is 4 for simplicity.
  • We'll use sine and cosine functions to generate positional encodings. Consider the following positional encodings for the tokens in our example sentence:

\text{PE}(1) = \left[\sin\left(\frac{1}{10000^{2 \times 0/4}}\right), \cos\left(\frac{1}{10000^{2 \times 0/4}}\right), \sin\left(\frac{1}{10000^{2 \times 1/4}}\right), \cos\left(\frac{1}{10000^{2 \times 1/4}}\right)\right] \\ \text{PE}(2) = \left[\sin\left(\frac{2}{10000^{2 \times 0/4}}\right), \cos\left(\frac{2}{10000^{2 \times 0/4}}\right), \sin\left(\frac{2}{10000^{2 \times 1/4}}\right), \cos\left(\frac{2}{10000^{2 \times 1/4}}\right)\right] \\ \dots \\ \text{PE}(6) = \left[\sin\left(\frac{6}{10000^{2 \times 0/4}}\right), \cos\left(\frac{6}{10000^{2 \times 0/4}}\right), \sin\left(\frac{6}{10000^{2 \times 1/4}}\right), \cos\left(\frac{6}{10000^{2 \times 1/4}}\right)\right]

  • These positional encodings are added element-wise to the word embeddings. The resulting vectors contain both semantic and positional information which allows the Transformer model to understand not only the meaning of each word but also its position in the sequence.

Positional Encoding Layer in Transformers

Positional Encoding layer is important in Transformer as it provides positional information to the model. Since Transformers process sequences in parallel and don’t have a built-in understanding of token order it helps the model to capture the sequence’s structure.

Using a mathematical formula, it generates a unique representation for each token's position in the sequence helps in allowing the Transformer to understand token order while processing in parallel.

Formula for Positional Encoding: For each position p in the sequence and for each dimension 2i and 2i+1 in the encoding vector:

  1. Even-indexed dimensions: PE(p, 2i) = \sin\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)
  2. Odd-indexed dimensions: PE(p, 2i+1) = \cos\left(\frac{p}{10000^{\frac{2i}{d_{\text{model}}}}}\right)

These formulas use sine and cosine functions to create wave-like patterns that changes across the sequence positions. Using sine for even indices and cosine for odd indices helps in getting a combination of features that can effectively represent positional information across different sequence lengths.

Implementation of Positional Encoding in Transformers

Here we will be using Numpy and Tensorflow libraries for its implementations.

  • angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model)) : Calculate angle values based on position and model dimension.
  • position = 50, d_model = 512 : Set the sequence length (number of positions) and dimensionality of the model respectively.
Python
import numpy as np
import tensorflow as tf

def positional_encoding(position, d_model):

    angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
    
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    
    return tf.cast(pos_encoding, dtype=tf.float32)

position = 50  
d_model = 512 
pos_encoding = positional_encoding(position, d_model)

print(pos_encoding.shape) 

Output:

(1, 50, 512)

Array provided is the positional encodings generated by the positional_encoding function for a sequence of length 10 and a model dimensionality of 512. Each row in the array corresponds to a position in the sequence and each column represents a dimension in the model.

Importance of Positional Encoding

Positional encodings are important in Transformer models for several reasons:

  1. Contextual Understanding: In natural language meaning of a word depends on its position hence helping model to understand these differences.
  2. Better Generalization: It allows Transformer models to handle input sequences of different lengths which makes them more flexible for tasks like document summarization or question answering.
  3. Preventing Symmetry Issues: Without positional encoding it considers token as same which causes issues but by using positional encoding tokens at different positions are treated differently which improves model’s ability to capture long-range dependencies.

By adding important positional information, positional encodings allow Transformer models to understand the relationships and order of tokens which ensures it processes sequential data while parallel processing.


Next Article

Similar Reads