0% found this document useful (0 votes)

17 views7 pages

Trans From Scratch

The document provides an introduction to the Transformer architecture, detailing its input representation, self-attention mechanism, and multi-layer perceptron stages. It explains the importance of positional encoding and the structure of a Transformer block, including residual connections and layer normalization. Additionally, it includes practical guidance for building a Transformer encoder from scratch in MATLAB, along with training procedures and example code.

Uploaded by

Pranesh L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views7 pages

Trans From Scratch

Uploaded by

Pranesh L

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

An Introduction to Transformer

https://arxiv.org/pdf/2304.10557

Input to Transformer

A sequence of tokens is a generic representation to use as an input – many different types of data can be
“tokenised” and transformers are then immediately applicable.

The representation of the input sequence will be produced by iteratively applying a transformer block

Stage 1: self-attention across the sequence

Attention

A is self Attention matrix

(2)

Where does it come from?.

The neat idea in the first stage of the transformer is that the attention matrix is generated from the input
sequence itself – so-called self-attention. A simple way of generating the attention matrix from the input

1
would be to measure the similarity between two locations by the dot product between the features at those
two locations and then use a softmax function to handle the normalisation

However, this naïve approach entangles information about the similarity between locations in the sequence
with the content of the sequence itself. (added. words with same meaning score high rather than
association between words)

An alternative is to perform the same operation on a linear transformation of the sequence, so that

Typically, U will project to a lower dimensional space i.e. U is K×D dimensional with K < D

In this way only some of the features in the input sequence need be used to compute the similarity, the
others being projected out, thereby decoupling the attention computation from the content. However, the
numerator in this construction is symmetric. This could be a disadvantage. For example, we might want the
word ‘caulking iron’ to be strongly associated with the word ‘tool’ (as it is a type of tool), but have the word
‘tool’ more weakly associated with the word ‘caulking iron’ (because most of us rarely encounter it).

Fortunately, it is simple to generalise the attention mechanism above to be asymmetric by applying two
different linear transformations to the original sequence

(3)

The two quantities that are dot-producted together here and are typically known as the
queries and the keys, respectively. Together equations 2 and 3 define the self-attention mechanism. Notice
that the K × D matrices and are the only parameters of this mechanism

Multihead Self Attention

Stage 2: multi-layer perceptron across features

2
The second stage of processing in the transformer block operates across features, refining the
representation using a non-linear transform. To do this, we simply apply a multi-layer perceptron (MLP)
to the vector of features at each location n in the sequence,

Notice that the parameters of the MLP, θ, are the same for each location n

The Transformer Block:

Putting it all together with residual connections and layer normalisation
We can now stack MHSA and MLP layers to produce the transformer block. Rather than doing this directly,
we make use of two ubiquitous transformations to produce a more stable model that trains more easily:
Residual connections and normalisation.

Residual connections.

Token normalisation.

Transformers perform layer normalisation (left hand schematic given below) which normalises the mean and
standard deviation of each individual token in each sequence in the batch. Batch normalisation (right hand
schematic), which normalises over the feature and batch dimension together, is found to be far less stable
[Shen et al., 2020]. Other flavours of normalisation are possible and potentially under-explored e.g. instance
normalisation would normalise across the sequence dimension instead

3
Position encoding
Positional information is key in many problems and the transformer has thrown it out. Thankfully, there is a
simple fix for this: the location of each token within the original dataset should be included in the token itself,
or through the way it is processed.

The position information can be fixedby adding a vector of sinusoids of different frequencies and phases
to encode position of a word in a sentence [Vaswani et al., 2017], or it can be a free parameter which is
learned [Devlin et al., 2019], as it often done in image transformers. There are also approaches to include
relative distance information between pairs of tokens by modifying the self-attention mechanism [Wu et al.,
2021] which connects to equivariant transformers.

Building Transformer from Scratch in Matlab

You can use selfAttentionLayer to build the encoder from layers.

The general structure of the intermediate encoder blocks is like:

selfAttentionLayer(numHeads,numKeyChannels) % self attention

additionLayer(2,Name="attention_add") % residual connection around attention

layerNormalizationLayer(Name="attention_norm") % layer norm

fullyConnectedLayer(feedforwardHiddenSize) % feedforward part 1

reluLayer % nonlinear activation

fullyConnectedLayer(attentionHiddenSize) % feedforward part 2

4
additionLayer(2,Name="feedforward_add") % residual connection around feedforward

layerNormalizationLayer() % layer norm

You would need to hook up the connections to the addition layers appropriately.

Typically you would have multiple copies of this encoder block in a transformer encoder.

You also typically need an embedding at the start of the model. For text data it's common to use
wordEmbeddingLayer whereas image data you would use patchEmbeddingLayer.

Also the above encoder block makes no use of positional information, so if your training task
requires positional information to be used, you would typically inject the position information via a
positionEmbeddingLayer or sinusoidalPositionEncodingLayer.

Finally the last encoder block will typically feed into a model "head" to map the encoder output back to the
dimensions of the training targets. Typically this can just be some simple fullyConnectedLayer-s.

Note that for both image and sequence input data the output of the encoder is still an image or sequence,
so for image classification and sequence-to-one tasks you need some way to map that sequence of encoder
ouptuts to a fixed-size representation. For this you could use indexing1dLayer or pooling layers like
globalMaxPooling1dLayer.

% Create model
% We will use 2 encoder layers.
numHeads = 1;
numKeyChannels = 20;
feedforwardHiddenSize = 100;
modelHiddenSize = 20;
% Since the values in the sequence can be 1,2, ..., 10 the "vocabulary" size is
10.
vocabSize = 10;
inputSize = 1;
encoderLayers = [
sequenceInputLayer(1,Name="in") % input
wordEmbeddingLayer(modelHiddenSize,vocabSize,Name="embedding") % embedding
positionEmbeddingLayer(modelHiddenSize,vocabSize) % position
embedding
additionLayer(2,Name="embed_add") % add the
data and position embeddings
selfAttentionLayer(numHeads,numKeyChannels) % encoder
block 1
additionLayer(2,Name="attention_add") %
layerNormalizationLayer(Name="attention_norm") %
fullyConnectedLayer(feedforwardHiddenSize) %
reluLayer %
fullyConnectedLayer(modelHiddenSize) %
additionLayer(2,Name="feedforward_add") %
layerNormalizationLayer(Name="encoder1_out") %

5
selfAttentionLayer(numHeads,numKeyChannels) % encoder
block 2
additionLayer(2,Name="attention2_add") %
layerNormalizationLayer(Name="attention2_norm") %
fullyConnectedLayer(feedforwardHiddenSize) %
reluLayer %
fullyConnectedLayer(modelHiddenSize) %
additionLayer(2,Name="feedforward2_add") %
layerNormalizationLayer() %
indexing1dLayer %
fullyConnectedLayer(inputSize)]; % output head
net = dlnetwork(encoderLayers,Initialize=false);
net = connectLayers(net,"embed_add","attention_add/in2");
net = connectLayers(net,"embedding","embed_add/in2");
net = connectLayers(net,"attention_norm","feedforward_add/in2");
net = connectLayers(net,"encoder1_out","attention2_add/in2");
net = connectLayers(net,"attention2_norm","feedforward2_add/in2");
net = initialize(net);
% analyze the network to see how data flows through it
analyzeNetwork(net)
% create toy training data
% We will generate 10,000 sequences of length 10
% with values that are random integers 1-10
numObs = 10000;
seqLen = 10;
x = randi([1,10],[seqLen,numObs]);
% Loop over to create y(i) = x(x(1),i) + x(x(2),i)
y = zeros(numObs,1);
for i = 1:numObs
idx = x(1:2,i);
y(i) = sum(x(idx,i));
end
x = num2cell(x,1);
% specify training options and train
opts = trainingOptions("adam", ...
MaxEpochs = 200, ...
MiniBatchSize = numObs/10, ...
Plots="training-progress", ...
Shuffle="every-epoch", ...
InitialLearnRate=1e-2, ...
LearnRateDropFactor=0.9, ...
LearnRateDropPeriod=10, ...
LearnRateSchedule="piecewise");
net = trainnet(x,y,net,"mse",opts);
% test the network on a new input
x = randi([1,10],[seqLen,1]);
ypred = predict(net,x)
yact = x(x(1)) + x(x(2))

https://in.mathworks.com/matlabcentral/answers/2014811-is-there-any-documentation-on-how-to-build-a-
transformer-encoder-from-scratch-in-matlab

https://in.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html

6
7

Self-Attention Mechanism in NLP
No ratings yet
Self-Attention Mechanism in NLP
18 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Introduction to Transformer Architecture
No ratings yet
Introduction to Transformer Architecture
10 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
Understanding the Transformer Architecture
No ratings yet
Understanding the Transformer Architecture
10 pages
Transformers
No ratings yet
Transformers
15 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
03b. Transformers
No ratings yet
03b. Transformers
75 pages
Transformer
No ratings yet
Transformer
14 pages
The Transformer Model
No ratings yet
The Transformer Model
1 page
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Advanced NLP with Transformers Guide
No ratings yet
Advanced NLP with Transformers Guide
40 pages
Transformer
No ratings yet
Transformer
58 pages
Attention Models and Transformers Overview
No ratings yet
Attention Models and Transformers Overview
40 pages
Understanding Transformer Models in AI
No ratings yet
Understanding Transformer Models in AI
36 pages
Understanding Bahdanau Attention Mechanism
No ratings yet
Understanding Bahdanau Attention Mechanism
41 pages
Transformer Concepts
100% (1)
Transformer Concepts
8 pages
NLP 8
No ratings yet
NLP 8
42 pages
Solved Example of Transformers
No ratings yet
Solved Example of Transformers
20 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
Transformer
No ratings yet
Transformer
4 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
Building GPT-2 from Scratch in PyTorch
No ratings yet
Building GPT-2 from Scratch in PyTorch
13 pages
09 Transformers
No ratings yet
09 Transformers
40 pages
Understanding Encoder-Decoder Models
No ratings yet
Understanding Encoder-Decoder Models
5 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
Understanding Positional Encoding in Transformers
No ratings yet
Understanding Positional Encoding in Transformers
33 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformer
No ratings yet
Transformer
31 pages
Generative AI
No ratings yet
Generative AI
54 pages
11 Transformers Notes
No ratings yet
11 Transformers Notes
25 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
Attention-Based Transformers Explained
No ratings yet
Attention-Based Transformers Explained
27 pages
Vision Transformers in Computer Vision
No ratings yet
Vision Transformers in Computer Vision
2 pages
Transformer
No ratings yet
Transformer
41 pages
Understanding Attention in Transformers
No ratings yet
Understanding Attention in Transformers
41 pages
Understanding Attention in Transformers
No ratings yet
Understanding Attention in Transformers
41 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
The Transformer Model in Equations: John Thickstun
No ratings yet
The Transformer Model in Equations: John Thickstun
5 pages
Transformer Architecture Overview
No ratings yet
Transformer Architecture Overview
18 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
7 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
GEN-AI Handout 1
No ratings yet
GEN-AI Handout 1
4 pages
Understanding the Transformer Model
No ratings yet
Understanding the Transformer Model
32 pages
Attention Is All We Need
No ratings yet
Attention Is All We Need
5 pages
Seq2Seq, Attention and Transformers
No ratings yet
Seq2Seq, Attention and Transformers
142 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Lecture 25
No ratings yet
Lecture 25
13 pages
N-gram vs Negative Sampling in NLP
No ratings yet
N-gram vs Negative Sampling in NLP
117 pages
Introduction To Transformers An NLP Perspective
No ratings yet
Introduction To Transformers An NLP Perspective
119 pages
AML - Lecture 9
No ratings yet
AML - Lecture 9
36 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
DBM 1013
No ratings yet
DBM 1013
11 pages
Mixed Fractions Multiplication
No ratings yet
Mixed Fractions Multiplication
2 pages
Adding Mixed Numbers: Grade 6 Fraction Worksheet
No ratings yet
Adding Mixed Numbers: Grade 6 Fraction Worksheet
12 pages
Basic Calculus: Application of Antidifferentiation To Differential Equations
No ratings yet
Basic Calculus: Application of Antidifferentiation To Differential Equations
13 pages
Math8 DLL Q1 W1
No ratings yet
Math8 DLL Q1 W1
5 pages
Taylor Series Workshop Exercises
No ratings yet
Taylor Series Workshop Exercises
2 pages
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
No ratings yet
Lecture 6 - Ridge Regression, Polynomial Regression (DONE!!) PDF
26 pages
Solving PDEs on Manifolds Using Conformal Mapping
No ratings yet
Solving PDEs on Manifolds Using Conformal Mapping
13 pages
Linear Algebra: Subspaces and Dimensions
No ratings yet
Linear Algebra: Subspaces and Dimensions
34 pages
Math May 2007 MS C4
No ratings yet
Math May 2007 MS C4
23 pages
Permutation Maze
50% (2)
Permutation Maze
1 page
(NEW SPECIFICATION) Pure Mathematics 3 Specimen Paper Mark Scheme (2018)
No ratings yet
(NEW SPECIFICATION) Pure Mathematics 3 Specimen Paper Mark Scheme (2018)
19 pages
Learning Activity Sheet Mathematics 5 - Addition and Subtraction of Fractions
No ratings yet
Learning Activity Sheet Mathematics 5 - Addition and Subtraction of Fractions
11 pages
Fermat's Last Theorem Unveiled
No ratings yet
Fermat's Last Theorem Unveiled
10 pages
Quadratic Equation Quiz
No ratings yet
Quadratic Equation Quiz
4 pages
(Ebook) Cambridge Lower Secondary Mathematics Learner's Book 8 (Cambridge Lower Secondary Maths) by Lynn Byrd, Greg Byrd, Chris Pearce ISBN 9781108771528, 1108771521 Full Digital Chapters
86% (14)
(Ebook) Cambridge Lower Secondary Mathematics Learner's Book 8 (Cambridge Lower Secondary Maths) by Lynn Byrd, Greg Byrd, Chris Pearce ISBN 9781108771528, 1108771521 Full Digital Chapters
351 pages
Math Pre Task Year 7 9.1
No ratings yet
Math Pre Task Year 7 9.1
13 pages
11.1 The Area Between Two Curves 12 PDF
No ratings yet
11.1 The Area Between Two Curves 12 PDF
12 pages
Z 5 DLL Factoring
No ratings yet
Z 5 DLL Factoring
2 pages
Equations-Worksheet 241106 194347
No ratings yet
Equations-Worksheet 241106 194347
3 pages
Complex Geometry & Foliations
No ratings yet
Complex Geometry & Foliations
146 pages
A2 1.2 Packet
No ratings yet
A2 1.2 Packet
4 pages
Factorization (Cross Method)
No ratings yet
Factorization (Cross Method)
6 pages
DFT of Interleaved Sequences
No ratings yet
DFT of Interleaved Sequences
41 pages
Objective Type Questions for Mathematics
No ratings yet
Objective Type Questions for Mathematics
9 pages
Understanding Types of Numbers
No ratings yet
Understanding Types of Numbers
36 pages
Syllabus Math 10 2023-2024
No ratings yet
Syllabus Math 10 2023-2024
43 pages
Definition. The Group Is Generated by The Set: G A G G A G A G A G A G A G G A
No ratings yet
Definition. The Group Is Generated by The Set: G A G G A G A G A G A G A G G A
2 pages
Understanding Mathematical Induction
No ratings yet
Understanding Mathematical Induction
17 pages
Quasi Newton Methods
No ratings yet
Quasi Newton Methods
60 pages

Trans From Scratch

Uploaded by

Trans From Scratch

Uploaded by

An Introduction to Transformer

Stage 1: self-attention across the sequence

A is self Attention matrix

Where does it come from?.

Multihead Self Attention

Stage 2: multi-layer perceptron across features

The Transformer Block:

Building Transformer from Scratch in Matlab

You can use selfAttentionLayer to build the encoder from layers.

The general structure of the intermediate encoder blocks is like:

selfAttentionLayer(numHeads,numKeyChannels) % self attention

additionLayer(2,Name="attention_add") % residual connection around attention

layerNormalizationLayer(Name="attention_norm") % layer norm

fullyConnectedLayer(feedforwardHiddenSize) % feedforward part 1

reluLayer % nonlinear activation

fullyConnectedLayer(attentionHiddenSize) % feedforward part 2

layerNormalizationLayer() % layer norm

You might also like