0% found this document useful (0 votes)
17 views7 pages

Trans From Scratch

The document provides an introduction to the Transformer architecture, detailing its input representation, self-attention mechanism, and multi-layer perceptron stages. It explains the importance of positional encoding and the structure of a Transformer block, including residual connections and layer normalization. Additionally, it includes practical guidance for building a Transformer encoder from scratch in MATLAB, along with training procedures and example code.

Uploaded by

Pranesh L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views7 pages

Trans From Scratch

The document provides an introduction to the Transformer architecture, detailing its input representation, self-attention mechanism, and multi-layer perceptron stages. It explains the importance of positional encoding and the structure of a Transformer block, including residual connections and layer normalization. Additionally, it includes practical guidance for building a Transformer encoder from scratch in MATLAB, along with training procedures and example code.

Uploaded by

Pranesh L
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An Introduction to Transformer

https://arxiv.org/pdf/2304.10557

Input to Transformer

A sequence of tokens is a generic representation to use as an input – many different types of data can be
“tokenised” and transformers are then immediately applicable.

The representation of the input sequence will be produced by iteratively applying a transformer block

Stage 1: self-attention across the sequence

Attention

A is self Attention matrix

(2)

Where does it come from?.

The neat idea in the first stage of the transformer is that the attention matrix is generated from the input
sequence itself – so-called self-attention. A simple way of generating the attention matrix from the input

1
would be to measure the similarity between two locations by the dot product between the features at those
two locations and then use a softmax function to handle the normalisation

However, this naïve approach entangles information about the similarity between locations in the sequence
with the content of the sequence itself. (added. words with same meaning score high rather than
association between words)

An alternative is to perform the same operation on a linear transformation of the sequence, so that

Typically, U will project to a lower dimensional space i.e. U is K×D dimensional with K < D

In this way only some of the features in the input sequence need be used to compute the similarity, the
others being projected out, thereby decoupling the attention computation from the content. However, the
numerator in this construction is symmetric. This could be a disadvantage. For example, we might want the
word ‘caulking iron’ to be strongly associated with the word ‘tool’ (as it is a type of tool), but have the word
‘tool’ more weakly associated with the word ‘caulking iron’ (because most of us rarely encounter it).

Fortunately, it is simple to generalise the attention mechanism above to be asymmetric by applying two
different linear transformations to the original sequence

(3)

The two quantities that are dot-producted together here and are typically known as the
queries and the keys, respectively. Together equations 2 and 3 define the self-attention mechanism. Notice
that the K × D matrices and are the only parameters of this mechanism

Multihead Self Attention

Stage 2: multi-layer perceptron across features

2
The second stage of processing in the transformer block operates across features, refining the
representation using a non-linear transform. To do this, we simply apply a multi-layer perceptron (MLP)
to the vector of features at each location n in the sequence,

Notice that the parameters of the MLP, θ, are the same for each location n

The Transformer Block:


Putting it all together with residual connections and layer normalisation
We can now stack MHSA and MLP layers to produce the transformer block. Rather than doing this directly,
we make use of two ubiquitous transformations to produce a more stable model that trains more easily:
Residual connections and normalisation.

Residual connections.

Token normalisation.

Transformers perform layer normalisation (left hand schematic given below) which normalises the mean and
standard deviation of each individual token in each sequence in the batch. Batch normalisation (right hand
schematic), which normalises over the feature and batch dimension together, is found to be far less stable
[Shen et al., 2020]. Other flavours of normalisation are possible and potentially under-explored e.g. instance
normalisation would normalise across the sequence dimension instead

3
Position encoding
Positional information is key in many problems and the transformer has thrown it out. Thankfully, there is a
simple fix for this: the location of each token within the original dataset should be included in the token itself,
or through the way it is processed.

The position information can be fixedby adding a vector of sinusoids of different frequencies and phases
to encode position of a word in a sentence [Vaswani et al., 2017], or it can be a free parameter which is
learned [Devlin et al., 2019], as it often done in image transformers. There are also approaches to include
relative distance information between pairs of tokens by modifying the self-attention mechanism [Wu et al.,
2021] which connects to equivariant transformers.

Building Transformer from Scratch in Matlab

You can use selfAttentionLayer to build the encoder from layers.

The general structure of the intermediate encoder blocks is like:

selfAttentionLayer(numHeads,numKeyChannels) % self attention

additionLayer(2,Name="attention_add") % residual connection around attention

layerNormalizationLayer(Name="attention_norm") % layer norm

fullyConnectedLayer(feedforwardHiddenSize) % feedforward part 1

reluLayer % nonlinear activation

fullyConnectedLayer(attentionHiddenSize) % feedforward part 2

4
additionLayer(2,Name="feedforward_add") % residual connection around feedforward

layerNormalizationLayer() % layer norm

You would need to hook up the connections to the addition layers appropriately.

Typically you would have multiple copies of this encoder block in a transformer encoder.

You also typically need an embedding at the start of the model. For text data it's common to use
wordEmbeddingLayer whereas image data you would use patchEmbeddingLayer.

Also the above encoder block makes no use of positional information, so if your training task
requires positional information to be used, you would typically inject the position information via a
positionEmbeddingLayer or sinusoidalPositionEncodingLayer.

Finally the last encoder block will typically feed into a model "head" to map the encoder output back to the
dimensions of the training targets. Typically this can just be some simple fullyConnectedLayer-s.

Note that for both image and sequence input data the output of the encoder is still an image or sequence,
so for image classification and sequence-to-one tasks you need some way to map that sequence of encoder
ouptuts to a fixed-size representation. For this you could use indexing1dLayer or pooling layers like
globalMaxPooling1dLayer.

% Create model
% We will use 2 encoder layers.
numHeads = 1;
numKeyChannels = 20;
feedforwardHiddenSize = 100;
modelHiddenSize = 20;
% Since the values in the sequence can be 1,2, ..., 10 the "vocabulary" size is
10.
vocabSize = 10;
inputSize = 1;
encoderLayers = [
sequenceInputLayer(1,Name="in") % input
wordEmbeddingLayer(modelHiddenSize,vocabSize,Name="embedding") % embedding
positionEmbeddingLayer(modelHiddenSize,vocabSize) % position
embedding
additionLayer(2,Name="embed_add") % add the
data and position embeddings
selfAttentionLayer(numHeads,numKeyChannels) % encoder
block 1
additionLayer(2,Name="attention_add") %
layerNormalizationLayer(Name="attention_norm") %
fullyConnectedLayer(feedforwardHiddenSize) %
reluLayer %
fullyConnectedLayer(modelHiddenSize) %
additionLayer(2,Name="feedforward_add") %
layerNormalizationLayer(Name="encoder1_out") %

5
selfAttentionLayer(numHeads,numKeyChannels) % encoder
block 2
additionLayer(2,Name="attention2_add") %
layerNormalizationLayer(Name="attention2_norm") %
fullyConnectedLayer(feedforwardHiddenSize) %
reluLayer %
fullyConnectedLayer(modelHiddenSize) %
additionLayer(2,Name="feedforward2_add") %
layerNormalizationLayer() %
indexing1dLayer %
fullyConnectedLayer(inputSize)]; % output head
net = dlnetwork(encoderLayers,Initialize=false);
net = connectLayers(net,"embed_add","attention_add/in2");
net = connectLayers(net,"embedding","embed_add/in2");
net = connectLayers(net,"attention_norm","feedforward_add/in2");
net = connectLayers(net,"encoder1_out","attention2_add/in2");
net = connectLayers(net,"attention2_norm","feedforward2_add/in2");
net = initialize(net);
% analyze the network to see how data flows through it
analyzeNetwork(net)
% create toy training data
% We will generate 10,000 sequences of length 10
% with values that are random integers 1-10
numObs = 10000;
seqLen = 10;
x = randi([1,10],[seqLen,numObs]);
% Loop over to create y(i) = x(x(1),i) + x(x(2),i)
y = zeros(numObs,1);
for i = 1:numObs
idx = x(1:2,i);
y(i) = sum(x(idx,i));
end
x = num2cell(x,1);
% specify training options and train
opts = trainingOptions("adam", ...
MaxEpochs = 200, ...
MiniBatchSize = numObs/10, ...
Plots="training-progress", ...
Shuffle="every-epoch", ...
InitialLearnRate=1e-2, ...
LearnRateDropFactor=0.9, ...
LearnRateDropPeriod=10, ...
LearnRateSchedule="piecewise");
net = trainnet(x,y,net,"mse",opts);
% test the network on a new input
x = randi([1,10],[seqLen,1]);
ypred = predict(net,x)
yact = x(x(1)) + x(x(2))

https://in.mathworks.com/matlabcentral/answers/2014811-is-there-any-documentation-on-how-to-build-a-
transformer-encoder-from-scratch-in-matlab

https://in.mathworks.com/help/deeplearning/ug/train-vision-transformer-network-for-image-classification.html

6
7

You might also like