Transformer Architecture
Large Language Model ((LM) is a neural network
Self-supervised learning SSL) is a machine
learning technique where models are
trained of the data the rest of the data.
to predict part input using
Masked Language Modeling (MLM)
ATTENTION MECHANISM
Midterm Project for LLM course :
: Use Attention , Transformen to build
a
Simple LLM Model
Self-Attention
W-q = nn . Linear(d-in ,
d-out , bias =
False)
W .
K = nn . Linear (d-in ,
d-out , bias =
False)
W V -
= nn . Linear (d-in ,
d-out) bias =
False)
TRANSFORMER MODEL
~ Embedding
V Positional
Embedding
V Attention
Dense Layer
Residual Connections x :
1) F(x) + X
,timax-attention
attention scores
weights >
Context Vector
(for a given query (word) Embedding
Dot-product Attention :
Q k
~
.6
3 .
Extending Single-head Attention
We simply stack multiple single-head attention
modules to obtain a multi-head attention