Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
Abstract
This is a tutorial and survey paper on the atten-
tion mechanism, transformers, BERT, and GPT.
We first explain attention mechanism, sequence-
to-sequence model without and with attention,
self-attention, and attention in different areas
such as natural language processing and com-
puter vision. Then, we explain transformers
which do not use any recurrence. We ex-
plain all the parts of encoder and decoder in
the transformer, including positional encoding,
multihead self-attention and cross-attention, and
masked multihead attention. Thereafter, we in-
troduce the Bidirectional Encoder Representa-
tions from Transformers (BERT) and Generative
Figure 1. Attention in visual system for (a) seeing a picture by
Pre-trained Transformer (GPT) as the stacks of attending to more important parts of scene and (b) reading a sen-
encoders and decoders of transformer, respec- tence by attending to more informative words in the sentence.
tively. We explain their characteristics and how
they work.
more in skimming.
1. Introduction The concept of attention can be modeled in machine learn-
When looking at a scene or picture, our visual system, so as ing where attention is a simple weighting of data. In the
a machine learning model (Li et al., 2019b), focuses on or attention mechanism, explained in this tutorial paper, the
attends to some specific parts of the scene/image with more more informative or more important parts of data are given
information and importance and ignores the less informa- larger weights for the sake of more attention. Many of the
tive or less important parts. For example, when we look at state-of-the-art Natural Language Processing (NLP) (In-
the Mona Lisa portrait, our visual system attends to Mona durkhya & Damerau, 2010) and deep learning techniques
Lisa’s face and smile, as Fig. 1 illustrates. Moreover, when in NLP (Socher et al., 2012).
reading a text, especially when we want to try fast read- Transformers are also autoencoders which encode the input
ing, one technique is skimming (Xu, 2011) in which our data to a hidden space and then decodes those to another
visual system or a model skims the data with high pacing domain. Transfer learning is widely used in NLP (Wolf
and only attends to more informative words of sentences et al., 2019b). Transformers can also be used for trans-
(Yu et al., 2018). Figure 1 shows a sample sentence and fer learning. Recently, transformers were proposed merely
highlights the words to which our visual system focuses composed of attention modules, excluding recurrence and
any recurrent modules (Vaswani et al., 2017). This was a
great breakthrough. Prior to the proposal of transformers
with only attention mechanism, recurrent models such as
Long-Short Term Memory (LSTM) (Hochreiter & Schmid-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 2
the i-th context vector is: In order to make this score a probability, these scores
n
should sum to one; hence, we make its softmax form as
X (Chorowski et al., 2014):
ci = aij hj , (3)
j=1 esij
R 3 aij := Pn sik
. (5)
k=1 e
where aij ≥ 0 is the weight of hj for the i-th context
vector. This weighted sum for a specific word in the se- In this way, the score vector [ai1 , ai2 , . . . , ain ]> behaves
quence determines which words in the sequence have more as a discrete probability distribution. Therefore, In Eq. (3),
effect on that word. In other words, it determines which the weights sum to one and the weights with higher values
words this specific word “attends” to more. This notion of attend more to their corresponding hidden vectors.
weighted impact, to see which parts have more impact, is
called “attention”. It is noteworthy that the original idea of 2.4. Self-Attention
arithmetic linear combination of vectors for the purpose of 2.4.1. T HE N EED FOR C OMPOSITE E MBEDDING
word embedding, similar to Eq. (3), was in the Word2Vec Many of the previous methods for NLP, such as word2vec
method (Mikolov et al., 2013a;b). (Mikolov et al., 2013a;b; Goldberg & Levy, 2014; Mikolov
The sequence-to-sequence model with attention considers et al., 2015) and GloVe (Pennington et al., 2014), used to
a notion of similarity between the latent vector li−1 of the learn a representation for every word. However, for un-
decoder and the hidden vector hj of the encoder (Bahdanau derstanding how the words relate to each other, we can
et al., 2015): have a composite embedding where the compositions of
words also have some embedding representation (Cheng
R 3 sij := similarity(li−1 , hj ). (4) et al., 2016). For example, Fig. 4 shows a sentence which
highlights the relation of words. This figure shows, when
The intuition for this similarity score is as follows. The reading a word in a sentence, which previous words in the
output word y i depends on the previous latent vector li−1 sentence we remember more. This relation of words shows
(see Fig. 3) and and the hidden vector hj depends son the that we need to have a composite embedding for natural
input word xj . Hence, this similarity score relates to the language embedding.
impact of the input xj on the output y i . In this way, the
score sij shows the impact of the i-th word to generate the 2.4.2. Q UERY-R ETRIEVAL M ODELING
j-th word in the sequence. This similarity notion can be Consider a database with keys and values where a query
a neural network learned by backpropagation (Rumelhart is searched through the keys to retrieve a value (Garcia-
et al., 1986). Molina et al., 1999). Figure 5 shows such database. We
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 4
can generalize this hard definition of query-retrieval to a where W ∈ Rp×p , wq ∈ Rp , and wk ∈ Rp are some
soft query-retrieval where several keys, rather than only learn-able matrices and vectors. Among these similarity
one key, can be corresponded to the query. For this, we measures, the scaled inner product is used most.
calculate the similarity of the query with all the keys to see The Eq. (6) calculates the attention of a target word (or
which keys are more similar to the query. This soft query- query) with respect to every input word (or keys) which are
retrieval is formulated as: the previous and forthcoming words. As Fig. 4 illustrates,
n
when processing a word which is considered as the query,
attention(q, {ki }ni=1 , {v i }ni=1 ) :=
X
ai v i , (6) the other words in the sequence are the keys. Using Eq.
i=1
(6), we see how similar the other words of the sequence
are to that word. In other words, we see how impactful the
where: other previous and forthcoming words are for generating a
missing word in the sequence.
R 3 si := similarity(q, ki ), (7) We provide an example for Eq. (6), here. Consider a sen-
tence “I am a student”. Assume we are processing the word
esi
R 3 ai := softmax(si ) = Pn , (8) “student” in this sequence. Hence, we have a query cor-
k=1 esk responding to the word “student”. The values are corre-
sponding to the previous words which are “I”, “am”, and
and q, {ki }ni=1 , and {v i }ni=1 denote the query, keys, and “a”. Assume we calculate the normalized similarity of the
values, respectively. Recall that the the context vector of query and the values and obtain the weights 0.7, 0.2, and
a sequence-to-sequence model with attention, introduced 0.1 for “I”, “am”, and “a”, respectively, where the weights
by Eq. (3), was also a linear combination with weights of sum to one. Then, the attention value for the word “stu-
normalized similarity (see Eq. (5)). The same linear com- dent” is 0.7v I + 0.2v am + 0.1v a .
bination is the Eq. (6) where the weights are the similarity
of query with the keys. An illustration of Eqs. (6), (7), and 2.4.3. ATTENTION F ORMULATION
(8) is shown in Fig. 6. Note that the similarity si can be Let the words of a sequence of words be in a d-dimensional
any notion of similarity. Some of the well-known similarity space, i.e., the sequence is {xi ∈ Rd }ni=1 . This d-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 5
Figure 8. Using attention in computer vision: (a) transformer of image to caption with CNN for its encoder and RNN or LSTM for
its decoder. The caption for this image is “Elephant is in water”, (b) the convolutional filters take the values from the image for the
query-retrieval modeling in attention mechanism.
x i ← x i + pi . (18)
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 7
Table 1. Comparison of complexities between self-attention and recurrence (Vaswani et al., 2017).
Complexity per layer Sequential operations Maximum path length
Self-Attention O(n2 p) O(1) O(1)
Recurrence O(np2 ) O(n) O(n)
3.3.2. M ULTIHEAD ATTENTION WITH former. We showed that we can learn a sequence using
C ROSS -ATTENTION the transformer. Therefore, attention is all we need to learn
As Figs. 9 and 11 illustrate, the output of masked multihead a sequence and there is no need to any recurrence module.
attention module is fed to a multihead attention module The proposal of transformers (Vaswani et al., 2017) was a
with cross-attention. This module is not self-attention be- breakthrough in NLP; the state-of-the-art NLP methods are
cause all its values, keys, and queries are not from the same all based on transformers nowadays.
sequence but its values and keys are from the output of en-
3.4.2. C OMPLEXITY C OMPARISON
coder and the queries are from the output of the masked
multihead attention module in the decoder. In other words, Table 1 reports the complexity of operations in the self-
the values and keys come from the processed input embed- attention mechanism and compares them with those in re-
dings and the queries are from the processed output embed- currence such as RNN. In self-attention, we learn attention
dings. The calculated multihead attention determines how of every word to every other word in the sequence of n
much every output embedding attends to the input embed- words. Also, we learn a p-dimensional embedding for ev-
dings, how much every pair of output embeddings attends ery word. Hence, the complexity of operations per layer
to the pairs of input embeddings, how much every pair of is O(n2 p). This is while the complexity per layer in re-
pairs of output embeddings attends to the pairs of pairs of currence is O(np2 ). Although, the complexity per layer in
input embeddings, and so on. This shows the connection self-attention is worse than recurrence, many of its opera-
between input sequence and the generated output sequence. tions can be performed in parallel because all the words of
sequence are processed simultaneously, as also explained
3.3.3. F EEDFORWARD L AYER AND S OFTMAX in the following. Hence, the O(n2 p) is not very bad for
ACTIVATION being able to parallelize it. That is while the recurrence
Again, the output of the multihead attention module with cannot be parallelized for its sequential nature.
cross-attention is normalized and added to its input. Then, As for the number of sequential operations, the self-
it is fed to a feedforward neural network with layer nor- attention mechanism processes all the n words simultane-
malization and adding to its input afterwards. Note that the ously so its sequential operations is in the order of O(1). As
masked multihead attention, the multihead attention with recurrence should process the words sequentially, the num-
cross-attention, and the feedforward network are stacked ber of its sequential operations is of order O(n). As for
for N = 6 times. the maximum path length between every two words, self-
The output of feedforward network passes through a linear attention learns attention between every two words; hence,
layer by linear projection and a softmax activation function its maximum path length is of the order O(1). However,
is applied finally. The number of output neurons with the in recurrence, as every word requires a path with a length
softmax activation functions is the number of all words in of a fraction of sequence (a length of n in the worst case)
the dictionary which is a large number. The outputs of de- to reach the process of another word, its maximum path
coder sum to one and are the probability of every word in length is O(n). This shows that attention reduces both se-
the dictionary to be the generated next word. For the sake quential operations and maximum path length, compared
of sequence generation, the token or word with the largest to recurrence.
probability is the next word.
4. BERT: Bidirectional Encoder
3.4. Attention is All We Need! Representations from Transformers
3.4.1. N O N EED TO RNN! BERT (Devlin et al., 2018) is one of the state-of-the-art
As Fig. 9 illustrates, the output of decoder is fed to the methods for NLP. It is a stack of encoders of transformer
masked multihead attention module of decoder with some (see Fig. 9). In other words, it is built using transformer
shift. Note that this is not a notion of recurrence because encoder blocks. Although some NLP methods, such as
it can be interpreted by the procedure of teacher-forcing XLNet (Yang et al., 2019b) have slightly outperformed it,
(Kolen & Kremer, 2001). Hence, we see that there is not BERT is still one of the best models for different NLP tasks
any recurrent module like RNN (Kombrink et al., 2011) such as question answering (Qu et al., 2019), natural lan-
and LSTM (Hochreiter & Schmidhuber, 1997) in trans- guage understanding (Dong et al., 2019), sentiment analy-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 10
However, GPT model is used for language model (Rosen- can pass the writer’s Turing’s test (Elkins & Chun, 2020;
feld, 2000; Jozefowicz et al., 2016; Jing & Xu, 2019) whose Floridi & Chiriatti, 2020). Note that GPT-3 has kind of
objective is to predict the next word, in an incomplete sen- memorized the texts of all subjects but not in bad way, i.e.,
tence, given all of the previous words. The predicted new overfitting, but in a good way. This memorization is be-
word is then added to the sequence and is fed to the GPT as cause of the complexity of huge number of learn-able pa-
input again and the next other word is predicted. This goes rameters (Arpit et al., 2017) and not being overfitted is be-
on until the sentences get complete with their next com- cause of being trained by big enough Internet data.
ing words. In other words, GPT model takes some docu- GPT-3 has had many different interesting applications such
ment and continues the text in the best and related way. For as fiction and poetry generation (Branwen, 2020). Of
example, if the input sentences are about psychology, the course, it is causing some risks, too (McGuffie & New-
trained GPT model generates the next sentences also about house, 2020).
psychology to complete the document. Note that as any text
without label can be used for predicting the next words in 6. Conclusion
sentences, GPT is an unsupervised method making it pos-
Transformers are very essential tools in natural language
sible to be trained on huge amount of Internet data.
processing and computer vision. This paper was a tuto-
The successors of GPT-1 (Radford et al., 2018) are GPT- rial and survey paper on attention mechanism, transform-
2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020). ers, BERT, and GPT. We explained attention mechanism,
GPT-2 and GPT-3 are extension of GPT-1 with more num- the sequence-to-sequence model with and without atten-
ber of stacks of transformer decoder. Hence, they have tion, and self-attention. The different parts of encoder and
more learn-able parameters and can be trained with more decoder of a transformer were explained. Finally, BERT
data for better language modeling and inference. For ex- and GPT were introduced as stacks of the encoders and de-
ample, GPT-2 has 1.5 billion parameters. GPT-2 and es- coders of transformer, respectively.
pecially GPT-3 have been trained with much more Internet
data with various general and academic subjects to be able Acknowledgment
to generate text in any subject and style of interest. For
example, GPT-2 has been trained on 8 million web pages The authors hugely thank Prof. Pascal Poupart whose
which contain 40GB of Internet text data. course partly covered some of materials in this tutorial pa-
per. Some of the materials of this paper can also be found
GPT-2 is a quite large model and cannot be easily used
in Prof. Ali Ghodsi’s course videos.
in embedded systems because of requiring large memory.
Hence, different sizes of GPT-2, like small, medium, large,
Xlarge, DistilGPT-2, are provided for usage in embedded
References
systems, where the number of stacks and learn-able param- Alsentzer, Emily, Murphy, John R, Boag, Willie, Weng,
eters differ in these versions. These versions of GPT-2 can Wei-Hung, Jin, Di, Naumann, Tristan, and McDermott,
be found and used in the HuggingFace transformer Python Matthew. Publicly available clinical BERT embeddings.
package (Wolf et al., 2019a). arXiv preprint arXiv:1904.03323, 2019.
GPT-2 has been used in many different applications such Arpit, Devansh, Jastrzebski, Stanisław, Ballas, Nicolas,
as dialogue systems (Budzianowski & Vulić, 2019), patent Krueger, David, Bengio, Emmanuel, Kanwal, Maxin-
claim generation (Lee & Hsiang, 2019), and medical text der S, Maharaj, Tegan, Fischer, Asja, Courville, Aaron,
simplification (Van et al., 2020). A combination of GPT-2 Bengio, Yoshua, and Lacoste-Julien, Simon. A closer
and BERT has been used for question answering (Klein & look at memorization in deep networks. In International
Nabi, 2019). It is noteworthy that GPT can be seen as few Conference on Machine Learning, 2017.
shot learning (Brown et al., 2020). A comparison of GPT
and BERT can also be found in (Ethayarajh, 2019). Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
GPT-3 is a very huge version of GPT with so many number Yoshua. Neural machine translation by jointly learning
of stacks and learn-able parameters. For comparison, note to align and translate. In International Conference on
that GPT-2, NVIDIA Megatron (Shoeybi et al., 2019), Mi- Learning Representations, 2015.
crosoft Turing-NLG (Microsoft, 2020), and GPT-3 (Brown
et al., 2020) have 1.5 billion, 8 billion, 17 billion, and 175 Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza,
billion learn-able parameters, respectively. This huge num- Alex, Pereira, Fernando, and Vaughan, Jennifer Wort-
ber of parameters allows GPT-3 to be trained by very huge man. A theory of learning from different domains. Ma-
amount of Internet text data on various subjects and topics. chine learning, 79(1-2):151–175, 2010.
Hence, GPT-3 has been able to learn almost all topics of
documents and even some people are discussing whether it Branwen, Gwern. GPT-3 creative fiction. https://
www.gwern.net/GPT-3, 2020.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 12
Brown, Tom B, Mann, Benjamin, Ryder, Nick, Subbiah, Floridi, Luciano and Chiriatti, Massimo. GPT-3: Its nature,
Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakan- scope, limits, and consequences. Minds and Machines,
tan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, pp. 1–14, 2020.
Amanda, et al. Language models are few-shot learners.
In Advances in neural information processing systems, Garcia-Molina, Hector, D. Ullman, Jeffrey, and Widom,
2020. Jennifer. Database systems: The complete book. Pren-
tice Hall, 1999.
Budzianowski, Paweł and Vulić, Ivan. Hello, it’s GPT-2–
how can I help you? Towards the use of pretrained lan- Goldberg, Yoav and Levy, Omer. word2vec explained:
guage models for task-oriented dialogue systems. arXiv deriving Mikolov et al.’s negative-sampling word-
preprint arXiv:1907.05774, 2019. embedding method. arXiv preprint arXiv:1402.3722,
2014.
Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals,
Oriol. Listen, attend and spell. arXiv preprint Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte,
arXiv:1508.01211, 2015. Schölkopf, Bernhard, and Smola, Alex J. A kernel
Chan, William, Jaitly, Navdeep, Le, Quoc, and Vinyals, method for the two-sample-problem. In Advances in
Oriol. Listen, attend and spell: A neural network neural information processing systems, pp. 513–520,
for large vocabulary conversational speech recognition. 2007.
In 2016 IEEE International Conference on Acoustics,
Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J,
Speech and Signal Processing, pp. 4960–4964. IEEE,
Schölkopf, Bernhard, and Smola, Alexander. A kernel
2016.
two-sample test. The Journal of Machine Learning Re-
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long search, 13(1):723–773, 2012.
short-term memory-networks for machine reading. In
Conference on Empirical Methods in Natural Language He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Processing, pp. 551–561, 2016. Jian. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision
Chorowski, Jan, Bahdanau, Dzmitry, Cho, Kyunghyun, and and pattern recognition, pp. 770–778, 2016.
Bengio, Yoshua. End-to-end continuous speech recog-
nition using attention-based recurrent NN: First results. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
arXiv preprint arXiv:1412.1602, 2014. term memory. Neural computation, 9(8):1735–1780,
1997.
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and
Toutanova, Kristina. BERT: Pre-training of deep bidi- Indurkhya, Nitin and Damerau, Fred J. Handbook of natu-
rectional transformers for language understanding. arXiv ral language processing, volume 2. CRC Press, 2010.
preprint arXiv:1810.04805, 2018.
Ioffe, Sergey and Szegedy, Christian. Batch normalization:
Dong, Li, Yang, Nan, Wang, Wenhui, Wei, Furu, Liu, Xi-
Accelerating deep network training by reducing internal
aodong, Wang, Yu, Gao, Jianfeng, Zhou, Ming, and Hon,
covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Hsiao-Wuen. Unified language model pre-training for
natural language understanding and generation. In Ad- Jiao, Xiaoqi, Yin, Yichun, Shang, Lifeng, Jiang, Xin, Chen,
vances in Neural Information Processing Systems, pp. Xiao, Li, Linlin, Wang, Fang, and Liu, Qun. Tiny-
13063–13075, 2019. BERT: Distilling BERT for natural language understand-
Dou, Qi, Coelho de Castro, Daniel, Kamnitsas, Konstanti- ing. arXiv preprint arXiv:1909.10351, 2019.
nos, and Glocker, Ben. Domain generalization via
Jing, Kun and Xu, Jungang. A survey on neural network
model-agnostic learning of semantic features. Advances
language models. arXiv preprint arXiv:1906.03591,
in Neural Information Processing Systems, 32:6450–
2019.
6461, 2019.
Elkins, Katherine and Chun, Jon. Can GPT-3 pass a writer’s Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
Turing test? Journal of Cultural Analytics, 2371:4549, Noam, and Wu, Yonghui. Exploring the limits of
2020. language modeling. arXiv preprint arXiv:1602.02410,
2016.
Ethayarajh, Kawin. How contextual are contextualized
word representations? Comparing the geometry of Kazemi, Vahid and Elqursh, Ali. Show, ask, attend, and
BERT, ELMo, and GPT-2 embeddings. arXiv preprint answer: A strong baseline for visual question answering.
arXiv:1909.00512, 2019. arXiv preprint arXiv:1704.03162, 2017.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 13
Klein, Tassilo and Nabi, Moin. Learning to answer by McGuffie, Kris and Newhouse, Alex. The radicalization
learning to ask: Getting the best of gpt-2 and bert worlds. risks of GPT-3 and advanced neural language models.
arXiv preprint arXiv:1911.02365, 2019. arXiv preprint arXiv:2009.06807, 2020.
Kolen, John F and Kremer, Stefan C. A field guide to dy- Microsoft. Turing-NLG: A 17-billion-parameter language
namical recurrent networks. John Wiley & Sons, 2001. model by Microsoft. Microsoft Blog, 2020.
Kombrink, Stefan, Mikolov, Tomáš, Karafiát, Martin, and Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jef-
Burget, Lukáš. Recurrent neural network based language frey. Efficient estimation of word representations in vec-
modeling in meeting recognition. In Twelfth annual con- tor space. In International Conference on Learning Rep-
ference of the international speech communication asso- resentations, 2013a.
ciation, 2011. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Greg S, and Dean, Jeff. Distributed representations of
Patrick. Gradient-based learning applied to document words and phrases and their compositionality. In Ad-
recognition. Proceedings of the IEEE, 86(11):2278– vances in neural information processing systems, pp.
2324, 1998. 3111–3119, 2013b.
Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Mikolov, Tomas, Chen, Kai, Corrado, Gregory S, and
Andrew Y. Convolutional deep belief networks for scal- Dean, Jeffrey A. Computing numeric representations of
able unsupervised learning of hierarchical representa- words in a high-dimensional space, May 19 2015. US
tions. In International Conference on Machine Learning, Patent 9,037,464.
pp. 609–616, 2009. Pennington, Jeffrey, Socher, Richard, and Manning,
Christopher D. GloVe: Global vectors for word rep-
Lee, Jieh-Sheng and Hsiang, Jieh. Patent claim gener-
resentation. In Proceedings of the 2014 conference on
ation by fine-tuning OpenAI GPT-2. arXiv preprint
empirical methods in natural language processing, pp.
arXiv:1907.02052, 2019.
1532–1543, 2014.
Lee, Jinhyuk, Yoon, Wonjin, Kim, Sungdong, Kim,
Perlovsky, Leonid I. Toward physics of the mind: Con-
Donghyeon, Kim, Sunkyu, So, Chan Ho, and Kang, Jae-
cepts, emotions, consciousness, and symbols. Physics of
woo. BioBERT: a pre-trained biomedical language rep-
Life Reviews, 3(1):23–55, 2006.
resentation model for biomedical text mining. Bioinfor-
matics, 36(4):1234–1240, 2020. Qu, Chen, Yang, Liu, Qiu, Minghui, Croft, W Bruce,
Zhang, Yongfeng, and Iyyer, Mohit. BERT with his-
Li, Da, Yang, Yongxin, Song, Yi-Zhe, and Hospedales, tory answer embedding for conversational question an-
Timothy M. Deeper, broader and artier domain gener- swering. In Proceedings of the 42nd International ACM
alization. In Proceedings of the IEEE international con- SIGIR Conference on Research and Development in In-
ference on computer vision, pp. 5542–5550, 2017. formation Retrieval, pp. 1133–1136, 2019.
Li, Hui, Wang, Peng, Shen, Chunhua, and Zhang, Guyu. Qureshi, Ahmed Hussain, Nakamura, Yutaka, Yoshikawa,
Show, attend and read: A simple and strong baseline Yuichiro, and Ishiguro, Hiroshi. Show, attend and inter-
for irregular text recognition. In Proceedings of the act: Perceivable human-robot social interaction through
AAAI Conference on Artificial Intelligence, volume 33, neural attention q-network. In 2017 IEEE International
pp. 8610–8617, 2019a. Conference on Robotics and Automation, pp. 1639–
Li, Yang, Kaiser, Lukasz, Bengio, Samy, and Si, Si. 1645. IEEE, 2017.
Area attention. In International Conference on Machine Radford, Alec, Narasimhan, Karthik, Salimans, Tim, and
Learning, pp. 3846–3855. PMLR, 2019b. Sutskever, Ilya. Improving language understanding by
generative pre-training. Technical report, OpenAI, 2018.
Lim, Jian Han, Chan, Chee Seng, Ng, Kam Woh, Fan,
Lixin, and Yang, Qiang. Protect, show, attend and Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David,
tell: Image captioning model with ownership protection. Amodei, Dario, and Sutskever, Ilya. Language models
arXiv preprint arXiv:2008.11009, 2020. are unsupervised multitask learners. OpenAI blog, 1(8):
9, 2019.
Luong, Minh-Thang, Pham, Hieu, and Manning, Christo-
pher D. Effective approaches to attention-based neural Rosenfeld, Ronald. Two decades of statistical language
machine translation. arXiv preprint arXiv:1508.04025, modeling: Where do we go from here? Proceedings
2015. of the IEEE, 88(8):1270–1278, 2000.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 14
Rumelhart, David E, Hinton, Geoffrey E, and Williams, particular: Multi-domain translation with domain trans-
Ronald J. Learning representations by back-propagating formation networks. In Proceedings of the AAAI Con-
errors. Nature, 323(6088):533–536, 1986. ference on Artificial Intelligence, volume 34, pp. 9233–
9241, 2020.
Sanh, Victor, Debut, Lysandre, Chaumond, Julien, and
Wolf, Thomas. DistilBERT, a distilled version of bert: Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond,
smaller, faster, cheaper and lighter. arXiv preprint Julien, Delangue, Clement, Moi, Anthony, Cistac, Pier-
arXiv:1910.01108, 2019. ric, Rault, Tim, Louf, Rémi, Funtowicz, Morgan, et al.
HuggingFace’s transformers: State-of-the-art natural
Shoeybi, Mohammad, Patwary, Mostofa, Puri, Raul, language processing. arXiv preprint arXiv:1910.03771,
LeGresley, Patrick, Casper, Jared, and Catanzaro, Bryan. 2019a.
Megatron-LM: Training multi-billion parameter lan-
guage models using model parallelism. arXiv preprint Wolf, Thomas, Sanh, Victor, Chaumond, Julien, and De-
arXiv:1909.08053, 2019. langue, Clement. TransferTransfo: A transfer learn-
ing approach for neural network based conversational
Socher, Richard, Bengio, Yoshua, and Manning, Chris. agents. In Proceedings of the AAAI Conference on Arti-
Deep learning for nlp. Tutorial at Association of Com- ficial Intelligence, 2019b.
putational Logistics (ACL), 2012.
Xu, Jun. On the techniques of English fast-reading. In
Song, Youwei, Wang, Jiahai, Liang, Zhiwei, Liu, Zhiyue, Theory and Practice in Language Studies, volume 1, pp.
and Jiang, Tao. Utilizing BERT intermediate layers for 1416–1419. Academy Publisher, 2011.
aspect based sentiment analysis and natural language in- Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,
ference. arXiv preprint arXiv:2002.04815, 2020. Courville, Aaron, Salakhudinov, Ruslan, Zemel, Rich,
and Bengio, Yoshua. Show, attend and tell: Neural im-
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, age caption generation with visual attention. In Interna-
Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a tional conference on machine learning, pp. 2048–2057,
simple way to prevent neural networks from overfitting. 2015.
The journal of machine learning research, 15(1):1929–
1958, 2014. Yang, Chao, Kim, Taehwan, Wang, Ruizhe, Peng, Hao, and
Kuo, C-C Jay. Show, attend, and translate: Unsupervised
Staliūnaitė, Ieva and Iacobacci, Ignacio. Compositional image translation with self-regularization and attention.
and lexical semantics in RoBERTa, BERT and Dis- IEEE Transactions on Image Processing, 28(10):4845–
tilBERT: A case study on CoQA. arXiv preprint 4856, 2019a.
arXiv:2009.08257, 2020.
Yang, Zhilin, Dai, Zihang, Yang, Yiming, Carbonell,
Summerfield, Christopher and Egner, Tobias. Expectation Jaime, Salakhutdinov, Russ R, and Le, Quoc V. XL-
(and attention) in visual cognition. Trends in cognitive net: Generalized autoregressive pretraining for language
sciences, 13(9):403–409, 2009. understanding. In Advances in neural information pro-
cessing systems, pp. 5753–5763, 2019b.
Tsai, Henry, Riesa, Jason, Johnson, Melvin, Arivazhagan,
Naveen, Li, Xin, and Archer, Amelia. Small and practi- Yu, Keyi, Liu, Yang, Schwing, Alexander G, and Peng,
cal BERT models for sequence labeling. arXiv preprint Jian. Fast and accurate text classification: Skimming,
arXiv:1909.00100, 2019. rereading and early stopping. In International Confer-
ence on Learning Representations, 2018.
Van, Hoang, Kauchak, David, and Leroy, Gondy. Au-
toMeTS: The autocomplete for medical text simplifica- Zhang, Honglun, Chen, Wenqing, Tian, Jidong, Wang,
tion. arXiv preprint arXiv:2010.10573, 2020. Yongkun, and Jin, Yaohui. Show, attend and translate:
Unpaired multi-domain image-to-image translation with
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, visual attention. arXiv preprint arXiv:1811.07483, 2018.
Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Łukasz,
and Polosukhin, Illia. Attention is all you need. In
Advances in neural information processing systems, pp.
5998–6008, 2017.