Contrastive Learning for Sentence Representation
Contrastive Learning for Sentence Representation
[CLS] Tok′1 ··· Tok′i ··· Tok′j ··· Tok′N [CLS] Tok′′
1 ··· Tok′′
i ··· Tok′′
j ··· Tok′′
N
AUG ∼ A AUG ∼ A
Tok1 ··· Toki ··· Tokj · · · TokN
Original sentence s
inter-sentence coherence. Their work shows about what kind of augmentations can be
that coherence prediction is a better choice used in contrastive learning.
than the topic prediction, the way NSP uses.
• We showed that model pre-trained by our
DeCLUTR (Giorgi et al., 2020) is the first work
proposed method outperforms several strong
to combine Contrastive Learning (CL) with
baselines (including RoBERTa and BERT)
MLM into pre-training. However, it requires an
on both GLUE (Wang et al., 2018) and Sen-
extremely long input document, i.e., 2048 tokens,
tEval (Conneau and Kiela, 2018) benchmark.
which restricts the model to be pre-trained on
For example, we showed +2.2% absolute im-
limited data. Further, DeCLUTR trains from
provement on 8 GLUE tasks and +5.7% abso-
existing pre-trained models, so it remains un-
lute improvement on 7 SentEval semantic tex-
known whether it could also achieve the same
tual similarity tasks compared to RoBERTa
performance when it trains from scratch.
model.
Drawing from the recent advances in pre-
trained language models and contrastive learning, 2 Related Work
we propose a new framework, CLEAR, combining There are three lines of literatures that are closely
word-level MLM objective with sentence-level CL related to our work: sentence representation, large-
objective to pre-train a language model. MLM ob- scale pre-trained language representation models,
jective enables the model capture word-level hid- contrastive learning.
den features while CL objective ensures the model
with the capacity of recognizing similar meaning 2.1 Sentence Representation
sentences by training an encoder to minimize the Learning the representation of sentence has been
distance between the embeddings of different aug- studied by many existing works. Applying
mentations of the same sentence. In this paper, various pooling strategies onto word embed-
we present a novel design of augmentations that dings as the representation of sentence is a
can be used to pre-train a language model at the common baseline (Iyyer et al., 2015; Shen et al.,
sentence-level. Our main findings and contribu- 2018; Reimers and Gurevych, 2019). Skip-
tions can be summarized as follows: Thoughts (Kiros et al., 2015) trains an encoder-
decoder model trying to reconstruct surrounding
• We proposed and tested four basic sentence sentences. Quick-Thoughts (Logeswaran and Lee,
augmentations: random-words-deletion, 2018) trains a encoder-only model with the abil-
spans-deletion, synonym-substitution, and ity to select the correct context of the sen-
reordering, which fills a large gap in NLP tence out of other contrastive sentences. Later
on, many pre-trained language models such as 2020) regards different spans inside one document
BERT (Devlin et al., 2019) propose to use the are similar to each others. Our model differs
manually-inserted token (the [CLS] token) as from CERT in adopting an encoder-only struc-
the representation of the whole sentence and be- ture, which decreases noise brought by the de-
come the new state-of-the-art in a variety of coder. Further, unlike DeCLUTR, which only tests
downstream tasks. One recent paper Sentence- one augmentation and trains the model from an
BERT (Reimers and Gurevych, 2019) compares existing pre-trained model, we pre-train all mod-
the average BERT embeddings with the CLS- els from scratch, which provides a straightforward
token embedding and surprisingly finds that com- comparison with the existing pre-trained models.
puting the mean of all output vectors at the
last layer of BERT outperforms the CLS-token 3 Method
marginally. This section proposes a novel framework and
2.2 Large-scale Pre-trained Language several sentence augmentation methods for con-
trastive learning in NLP.
Representation Models
The deep pre-trained language models have 3.1 The Contrastive Learning Framework
proven their powers in capturing implicit language Borrow from SimCLR (Chen et al., 2020), we pro-
features even with different model architectures, pose a new contrastive learning framework to learn
pre-training tasks, and loss functions. Two of the the sentence representation, named as CLEAR.
early works that are GPT (Radford et al., 2018) There are four main components in CLEAR, as
and BERT (Devlin et al., 2019): GPT uses a left- outlined in Figure 1.
to-right Transformer while BERT designs a bidi-
rectional Transformer. Both created an incredible • An augmentation component AUG(·) which
new state of the art in a lot of downstream tasks. apply the random augmentation to the orig-
Following this observation, recently, a tremen- inal sentence. For each original sentence s,
dous number of research works are published in we generate two random augmentations se1 =
the pre-trained language model domain. Some ex- AUG(s, seed 1 ) and se2 = AUG(s, seed 2 ),
tend previous models to a sequence-to-sequence where seed 1 and seed 2 are two random seeds.
structure (Song et al., 2019; Lewis et al., 2019; Note that, to test each augmentation’s ef-
Liu et al., 2020), which enforces the model’s fect solely, we adopt the same augmentation
capability on language generation. The oth- to generate se1 and se2 . Testing the mixing
ers (Yang et al., 2019; Liu et al., 2019; Clark et al., augmentation models requests more compu-
2020) explore the different pre-training objectives tational resources, which we plan to leave for
to either improve the model’s performance or ac- future work. We will detail the proposed aug-
celerate the pre-training. mentation set A at Section 3.3.
Tok[del] Tok3 Tok[del] Tok5 ··· TokN Tok[del] Tok5 ··· TokN
Tok[del] Tok[del] Tok3 Tok[del] Tok5 ··· TokN Tok[del] Tok[del] Tok[del] Tok[del] Tok5 ··· TokN
Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN
(a) Word Deletion: Tok1 , Tok2 , and Tok4 are (b) Span Deletion: The span [Tok1 , Tok2 , Tok3 ,
deleted, the sentence after augmentation will be: Tok4 ] is deleted, the sentence after augmentation
[Tok[del] , Tok3 , Tok[del] , Tok5 , . . . , TokN ]. will be: [Tok[del] , Tok5 , . . . , TokN ].
Tok4 Tok3 Tok1 Tok2 Tok5 ··· TokN Tok1 Tok′2 Tok′3 Tok4 Tok5 ··· Tok′N
Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN
(c) Reordering: Two spans [Tok1 , Tok2 ] and (d) Synonym Substitution: Tok2 , Tok3 , and
[Tok4 ] are reordered, the sentence after aug- TokN are substituted by their synonyms Tok′2 ,
mentation will be: [Tok4 , Tok3 , Tok1 , Tok2 , Tok′3 , and Tok′N , respectively. The sentence after
Tok5 , . . . , TokN ]. augmentation will be: [Tok1 , Tok′2 , Tok′3 , Tok4 ,
Tok5 , . . . , Tok′N ].
Figure 2: Four sentence augmentation methods in proposed contrastive learning framework CLEAR.
Method MNLI QNLI QQP RTE SST-2 MRPC CoLA STS Avg
Baselines
BERT-base (Devlin et al., 2019) 84.0 89.0 89.1 61.0 93.0 86.3 57.3 89.5 81.2
RoBERTa-base (Liu et al., 2019) 87.2 93.2 88.2 71.8 94.4 87.8 56.1 89.4 83.5
MLM+1-CL-objective
MLM+ del-word 86.8 93.0 90.2 79.4 94.2 89.7 62.1 90.5 85.7
MLM+ del-span 87.3 92.8 90.1 79.8 94.4 89.9 59.8 90.3 85.6
MLM+2-CL-objective
MLM+ subs+ del-word 87.3 93.1 90.0 73.3 93.7 90.2 62.1 90.1 85.0
MLM+ subs+ del-span 87.0 93.4 90.3 74.4 94.3 90.5 63.3 90.5 85.5
MLM+ del-word+ reorder 87.0 92.7 89.5 76.5 94.5 90.6 59.1 90.4 85.0
MLM+ del-span+ reorder 86.7 92.9 90.0 78.3 94.5 89.2 64.3 89.8 85.7
Baselines
RoBERTa-base-mean 74.1 65.6 47.2 38.3 46.7 55.0 49.5 53.8
RoBERTa-base-[CLS] 75.9 71.9 47.4 37.5 47.9 55.1 57.6 56.1
MLM+1-CL-objective
MLM+ del-word-mean 75.9 69.0 50.6 40.0 50.2 58.9 52.4 56.7
MLM+ del-span-mean 71.0 62.6 49.3 41.7 48.9 58.1 52.3 54.8
MLM+ del-word-[CLS] 77.1 71.6 50.6 44.5 48.3 58.4 56.1 58.1
MLM+ del-span-[CLS] 62.7 57.4 34.4 20.4 24.3 32.0 31.5 37.5
MLM+2-CL-objective
MLM+ del-word+ reorder-mean 75.8 66.2 51.1 45.7 51.8 61.3 57.0 58.4
MLM+ del-span+ reorder-mean 75.4 67.8 48.3 50.3 54.9 60.4 56.8 59.1
MLM+ subs+ del-word-mean 73.6 63.4 44.6 39.8 50.1 55.5 49.6 53.8
MLM+ subs+ del-span-mean 75.5 67.0 48.3 45.0 54.6 60.9 58.5 58.5
MLM+ del-word+ reorder-[CLS] 71.9 63.8 41.9 30.9 37.4 48.9 52.1 49.6
MLM+ del-span+ reorder-[CLS] 75.0 68.7 49.4 54.3 57.6 64.0 61.4 61.5
MLM+ subs+ del-word-[CLS] 73.6 62.9 44.5 35.8 47.6 55.8 59.6 54.3
MLM+ subs+ del-span-[CLS] 75.6 72.5 49.0 48.9 57.4 63.6 65.6 61.8
As we can see in Table 1, our proposed sev- fine-tuning like in GLUE. We evaluate the per-
eral models outperform the baselines on GLUE. formance of our proposed methods for common
Note that different tasks adopt different evalua- Semantic Textual Similarity (STS) tasks on
tion matrices, our two best models MLM+del- SentEval. Note that some previous models (e.g.,
word and MLM+del-span+reorder both improve Sentence-BERT (Reimers and Gurevych, 2019))
the best baseline RoBERTa-base by 2.2% on aver- on the SentEval leaderboard trains on the specific
age score. Besides, a more important observation datasets such as Stanford NLI (Bowman et al.,
is that all best performance for each task comes 2015) and MultiNLI (Williams et al., 2017),
from our proposed model. On CoLA and RTE, our which makes it hard for a direct comparison. To
best model exceeds the baseline by 7.0% and 8.0% make it easier, we compare one of our proposed
correspondingly. Further, we also find that differ- models with RoBERTa-base directly on SentEval.
ent downstream tasks benefit from different aug- According to Sentence-BERT, using the mean of
mentations. We will make a more specific analysis all output vectors in the last layer is more effective
in Section 5.2. than using the CLS-token output. We test both
One notable thing is that we don’t show pooling strategies for each model.
the result of MLM+subs, MLM+reorder, and
MLM+subs+reorder in Table 1. We observe that From Table 2, we observe that mean-pooling
the pre-training for these three models either con- strategy does not show much advantages. In many
verges quickly or suffers from a gradient explosion of the cases, CLS-pooling is better than the mean-
problem, which indicates that these three augmen- pooling for our proposed models. The underlying
tations are too easy to distinguish. reason is that the contrastive learning directly up-
dates the representation of [CLS] token. Besides
4.3 SentEval Results for Semantic Textual that, we find adding the CL loss makes the model
Similarity Tasks especially good at the Semantic Textual Similar-
SentEval is a popular benchmark for eval- ity (STS) task, beating the best baseline by a large
uating general-purpose sentence representa- margin (+5.7%). We think it is because the pre-
tions (Conneau and Kiela, 2018). The specialty training of contrastive learning is to find the sim-
for this benchmark is that it doesn’t do the ilar sentence pairs, which aligns with STS task.
Table 3: Ablation study for several methods evaluated on GLUE dev set. All results are pre-trained on wiki-103
data for 500 epochs.
Method MNLI-m QNLI QQP RTE SST-2 MRPC CoLA STS Avg
RoBERTa-base 80.4 87.5 87.4 61.4 91.4 82.4 38.9 81.9 76.4
MLM-variant
Double-batch RoBERTa-base 80.3 88.0 87.1 59.9 91.9 82.1 43.0 82.0 76.8
Double MLM RoBERTA-base 80.5 87.6 87.3 57.4 90.4 77.7 42.2 83.0 75.8
MLM+CL-objective
MLM+ del-span 80.6 88.8 87.3 62.1 92.1 77.8 44.1 81.4 76.8
MLM+ del-span + reorder 81.1 88.7 87.5 58.1 90.0 80.4 43.3 87.4 77.1
MLM+ subs + del-word + reorder 80.5 87.7 87.3 59.6 90.4 80.2 45.1 87.1 77.2
This could explain why our proposed models show model. It tells us the proposed model does not
such large improvements on STS. solely benefit from a larger batch; CL loss also
helps.
5 Discussion
This section discusses an ablation study to com- 5.2 Different Augmentation Learns Different
pare the CL loss and MLM loss and shows some Features
observations about what different augmentation
In Table 1, we find an interesting phenomenon:
learns.
different proposed models are good at specific
5.1 Ablation Study tasks.
Our proposed CL-based models outperforms One example is MLM+subs+del-span helps the
MLM-based models, one remaining question is, model be good at dealing with similarity and para-
where does our proposed model benefit from? phrase tasks. On QQP and STS, it achieves the
Does it come from the CL loss, or is it from highest score; on MRPC, it ranks second. We
the larger batch size (since to calculate CL loss, infer the outperformance of MLM+subs+del-span
one needs to store extra information per batch)? in this kind of task is because synonym substitu-
To answer this question, we set up two extra tion helps translate the original sentence to similar
baselines: Double MLM RoBERTa-base adopts meaning sentences while deleting different spans
the MLM+MLM loss, each MLM is performed makes more variety of similar sentences visible.
on different mask for the same original sentence; Combining them enhances the model’s capacity to
the other Double-batch RoBERTa-base uses single deal with many unseen sentence pairs.
MLM loss with a double-size batch. We also notice that MLM+del-span achieves
Due to the limitation of computational re- good performance on inference tasks (MNLI,
source, we conduct the ablation study on a QNLI, RTE). The underlying reason is, with a
smaller pre-training corpus, i.e., WiKiText-103 span deletion, the model has already been pre-
dataset (Merity et al., 2016). All the models listed trained well to infer the other similar sentences.
in Table 3 are pre-trained for 500 epochs on 64 The ability to identify similar sentence pairs helps
NVIDIA Tesla V100 32GB GPUs. Three of to recognize the contradiction. Therefore, the gap
our proposed models are reported in the table. between the pre-trained task and this downstream
The general performance for the variants doesn’t task narrows.
show much difference compared with the original Overall, we observe that different augmentation
RoBERTa-base, with a +0.4% increase on the aver- learns different features. Some specific augmen-
age score on Double-batch RoBERTa-base, which tations are especially good at some certain down-
confirms the idea that a larger batch benefits the stream tasks. Designing a task-specific augmen-
representation training as proposed by previous tation or exploring meta-learning to adaptively se-
work (Liu et al., 2019). Yet, the best-performed lect different CL objectives is a promising future
baseline is still not as good as our best-proposed direction.
6 Conclusion Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,
and Hal Daumé III. 2015. Deep unordered compo-
In this work, we presented an instantiation for con- sition rivals syntactic methods for text classification.
trastive sentence representation learning. By care- In Proceedings of the 53rd annual meeting of the as-
fully designing and testing different data augmen- sociation for computational linguistics and the 7th
international joint conference on natural language
tations and combinations, we prove the proposed processing (volume 1: Long papers), pages 1681–
methods’ effectiveness on GLUE and SentEval 1691.
benchmark under the diverse pre-training corpus.
Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel,
The experiment results indicate that the pre- Joseph E Gonzalez, and Ion Stoica. 2020. Con-
trained model would be more robust when lever- trastive code representation learning. arXiv preprint
aging adequate sentence-level supervision. More arXiv:2007.04973.
importantly, we reveal that different augmentation Robin Jia, Aditi Raghunathan, Kerem Göksel, and
learns different features for the model. Finally, we Percy Liang. 2019. Certified robustness to
demonstrate that the performance improvement adversarial word substitutions. arXiv preprint
comes from both the larger batch size and the con- arXiv:1909.00986.
trastive loss. Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
References Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov,
Samuel R Bowman, Gabor Angeli, Christopher Potts, Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Christopher D Manning. 2015. A large anno- and Sanja Fidler. 2015. Skip-thought vectors. In
tated corpus for learning natural language inference. Advances in neural information processing systems,
arXiv preprint arXiv:1508.05326. pages 3294–3302.
Ting Chen, Simon Kornblith, Mohammad Norouzi, Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
and Geoffrey Hinton. 2020. A simple framework for Kevin Gimpel, Piyush Sharma, and Radu Soricut.
contrastive learning of visual representations. arXiv 2019. Albert: A lite bert for self-supervised learn-
preprint arXiv:2002.05709. ing of language representations. arXiv preprint
arXiv:1909.11942.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2020. Electra: Pre-training Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
text encoders as discriminators rather than genera- jan Ghazvininejad, Abdelrahman Mohamed, Omer
tors. arXiv preprint arXiv:2003.10555. Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.
Bart: Denoising sequence-to-sequence pre-training
Alexis Conneau and Douwe Kiela. 2018. Senteval: An for natural language generation, translation, and
evaluation toolkit for universal sentence representa- comprehension. arXiv preprint arXiv:1910.13461.
tions. arXiv preprint arXiv:1803.05449.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Edunov, Marjan Ghazvininejad, Mike Lewis, and
Kristina Toutanova. 2019. Bert: Pre-training of Luke Zettlemoyer. 2020. Multilingual denoising
deep bidirectional transformers for language under- pre-training for neural machine translation. arXiv
standing. In Proceedings of the 2019 Conference of preprint arXiv:2001.08210.
the North American Chapter of the Association for Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Computational Linguistics: Human Language Tech- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
nologies, Volume 1 (Long and Short Papers), pages Luke Zettlemoyer, and Veselin Stoyanov. 2019.
4171–4186. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Hongchao Fang and Pengtao Xie. 2020. Cert: Con-
trastive self-supervised learning for language under- Lajanugen Logeswaran and Honglak Lee. 2018. An
standing. arXiv preprint arXiv:2005.12766. efficient framework for learning sentence represen-
tations. arXiv preprint arXiv:1803.02893.
John M Giorgi, Osvald Nitski, Gary D Bader, and
Bo Wang. 2020. Declutr: Deep contrastive learn- Stephen Merity, Caiming Xiong, James Bradbury, and
ing for unsupervised textual representations. arXiv Richard Socher. 2016. Pointer sentinel mixture mod-
preprint arXiv:2006.03659. els. arXiv preprint arXiv:1609.07843.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ishan Misra and Laurens van der Maaten. 2020. Self-
Ross Girshick. 2020. Momentum contrast for unsu- supervised learning of pretext-invariant representa-
pervised visual representation learning. In Proceed- tions. In Proceedings of the IEEE/CVF Conference
ings of the IEEE/CVF Conference on Computer Vi- on Computer Vision and Pattern Recognition, pages
sion and Pattern Recognition, pages 9729–9738. 6707–6717.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins.
Ilya Sutskever. 2018. Improving language under- 2019. Local aggregation for unsupervised learning
standing by generative pre-training. URL https://s3- of visual embeddings. In Proceedings of the IEEE
us-west-2. amazonaws. com/openai-assets/research- International Conference on Computer Vision, pages
covers/languageunsupervised/language understand- 6002–6012.
ing paper. pdf.