0% found this document useful (0 votes)
0 views10 pages

Contrastive Learning for Sentence Representation

The paper introduces CLEAR, a contrastive learning framework aimed at improving sentence representation by utilizing various sentence-level augmentation strategies. It demonstrates that different augmentations, such as word deletion and synonym substitution, lead to significant performance improvements on NLP tasks compared to existing methods. CLEAR outperforms strong baselines like RoBERTa and BERT on benchmarks including GLUE and SentEval, highlighting the effectiveness of its approach to sentence representation learning.

Uploaded by

Jing Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views10 pages

Contrastive Learning for Sentence Representation

The paper introduces CLEAR, a contrastive learning framework aimed at improving sentence representation by utilizing various sentence-level augmentation strategies. It demonstrates that different augmentations, such as word deletion and synonym substitution, lead to significant performance improvements on NLP tasks compared to existing methods. CLEAR outperforms strong baselines like RoBERTa and BERT on benchmarks including GLUE and SentEval, highlighting the effectiveness of its approach to sentence representation learning.

Uploaded by

Jing Ma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CLEAR: Contrastive Learning for Sentence Representation

Zhuofeng Wu1∗ Sinong Wang2 Jiatao Gu2


Madian Khabsa2 Fei Sun3 Hao Ma2
1
School of Information, University of Michigan
[email protected]
2
Facebook AI
{sinongwang, jgu, mkhabsa, haom}@fb.com
3
Institute of Computing Technology, Chinese Academy of Sciences
[email protected]

Abstract that averaging of all output word vectors out-


performs the CLS-token embedding marginally.
arXiv:2012.15466v1 [cs.CL] 31 Dec 2020

Pre-trained language models have proven their


unique powers in capturing implicit language Sentence-BERT’s results suggest that models like
features. However, most pre-training ap- BERT learn a better representation at the token
proaches focus on the word-level training ob- level. One natural question is how to better learn
jective, while sentence-level objectives are sentence representation.
rarely studied. In this paper, we propose Inspired by the success of contrastive learn-
Contrastive LEArning for sentence Repre- ing in computer vision (Zhuang et al., 2019;
sentation (CLEAR), which employs multiple
Tian et al., 2019; He et al., 2020; Chen et al.,
sentence-level augmentation strategies in or-
der to learn a noise-invariant sentence repre- 2020; Misra and Maaten, 2020), we are interested
sentation. These augmentations include word in exploring whether it could also help language
and span deletion, reordering, and substitu- models generate a better sentence representation.
tion. Furthermore, we investigate the key rea- The key method in contrastive learning is augment-
sons that make contrastive learning effective ing positive samples during the training. How-
through numerous experiments. We observe ever, data augmentation for text is not as fruitful
that different sentence augmentations during
as for image. The image can be augmented eas-
pre-training lead to different performance im-
provements on various downstream tasks.Our ily by rotating, cropping, resizing, or cutouting,
approach is shown to outperform multiple ex- etc. (Chen et al., 2020). In NLP, there are mini-
isting methods on both SentEval and GLUE mal augmentation ways that have been researched
benchmarks. in literature (Giorgi et al., 2020; Fang and Xie,
2020). The main reason is that every word in a
1 Introduction
sentence may play an essential role in expressing
Learning a better sentence representation model the whole meaning. Additionally, the order of the
has always been a fundamental problem in Natu- words also matters.
ral Language Processing (NLP). Taking the mean Most existing pre-trained language mod-
of word embeddings as the representation of sen- els (Devlin et al., 2019; Liu et al., 2019;
tence (also known as mean pooling) is a com- Lewis et al., 2019) are adding different kinds
mon baseline in the early stage. Later on, of noises to the text and trying to restore them
pre-trained models such as BERT (Devlin et al., at the word-level. Sentence-level objectives
2019) propose to insert a special token (i.e., are rarely studied. BERT (Devlin et al., 2019)
[CLS] token) during the pre-training and take combines the word-level loss, masked language
its embedding as the representation for the sen- modeling (MLM) with a sentence-level loss,
tence. Because of the tremendous improve- next sentence prediction (NSP), and observes
ment brought by BERT (Devlin et al., 2019), that MLM+NSP is essential for some down-
people seemed to agree that CLS-token em- stream tasks. RoBERTa (Liu et al., 2019) drops
bedding is better than averaging word embed- the NSP objective during the pre-training but
dings. Nevertheless, a recent paper Sentence- achieves a much better performance in a variety
BERT (Reimers and Gurevych, 2019) observed of downstream tasks. ALBERT (Lan et al., 2019)

Work done while the author was an intern at Facebook proposes a self-supervised loss for Sentence-
AI. Order Prediction (SOP), which models the
z1 maximize agreement z2
g(·) g(·)

h [CLS] h1 ··· hi ··· hj ··· hN h[CLS] h1 ··· hi ··· hj ··· hN

transformer encoder f (·) transformer encoder f (·)

E [CLS] E1 ··· Ei ··· Ej ··· EN E [CLS] E1 ··· Ei ··· Ej ··· EN

[CLS] Tok′1 ··· Tok′i ··· Tok′j ··· Tok′N [CLS] Tok′′
1 ··· Tok′′
i ··· Tok′′
j ··· Tok′′
N

se1 = AUG(s, seed 1 ) se2 = AUG(s, seed 2 )

AUG ∼ A AUG ∼ A
Tok1 ··· Toki ··· Tokj · · · TokN

Original sentence s

Figure 1: The proposed contrastive learning framework CLEAR.

inter-sentence coherence. Their work shows about what kind of augmentations can be
that coherence prediction is a better choice used in contrastive learning.
than the topic prediction, the way NSP uses.
• We showed that model pre-trained by our
DeCLUTR (Giorgi et al., 2020) is the first work
proposed method outperforms several strong
to combine Contrastive Learning (CL) with
baselines (including RoBERTa and BERT)
MLM into pre-training. However, it requires an
on both GLUE (Wang et al., 2018) and Sen-
extremely long input document, i.e., 2048 tokens,
tEval (Conneau and Kiela, 2018) benchmark.
which restricts the model to be pre-trained on
For example, we showed +2.2% absolute im-
limited data. Further, DeCLUTR trains from
provement on 8 GLUE tasks and +5.7% abso-
existing pre-trained models, so it remains un-
lute improvement on 7 SentEval semantic tex-
known whether it could also achieve the same
tual similarity tasks compared to RoBERTa
performance when it trains from scratch.
model.
Drawing from the recent advances in pre-
trained language models and contrastive learning, 2 Related Work
we propose a new framework, CLEAR, combining There are three lines of literatures that are closely
word-level MLM objective with sentence-level CL related to our work: sentence representation, large-
objective to pre-train a language model. MLM ob- scale pre-trained language representation models,
jective enables the model capture word-level hid- contrastive learning.
den features while CL objective ensures the model
with the capacity of recognizing similar meaning 2.1 Sentence Representation
sentences by training an encoder to minimize the Learning the representation of sentence has been
distance between the embeddings of different aug- studied by many existing works. Applying
mentations of the same sentence. In this paper, various pooling strategies onto word embed-
we present a novel design of augmentations that dings as the representation of sentence is a
can be used to pre-train a language model at the common baseline (Iyyer et al., 2015; Shen et al.,
sentence-level. Our main findings and contribu- 2018; Reimers and Gurevych, 2019). Skip-
tions can be summarized as follows: Thoughts (Kiros et al., 2015) trains an encoder-
decoder model trying to reconstruct surrounding
• We proposed and tested four basic sentence sentences. Quick-Thoughts (Logeswaran and Lee,
augmentations: random-words-deletion, 2018) trains a encoder-only model with the abil-
spans-deletion, synonym-substitution, and ity to select the correct context of the sen-
reordering, which fills a large gap in NLP tence out of other contrastive sentences. Later
on, many pre-trained language models such as 2020) regards different spans inside one document
BERT (Devlin et al., 2019) propose to use the are similar to each others. Our model differs
manually-inserted token (the [CLS] token) as from CERT in adopting an encoder-only struc-
the representation of the whole sentence and be- ture, which decreases noise brought by the de-
come the new state-of-the-art in a variety of coder. Further, unlike DeCLUTR, which only tests
downstream tasks. One recent paper Sentence- one augmentation and trains the model from an
BERT (Reimers and Gurevych, 2019) compares existing pre-trained model, we pre-train all mod-
the average BERT embeddings with the CLS- els from scratch, which provides a straightforward
token embedding and surprisingly finds that com- comparison with the existing pre-trained models.
puting the mean of all output vectors at the
last layer of BERT outperforms the CLS-token 3 Method
marginally. This section proposes a novel framework and
2.2 Large-scale Pre-trained Language several sentence augmentation methods for con-
trastive learning in NLP.
Representation Models
The deep pre-trained language models have 3.1 The Contrastive Learning Framework
proven their powers in capturing implicit language Borrow from SimCLR (Chen et al., 2020), we pro-
features even with different model architectures, pose a new contrastive learning framework to learn
pre-training tasks, and loss functions. Two of the the sentence representation, named as CLEAR.
early works that are GPT (Radford et al., 2018) There are four main components in CLEAR, as
and BERT (Devlin et al., 2019): GPT uses a left- outlined in Figure 1.
to-right Transformer while BERT designs a bidi-
rectional Transformer. Both created an incredible • An augmentation component AUG(·) which
new state of the art in a lot of downstream tasks. apply the random augmentation to the orig-
Following this observation, recently, a tremen- inal sentence. For each original sentence s,
dous number of research works are published in we generate two random augmentations se1 =
the pre-trained language model domain. Some ex- AUG(s, seed 1 ) and se2 = AUG(s, seed 2 ),
tend previous models to a sequence-to-sequence where seed 1 and seed 2 are two random seeds.
structure (Song et al., 2019; Lewis et al., 2019; Note that, to test each augmentation’s ef-
Liu et al., 2020), which enforces the model’s fect solely, we adopt the same augmentation
capability on language generation. The oth- to generate se1 and se2 . Testing the mixing
ers (Yang et al., 2019; Liu et al., 2019; Clark et al., augmentation models requests more compu-
2020) explore the different pre-training objectives tational resources, which we plan to leave for
to either improve the model’s performance or ac- future work. We will detail the proposed aug-
celerate the pre-training. mentation set A at Section 3.3.

2.3 Contrastive Learning • A transformer-based encoder f (·) that learns


the representation of the input augmented
Contrastive Learning has become a rising do- sentences H1 = f (e s1 ) and H2 = f (e s2 ).
main because of its significant success in various Any encoder that learns the sentence repre-
computer vision tasks and datasets. Several re- sentation can be used here to replace our en-
searchers (Zhuang et al., 2019; Tian et al., 2019; coder. We choose the current start-of-the-
Misra and Maaten, 2020; Chen et al., 2020) pro- art (i.e., transformer (Vaswani et al., 2017))
posed to make the representations of the different to learn sentence representation and use the
augmentation of an image agree with each other representation of a manually-inserted token
and showed positive results. The main difference as the vector of the sentence (i.e., [CLS], as
between these works is their various definition of used in BERT and RoBERTa).
image augmentation.
Researchers in the NLP domain have also • A nonlinear neural network projection head
started to work on finding suitable augmentation g(·) that maps the encoded augmentations H1
for text. CERT (Fang and Xie, 2020) applies and H2 to the vector z1 = g(H1 ), z2 =
the back-translation to create augmentations of g(H2 ) in a new space. According to observa-
original sentences, while DeCLUTR (Giorgi et al., tions in SimCLR (Chen et al., 2020), adding
sentence after word deletion sentence after span deletion

Tok[del] Tok3 Tok[del] Tok5 ··· TokN Tok[del] Tok5 ··· TokN

Tok[del] Tok[del] Tok3 Tok[del] Tok5 ··· TokN Tok[del] Tok[del] Tok[del] Tok[del] Tok5 ··· TokN

Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN

original sentence original sentence

(a) Word Deletion: Tok1 , Tok2 , and Tok4 are (b) Span Deletion: The span [Tok1 , Tok2 , Tok3 ,
deleted, the sentence after augmentation will be: Tok4 ] is deleted, the sentence after augmentation
[Tok[del] , Tok3 , Tok[del] , Tok5 , . . . , TokN ]. will be: [Tok[del] , Tok5 , . . . , TokN ].

sentence after reordering sentence after similar word subsitution

Tok4 Tok3 Tok1 Tok2 Tok5 ··· TokN Tok1 Tok′2 Tok′3 Tok4 Tok5 ··· Tok′N

Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN Tok1 Tok2 Tok3 Tok4 Tok5 ··· TokN

original sentence original sentence

(c) Reordering: Two spans [Tok1 , Tok2 ] and (d) Synonym Substitution: Tok2 , Tok3 , and
[Tok4 ] are reordered, the sentence after aug- TokN are substituted by their synonyms Tok′2 ,
mentation will be: [Tok4 , Tok3 , Tok1 , Tok2 , Tok′3 , and Tok′N , respectively. The sentence after
Tok5 , . . . , TokN ]. augmentation will be: [Tok1 , Tok′2 , Tok′3 , Tok4 ,
Tok5 , . . . , Tok′N ].

Figure 2: Four sentence augmentation methods in proposed contrastive learning framework CLEAR.

a nonlinear projection head can significantly whether k 6= i, τ is a temperature parameter,


improve representation quality of images. sim(u, v) = u⊤ v/(kuk2 kvk2 ) denotes the
cosine similarity of two vector u and v. The
• A contrastive learning loss function defined overall contrastive learning loss is defined as
for a contrastive prediction task, i.e., try- the sum of all positive pairs’ loss in a mini-
ing to predict positive augmentation pair (e s1 , batch:
se2 ) in the set {s̃}. We construct the set 2N X
2N
X
{s̃} by randomly augmenting twice for all LCL = m(i, j)l(i, j) (2)
the sentences in a minibatch (assuming a i=1 j=1
minibatch is a set {s} size N ), getting a
where m(i, j) is a function returns 1 when i
set {s̃} with size 2N . The two variants
and j is a positive pair, returns 0 otherwise.
from the same original sentence form the
positive pair, while all other instances from 3.2 The Combined Loss for Pre-training
the same minibatch are regarded as negative Similar to (Giorgi et al., 2020), for the purpose of
samples for them. The contrastive learning grabbing both token-level and sentence-level fea-
loss has been tremendously used in previ- tures, we use a combined loss of MLM objective
ous work (Wu et al., 2018; Chen et al., 2020; and CL objective to get the overall loss:
Giorgi et al., 2020; Fang and Xie, 2020). The
Ltotal = LMLM + LCL (3)
loss function for a positive pair is defined as:
where LMLM is calculated through predicting
exp (sim(zi , zj )/τ )
the random-masked tokens in set {s} as de-
l(i, j)=− log P2N
k=1 1[k6=i] exp (sim(zi , zk )/τ )
scribed in BERT and RoBERTa (Devlin et al.,
(1) 2019; Liu et al., 2019). Our pre-training target is
where 1[k6=i] is the indicator function to judge to minimize the Ltotal .
3.3 Design Rationale for Sentence sample some words and replace them with syn-
Augmentations onyms to construct one augmentation. The syn-
The data augmentation is crucial for learning onym list comes from a vocabulary they used. In
the representation of image (Tian et al., 2019; our pre-training corpus, there are roughly 40% to-
Jain et al., 2020). However, in language modeling, kens with at least one similar-meaning token in the
it remains unknown whether data (sentence) aug- list.
mentation would benefit the representation learn-
ing and what kind of data augmentation could ap- 4 Experiment
ply to the text. To answer these questions, we This section presents empirical experiments that
explore and test four basic augmentations (shown compare the proposed methods with various base-
in Figure 2) and their combinations in our exper- lines and alternative approaches.
iment. We do believe there exist more potential
augmentations, which we plan to leave for future 4.1 Setup
exploration.
Model configuration: We use the Transformer
One type of augmentation we consider is dele-
(12 layers, 12 heads and 768 hidden size) as
tion, which bases on the hypothesis that some dele-
our primary encoder (Vaswani et al., 2017). Mod-
tion in a sentence wouldn’t affect too much of the
els are pre-trained for 500K updates, with mini-
original semantic meaning. In some case, it may
batches containing 8,192 sequences of maximum
happen that deleting some words leads the sen-
length 512 tokens. For the first 24,000 steps, the
tence to a different meaning (e.g., the word not).
learning rate is warmed up to a peak value of 6e−4,
However, we believe including proper noise can
then linearly decayed for the rest. All models are
benefit the model to be more robust. We consider
optimized by Adam (Kingma and Ba, 2014) with
two different deletions, i.e., word deletion and
β1 = 0.9, β2 = 0.98, ǫ = 1e−6, and L2 weight de-
span deletion.
cay of 0.01. We use 0.1 for dropout on all layers
• Word deletion (shown in Figure 2a) ran- and in attention. All of the models are pre-trained
domly selects tokens in the sentence and on 256 NVIDIA Tesla V100 32GB GPUs.
replace them by a special token [DEL], Pre-training data: We pre-train all the models on
which is similar to the token [MASK] in a combination of BookCorpus (Zhu et al., 2015)
BERT (Devlin et al., 2019). and English Wikipedia datasets, the data BERT
used for pre-training. For more statistics of the
• Span deletion (shown in Figure 2b) picks and dataset and processing details, one can refer to
replaces the deletion objective on the span- BERT (Devlin et al., 2019).
level. Generally, span-deletion is a special Hyperparameters for MLM: For calculating
case of word-deletion, which puts more focus MLM loss, we randomly mask 15% tokens of the
on deleting consecutive words. input text s and use the surrounding tokens to pre-
dict them. To fill the gap between fine-tuning
To avoid the model easily distinguishing the two and pre-training, we also adopt the 10%-random-
augmentations from the remaining words at the replacement and 10%-keep-unchanged setting in
same location, we eliminate the consecutive token BERT for the masked tokens.
[DEL] into one token. Hyperparameters for CL: To compute CL loss,
Reordering (shown in Figure 2c) is another we set up different hyperparameters:
widely-studied augmentation that can keep the
original sentence’s features. BART (Lewis et al., • For Word Deletion (del-word), we delete
2019) has explored restoring the original sentence 70% tokens.
from the random reordered sentence. We ran-
domly sample several pairs of span and switch • For Span Deletion (del-span), we delete 5
them pairwise to construct the reordering augmen- spans (each with 5% length of the input text).
tation in our implementation.
Substitution (shown in Figure 2d) has been • For Reordering (reorder), we randomly
proven efficient in improving model’s robust- pick 5 pairs of spans (each with roughly 5%
ness (Jia et al., 2019). Following their work, we length as well) and switch spans pairwise.
Table 1: Performance of competing methods evaluated on GLUE dev set. Following GLUE’s setting (Wang et al.,
2018), unweighted average accuracy on the matched and mismatched dev sets is reported for MNLI. The un-
weighted average of accuracy and F1 is reported for MRPC and QQP. The unweighted average of Pearson and
Spearman correlation is reported for STS-B. The Matthews correlation is reported for CoLA. For all other tasks
we report accuracy.

Method MNLI QNLI QQP RTE SST-2 MRPC CoLA STS Avg

Baselines
BERT-base (Devlin et al., 2019) 84.0 89.0 89.1 61.0 93.0 86.3 57.3 89.5 81.2
RoBERTa-base (Liu et al., 2019) 87.2 93.2 88.2 71.8 94.4 87.8 56.1 89.4 83.5

MLM+1-CL-objective
MLM+ del-word 86.8 93.0 90.2 79.4 94.2 89.7 62.1 90.5 85.7
MLM+ del-span 87.3 92.8 90.1 79.8 94.4 89.9 59.8 90.3 85.6
MLM+2-CL-objective
MLM+ subs+ del-word 87.3 93.1 90.0 73.3 93.7 90.2 62.1 90.1 85.0
MLM+ subs+ del-span 87.0 93.4 90.3 74.4 94.3 90.5 63.3 90.5 85.5
MLM+ del-word+ reorder 87.0 92.7 89.5 76.5 94.5 90.6 59.1 90.4 85.0
MLM+ del-span+ reorder 86.7 92.9 90.0 78.3 94.5 89.2 64.3 89.8 85.7

• For Substitution (subs), we randomly select 4.2 GLUE Results


30% tokens and replace each token with one
We mainly evaluate all the models by the Gen-
of their similar-meaning tokens.
eral Language Understanding Evaluation (GLUE)
benchmark development set (Wang et al., 2018).
GLUE is a benchmark containing several differ-
Some of the above hyperparameters are slightly-
ent types of NLP tasks: natural language infer-
tuned on the WiKiText-103 dataset (Merity et al.,
ence task (MNLI, QNLI, and RTE), similarity
2016) (trained for 100 epochs, evaluated on
task (QQP, MRPC, STS), sentiment analysis task
the GLUE dev benchmark). For example, we
(SST), and linguistic acceptability task(CoLA).
find 70% deletion model perform best out of
It provides a comprehensive evaluation for pre-
{30%, 40%, 50%, 60%, 70%, 80%, 90%} deletion
trained language models.
models. For models using mixed augmentations,
like MLM+2-CL-objective in Table 1, they use the To fit the different downstream tasks’ require-
same optimized hyperparameters as in the single ments, we follow the RoBERTa’s hyperparamters
model. For instance, our notation MLM+subs+del- to finetune our model for various tasks. Specifi-
span represents a model combining the MLM loss cally, we add an extra fully connected layer and
with CL loss: for MLM, it masks 15% tokens; for then finetune the whole model on different train-
CL, it substitutes 30% tokens first and then deletes ing sets.
5 spans to generate augmented sentences. The primary baselines we include are BERT-
base and RoBERTa-base. The results for BERT-
Note that the hyperparameters we used might
base are from huggingface’s reimplementation1 . A
not be the most optimized ones. Yet, it is un- more fair comparison comes from RoBERTa-base
known whether optimized hyperparameters on a since we use the same hyperparameters RoBERTa-
1-CL-objective model perform consistently on a base used for MLM loss. Note that our models are
2-CL-objective model. Additionally, it is also all combining two-loss, it is still unfair to compare
unclear whether the optimized hyperparameters a MLM-only model with a MLM+CL model. To
for WiKiText-103 are still the optimized ones
answer this question, we set two other baselines
on BookCorpus and English Wikipedia datasets.
in Section 5.1 to make a more strict comparison:
However, it is hard to tune every possible hyperpa-
one combines two MLM losses, the other adopts a
rameter due to the extensive computation resource
double batch size.
requirement for pre-training. We will leave these
1
questions to explore in the future. https://huggingface.co/transformers/v1.1.0/examples.html
Table 2: Performance of competing methods evaluated on SentEval. All results are pre-trained on BookCorpus
and English Wikipedia datasets for 500k steps.

Method SICK-R STS-B STS12 STS13 STS14 STS15 STS16 Avg

Baselines
RoBERTa-base-mean 74.1 65.6 47.2 38.3 46.7 55.0 49.5 53.8
RoBERTa-base-[CLS] 75.9 71.9 47.4 37.5 47.9 55.1 57.6 56.1
MLM+1-CL-objective
MLM+ del-word-mean 75.9 69.0 50.6 40.0 50.2 58.9 52.4 56.7
MLM+ del-span-mean 71.0 62.6 49.3 41.7 48.9 58.1 52.3 54.8
MLM+ del-word-[CLS] 77.1 71.6 50.6 44.5 48.3 58.4 56.1 58.1
MLM+ del-span-[CLS] 62.7 57.4 34.4 20.4 24.3 32.0 31.5 37.5
MLM+2-CL-objective
MLM+ del-word+ reorder-mean 75.8 66.2 51.1 45.7 51.8 61.3 57.0 58.4
MLM+ del-span+ reorder-mean 75.4 67.8 48.3 50.3 54.9 60.4 56.8 59.1
MLM+ subs+ del-word-mean 73.6 63.4 44.6 39.8 50.1 55.5 49.6 53.8
MLM+ subs+ del-span-mean 75.5 67.0 48.3 45.0 54.6 60.9 58.5 58.5
MLM+ del-word+ reorder-[CLS] 71.9 63.8 41.9 30.9 37.4 48.9 52.1 49.6
MLM+ del-span+ reorder-[CLS] 75.0 68.7 49.4 54.3 57.6 64.0 61.4 61.5
MLM+ subs+ del-word-[CLS] 73.6 62.9 44.5 35.8 47.6 55.8 59.6 54.3
MLM+ subs+ del-span-[CLS] 75.6 72.5 49.0 48.9 57.4 63.6 65.6 61.8

As we can see in Table 1, our proposed sev- fine-tuning like in GLUE. We evaluate the per-
eral models outperform the baselines on GLUE. formance of our proposed methods for common
Note that different tasks adopt different evalua- Semantic Textual Similarity (STS) tasks on
tion matrices, our two best models MLM+del- SentEval. Note that some previous models (e.g.,
word and MLM+del-span+reorder both improve Sentence-BERT (Reimers and Gurevych, 2019))
the best baseline RoBERTa-base by 2.2% on aver- on the SentEval leaderboard trains on the specific
age score. Besides, a more important observation datasets such as Stanford NLI (Bowman et al.,
is that all best performance for each task comes 2015) and MultiNLI (Williams et al., 2017),
from our proposed model. On CoLA and RTE, our which makes it hard for a direct comparison. To
best model exceeds the baseline by 7.0% and 8.0% make it easier, we compare one of our proposed
correspondingly. Further, we also find that differ- models with RoBERTa-base directly on SentEval.
ent downstream tasks benefit from different aug- According to Sentence-BERT, using the mean of
mentations. We will make a more specific analysis all output vectors in the last layer is more effective
in Section 5.2. than using the CLS-token output. We test both
One notable thing is that we don’t show pooling strategies for each model.
the result of MLM+subs, MLM+reorder, and
MLM+subs+reorder in Table 1. We observe that From Table 2, we observe that mean-pooling
the pre-training for these three models either con- strategy does not show much advantages. In many
verges quickly or suffers from a gradient explosion of the cases, CLS-pooling is better than the mean-
problem, which indicates that these three augmen- pooling for our proposed models. The underlying
tations are too easy to distinguish. reason is that the contrastive learning directly up-
dates the representation of [CLS] token. Besides
4.3 SentEval Results for Semantic Textual that, we find adding the CL loss makes the model
Similarity Tasks especially good at the Semantic Textual Similar-
SentEval is a popular benchmark for eval- ity (STS) task, beating the best baseline by a large
uating general-purpose sentence representa- margin (+5.7%). We think it is because the pre-
tions (Conneau and Kiela, 2018). The specialty training of contrastive learning is to find the sim-
for this benchmark is that it doesn’t do the ilar sentence pairs, which aligns with STS task.
Table 3: Ablation study for several methods evaluated on GLUE dev set. All results are pre-trained on wiki-103
data for 500 epochs.

Method MNLI-m QNLI QQP RTE SST-2 MRPC CoLA STS Avg
RoBERTa-base 80.4 87.5 87.4 61.4 91.4 82.4 38.9 81.9 76.4
MLM-variant
Double-batch RoBERTa-base 80.3 88.0 87.1 59.9 91.9 82.1 43.0 82.0 76.8
Double MLM RoBERTA-base 80.5 87.6 87.3 57.4 90.4 77.7 42.2 83.0 75.8
MLM+CL-objective
MLM+ del-span 80.6 88.8 87.3 62.1 92.1 77.8 44.1 81.4 76.8
MLM+ del-span + reorder 81.1 88.7 87.5 58.1 90.0 80.4 43.3 87.4 77.1
MLM+ subs + del-word + reorder 80.5 87.7 87.3 59.6 90.4 80.2 45.1 87.1 77.2

This could explain why our proposed models show model. It tells us the proposed model does not
such large improvements on STS. solely benefit from a larger batch; CL loss also
helps.
5 Discussion
This section discusses an ablation study to com- 5.2 Different Augmentation Learns Different
pare the CL loss and MLM loss and shows some Features
observations about what different augmentation
In Table 1, we find an interesting phenomenon:
learns.
different proposed models are good at specific
5.1 Ablation Study tasks.
Our proposed CL-based models outperforms One example is MLM+subs+del-span helps the
MLM-based models, one remaining question is, model be good at dealing with similarity and para-
where does our proposed model benefit from? phrase tasks. On QQP and STS, it achieves the
Does it come from the CL loss, or is it from highest score; on MRPC, it ranks second. We
the larger batch size (since to calculate CL loss, infer the outperformance of MLM+subs+del-span
one needs to store extra information per batch)? in this kind of task is because synonym substitu-
To answer this question, we set up two extra tion helps translate the original sentence to similar
baselines: Double MLM RoBERTa-base adopts meaning sentences while deleting different spans
the MLM+MLM loss, each MLM is performed makes more variety of similar sentences visible.
on different mask for the same original sentence; Combining them enhances the model’s capacity to
the other Double-batch RoBERTa-base uses single deal with many unseen sentence pairs.
MLM loss with a double-size batch. We also notice that MLM+del-span achieves
Due to the limitation of computational re- good performance on inference tasks (MNLI,
source, we conduct the ablation study on a QNLI, RTE). The underlying reason is, with a
smaller pre-training corpus, i.e., WiKiText-103 span deletion, the model has already been pre-
dataset (Merity et al., 2016). All the models listed trained well to infer the other similar sentences.
in Table 3 are pre-trained for 500 epochs on 64 The ability to identify similar sentence pairs helps
NVIDIA Tesla V100 32GB GPUs. Three of to recognize the contradiction. Therefore, the gap
our proposed models are reported in the table. between the pre-trained task and this downstream
The general performance for the variants doesn’t task narrows.
show much difference compared with the original Overall, we observe that different augmentation
RoBERTa-base, with a +0.4% increase on the aver- learns different features. Some specific augmen-
age score on Double-batch RoBERTa-base, which tations are especially good at some certain down-
confirms the idea that a larger batch benefits the stream tasks. Designing a task-specific augmen-
representation training as proposed by previous tation or exploring meta-learning to adaptively se-
work (Liu et al., 2019). Yet, the best-performed lect different CL objectives is a promising future
baseline is still not as good as our best-proposed direction.
6 Conclusion Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber,
and Hal Daumé III. 2015. Deep unordered compo-
In this work, we presented an instantiation for con- sition rivals syntactic methods for text classification.
trastive sentence representation learning. By care- In Proceedings of the 53rd annual meeting of the as-
fully designing and testing different data augmen- sociation for computational linguistics and the 7th
international joint conference on natural language
tations and combinations, we prove the proposed processing (volume 1: Long papers), pages 1681–
methods’ effectiveness on GLUE and SentEval 1691.
benchmark under the diverse pre-training corpus.
Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel,
The experiment results indicate that the pre- Joseph E Gonzalez, and Ion Stoica. 2020. Con-
trained model would be more robust when lever- trastive code representation learning. arXiv preprint
aging adequate sentence-level supervision. More arXiv:2007.04973.
importantly, we reveal that different augmentation Robin Jia, Aditi Raghunathan, Kerem Göksel, and
learns different features for the model. Finally, we Percy Liang. 2019. Certified robustness to
demonstrate that the performance improvement adversarial word substitutions. arXiv preprint
comes from both the larger batch size and the con- arXiv:1909.00986.
trastive loss. Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
References Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov,
Samuel R Bowman, Gabor Angeli, Christopher Potts, Richard Zemel, Raquel Urtasun, Antonio Torralba,
and Christopher D Manning. 2015. A large anno- and Sanja Fidler. 2015. Skip-thought vectors. In
tated corpus for learning natural language inference. Advances in neural information processing systems,
arXiv preprint arXiv:1508.05326. pages 3294–3302.

Ting Chen, Simon Kornblith, Mohammad Norouzi, Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
and Geoffrey Hinton. 2020. A simple framework for Kevin Gimpel, Piyush Sharma, and Radu Soricut.
contrastive learning of visual representations. arXiv 2019. Albert: A lite bert for self-supervised learn-
preprint arXiv:2002.05709. ing of language representations. arXiv preprint
arXiv:1909.11942.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2020. Electra: Pre-training Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
text encoders as discriminators rather than genera- jan Ghazvininejad, Abdelrahman Mohamed, Omer
tors. arXiv preprint arXiv:2003.10555. Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.
Bart: Denoising sequence-to-sequence pre-training
Alexis Conneau and Douwe Kiela. 2018. Senteval: An for natural language generation, translation, and
evaluation toolkit for universal sentence representa- comprehension. arXiv preprint arXiv:1910.13461.
tions. arXiv preprint arXiv:1803.05449.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Edunov, Marjan Ghazvininejad, Mike Lewis, and
Kristina Toutanova. 2019. Bert: Pre-training of Luke Zettlemoyer. 2020. Multilingual denoising
deep bidirectional transformers for language under- pre-training for neural machine translation. arXiv
standing. In Proceedings of the 2019 Conference of preprint arXiv:2001.08210.
the North American Chapter of the Association for Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Computational Linguistics: Human Language Tech- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
nologies, Volume 1 (Long and Short Papers), pages Luke Zettlemoyer, and Veselin Stoyanov. 2019.
4171–4186. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Hongchao Fang and Pengtao Xie. 2020. Cert: Con-
trastive self-supervised learning for language under- Lajanugen Logeswaran and Honglak Lee. 2018. An
standing. arXiv preprint arXiv:2005.12766. efficient framework for learning sentence represen-
tations. arXiv preprint arXiv:1803.02893.
John M Giorgi, Osvald Nitski, Gary D Bader, and
Bo Wang. 2020. Declutr: Deep contrastive learn- Stephen Merity, Caiming Xiong, James Bradbury, and
ing for unsupervised textual representations. arXiv Richard Socher. 2016. Pointer sentinel mixture mod-
preprint arXiv:2006.03659. els. arXiv preprint arXiv:1609.07843.
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ishan Misra and Laurens van der Maaten. 2020. Self-
Ross Girshick. 2020. Momentum contrast for unsu- supervised learning of pretext-invariant representa-
pervised visual representation learning. In Proceed- tions. In Proceedings of the IEEE/CVF Conference
ings of the IEEE/CVF Conference on Computer Vi- on Computer Vision and Pattern Recognition, pages
sion and Pattern Recognition, pages 9729–9738. 6707–6717.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins.
Ilya Sutskever. 2018. Improving language under- 2019. Local aggregation for unsupervised learning
standing by generative pre-training. URL https://s3- of visual embeddings. In Proceedings of the IEEE
us-west-2. amazonaws. com/openai-assets/research- International Conference on Computer Vision, pages
covers/languageunsupervised/language understand- 6002–6012.
ing paper. pdf.

Nils Reimers and Iryna Gurevych. 2019. Sentence-


bert: Sentence embeddings using siamese bert-
networks. arXiv preprint arXiv:1908.10084.

Dinghan Shen, Guoyin Wang, Wenlin Wang, Mar-


tin Renqiang Min, Qinliang Su, Yizhe Zhang, Chun-
yuan Li, Ricardo Henao, and Lawrence Carin.
2018. Baseline needs more love: On simple word-
embedding-based models and associated pooling
mechanisms. arXiv preprint arXiv:1805.09843.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-


Yan Liu. 2019. Mass: Masked sequence to sequence
pre-training for language generation. arXiv preprint
arXiv:1905.02450.

Yonglong Tian, Dilip Krishnan, and Phillip Isola.


2019. Contrastive multiview coding. arXiv preprint
arXiv:1906.05849.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob


Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.

Alex Wang, Amanpreet Singh, Julian Michael, Felix


Hill, Omer Levy, and Samuel R Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
arXiv:1804.07461.

Adina Williams, Nikita Nangia, and Samuel R Bow-


man. 2017. A broad-coverage challenge corpus for
sentence understanding through inference. arXiv
preprint arXiv:1704.05426.

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua


Lin. 2018. Unsupervised feature learning via non-
parametric instance discrimination. In Proceedings
of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 3733–3742.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-


bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhut-


dinov, Raquel Urtasun, Antonio Torralba, and Sanja
Fidler. 2015. Aligning books and movies: Towards
story-like visual explanations by watching movies
and reading books. In Proceedings of the IEEE inter-
national conference on computer vision, pages 19–
27.

You might also like