
Mr. Bush
asked
Congress
to raise to
$ 6 billion
Org
Money
Person Others
Transformer-based Backbone Network
Linear Layer + SoftMax
Org M M MO O OOP P
Model:
Target:
Input Sentence:
Entity Types:
Mr. Bush
asked
Congress
to raise to
$ 6 billion
Person
Org Money
Support set:
Gates
co-founded Microsoft …Query set:
Jobs
founded
NeXT Inc
. with
$ 7 million
Distance
SoftMax
(a) Baseline: NER with a linear classifier (b) Prototype-based method
Series 5
runners up
JLS
and
Florence and the Machine
performed on show
Event Musician Artist
Entity Types:
Input Sentence:
Examples
Others
Labeled set
Unlabeled set
Student
Teacher
O M OP
…
Distillation
(c) Noisy supervised pre-training (d) Self-training
Figure 2: Illustration of different methods for few-shot NER. In this example, each token in the input sentence is
categorized into one of the four entity types. (a) A typical NER system, where a linear classifier is built on top of
unsupervised pre-trained Transformer-based networks such as BERT/Roberta. (b) A prototype set is constructed
via averaging features of all tokens belonging to a given entity type in the support set (e.g., the prototype for
Person is an average of three tokens: Mr., Bush and Jobs). For a token in the query set, its distances from
different prototypes are computed, and the model is trained to maximize the likelihood to assign the query token
to its target prototype. (c) The Wikipedia dataset is employed for supervised pre-training, whose entity types are
related but different (e.g., Musician and Artist are more fine-grained types of Person in the downstream
task). The associated types on each token can be noisy. (d) Self-training: An NER system (teacher model) trained
on a small labeled dataset is used to predict soft labels for sentences in a large unlabeled dataset. The joint of the
predicted dataset and original dataset is used to train a student model.
How to leverage unlabeled in-domain sentences in
a semi-supervised manner? Note that these three
directions are complementary to each other, can be
further used jointly to extrapolate the methodology
space in Figure 1.
3.1 Prototype-based Methods
Meta-learning (Ravi and Larochelle, 2017) have
shown promising results for few-shot image classi-
fication (Tian et al., 2020) and sentence classifica-
tion (Yu et al., 2018; Geng et al., 2019). It is natural
to adapt this idea to few-shot NER. The core idea is
to use episodic classification paradigm to simulate
few-shot settings during model training. Specifi-
cally in each episode,
M
entity types (usually
M <
|Y|
) are randomly sampled from
D
L
, containing a
support set
S = {(X
i
, Y
i
}
M×K
i=1
(
K
sentences per
type) and a query set
Q = {(
ˆ
X
i
,
ˆ
Y
i
}
M×K
0
i=1
(
K
0
sentences per type).
We build our method based on prototypical net-
work (Snell et al., 2017), which introduces the no-
tion of prototypes, representing entity types as vec-
tors in the same representation space of individual
tokens. To construct the prototype for the
m
-th
entity type
c
m
, the average of representations is
computed for all tokens belonging to this type in
the support set S:
c
m
=
1
|S
m
|
X
x∈S
m
f
θ
0
(x), (3)
where
S
m
is the tokens set of the
m
-th type in
S
,
and
f
θ
0
is defined in
(2)
. For an input token
x ∈ Q
from the query set, its prediction distribution is
computed by a softmax function of the distance be-
tween
x
and all the entity prototypes. For example,
the prediction probability for the
m
-th prototype is:
q(y =I
m
|x)=
exp (−d(f
θ
0
(x), c
m
))
P
m
0
exp (−d(f
θ
0
(x), c
m
0
))
(4)
where
I
m
is the one-hot vector with 1 for
m
-th
coordinate and 0 elsewhere, and
d(f
θ
0
(x), c
m
) =
kf
θ
0
(x) − c
m
k
2
is used in our implementation.
We provide a simple example to illustrate the
prototype method in Figure 2(b). In each train-
ing iteration, a new episode is sampled, and the
model parameter
θ
0
is updated via plugging
(4)
into
(1)
. In testing phase, the label of a new token
x
is assigned using the nearest neighbor criterion
arg min
m
d(f
θ
0
(x), c
m
).
3.2 Noisy Supervised Pre-training
Generic representations via self-supervised
pre-trained language models (Devlin et al.,