0% found this document useful (0 votes)
4 views10 pages

Ner LSTM

The paper presents two novel neural architectures for named entity recognition (NER) that do not rely on language-specific resources or extensive hand-crafted features. The first model is based on bidirectional LSTMs combined with conditional random fields (LSTM-CRF), while the second employs a transition-based approach using stack LSTMs to construct and label segments. Both models achieve state-of-the-art performance in multiple languages using only small amounts of supervised training data and unsupervised corpora.

Uploaded by

tam01667766277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

Ner LSTM

The paper presents two novel neural architectures for named entity recognition (NER) that do not rely on language-specific resources or extensive hand-crafted features. The first model is based on bidirectional LSTMs combined with conditional random fields (LSTM-CRF), while the second employs a transition-based approach using stack LSTMs to construct and label segments. Both models achieve state-of-the-art performance in multiple languages using only small amounts of supervised training data and unsupervised corpora.

Uploaded by

tam01667766277
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Neural Architectures for Named Entity Recognition

Guillaume Lample♠ Miguel Ballesteros♣♠


Sandeep Subramanian♠ Kazuya Kawakami♠ Chris Dyer♠

Carnegie Mellon University ♣ NLP Group, Pompeu Fabra University
{glample,sandeeps,kkawakam,cdyer}@cs.cmu.edu,
[email protected]

Abstract that have relied extensively on unsupervised fea-


tures (Collobert et al., 2011; Turian et al., 2010;
arXiv:1603.01360v1 [cs.CL] 4 Mar 2016

State-of-the-art named entity recognition sys- Lin and Wu, 2009; Ando and Zhang, 2005b, in-
tems rely heavily on hand-crafted features and ter alia) have used these to augment, rather than
domain-specific knowledge in order to learn replace, hand-engineered features (e.g., knowledge
effectively from the small, supervised training
about capitalization patterns and character classes in
corpora that are available. In this paper, we
introduce two new neural architectures—one a particular language) and specialized knowledge re-
based on bidirectional LSTMs and conditional sources (e.g., gazetteers).
random fields, and the other that constructs In this paper, we present neural architectures
and labels segments using a transition-based for NER that use no language-specific resources
approach inspired by shift-reduce parsers.
or features beyond a small amount of supervised
Our models rely on two sources of infor-
mation about words: character-based word training data and unlabeled corpora. Our mod-
representations learned from the supervised els are designed to capture two intuitions. First,
corpus and unsupervised word representa- since names often consist of multiple tokens, rea-
tions learned from unannotated corpora. Our soning jointly over tagging decisions for each to-
models obtain state-of-the-art performance in ken is important. We compare two models here,
NER in four languages without resorting to (i) a bidirectional LSTM with a sequential condi-
any language-specific knowledge or resources
tional random layer above it (LSTM-CRF; §2), and
such as gazetteers. 1
(ii) a new model that constructs and labels chunks
of input sentences using an algorithm inspired by
1 Introduction transition-based parsing with states represented by
stack LSTMs (S-LSTM; §3). Second, token-level
Named entity recognition (NER) is a challenging evidence for “being a name” includes both ortho-
learning problem. One the one hand, in most lan- graphic evidence (what does the word being tagged
guages and domains, there is only a very small as a name look like?) and distributional evidence
amount of supervised training data available. On the (where does the word being tagged tend to oc-
other, there are few constraints on the kinds of words cur in a corpus?). To capture orthographic sen-
that can be names, so generalizing from this small sitivity, we use character-based word representa-
sample of data is difficult. As a result, carefully con- tion model (Ling et al., 2015b) to capture distribu-
structed orthographic features and language-specific tional sensitivity, we combine these representations
knowledge resources, such as gazetteers, are widely with distributional representations (Mikolov et al.,
used for solving this task. Unfortunately, language- 2013b). Our word representations combine both of
specific resources and features are costly to de- these, and dropout training is used to encourage the
velop in new languages and new domains, making model to learn to trust both sources of evidence (§4).
NER a challenge to adapt. Unsupervised learning
Experiments in English, Dutch, German, and
from unannotated corpora offers an alternative strat-
Spanish show that we are able to obtain state-
egy for obtaining better generalization from small
of-the-art NER performance with the LSTM-CRF
amounts of supervision. However, even systems
model in Dutch, German, and Spanish, and very
1
Code is available at https://github.com/glample/tagger near the state-of-the-art in English without any
hand-engineered features or gazetteers (§5). The the forward LSTM and the latter as the backward
transition-based algorithm likewise surpasses the LSTM. These are two distinct networks with differ-
best previously published results in several lan- ent parameters. This forward and backward LSTM
guages, although it performs less well than the pair is referred to as a bidirectional LSTM (Graves
LSTM-CRF model. and Schmidhuber, 2005).
The representation of a word using this model is
2 LSTM-CRF Model obtained by concatenating its left and right context

− ← −
We provide a brief description of LSTMs and CRFs, representations, ht = [ht ; ht ]. These representa-
and present a hybrid tagging architecture. tions effectively include a representation of a word
in context, which is useful for numerous tagging ap-
2.1 LSTM plications.
Recurrent neural networks (RNNs) are a family
2.2 CRF Tagging Models
of neural networks that operate on sequential
data. They take as input a sequence of vectors A very simple—but surprisingly effective—tagging
(x1 , x2 , . . . , xn ) and return another sequence model is to use the ht ’s as features to make indepen-
(h1 , h2 , . . . , hn ) that represents some information dent tagging decisions for each output yt (Ling et
about the sequence at every step in the input. al., 2015b). Despite this model’s success in simple
Although RNNs can, in theory, learn long depen- problems like POS tagging, its independent classifi-
dencies, in practice they fail to do so and tend to cation decisions are limiting when there are strong
be biased towards their most recent inputs in the dependencies across output labels. NER is one such
sequence (Bengio et al., 1994). Long Short-term task, since the “grammar” that characterizes inter-
Memory Networks (LSTMs) have been designed to pretable sequences of tags imposes several hard con-
combat this issue by incorporating a memory-cell straints (e.g., I-PER cannot follow B-LOC; see §2.4
and have been shown to capture long-range depen- for details) that would be impossible to model with
dencies. They do so using several gates that control independence assumptions.
the proportion of the input to give to the memory Therefore, instead of modeling tagging decisions
cell, and the proportion from the previous state to independently, we model them jointly using a con-
forget (Hochreiter and Schmidhuber, 1997). We use ditional random field (Lafferty et al., 2001). For an
the following implementation: input sentence

X = (x1 , x2 , . . . , xn ),
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) we consider P to be the matrix of scores output by
ct = (1 − it ) ct−1 + the bidirectional LSTM network. P is of size n × k,
it tanh(Wxc xt + Whc ht−1 + bc ) where k is the number of distinct tags, and Pi,j cor-
responds to the score of the j th tag of the ith word
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo )
in a sentence. For a sequence of predictions
ht = o t tanh(ct ),
y = (y1 , y2 , . . . , yn ),
where σ is the element-wise sigmoid function, and
is the element-wise product. we define its score to be
For a given sentence (x1 , x2 , . . . , xn ) containing n
X n
X
n words, each represented as a d-dimensional vector, s(X, y) = Ayi ,yi+1 + Pi,yi

− i=0 i=1
an LSTM computes a representation ht of the left
context of the sentence at every word t. Naturally, where A is a matrix of transition scores such that


generating a representation of the right context ht Ai,j represents the score of a transition from the
as well should add useful information. This can be tag i to tag j. y0 and yn are the start and end
achieved using a second LSTM that reads the same tags of a sentence, that we add to the set of possi-
sequence in reverse. We will refer to the former as ble tags. A is therefore a square matrix of size k +2.
A softmax over all possible tag sequences yields a
probability for the sequence y:

es(X,y)
p(y|X) = P s(X,e
y)
.
e ∈YX e
y

During training, we maximize the log-probability of


the correct tag sequence:

 
X
log(p(y|X)) = s(X, y) − log  es(X,ey) 
e ∈YX
y

= s(X, y) − logadd s(X, y


e ), (1) Figure 1: Main architecture of the network. Word embeddings
e ∈YX
y
are given to a bidirectional LSTM. li represents the word i and
where YX represents all possible tag sequences its left context, ri represents the word i and its right context.
(even those that do not verify the IOB format) for Concatenating these two vectors yields a representation of the
a sentence X. From the formulation above, it is ev- word i in its context, ci .
ident that we encourage our network to produce a
valid sequence of output labels. While decoding, we let xi denote the sequence of word embeddings for
predict the output sequence that obtains the maxi- every word in a sentence, and yi be their associated
mum score given by: tags. We return to a discussion of how the embed-
dings xi are modeled in Section 4. The sequence of
y∗ = argmax s(X, y
e ). (2) word embeddings is given as input to a bidirectional
e ∈YX
y
LSTM, which returns a representation of the left and
Since we are only modeling bigram interactions right context for each word as explained in 2.1.
between outputs, both the summation in Eq. 1 and These representations are concatenated (ci ) and
the maximum a posteriori sequence y∗ in Eq. 2 can linearly projected onto a layer whose size is equal
be computed using dynamic programming. to the number of distinct tags. Instead of using the
softmax output from this layer, we use a CRF as pre-
2.3 Parameterization and Training viously described to take into account neighboring
The scores associated with each tagging decision tags, yielding the final predictions for every word
for each token (i.e., the Pi,y ’s) are defined to be yi . Additionally, we observed that adding a hidden
the dot product between the embedding of a word- layer between ci and the CRF layer marginally im-
in-context computed with a bidirectional LSTM— proved our results. All results reported with this
exactly the same as the POS tagging model of Ling model incorporate this extra-layer. The parameters
et al. (2015b) and these are combined with bigram are trained to maximize Eq. 1 of observed sequences
compatibility scores (i.e., the Ay,y0 ’s). This archi- of NER tags in an annotated corpus, given the ob-
tecture is shown in figure 1. Circles represent ob- served words.
served variables, diamonds are deterministic func-
tions of their parents, and double circles are random 2.4 IOBES Tagging Scheme
variables. The task of named entity recognition is to assign a
The parameters of this model are thus the matrix named entity label to every word in a sentence. A
of bigram compatibility scores A, and the parame- single named entity could span several tokens within
ters that give rise to the matrix P, namely the param- a sentence. Sentences are usually represented in the
eters of the bidirectional LSTM, the linear feature IOB format (Inside, Outside, Beginning) where ev-
weights, and the word embeddings. As in part 2.2, ery token is labeled as B-label if the token is the
beginning of a named entity, I-label if it is inside that contains the words that have yet to be processed.
a named entity but not the first token within the The transition inventory contains the following tran-
named entity, or O otherwise. However, we decided sitions: The SHIFT transition moves a word from
to use the IOBES tagging scheme, a variant of IOB, the buffer to the stack, the OUT transition moves a
which encodes information about singleton entities word from the buffer directly into the output stack
(S) and explicitly marks the end of named entities while the REDUCE(y) transition pops all items from
(E). Using this scheme, tagging a word as I-label the top of the stack creating a “chunk,” labels this
with high-confidence narrows down the choices for with label y, and pushes a representation of this
the subsequent word to I-label or E-label, however, chunk onto the output stack. The algorithm com-
the IOB scheme is only capable of determining that pletes when the stack and buffer are both empty. The
the subsequent word cannot be the interior of an- algorithm is depicted in Figure 2, which shows the
other label. Ratinov and Roth (2009) and Dai et al. sequence of operations required to process the sen-
(2015) showed that using a more expressive tagging tence Mark Watney visited Mars.
scheme like IOBES improves model performance The model is parameterized by defining a prob-
marginally. We observed a similar improvement to ability distribution over actions at each time step,
theirs using the IOBES notation. given the current contents of the stack, buffer, and
output, as well as the history of actions taken. Fol-
3 Transition-Based Chunking Model lowing Dyer et al. (2015), we use stack LSTMs
to compute a fixed dimensional embedding of each
As an alternative to the LSTM-CRF discussed in of these, and take a concatenation of these to ob-
the previous section, we explore a new architecture tain the full algorithm state. This representation is
that chunks and labels a sequence of inputs using used to define a distribution over the possible ac-
an algorithm similar to transition-based dependency tions that can be taken at each time step. The model
parsing. This model directly constructs representa- is trained to maximize the conditional probability of
tions of the multi-token names (e.g., the name Mark sequences of reference actions (extracted from a la-
Watney is composed into a single representation). beled training corpus) given the input sentences. To
This model relies on a stack data structure to in- label a new input sequence at test time, the maxi-
crementally construct chunks of the input. To ob- mum probability action is chosen greedily until the
tain representations of this stack used for predict- algorithm reaches a termination state. Although this
ing subsequent actions, we use the Stack-LSTM pre- is not guaranteed to find a global optimum, it is ef-
sented by Dyer et al. (2015), in which the LSTM fective in practice. Since each token is either moved
is augmented with a “stack pointer.” While sequen- directly to the output (1 action) or first to the stack
tial LSTMs model sequences from left to right, stack and then the output (2 actions), the total number of
LSTMs permit embedding of a stack of objects that actions for a sequence of length n is maximally 2n.
are both added to (using a push operation) and re-
moved from (using a pop operation). This allows 3.2 Representing Labeled Chunks
the Stack-LSTM to work like a stack that maintains When the REDUCE(y) operation is executed, the al-
a “summary embedding” of its contents. We refer gorithm shifts a sequence of tokens (together with
to this model as Stack-LSTM or S-LSTM model for their vector embeddings) from the stack to the out-
simplicity. put buffer as a single completed chunk. To compute
an embedding of this sequence, we run a bidirec-
3.1 Chunking Algorithm
tional LSTM over the embeddings of its constituent
We designed a transition inventory which is given in tokens together with a token representing the type of
Figure 2 that is inspired by transition-based parsers, the chunk being identified (i.e., y). This function is
in particular the arc-standard parser of Nivre (2004). given as g(u, . . . , v, ry ), where ry is a learned em-
In this algorithm, we make use of two stacks (des- bedding of a label type. Thus, the output buffer con-
ignated output and stack representing, respectively, tains a single vector representation for each labeled
completed chunks and scratch space) and a buffer chunk that is generated, regardless of its length.
Outt Stackt Buffert Action Outt+1 Stackt+1 Buffert+1 Segments
O S (u, u), B SHIFT O (u, u), S B —
O (u, u), . . . , (v, v), S B REDUCE (y) g(u, . . . , v, ry ), O S B (u . . . v, y)
O S (u, u), B OUT g(u, r∅ ), O S B —

Figure 2: Transitions of the Stack-LSTM model indicating the action applied and the resulting state. Bold symbols indicate
(learned) embeddings of words and relations, script symbols indicate the corresponding words and relations.

Transition Output Stack Buffer Segment


[] [] [Mark, Watney, visited, Mars]
S HIFT [] [Mark] [Watney, visited, Mars]
S HIFT [] [Mark, Watney] [visited, Mars]
REDUCE(PER) [(Mark Watney)-PER] [] [visited, Mars] (Mark Watney)-PER
OUT [(Mark Watney)-PER, visited] [] [Mars]
SHIFT [(Mark Watney)-PER, visited] [Mars] []
REDUCE(LOC) [(Mark Watney)-PER, visited, (Mars)-LOC] [] [] (Mars)-LOC

Figure 3: Transition sequence for Mark Watney visited Mars with the Stack-LSTM model.

4 Input Word Embeddings


The input layers to both of our models are vector
representations of individual words. Learning inde-
pendent representations for word types from the lim-
ited NER training data is a difficult problem: there
are simply too many parameters to reliably estimate.
Since many languages have orthographic or mor-
phological evidence that something is a name (or
not a name), we want representations that are sen-
sitive to the spelling of words. We therefore use a
model that constructs representations of words from
representations of the characters they are composed
of (4.1). Our second intuition is that names, which
may individually be quite varied, appear in regular
contexts in large corpora. Therefore we use embed-
dings learned from a large corpus that are sensitive
to word order (4.2). Finally, to prevent the models
from depending on one representation or the other Figure 4: The character embeddings of the word “Mars” are
too strongly, we use dropout training and find this is given to a bidirectional LSTMs. We concatenate their last out-
crucial for good generalization performance (4.3). puts to an embedding from a lookup table to obtain a represen-
tation for this word.
4.1 Character-based models of words
An important distinction of our work from most tagging and language modeling (Ling et al., 2015b)
previous approaches is that we learn character-level or dependency parsing (Ballesteros et al., 2015).
features while training instead of hand-engineering Figure 4 describes our architecture to generate a
prefix and suffix information about words. Learn- word embedding for a word from its characters. A
ing character-level embeddings has the advantage of character lookup table initialized at random contains
learning representations specific to the task and do- an embedding for every character. The character
main at hand. They have been found useful for mor- embeddings corresponding to every character in a
phologically rich languages and to handle the out- word are given in direct and reverse order to a for-
of-vocabulary problem for tasks like part-of-speech ward and a backward LSTM. The embedding for a
word derived from its characters is the concatenation word version 4 (with the LA Times and NY Times
of its forward and backward representations from portions removed) respectively.2 We use an embed-
the bidirectional LSTM. This character-level repre- ding dimension of 100 for English, 64 for other lan-
sentation is then concatenated with a word-level rep- guages, a minimum word frequency cutoff of 4, and
resentation from a word lookup-table. During test- a window size of 8.
ing, words that do not have an embedding in the
lookup table are mapped to a UNK embedding. To 4.3 Dropout training
train the UNK embedding, we replace singletons Initial experiments showed that character-level em-
with the UNK embedding with a probability 0.5. In beddings did not improve our overall performance
all our experiments, the hidden dimension of the for- when used in conjunction with pretrained word rep-
ward and backward character LSTMs are 25 each, resentations. To encourage the model to depend on
which results in our character-based representation both representations, we use dropout training (Hin-
of words being of dimension 50. ton et al., 2012), applying a dropout mask to the final
Recurrent models like RNNs and LSTMs are ca- embedding layer just before the input to the bidirec-
pable of encoding very long sequences, however, tional LSTM in Figure 1. We observe a significant
they have a representation biased towards their most improvement in our model’s performance after us-
recent inputs. As a result, we expect the final rep- ing dropout (see table 5).
resentation of the forward LSTM to be an accurate
representation of the suffix of the word, and the fi- 5 Experiments
nal state of the backward LSTM to be a better rep-
This section presents the methods we use to train our
resentation of its prefix. Alternative approaches—
models, the results we obtained on various tasks and
most notably like convolutional networks—have
the impact of our networks’ configuration on model
been proposed to learn representations of words
performance.
from their characters (Zhang et al., 2015; Kim et al.,
2015). However, convnets are designed to discover 5.1 Training
position-invariant features of their inputs. While this
is appropriate for many problems, e.g., image recog- For both models presented, we train our networks
nition (a cat can appear anywhere in a picture), we using the back-propagation algorithm updating our
argue that important information is position depen- parameters on every training example, one at a
dent (e.g., prefixes and suffixes encode different in- time, using stochastic gradient descent (SGD) with
formation than stems), making LSTMs an a priori a learning rate of 0.01 and a gradient clipping of
better function class for modeling the relationship 5.0. Several methods have been proposed to enhance
between words and their characters. the performance of SGD, such as Adadelta (Zeiler,
2012) or Adam (Kingma and Ba, 2014). Although
4.2 Pretrained embeddings we observe faster convergence using these methods,
As in Collobert et al. (2011), we use pretrained none of them perform as well as SGD with gradient
word embeddings to initialize our lookup table. We clipping.
observe significant improvements using pretrained Our LSTM-CRF model uses a single layer for
word embeddings over randomly initialized ones. the forward and backward LSTMs whose dimen-
Embeddings are pretrained using skip-n-gram (Ling sions are set to 100. Tuning this dimension did
et al., 2015a), a variation of word2vec (Mikolov et not significantly impact model performance. We set
al., 2013a) that accounts for word order. These em- the dropout rate to 0.5. Using higher rates nega-
beddings are fine-tuned during training. tively impacted our results, while smaller rates led
Word embeddings for Spanish, Dutch, German to longer training time.
and English are trained using the Spanish Gigaword The stack-LSTM model uses two layers each of
version 3, the Leipzig corpora collection, the Ger- dimension 100 for each stack. The embeddings of
man monolingual training data from the 2010 Ma- 2
(Graff, 2011; Biemann et al., 2007; Callison-Burch et al.,
chine Translation Workshop and the English Giga- 2010; Parker et al., 2009)
the actions used in the composition functions have German, Dutch and Spanish respectively in compar-
16 dimensions each, and the output embedding is ison to other models. On these three languages, the
of dimension 20. We experimented with different LSTM-CRF model significantly outperforms all pre-
dropout rates and reported the scores using the best vious methods, including the ones using external la-
dropout rate for each language.3 beled data. The only exception is Dutch, where the
model of Gillick et al. (2015) can perform better by
5.2 Data Sets leveraging the information from other NER datasets.
We test our model on different datasets for named The Stack-LSTM also consistently presents state-
entity recognition. To demonstrate our model’s the-art (or close to) results compared to systems that
ability to generalize to different languages, we do not use external data.
present results on the CoNLL-2002 and CoNLL- Model F1
2003 datasets (Tjong Kim Sang, 2002; Tjong Collobert et al. (2011)* 89.59
Lin and Wu (2009) 83.78
Kim Sang and De Meulder, 2003) that contain in- Lin and Wu (2009)* 90.90
dependent named entity labels for English, Span- Huang et al. (2015)* 90.10
Passos et al. (2014) 90.05
ish, German and Dutch. All datasets contain four Passos et al. (2014)* 90.90
different types of named entities: locations, per- Luo et al. (2015)* + gaz 89.9
Luo et al. (2015)* + gaz + linking 91.2
sons, organizations, and miscellaneous entities that Chiu and Nichols (2015) 90.69
do not belong in any of the three previous cate- Chiu and Nichols (2015)* 90.77
gories. Although POS tags were made available for LSTM-CRF (no char) 90.20
LSTM-CRF 90.94
all datasets, we did not include them in our models. S-LSTM (no char) 87.96
We did not perform any dataset preprocessing, apart S-LSTM 90.33
from replacing every digit with a zero in the English Table 1: English NER results (CoNLL-2003 test set). * indi-
NER dataset. cates models trained with the use of external labeled data

5.3 Results Model F1


Florian et al. (2003)* 72.41
Table 1 presents our comparisons with other mod- Ando and Zhang (2005a) 75.27
els for named entity recognition in English. To Qi et al. (2009) 75.72
Gillick et al. (2015) 72.08
make the comparison between our model and oth- Gillick et al. (2015)* 76.22
ers fair, we report the scores of other models with LSTM-CRF – no char 75.06
and without the use of external labeled data such LSTM-CRF 78.76
S-LSTM – no char 65.87
as gazetteers and knowledge bases. Our models do S-LSTM 75.66
not use gazetteers or any external labeled resources. Table 2: German NER results (CoNLL-2003 test set). * indi-
The best score reported on this task is by Luo et al. cates models trained with the use of external labeled data
(2015). They obtained a F1 of 91.2 by jointly model-
ing the NER and entity linking tasks (Hoffart et al., Model F1
2011). Their model uses a lot of hand-engineered Carreras et al. (2002) 77.05
Nothman et al. (2013) 78.6
features including spelling features, WordNet clus- Gillick et al. (2015) 78.08
ters, Brown clusters, POS tags, chunks tags, as Gillick et al. (2015)* 82.84
well as stemming and external knowledge bases like LSTM-CRF – no char 73.14
LSTM-CRF 81.74
Freebase and Wikipedia. Our LSTM-CRF model S-LSTM – no char 69.90
outperforms all other systems, including the ones us- S-LSTM 79.88
ing external labeled data like gazetteers. Our Stack- Table 3: Dutch NER (CoNLL-2002 test set). * indicates mod-
LSTM model also outperforms all previous models els trained with the use of external labeled data
that do not incorporate external features, apart from
the one presented by Chiu and Nichols (2015). 5.4 Network architectures
Tables 2, 3 and 4 present our results on NER for
Our models had several components that we could
3
English (D=0.2), German, Spanish and Dutch (D=0.3) tweak to understand their impact on the overall per-
Model F1
Several other neural architectures have previously
Carreras et al. (2002)* 81.39
Santos and Guimarães (2015) 82.21 been proposed for NER. For instance, Collobert et
Gillick et al. (2015) 81.83 al. (2011) uses a CNN over a sequence of word em-
Gillick et al. (2015)* 82.95
LSTM-CRF – no char 83.44
beddings with a CRF layer on top. This can be
LSTM-CRF 85.75 thought of as our first model without character-level
S-LSTM – no char 79.46 embeddings and with the bidirectional LSTM be-
S-LSTM 83.93
ing replaced by a CNN. More recently, Huang et al.
Table 4: Spanish NER (CoNLL-2002 test set). * indicates mod-
(2015) presented a model similar to our LSTM-CRF,
els trained with the use of external labeled data
but using hand-crafted spelling features. Lin and
Wu (2009) used a linear chain CRF with L2 regular-
formance. We explored the impact that the CRF, the ization, they added phrase cluster features extracted
character-level representations, pretraining of our from the web data and spelling features. Passos et
word embeddings and dropout had on our LSTM- al. (2014) also used a linear chain CRF with spelling
CRF model. We observed that pretraining our word features and gazetteers.
embeddings gave us the biggest improvement in Language independent NER models like ours
overall performance of +7.31 in F1 . The CRF layer have also been proposed in the past. Cucerzan
gave us an increase of +1.79, while using dropout and Yarowsky (1999; 2002) present semi-supervised
resulted in a difference of +1.17 and finally learn- bootstrapping algorithms for named entity recogni-
ing character-level word embeddings resulted in an tion by co-training character-level (word-internal)
increase of about +0.74. For the Stack-LSTM we and token-level (context) features. Eisenstein et
performed a similar set of experiments. Results with al. (2011) use Bayesian nonparametrics to construct
different architectures are given in table 5. a database of named entities in an almost unsu-
Model Variant F1 pervised setting. Ratinov and Roth (2009) quanti-
LSTM char + dropout + pretrain 89.15 tatively compare several approaches for NER and
LSTM-CRF char + dropout 83.63
LSTM-CRF pretrain 88.39 build their own supervised model using a regular-
LSTM-CRF pretrain + char 89.77 ized average perceptron and aggregating context in-
LSTM-CRF pretrain + dropout 90.20
LSTM-CRF pretrain + dropout + char 90.94
formation.
S-LSTM char + dropout 80.88 Finally, there is currently a lot of interest in mod-
S-LSTM pretrain 86.67
S-LSTM pretrain + char 89.32
els for NER that use letter-based representations.
S-LSTM pretrain + dropout 87.96 Gillick et al. (2015) model the task of sequence-
S-LSTM pretrain + dropout + char 90.33 labeling as a sequence to sequence learning prob-
Table 5: English NER results with our models, using differ- lem and incorporate character-based representations
ent configurations. “pretrain” refers to models that include pre- into their encoder model. Chiu and Nichols (2015)
trained word embeddings, “char” refers to models that include employ an architecture similar to ours, but instead
character-based modeling of words, “dropout” refers to models use CNNs to learn character-level features, in a way
that include dropout rate. similar to the work by Santos and Guimarães (2015).

6 Related Work 7 Conclusion

In the CoNLL-2002 shared task, Carreras et al. This paper presents two neural architectures for se-
(2002) obtained among the best results on both quence labeling that provide the best NER results
Dutch and Spanish by combining several small ever reported in standard evaluation settings, even
fixed-depth decision trees. Next year, in the CoNLL- compared with models that use external resources,
2003 Shared Task, Florian et al. (2003) obtained the such as gazetteers.
best score on German by combining the output of A key aspect of our models are that they model
four diverse classifiers. Qi et al. (2009) later im- output label dependencies, either via a simple CRF
proved on this with a neural network by doing unsu- architecture, or using a transition-based algorithm
pervised learning on a massive unlabeled corpus. to explicitly construct and label chunks of the in-
put. Word representations are also crucially impor- and Pavel Kuksa. 2011. Natural language process-
tant for success: we use both pre-trained word rep- ing (almost) from scratch. The Journal of Machine
resentations and “character-based” representations Learning Research, 12:2493–2537.
that capture morphological and orthographic infor- [Cucerzan and Yarowsky1999] Silviu Cucerzan and
David Yarowsky. 1999. Language independent
mation. To prevent the learner from depending too
named entity recognition combining morphological
heavily on one representation class, dropout is used. and contextual evidence. In Proceedings of the 1999
Joint SIGDAT Conference on EMNLP and VLC, pages
Acknowledgments 90–99.
[Cucerzan and Yarowsky2002] Silviu Cucerzan and
Miguel Ballesteros is supported by the European
David Yarowsky. 2002. Language independent ner
Commission under the contract numbers FP7-ICT- using a unified model of internal and contextual
610411 (project MULTISENSOR) and H2020-RIA- evidence. In proceedings of the 6th conference on
645012 (project KRISTINA). Natural language learning-Volume 20, pages 1–4.
Association for Computational Linguistics.
[Dai et al.2015] Hong-Jie Dai, Po-Ting Lai, Yung-Chun
References Chang, and Richard Tzong-Han Tsai. 2015. Enhanc-
[Ando and Zhang2005a] Rie Kubota Ando and Tong ing of chemical compound and drug name recogni-
Zhang. 2005a. A framework for learning predictive tion using representative tag scheme and fine-grained
structures from multiple tasks and unlabeled data. The tokenization. Journal of cheminformatics, 7(Suppl
Journal of Machine Learning Research, 6:1817–1853. 1):S14.
[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang
[Ando and Zhang2005b] Rie Kubota Ando and Tong
Ling, Austin Matthews, and Noah A. Smith. 2015.
Zhang. 2005b. Learning predictive structures. JMLR,
Transition-based dependency parsing with stack long
6:1817–1853.
short-term memory. In Proc. ACL.
[Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer,
[Eisenstein et al.2011] Jacob Eisenstein, Tae Yano,
and Noah A. Smith. 2015. Improved transition-based
William W Cohen, Noah A Smith, and Eric P Xing.
dependency parsing by modeling characters instead of
2011. Structured databases of named entities from
words with LSTMs. In Proceedings of EMNLP.
bayesian nonparametrics. In Proceedings of the First
[Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Workshop on Unsupervised Learning in NLP, pages
Paolo Frasconi. 1994. Learning long-term depen- 2–12. Association for Computational Linguistics.
dencies with gradient descent is difficult. Neural Net- [Florian et al.2003] Radu Florian, Abe Ittycheriah,
works, IEEE Transactions on, 5(2):157–166. Hongyan Jing, and Tong Zhang. 2003. Named
[Biemann et al.2007] Chris Biemann, Gerhard Heyer, entity recognition through classifier combination. In
Uwe Quasthoff, and Matthias Richter. 2007. The Proceedings of the seventh conference on Natural
leipzig corpora collection-monolingual corpora of language learning at HLT-NAACL 2003-Volume
standard size. Proceedings of Corpus Linguistic. 4, pages 168–171. Association for Computational
[Callison-Burch et al.2010] Chris Callison-Burch, Linguistics.
Philipp Koehn, Christof Monz, Kay Peterson, Mark [Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol
Przybocki, and Omar F Zaidan. 2010. Findings Vinyals, and Amarnag Subramanya. 2015. Multilin-
of the 2010 joint workshop on statistical machine gual language processing from bytes. arXiv preprint
translation and metrics for machine translation. In arXiv:1512.00103.
Proceedings of the Joint Fifth Workshop on Statistical [Graff2011] David Graff. 2011. Spanish gigaword third
Machine Translation and MetricsMATR, pages 17–53. edition (ldc2011t12). Linguistic Data Consortium,
Association for Computational Linguistics. Univer-sity of Pennsylvania, Philadelphia, PA.
[Carreras et al.2002] Xavier Carreras, Lluı́s Màrquez, and [Graves and Schmidhuber2005] Alex Graves and Jürgen
Lluı́s Padró. 2002. Named entity extraction using ad- Schmidhuber. 2005. Framewise phoneme classifi-
aboost, proceedings of the 6th conference on natural cation with bidirectional LSTM networks. In Proc.
language learning. August, 31:1–4. IJCNN.
[Chiu and Nichols2015] Jason PC Chiu and Eric Nichols. [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivas-
2015. Named entity recognition with bidirectional tava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
lstm-cnns. arXiv preprint arXiv:1511.08308. Salakhutdinov. 2012. Improving neural networks by
[Collobert et al.2011] Ronan Collobert, Jason Weston, preventing co-adaptation of feature detectors. arXiv
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, preprint arXiv:1207.0580.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Distributed representations of words and phrases and
Jürgen Schmidhuber. 1997. Long short-term memory. their compositionality. In Proc. NIPS.
Neural Computation, 9(8):1735–1780. [Nivre2004] Joakim Nivre. 2004. Incrementality in de-
[Hoffart et al.2011] Johannes Hoffart, Mohamed Amir terministic dependency parsing. In Proceedings of
Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred the Workshop on Incremental Parsing: Bringing En-
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, gineering and Cognition Together.
and Gerhard Weikum. 2011. Robust disambiguation [Nothman et al.2013] Joel Nothman, Nicky Ringland,
of named entities in text. In Proceedings of the Con- Will Radford, Tara Murphy, and James R Curran.
ference on Empirical Methods in Natural Language 2013. Learning multilingual named entity recognition
Processing, pages 782–792. Association for Compu- from wikipedia. Artificial Intelligence, 194:151–175.
tational Linguistics. [Parker et al.2009] Robert Parker, David Graff, Junbo
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. Kong, Ke Chen, and Kazuaki Maeda. 2009. English
2015. Bidirectional LSTM-CRF models for sequence gigaword fourth edition (ldc2009t13). Linguistic Data
tagging. CoRR, abs/1508.01991. Consortium, Univer-sity of Pennsylvania, Philadel-
[Kim et al.2015] Yoon Kim, Yacine Jernite, David Son- phia, PA.
tag, and Alexander M. Rush. 2015. Character-aware [Passos et al.2014] Alexandre Passos, Vineet Kumar, and
neural language models. CoRR, abs/1508.06615. Andrew McCallum. 2014. Lexicon infused phrase
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba. embeddings for named entity resolution. arXiv
2014. Adam: A method for stochastic optimization. preprint arXiv:1404.5367.
arXiv preprint arXiv:1412.6980. [Qi et al.2009] Yanjun Qi, Ronan Collobert, Pavel Kuksa,
Koray Kavukcuoglu, and Jason Weston. 2009. Com-
[Lafferty et al.2001] John Lafferty, Andrew McCallum,
bining labeled and unlabeled data with word-class dis-
and Fernando CN Pereira. 2001. Conditional random
tribution learning. In Proceedings of the 18th ACM
fields: Probabilistic models for segmenting and label-
conference on Information and knowledge manage-
ing sequence data. In Proc. ICML.
ment, pages 1737–1740. ACM.
[Lin and Wu2009] Dekang Lin and Xiaoyun Wu. 2009.
[Ratinov and Roth2009] Lev Ratinov and Dan Roth.
Phrase clustering for discriminative learning. In Pro-
2009. Design challenges and misconceptions in
ceedings of the Joint Conference of the 47th Annual
named entity recognition. In Proceedings of the Thir-
Meeting of the ACL and the 4th International Joint
teenth Conference on Computational Natural Lan-
Conference on Natural Language Processing of the
guage Learning, pages 147–155. Association for
AFNLP: Volume 2-Volume 2, pages 1030–1038. As-
Computational Linguistics.
sociation for Computational Linguistics.
[Santos and Guimarães2015] Cicero Nogueira dos Santos
[Ling et al.2015a] Wang Ling, Lin Chu-Cheng, Yulia and Victor Guimarães. 2015. Boosting named entity
Tsvetkov, Silvio Amir, Rámon Fernandez Astudillo, recognition with neural character embeddings. arXiv
Chris Dyer, Alan W Black, and Isabel Trancoso. preprint arXiv:1505.05008.
2015a. Not all contexts are created equal: Better [Tjong Kim Sang and De Meulder2003] Erik F. Tjong
word representations with variable attention. In Proc. Kim Sang and Fien De Meulder. 2003. Introduction
EMNLP. to the conll-2003 shared task: Language-independent
[Ling et al.2015b] Wang Ling, Tiago Luı́s, Luı́s Marujo, named entity recognition. In Proc. CoNLL.
Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, [Tjong Kim Sang2002] Erik F. Tjong Kim Sang. 2002.
Alan W Black, and Isabel Trancoso. 2015b. Finding Introduction to the conll-2002 shared task: Language-
function in form: Compositional character models for independent named entity recognition. In Proc.
open vocabulary word representation. In Proceedings CoNLL.
of the Conference on Empirical Methods in Natural [Turian et al.2010] Joseph Turian, Lev Ratinov, and
Language Processing (EMNLP). Yoshua Bengio. 2010. Word representations: A sim-
[Luo et al.2015] Gang Luo, Xiaojiang Huang, Chin-Yew ple and general method for semi-supervised learning.
Lin, and Zaiqing Nie. 2015. Joint named entity recog- In Proc. ACL.
nition and disambiguation. In Proc. EMNLP. [Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg An adaptive learning rate method. arXiv preprint
Corrado, and Jeffrey Dean. 2013a. Efficient estima- arXiv:1212.5701.
tion of word representations in vector space. arXiv [Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann
preprint arXiv:1301.3781. LeCun. 2015. Character-level convolutional networks
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, for text classification. In Advances in Neural Informa-
Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. tion Processing Systems, pages 649–657.

You might also like