Ner LSTM
Ner LSTM
State-of-the-art named entity recognition sys- Lin and Wu, 2009; Ando and Zhang, 2005b, in-
tems rely heavily on hand-crafted features and ter alia) have used these to augment, rather than
domain-specific knowledge in order to learn replace, hand-engineered features (e.g., knowledge
effectively from the small, supervised training
about capitalization patterns and character classes in
corpora that are available. In this paper, we
introduce two new neural architectures—one a particular language) and specialized knowledge re-
based on bidirectional LSTMs and conditional sources (e.g., gazetteers).
random fields, and the other that constructs In this paper, we present neural architectures
and labels segments using a transition-based for NER that use no language-specific resources
approach inspired by shift-reduce parsers.
or features beyond a small amount of supervised
Our models rely on two sources of infor-
mation about words: character-based word training data and unlabeled corpora. Our mod-
representations learned from the supervised els are designed to capture two intuitions. First,
corpus and unsupervised word representa- since names often consist of multiple tokens, rea-
tions learned from unannotated corpora. Our soning jointly over tagging decisions for each to-
models obtain state-of-the-art performance in ken is important. We compare two models here,
NER in four languages without resorting to (i) a bidirectional LSTM with a sequential condi-
any language-specific knowledge or resources
tional random layer above it (LSTM-CRF; §2), and
such as gazetteers. 1
(ii) a new model that constructs and labels chunks
of input sentences using an algorithm inspired by
1 Introduction transition-based parsing with states represented by
stack LSTMs (S-LSTM; §3). Second, token-level
Named entity recognition (NER) is a challenging evidence for “being a name” includes both ortho-
learning problem. One the one hand, in most lan- graphic evidence (what does the word being tagged
guages and domains, there is only a very small as a name look like?) and distributional evidence
amount of supervised training data available. On the (where does the word being tagged tend to oc-
other, there are few constraints on the kinds of words cur in a corpus?). To capture orthographic sen-
that can be names, so generalizing from this small sitivity, we use character-based word representa-
sample of data is difficult. As a result, carefully con- tion model (Ling et al., 2015b) to capture distribu-
structed orthographic features and language-specific tional sensitivity, we combine these representations
knowledge resources, such as gazetteers, are widely with distributional representations (Mikolov et al.,
used for solving this task. Unfortunately, language- 2013b). Our word representations combine both of
specific resources and features are costly to de- these, and dropout training is used to encourage the
velop in new languages and new domains, making model to learn to trust both sources of evidence (§4).
NER a challenge to adapt. Unsupervised learning
Experiments in English, Dutch, German, and
from unannotated corpora offers an alternative strat-
Spanish show that we are able to obtain state-
egy for obtaining better generalization from small
of-the-art NER performance with the LSTM-CRF
amounts of supervision. However, even systems
model in Dutch, German, and Spanish, and very
1
Code is available at https://github.com/glample/tagger near the state-of-the-art in English without any
hand-engineered features or gazetteers (§5). The the forward LSTM and the latter as the backward
transition-based algorithm likewise surpasses the LSTM. These are two distinct networks with differ-
best previously published results in several lan- ent parameters. This forward and backward LSTM
guages, although it performs less well than the pair is referred to as a bidirectional LSTM (Graves
LSTM-CRF model. and Schmidhuber, 2005).
The representation of a word using this model is
2 LSTM-CRF Model obtained by concatenating its left and right context
→
− ← −
We provide a brief description of LSTMs and CRFs, representations, ht = [ht ; ht ]. These representa-
and present a hybrid tagging architecture. tions effectively include a representation of a word
in context, which is useful for numerous tagging ap-
2.1 LSTM plications.
Recurrent neural networks (RNNs) are a family
2.2 CRF Tagging Models
of neural networks that operate on sequential
data. They take as input a sequence of vectors A very simple—but surprisingly effective—tagging
(x1 , x2 , . . . , xn ) and return another sequence model is to use the ht ’s as features to make indepen-
(h1 , h2 , . . . , hn ) that represents some information dent tagging decisions for each output yt (Ling et
about the sequence at every step in the input. al., 2015b). Despite this model’s success in simple
Although RNNs can, in theory, learn long depen- problems like POS tagging, its independent classifi-
dencies, in practice they fail to do so and tend to cation decisions are limiting when there are strong
be biased towards their most recent inputs in the dependencies across output labels. NER is one such
sequence (Bengio et al., 1994). Long Short-term task, since the “grammar” that characterizes inter-
Memory Networks (LSTMs) have been designed to pretable sequences of tags imposes several hard con-
combat this issue by incorporating a memory-cell straints (e.g., I-PER cannot follow B-LOC; see §2.4
and have been shown to capture long-range depen- for details) that would be impossible to model with
dencies. They do so using several gates that control independence assumptions.
the proportion of the input to give to the memory Therefore, instead of modeling tagging decisions
cell, and the proportion from the previous state to independently, we model them jointly using a con-
forget (Hochreiter and Schmidhuber, 1997). We use ditional random field (Lafferty et al., 2001). For an
the following implementation: input sentence
X = (x1 , x2 , . . . , xn ),
it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) we consider P to be the matrix of scores output by
ct = (1 − it ) ct−1 + the bidirectional LSTM network. P is of size n × k,
it tanh(Wxc xt + Whc ht−1 + bc ) where k is the number of distinct tags, and Pi,j cor-
responds to the score of the j th tag of the ith word
ot = σ(Wxo xt + Who ht−1 + Wco ct + bo )
in a sentence. For a sequence of predictions
ht = o t tanh(ct ),
y = (y1 , y2 , . . . , yn ),
where σ is the element-wise sigmoid function, and
is the element-wise product. we define its score to be
For a given sentence (x1 , x2 , . . . , xn ) containing n
X n
X
n words, each represented as a d-dimensional vector, s(X, y) = Ayi ,yi+1 + Pi,yi
→
− i=0 i=1
an LSTM computes a representation ht of the left
context of the sentence at every word t. Naturally, where A is a matrix of transition scores such that
←
−
generating a representation of the right context ht Ai,j represents the score of a transition from the
as well should add useful information. This can be tag i to tag j. y0 and yn are the start and end
achieved using a second LSTM that reads the same tags of a sentence, that we add to the set of possi-
sequence in reverse. We will refer to the former as ble tags. A is therefore a square matrix of size k +2.
A softmax over all possible tag sequences yields a
probability for the sequence y:
es(X,y)
p(y|X) = P s(X,e
y)
.
e ∈YX e
y
X
log(p(y|X)) = s(X, y) − log es(X,ey)
e ∈YX
y
Figure 2: Transitions of the Stack-LSTM model indicating the action applied and the resulting state. Bold symbols indicate
(learned) embeddings of words and relations, script symbols indicate the corresponding words and relations.
Figure 3: Transition sequence for Mark Watney visited Mars with the Stack-LSTM model.
In the CoNLL-2002 shared task, Carreras et al. This paper presents two neural architectures for se-
(2002) obtained among the best results on both quence labeling that provide the best NER results
Dutch and Spanish by combining several small ever reported in standard evaluation settings, even
fixed-depth decision trees. Next year, in the CoNLL- compared with models that use external resources,
2003 Shared Task, Florian et al. (2003) obtained the such as gazetteers.
best score on German by combining the output of A key aspect of our models are that they model
four diverse classifiers. Qi et al. (2009) later im- output label dependencies, either via a simple CRF
proved on this with a neural network by doing unsu- architecture, or using a transition-based algorithm
pervised learning on a massive unlabeled corpus. to explicitly construct and label chunks of the in-
put. Word representations are also crucially impor- and Pavel Kuksa. 2011. Natural language process-
tant for success: we use both pre-trained word rep- ing (almost) from scratch. The Journal of Machine
resentations and “character-based” representations Learning Research, 12:2493–2537.
that capture morphological and orthographic infor- [Cucerzan and Yarowsky1999] Silviu Cucerzan and
David Yarowsky. 1999. Language independent
mation. To prevent the learner from depending too
named entity recognition combining morphological
heavily on one representation class, dropout is used. and contextual evidence. In Proceedings of the 1999
Joint SIGDAT Conference on EMNLP and VLC, pages
Acknowledgments 90–99.
[Cucerzan and Yarowsky2002] Silviu Cucerzan and
Miguel Ballesteros is supported by the European
David Yarowsky. 2002. Language independent ner
Commission under the contract numbers FP7-ICT- using a unified model of internal and contextual
610411 (project MULTISENSOR) and H2020-RIA- evidence. In proceedings of the 6th conference on
645012 (project KRISTINA). Natural language learning-Volume 20, pages 1–4.
Association for Computational Linguistics.
[Dai et al.2015] Hong-Jie Dai, Po-Ting Lai, Yung-Chun
References Chang, and Richard Tzong-Han Tsai. 2015. Enhanc-
[Ando and Zhang2005a] Rie Kubota Ando and Tong ing of chemical compound and drug name recogni-
Zhang. 2005a. A framework for learning predictive tion using representative tag scheme and fine-grained
structures from multiple tasks and unlabeled data. The tokenization. Journal of cheminformatics, 7(Suppl
Journal of Machine Learning Research, 6:1817–1853. 1):S14.
[Dyer et al.2015] Chris Dyer, Miguel Ballesteros, Wang
[Ando and Zhang2005b] Rie Kubota Ando and Tong
Ling, Austin Matthews, and Noah A. Smith. 2015.
Zhang. 2005b. Learning predictive structures. JMLR,
Transition-based dependency parsing with stack long
6:1817–1853.
short-term memory. In Proc. ACL.
[Ballesteros et al.2015] Miguel Ballesteros, Chris Dyer,
[Eisenstein et al.2011] Jacob Eisenstein, Tae Yano,
and Noah A. Smith. 2015. Improved transition-based
William W Cohen, Noah A Smith, and Eric P Xing.
dependency parsing by modeling characters instead of
2011. Structured databases of named entities from
words with LSTMs. In Proceedings of EMNLP.
bayesian nonparametrics. In Proceedings of the First
[Bengio et al.1994] Yoshua Bengio, Patrice Simard, and Workshop on Unsupervised Learning in NLP, pages
Paolo Frasconi. 1994. Learning long-term depen- 2–12. Association for Computational Linguistics.
dencies with gradient descent is difficult. Neural Net- [Florian et al.2003] Radu Florian, Abe Ittycheriah,
works, IEEE Transactions on, 5(2):157–166. Hongyan Jing, and Tong Zhang. 2003. Named
[Biemann et al.2007] Chris Biemann, Gerhard Heyer, entity recognition through classifier combination. In
Uwe Quasthoff, and Matthias Richter. 2007. The Proceedings of the seventh conference on Natural
leipzig corpora collection-monolingual corpora of language learning at HLT-NAACL 2003-Volume
standard size. Proceedings of Corpus Linguistic. 4, pages 168–171. Association for Computational
[Callison-Burch et al.2010] Chris Callison-Burch, Linguistics.
Philipp Koehn, Christof Monz, Kay Peterson, Mark [Gillick et al.2015] Dan Gillick, Cliff Brunk, Oriol
Przybocki, and Omar F Zaidan. 2010. Findings Vinyals, and Amarnag Subramanya. 2015. Multilin-
of the 2010 joint workshop on statistical machine gual language processing from bytes. arXiv preprint
translation and metrics for machine translation. In arXiv:1512.00103.
Proceedings of the Joint Fifth Workshop on Statistical [Graff2011] David Graff. 2011. Spanish gigaword third
Machine Translation and MetricsMATR, pages 17–53. edition (ldc2011t12). Linguistic Data Consortium,
Association for Computational Linguistics. Univer-sity of Pennsylvania, Philadelphia, PA.
[Carreras et al.2002] Xavier Carreras, Lluı́s Màrquez, and [Graves and Schmidhuber2005] Alex Graves and Jürgen
Lluı́s Padró. 2002. Named entity extraction using ad- Schmidhuber. 2005. Framewise phoneme classifi-
aboost, proceedings of the 6th conference on natural cation with bidirectional LSTM networks. In Proc.
language learning. August, 31:1–4. IJCNN.
[Chiu and Nichols2015] Jason PC Chiu and Eric Nichols. [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivas-
2015. Named entity recognition with bidirectional tava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
lstm-cnns. arXiv preprint arXiv:1511.08308. Salakhutdinov. 2012. Improving neural networks by
[Collobert et al.2011] Ronan Collobert, Jason Weston, preventing co-adaptation of feature detectors. arXiv
Léon Bottou, Michael Karlen, Koray Kavukcuoglu, preprint arXiv:1207.0580.
[Hochreiter and Schmidhuber1997] Sepp Hochreiter and Distributed representations of words and phrases and
Jürgen Schmidhuber. 1997. Long short-term memory. their compositionality. In Proc. NIPS.
Neural Computation, 9(8):1735–1780. [Nivre2004] Joakim Nivre. 2004. Incrementality in de-
[Hoffart et al.2011] Johannes Hoffart, Mohamed Amir terministic dependency parsing. In Proceedings of
Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred the Workshop on Incremental Parsing: Bringing En-
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, gineering and Cognition Together.
and Gerhard Weikum. 2011. Robust disambiguation [Nothman et al.2013] Joel Nothman, Nicky Ringland,
of named entities in text. In Proceedings of the Con- Will Radford, Tara Murphy, and James R Curran.
ference on Empirical Methods in Natural Language 2013. Learning multilingual named entity recognition
Processing, pages 782–792. Association for Compu- from wikipedia. Artificial Intelligence, 194:151–175.
tational Linguistics. [Parker et al.2009] Robert Parker, David Graff, Junbo
[Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. Kong, Ke Chen, and Kazuaki Maeda. 2009. English
2015. Bidirectional LSTM-CRF models for sequence gigaword fourth edition (ldc2009t13). Linguistic Data
tagging. CoRR, abs/1508.01991. Consortium, Univer-sity of Pennsylvania, Philadel-
[Kim et al.2015] Yoon Kim, Yacine Jernite, David Son- phia, PA.
tag, and Alexander M. Rush. 2015. Character-aware [Passos et al.2014] Alexandre Passos, Vineet Kumar, and
neural language models. CoRR, abs/1508.06615. Andrew McCallum. 2014. Lexicon infused phrase
[Kingma and Ba2014] Diederik Kingma and Jimmy Ba. embeddings for named entity resolution. arXiv
2014. Adam: A method for stochastic optimization. preprint arXiv:1404.5367.
arXiv preprint arXiv:1412.6980. [Qi et al.2009] Yanjun Qi, Ronan Collobert, Pavel Kuksa,
Koray Kavukcuoglu, and Jason Weston. 2009. Com-
[Lafferty et al.2001] John Lafferty, Andrew McCallum,
bining labeled and unlabeled data with word-class dis-
and Fernando CN Pereira. 2001. Conditional random
tribution learning. In Proceedings of the 18th ACM
fields: Probabilistic models for segmenting and label-
conference on Information and knowledge manage-
ing sequence data. In Proc. ICML.
ment, pages 1737–1740. ACM.
[Lin and Wu2009] Dekang Lin and Xiaoyun Wu. 2009.
[Ratinov and Roth2009] Lev Ratinov and Dan Roth.
Phrase clustering for discriminative learning. In Pro-
2009. Design challenges and misconceptions in
ceedings of the Joint Conference of the 47th Annual
named entity recognition. In Proceedings of the Thir-
Meeting of the ACL and the 4th International Joint
teenth Conference on Computational Natural Lan-
Conference on Natural Language Processing of the
guage Learning, pages 147–155. Association for
AFNLP: Volume 2-Volume 2, pages 1030–1038. As-
Computational Linguistics.
sociation for Computational Linguistics.
[Santos and Guimarães2015] Cicero Nogueira dos Santos
[Ling et al.2015a] Wang Ling, Lin Chu-Cheng, Yulia and Victor Guimarães. 2015. Boosting named entity
Tsvetkov, Silvio Amir, Rámon Fernandez Astudillo, recognition with neural character embeddings. arXiv
Chris Dyer, Alan W Black, and Isabel Trancoso. preprint arXiv:1505.05008.
2015a. Not all contexts are created equal: Better [Tjong Kim Sang and De Meulder2003] Erik F. Tjong
word representations with variable attention. In Proc. Kim Sang and Fien De Meulder. 2003. Introduction
EMNLP. to the conll-2003 shared task: Language-independent
[Ling et al.2015b] Wang Ling, Tiago Luı́s, Luı́s Marujo, named entity recognition. In Proc. CoNLL.
Ramón Fernandez Astudillo, Silvio Amir, Chris Dyer, [Tjong Kim Sang2002] Erik F. Tjong Kim Sang. 2002.
Alan W Black, and Isabel Trancoso. 2015b. Finding Introduction to the conll-2002 shared task: Language-
function in form: Compositional character models for independent named entity recognition. In Proc.
open vocabulary word representation. In Proceedings CoNLL.
of the Conference on Empirical Methods in Natural [Turian et al.2010] Joseph Turian, Lev Ratinov, and
Language Processing (EMNLP). Yoshua Bengio. 2010. Word representations: A sim-
[Luo et al.2015] Gang Luo, Xiaojiang Huang, Chin-Yew ple and general method for semi-supervised learning.
Lin, and Zaiqing Nie. 2015. Joint named entity recog- In Proc. ACL.
nition and disambiguation. In Proc. EMNLP. [Zeiler2012] Matthew D Zeiler. 2012. Adadelta:
[Mikolov et al.2013a] Tomas Mikolov, Kai Chen, Greg An adaptive learning rate method. arXiv preprint
Corrado, and Jeffrey Dean. 2013a. Efficient estima- arXiv:1212.5701.
tion of word representations in vector space. arXiv [Zhang et al.2015] Xiang Zhang, Junbo Zhao, and Yann
preprint arXiv:1301.3781. LeCun. 2015. Character-level convolutional networks
[Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, for text classification. In Advances in Neural Informa-
Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. tion Processing Systems, pages 649–657.