0% found this document useful (0 votes)
137 views7 pages

Hyperparameter Tuning For Deep Learning in Natural Language Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views7 pages

Hyperparameter Tuning For Deep Learning in Natural Language Processing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Hyperparameter Tuning for Deep Learning in Natural Language

Processing

Ahmad Aghaebrahimian Mark Cieliebak


Zurich University of Applied Sciences Zurich University of Applied Sciences
Switzerland Switzerland
[email protected] [email protected]

Abstract with a series of hyperparameters which need to


be tuned if one expects to obtain state-of-the-art
Deep Neural Networks have advanced or even better results using them. Some of these
rapidly over the past several years. How- hyperparameters, such as the number of layers or
ever, it still seems like a black art for many the number of neurons per layer, are bound di-
people to make use of them efficiently. rectly to the deep neural architecture, while others
The reason for this complexity is that ob- - such as drop-out rate - are independent of the ar-
taining a consistent and outstanding result chitecture. In addition to these hyperparameters,
from a deep architecture requires optimiz- there are other network choices such as the clas-
ing many parameters known as hyperpa- sifier type that affects the network performance to
rameters. Hyperparameter tuning is an es- a large extent. Our list of parameters to tune in-
sential task in deep learning, which can cludes both of these hyperparameters and network
make significant changes in network per- choices. Since none of these parameters, includ-
formance. This paper is the essence of ing network choices and hyperparameters, can be
over 3000 GPU hours on optimizing a net- learned within the network directly, from now on,
work for a text classification task on a wide we use the term hyperparameter to refer to both.
array of hyperparameters. We provide a Recognizing the best choice of hyperparame-
list of hyperparameters to tune in addition ters is often a cumbersome process to a level that
to their tuning impact on the network per- some people consider it a ”black art” (Snoek et al.,
formance. The hope is that such a listing 2012). Scarcity of proper research on the impact
will provide the interested researchers a of these parameters on the network performance
mean to prioritize their efforts and to mod- often leads to a waste of a lot of time, especially
ify their deep architecture for getting the for younger researchers with little experience. In
best performance with the least effort. this paper, we adopt a state-of-the-art multi-label
classifier to investigate the impact of 12 categories
1 Introduction
of hyperparameters on the task of multi-label text
The application of Deep Neural Networks (DNN) classification. The task in multi-label text classifi-
such as Convolution Neural Networks (CNN) (Le- cation is to assign one or more labels to each text.
Cun et al., 1989) or Recurrent Neural Networks Word embeddings types, word embeddings
(RNN) (Rumelhart et al., 1986) and its variants sizes, word embeddings updating, character
(e.g., Long Short Term Memory (LSTM) (Hochre- embeddings, deep architectures (CNN, LSTM,
iter and Schmidhuber, 1997) or Gated Recurrent GRU), optimizers, gradient control, classifiers,
Unit (GRU) (Cho et al., 2014)) has accelerated drop out, deep vs. wide networks, and pooling
since the beginning of this decade partly due to are the settings studied in this work. To make the
the abundance of data available for training. Since experiment manageable, several groups of these
past several years, DNNs have found their way in parameters are set on an individual grid to serve
many areas of Artificial Intelligence (AI) such as as an ad-hoc grid search scheme for finding the
image processing or Natural Language Process- most promising hyperparameters by focusing on
ing (NLP) and have yielded superior performance the most promising optimized area.
in almost all of them. However, a DNN comes We provide the readers with an insight into the

c
Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
impact of each hyperparameter on this specific
task. This study is performed by running over 400
different configurations in over 3000 GPU hours.
The contribution of this work is to provide a prior-
itized list of hyperparameters to optimize.

2 Related Work
Hyperparameter tuning is often performed using
grid search/brute force, where all possible com-
binations of the hyperparameters with all of their
values form a grid and an algorithm is trained
for each combination. However, this method be-
comes incomputable already for small numbers
of hyperparameters. For instance, in our study Figure 1: The system architecture
with 12 categories of hyperparameters each with
four instances on average, we would have a grid trated schema is the optimized network which cre-
with several million nods, which would be highly ated the best results for the task. One channel is
computationally expensive. To address this is- devoted to the most informative words given each
sue Bergstra et al. (2013) proposed a method for class, which are extracted using the χ2 method.
randomized parameter tuning and showed that for The other channel is used for input tokens. For
each of their datasets there are only a few impact- more information about the architecture, please re-
ful parameters on which more values should be fer to Aghaebrahimian and Cieliebak (2019).
tried. However, due to the random mechanism The dataset used for this experiment is a pro-
in this approach, each trial is independent of the prietary dataset with roughly 60K articles with a
others. Hence, it does not learn anything from total number of 28 labels. The dataset contains
other experiments. To address this problem Snoek about 250K different words and assigns 2.5 labels
et al. (2012) proposed a Bayesian optimization to each article on average. It is randomly divided
method using a statistical model for mapping hy- into 80%,10%, and 10% parts for training, validat-
perparameters to an objective function. However, ing, and testing accordingly.
Bayesian optimization adds another layer of com- The textual data is preprocessed by removing
plexity to the problem. Therefore, this method has non-alphanumeric values and replacing numeric
not gained much popularity since its proposal. values with a unique symbol. The resulting strings
The most effective and straightforward method are tokenized and truncated to 3k tokens. Shorter
for hyperparameter tuning is still ad-hoc grid texts are padded with 0 to fixate all the texts to the
search (Hutter et al., 2015) where the researcher same length.
manually tries the most correlated parameters on Two measures are used for evaluation. F1 (Mi-
the same grid to gradually and iteratively find the cro) is used as a measure of performance. It is
most impactful set of hyperparameters with the computed by calculating F1 scores for each arti-
best values. cle and averaging them over all articles in the test
data. The second metric, Epochs, is reported as
3 Multi-Label Classification a measure of time required for the network with
Multi-label text classification is the task of as- a specific setting to converge. The early stopping
signing one or more labels to each text. News method is used as criterion for convergence, which
classification is an example of such a task. For is recognized when after three consecutive epochs
this task, we adopted a state-of-the-art architecture no decrease in validation loss is observed. All
for multi-label classification (Aghaebrahimian and models are trained in batches with 64 instances in
Cieliebak, 2019). The schema of the model is il- each.
lustrated in Figure 1.
4 Experimental results
The architecture consists of two channels of bi-
GRU deep structures with an attention mechanism There are 12 categories of hyperparameters which
and a dense sigmoid layer on the top. The illus- are tuned in this study. Some of the hyperparam-
Word embedding type Epochs Results
eters, such as the deep architecture or the classi- Word2Vec (Mikolov et al., 2013) 26 81.9 %
fier type, are network choices while others, such Glove-6 (Pennington et al., 2014) 25 81.7 %
Glove-42 (Pennington et al., 2014) 26 82.9 %
as the embeddings type or the dropout rate, are Glove-840 (Pennington et al., 2014) 29 84.5 %
variables pertaining to different parts of the net- FastText (Bojanowski et al., 2016) 24 79.2 %
Dependency (Levy and Goldberg, 2014) 22 81.4 %
work. The results of hyperparameter optimization ELMo (Peters et al., 2018) 32 84.6 %
on each criterion are reported in the following sub-
sections. Table 1: Embedding type tuning results. Embed-
ding types, sizes, and update methods are on the
All parameters except the parameter under in-
same grid (26 configurations).
vestigation in each experiment are kept constant.
All other parameters that are not part of this study,
such as the seed number or batch size, are also kept 50-dimensional vectors, which is sub-optimal, all
constant throughout all the experiments. other dimensions yield superior results with an un-
noticeable difference in the number of Epochs.
4.1 Word Embeddings Grid Word embedding size Epochs Results
50 22 81.8 %
In this grid, we tune the word embeddings type, 100 25 82.9 %
83.6 %
the size, and the method of updating. Low di- 200
300
27
29 84.3 %
mensional dense word vectors known as word em- 1024 32 84.6 %

beddings have been proven to be highly effective


Table 2: Embedding size tuning results. Embed-
in representing words, and often lead to signifi-
ding types, sizes, and update methods are on the
cantly better performance (Collobert et al., 2011).
same grid search (26 configurations).
Depending on the method used for their training,
they can provide different levels of syntactic and
Word embeddings provide a mean of transfer
semantic information about each word. Many fac-
learning, which means word vectors are initially
tors can affect the quality of word embeddings,
learned using a large dataset containing several
including the data on which they were trained,
billion tokens and are fine-tuned on a smaller
their number of dimension, their domain, and pre-
dataset for doing their specific task afterwards.
processing steps involved in the training. We in-
This mechanism can be controlled by having word
vestigated five widely studied pre-trained word
vectors frozen or fine-tuned through training. De-
embeddings including Word2Vec (Mikolov et al.,
pending on the size of the dataset on which word
2013) trained on Google News dataset with 100
embeddings are being refined, updating them can
billion tokens, Glove (Pennington et al., 2014)
improve the performance. However, as observed
with three variants (one trained on Wikipedia with
in Table 3 fine-tuning the word vectors yielded no
64 billion tokens and two others trained on the
significant improvement over original pre-trained
Common Crawl, one on 42 and the other on 840
ones since the dataset was not large enough.
billion tokens), FastText (Bojanowski et al., 2016),
dependency based (Levy and Goldberg, 2014), Word embedding updating Epochs Results
Disabled 29 84.3 %
and ELMo (Peters et al., 2018). As shown in Ta- Enabled 31 84.5 %
ble 1, the Glove embeddings trained on the Com-
mon Crawl yields significantly better results com- Table 3: Embedding update method tuning results.
pared to other embeddings except for Elmo. Elmo Embedding types, sizes, and update methods are
and Glove-840 yield roughly similar results. How- on the same grid search (26 configurations).
ever, due to the much larger word vector size in
Elmo, it is much more computationally expensive 4.2 Character embedding
and takes much longer time to converge. Word-level features are not the only features used
Each pre-trained embedding comes with a spe- in text analytics. Character-level features are
cific vector size. The Glove embeddings are also reported to improve model performance es-
available in 50, 100, 200, and 300-dimensional pecially in tasks such as Named Entity Recogni-
word vectors. Elmo provides 1024 dimensional tion (NER) (Akbik et al., 2018) or Part Of Speech
vectors, and other embeddings all are with 300- (POS) (Anastasiev et al., 2018) tagging, where
dimensional word vectors. The results for size knowing the function of individual characters such
tuning are reported in Table 2. Except for the as prefixes, suffixes, or even infixes are beneficial.
Deep architectures Epochs Results
We used two different character encoding mecha- LSTM (Hochreiter and Schmidhuber, 1997) 30 78.2 %
nisms, one CNN-based (Ma and Hovy, 2016) and Bi-LSTM 37 82.9 %
GRU (Cho et al., 2014) 21 79.8 %
the other LSTM-based (Lample et al., 2016), to Bi-GRU 29 84.3 %
investigate the impact of character-level features CNN (single channel) (Kim, 2014) 18 81.7 %
CNN (double channel) (Kim, 2014) 23 82.5 %
on the network performance. As we expected,
using character-level features had no added value Table 5: Deep architecture tuning results. Deep
in the label classification task where labels were architectures, Deep and wide networks and opti-
bound to words and their syntactic and semantic mizers are in the same grid (270 configurations).
attributes rather than to their characters.
Character embeddings and the best of embed- 4.4 Deep vs. wide networks
dings grid were tuned on the same grid. It means The application of more deep layers and more
that in this grid, we disregard the sub-optimal set- units in each layer has been beneficial in some
tings in the embeddings grid and only focus on the tasks. Adding more layers helps in more complex
winning setting. Given the winning setting, we tasks to generate more layers of abstraction, while
tune the character embedding settings to investi- adding more units to each layer contributes to gen-
gate the impact of character embeddings (Table 4). erating more features. Still, adding extra layers in
depth and width without enough training data usu-
ally leads to overfitting. In all of our configura-
Character embedding Epochs Results tions, we got the best performance by having 128
Disabled 29 84.3 %
Enabled-CNN (Ma and Hovy, 2016) 31 84.7 %
units for each layer and only one layer in depth
Enabled-LSTM (Lample et al., 2016) 36 84.8 % (Table 6).
Table 4: Character embedding tuning results. Deep vs. wide network Epochs Results
Deep-1 29 84.3 %
Character embeddings and the best of embeddings Deep-2 26 83.7 %
are in the same grid search (14 configurations). Deep-3 18 74.6 %
Wide-64 30 82.9 %
Wide-128 29 84.3 %
Wide-256 25 83.5 %

Table 6: Deep and wide networks tuning. Deep


4.3 Deep architectures and wide networks, Deep architectures, and opti-
mizers are in the same grid (270 configurations).
The choice of deep architecture either as a Con-
volution Neural Network (CNN) (LeCun et al.,
4.5 Optimizer
1989) or as a variant of Recurrent Neural Net-
works (RNN) such as Long Short Term Memory The job of an optimizer is to minimize the loss
(LSTM) (Hochreiter and Schmidhuber, 1997) or in the objective function. Gradient-based meth-
Gated Recurrent Unit (GRU) (Cho et al., 2014) ods in general, and Stochastic Gradient Descent
can have a huge effect on the performance of a (SGD) in particular, are one of the widely used
model. classes of optimizers for minimizing the objective
functions in machine learning. Due to high sen-
The deep architecture type, the number of deep
sitivity to learning rate in SGD, other variants of
layers, and the number of units in each layer, as
optimizers such as Adagrad (Duchi et al., 2011),
well as the optimizers, are highly dependent on
RMSProp (Hinton, 2012), Adam (Kingma and Ba,
each other. Therefore, we optimize all of them on
2015), and Nadam (Dozat, 2015) have been pro-
the same grid with 270 different configurations.
posed in recent years. In all our configurations, we
For the CNN model, we adapted Kim (2014) got the best performance using Adam. Nadam also
model, and for RNN models, we used both vari- yields almost the same performance while con-
ants LSTM and GRU as single and bidirectional verging faster (Table 7).
architectures. As seen in Table 5, although CNN
models converge faster than the RNNs, they can 4.6 Pooling
not beat RNNs performance. Among all other Either in a CNN after the convolutional filters or
RNN models, bidirectional GRU yields signifi- in an RNN after the recurrent layers, pooling has
cantly better results. been proven as a useful tool for extracting the most
Optimizer Epochs Results
SGD 22 78.4 %
puted features in this layer are projected to their
Adagrad (Duchi et al., 2011) 25 82.7 % appropriate classes. Therefore the choice of this
RMSProp (Hinton, 2012) 27 83.9 %
Adam (Kingma and Ba, 2015) 29 84.3 % layer has an essential impact on the network per-
Nadam (Dozat, 2015) 24 84.2 % formance. The choice of this layer is highly de-
Table 7: Optimizer tuning results. Optimizers, pendent on the assumptions we make about the
Deep and wide networks, and Deep architectures task at hand. If the labels are independently dis-
are in the same grid (270 configurations). tributed, the Sigmoid and the Softmax yield better
results, while if they are conditioned on their ad-
jacent labels (e.g., POS tagging) the Conditional
relevant features given each task. We investigated Random Field (CRF) (Lafferty et al., 2001) works
three types of polling, namely average, max and better. If we expect a multinomial distribution
the concatenation of both with the best of opti- over the labels, the Softmax is the best classifier
mizer configurations on the same grid with 15 set- to choose while if we expect a Bernoulli distribu-
tings. The results are reported in Table 8, which tion, the Sigmoid is the right choice. All of the
shows that using both yields the best performance facts mentioned here come from the assumptions
for our task. behind each of these statistical functions.
Pooling Epochs Results We investigated the performance of these three
Average 29 83.2 %
Max 29 83.5 % classifiers with the best of the deep architectures
Both 29 84.2 % from Sub-section 4.3 on the same grid with 18
configurations. As observed in the results pre-
Table 8: Pooling tuning results. Pooling and the
sented in Table 10, the Sigmoid obtains statis-
best of optimizers are in the same grid search (15
tically significant better result compared to two
configurations).
other functions. As expected, due to the indepen-
dence among the labels of different samples, the
4.7 Gradient control CRF did not perform very well. Likewise, duo
The derivatives which are computed in backprop- to the freedom among labels in each sample, the
agation at training time in a DNN with many lay- Softmax also performed poorly.
ers get smaller and smaller to the point of van- Classifier Epochs Results
ishing. This is particularly true for RNN’s which Softmax 30 78.4 %
Sigmoid 29 84.2 %
have a large number of layers. This makes the CRF 31 77.1 %
training difficult and time-consuming. There are
two widely practiced mechanism called gradient Table 10: Classifier tuning results. The classifiers
clipping (Mikolov, 2012) and gradient normaliza- and the best of deep architectures are in the same
tion (Pascanu et al., 2013) to address this issue grid search (18 configurations).
known as gradient vanishing. We set the gradient
control mechanism with the best of the deep ar- 4.9 Drop out value
chitectures from Sub-section 4.3 on the same grid
Deep neural networks tend to memorize or over-
with 18 configurations. In all of these configura-
fit, which is not a desirable behavior since we
tions, we got better results using gradient normal-
are mostly interested in the ability of the net-
ization (Table 9).
work to generalize. Drop out (Srivastava et al.,
Gradient control Epochs Results 2014) is an effective tool to enhance generalizabil-
Disabled 28 82.9 %
Clipping (Mikolov, 2012) 31 83.1 % ity. The first technique known as simple or naive
Normalization (Pascanu et al., 2013) 29 84.2 %
drop out was proposed as a mechanism which ran-
Table 9: Gradient control tuning results. Gradient domly removes the connections between deep lay-
control and the best of deep architectures are in the ers. Gal and Ghahramani (2016) proposed a new
same grid search (18 configurations). mechanism for drop out called variational, which
improves the simple drop out by defining static
masks for removing the connections between deep
4.8 Classifier layers (‘interlayer’) as well as between the units
The last layer in a classification model is consid- inside deep layers (‘intralayer’). We placed drop
ered as the most crucial layer since all the com- out methods with the best of the deep architec-
tures from Sub-section 4.3 on the same grid with References
90 configurations. The results are reported in Ta- Ahmad Aghaebrahimian and Mark Cieliebak. 2019.
ble 11 and Table 12. As expected, the configu- Towards integration of statistical hypothesis tests
rations with both inter- and intralayer variational into deep neural networks. In Proceedings of the
method yields the best performance. 57th annual meeting of the association of Computa-
tional Linguistics (ACL). Florence, Italy.
Drop out value Epochs Results
Disabled 24 80.2 % Alan Akbik, Duncan Blythe, and Roland Vollgraf.
Simple 0.2 26 83.2 % 2018. Contextual string embeddings for sequence
Simple 0.5 27 83.8 %
Simple 0.7 29 81.5 %
labeling. In Proceedings of the 27th International
Variational 32 84.2 % Conference on Computational Linguistics. Santa Fe,
New Mexico, USA.
Table 11: Simple drop out tuning results. The drop
D. G. Anastasiev, I. O. Gusev, and E. M. Indenbom.
out and the best of deep architectures are in the
2018. Improving part-of-speech tagging via multi-
same grid search (90 configurations). task learning and character-level word representa-
tions. In Proceedings of the International Confer-
ence Dialogue, Computational linguistics and intel-
Variational drop out method Epochs Results lectual technologies.
Inter 31 83.5 %
Intra 30 83.2 %
Both 32 84.2 % James Bergstra, Daniel Yamins, and David D. Cox.
2013. Making a science of model search: Hy-
Table 12: Variational drop out value tuning results perparameter optimization in hundreds of dimen-
sions for vision architectures. In Proceedings of the
30th International Conference on Machine Learning
5 Conclusion (ICML). Atlanta, GA, USA.
In this study, we investigated various settings for Piotr Bojanowski, Edouard Grave, Armand Joulin,
a Deep Neural Network for multi-label classifica- and Tomas Mikolov. 2016. Enriching word vec-
tion. Considering the characteristics of the dataset tors with subword information. arXiv preprint
and the task, we observed the following results: arXiv:1607.04606.
Using Sigmoid in the last layer yields statistically Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-
significant better results compared to CRF or Soft- danau, and Yoshua Bengio. 2014. On the properties
max. The Glove embedings (Pennington et al., of neural machine translation: Encoder-decoder ap-
2014) with more than 100-dimensional vector size proaches. In Proceedings of the Eighth Workshop on
Syntax, Semantics and Structure in Statistical Trans-
and without updating yields statistically signifi- lation.
cant better results compared to other word vec-
tors. Compared to other deep architectures, bi- Ronan Collobert, Jason Weston, Léon Bottou, Michael
GRU yields better results when it is used as a one- Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
depth layer with 128 units. Adam and Nadam ob- scratch. J. Mach. Learn. Res. 12:2493–2537.
tain roughly the same results, while Nadam con-
verges much faster. Pooling is better to be used as Timothy Dozat. 2015. Incorporating nesterov momen-
the concatenation of both max and average-pooled tum into adam.
tensors, and it is better to use Normalization (Pas- John Duchi, Elad Hazan, and Yoram Singer. 2011.
canu et al., 2013) as a mean of gradient control to Adaptive subgradient methods for online learning
control gradient vanishing. It is also a good prac- and stochastic optimization. J. Mach. Learn. Res.
tice to use Variational drop out (Gal and Ghahra- 12:2121–2159.
mani, 2016) both between layers and inside recur- Yarin Gal and Zoubin Ghahramani. 2016. A theoret-
rent units to control over-fitting. Finally, we did ically grounded application of dropout in recurrent
not observe any improvement by using character neural networks. In Proceedings of the 30th Interna-
embeddings. tional Conference on Neural Information Processing
Systems. Curran Associates Inc., USA, NIPS’16.
The order in which these parameters are men-
tioned is the magnitude of their importance for the Geoffrey Hinton. 2012. Neural networks for machine
final performance. Parameters with no mention learning. Lecture 6a - Overview of mini-batch gra-
dient descent .
here did not have any noticeable impact on the sys-
tem results. Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long
short-term memory. In Neural Comput.. volume 9.
Frank Hutter, Jörg Lücke, and Lars Schmidt-Thieme. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
2015. Beyond manual tuning of hyperparameters. Gardner, Christopher Clark, Kenton Lee, and Luke
KI - Künstliche Intelligenz . Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proc. of NAACL.
Yoon Kim. 2014. Convolutional neural networks for
sentence classification. In Proceedings of the 2014 David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.
Conference on Empirical Methods in Natural Lan- Williams. 1986. Learning Representations by Back-
guage Processing (EMNLP). Association for Com- propagating Errors. Nature 323.
putational Linguistics, Doha, Qatar.
Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: 2012. Practical bayesian optimization of machine
A method for stochastic optimization. CoRR learning algorithms. In Proceedings of the 25th In-
abs/1412.6980. ternational Conference on Neural Information Pro-
cessing Systems (NIPS). USA.
John D. Lafferty, Andrew McCallum, and Fernando
C. N. Pereira. 2001. Conditional random fields: Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Probabilistic models for segmenting and labeling se- Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
quence data. In Proceedings of the Eighteenth Inter- Dropout: A simple way to prevent neural networks
national Conference on Machine Learning (ICML). from over fitting. Journal of Machine Learning Re-
search .
Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Neural architectures for named entity recognition.
In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies.
San Diego, California.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.


Howard, W. Hubbard, and L. D. Jackel. 1989. Back-
propagation applied to handwritten zip code recog-
nition. Neural Computation .

Omer Levy and Yoav Goldberg. 2014. Dependency-


based word embeddings. In Proceedings of the 52nd
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 2: Short Papers).

Xuezhe Ma and Eduard Hovy. 2016. End-to-end


sequence labeling via bi-directional LSTM-CNNs-
CRF. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers). Association for Compu-
tational Linguistics, Berlin, Germany, pages 1064–
1074.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey


Dean. 2013. Efficient estimation of word represen-
tations in vector space. arXiv:1301.3781 .

Tom Mikolov. 2012. Statistical language models based


on neural networks. Ph.D. Thesis, Brno University
of Technology .

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.


2013. On the difficulty of training recurrent neu-
ral networks. In Proceedings of the 30th Interna-
tional Conference on International Conference on
Machine Learning (ICML).

Jeffrey Pennington, Richard Socher, and Christopher


Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing
(EMNLP).

You might also like