Hyperparameter Tuning For Deep Learning in Natural Language Processing
Hyperparameter Tuning For Deep Learning in Natural Language Processing
Processing
c
Copyright
2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
impact of each hyperparameter on this specific
task. This study is performed by running over 400
different configurations in over 3000 GPU hours.
The contribution of this work is to provide a prior-
itized list of hyperparameters to optimize.
2 Related Work
Hyperparameter tuning is often performed using
grid search/brute force, where all possible com-
binations of the hyperparameters with all of their
values form a grid and an algorithm is trained
for each combination. However, this method be-
comes incomputable already for small numbers
of hyperparameters. For instance, in our study Figure 1: The system architecture
with 12 categories of hyperparameters each with
four instances on average, we would have a grid trated schema is the optimized network which cre-
with several million nods, which would be highly ated the best results for the task. One channel is
computationally expensive. To address this is- devoted to the most informative words given each
sue Bergstra et al. (2013) proposed a method for class, which are extracted using the χ2 method.
randomized parameter tuning and showed that for The other channel is used for input tokens. For
each of their datasets there are only a few impact- more information about the architecture, please re-
ful parameters on which more values should be fer to Aghaebrahimian and Cieliebak (2019).
tried. However, due to the random mechanism The dataset used for this experiment is a pro-
in this approach, each trial is independent of the prietary dataset with roughly 60K articles with a
others. Hence, it does not learn anything from total number of 28 labels. The dataset contains
other experiments. To address this problem Snoek about 250K different words and assigns 2.5 labels
et al. (2012) proposed a Bayesian optimization to each article on average. It is randomly divided
method using a statistical model for mapping hy- into 80%,10%, and 10% parts for training, validat-
perparameters to an objective function. However, ing, and testing accordingly.
Bayesian optimization adds another layer of com- The textual data is preprocessed by removing
plexity to the problem. Therefore, this method has non-alphanumeric values and replacing numeric
not gained much popularity since its proposal. values with a unique symbol. The resulting strings
The most effective and straightforward method are tokenized and truncated to 3k tokens. Shorter
for hyperparameter tuning is still ad-hoc grid texts are padded with 0 to fixate all the texts to the
search (Hutter et al., 2015) where the researcher same length.
manually tries the most correlated parameters on Two measures are used for evaluation. F1 (Mi-
the same grid to gradually and iteratively find the cro) is used as a measure of performance. It is
most impactful set of hyperparameters with the computed by calculating F1 scores for each arti-
best values. cle and averaging them over all articles in the test
data. The second metric, Epochs, is reported as
3 Multi-Label Classification a measure of time required for the network with
Multi-label text classification is the task of as- a specific setting to converge. The early stopping
signing one or more labels to each text. News method is used as criterion for convergence, which
classification is an example of such a task. For is recognized when after three consecutive epochs
this task, we adopted a state-of-the-art architecture no decrease in validation loss is observed. All
for multi-label classification (Aghaebrahimian and models are trained in batches with 64 instances in
Cieliebak, 2019). The schema of the model is il- each.
lustrated in Figure 1.
4 Experimental results
The architecture consists of two channels of bi-
GRU deep structures with an attention mechanism There are 12 categories of hyperparameters which
and a dense sigmoid layer on the top. The illus- are tuned in this study. Some of the hyperparam-
Word embedding type Epochs Results
eters, such as the deep architecture or the classi- Word2Vec (Mikolov et al., 2013) 26 81.9 %
fier type, are network choices while others, such Glove-6 (Pennington et al., 2014) 25 81.7 %
Glove-42 (Pennington et al., 2014) 26 82.9 %
as the embeddings type or the dropout rate, are Glove-840 (Pennington et al., 2014) 29 84.5 %
variables pertaining to different parts of the net- FastText (Bojanowski et al., 2016) 24 79.2 %
Dependency (Levy and Goldberg, 2014) 22 81.4 %
work. The results of hyperparameter optimization ELMo (Peters et al., 2018) 32 84.6 %
on each criterion are reported in the following sub-
sections. Table 1: Embedding type tuning results. Embed-
ding types, sizes, and update methods are on the
All parameters except the parameter under in-
same grid (26 configurations).
vestigation in each experiment are kept constant.
All other parameters that are not part of this study,
such as the seed number or batch size, are also kept 50-dimensional vectors, which is sub-optimal, all
constant throughout all the experiments. other dimensions yield superior results with an un-
noticeable difference in the number of Epochs.
4.1 Word Embeddings Grid Word embedding size Epochs Results
50 22 81.8 %
In this grid, we tune the word embeddings type, 100 25 82.9 %
83.6 %
the size, and the method of updating. Low di- 200
300
27
29 84.3 %
mensional dense word vectors known as word em- 1024 32 84.6 %