Ensemble Learning For Named Entity Recognition
Ensemble Learning For Named Entity Recognition
Introduction
One of the first research papers in the field of named entity recognition (NER) was
presented in 1991 [32]. Today, more than two decades later, this research field is still
highly relevant for manifold communities including Semantic Web Community, where
the need to capture and to translate the content of natural language (NL) with the help
of NER tools arises in manifold semantic applications [15, 19, 20, 24, 34]. The NER
tools that resulted from more than 2 decades of research now implement a diversity
of algorithms that rely on a large number of heterogeneous formalisms. Consequently,
these algorithms have diverse strengths and weaknesses.
Currently, several services and frameworks that consume NL to generate semistructured or even structured data rely on solely one of the formalisms developed for
NER or simply merging the results of several tools (e.g., by using simple voting). By doing so, current approaches fail to make use of the diversity of current NER algorithms.
On the other hand, it is a well-known fact that algorithms with diverse strengths and
weaknesses can be aggregated in various ways to create a system that outperforms the
best individual algorithms within the system [44]. This learning paradigm is known as
ensemble learning. While previous works have already suggested that ensemble learning can be used to improve NER [34], no comparison of the performance of existing
supervised machine-learning approaches for ensemble learning on the NER task has
been presented so far.
We address this research gap by presenting and evaluating an open-source framework for NER that makes use on ensemble learning. In this evaluation, we use four
state-of-the-art NER algorithms, fifteen different machine learning algorithms and five
datasets. The statistical significance our results is ensured by using Wilcoxon signedrank tests.
The goal of our evaluation is to answer the following questions:
1. Does NER based on ensemble learning achieve higher f-scores than the best NER
tool within the system?
2. Does NER based on ensemble learning achieve higher f-scores than simple voting
based on the results of the NER tools?
3. Which ensemble learning approach achieves the best f-score for the NER task?
The rest of this paper is structured as follows. After reviewing related work in Section 2, we give an overview of our approach in Section 3. Especially, we present the theoretical framework that underlies our approach. Subsequently, in Section 4, we present
our evaluation pipeline and its setup. Thereafter, in Section 5, we present the results
of a series of experiments in which we compare several machine learning algorithms
with state-of-the-art NER tools. We conclude by discussing our results and elaborating on some future work in Section 6. The results of this paper were integrated into
the open-source NER framework FOX.1 Our framework provides a free-to-use RESTful web service for the community. A documentation of the framework as well as a
specification of the RESTful web service can be found at FOXs project page.
Related Work
NER tools and frameworks implement a broad spectrum of approaches, which can
be subdivided into three main categories: dictionary-based, rule-based and machinelearning approaches [31]. The first systems for NER implemented dictionary-based approaches, which relied on a list of named entities (NEs) and tried to identify these in
text [2,43]. Following work then showed that these approaches did not perform well for
NER tasks such as recognizing proper names [39]. Thus, rule-based approaches were
introduced. These approaches rely on hand-crafted rules [8,42] to recognize NEs. Most
rule-based approaches combine dictionary and rule-based algorithms to extend the list
of known entities. Nowadays, hand-crafted rules for recognizing NEs are usually implemented when no training examples are available for the domain or language to process [32]. When training examples are available, the methods of choice are borrowed
from supervised machine learning. Approaches such as Hidden Markov Models [46],
Maximum Entropy Models [10] and Conditional Random Fields [14] have been applied
to the NER task. Due to scarcity of large training corpora as necessitated by supervised
machine learning approaches, the semi-supervised [31, 35] and unsupervised machine
learning paradigms [13, 33] have also been used for extracting NER from text. In [44],
a system was presented that combines with stacking and voting classifiers which were
1
trained with several languages, for language-independent NER. [31] gives an exhaustive
overview of approaches for the NER task.
Over the last years, several benchmarks for NER have been proposed. For example, [9] presents a benchmark for NER and entity linking approaches. Especially, the
authors define the named entity annotation task. Other benchmark datasets include the
manually annotated datasets presented in [38]. Here, the authors present annotated
datasets extracted from RSS feeds as well as datasets retrieved from news platforms.
Other authors designed datasets to evaluate their own systems. For example, the Web
dataset (which we use in our evaluation) is a particularly noisy dataset designed to evaluate the system presented in [37]. The dataset Reuters, which we also use, consists
annotated documents chosen out of the Reuters-215788 corpus and was used in [4].
Overview
3.1
NER encompasses two main tasks: (1) The identification of names2 such as Germany,
University of Leipzig and G. W. Leibniz in a given unstructured text
and (2) the classification of these names into predefined entity types3 , such as Location,
Organization and Person. In general the NER task can be viewed as the sequential prediction problem of estimating the probabilities P (yi |xik ...xi+l , yim ...yi1 ),
where x = (x1 , .., xn ) is an input sequence (i.e., the preprocessed input text) and
y = (y1 , ..., yn ) the output sequence (i.e., the entity types) [37].
3.2
Ensemble Learning
disadvantage of relying on some form of weighted vote on the output of the classifiers.
Thus, if all classifiers Ci return wrong results, classical ensemble learning approaches
are bound to make the same mistake [12]. In addition, voting does not take the different
levels of accuracy of classifiers for different entity types into consideration. Rather, it
assigns a global weight to each classifier that describes its overall accuracy. Based on
these observations, we decided to apply ensemble learning for NER based at entity-type
level. The main advantage of this ensemble-learning setting is that we can now assign
different weights to each tool-type pair.
Formally, we model the ensemble learning task at hand as follows: Let the matrix
m
M mtn (Equation 1) illustrate the input data for S, where Pn,t
are predictions of the
m-th NER tool that the n-th token is of the t-th type.
1
1
2
2
m
m
P1,1 P1,t
P1,1
P1,t
P1,1
P1,t
.. . .
.
.
.. . .
.
.. . .
(1)
.
. ..
. ..
. ..
.
.
m
m
2
2
1
1
Pn,1
Pn,t
Pn,1
Pn,t
Pn,1
Pn,t
The goal of ensemble learning for NER is to detect a classifier that leads to a correct
classification of each of the n tokens into one of the types t.
Evaluation
We performed a thorough evaluation of ensemble learning approaches by using five different datasets and running a 10-fold cross-validation for 15 algorithms. In this section,
we present the pipeline and the setup for our evaluation as well as our results.
4.1
Pipeline
Figure 1 shows the workflow chart of our evaluation pipeline. In the first step of
our evaluation pipeline, we preprocessed our reference dataset to extract the input text
for the NER tools as well as the correct NEs, which we used to create training and
testing data. In the second step, we made use of all NER tools with this input text to
calculate the predictions of all entity types for each token in this input. At this point,
we represented the output of the tools as matrix (see Equation 1). Thereafter, the matrix
was randomly split into 10 disjoint sets as preparation for a 10-fold cross-validation.
We trained the different classifiers at hand (i.e., S) with the training dataset (i.e., with 9
of 10 sets) and tested the trained classifier with the testing dataset (i.e., with the leftover
set). To use each of the 10 sets as testing set once, we repeated training and testing of
the classifiers 10 times and used the disjoint sets accordingly. Furthermore, the pipeline
was repeated 10 times to deal with non-deterministic classifiers. In the last step, we
compared the classification of the 10 testing datasets with the oracle dataset to calculate
measures for the evaluation.
We ran our pipeline on 15 ensemble learning algorithms. We carried out both a
token-based evaluation and an entity-based evaluation. In the token-based evaluation,
we regarded partial matches of multi-word units as being partially correct. For example,
our gold standard considered Federal Republic of Germany as being an
instance of Location. If a tool generated Germany as being a location and omitted
Federal Republic of, it was assigned 1 true positive and 3 false negatives. The
entity-based evaluation only regarded exact matches as correct. In the example above,
the entity was simply considered to be incorrect. To provide transparent results, we only
used open-source libraries in our evaluation. Given that some of these tools at hand do
not allow accessing their confidence score without any major alteration of their code,
we considered the output of the tools to be binary (i.e., either 1 or 0).
We integrated four NER tools so far: the Stanford Named Entity Recognizer4 (Stanford) [14], the Illinois Named Entity Tagger5 (Illinois) [37], the Ottawa Baseline Information Extraction6 (Balie) [30] and the Apache OpenNLP Name Finder7 (OpenNLP)
[3]. We only considered the performance of these tools on the classes Location,
Organization and Person. To this end, we mapped the entity types of each of
the NER tools to these three classes. We utilized the Waikato Environment for Knowledge Analysis (Weka) [21] and the implemented classifiers with default parameters:
AdaBoostM1 (ABM1) [16] and Bagging (BG) [5] with J48 [36] as base classifier,
Decision Table (DT) [26], Functional Trees (FT) [18, 27], J48 [36], Logistic Model
Trees (LMT) [27, 41], Logistic Regression (Log) [28], Additive Logistic Regression
(LogB) [17], Multilayer Perceptron (MLP), Nave Bayes (NB) [23], Random Forest
(RF) [6], Support Vector Machine (SVM) [7] and Sequential Minimal Optimization
(SMO) [22]. In addition, we used voting at class level (CVote) and a simple voting
(Vote) approach [44] with equal weights for all NER tools. CVote selects the NER tool
with the highest prediction performance for each type according to the evaluation and
uses that particular tool for the given class. Vote as naive approach combines the results of the NER tools with the Majority Vote Rule [25] and was the baseline ensemble
learning technique in our evaluation.
4
5
6
7
4.2
Experimental Setup
We used five datasets and five measures for our evaluation. We used the recommended
Wilcoxon signed-rank test to measure the statistical significance of our results [11]. For
this purpose, we applied each measurement of the ten 10-fold cross-validation runs for
the underlying distribution and we set up a 95% confidence interval.
Datasets An overview of the datasets is shown in Table 1. The Web dataset consists
of 20 annotated Web sites as described in [37] and contains the most noise compared
to the other datasets. The dataset Reuters consists of 50 documents randomly chosen
out of the Reuters-215788 corpus8 [4]. News is a small subset of the dataset News that
consists of text from newspaper articles and was re-annotated manually by the authors
to ensure high data quality. Likewise, Reuters was extracted and annotated manually
by the authors. The last dataset, All, consists of the datasets mentioned before merged
into one and allows for measuring how well the ensemble learning approaches perform
when presented with data from heterogenous sources.
News
5117
6899
3899
15915
News
341
434
254
1029
Measures To assess the performance of the different algorithms, we computed the following values on the test datasets: The number of true positives T Pt , the number of
true negatives T Nt , the number of false positives F Pt and the number of false negatives F Nt . These numbers were collected for each entity type t and averaged over
the ten runs of the 10-fold cross-validations. Then, we applied the one-against-all approach [1] to convert the multi-class confusion matrix of each dataset into a binary
confusion matrix.
Subsequently, we determined with macro-averaging the classical measures recall
(rec), precision (pre) and f-score (F1 ) as follows:
P
rec =
8
tT
T Pt
(T Pt +F Nt )
|T |
P
, pre =
tT
T Pt
T Pt +F Pt
|T |
P
, F1 =
tT
2pret rect
pret +rect
|T |
(2)
For the sake of completeness, we averaged the error rate (error) (Equation 3) and the
Matthews correlation coefficient (M CC) [29] (Equation 4) similarly.
P
F Pt +F Nt
error =
P
M CC =
tT
tT
T Pt +T Nt +F Pt +F Nt
(3)
|T |
T Pt T Nt F Pt F Nt
(T Pt +F Pt )(T Pt +F Nt )(T Nt +F Pt )(T Nt +F Nt )
(4)
|T |
The error rate monitors the fraction of positive and negative classifications for that
the classifier failed. The Matthews correlation coefficient considers both the true positives and the true negatives as successful classification and is rather unaffected by sampling biases. Higher values indicating better classifications.
Results
Table 2Table 11 show the results of our evaluation for the 15 classifiers we used within
our pipeline and the four NER tools we integrated so far. The best results are marked
bold and the NER tools are underlined. Figure 2Figure 4 depict the f-scores separated
according classes of the four NER tools, the simple voting approach Vote and the best
classifier for the depicted dataset.
rec
95.19
95.15
94.82
94.86
94.78
94.76
94.68
94.63
94.30
93.54
94.05
94.01
94.61
92.36
92.02
89.98
82.79
77.68
71.42
pre
95.28
95.28
95.18
95.09
94.98
94.93
94.95
94.95
95.15
95.37
94.75
94.37
92.64
91.01
90.84
82.97
87.35
82.05
90.47
F1
95.23
95.21
95.00
94.97
94.88
94.84
94.82
94.79
94.72
94.44
94.40
94.19
93.60
91.68
91.42
85.92
84.95
79.80
79.57
error
0.32
0.32
0.33
0.33
0.34
0.34
0.34
0.34
0.35
0.37
0.37
0.39
0.42
0.53
0.54
0.94
0.92
1.21
1.13
M CC
0.951
0.951
0.948
0.948
0.947
0.947
0.946
0.946
0.945
0.943
0.942
0.940
0.934
0.914
0.911
0.857
0.845
0.792
0.797
rec
93.95
94.10
94.08
93.76
93.51
93.85
93.30
93.30
93.42
92.89
92.55
92.44
94.08
92.00
91.43
82.07
91.42
81.54
69.36
pre
92.27
92.13
91.91
92.07
92.18
91.46
91.65
91.65
91.39
91.68
91.26
91.29
88.26
87.58
86.94
84.84
76.52
79.66
85.02
F1
93.10
93.09
92.97
92.90
92.83
92.62
92.47
92.47
92.37
92.27
91.90
91.86
91.01
89.72
89.10
83.34
82.67
80.48
75.78
error
0.30
0.30
0.31
0.31
0.31
0.32
0.33
0.33
0.33
0.33
0.36
0.34
0.40
0.45
0.47
0.67
0.83
0.79
0.88
M CC
0.930
0.929
0.928
0.928
0.927
0.925
0.923
0.923
0.922
0.921
0.917
0.917
0.909
0.895
0.889
0.831
0.829
0.801
0.760
We reached the highest f-scores on the News dataset (Table 2 and Table 3) for
both the token-based and the entity-based evaluation. In the token-based evaluation,
the MLP and RF classifiers perform best for precision (95.28%), error rate (0.32%)
and Matthews correlation coefficient (0.951). MLP performs best for f-score (95.23%)
with 0.04% more recall than RF. The baseline classifier (i.e., simple voting) is clearly
outperformed by MLP by up to +5.21% recall, +12.31% precision, +9.31% f-score, 0.62% error rate and +0.094 MCC. Furthermore, the best single approach is Stanford
and outperformed by up to +2.83% recall, +4.27% precision, +3.55% f-score, -0.21%
error rate (that is a reduction by 40%) and +0.037 MCC. Slightly poorer results are
achieved in the entity-based evaluation, where MLP is second to FT with 0.01% less
f-score.
On the News dataset (Table 4-Table 5), which was the largest homogenous dataset
in our evaluation, we repeatedly achieved high f-scores. The best approach w.r.t. the
rec
93.73
93.56
93.64
93.50
93.49
93.11
93.44
93.22
92.19
92.15
91.38
91.42
92.70
92.70
93.36
82.43
75.21
83.13
70.81
pre
92.16
92.19
92.10
92.20
92.17
92.49
92.15
92.26
92.49
91.90
91.36
91.32
88.09
88.09
86.17
78.11
74.41
69.14
72.86
F1
92.94
92.87
92.86
92.84
92.83
92.79
92.79
92.73
92.31
92.01
91.35
91.34
90.34
90.34
89.58
80.20
73.71
73.03
71.54
error
0.51
0.51
0.51
0.52
0.52
0.52
0.52
0.52
0.54
0.57
0.63
0.62
0.68
0.68
0.77
1.37
2.06
2.36
1.90
S
LMT
BG
DT
ABM1
J48
FT
RF
MLP
SVM
SMO
Log
LogB
Stanford
CVote
NB
Illinois
Balie
OpenNLP
Vote
rec
92.95
92.82
92.89
92.87
92.87
92.90
92.84
92.83
91.56
91.13
90.62
90.76
91.78
91.78
92.54
81.66
71.58
72.71
82.71
pre
88.84
88.95
88.88
88.82
88.82
88.78
88.77
88.69
89.22
88.36
88.09
87.83
83.92
83.92
81.16
72.50
68.67
67.29
61.30
F1
90.84
90.83
90.83
90.79
90.79
90.78
90.74
90.70
90.33
89.69
89.29
89.22
87.66
87.66
86.34
76.71
69.66
67.89
67.10
error
0.44
0.44
0.44
0.44
0.44
0.44
0.44
0.44
0.45
0.49
0.51
0.51
0.58
0.58
0.69
1.11
1.42
1.80
2.19
M CC
0.906
0.906
0.906
0.906
0.906
0.906
0.906
0.905
0.901
0.895
0.891
0.890
0.875
0.875
0.863
0.763
0.692
0.681
0.686
token-based evaluation is LMT with an f-score of 92.94%. Random Forest follows the
best approach with respect to f-score again. Moreover, the best single tool Stanford and
the baseline classifier Vote are repeatedly outperformed by up to +2.6% resp. +19.91%
f-score. Once again, the entity-based results are approximately 2% poorer, with LMT
leading the table like in the token-based evaluation.
On the Web dataset (Table 6-Table 7), which is the worst-case dataset for NER tools
as it contains several incomplete sentences, the different classifiers reached their lowest
values. For the token-based evaluation, AdaBoostM1 with J48 achieves the best f-score
(69.04%) and Matthews correlation coefficient (0.675) and is followed by Random Forest again with respect to f-score. Nave Bayes performs best for recall (96.64%), Logistic Regression for precision (77.89%) and MLP and RF for the error rate (3.33%).
Simple voting is outperformed by ABM1 by up to +3.5% recall, +20.08% precision,
+10.45% f-score, -2.64% error rate and +0.108 MCC, while Stanford (the best tool for
this dataset) is outperformed by up to +3.83% recall, +2.64% precision, +3.21% f-score,
-0.13% error rate and +0.032 MCC. Similar insights can be won from the entity-based
evaluation, with some classifiers like RF being approximately 10% poorer that at token
level.
On the Reuters dataset (Table 8-Table 9), which was the smallest dataset in our
evaluation, Support Vector Machine performs best. In the token-based evaluation, SVM
achieves an f-score of 87.78%, an error rate of 0.89% and a Matthews correlation coefficient of 0.875%. They are followed by Random Forest with respect to f-score once
again. Nave Bayes performs best for recall (86.54%). In comparison, ensemble learning outperforms Vote with SVM by up to +4.46% recall, +3.48% precision, +2.43% f-
rec
64.40
64.36
63.86
62.98
63.39
62.80
63.16
62.94
60.47
60.31
63.47
61.06
62.21
71.19
60.57
69.64
66.90
45.71
38.63
pre
74.83
74.57
75.11
75.47
74.24
74.18
73.54
73.45
77.48
77.89
72.45
76.19
73.78
63.42
72.19
60.56
54.75
58.81
43.83
F1
69.04
68.93
68.81
68.25
68.04
67.85
67.66
67.60
67.57
67.50
67.49
67.46
67.21
66.88
65.81
64.44
58.59
49.18
40.15
error
3.38
3.38
3.33
3.33
3.43
3.43
3.49
3.49
3.40
3.39
3.57
3.34
3.49
4.42
3.51
5.09
6.02
5.93
7.02
rec
84.57
86.11
85.89
84.41
84.64
84.33
84.22
84.51
84.70
85.25
84.41
84.45
83.74
86.54
81.96
81.57
80.11
67.94
64.92
pre
91.75
89.24
89.46
91.08
90.70
90.85
91.01
90.47
90.16
88.75
89.00
88.49
88.27
83.18
88.66
84.85
81.15
82.08
68.61
F1
87.78
87.58
87.55
87.43
87.33
87.27
87.22
87.15
87.14
86.87
86.43
86.28
85.35
84.77
84.64
82.85
79.41
73.96
64.78
error
0.89
0.90
0.90
0.89
0.93
0.89
0.90
0.93
0.94
0.95
0.99
0.98
1.09
1.10
1.14
1.20
1.43
1.76
2.62
M CC
0.875
0.872
0.871
0.871
0.870
0.870
0.870
0.868
0.868
0.864
0.861
0.859
0.851
0.842
0.844
0.824
0.793
0.736
0.645
S
MLP
Stanford
LogB
FT
ABM1
Log
CVote
J48
BG
RF
SVM
DT
LMT
SMO
NB
Illinois
Vote
OpenNLP
Balie
rec
64.95
64.80
61.25
63.67
63.49
60.43
65.69
63.21
64.04
64.15
62.36
61.92
61.25
62.44
74.18
69.31
67.42
46.94
38.07
pre
61.86
61.31
64.10
61.10
61.01
63.62
59.54
59.72
59.10
55.88
57.26
57.05
56.89
56.01
49.20
45.85
37.77
46.78
32.92
F1
63.36
62.83
62.60
62.21
62.17
61.95
61.82
61.39
61.30
59.69
59.57
59.34
58.96
58.83
58.55
54.25
47.12
43.99
35.07
error
1.99
1.95
1.94
2.09
2.08
1.99
2.05
2.12
2.13
2.27
2.15
2.17
2.19
2.21
3.17
3.82
4.84
3.71
3.63
M CC
0.624
0.619
0.616
0.612
0.611
0.610
0.612
0.603
0.603
0.587
0.586
0.583
0.579
0.579
0.586
0.541
0.477
0.437
0.334
rec
81.37
80.60
80.80
80.41
80.55
82.77
80.70
81.11
80.08
80.01
80.27
79.62
80.00
77.86
83.80
77.56
80.35
66.85
68.90
pre
88.85
88.72
87.92
88.50
87.70
85.73
86.23
85.20
86.11
85.51
84.09
83.21
82.71
85.42
77.68
82.38
76.25
80.33
70.14
F1
84.71
84.15
83.96
83.95
83.75
83.74
83.32
82.95
82.86
82.62
81.98
81.36
81.32
81.00
80.61
79.68
77.37
72.89
68.71
error
0.69
0.73
0.73
0.73
0.75
0.72
0.75
0.79
0.78
0.78
0.83
0.88
0.85
0.85
0.92
0.90
1.03
1.18
1.39
M CC
0.846
0.840
0.838
0.838
0.836
0.836
0.830
0.827
0.826
0.823
0.817
0.809
0.809
0.809
0.802
0.794
0.773
0.726
0.684
score, -0.54% error rate and +0.082 MCC. Moreover, the best NER tool for this dataset,
Illinois, is outperformed by up to +0.83% recall, +3.48% precision, +2.43% f-score,
-0.20% error rate and +0.024 MCC. In Figure 3a, we barely see a learning effect as
ABM1 is almost equal to one of the integrated NER tools assessed at class level especially for the class Organization on the Web dataset but in Figure 3c on the
Reuters dataset we clearly see a learning effect for the class Organization and
Person with the SVM approach.
On the All dataset for token-based evaluation (Table 10), the Random Forest approach performs best for f-score (91.27%), error rate (0.64%) and Matthews correlation
coefficient (0.909). Support Vector Machine achieves the best precision (91.24%) and
Nave Bayes the best recall (91.00%) again. In comparison, ensemble learning outperformed Vote with RF by up to +9.71% recall, +21.01% precision, +18.37% f-score,
-1.8% error rate and +0.176% MCC and Stanford, the best tool for this dataset, by up
rec
91.58
91.67
91.49
91.46
91.59
91.49
91.25
90.94
90.15
90.13
88.69
88.92
90.75
90.75
92.00
81.66
81.85
72.63
67.75
pre
90.97
90.86
90.99
90.98
90.84
90.82
91.00
91.05
91.24
90.48
90.57
90.21
87.73
87.73
85.27
77.61
69.96
75.60
71.65
F1
91.27
91.26
91.24
91.22
91.21
91.16
91.12
90.99
90.67
90.27
89.59
89.53
89.21
89.21
88.46
79.54
72.90
72.65
69.40
error
0.64
0.64
0.64
0.64
0.64
0.65
0.65
0.66
0.67
0.71
0.76
0.76
0.78
0.78
0.89
1.48
2.44
2.19
2.09
S
J48
ABM1
LMT
DT
RF
FT
BG
MLP
SVM
SMO
Log
LogB
Stanford
CVote
NB
Illinois
Balie
OpenNLP
Vote
rec
92.68
92.66
92.59
92.56
92.51
92.47
92.17
92.07
90.91
90.94
89.49
89.21
92.00
92.00
92.69
81.43
69.27
71.29
81.97
pre
88.62
88.59
88.50
88.44
88.33
88.37
88.55
88.60
88.97
87.31
88.10
87.68
84.48
84.48
80.59
71.82
67.47
69.44
62.17
F1
90.59
90.56
90.48
90.44
90.35
90.35
90.31
90.28
89.88
89.00
88.70
88.36
88.05
88.05
86.04
76.25
67.82
67.66
67.27
error
0.44
0.44
0.45
0.45
0.45
0.45
0.45
0.45
0.46
0.52
0.53
0.54
0.56
0.56
0.71
1.12
1.48
1.80
2.17
M CC
0.904
0.904
0.903
0.902
0.902
0.902
0.901
0.901
0.897
0.888
0.885
0.881
0.879
0.879
0.860
0.759
0.674
0.682
0.687
to +0.83% recall, +3.24% precision, +2.06% f-score, -0.14% error rate and +0.021%
MCC. Again, entity-based evaluation (Table 11) compared to token-based evaluation,
the f-score of J48, the best ensemble learning approach here, is approximately 1%
poorer with higher recall but lower precision. In Figure 4, we clearly see a learning
effect for RF and J48 at class level.
Overall, ensemble learning outperform all included NER tools and the simple voting
approach for all datasets with respect to f-score, which answers our first and second
question. Here, it is worth mentioning that Stanford and Illinois are the best tools in
our framework. The three best classifiers with respect to the averaged f-scores over our
datasets for token-based evaluation are the Random Forest classifier with the highest
value, closely followed by Multilayer Perceptron and AdaBoostM1 with J48 and for
entity-based evaluation AdaBoostM1 with J48 with the highest value, closely followed
by MLP and J48. We cannot observe a significant difference between these.
In Table 12 and Table 13, we depict the f-scores of these three classifiers at class
level for our datasets. The statistically significant differences are marked in bold. Note
that two out of three scores being marked bold for the same setting in a column means
that the corresponding approaches are significantly better than the third one yet not
significantly better than each other. In the token-based evaluation, the Multilayer Perceptron and Random Forest classifier surpass the AdaBoostM1 with J48 on the News
and Web datasets. On the News dataset, MLP surpasses RF for Location but RF
surpasses MLP for Person. On the Web dataset, RF is better than MLP for Location
but not significantly different from one another for Person. Also, for the Organization
class, no significant difference could be determined on both datasets. On the Reuters
dataset, MLP and RF are better than ABM1 for Location and Organization,
but do not differ one another. For the class Person, no significant difference could be
determined for all three classifiers. On the News and All dataset, Random Forest is
significantly best for Location. Random Forest and AdaBoostM1 with J48 surpass
the Multilayer Perceptron for Organization but are not significantly different. For
the class Person, ABM1 is significantly best on the News dataset and RF is best on
the All dataset. The entity-level results also suggest shifts amongst the best systems
depending on the datasets. Interestingly, MLP and ABM1 are the only two classes of
algorithm that appear as top algorithms in both evaluation schemes.
Consequently, our results suggest that while the four approaches RF, MLP, ABM1
and J48 perform best over the datasets at hand, MLP and ABM1 are to be favored. Note
that significant differences can be observed across the different datasets and that all four
paradigms RF, MLP, ABM1 and J48 should be considered when applying ensemble
learning to NER. This answers the last and most important question of this evaluation.
Class
Location
Organization
Person
Location
Organization
Person
Location
Organization
Person
News
92.12
89.45
97.02
91.79
89.34
97.07
91.75
89.49
97.12
News
94.96
92.44
98.25
95.22
92.45
98.04
95.10
92.00
97.89
Web
54.58
65.60
86.61
53.78
65.72
86.94
55.11
65.47
86.53
Reuters
82.25
90.53
89.95
82.13
90.38
90.14
81.19
89.91
90.37
All
89.98
87.93
95.91
89.62
87.63
95.73
89.90
87.96
95.87
Class
Location
Organization
Person
Location
Organization
Person
Location
Organization
Person
News
91.26
85.19
95.91
91.14
85.17
95.79
91.27
85.18
95.91
News
95.71
85.87
95.81
95.35
87.30
96.61
95.71
85.87
95.81
Web
58.21
50.66
77.63
56.72
52.29
81.09
56.53
50.56
77.10
Reuters
78.99
80.45
93.02
76.32
78.74
90.88
78.99
80.49
92.36
All
90.05
85.43
96.21
89.63
85.38
95.83
90.08
85.44
96.23
References
1. Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A
unifying approach for margin classifiers. J. Mach. Learn. Res., 1:113141, September 2001.
2. R. Amsler. Research towards the development of a lexical knowledge base for natural language processing. SIGIR Forum, 23:12, 1989.
3. J Baldridge. The opennlp project, 2005.
4. S. D. Bay and S. Hettich. The UCI KDD Archive [http://kdd.ics.uci.edu], 1999.
5. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996.
6. Leo Breiman. Random forests. Machine Learning, 45(1):532, 2001.
7. Chih-Chung Chang and Chih-Jen Lin. Libsvm - a library for support vector machines, 2001.
The Weka classifier works with version 2.82 of LIBSVM.
8. Sam Coates-Stephens. The analysis and acquisition of proper names for the understanding
of free text. Computers and the Humanities, 26:441456, 1992. 10.1007/BF00136985.
9. Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. A framework for benchmarking entity-annotation systems. In Proceedings of the 22nd international conference on World
Wide Web, pages 249260. International World Wide Web Conferences Steering Committee,
2013.
10. James R. Curran and Stephen Clark. Language independent ner using a maximum entropy
tagger. In Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003 - Volume 4, pages 164167, 2003.
11. Janez Demsar. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn.
Res., 7:130, December 2006.
12. Thomas G. Dietterich. Ensemble methods in machine learning. In Proceedings of the First
International Workshop on Multiple Classifier Systems, MCS 00, pages 115, London, UK,
2000. Springer-Verlag.
13. Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen
Soderland, Daniel S. Weld, and Alexander Yates. Unsupervised named-entity extraction
from the web: an experimental study. Artif. Intell., 165:91134, June 2005.
14. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, pages 363370,
2005.
15. Nuno Freire, Jose Borbinha, and Pavel Calado. An approach for named entity recognition
in poorly structured data. In Elena Simperl, Philipp Cimiano, Axel Polleres, Oscar Corcho,
and Valentina Presutti, editors, The Semantic Web: Research and Applications, volume 7295
of Lecture Notes in Computer Science, pages 718732. Springer Berlin Heidelberg, 2012.
16. Yoav Freund and Robert E. Schapire. Experiments with a New Boosting Algorithm. In
International Conference on Machine Learning, pages 148156, 1996.
17. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of
boosting. Technical report, Stanford University, 1998.
18. Joao Gama. Functional trees. 55(3):219250, 2004.
19. Aldo Gangemi. A comparison of knowledge extraction tools for the semantic web. In Philipp
Cimiano, Oscar
Corcho, Valentina Presutti, Laura Hollink, and Sebastian Rudolph, editors,
ESWC, volume 7882 of Lecture Notes in Computer Science, pages 351366. Springer, 2013.
20. Sherzod Hakimov, Salih Atilay Oto, and Erdogan Dogdu. Named entity recognition and
disambiguation using linked data and graph-based centrality scoring. In Proceedings of the
4th International Workshop on Semantic Web Information Management, SWIM 12, pages
4:14:7, New York, NY, USA, 2012. ACM.
21. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H.
Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):1018,
November 2009.
22. Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. In Michael I.
Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. MIT Press, 1998.
23. George H. John and Pat Langley. Estimating continuous distributions in bayesian classifiers.
In Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338345, San Mateo,
1995. Morgan Kaufmann.
24. Ali Khalili and Soren Auer. Rdface: The rdfa content editor. ISWC 2011 demo track, 2011.
25. J. Kittler, M. Hatef, R. P W Duin, and J. Matas. On combining classifiers. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 20(3):226239, Mar 1998.
26. Ron Kohavi. The power of decision tables. In 8th European Conference on Machine Learning, pages 174189. Springer, 1995.
27. Niels Landwehr, Mark Hall, and Eibe Frank. Logistic model trees. Machine Learning,
95(1-2):161205, 2005.
28. S. le Cessie and J.C. van Houwelingen. Ridge estimators in logistic regression. Applied
Statistics, 41(1):191201, 1992.
29. B. W. Matthews. Comparison of the predicted and observed secondary structure of T4 phage
lysozyme. Biochim. Biophys. Acta, 405:442451, 1975.