Crowdsourcing Ground Truth for Relation
Extraction in the Medical Domain
Anca Dumitrache, Lora Aroyo, Chris Welty 26th
January 2015
Introduction
problem: cognitive computing systems need annotated data for training,
testing, evaluation
solution: human annotation through crowdsourcing augmented with machine
processing
What's wrong with the gold standard?
● algorithmic performance is measured on test sets vetted by human
experts→never perfectly correct
● gold standards are created assuming that for each annotated instance there
is a single right answer→does not account for alternative
interpretations and clarity
● gold standard quality is measured in inter-annotator agreement→what
happens when disagreeing annotators are both right?
The fallacy of the “one truth” assumption that pervades computational
semantics
Examples of disagreement
Does each sentence express the TREAT relation?
● ANTIBIOTICS are the first line treatment for indications of TYPHUS.
● Patients with TYPHUS who were given ANTIBIOTICS exhibited several
side-effects.
● With ANTIBIOTICS in short supply, DDT was used during World War II
to control the insect vectors of TYPHUS.
Examples of disagreement
Does each sentence express the TREAT relation?
● ANTIBIOTICS are the first line treatment for indications of TYPHUS.
→agreement 99%
● Patients with TYPHUS who were given ANTIBIOTICS exhibited several
side-effects.→agreement 80%
● With ANTIBIOTICS in short supply, DDT was used during World War II
to control the insect vectors of TYPHUS.→agreement 60%
Why disagreement happens
With ANTIBIOTICS in short supply, DDT
was used during World War II to control the
insect vectors of TYPHUS.
ANTIBIOTICS are the first line treatment
for indications of TYPHUS.
“cause” vs.
“side effect”
spammers vs.
high quality workers
Triangle of
reference
input sentence
annotation worker
CrowdTruth
● use crowdsourcing tasks to improve machine-generated data
● adaptable to new annotation tasks, new domains
● disagreement is an indicator of quality for:
○ sentences, relations, workers
● capture and interpret disagreement through CrowdTruth metrics
● open source & available as a web service:
http://CrowdTruth.org
CrowdTruth for relation extraction
goal: use the CrowdTruth approach for collecting a relation extraction gold
standard, to improve the performance of a relation extraction classifier
issue: how to interpret crowd disagreement for relation extraction?
approach:
● run relation extraction crowdsourcing task on 900 medical sentences
● measure disagreement with CrowdTruth metrics
● train & evaluate classifier with CrowdTruth scores
Research questions
1. what is the threshold between a negative and positive relation in
crowdsourced data?
2. can CrowdTruth outperform experts in training a relation extraction
classifier?
3. can CrowdTruth be used in evaluating a relation extraction classifier?
Relation extraction task
Unit vector
Sentence vector
Sentence-relation score
Experiment setup
1. train relation extraction classifier with x-validation for one relation
datasets to compare:
● baseline: original distant supervision relations
● single annotator: for each sentence-relation pair, randomly sample a
binary decision from the sentence-relation vector
● expert annotator: each sentence annotated by 1 medical expert
● CrowdTruth: use scaled sentence-relation score as confidence value
2. perform evaluation on manually vetted data
Crowdsourcing results
Relation Frequency
ASSOCIATED WITH 69.4%
TREATS 45.23%
SYMPTOM 43.56%
CAUSE 41.35%
MANIFESTATION 39.57%
Top relations
Data cleanup
● tuning parameters:
○ how much disagreement we accept until labeling a worker as a
spammer?
○ what is the sen-rel score threshold for positive/negative
● spam removal
● data clustering: relation similarity
A pairwise metric between two relations, it is used to express how likely one relation is to
appear in a sentence, given that the second one is present. It is based on the causal
power - for every two relations i and j, the causal power of i over j is:
where P(i) is the probability that relation i appears in the sentence.
Relation similarity
Relation clustering
Relation Frequency
CAUSE (cause + symptom + manifestation
+ side effect)
74.72%
ASSOCIATED WITH 69.4%
TREATS 45.23%
Top relations
Final relation overlap
Building a test set
+ manual
evaluation
X-validation results
baseline single expert crowd
thresh
0.1
crowd
thresh
0.2
crowd
thresh
0.3
crowd
thresh
0.4
crowd
thresh
0.5
crowd
thresh
0.6
crowd
thresh
0.7
crowd
thresh
0.8
crowd
thresh
0.9
Kendall tau 0.62 0.64 0.664 0.656 0.654 0.672 0.662 0.666 0.654 0.651 0.649 0.655
Spearman
footrule
0.513 0.501 0.568 0.545 0.54 0.554 0.543 0.53 0.535 0.543 0.536 0.549
Rank correlation between classifier output
and CrowdTruth
Results
1. what is the threshold between a negative and positive relation in
crowdsourced data?
● agreement with experts: sen-rel score = 0.5
● best performance: sen-rel score = 0.7
2. can CrowdTruth outperform experts in training a relation extraction
classifier?
● not yet, but it comes close
3. can CrowdTruth be used in evaluating a relation extraction classifier?
● yes for rank correlation methods
● TODO: weighted precision, recall, F1, accuracy

CrowdTruth for medical relation extraction - WAI talk

  • 1.
    Crowdsourcing Ground Truthfor Relation Extraction in the Medical Domain Anca Dumitrache, Lora Aroyo, Chris Welty 26th January 2015
  • 2.
    Introduction problem: cognitive computingsystems need annotated data for training, testing, evaluation solution: human annotation through crowdsourcing augmented with machine processing
  • 3.
    What's wrong withthe gold standard? ● algorithmic performance is measured on test sets vetted by human experts→never perfectly correct ● gold standards are created assuming that for each annotated instance there is a single right answer→does not account for alternative interpretations and clarity ● gold standard quality is measured in inter-annotator agreement→what happens when disagreeing annotators are both right? The fallacy of the “one truth” assumption that pervades computational semantics
  • 4.
    Examples of disagreement Doeseach sentence express the TREAT relation? ● ANTIBIOTICS are the first line treatment for indications of TYPHUS. ● Patients with TYPHUS who were given ANTIBIOTICS exhibited several side-effects. ● With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.
  • 5.
    Examples of disagreement Doeseach sentence express the TREAT relation? ● ANTIBIOTICS are the first line treatment for indications of TYPHUS. →agreement 99% ● Patients with TYPHUS who were given ANTIBIOTICS exhibited several side-effects.→agreement 80% ● With ANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS.→agreement 60%
  • 6.
    Why disagreement happens WithANTIBIOTICS in short supply, DDT was used during World War II to control the insect vectors of TYPHUS. ANTIBIOTICS are the first line treatment for indications of TYPHUS. “cause” vs. “side effect” spammers vs. high quality workers Triangle of reference input sentence annotation worker
  • 7.
    CrowdTruth ● use crowdsourcingtasks to improve machine-generated data ● adaptable to new annotation tasks, new domains ● disagreement is an indicator of quality for: ○ sentences, relations, workers ● capture and interpret disagreement through CrowdTruth metrics ● open source & available as a web service: http://CrowdTruth.org
  • 8.
    CrowdTruth for relationextraction goal: use the CrowdTruth approach for collecting a relation extraction gold standard, to improve the performance of a relation extraction classifier issue: how to interpret crowd disagreement for relation extraction? approach: ● run relation extraction crowdsourcing task on 900 medical sentences ● measure disagreement with CrowdTruth metrics ● train & evaluate classifier with CrowdTruth scores
  • 9.
    Research questions 1. whatis the threshold between a negative and positive relation in crowdsourced data? 2. can CrowdTruth outperform experts in training a relation extraction classifier? 3. can CrowdTruth be used in evaluating a relation extraction classifier?
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Experiment setup 1. trainrelation extraction classifier with x-validation for one relation datasets to compare: ● baseline: original distant supervision relations ● single annotator: for each sentence-relation pair, randomly sample a binary decision from the sentence-relation vector ● expert annotator: each sentence annotated by 1 medical expert ● CrowdTruth: use scaled sentence-relation score as confidence value 2. perform evaluation on manually vetted data
  • 15.
    Crowdsourcing results Relation Frequency ASSOCIATEDWITH 69.4% TREATS 45.23% SYMPTOM 43.56% CAUSE 41.35% MANIFESTATION 39.57% Top relations
  • 16.
    Data cleanup ● tuningparameters: ○ how much disagreement we accept until labeling a worker as a spammer? ○ what is the sen-rel score threshold for positive/negative ● spam removal ● data clustering: relation similarity A pairwise metric between two relations, it is used to express how likely one relation is to appear in a sentence, given that the second one is present. It is based on the causal power - for every two relations i and j, the causal power of i over j is: where P(i) is the probability that relation i appears in the sentence.
  • 17.
  • 18.
    Relation clustering Relation Frequency CAUSE(cause + symptom + manifestation + side effect) 74.72% ASSOCIATED WITH 69.4% TREATS 45.23% Top relations
  • 19.
  • 20.
    Building a testset + manual evaluation
  • 21.
  • 22.
    baseline single expertcrowd thresh 0.1 crowd thresh 0.2 crowd thresh 0.3 crowd thresh 0.4 crowd thresh 0.5 crowd thresh 0.6 crowd thresh 0.7 crowd thresh 0.8 crowd thresh 0.9 Kendall tau 0.62 0.64 0.664 0.656 0.654 0.672 0.662 0.666 0.654 0.651 0.649 0.655 Spearman footrule 0.513 0.501 0.568 0.545 0.54 0.554 0.543 0.53 0.535 0.543 0.536 0.549 Rank correlation between classifier output and CrowdTruth
  • 23.
    Results 1. what isthe threshold between a negative and positive relation in crowdsourced data? ● agreement with experts: sen-rel score = 0.5 ● best performance: sen-rel score = 0.7 2. can CrowdTruth outperform experts in training a relation extraction classifier? ● not yet, but it comes close 3. can CrowdTruth be used in evaluating a relation extraction classifier? ● yes for rank correlation methods ● TODO: weighted precision, recall, F1, accuracy