CrowdTruth for medical relation extraction - WAI talk

Crowdsourcing Ground Truth for Relation
Extraction in the Medical Domain
Anca Dumitrache, Lora Aroyo, Chris Welty 26th
January 2015

Introduction
problem: cognitive computing systems need annotated data for training,
testing, evaluation
solution: human annotation through crowdsourcing augmented with machine
processing

What's wrong with the gold standard?
● algorithmic performance is measured on test sets vetted by human
experts→never perfectly correct
● gold standards are created assuming that for each annotated instance there
is a single right answer→does not account for alternative
interpretations and clarity
● gold standard quality is measured in inter-annotator agreement→what
happens when disagreeing annotators are both right?
The fallacy of the “one truth” assumption that pervades computational
semantics

Examples of disagreement
Does each sentence express the TREAT relation?
● ANTIBIOTICS are the first line treatment for indications of TYPHUS.
● Patients with TYPHUS who were given ANTIBIOTICS exhibited several
side-effects.
● With ANTIBIOTICS in short supply, DDT was used during World War II
to control the insect vectors of TYPHUS.

Examples of disagreement
Does each sentence express the TREAT relation?
● ANTIBIOTICS are the first line treatment for indications of TYPHUS.
→agreement 99%
● Patients with TYPHUS who were given ANTIBIOTICS exhibited several
side-effects.→agreement 80%
● With ANTIBIOTICS in short supply, DDT was used during World War II
to control the insect vectors of TYPHUS.→agreement 60%

Why disagreement happens
With ANTIBIOTICS in short supply, DDT
was used during World War II to control the
insect vectors of TYPHUS.
ANTIBIOTICS are the first line treatment
for indications of TYPHUS.
“cause” vs.
“side effect”
spammers vs.
high quality workers
Triangle of
reference
input sentence
annotation worker

CrowdTruth
● use crowdsourcing tasks to improve machine-generated data
● adaptable to new annotation tasks, new domains
● disagreement is an indicator of quality for:
○ sentences, relations, workers
● capture and interpret disagreement through CrowdTruth metrics
● open source & available as a web service:
http://CrowdTruth.org

CrowdTruth for relation extraction
goal: use the CrowdTruth approach for collecting a relation extraction gold
standard, to improve the performance of a relation extraction classifier
issue: how to interpret crowd disagreement for relation extraction?
approach:
● run relation extraction crowdsourcing task on 900 medical sentences
● measure disagreement with CrowdTruth metrics
● train & evaluate classifier with CrowdTruth scores

Research questions
1. what is the threshold between a negative and positive relation in
crowdsourced data?
2. can CrowdTruth outperform experts in training a relation extraction
classifier?
3. can CrowdTruth be used in evaluating a relation extraction classifier?

Experiment setup
1. train relation extraction classifier with x-validation for one relation
datasets to compare:
● baseline: original distant supervision relations
● single annotator: for each sentence-relation pair, randomly sample a
binary decision from the sentence-relation vector
● expert annotator: each sentence annotated by 1 medical expert
● CrowdTruth: use scaled sentence-relation score as confidence value
2. perform evaluation on manually vetted data

Crowdsourcing results
Relation Frequency
ASSOCIATED WITH 69.4%
TREATS 45.23%
SYMPTOM 43.56%
CAUSE 41.35%
MANIFESTATION 39.57%
Top relations

Data cleanup
● tuning parameters:
○ how much disagreement we accept until labeling a worker as a
spammer?
○ what is the sen-rel score threshold for positive/negative
● spam removal
● data clustering: relation similarity
A pairwise metric between two relations, it is used to express how likely one relation is to
appear in a sentence, given that the second one is present. It is based on the causal
power - for every two relations i and j, the causal power of i over j is:
where P(i) is the probability that relation i appears in the sentence.

Relation clustering
Relation Frequency
CAUSE (cause + symptom + manifestation
+ side effect)
74.72%
ASSOCIATED WITH 69.4%
TREATS 45.23%
Top relations

Building a test set
+ manual
evaluation

baseline single expert crowd
thresh
0.1
crowd
thresh
0.2
crowd
thresh
0.3
crowd
thresh
0.4
crowd
thresh
0.5
crowd
thresh
0.6
crowd
thresh
0.7
crowd
thresh
0.8
crowd
thresh
0.9
Kendall tau 0.62 0.64 0.664 0.656 0.654 0.672 0.662 0.666 0.654 0.651 0.649 0.655
Spearman
footrule
0.513 0.501 0.568 0.545 0.54 0.554 0.543 0.53 0.535 0.543 0.536 0.549
Rank correlation between classifier output
and CrowdTruth

Results
1. what is the threshold between a negative and positive relation in
crowdsourced data?
● agreement with experts: sen-rel score = 0.5
● best performance: sen-rel score = 0.7
2. can CrowdTruth outperform experts in training a relation extraction
classifier?
● not yet, but it comes close
3. can CrowdTruth be used in evaluating a relation extraction classifier?
● yes for rank correlation methods
● TODO: weighted precision, recall, F1, accuracy

CrowdTruth for medical relation extraction - WAI talk

More Related Content

What's hot

Viewers also liked

Similar to CrowdTruth for medical relation extraction - WAI talk

Recently uploaded

CrowdTruth for medical relation extraction - WAI talk