Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

Designing at the Intersection of HCI & AI:
Misinformation & Crowdsourced Annotation
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease

“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
What’s an Information School?
2

Matt Lease (University of Texas at Austin) 3
UT Austin “Moonshot” Project
Goal: design a future of AI & autonomous technologies
that are beneficial — not detrimental — to society.
http://goodsystems.utexas.edu

Part I: Design for
Crowdsourced Annotation
4Matt Lease (University of Texas at Austin)

Matt Lease (University of Texas at Austin)
Motivation 1: Supervised Learning
• AI accuracy greatly impacted by amount of training data
• Want labels that are reliable, inexpensive, & easy to collect
• Snow et al., EMNLP 2008
– Ensure label quality by assigning same task
to multiple workers & aggregating responses
– Can we ensure quality without reliance on
redundant work?
5

Motivation 2:
Human Computation
6
“Software developers with innovative ideas for businesses and
technologies are constrained by the limits of artificial
intelligence… If software developers could programmatically
access and incorporate human intelligence into their
applications, a whole new class of innovative businesses and
applications would be possible. This is the goal of Amazon
Mechanical Turk… people are freer to innovate because they
can now imbue software with real human intelligence.”

Collecting Annotator Rationales
for Relevance Judgments
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, Tamer Elsayed
2016 AAAI Conference on Human Computation & Crowdsourcing (HCOMP)
Follow-on work: Kutlu et al., SIGIR 2018

Search Relevance
What are the symptoms of jaundice?
8
jaundicejaundice

Search Relevance
9
jaundice
jaundice

Search Relevance
10
jaundice
25 Years of the National Institute of
Standards & Technology Text REtrieval
Conference (NIST TREC)
● Expert assessors provide relevance
labels for web pages.
● Task is highly subjective: even expert
assessors disagree often.
Google: Quality Rater Guidelines
(150 pages of instructions!)
jaundice

A First Experiment
• Collected sample of relevance judgments on Mechanical Turk.
• Labeled some data ourselves.
• Checked agreement.
11
● Between workers.
● Between workers vs. our own labels.
● Between workers vs. NIST gold.
● Between our labels vs. NIST gold.
● Why do our labels disagree with NIST? Who knows…
Can we do
better?

The Rationale
12
jaundice
jaundice

The Rationale
13
jaundice
Zaidan, Eisner, & Piatko.
NAACL 2007.
jaundice

Why Rationales?
14
jaundice1. Transparency
● Focused context for interpreting
objective or subjective answers.
● Workers can justify decisions and
establish other valid answers.
● Scalable gold creation w/o experts.
● Can verify labels both now & in future.
● e.g. Imagine NIST gold with these
jaundice

Why Rationales?
15
jaundice2. Reliability & Verifiability
● Increased accountability reduces
temptation to cheat.
● Enables iterative task design.
(more to come…)
● Enables dual-supervision, both when
aggregating answers and training
model for actual task. (more to come…)
● Better quality assurance could reduce
need to aggregate redundant work.
jaundice

Why Rationales?
16
jaundice3. Increased Inclusivity
Hypothesis: With improved transparency
and accountability, we can remove all
traditional barriers to participation so
anyone interested is allowed to work.
● Scalability
● Diversity
● Equal Opportunity
jaundice

Experimental Setup
• Collected 10K relevance judgments through Mechanical Turk.
• Evaluated two main task types.
– Standard Task (Baseline): Assessors provide a relevance judgment
– Rationale Task: Assessors provide a relevance judgment & rationale.
– Two other variant designs will be mentioned later in talk...
• No worker qualifications or “honey-pot” questions used.
• Equal pay across all evaluated tasks.
17

Results - Accuracy
• Requiring rationales yields
much higher quality work.
• Accuracy with one rationale
(80%) not far off from five
standard judgments (86%)
18

Results - Cost-Efficiency
• Rationale tasks initially
slower, but the difference
becomes negligible with
task familiarity.
• Rationales make explicit the
implicit reasoning process
underlying labeling.
19

But wait, there’s more!
What about using the collected rationales?
20

Using Rationales: Overlap
21
Assessor 1 Rationale Assessor 2 Rationale

Using Rationales: Overlap
22
Assessor 1 Rationale Assessor 2 Rationale Overlap
Idea: Filter judgments based on pairwise rationale overlap among assessors.
Motivation: Workers who converge on similar rationales likely to agree on labels too.

Results - Accuracy (Overlap)
Filtering collected judgments
by rationale overlap before
aggregation increases quality.
23

Using Rationales: Two-Stage Task Design
24
Assessor 1 Rationale
Assessor 1: Relevant Assessor 2:
?
Idea: Reviewer must confirm or refute initial reviewer.
Motivation: Worker must consider their response in the
context of peer’s reasoning.
82% of Stage 1 errors fixed.
No new errors introduced.

Results - Accuracy (Two-Stage)
• One review achieves same
accuracy as using four
extra standard judgments.
• Aggregating reviewers
reaches same accuracy as
filtered approaches.
25
1 Assessor +
1 Reviewer
1 Assessor +
4 Reviewers

The Big Picture
• Transparency
– Context for understanding and validating subjective answers.
– Convergence on justification-based crowdsourcing.
• Improved Accuracy
– Rationales make implicit explicit and hold workers accountable.
• Improved Cost-Efficiency
– No additional cost for collection once workers are familiar with task.
• Improved Aggregation
– Rationales can be used for filtering or aggregating judgments.
26

Future Work
27
• Dual Supervision: How can we further leverage
rationales for aggregation?
– Supervised learning over labels/rationales.
Zaidan, Eisner, & Piatko. NAACL 2007.
• Task Design: What about other sequential task
designs? (e.g., multi-stage)
• Generalizability: How far can we generalize
rationales to other tasks? (e.g., beyond images)
Donahue & Grauman. Annotator Rationales
for Visual Recognition. ICCV 2011.

Part II: Misinfomation
&
Human-AI Partnerships

“Truthiness” is not a new problem
“Truthiness is tearing apart our country... It used to be,
everyone was entitled to their own opinion, but not
their own facts. But that’s not the case anymore.”
– Stephen Colbert (Jan. 25, 2006)
“You furnish the pictures and I’ll furnish the war.”
– William Randolph Hearst (Jan. 25, 1898)
29

Information Literacy
National Information Literacy Awareness Month,
US Presidential Proclamation, October 1, 2009.
“Though we may know how to find the information
we need, we must also know how to evaluate it.
Over the past decade, we have seen a crisis of
authenticity emerge. We now live in a world where
anyone can publish an opinion or perspective, true
or not, and have that opinion amplified…”
30Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking

Automatic Fact Checking

Design Challenge: How to interact with ML models?
35

Brief Case Study: Facebook
(simpler case: journalist fact-checking)
36

Tessa Lyons, a Facebook News Feed product manager:
“…putting a strong image, like a red flag, next to an
article may actually entrench deeply held beliefs —
the opposite effect to what we intended.”
37

Alternative Design
38

Another Alternative Design
39

AI & HCI for Misinformation
“A few classes in ‘use and users of information’ … could
have helped social media platforms avoid the common
pitfalls of the backfire effect in their fake news efforts
and perhaps even avoided … mob rule, virality-based
algorithmic prioritization in the first place.”
https://www.forbes.com/sites/kalevleetaru/
Monday, August 5, 2019
40

Believe it or not: Designing a Human-AI
Partnership for Mixed-Initiative Fact-Checking
Joint work with
An Thanh Nguyen (UT), Byron Wallace (Northeastern), & more…
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease

Automatic Fact-Checking

Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Design Challenges
• Fair, Accountable, & Transparent (AI)
– Why trust “black box” classifier?
– How do we reason about potential bias?
– Do people really only want to know “fact” vs. “fake”?
– How to integrate human knowledge/experience?
• Joint AI + Human Reasoning, Correct Errors, Personalization
• How to design strong Human + AI Partnerships?
– Horvitz, CHI’99: mixed-initiative design
– Dove et al., CHI’17 “Machine Learning As a Design Material”
44

• Crowdsourced stance labels
– Hybrid AI + Human (near real-time) Prediction
• Joint graphical model of stance, veracity, & annotators
– Interaction between variables
– Interpretable
• Source on github
Nguyen et al., AAAI’18
45

Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 46
Demo!
Nguyen et al., UIST’18

Primary Interface
47

Source Reputation
48

System Architecture
• Google Search API
• Two logistic regression models
– average accuracy > 70% but with variance
– Stance (Ferreira & Vlachos ’16) w/ same features
– Veracity (Popat et al. ‘17)
– Scikit-learn, L1 regularization, Liblinear solver, & default parameters
49

Data: Train & Test
Emergent (Ferreira & Vlachos ’16)
Accuracy of prediction models
50

Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not they see model predictions
51

2 Groups: Control vs. System
52

53

Summary of Findings
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
54

55

56

Summary of Findings
– Result: human accuracy roughly follows model accuracy
57

Summary of Findings
– Result: human accuracy roughly follows model accuracy
• Experiment 2: Whether or not user can override predictions
– Predict claim veracity and give confidence in prediction
– Not statistically significant on average, interaction sometimes hurts
58

What about user bias?
59

New form of echo chamber?
Interaction promotes transparency & trust, but can affirm user bias
60

Anubrata Das, Kunjan Mehta and Matthew Lease
SIGIR 2019 Workshop on Fair, Accountable, Confidential, Transparent,
and Safe Information Retrieval (FACTS-IR). July 25, 2019
CobWeb: A Research Prototype for
Exploring User Bias in Political
Fact-Checking

62
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease

63
We introduce
the news sources
A user can alter the
source reputation
Changing reputation
scores changes the
predicted correctness
Changing reputation
scores changes the
overall political leaning
Imaginary
Source Bias
Stance
Sources

64
We introduce
the news sources
Changing the overall
political leaning
changes the predicted
correctness
A user can alter the
overall political leaning
Imaginary
Source Bias
Stance
Sources
Changing the overall
political leaning changes
the source reputations

65
Participants are able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10

66
Post-tasks Questionnaire
Usefulness04
Communicating overall political leaning
is useful in predicting truthfulness of
claims.
Ease of Use03
Knowing the overall political leaning
would make it easier to predict the
truthfulness of claims.
Effectiveness02
would enable me to effectively predict
the truthfulness of claims.
Accuracy01
would help me accurately predict the
truthfulness of claims.

67
Participants able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find it useful to have an indicator for their overall
political leaning in a claim checking scenario
8/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10

68
Future Work03
- Bias Detection on Real Data
- Extensive user study
- Evaluation design for Search Bias
Conclusion02
Communicating user’s own bias in
fact checking helps a user in
assessing the credibility of a claim
Contribution01
An interface that communicates a
user’s own political biases in a fact-
checking context
CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking
-- Working paper; more to come!
https://arxiv.org/abs/1907.03718
A Conceptual Framework for Evaluating Fairness in Search
https://arxiv.org/abs/1907.09328

Wrap-up on
Misinformation
• Fact-checking more than black-box prediction: Interaction, exploration, trust
– Useful problem for grounding work on Fair, Accountable, & Transparent (FAT) AI
• Mixed-initiative human + AI partnership for fact-checking
– backend NLP + front-end interaction
• Fact Checking & IR (Lease, DESIRES’18)
– How to diversify search results for controversial topics?
– Information evaluation (eg, vaccination & autism)
• Potential harm as well as good
– Potential added confusion, data / algorithmic bias
– Potential for personal “echo chamber”
– Adversarial settings

Thank You!
Slides: slideshare.net/mattlease
Lab: ir.ischool.utexas.edu
70

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

More Related Content

What's hot (20)

Similar to Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation (20)

More from Matthew Lease (20)

Recently uploaded (20)

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation