SlideShare a Scribd company logo
Designing at the Intersection of HCI & AI:
Misinformation & Crowdsourced Annotation
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
“The place where people & technology meet”
~ Wobbrock et al., 2009
“iSchools” now exist at over 100 universities around the world
What’s an Information School?
2
Matt Lease (University of Texas at Austin) 3
UT Austin “Moonshot” Project
Goal: design a future of AI & autonomous technologies
that are beneficial — not detrimental — to society.
http://goodsystems.utexas.edu
Part I: Design for
Crowdsourced Annotation
4Matt Lease (University of Texas at Austin)
Matt Lease (University of Texas at Austin)
Motivation 1: Supervised Learning
• AI accuracy greatly impacted by amount of training data
• Want labels that are reliable, inexpensive, & easy to collect
• Snow et al., EMNLP 2008
– Ensure label quality by assigning same task
to multiple workers & aggregating responses
– Can we ensure quality without reliance on
redundant work?
5
Motivation 2:
Human Computation
6
“Software developers with innovative ideas for businesses and
technologies are constrained by the limits of artificial
intelligence… If software developers could programmatically
access and incorporate human intelligence into their
applications, a whole new class of innovative businesses and
applications would be possible. This is the goal of Amazon
Mechanical Turk… people are freer to innovate because they
can now imbue software with real human intelligence.”
Collecting Annotator Rationales
for Relevance Judgments
Tyler McDonnell, Matthew Lease, Mucahid Kutlu, Tamer Elsayed
2016 AAAI Conference on Human Computation & Crowdsourcing (HCOMP)
Follow-on work: Kutlu et al., SIGIR 2018
Search Relevance
What are the symptoms of jaundice?
8
jaundicejaundice
Search Relevance
9
jaundice
What are the symptoms of jaundice?
jaundice
Search Relevance
10
jaundice
25 Years of the National Institute of
Standards & Technology Text REtrieval
Conference (NIST TREC)
● Expert assessors provide relevance
labels for web pages.
● Task is highly subjective: even expert
assessors disagree often.
Google: Quality Rater Guidelines
(150 pages of instructions!)
What are the symptoms of jaundice?
jaundice
A First Experiment
• Collected sample of relevance judgments on Mechanical Turk.
• Labeled some data ourselves.
• Checked agreement.
11
● Between workers.
● Between workers vs. our own labels.
● Between workers vs. NIST gold.
● Between our labels vs. NIST gold.
● Why do our labels disagree with NIST? Who knows…
Can we do
better?
The Rationale
12
jaundice
What are the symptoms of jaundice?
jaundice
The Rationale
13
jaundice
What are the symptoms of jaundice?
Zaidan, Eisner, & Piatko.
NAACL 2007.
jaundice
Why Rationales?
14
jaundice1. Transparency
● Focused context for interpreting
objective or subjective answers.
● Workers can justify decisions and
establish other valid answers.
● Scalable gold creation w/o experts.
● Can verify labels both now & in future.
● e.g. Imagine NIST gold with these
What are the symptoms of jaundice?
jaundice
Why Rationales?
15
jaundice2. Reliability & Verifiability
● Increased accountability reduces
temptation to cheat.
● Enables iterative task design.
(more to come…)
● Enables dual-supervision, both when
aggregating answers and training
model for actual task. (more to come…)
● Better quality assurance could reduce
need to aggregate redundant work.
What are the symptoms of jaundice?
jaundice
Why Rationales?
16
jaundice3. Increased Inclusivity
Hypothesis: With improved transparency
and accountability, we can remove all
traditional barriers to participation so
anyone interested is allowed to work.
● Scalability
● Diversity
● Equal Opportunity
What are the symptoms of jaundice?
jaundice
Experimental Setup
• Collected 10K relevance judgments through Mechanical Turk.
• Evaluated two main task types.
– Standard Task (Baseline): Assessors provide a relevance judgment
– Rationale Task: Assessors provide a relevance judgment & rationale.
– Two other variant designs will be mentioned later in talk...
• No worker qualifications or “honey-pot” questions used.
• Equal pay across all evaluated tasks.
17
Results - Accuracy
• Requiring rationales yields
much higher quality work.
• Accuracy with one rationale
(80%) not far off from five
standard judgments (86%)
18
Results - Cost-Efficiency
• Rationale tasks initially
slower, but the difference
becomes negligible with
task familiarity.
• Rationales make explicit the
implicit reasoning process
underlying labeling.
19
But wait, there’s more!
What about using the collected rationales?
20
Using Rationales: Overlap
21
Assessor 1 Rationale Assessor 2 Rationale
Using Rationales: Overlap
22
Assessor 1 Rationale Assessor 2 Rationale Overlap
Idea: Filter judgments based on pairwise rationale overlap among assessors.
Motivation: Workers who converge on similar rationales likely to agree on labels too.
Results - Accuracy (Overlap)
Filtering collected judgments
by rationale overlap before
aggregation increases quality.
23
Using Rationales: Two-Stage Task Design
24
Assessor 1 Rationale
Assessor 1: Relevant Assessor 2:
?
Idea: Reviewer must confirm or refute initial reviewer.
Motivation: Worker must consider their response in the
context of peer’s reasoning.
82% of Stage 1 errors fixed.
No new errors introduced.
Results - Accuracy (Two-Stage)
• One review achieves same
accuracy as using four
extra standard judgments.
• Aggregating reviewers
reaches same accuracy as
filtered approaches.
25
1 Assessor +
1 Reviewer
1 Assessor +
4 Reviewers
The Big Picture
• Transparency
– Context for understanding and validating subjective answers.
– Convergence on justification-based crowdsourcing.
• Improved Accuracy
– Rationales make implicit explicit and hold workers accountable.
• Improved Cost-Efficiency
– No additional cost for collection once workers are familiar with task.
• Improved Aggregation
– Rationales can be used for filtering or aggregating judgments.
26
Future Work
27
• Dual Supervision: How can we further leverage
rationales for aggregation?
– Supervised learning over labels/rationales.
Zaidan, Eisner, & Piatko. NAACL 2007.
• Task Design: What about other sequential task
designs? (e.g., multi-stage)
• Generalizability: How far can we generalize
rationales to other tasks? (e.g., beyond images)
Donahue & Grauman. Annotator Rationales
for Visual Recognition. ICCV 2011.
Part II: Misinfomation
&
Human-AI Partnerships
28Matt Lease (University of Texas at Austin)
“Truthiness” is not a new problem
“Truthiness is tearing apart our country... It used to be,
everyone was entitled to their own opinion, but not
their own facts. But that’s not the case anymore.”
– Stephen Colbert (Jan. 25, 2006)
“You furnish the pictures and I’ll furnish the war.”
– William Randolph Hearst (Jan. 25, 1898)
29
Information Literacy
National Information Literacy Awareness Month,
US Presidential Proclamation, October 1, 2009.
“Though we may know how to find the information
we need, we must also know how to evaluate it.
Over the past decade, we have seen a crisis of
authenticity emerge. We now live in a world where
anyone can publish an opinion or perspective, true
or not, and have that opinion amplified…”
30Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Matt Lease (University of Texas at Austin) 31
Automatic Fact Checking
32Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
33Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
34Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Matt Lease (University of Texas at Austin)
Design Challenge: How to interact with ML models?
35
Matt Lease (University of Texas at Austin)
Brief Case Study: Facebook
(simpler case: journalist fact-checking)
36
Matt Lease (University of Texas at Austin)
Tessa Lyons, a Facebook News Feed product manager:
“…putting a strong image, like a red flag, next to an
article may actually entrench deeply held beliefs —
the opposite effect to what we intended.”
37
Matt Lease (University of Texas at Austin)
Alternative Design
38
Matt Lease (University of Texas at Austin)
Another Alternative Design
39
Matt Lease (University of Texas at Austin)
AI & HCI for Misinformation
“A few classes in ‘use and users of information’ … could
have helped social media platforms avoid the common
pitfalls of the backfire effect in their fake news efforts
and perhaps even avoided … mob rule, virality-based
algorithmic prioritization in the first place.”
https://www.forbes.com/sites/kalevleetaru/
Monday, August 5, 2019
40
Believe it or not: Designing a Human-AI
Partnership for Mixed-Initiative Fact-Checking
Joint work with
An Thanh Nguyen (UT), Byron Wallace (Northeastern), & more…
Matt Lease
School of Information @mattlease
University of Texas at Austin ml@utexas.edu
Slides:
slideshare.net/mattlease
Matt Lease (University of Texas at Austin) 42
Automatic Fact-Checking
43Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Design Challenges
• Fair, Accountable, & Transparent (AI)
– Why trust “black box” classifier?
– How do we reason about potential bias?
– Do people really only want to know “fact” vs. “fake”?
– How to integrate human knowledge/experience?
• Joint AI + Human Reasoning, Correct Errors, Personalization
• How to design strong Human + AI Partnerships?
– Horvitz, CHI’99: mixed-initiative design
– Dove et al., CHI’17 “Machine Learning As a Design Material”
44
• Crowdsourced stance labels
– Hybrid AI + Human (near real-time) Prediction
• Joint graphical model of stance, veracity, & annotators
– Interaction between variables
– Interpretable
• Source on github
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Nguyen et al., AAAI’18
45
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 46
Demo!
Nguyen et al., UIST’18
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Primary Interface
47
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Source Reputation
48
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
System Architecture
• Google Search API
• Two logistic regression models
– average accuracy > 70% but with variance
– Stance (Ferreira & Vlachos ’16) w/ same features
– Veracity (Popat et al. ‘17)
– Scikit-learn, L1 regularization, Liblinear solver, & default parameters
49
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
Data: Train & Test
Emergent (Ferreira & Vlachos ’16)
Accuracy of prediction models
50
Matt Lease (University of Texas at Austin)
Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not they see model predictions
51
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
52
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
53
Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
54
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
55
Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
2 Groups: Control vs. System
56
Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
– Result: human accuracy roughly follows model accuracy
57
Matt Lease (University of Texas at Austin)
Summary of Findings
• User studies on MTurk, ~100 participants per experiment
• Experiment 1: Whether or not reputation/stance shown
– Predict claim veracity before & after seeing model predictions
– Result: human accuracy roughly follows model accuracy
• Experiment 2: Whether or not user can override predictions
– Predict claim veracity and give confidence in prediction
– Not statistically significant on average, interaction sometimes hurts
58
Matt Lease (University of Texas at Austin)
What about user bias?
59
Matt Lease (University of Texas at Austin)
New form of echo chamber?
Interaction promotes transparency & trust, but can affirm user bias
60
Anubrata Das, Kunjan Mehta and Matthew Lease
SIGIR 2019 Workshop on Fair, Accountable, Confidential, Transparent,
and Safe Information Retrieval (FACTS-IR). July 25, 2019
CobWeb: A Research Prototype for
Exploring User Bias in Political
Fact-Checking
62
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
63
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
A user can alter the
source reputation
Changing reputation
scores changes the
predicted correctness
Changing reputation
scores changes the
overall political leaning
Imaginary
Source Bias
Stance
Sources
64
We introduce
“political leaning” (bias)
as a function of the
adjusted reputation of
the news sources
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
Changing the overall
political leaning
changes the predicted
correctness
A user can alter the
overall political leaning
Imaginary
Source Bias
Stance
Sources
Changing the overall
political leaning changes
the source reputations
65
Participants are able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
66
Post-tasks Questionnaire
Usefulness04
Communicating overall political leaning
is useful in predicting truthfulness of
claims.
Ease of Use03
Knowing the overall political leaning
would make it easier to predict the
truthfulness of claims.
Effectiveness02
Knowing the overall political leaning
would enable me to effectively predict
the truthfulness of claims.
Accuracy01
Knowing the overall political leaning
would help me accurately predict the
truthfulness of claims.
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
67
Participants able to correctly identify the correctness of an
imaginary claim
8/10
Participants find the change in overall political leaning as a
function of change in reputation score intuitive
6/10
Participants find it useful to have an indicator for their overall
political leaning in a claim checking scenario
8/10
Participants find the relationship between change in overall
political leaning and the source reputation score intuitive
6/10
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
68
Future Work03
- Bias Detection on Real Data
- Extensive user study
- Evaluation design for Search Bias
Conclusion02
Communicating user’s own bias in
fact checking helps a user in
assessing the credibility of a claim
Contribution01
An interface that communicates a
user’s own political biases in a fact-
checking context
CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking
-- Working paper; more to come!
https://arxiv.org/abs/1907.03718
A Conceptual Framework for Evaluating Fairness in Search
https://arxiv.org/abs/1907.09328
CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
Wrap-up on
Misinformation
• Fact-checking more than black-box prediction: Interaction, exploration, trust
– Useful problem for grounding work on Fair, Accountable, & Transparent (FAT) AI
• Mixed-initiative human + AI partnership for fact-checking
– backend NLP + front-end interaction
• Fact Checking & IR (Lease, DESIRES’18)
– How to diversify search results for controversial topics?
– Information evaluation (eg, vaccination & autism)
• Potential harm as well as good
– Potential added confusion, data / algorithmic bias
– Potential for personal “echo chamber”
– Adversarial settings
69Matt Lease (University of Texas at Austin)
Matt Lease (University of Texas at Austin)
Thank You!
Slides: slideshare.net/mattlease
Lab: ir.ischool.utexas.edu
70

More Related Content

PDF
Designing Human-AI Partnerships to Combat Misinfomation
PDF
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
PDF
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
PDF
AI & Work, with Transparency & the Crowd
PDF
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
PDF
But Who Protects the Moderators?
PDF
Explainable Fact Checking with Humans in-the-loop
PDF
Fact Checking & Information Retrieval
Designing Human-AI Partnerships to Combat Misinfomation
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
AI & Work, with Transparency & the Crowd
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
But Who Protects the Moderators?
Explainable Fact Checking with Humans in-the-loop
Fact Checking & Information Retrieval

What's hot (20)

PDF
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
PDF
Introduction to Data Science and Large-scale Machine Learning
PPT
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
PPTX
Smart Data - How you and I will exploit Big Data for personalized digital hea...
PDF
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
PPT
Social Machines - 2017 Update (University of Iowa)
PDF
Towards Contested Collective Intelligence
PPTX
Citizen Sensor Data Mining, Social Media Analytics and Applications
PPTX
Data Science and Urban Science @ UW
PPTX
Big Data Talent in Academic and Industry R&D
PPTX
Urban Data Science at UW
PPT
Wither OWL
PPTX
Intro to Data Science Concepts
PDF
Teaching, Assessment and Learning Analytics: Time to Question Assumptions
PPTX
Roger hoerl say award presentation 2013
PPTX
2015 Kno.e.sis Center Annual Review
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
PPTX
Philosophy of Big Data: Big Data, the Individual, and Society
PDF
Deep Learning Use Cases - Data Science Pop-up Seattle
PPT
Semantic Web: The Inside Story
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Introduction to Data Science and Large-scale Machine Learning
Social Machines: The coming collision of Artificial Intelligence, Social Netw...
Smart Data - How you and I will exploit Big Data for personalized digital hea...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Social Machines - 2017 Update (University of Iowa)
Towards Contested Collective Intelligence
Citizen Sensor Data Mining, Social Media Analytics and Applications
Data Science and Urban Science @ UW
Big Data Talent in Academic and Industry R&D
Urban Data Science at UW
Wither OWL
Intro to Data Science Concepts
Teaching, Assessment and Learning Analytics: Time to Question Assumptions
Roger hoerl say award presentation 2013
2015 Kno.e.sis Center Annual Review
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Philosophy of Big Data: Big Data, the Individual, and Society
Deep Learning Use Cases - Data Science Pop-up Seattle
Semantic Web: The Inside Story
Ad

Similar to Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation (20)

PDF
Supporting decisions with ML
PPTX
How to create a taxonomy for management buy-in
PPTX
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
PPTX
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptx
PPTX
Augmented intelligence as a response to the crisis of artificial intelligence
PDF
(In)convenient truths about applied machine learning
PDF
2019 June 27 - Big data and data science
PPTX
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
PPTX
aiandtheresearcher-230215154854-ad3503b1.pptx
PPTX
aiandtheresearcher-230215154854-ad3503b1.pptx
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
Introduction to Data and Computation: Essential capabilities for everyone in ...
PDF
Human in the loop: Bayesian Rules Enabling Explainable AI
PPTX
Computational Thinking in the Workforce and Next Generation Science Standards...
PPTX
Introduction to Data Science and Analytics
PDF
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
PPT
data science ppt of emngineering studnets
PDF
Metrocon-Rise-Of-Crowd-Computing
PDF
Social machines: theory design and incentives
PDF
UT Dallas CS - Rise of Crowd Computing
Supporting decisions with ML
How to create a taxonomy for management buy-in
Pistoia Alliance Webinar Demystifying AI: Centre of Excellence for AI Webina...
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptx
Augmented intelligence as a response to the crisis of artificial intelligence
(In)convenient truths about applied machine learning
2019 June 27 - Big data and data science
AI and the Researcher: ChatGPT and DALL-E in Scholarly Writing and Publishing
aiandtheresearcher-230215154854-ad3503b1.pptx
aiandtheresearcher-230215154854-ad3503b1.pptx
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Introduction to Data and Computation: Essential capabilities for everyone in ...
Human in the loop: Bayesian Rules Enabling Explainable AI
Computational Thinking in the Workforce and Next Generation Science Standards...
Introduction to Data Science and Analytics
Machine Learning: Opening the Pandora's Box - Dhiana Deva @ QCon São Paulo 2019
data science ppt of emngineering studnets
Metrocon-Rise-Of-Crowd-Computing
Social machines: theory design and incentives
UT Dallas CS - Rise of Crowd Computing
Ad

More from Matthew Lease (20)

PDF
Automated Models for Quantifying Centrality of Survey Responses
PDF
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
PDF
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Systematic Review is e-Discovery in Doctor’s Clothing
PDF
The Rise of Crowd Computing (July 7, 2016)
PDF
The Rise of Crowd Computing - 2016
PDF
The Rise of Crowd Computing (December 2015)
PDF
Toward Better Crowdsourcing Science
PDF
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
PDF
The Search for Truth in Objective & Subject Crowdsourcing
PDF
Toward Effective and Sustainable Online Crowd Work
PDF
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
PDF
Crowdsourcing: From Aggregation to Search Engine Evaluation
PDF
Crowdsourcing Transcription Beyond Mechanical Turk
PDF
Crowdsourcing for Information Retrieval: From Statistics to Ethics
PDF
Crowdsourcing & ethics: a few thoughts and refences.
PDF
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
PDF
Mechanical Turk is Not Anonymous
PDF
Rise of Crowd Computing (December 2012)
Automated Models for Quantifying Centrality of Survey Responses
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Systematic Review is e-Discovery in Doctor’s Clothing
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing (December 2015)
Toward Better Crowdsourcing Science
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
The Search for Truth in Objective & Subject Crowdsourcing
Toward Effective and Sustainable Online Crowd Work
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
Crowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Mechanical Turk is Not Anonymous
Rise of Crowd Computing (December 2012)

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPT
Teaching material agriculture food technology
PDF
August Patch Tuesday
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cloud_computing_Infrastucture_as_cloud_p
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
NewMind AI Weekly Chronicles - August'25-Week II
TLE Review Electricity (Electricity).pptx
Getting Started with Data Integration: FME Form 101
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Teaching material agriculture food technology
August Patch Tuesday
Encapsulation_ Review paper, used for researhc scholars
MIND Revenue Release Quarter 2 2025 Press Release
A comparative study of natural language inference in Swahili using monolingua...
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
Per capita expenditure prediction using model stacking based on satellite ima...
Heart disease approach using modified random forest and particle swarm optimi...
Advanced methodologies resolving dimensionality complications for autism neur...

Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation

  • 1. Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Annotation Matt Lease School of Information @mattlease University of Texas at Austin [email protected] Slides: slideshare.net/mattlease
  • 2. “The place where people & technology meet” ~ Wobbrock et al., 2009 “iSchools” now exist at over 100 universities around the world What’s an Information School? 2
  • 3. Matt Lease (University of Texas at Austin) 3 UT Austin “Moonshot” Project Goal: design a future of AI & autonomous technologies that are beneficial — not detrimental — to society. http://goodsystems.utexas.edu
  • 4. Part I: Design for Crowdsourced Annotation 4Matt Lease (University of Texas at Austin)
  • 5. Matt Lease (University of Texas at Austin) Motivation 1: Supervised Learning • AI accuracy greatly impacted by amount of training data • Want labels that are reliable, inexpensive, & easy to collect • Snow et al., EMNLP 2008 – Ensure label quality by assigning same task to multiple workers & aggregating responses – Can we ensure quality without reliance on redundant work? 5
  • 6. Motivation 2: Human Computation 6 “Software developers with innovative ideas for businesses and technologies are constrained by the limits of artificial intelligence… If software developers could programmatically access and incorporate human intelligence into their applications, a whole new class of innovative businesses and applications would be possible. This is the goal of Amazon Mechanical Turk… people are freer to innovate because they can now imbue software with real human intelligence.”
  • 7. Collecting Annotator Rationales for Relevance Judgments Tyler McDonnell, Matthew Lease, Mucahid Kutlu, Tamer Elsayed 2016 AAAI Conference on Human Computation & Crowdsourcing (HCOMP) Follow-on work: Kutlu et al., SIGIR 2018
  • 8. Search Relevance What are the symptoms of jaundice? 8 jaundicejaundice
  • 9. Search Relevance 9 jaundice What are the symptoms of jaundice? jaundice
  • 10. Search Relevance 10 jaundice 25 Years of the National Institute of Standards & Technology Text REtrieval Conference (NIST TREC) ● Expert assessors provide relevance labels for web pages. ● Task is highly subjective: even expert assessors disagree often. Google: Quality Rater Guidelines (150 pages of instructions!) What are the symptoms of jaundice? jaundice
  • 11. A First Experiment • Collected sample of relevance judgments on Mechanical Turk. • Labeled some data ourselves. • Checked agreement. 11 ● Between workers. ● Between workers vs. our own labels. ● Between workers vs. NIST gold. ● Between our labels vs. NIST gold. ● Why do our labels disagree with NIST? Who knows… Can we do better?
  • 12. The Rationale 12 jaundice What are the symptoms of jaundice? jaundice
  • 13. The Rationale 13 jaundice What are the symptoms of jaundice? Zaidan, Eisner, & Piatko. NAACL 2007. jaundice
  • 14. Why Rationales? 14 jaundice1. Transparency ● Focused context for interpreting objective or subjective answers. ● Workers can justify decisions and establish other valid answers. ● Scalable gold creation w/o experts. ● Can verify labels both now & in future. ● e.g. Imagine NIST gold with these What are the symptoms of jaundice? jaundice
  • 15. Why Rationales? 15 jaundice2. Reliability & Verifiability ● Increased accountability reduces temptation to cheat. ● Enables iterative task design. (more to come…) ● Enables dual-supervision, both when aggregating answers and training model for actual task. (more to come…) ● Better quality assurance could reduce need to aggregate redundant work. What are the symptoms of jaundice? jaundice
  • 16. Why Rationales? 16 jaundice3. Increased Inclusivity Hypothesis: With improved transparency and accountability, we can remove all traditional barriers to participation so anyone interested is allowed to work. ● Scalability ● Diversity ● Equal Opportunity What are the symptoms of jaundice? jaundice
  • 17. Experimental Setup • Collected 10K relevance judgments through Mechanical Turk. • Evaluated two main task types. – Standard Task (Baseline): Assessors provide a relevance judgment – Rationale Task: Assessors provide a relevance judgment & rationale. – Two other variant designs will be mentioned later in talk... • No worker qualifications or “honey-pot” questions used. • Equal pay across all evaluated tasks. 17
  • 18. Results - Accuracy • Requiring rationales yields much higher quality work. • Accuracy with one rationale (80%) not far off from five standard judgments (86%) 18
  • 19. Results - Cost-Efficiency • Rationale tasks initially slower, but the difference becomes negligible with task familiarity. • Rationales make explicit the implicit reasoning process underlying labeling. 19
  • 20. But wait, there’s more! What about using the collected rationales? 20
  • 21. Using Rationales: Overlap 21 Assessor 1 Rationale Assessor 2 Rationale
  • 22. Using Rationales: Overlap 22 Assessor 1 Rationale Assessor 2 Rationale Overlap Idea: Filter judgments based on pairwise rationale overlap among assessors. Motivation: Workers who converge on similar rationales likely to agree on labels too.
  • 23. Results - Accuracy (Overlap) Filtering collected judgments by rationale overlap before aggregation increases quality. 23
  • 24. Using Rationales: Two-Stage Task Design 24 Assessor 1 Rationale Assessor 1: Relevant Assessor 2: ? Idea: Reviewer must confirm or refute initial reviewer. Motivation: Worker must consider their response in the context of peer’s reasoning. 82% of Stage 1 errors fixed. No new errors introduced.
  • 25. Results - Accuracy (Two-Stage) • One review achieves same accuracy as using four extra standard judgments. • Aggregating reviewers reaches same accuracy as filtered approaches. 25 1 Assessor + 1 Reviewer 1 Assessor + 4 Reviewers
  • 26. The Big Picture • Transparency – Context for understanding and validating subjective answers. – Convergence on justification-based crowdsourcing. • Improved Accuracy – Rationales make implicit explicit and hold workers accountable. • Improved Cost-Efficiency – No additional cost for collection once workers are familiar with task. • Improved Aggregation – Rationales can be used for filtering or aggregating judgments. 26
  • 27. Future Work 27 • Dual Supervision: How can we further leverage rationales for aggregation? – Supervised learning over labels/rationales. Zaidan, Eisner, & Piatko. NAACL 2007. • Task Design: What about other sequential task designs? (e.g., multi-stage) • Generalizability: How far can we generalize rationales to other tasks? (e.g., beyond images) Donahue & Grauman. Annotator Rationales for Visual Recognition. ICCV 2011.
  • 28. Part II: Misinfomation & Human-AI Partnerships 28Matt Lease (University of Texas at Austin)
  • 29. “Truthiness” is not a new problem “Truthiness is tearing apart our country... It used to be, everyone was entitled to their own opinion, but not their own facts. But that’s not the case anymore.” – Stephen Colbert (Jan. 25, 2006) “You furnish the pictures and I’ll furnish the war.” – William Randolph Hearst (Jan. 25, 1898) 29
  • 30. Information Literacy National Information Literacy Awareness Month, US Presidential Proclamation, October 1, 2009. “Though we may know how to find the information we need, we must also know how to evaluate it. Over the past decade, we have seen a crisis of authenticity emerge. We now live in a world where anyone can publish an opinion or perspective, true or not, and have that opinion amplified…” 30Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  • 31. Matt Lease (University of Texas at Austin) 31
  • 32. Automatic Fact Checking 32Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  • 33. 33Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  • 34. 34Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  • 35. Matt Lease (University of Texas at Austin) Design Challenge: How to interact with ML models? 35
  • 36. Matt Lease (University of Texas at Austin) Brief Case Study: Facebook (simpler case: journalist fact-checking) 36
  • 37. Matt Lease (University of Texas at Austin) Tessa Lyons, a Facebook News Feed product manager: “…putting a strong image, like a red flag, next to an article may actually entrench deeply held beliefs — the opposite effect to what we intended.” 37
  • 38. Matt Lease (University of Texas at Austin) Alternative Design 38
  • 39. Matt Lease (University of Texas at Austin) Another Alternative Design 39
  • 40. Matt Lease (University of Texas at Austin) AI & HCI for Misinformation “A few classes in ‘use and users of information’ … could have helped social media platforms avoid the common pitfalls of the backfire effect in their fake news efforts and perhaps even avoided … mob rule, virality-based algorithmic prioritization in the first place.” https://www.forbes.com/sites/kalevleetaru/ Monday, August 5, 2019 40
  • 41. Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Joint work with An Thanh Nguyen (UT), Byron Wallace (Northeastern), & more… Matt Lease School of Information @mattlease University of Texas at Austin [email protected] Slides: slideshare.net/mattlease
  • 42. Matt Lease (University of Texas at Austin) 42
  • 43. Automatic Fact-Checking 43Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking
  • 44. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Design Challenges • Fair, Accountable, & Transparent (AI) – Why trust “black box” classifier? – How do we reason about potential bias? – Do people really only want to know “fact” vs. “fake”? – How to integrate human knowledge/experience? • Joint AI + Human Reasoning, Correct Errors, Personalization • How to design strong Human + AI Partnerships? – Horvitz, CHI’99: mixed-initiative design – Dove et al., CHI’17 “Machine Learning As a Design Material” 44
  • 45. • Crowdsourced stance labels – Hybrid AI + Human (near real-time) Prediction • Joint graphical model of stance, veracity, & annotators – Interaction between variables – Interpretable • Source on github Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Nguyen et al., AAAI’18 45
  • 46. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 46 Demo! Nguyen et al., UIST’18
  • 47. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Primary Interface 47
  • 48. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Source Reputation 48
  • 49. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking System Architecture • Google Search API • Two logistic regression models – average accuracy > 70% but with variance – Stance (Ferreira & Vlachos ’16) w/ same features – Veracity (Popat et al. ‘17) – Scikit-learn, L1 regularization, Liblinear solver, & default parameters 49
  • 50. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking Data: Train & Test Emergent (Ferreira & Vlachos ’16) Accuracy of prediction models 50
  • 51. Matt Lease (University of Texas at Austin) Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not they see model predictions 51
  • 52. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 52
  • 53. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 53
  • 54. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions 54
  • 55. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 55
  • 56. Matt Lease (UT Austin) • Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact-Checking 2 Groups: Control vs. System 56
  • 57. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions – Result: human accuracy roughly follows model accuracy 57
  • 58. Matt Lease (University of Texas at Austin) Summary of Findings • User studies on MTurk, ~100 participants per experiment • Experiment 1: Whether or not reputation/stance shown – Predict claim veracity before & after seeing model predictions – Result: human accuracy roughly follows model accuracy • Experiment 2: Whether or not user can override predictions – Predict claim veracity and give confidence in prediction – Not statistically significant on average, interaction sometimes hurts 58
  • 59. Matt Lease (University of Texas at Austin) What about user bias? 59
  • 60. Matt Lease (University of Texas at Austin) New form of echo chamber? Interaction promotes transparency & trust, but can affirm user bias 60
  • 61. Anubrata Das, Kunjan Mehta and Matthew Lease SIGIR 2019 Workshop on Fair, Accountable, Confidential, Transparent, and Safe Information Retrieval (FACTS-IR). July 25, 2019 CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking
  • 62. 62 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  • 63. 63 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease A user can alter the source reputation Changing reputation scores changes the predicted correctness Changing reputation scores changes the overall political leaning Imaginary Source Bias Stance Sources
  • 64. 64 We introduce “political leaning” (bias) as a function of the adjusted reputation of the news sources CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease Changing the overall political leaning changes the predicted correctness A user can alter the overall political leaning Imaginary Source Bias Stance Sources Changing the overall political leaning changes the source reputations
  • 65. 65 Participants are able to correctly identify the correctness of an imaginary claim 8/10 Participants find the change in overall political leaning as a function of change in reputation score intuitive 6/10 Participants find the relationship between change in overall political leaning and the source reputation score intuitive 6/10 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  • 66. 66 Post-tasks Questionnaire Usefulness04 Communicating overall political leaning is useful in predicting truthfulness of claims. Ease of Use03 Knowing the overall political leaning would make it easier to predict the truthfulness of claims. Effectiveness02 Knowing the overall political leaning would enable me to effectively predict the truthfulness of claims. Accuracy01 Knowing the overall political leaning would help me accurately predict the truthfulness of claims. CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  • 67. 67 Participants able to correctly identify the correctness of an imaginary claim 8/10 Participants find the change in overall political leaning as a function of change in reputation score intuitive 6/10 Participants find it useful to have an indicator for their overall political leaning in a claim checking scenario 8/10 Participants find the relationship between change in overall political leaning and the source reputation score intuitive 6/10 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  • 68. 68 Future Work03 - Bias Detection on Real Data - Extensive user study - Evaluation design for Search Bias Conclusion02 Communicating user’s own bias in fact checking helps a user in assessing the credibility of a claim Contribution01 An interface that communicates a user’s own political biases in a fact- checking context CobWeb: A Research Prototype for Exploring User Bias in Political Fact-Checking -- Working paper; more to come! https://arxiv.org/abs/1907.03718 A Conceptual Framework for Evaluating Fairness in Search https://arxiv.org/abs/1907.09328 CobWeb; Anubrata Das, Kunjan Mehta and Matthew Lease
  • 69. Wrap-up on Misinformation • Fact-checking more than black-box prediction: Interaction, exploration, trust – Useful problem for grounding work on Fair, Accountable, & Transparent (FAT) AI • Mixed-initiative human + AI partnership for fact-checking – backend NLP + front-end interaction • Fact Checking & IR (Lease, DESIRES’18) – How to diversify search results for controversial topics? – Information evaluation (eg, vaccination & autism) • Potential harm as well as good – Potential added confusion, data / algorithmic bias – Potential for personal “echo chamber” – Adversarial settings 69Matt Lease (University of Texas at Austin)
  • 70. Matt Lease (University of Texas at Austin) Thank You! Slides: slideshare.net/mattlease Lab: ir.ischool.utexas.edu 70