SlideShare a Scribd company logo
The Search for Truth in
Objective & Subjective Crowdsourcing
Matt Lease
School of Information
University of Texas at Austin
ir.ischool.utexas.edu
@mattlease
ml@utexas.edu
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects research
• Finding Consensus for Objective Tasks
• Subjective Relevance & Psychometrics
2
Matt Lease <ml@utexas.edu>
“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org
4
FYI: MTurk & Human Subjects Research
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to strictly protect workers’ anonymity….”
5
`
A MTurk worker’s ID is
also their customer
ID on Amazon. Public
profile pages can link
worker ID to name.
Lease et al., SSRN’13 6
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects research
• Finding Consensus for Objective Tasks
• Subjective Relevance & Psychometrics
7
Matt Lease <ml@utexas.edu>
Finding Consensus in Human Computation
• For an objective labeling task, how do we resolve
disagreement between respondents?
– e.g., majority voting, weighted voting
– Contrast cases: subjective, polling, & ideation
• Research pre-dates crowdsourcing (e.g. experts)
– Dawid and Skene’79, Smyth et al., ’95
• One of the most studied problems in HCOMP
– Quality control of crowd labeling via plurality
– Methods in many areas: ML, Vision, NLP, IR, DB, …
– With all the time & $$$ invested, what have we learned?
8
Matt Lease <ml@utexas.edu>
Value of Benchmarking
• “If you cannot measure it, you cannot improve it.”
• Drive field innovation by clear challenge tasks
– e.g., David Tse’s FIST 2012 Keynote (Comp. Biology)
• Tackling important questions
– What is the current state-of-the-art?
– How do current methods compare?
– What works, what doesn’t, and why?
– How has field progressed over time? 9
Matt Lease <ml@utexas.edu>
10
Matt Lease <ml@utexas.edu>
SQUARE:
A Benchmark
for Research on
Computing
Crowd
Consensus
@HCOMP’13
ir.ischool.utexas.edu/square
(open source)
11
“Real” Crowdsourcing Datasets
12
How does the
crowd behave?
Methods
Includes popular and/or open-source methods
• Task / Model / Supervision / Estimation & sparsity
• Task-independent
– Majority Voting
– ZenCrowd (Demartini et al., 2012), EM-based
– GLAD (Whitehill et al., 2009)
• Classification-specific (confusion matrices)
– Snow et al., 2008, Naïve Bayes
– Dawid & Skene (1979), EM-based
– Raykar et al. (2012)
– CUBAM (Welinder et al., 2010)
Matt Lease <ml@utexas.edu>
13
Results: Unsupervised Accuracy
Relative effectiveness vs. majority voting
15
-15%
-10%
-5%
0%
5%
10%
15%
BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL
DS ZC RY GLAD CUBCAM
Results: Varying Supervision
16
Matt Lease <ml@utexas.edu>
Findings
• Majority voting never best, but rarely much worse
• No method performs far better than others
• Each method often best for some condition
– e.g., original dataset method was designed for
• DS & RY tend to perform best (RY adds priors)
– ZC (also EM-based) does well with injected noise
17
Matt Lease <ml@utexas.edu>
Provocative: So Where’s the Progress?
• Sure, progress is not only empirical, but…
• Maybe gold is too noisy to detect improvement?
– Cormack & Kolcz’09, Klebanov & Beigman’10
• Might we see bigger differences from
– Different tasks/scenarios? Larger data scales?
– Better methods or tuning? Better benchmark tests?
– Spammer detection and filtering?
• We invite community contributions!
18
Matt Lease <ml@utexas.edu>
Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects research
• Finding Consensus for Objective Tasks
• Subjective Relevance & Psychometrics
19
Matt Lease <ml@utexas.edu>
Multidimensional Relevance Modeling
via Psychometrics and Crowdsourcing
Joint work with
Yinglong Zhang Jin Zhang Jacek Gwizdka
Paper @ SIGIR 2014
Matt Lease <ml@utexas.edu>
20
How to Evaluate a Search Engine?
• 3 complementary approaches (with tradeoffs)
– Log analysis (“big data”): e.g., infer relevance from clicks
– User study: users perform controlled search task(s)
– Annotate: 1) create a set of queries, 2) label document
relevance to each, & 3) measure algorithmic effectiveness
• Cranfield (Cleverdon et al., 1966), simplified topical relevance
• Examples from Google
– Video: How Google makes improvements to its search
– Video: How does Google use human raters in web search?
– Search Quality Rating Guidelines (November 2, 2012) 21
Matt Lease <ml@utexas.edu>
Saracevic’s 1997 Salton Award address
“…the human-centered side was often highly critical
of the systems side for ignoring users... [when]
results have implications for systems design &
practice. Unfortunately… beyond suggestions,
concrete design solutions were not delivered.
“…the systems side by and large ignores the user
side and user studies… the stance is ‘tell us what
to do and we will.’ But nobody is telling...
“Thus, there are not many interactions…”
Matt Lease <ml@utexas.edu> 22/20
RQs: Information Retrieval
• What is relevance?
– What factors constitute it? Can we quantify their
relative importance? How do they interact?
• Old question, many studies, little agreement
• Significance
– Increase fundamental understanding of relevance
– Foster multi-dimensional evaluation of IR systems
– Bridge human & system-centered relevance modeling
• Create multi-dimensional judgment data for training & eval
• Motivate research to automatically infer underlying factors
Matt Lease <ml@utexas.edu> 23/20
RQs: Crowdsourcing Subjective Tasks
• How can we measure/ensure the quality of
subjective judgments (especially online)?
– Traditional, trusted personnel often disagree in
judging even simplified topical relevance
– How to distinguish valid subjectivity vs. human error?
• Significance
– Promote systematic study of quality assurance for
subjective tasks in HCOMP community
– Help explain/reduce observed labeling disagreements
Matt Lease <ml@utexas.edu> 24/20
Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Missing/ignoring prior work in other disciplines
– It turns out other fields have thought (a lot) about
a number of problems that show up in HCOMP!
• And other stuff (fun read…)
25
Social Sciences have been…
• …collecting reliable, subjective data from online
participants before “crowdsourcing” was coined
• …inferring latent factors and relationships from
noisy, observed data using powerful modeling
techniques that are positivist and data-driven
• …using MTurk to reproduce many traditional
behavioral studies with university students
Maybe we can learn something from them?
Matt Lease <ml@utexas.edu>
26
Pscychology to the Rescue!
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
27/20
Key Ideas from Pscyhometrics
• Use established survey techniques to collect
subjective relevance judgments
– Ask repeated, similar questions, & change polarity
• Analyze via Structural Equation Modeling (SEM)
– Cousin to graphical models in statistics/AI
– Posit questions associated with latent factors
– Use Exploratory Factor Analysis (EFA) to assess
question-factor relationships & prune “bad” questions
– Use Confirmatory Factor Analysis (CFA) to assess
correlations, test significance, & compare models
Matt Lease <ml@utexas.edu>
28
Collecting multi-dimensional relevance
judgments
• Participant picks one of several pre-defined topics
– You want to plan a one week vacation in China
• Participant assigned a Web page to judge
– We wrote a query for each topic, submitted to a popular
search engine, and did stratified sampling of results
• Participant answers a set of likert-scale questions
– I think the information in this page is incorrect
– It’s difficult to understand the information in this page
Matt Lease <ml@utexas.edu> 29/20
How do we ask the questions?
• Ask 3+ questions per hypothesized dimension
– Ask repeated, similar questions, & change polarity
– Randomize question order (don’t group questions)
– Over-generate questions to allow for later pruning
– Exclude participants failing self-consistency checks
• Survey design principles: tailor, engage, QA
– Use clear, familiar, non-leading wording
– Balance response scale and question polarity
– Pre-test survey in-house, then pilot study online
Matt Lease <ml@utexas.edu> 30/20
What Questions might we ask?
• What factors might determine relevance?
• We adopt same 5 factors from (Xu & Chen, 2006)
– Topicality, reliability, novelty, understability, & scope
– Choose same to make revised mechanics & any
difference in findings maximally clear
• Assume factors are incomplete & imperfect
– Positivist approach: do these factors explain
observed data better than other alternatives:
uni-dimensional relevance or another set of factors?
Matt Lease <ml@utexas.edu> 31/20
Structural Equation Modeling (SEM)
• Based on Sewell Wright’s path analysis (1921)
– A factor model is parameterized by factor loadings,
covariances, & residual error terms
• Graphical representation: path diagram
– Observed variables in boxes
– Latent variables in ovals
– Directed edges denote
causal relationships
– Residual error terms
implicitly assumed
Matt Lease <ml@utexas.edu> 32/20
Exploratory Factor Analysis (EFA) – 1 of 2
• Is the sample large enough for EFA?
– Kaiser-Mayer-Olkin (KMO) Measure of Adequacy
– Bartlett’s Test of Sphericity
• Principal Axis Factoring (PAF) to find eigenvalues
– Assume some large, constant # of latent factors
– Assume each factor has connecting edge to each question
– Estimate factor model parameters by least-squares fit
• Prune factors via Parallel Analysis
– Create random data with same # factors & questions
– Create correlation matrix and find eigenvalues
Matt Lease <ml@utexas.edu> 33/20
• Perform Parallel Analysis
– Create random data w/ same # of factors & questions
– Create correlation matrix and find eigenvalues
• Create Scree Plot of Eigenvalues
• Re-run EFA for reduced factors
• Compute Pearson correlations
• Discard questions with:
– Weak factor loading
– Strong cross-factor loading
– Lack of logical interpretation
• Kenny’s Rule: need >= 2 questions per factor for EFA
Exploratory Factor Analysis (EFA) – 2 of 2
Matt Lease <ml@utexas.edu> 34/20
Question-Factor Loadings (Weights)
Matt Lease <ml@utexas.edu> 35/20
CFA: Assess and Compare Models
• F First-order baseline model uses a single
latent factor to explain observed data
Posited hierarchical factor model
uses 5 relevance dimensions
Matt Lease <ml@utexas.edu> 36/20
• Null model assume observations independent
– Covariance between questions fixed at 0, means &
coveriances left free
• Comparison stats
– Non-Normed Fit Index (NNFI)
– Comparative Fit Index (CFI)
– Root-Mean Squared Error of Approximation (RMSEA)
– Standardized-root Mean-Square Residual (SMSR)
Confirmatory Factor Analysis (CFA)
Matt Lease <ml@utexas.edu> 37/20
Contributions
• Simple, reliable, scalable way to collect diverse (subjective),
multi-dimensional judgments from online participants
– Online survey techniques from pscyhometrics
– Doesn’t require objective task, gold labels, or N+ judges
– Help distinguish subjectivity vs. error
• Describe a rigorous, positivist, data-driven framework for
inferring & modeling multi-dimensional relevance
– Structural equation modeling (SEM) from pscyhometrics
– Run the experiment & let the data speak for itself
• Implemented in standard R libraries, data available online
Matt Lease <ml@utexas.edu> 38/20
Future Directions
• More data-driven positivist research into factors
– Different user groups, search scenarios, devices, etc.
– Need more data to support normative claims
• Train/test operational systems for varying factors
– Identify/extend detected features for each dimension
– Personalize search results for individual preferences
• Improve agreement by making task more natural
and/or analyzing latent factors if disagreement
• Intra-subject vs. inter-subject aggregation?
– Other methods for ensuring subjective data quality?
• SEM vs. graphical models?
39/20
Thank You!
ir.ischool.utexas.edu
40
Slides: www.slideshare.net/mattlease

More Related Content

PDF
Toward Effective and Sustainable Online Crowd Work
PPTX
Welcoming Webology
PDF
Issues: What the Web Can Tell us About Human Behavior
PDF
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
PDF
Use of Computational Tools to Support Planning & Policy by Johannes M. Bauer
PPTX
Ralph schroeder and eric meyer
PDF
Toward Better Crowdsourcing Science
Toward Effective and Sustainable Online Crowd Work
Welcoming Webology
Issues: What the Web Can Tell us About Human Behavior
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Use of Computational Tools to Support Planning & Policy by Johannes M. Bauer
Ralph schroeder and eric meyer
Toward Better Crowdsourcing Science

What's hot (20)

PDF
The Rise of Crowd Computing - 2016
PDF
The Rise of Crowd Computing (December 2015)
PPTX
Joe keating - world legal summit - ethical data science
PPTX
Privacy-driven design of Learning Analytics applications – exploring the desi...
PDF
AI and Legal Tech in Context: Privacy and Security Commons
PPTX
Student vulnerability, agency and learning analytics: an exploration
PDF
The Rise of Crowd Computing (July 7, 2016)
PDF
Data Standards and Linked Data: Challenges & Use Cases in Europe and the Unit...
PDF
Data Science and its impact on society
PDF
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
PDF
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
PDF
Ws2 values talk
PPTX
Knoesis Student Achievement
PDF
UTS CIC2 Briefing, 17 June 2016
PPT
Broad Data
PDF
Re-Defining Journalism Education: Using Systems Thinking and Design to Revolu...
PDF
Testing slides
PDF
Ws3 impact assessments talk
PPTX
Building Recommendation Systems on Social Data @KTH - FutureFriday - March 2014
PPTX
Text Mining, Term Mining, and Visualization - Improving the Impact of Scholar...
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing (December 2015)
Joe keating - world legal summit - ethical data science
Privacy-driven design of Learning Analytics applications – exploring the desi...
AI and Legal Tech in Context: Privacy and Security Commons
Student vulnerability, agency and learning analytics: an exploration
The Rise of Crowd Computing (July 7, 2016)
Data Standards and Linked Data: Challenges & Use Cases in Europe and the Unit...
Data Science and its impact on society
Crowdsourcing for Search Evaluation and Social-Algorithmic Search
Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)
Ws2 values talk
Knoesis Student Achievement
UTS CIC2 Briefing, 17 June 2016
Broad Data
Re-Defining Journalism Education: Using Systems Thinking and Design to Revolu...
Testing slides
Ws3 impact assessments talk
Building Recommendation Systems on Social Data @KTH - FutureFriday - March 2014
Text Mining, Term Mining, and Visualization - Improving the Impact of Scholar...
Ad

Similar to The Search for Truth in Objective & Subject Crowdsourcing (20)

PDF
Crowdsourcing: From Aggregation to Search Engine Evaluation
PDF
Crowdsourcing for Information Retrieval: From Statistics to Ethics
PDF
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
PDF
UT Dallas CS - Rise of Crowd Computing
PDF
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
PDF
Metrocon-Rise-Of-Crowd-Computing
PPTX
NCME Big Data in Education
PDF
Web search-metrics-tutorial-www2010-section-1of7-introduction
PPTX
Characterizing Data and Software for Social Science Research
PPTX
Technology in Employee Recruitment and Selection
PPTX
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
PPTX
Big Data for Student Learning
PDF
Social Network Analysis based on MOOC's (Massive Open Online Classes)
PDF
The state of the art in integrating machine learning into visual analytics
PPTX
How do crowdworkers learn
PDF
A Query Routing Model to Rank Expertcandidates on Twitter
PPTX
Introduction to Learning Analytics - Framework and Implementation Concerns
PPTX
The Art and Science of Analyzing Software Data
PDF
Sweeny group think-ias2015
PDF
Goal Dynamics_From System Dynamics to Implementation
Crowdsourcing: From Aggregation to Search Engine Evaluation
Crowdsourcing for Information Retrieval: From Statistics to Ethics
Multidimensional Relevance Modeling via Psychometrics & Crowdsourcing: ACM SI...
UT Dallas CS - Rise of Crowd Computing
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Metrocon-Rise-Of-Crowd-Computing
NCME Big Data in Education
Web search-metrics-tutorial-www2010-section-1of7-introduction
Characterizing Data and Software for Social Science Research
Technology in Employee Recruitment and Selection
[DSC Europe 22] Machine learning algorithms as tools for student success pred...
Big Data for Student Learning
Social Network Analysis based on MOOC's (Massive Open Online Classes)
The state of the art in integrating machine learning into visual analytics
How do crowdworkers learn
A Query Routing Model to Rank Expertcandidates on Twitter
Introduction to Learning Analytics - Framework and Implementation Concerns
The Art and Science of Analyzing Software Data
Sweeny group think-ias2015
Goal Dynamics_From System Dynamics to Implementation
Ad

More from Matthew Lease (20)

PDF
Automated Models for Quantifying Centrality of Survey Responses
PDF
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
PDF
Explainable Fact Checking with Humans in-the-loop
PDF
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
PDF
AI & Work, with Transparency & the Crowd
PDF
Designing Human-AI Partnerships to Combat Misinfomation
PDF
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
PDF
But Who Protects the Moderators?
PDF
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
PDF
Fact Checking & Information Retrieval
PDF
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
PDF
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
Systematic Review is e-Discovery in Doctor’s Clothing
PDF
Crowdsourcing Transcription Beyond Mechanical Turk
PDF
Crowdsourcing & ethics: a few thoughts and refences.
PDF
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
PDF
Mechanical Turk is Not Anonymous
PDF
Rise of Crowd Computing (December 2012)
PDF
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...
Automated Models for Quantifying Centrality of Survey Responses
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Explainable Fact Checking with Humans in-the-loop
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
AI & Work, with Transparency & the Crowd
Designing Human-AI Partnerships to Combat Misinfomation
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
But Who Protects the Moderators?
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Fact Checking & Information Retrieval
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Systematic Review is e-Discovery in Doctor’s Clothing
Crowdsourcing Transcription Beyond Mechanical Turk
Crowdsourcing & ethics: a few thoughts and refences.
Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems
Mechanical Turk is Not Anonymous
Rise of Crowd Computing (December 2012)
UT Austin @ TREC 2012 Crowdsourcing Track: Image Relevance Assessment Task (I...

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Getting Started with Data Integration: FME Form 101
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
1. Introduction to Computer Programming.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
Advanced methodologies resolving dimensionality complications for autism neur...
Network Security Unit 5.pdf for BCA BBA.
Getting Started with Data Integration: FME Form 101
MIND Revenue Release Quarter 2 2025 Press Release
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
1. Introduction to Computer Programming.pptx
Big Data Technologies - Introduction.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Unlocking AI with Model Context Protocol (MCP)

The Search for Truth in Objective & Subject Crowdsourcing

  • 1. The Search for Truth in Objective & Subjective Crowdsourcing Matt Lease School of Information University of Texas at Austin ir.ischool.utexas.edu @mattlease [email protected]
  • 2. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 2 Matt Lease <[email protected]>
  • 3. “The place where people & technology meet” ~ Wobbrock et al., 2009 www.ischools.org
  • 4. 4
  • 5. FYI: MTurk & Human Subjects Research • “What are the characteristics of MTurk workers?... the MTurk system is set up to strictly protect workers’ anonymity….” 5
  • 6. ` A MTurk worker’s ID is also their customer ID on Amazon. Public profile pages can link worker ID to name. Lease et al., SSRN’13 6
  • 7. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 7 Matt Lease <[email protected]>
  • 8. Finding Consensus in Human Computation • For an objective labeling task, how do we resolve disagreement between respondents? – e.g., majority voting, weighted voting – Contrast cases: subjective, polling, & ideation • Research pre-dates crowdsourcing (e.g. experts) – Dawid and Skene’79, Smyth et al., ’95 • One of the most studied problems in HCOMP – Quality control of crowd labeling via plurality – Methods in many areas: ML, Vision, NLP, IR, DB, … – With all the time & $$$ invested, what have we learned? 8 Matt Lease <[email protected]>
  • 9. Value of Benchmarking • “If you cannot measure it, you cannot improve it.” • Drive field innovation by clear challenge tasks – e.g., David Tse’s FIST 2012 Keynote (Comp. Biology) • Tackling important questions – What is the current state-of-the-art? – How do current methods compare? – What works, what doesn’t, and why? – How has field progressed over time? 9 Matt Lease <[email protected]>
  • 10. 10 Matt Lease <[email protected]> SQUARE: A Benchmark for Research on Computing Crowd Consensus @HCOMP’13 ir.ischool.utexas.edu/square (open source)
  • 13. Methods Includes popular and/or open-source methods • Task / Model / Supervision / Estimation & sparsity • Task-independent – Majority Voting – ZenCrowd (Demartini et al., 2012), EM-based – GLAD (Whitehill et al., 2009) • Classification-specific (confusion matrices) – Snow et al., 2008, Naïve Bayes – Dawid & Skene (1979), EM-based – Raykar et al. (2012) – CUBAM (Welinder et al., 2010) Matt Lease <[email protected]> 13
  • 14. Results: Unsupervised Accuracy Relative effectiveness vs. majority voting 15 -15% -10% -5% 0% 5% 10% 15% BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL DS ZC RY GLAD CUBCAM
  • 16. Findings • Majority voting never best, but rarely much worse • No method performs far better than others • Each method often best for some condition – e.g., original dataset method was designed for • DS & RY tend to perform best (RY adds priors) – ZC (also EM-based) does well with injected noise 17 Matt Lease <[email protected]>
  • 17. Provocative: So Where’s the Progress? • Sure, progress is not only empirical, but… • Maybe gold is too noisy to detect improvement? – Cormack & Kolcz’09, Klebanov & Beigman’10 • Might we see bigger differences from – Different tasks/scenarios? Larger data scales? – Better methods or tuning? Better benchmark tests? – Spammer detection and filtering? • We invite community contributions! 18 Matt Lease <[email protected]>
  • 18. Roadmap • Two quick items – What’s an iSchool & why pursue graduate study there? – MTurk: anonymity & human subjects research • Finding Consensus for Objective Tasks • Subjective Relevance & Psychometrics 19 Matt Lease <[email protected]>
  • 19. Multidimensional Relevance Modeling via Psychometrics and Crowdsourcing Joint work with Yinglong Zhang Jin Zhang Jacek Gwizdka Paper @ SIGIR 2014 Matt Lease <[email protected]> 20
  • 20. How to Evaluate a Search Engine? • 3 complementary approaches (with tradeoffs) – Log analysis (“big data”): e.g., infer relevance from clicks – User study: users perform controlled search task(s) – Annotate: 1) create a set of queries, 2) label document relevance to each, & 3) measure algorithmic effectiveness • Cranfield (Cleverdon et al., 1966), simplified topical relevance • Examples from Google – Video: How Google makes improvements to its search – Video: How does Google use human raters in web search? – Search Quality Rating Guidelines (November 2, 2012) 21 Matt Lease <[email protected]>
  • 21. Saracevic’s 1997 Salton Award address “…the human-centered side was often highly critical of the systems side for ignoring users... [when] results have implications for systems design & practice. Unfortunately… beyond suggestions, concrete design solutions were not delivered. “…the systems side by and large ignores the user side and user studies… the stance is ‘tell us what to do and we will.’ But nobody is telling... “Thus, there are not many interactions…” Matt Lease <[email protected]> 22/20
  • 22. RQs: Information Retrieval • What is relevance? – What factors constitute it? Can we quantify their relative importance? How do they interact? • Old question, many studies, little agreement • Significance – Increase fundamental understanding of relevance – Foster multi-dimensional evaluation of IR systems – Bridge human & system-centered relevance modeling • Create multi-dimensional judgment data for training & eval • Motivate research to automatically infer underlying factors Matt Lease <[email protected]> 23/20
  • 23. RQs: Crowdsourcing Subjective Tasks • How can we measure/ensure the quality of subjective judgments (especially online)? – Traditional, trusted personnel often disagree in judging even simplified topical relevance – How to distinguish valid subjectivity vs. human error? • Significance – Promote systematic study of quality assurance for subjective tasks in HCOMP community – Help explain/reduce observed labeling disagreements Matt Lease <[email protected]> 24/20
  • 24. Why Eytan Adar hates MTurk Research (CHI 2011 CHC Workshop) • Missing/ignoring prior work in other disciplines – It turns out other fields have thought (a lot) about a number of problems that show up in HCOMP! • And other stuff (fun read…) 25
  • 25. Social Sciences have been… • …collecting reliable, subjective data from online participants before “crowdsourcing” was coined • …inferring latent factors and relationships from noisy, observed data using powerful modeling techniques that are positivist and data-driven • …using MTurk to reproduce many traditional behavioral studies with university students Maybe we can learn something from them? Matt Lease <[email protected]> 26
  • 26. Pscychology to the Rescue! • A Guide to Behavioral Experiments on Mechanical Turk – W. Mason and S. Suri (2010). SSRN online. • Crowdsourcing for Human Subjects Research – L. Schmidt (CrowdConf 2010) • Crowdsourcing Content Analysis for Behavioral Research: Insights from Mechanical Turk – Conley & Tosti-Kharas (2010). Academy of Management • Amazon's Mechanical Turk : A New Source of Inexpensive, Yet High-Quality, Data? – M. Buhrmester et al. (2011). Perspectives… 6(1):3-5. – see also: Amazon Mechanical Turk Guide for Social Scientists 27/20
  • 27. Key Ideas from Pscyhometrics • Use established survey techniques to collect subjective relevance judgments – Ask repeated, similar questions, & change polarity • Analyze via Structural Equation Modeling (SEM) – Cousin to graphical models in statistics/AI – Posit questions associated with latent factors – Use Exploratory Factor Analysis (EFA) to assess question-factor relationships & prune “bad” questions – Use Confirmatory Factor Analysis (CFA) to assess correlations, test significance, & compare models Matt Lease <[email protected]> 28
  • 28. Collecting multi-dimensional relevance judgments • Participant picks one of several pre-defined topics – You want to plan a one week vacation in China • Participant assigned a Web page to judge – We wrote a query for each topic, submitted to a popular search engine, and did stratified sampling of results • Participant answers a set of likert-scale questions – I think the information in this page is incorrect – It’s difficult to understand the information in this page Matt Lease <[email protected]> 29/20
  • 29. How do we ask the questions? • Ask 3+ questions per hypothesized dimension – Ask repeated, similar questions, & change polarity – Randomize question order (don’t group questions) – Over-generate questions to allow for later pruning – Exclude participants failing self-consistency checks • Survey design principles: tailor, engage, QA – Use clear, familiar, non-leading wording – Balance response scale and question polarity – Pre-test survey in-house, then pilot study online Matt Lease <[email protected]> 30/20
  • 30. What Questions might we ask? • What factors might determine relevance? • We adopt same 5 factors from (Xu & Chen, 2006) – Topicality, reliability, novelty, understability, & scope – Choose same to make revised mechanics & any difference in findings maximally clear • Assume factors are incomplete & imperfect – Positivist approach: do these factors explain observed data better than other alternatives: uni-dimensional relevance or another set of factors? Matt Lease <[email protected]> 31/20
  • 31. Structural Equation Modeling (SEM) • Based on Sewell Wright’s path analysis (1921) – A factor model is parameterized by factor loadings, covariances, & residual error terms • Graphical representation: path diagram – Observed variables in boxes – Latent variables in ovals – Directed edges denote causal relationships – Residual error terms implicitly assumed Matt Lease <[email protected]> 32/20
  • 32. Exploratory Factor Analysis (EFA) – 1 of 2 • Is the sample large enough for EFA? – Kaiser-Mayer-Olkin (KMO) Measure of Adequacy – Bartlett’s Test of Sphericity • Principal Axis Factoring (PAF) to find eigenvalues – Assume some large, constant # of latent factors – Assume each factor has connecting edge to each question – Estimate factor model parameters by least-squares fit • Prune factors via Parallel Analysis – Create random data with same # factors & questions – Create correlation matrix and find eigenvalues Matt Lease <[email protected]> 33/20
  • 33. • Perform Parallel Analysis – Create random data w/ same # of factors & questions – Create correlation matrix and find eigenvalues • Create Scree Plot of Eigenvalues • Re-run EFA for reduced factors • Compute Pearson correlations • Discard questions with: – Weak factor loading – Strong cross-factor loading – Lack of logical interpretation • Kenny’s Rule: need >= 2 questions per factor for EFA Exploratory Factor Analysis (EFA) – 2 of 2 Matt Lease <[email protected]> 34/20
  • 35. CFA: Assess and Compare Models • F First-order baseline model uses a single latent factor to explain observed data Posited hierarchical factor model uses 5 relevance dimensions Matt Lease <[email protected]> 36/20
  • 36. • Null model assume observations independent – Covariance between questions fixed at 0, means & coveriances left free • Comparison stats – Non-Normed Fit Index (NNFI) – Comparative Fit Index (CFI) – Root-Mean Squared Error of Approximation (RMSEA) – Standardized-root Mean-Square Residual (SMSR) Confirmatory Factor Analysis (CFA) Matt Lease <[email protected]> 37/20
  • 37. Contributions • Simple, reliable, scalable way to collect diverse (subjective), multi-dimensional judgments from online participants – Online survey techniques from pscyhometrics – Doesn’t require objective task, gold labels, or N+ judges – Help distinguish subjectivity vs. error • Describe a rigorous, positivist, data-driven framework for inferring & modeling multi-dimensional relevance – Structural equation modeling (SEM) from pscyhometrics – Run the experiment & let the data speak for itself • Implemented in standard R libraries, data available online Matt Lease <[email protected]> 38/20
  • 38. Future Directions • More data-driven positivist research into factors – Different user groups, search scenarios, devices, etc. – Need more data to support normative claims • Train/test operational systems for varying factors – Identify/extend detected features for each dimension – Personalize search results for individual preferences • Improve agreement by making task more natural and/or analyzing latent factors if disagreement • Intra-subject vs. inter-subject aggregation? – Other methods for ensuring subjective data quality? • SEM vs. graphical models? 39/20