The Search for Truth in Objective & Subject Crowdsourcing

The Search for Truth in
Objective & Subjective Crowdsourcing
Matt Lease
School of Information
University of Texas at Austin
ir.ischool.utexas.edu
@mattlease
ml@utexas.edu

Roadmap
• Two quick items
– What’s an iSchool & why pursue graduate study there?
– MTurk: anonymity & human subjects research
• Finding Consensus for Objective Tasks
• Subjective Relevance & Psychometrics
2
Matt Lease <ml@utexas.edu>

“The place where people & technology meet”
~ Wobbrock et al., 2009
www.ischools.org

FYI: MTurk & Human Subjects Research
•
“What are the characteristics of MTurk workers?... the MTurk
system is set up to strictly protect workers’ anonymity….”
5

`
A MTurk worker’s ID is
also their customer
ID on Amazon. Public
profile pages can link
worker ID to name.
Lease et al., SSRN’13 6

Roadmap
• Two quick items
7

Finding Consensus in Human Computation
• For an objective labeling task, how do we resolve
disagreement between respondents?
– e.g., majority voting, weighted voting
– Contrast cases: subjective, polling, & ideation
• Research pre-dates crowdsourcing (e.g. experts)
– Dawid and Skene’79, Smyth et al., ’95
• One of the most studied problems in HCOMP
– Quality control of crowd labeling via plurality
– Methods in many areas: ML, Vision, NLP, IR, DB, …
– With all the time & $$$ invested, what have we learned?
8

Value of Benchmarking
• “If you cannot measure it, you cannot improve it.”
• Drive field innovation by clear challenge tasks
– e.g., David Tse’s FIST 2012 Keynote (Comp. Biology)
• Tackling important questions
– What is the current state-of-the-art?
– How do current methods compare?
– What works, what doesn’t, and why?
– How has field progressed over time? 9

10
SQUARE:
A Benchmark
for Research on
Computing
Crowd
Consensus
@HCOMP’13
ir.ischool.utexas.edu/square
(open source)

11
“Real” Crowdsourcing Datasets

Methods
Includes popular and/or open-source methods
• Task / Model / Supervision / Estimation & sparsity
• Task-independent
– Majority Voting
– ZenCrowd (Demartini et al., 2012), EM-based
– GLAD (Whitehill et al., 2009)
• Classification-specific (confusion matrices)
– Snow et al., 2008, Naïve Bayes
– Dawid & Skene (1979), EM-based
– Raykar et al. (2012)
– CUBAM (Welinder et al., 2010)
13

Results: Unsupervised Accuracy
Relative effectiveness vs. majority voting
15
-15%
-10%
-5%
0%
5%
10%
15%
BM HCB SpamCF WVSCM WB RTE TEMP WSD AC2 HC ALL
DS ZC RY GLAD CUBCAM

Results: Varying Supervision
16

Findings
• Majority voting never best, but rarely much worse
• No method performs far better than others
• Each method often best for some condition
– e.g., original dataset method was designed for
• DS & RY tend to perform best (RY adds priors)
– ZC (also EM-based) does well with injected noise
17

Provocative: So Where’s the Progress?
• Sure, progress is not only empirical, but…
• Maybe gold is too noisy to detect improvement?
– Cormack & Kolcz’09, Klebanov & Beigman’10
• Might we see bigger differences from
– Different tasks/scenarios? Larger data scales?
– Better methods or tuning? Better benchmark tests?
– Spammer detection and filtering?
• We invite community contributions!
18

Roadmap
• Two quick items
19

Multidimensional Relevance Modeling
via Psychometrics and Crowdsourcing
Joint work with
Yinglong Zhang Jin Zhang Jacek Gwizdka
Paper @ SIGIR 2014
20

How to Evaluate a Search Engine?
• 3 complementary approaches (with tradeoffs)
– Log analysis (“big data”): e.g., infer relevance from clicks
– User study: users perform controlled search task(s)
– Annotate: 1) create a set of queries, 2) label document
relevance to each, & 3) measure algorithmic effectiveness
• Cranfield (Cleverdon et al., 1966), simplified topical relevance
• Examples from Google
– Video: How Google makes improvements to its search
– Video: How does Google use human raters in web search?
– Search Quality Rating Guidelines (November 2, 2012) 21

Saracevic’s 1997 Salton Award address
“…the human-centered side was often highly critical
of the systems side for ignoring users... [when]
results have implications for systems design &
practice. Unfortunately… beyond suggestions,
concrete design solutions were not delivered.
“…the systems side by and large ignores the user
side and user studies… the stance is ‘tell us what
to do and we will.’ But nobody is telling...
“Thus, there are not many interactions…”
Matt Lease <ml@utexas.edu> 22/20

RQs: Information Retrieval
• What is relevance?
– What factors constitute it? Can we quantify their
relative importance? How do they interact?
• Old question, many studies, little agreement
• Significance
– Increase fundamental understanding of relevance
– Foster multi-dimensional evaluation of IR systems
– Bridge human & system-centered relevance modeling
• Create multi-dimensional judgment data for training & eval
• Motivate research to automatically infer underlying factors

RQs: Crowdsourcing Subjective Tasks
• How can we measure/ensure the quality of
subjective judgments (especially online)?
– Traditional, trusted personnel often disagree in
judging even simplified topical relevance
– How to distinguish valid subjectivity vs. human error?
• Significance
– Promote systematic study of quality assurance for
subjective tasks in HCOMP community
– Help explain/reduce observed labeling disagreements

Why Eytan Adar hates MTurk Research
(CHI 2011 CHC Workshop)
• Missing/ignoring prior work in other disciplines
– It turns out other fields have thought (a lot) about
a number of problems that show up in HCOMP!
• And other stuff (fun read…)
25

Social Sciences have been…
• …collecting reliable, subjective data from online
participants before “crowdsourcing” was coined
• …inferring latent factors and relationships from
noisy, observed data using powerful modeling
techniques that are positivist and data-driven
• …using MTurk to reproduce many traditional
behavioral studies with university students
Maybe we can learn something from them?
26

Pscychology to the Rescue!
• A Guide to Behavioral Experiments
on Mechanical Turk
– W. Mason and S. Suri (2010). SSRN online.
• Crowdsourcing for Human Subjects Research
– L. Schmidt (CrowdConf 2010)
• Crowdsourcing Content Analysis for Behavioral Research:
Insights from Mechanical Turk
– Conley & Tosti-Kharas (2010). Academy of Management
• Amazon's Mechanical Turk : A New Source of
Inexpensive, Yet High-Quality, Data?
– M. Buhrmester et al. (2011). Perspectives… 6(1):3-5.
– see also: Amazon Mechanical Turk Guide for Social Scientists
27/20

Key Ideas from Pscyhometrics
• Use established survey techniques to collect
subjective relevance judgments
– Ask repeated, similar questions, & change polarity
• Analyze via Structural Equation Modeling (SEM)
– Cousin to graphical models in statistics/AI
– Posit questions associated with latent factors
– Use Exploratory Factor Analysis (EFA) to assess
question-factor relationships & prune “bad” questions
– Use Confirmatory Factor Analysis (CFA) to assess
correlations, test significance, & compare models
28

Collecting multi-dimensional relevance
judgments
• Participant picks one of several pre-defined topics
– You want to plan a one week vacation in China
• Participant assigned a Web page to judge
– We wrote a query for each topic, submitted to a popular
search engine, and did stratified sampling of results
• Participant answers a set of likert-scale questions
– I think the information in this page is incorrect
– It’s difficult to understand the information in this page

How do we ask the questions?
• Ask 3+ questions per hypothesized dimension
– Ask repeated, similar questions, & change polarity
– Randomize question order (don’t group questions)
– Over-generate questions to allow for later pruning
– Exclude participants failing self-consistency checks
• Survey design principles: tailor, engage, QA
– Use clear, familiar, non-leading wording
– Balance response scale and question polarity
– Pre-test survey in-house, then pilot study online

What Questions might we ask?
• What factors might determine relevance?
• We adopt same 5 factors from (Xu & Chen, 2006)
– Topicality, reliability, novelty, understability, & scope
– Choose same to make revised mechanics & any
difference in findings maximally clear
• Assume factors are incomplete & imperfect
– Positivist approach: do these factors explain
observed data better than other alternatives:
uni-dimensional relevance or another set of factors?

Structural Equation Modeling (SEM)
• Based on Sewell Wright’s path analysis (1921)
– A factor model is parameterized by factor loadings,
covariances, & residual error terms
• Graphical representation: path diagram
– Observed variables in boxes
– Latent variables in ovals
– Directed edges denote
causal relationships
– Residual error terms
implicitly assumed

Exploratory Factor Analysis (EFA) – 1 of 2
• Is the sample large enough for EFA?
– Kaiser-Mayer-Olkin (KMO) Measure of Adequacy
– Bartlett’s Test of Sphericity
• Principal Axis Factoring (PAF) to find eigenvalues
– Assume some large, constant # of latent factors
– Assume each factor has connecting edge to each question
– Estimate factor model parameters by least-squares fit
• Prune factors via Parallel Analysis
– Create random data with same # factors & questions
– Create correlation matrix and find eigenvalues

• Perform Parallel Analysis
– Create random data w/ same # of factors & questions
– Create correlation matrix and find eigenvalues
• Create Scree Plot of Eigenvalues
• Re-run EFA for reduced factors
• Compute Pearson correlations
• Discard questions with:
– Weak factor loading
– Strong cross-factor loading
– Lack of logical interpretation
• Kenny’s Rule: need >= 2 questions per factor for EFA
Exploratory Factor Analysis (EFA) – 2 of 2

Question-Factor Loadings (Weights)

CFA: Assess and Compare Models
• F First-order baseline model uses a single
latent factor to explain observed data
Posited hierarchical factor model
uses 5 relevance dimensions

• Null model assume observations independent
– Covariance between questions fixed at 0, means &
coveriances left free
• Comparison stats
– Non-Normed Fit Index (NNFI)
– Comparative Fit Index (CFI)
– Root-Mean Squared Error of Approximation (RMSEA)
– Standardized-root Mean-Square Residual (SMSR)
Confirmatory Factor Analysis (CFA)

Contributions
• Simple, reliable, scalable way to collect diverse (subjective),
multi-dimensional judgments from online participants
– Online survey techniques from pscyhometrics
– Doesn’t require objective task, gold labels, or N+ judges
– Help distinguish subjectivity vs. error
• Describe a rigorous, positivist, data-driven framework for
inferring & modeling multi-dimensional relevance
– Structural equation modeling (SEM) from pscyhometrics
– Run the experiment & let the data speak for itself
• Implemented in standard R libraries, data available online

Future Directions
• More data-driven positivist research into factors
– Different user groups, search scenarios, devices, etc.
– Need more data to support normative claims
• Train/test operational systems for varying factors
– Identify/extend detected features for each dimension
– Personalize search results for individual preferences
• Improve agreement by making task more natural
and/or analyzing latent factors if disagreement
• Intra-subject vs. inter-subject aggregation?
– Other methods for ensuring subjective data quality?
• SEM vs. graphical models?
39/20

Thank You!
ir.ischool.utexas.edu
40
Slides: www.slideshare.net/mattlease

The Search for Truth in Objective & Subject Crowdsourcing

More Related Content

What's hot (20)

Similar to The Search for Truth in Objective & Subject Crowdsourcing (20)

More from Matthew Lease (20)

Recently uploaded (20)

The Search for Truth in Objective & Subject Crowdsourcing