The
Uncanny Valley
of ML
Dr June Andrews Delphi Data Nov 2019
Human Decision Systems
Simple Paradigm Represents:
• Judges setting bail
• Doctors processing images
• DMV clerks renewing licenses
• Muni train drivers stopping & going
• Administrators admitting students
• … you coming to MLConf
Information In, Decision Out, Works Pretty Well
Hype: Technology Will Replace People Overnight
???
Far More Likely Progression of Technology
Machine Learning with Augment Decision Making with Recommendations
Template Design for ML + Decision Systems
Recommendation: Yes
Feature A is high
Keep the Chain of Responsibility Intact
Integrate as soon as ML Accuracy ≈ Human Accuracy
Decreasing
cost of ML
Pressures for Introducing ML into Decision Systems
Increasing data
from revolutions in
sensors, records &
infrastructure
Fewer experts
graduating in
‘older fields’
Increasing number
of decisions
created by more
people
Ideal Result of ML + Human Decisions
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
As ML Accuracy Approaches Human Accuracy, System Performance Improves
LowHigh
The Uncanny Valley of ML
As ML Accuracy Approaches Human Accuracy, System Performance Degrades
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
LowHigh
Finding
the
Uncanny
Valley
of ML
Finding An Uncanny Valley of ML
Use Test Environments To Avoid The Uncanny Valley in Production
1. Create a simple
labeling task with
ground truth labels
2. Measure human
accuracy & speed
3.
1. Add a recommended decision
from a ‘Model’
2. Simulate models of different
accuracy near human accuracy
by perturbing the ground truth
labels
3. Assign each person to a
simulated model and run test
labels for normalization
4. Measure system accuracy &
speed as a function of ML
accuracy
5.
Easy Street - How many coffee mugs do you see?
Throwback to the first demos of Neural Nets for Compute Vision @ Cornell
100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs
100+ photos hand labeled
with the number of coffee
mugs for ground truth
Label quality is then
perturbed to simulate
different ML accuracies, with
a bias to perturbing the
images with many mugs
…used Amazon Mechanical
Turk workers
Easy Street - How many coffee mugs do you see?
Throwback to the first demos of Neural Nets for Compute Vision @ Cornell
2 mugs
1 mug
Ideal Results
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
LowHigh
Actual System Behavior
SystemAccuracy
ML Accuracy
Terrible Human (94%) Perfect
2 mugs
1 mug
LowHigh
Actual System Behavior
DecisionTime
ML Accuracy
Terrible Human (94%) Perfect
2 mugs
1 mug
LowHigh
Uncanny Valley for ML in
Counting Coffee Mugs?
Do people want machines to be wrong?
We trust machines more than we trust
ourselves when they are near but not over
human accuracy?
We’re lazy and want to defer decision
making (people varied from the ML when
it was correct)
2 mugs
1 mug
The Uncanny
Valley of ML
in the
Judicial
System
Decreasing
cost of ML
Pressures for Introducing ML into Court Systems
Increasing data
from records &
social media
Fewer experts
graduating in
‘older fields’
Increasing number of decisions
created by more people
Estonia is building a ‘Robot
Judge’ to settle disputes
under $8,000 - DailyMail
Broader initiative of e-
government. France wants
to match Estonia's level by
2022
160,000 parking tickets
overturned in the UK & US
with a chatbot -Guardian
on DoNotPay
Risk Score Print Outs in Cleveland. Includes features
like ‘how often are you bored’ -Quartz
Note - arraignment hearings are often under 5
minutes.
ML has a Growing Presence in Courts
Countries are Comparing Notes & Learning How to Use AI in the Courts
Locally ML is used in the
Judicial System for Bail
In California 49 of 58 counties
use a Pretrial Assessment System
(yes SF is one) [courts.ca.gov]
SB 10 signed in 2018 would
make it mandatory in October of
2019, but a 2020 referendum
contradicting SB 10 has created
a temporary pause
Just a sec, does Bail Matter?
• 20% of jail inmates in US are
awaiting trial
• Misdemeanors can take several
months for trial, felonies can take
years. Average wait time in the
Bronx is 642 days for a non-jury
trial and 827 days for a jury trial.
• Pretrial detention leads to 13%
increase in plea agreements, 42%
increase in length of sentence and
41% increase increase in court fees
-Stevenson The Journal of Law,
Economics, and Organization
8th Amendment (Bill of Rights)
‘Excessive bail shall not be
required, nor excessive fines
imposed, nor cruel and unusual
punishments inflicted.’
Finding An Uncanny Valley of ML
Unknown System Accuracy, Show Manipulation of a Single Label
1. Take a Real Case
2.
1. Simulate different UI’s &
different model deliverables
2. Compare label distribution
with actual outcome
Finding An Uncanny Valley of ML
Unknown System Accuracy, Instead Show Manipulation of a Single Label
Details Taken from Machine Bias by Propublica -2016
Summary - high schooler stole a bike for a few blocks, had a
High Risk Compas Score by Equivalent. Bail was set at $1000
400+ Survey Participants From Amazon MTurk
No ML
ML - Low Risk
ML - Medium Risk
ML - High Risk
ML - High Risk Positive Support
ML - High Risk Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
Power of Suggestion of High Risk ML (no reason) results in +14% in bail denied
Power of Suggestion of Low Risk ML (no Reason) results in +14% in $0 bail
High Risk ML with Negative Features results in +40% in denying bail
104 Survey Participants From June’s Network
No ML
ML - Positive Support
ML - Negative Support
Proportion of Responses
0% 33% 67% 100%
$0 Bail $1000 Bail No Release
While overall more forgiving group, still +40% increase denying
bail
A Higher Level of Machine Learning Knowledge Does NOT Change the Trend
Fundamental Design Flaw in
SB10 & Compas Scores
• Need to allow for the ML system to
return ‘Uncertain, not enough data’
• The Bureau of Justice Statistics has
a warning of ‘Interpret data with
caution. Estimate based on 10 or
fewer sample cases’ for someone
with Brisha’s details
• … also, effectiveness of requiring
ML in the California courts is not
slated to be measured until 2023,
4 years after release
Aside:
The
Uncanny
Valley of
ML
- Why does it exist?
Uncanny Valley of AI
Discovered by Masahiro Mori (1970)
Box office success of movies is
potentially related to the Uncanny
Valley:
• Final Fantasy
• Polar Express
• Beowulf
• The Incredible Hulk
Uncanny Valley of AI
Why it exists is open field of research:
• Mismatch between expectations
and observations [Tinwell]
• Difficult to classify objects that
move between the boundaries of
categories [Looser & Wheatly]
• Recognizing a similar cognitive
[Fray & Wegner]
• Ambiguity about the presence of
threat [McAndrews]
When it exists is also a debate.
2013 Activision Animation
Uncanny Valley of ML
Additional Theories to Consider:
• People excuse biased decisions on
the machine
• People want machines to be wrong
• Disagreeing is a different skill set
than analyzing
• Providing explanations of reasoning
suppresses intuitive decisions
When it exists should also be studied
further.
Thought Experiment
Yes / No
Imagine if the ML system was
another person, who wasn’t
quite as bright as the first
person
It would take longer, the bright
person would question
themselves more
Small Group Communication: A
Theoretical Approach has
additional details of when
groups underperform
individuals
Lean on the field of Team Research to Bootstrap Expectations on Integration
Crossing
the
Chasm
- Avoiding the Uncanny Valley
Self-Driving Cars - Headed for Uncanny Valley
People viewed as backups who would stay behind the wheel and intervene to
avoid accidents in unpredictable or computer confusing instants. Self-driving
option should be included as soon as possible for competitive advantage.
Left
No-op
Right
Left Right
Decreasing cost of ML
Increasing data
from revolutions in
sensors & records
Aging population
Increasing number of
drivers, commutes
increasing
Best Practice: Bet on the Power of ML
Left
No-op
Right
Left Right
Volvo changed from targeting Level 2 to
Levels {4, 5} after including executives in a
simulation of driving a Level 2 car [Wired]
Delaying Release Until Performance Crosses the Valley
Best Practice: Build Simulators
Nuclear Power Plants, Aviation,
Moon Landings … all use
simulators to refine product
designs before launch
**include actual judges/experts
in simulations
Identify location and impact of the Valley before building
Best Practice: Avoid by Redefining Success
Repurpose - ML
designed for 1 system,
may work well for
another
Reset expectations Relabel Bad Labels
The Uncanny Valley
Doesn’t Always Exist
First UI, people completely ignored the ML suggestion
You could design a system no one uses to avoid the Valley
System
Accuracy
ML Accuracy
Terrible Human Perfect
2 mugs
1 mug
Call to Action:
Source a New Field of Research ‘ML Integration’
• HCI, Team Research, Data
Science, AI, Psychology, User
Research & Application Fields are
all trying to understand
integrating ML into Human
Decision Systems …
independently and slowly
• Binding efforts into a single
discipline will rapidly increase
development and possibly meet
demand
* I am not qualified to give this talk … but who is?
Bootstrapping ML Integration
Funding Sources {Military, Accenture,
Academia, …?}
Initial Areas of Research:
• How to calculate the speed and
accuracy of large distributed
human + ML decision systems
• How to safely train and roll out
new decision processes to experts
• How to fairly explain a ML
decision. Beyond explainable AI.
• Design and run experiments in
these systems
Machine Learning
DS
AI
Politics
Team Research
Effective
ML Integration
The Hard Part - Does The Uncanny Valley matter?
ML Accuracy
Terrible Human Perfect
Accuracy
Time & Cost
LowHigh
Delivering Results in
the Uncanny Valley
Can Lead to Early
Project Termination
• These are critical systems
where a 5% drop in accuracy
undoes years of research and
investments. Mistakes are not
treated kindly in these fields.
• Funding is much more tightly
controlled and hard to obtain
after failed launches.
• Legal modifications may
become barriers
Call to Action: Implement Guardrail Metrics for
Vulnerable Members of the Population
• Guardrail Metrics are
used to allow ML to
optimize as much as it
can within a specified
business boundary
• Let’s define a boundary
in tech to not make
systems worse for black
girls
Thank You.
Slides at /drandrews

June Andrews - The Uncanny Valley of ML

  • 1.
    The Uncanny Valley of ML DrJune Andrews Delphi Data Nov 2019
  • 2.
    Human Decision Systems SimpleParadigm Represents: • Judges setting bail • Doctors processing images • DMV clerks renewing licenses • Muni train drivers stopping & going • Administrators admitting students • … you coming to MLConf Information In, Decision Out, Works Pretty Well
  • 3.
    Hype: Technology WillReplace People Overnight
  • 4.
    ??? Far More LikelyProgression of Technology Machine Learning with Augment Decision Making with Recommendations
  • 5.
    Template Design forML + Decision Systems Recommendation: Yes Feature A is high Keep the Chain of Responsibility Intact Integrate as soon as ML Accuracy ≈ Human Accuracy
  • 6.
    Decreasing cost of ML Pressuresfor Introducing ML into Decision Systems Increasing data from revolutions in sensors, records & infrastructure Fewer experts graduating in ‘older fields’ Increasing number of decisions created by more people
  • 7.
    Ideal Result ofML + Human Decisions ML Accuracy Terrible Human Perfect Accuracy Time & Cost As ML Accuracy Approaches Human Accuracy, System Performance Improves LowHigh
  • 8.
    The Uncanny Valleyof ML As ML Accuracy Approaches Human Accuracy, System Performance Degrades ML Accuracy Terrible Human Perfect Accuracy Time & Cost LowHigh
  • 9.
  • 10.
    Finding An UncannyValley of ML Use Test Environments To Avoid The Uncanny Valley in Production 1. Create a simple labeling task with ground truth labels 2. Measure human accuracy & speed 3. 1. Add a recommended decision from a ‘Model’ 2. Simulate models of different accuracy near human accuracy by perturbing the ground truth labels 3. Assign each person to a simulated model and run test labels for normalization 4. Measure system accuracy & speed as a function of ML accuracy 5.
  • 11.
    Easy Street -How many coffee mugs do you see? Throwback to the first demos of Neural Nets for Compute Vision @ Cornell 100+ photos hand labeled with the number of coffee mugs for ground truth Label quality is then perturbed to simulate different ML accuracies, with a bias to perturbing the images with many mugs
  • 12.
    100+ photos handlabeled with the number of coffee mugs for ground truth Label quality is then perturbed to simulate different ML accuracies, with a bias to perturbing the images with many mugs …used Amazon Mechanical Turk workers Easy Street - How many coffee mugs do you see? Throwback to the first demos of Neural Nets for Compute Vision @ Cornell
  • 13.
    2 mugs 1 mug IdealResults ML Accuracy Terrible Human Perfect Accuracy Time & Cost LowHigh
  • 14.
    Actual System Behavior SystemAccuracy MLAccuracy Terrible Human (94%) Perfect 2 mugs 1 mug LowHigh
  • 15.
    Actual System Behavior DecisionTime MLAccuracy Terrible Human (94%) Perfect 2 mugs 1 mug LowHigh
  • 16.
    Uncanny Valley forML in Counting Coffee Mugs? Do people want machines to be wrong? We trust machines more than we trust ourselves when they are near but not over human accuracy? We’re lazy and want to defer decision making (people varied from the ML when it was correct) 2 mugs 1 mug
  • 17.
    The Uncanny Valley ofML in the Judicial System
  • 18.
    Decreasing cost of ML Pressuresfor Introducing ML into Court Systems Increasing data from records & social media Fewer experts graduating in ‘older fields’ Increasing number of decisions created by more people
  • 19.
    Estonia is buildinga ‘Robot Judge’ to settle disputes under $8,000 - DailyMail Broader initiative of e- government. France wants to match Estonia's level by 2022 160,000 parking tickets overturned in the UK & US with a chatbot -Guardian on DoNotPay Risk Score Print Outs in Cleveland. Includes features like ‘how often are you bored’ -Quartz Note - arraignment hearings are often under 5 minutes. ML has a Growing Presence in Courts Countries are Comparing Notes & Learning How to Use AI in the Courts
  • 20.
    Locally ML isused in the Judicial System for Bail In California 49 of 58 counties use a Pretrial Assessment System (yes SF is one) [courts.ca.gov] SB 10 signed in 2018 would make it mandatory in October of 2019, but a 2020 referendum contradicting SB 10 has created a temporary pause
  • 21.
    Just a sec,does Bail Matter? • 20% of jail inmates in US are awaiting trial • Misdemeanors can take several months for trial, felonies can take years. Average wait time in the Bronx is 642 days for a non-jury trial and 827 days for a jury trial. • Pretrial detention leads to 13% increase in plea agreements, 42% increase in length of sentence and 41% increase increase in court fees -Stevenson The Journal of Law, Economics, and Organization 8th Amendment (Bill of Rights) ‘Excessive bail shall not be required, nor excessive fines imposed, nor cruel and unusual punishments inflicted.’
  • 22.
    Finding An UncannyValley of ML Unknown System Accuracy, Show Manipulation of a Single Label 1. Take a Real Case 2. 1. Simulate different UI’s & different model deliverables 2. Compare label distribution with actual outcome
  • 23.
    Finding An UncannyValley of ML Unknown System Accuracy, Instead Show Manipulation of a Single Label Details Taken from Machine Bias by Propublica -2016 Summary - high schooler stole a bike for a few blocks, had a High Risk Compas Score by Equivalent. Bail was set at $1000
  • 24.
    400+ Survey ParticipantsFrom Amazon MTurk No ML ML - Low Risk ML - Medium Risk ML - High Risk ML - High Risk Positive Support ML - High Risk Negative Support Proportion of Responses 0% 33% 67% 100% $0 Bail $1000 Bail No Release Power of Suggestion of High Risk ML (no reason) results in +14% in bail denied Power of Suggestion of Low Risk ML (no Reason) results in +14% in $0 bail High Risk ML with Negative Features results in +40% in denying bail
  • 25.
    104 Survey ParticipantsFrom June’s Network No ML ML - Positive Support ML - Negative Support Proportion of Responses 0% 33% 67% 100% $0 Bail $1000 Bail No Release While overall more forgiving group, still +40% increase denying bail A Higher Level of Machine Learning Knowledge Does NOT Change the Trend
  • 26.
    Fundamental Design Flawin SB10 & Compas Scores • Need to allow for the ML system to return ‘Uncertain, not enough data’ • The Bureau of Justice Statistics has a warning of ‘Interpret data with caution. Estimate based on 10 or fewer sample cases’ for someone with Brisha’s details • … also, effectiveness of requiring ML in the California courts is not slated to be measured until 2023, 4 years after release Aside:
  • 27.
  • 28.
    Uncanny Valley ofAI Discovered by Masahiro Mori (1970) Box office success of movies is potentially related to the Uncanny Valley: • Final Fantasy • Polar Express • Beowulf • The Incredible Hulk
  • 29.
    Uncanny Valley ofAI Why it exists is open field of research: • Mismatch between expectations and observations [Tinwell] • Difficult to classify objects that move between the boundaries of categories [Looser & Wheatly] • Recognizing a similar cognitive [Fray & Wegner] • Ambiguity about the presence of threat [McAndrews] When it exists is also a debate. 2013 Activision Animation
  • 30.
    Uncanny Valley ofML Additional Theories to Consider: • People excuse biased decisions on the machine • People want machines to be wrong • Disagreeing is a different skill set than analyzing • Providing explanations of reasoning suppresses intuitive decisions When it exists should also be studied further.
  • 31.
    Thought Experiment Yes /No Imagine if the ML system was another person, who wasn’t quite as bright as the first person It would take longer, the bright person would question themselves more Small Group Communication: A Theoretical Approach has additional details of when groups underperform individuals Lean on the field of Team Research to Bootstrap Expectations on Integration
  • 32.
  • 33.
    Self-Driving Cars -Headed for Uncanny Valley People viewed as backups who would stay behind the wheel and intervene to avoid accidents in unpredictable or computer confusing instants. Self-driving option should be included as soon as possible for competitive advantage. Left No-op Right Left Right Decreasing cost of ML Increasing data from revolutions in sensors & records Aging population Increasing number of drivers, commutes increasing
  • 34.
    Best Practice: Beton the Power of ML Left No-op Right Left Right Volvo changed from targeting Level 2 to Levels {4, 5} after including executives in a simulation of driving a Level 2 car [Wired] Delaying Release Until Performance Crosses the Valley
  • 35.
    Best Practice: BuildSimulators Nuclear Power Plants, Aviation, Moon Landings … all use simulators to refine product designs before launch **include actual judges/experts in simulations Identify location and impact of the Valley before building
  • 36.
    Best Practice: Avoidby Redefining Success Repurpose - ML designed for 1 system, may work well for another Reset expectations Relabel Bad Labels
  • 37.
    The Uncanny Valley Doesn’tAlways Exist First UI, people completely ignored the ML suggestion You could design a system no one uses to avoid the Valley System Accuracy ML Accuracy Terrible Human Perfect 2 mugs 1 mug
  • 38.
    Call to Action: Sourcea New Field of Research ‘ML Integration’ • HCI, Team Research, Data Science, AI, Psychology, User Research & Application Fields are all trying to understand integrating ML into Human Decision Systems … independently and slowly • Binding efforts into a single discipline will rapidly increase development and possibly meet demand * I am not qualified to give this talk … but who is?
  • 39.
    Bootstrapping ML Integration FundingSources {Military, Accenture, Academia, …?} Initial Areas of Research: • How to calculate the speed and accuracy of large distributed human + ML decision systems • How to safely train and roll out new decision processes to experts • How to fairly explain a ML decision. Beyond explainable AI. • Design and run experiments in these systems Machine Learning DS AI Politics Team Research Effective ML Integration
  • 40.
    The Hard Part- Does The Uncanny Valley matter? ML Accuracy Terrible Human Perfect Accuracy Time & Cost LowHigh
  • 41.
    Delivering Results in theUncanny Valley Can Lead to Early Project Termination • These are critical systems where a 5% drop in accuracy undoes years of research and investments. Mistakes are not treated kindly in these fields. • Funding is much more tightly controlled and hard to obtain after failed launches. • Legal modifications may become barriers
  • 42.
    Call to Action:Implement Guardrail Metrics for Vulnerable Members of the Population • Guardrail Metrics are used to allow ML to optimize as much as it can within a specified business boundary • Let’s define a boundary in tech to not make systems worse for black girls
  • 43.