Machine Learning: Trustworthy
Machine Learning: Trustworthy
Trustworthy
– T. J. Watson Research Center where he leads the machine learning group
Machine Learning
Explainability 360, and Uncertainty Quantification 360 open-source toolkits.
This book is most appropriate for project managers, data scientists, and other practitioners in high-
stakes domains who care about the broader impact of their work, have the patience to think about
what they’re doing before they jump in, and do not shy away from a little math.
In writing the book, I have taken advantage of the dual nature of my job as an applied data
scientist part of the time and a machine learning researcher the other part of the time. Each
•
chapter focuses on a different use case that technologists tend to face when developing algorithms
for financial services, health care, workforce management, social change, and other areas. These
use cases are fictionalized versions of real engagements I’ve worked on. The contents bring in the
Varshney
latest research from trustworthy machine learning, including some that I’ve personally conducted
as a machine learning researcher.
—Kush
Kush R. Varshney
Trustworthy
Machine Learning
Kush R. Varshney
ii
ISBN 979-8-41-190395-9
Cover image by W. T. Lee, United States Geological Survey, circa 1925. The photograph shows a person
standing on top of a horse, which is standing on a precarious section of a desolate rock formation. The
horse must really be worthy of the person’s trust.
How to cite:
—St. Francis
iv
v
Contents
Preface …………………………………………………………………………………………………………………………………………… vii
Part 2 Data
4 Data Modalities, Sources, and Biases …………………………………………………………………………………………… 40
5 Privacy and Consent …………………………………………………………………………………………………………………… 51
Part 4 Reliability
9 Distribution Shift ………………………………………………………………………………………………………………………… 114
10 Fairness ……………………………………………………….…………………………………………………………………………… 130
11 Adversarial Robustness …………………………………………………………………………………………………………… 152
Part 5 Interaction
12 Interpretability and Explainability …………………………………………………………………………………………… 163
13 Transparency …………………………………………………………………………………………………………………………… 186
14 Value Alignment ……………………………………………………………………………………………………………………… 204
Part 6 Purpose
15 Ethics Principles ……………………………………………………………………………………………………………………… 218
16 Lived Experience ……………………………………………………………………………………………………………………… 227
17 Social Good ……………………………………………………………………………………………………………………………… 236
18 Filter Bubbles and Disinformation …………………………………………………………………………………………… 247
Preface
Decision making in high-stakes applications, such as educational assessment, credit, employment,
health care, and criminal justice, is increasingly data-driven and supported by machine learning
models. Machine learning models are also enabling critical cyber-physical systems such as self-
driving automobiles and robotic surgery. Recommendations of content and contacts on social media
platforms are determined by machine learning systems.
Advancements in the field of machine learning over the last several years have been nothing short
of amazing. Nonetheless, even as these technologies become increasingly integrated into our lives,
journalists, activists, and academics uncover characteristics that erode the trustworthiness of these
systems. For example, a machine learning model that supports judges in pretrial detention decisions
was reported to be biased against black defendants. Similarly, a model supporting resume screening
for employment at a large technology company was reported to be biased against women. Machine
learning models for computer-aided diagnosis of disease from chest x-rays were shown to give
importance to markers contained in the image, rather than details of the patients’ anatomy. Self-
driving car fatalities have occurred in unusual confluences of conditions that the underlying machine
learning algorithms had not been trained on. Social media platforms have knowingly and
surreptitiously promoted harmful content. In short, while each day brings a new story of a machine
learning algorithm achieving superhuman performance on some task, these marvelous results are
only in the average case. The reliability, safety, security, and transparency required for us to trust these
algorithms in all cases remains elusive. As a result, there is growing popular will to have more fairness,
robustness, interpretability, and transparency in these systems.
They say “history doesn't repeat itself, but it often rhymes.” We have seen the current state of
affairs many times before with technologies that were new to their age. The 2016 book Weapons of Math
Destruction by Cathy O’Neil, catalogs numerous examples of machine learning algorithms gone amok.
In the conclusion, O’Neil places her work in the tradition of Progressive Era muckrakers Upton Sinclair
and Ida Tarbell. Sinclair's classic 1906 book The Jungle tackled the processed food industry. It helped
spur the passage of the Federal Meat Inspection Act and the Pure Food and Drug Act, which together
regulated that all foods must be cleanly prepared and free from adulteration.
In the 1870s, Henry J. Heinz started one of the largest food companies in the world today. At a time
when food companies were adulterating their products with wood fibers and other fillers, Heinz
started selling horseradish, relishes, and sauces made of natural and organic ingredients. Heinz
offered these products in transparent glass containers when others were using dark containers. His
company innovated processes for sanitary food preparation, and was the first to offer factory tours that
were open to the public. The H. J. Heinz Company lobbied for the passage of the aforementioned Pure
Food and Drug Act, which became the precursor to regulations for food labels and tamper-resistant
packaging. These practices increased trust and adoption of the products. They provided Heinz a
competitive advantage, but also advanced industry standards and benefited society.
And now to the rhyme. What is the current state of machine learning and how do we make it more
trustworthy? What are the analogs to natural ingredients, sanitary preparation, and tamper-resistant
viii
packages? What are machine learning's transparent containers, factory tours, and food labels? What is
the role of machine learning in benefiting society?
The aim of this book is to answer these questions and present a unified perspective on trustworthy
machine learning. There are several excellent books on machine learning in general from various
perspectives. There are also starting to be excellent texts on individual topics of trustworthy machine
learning such as fairness1 and explainability.2 However, to the best of my knowledge, there is no single
self-contained resource that defines trustworthy machine learning and takes the reader on a tour of all
the different aspects it entails.
I have tried to write the book I would like to read if I were an advanced technologist working in a
high-stakes domain who does not shy away from some applied mathematics. The goal is to impart a
way of thinking about putting together machine learning systems that regards safety, openness, and
inclusion as first-class concerns. We will develop a conceptual foundation that will give you the
confidence and a starting point to dive deeper into the topics that are covered.
“Many people see computer scientists as builders, as engineers, but I think there’s a
deeper intellectual perspective that many CS people share, which sees computation
as a metaphor for how we think about the world.”
We will neither go into extreme depth on any one topic nor work through software code examples, but
will lay the groundwork for how to approach real-world development. To this end, each chapter
contains a realistic, but fictionalized, scenario drawn from my experience that you might have already
faced or will face in the future. The book contains a mix of narrative and mathematics to elucidate the
increasingly sociotechnical nature of machine learning and its interactions with society. The contents
rely on some prior knowledge of mathematics at an undergraduate level as well as statistics at an
introductory level.3
“If you want to make a difference, you have to learn how to operate within imperfect
systems. Burning things down rarely works. It may allow for personal gains. But if
you care about making the system work for many, you have to do it from the
inside.”
The topic of the book is intimately tied to social justice and activism, but I will primarily adopt the
Henry Heinz (developer) standpoint rather than the Upton Sinclair (activist) standpoint. This choice is
1
Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. URL:
https://fairmlbook.org, 2020.
2
Christoph Molnar. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. URL:
https://christophm.github.io/interpretable-ml-book, 2019.
3
A good reference for mathematical background is: Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Mathematics for
Machine Learning. Cambridge, England, UK: Cambridge University Press, 2020.
ix
not meant to disregard or diminish the essential activist perspective, but represents my perhaps naïve
technological solutionist ethos and optimism for improving things from the inside. Moreover, most of
the theory and methods described herein are only small pieces of the overall puzzle for making
machine learning worthy of society’s trust; there are procedural, systemic, and political interventions
in the sociotechnical milieu that may be much more powerful.
This book stems from my decade-long professional career as a researcher working on high-stakes
applications of machine learning in human resources, health care, and sustainable development as
well as technical contributions to fairness, explainability, and safety in machine learning and decision
theory. It draws on ideas from a large number of people I have interacted with over many years, filtered
through my biases. I take responsibility for all errors, omissions, and misrepresentations. I hope you
find it useful in your work and life.
x
Establishing Trust | 1
1
Establishing Trust
Artificial intelligence is the study of machines that exhibit traits associated with a human mind such as
perception, learning, reasoning, planning, and problem solving. Although it had a prior history under
different names (e.g. cybernetics and automata studies), we may consider the genesis of the field of
artificial intelligence to be the Dartmouth Summer Research Project on Artificial Intelligence in the
summer of 1956. Soon thereafter, the field split into two camps: one focused on symbolic systems,
problem solving, psychology, performance, and serial architectures, and the other focused on
continuous systems, pattern recognition, neuroscience, learning, and parallel architectures. 1 This
book is primarily focused on the second of the two partitions of artificial intelligence, namely machine
learning.
The term machine learning was popularized in Arthur Samuel’s description of his computer system
that could play checkers,2 not because it was explicitly programmed to do so, but because it learned
from the experiences of previous games. In general, machine learning is the study of algorithms that
take data and information from observations and interactions as input and generalize from specific
inputs to exhibit traits of human thought. Generalization is a process by which specific examples are
abstracted to more encompassing concepts or decision rules.
One can subdivide machine learning into three main categories: (1) supervised learning, (2)
unsupervised learning, and (3) reinforcement learning. In supervised learning, the input data includes
observations and labels; the labels represent some sort of true outcome or common human practice in
reacting to the observation. In unsupervised learning, the input data includes only observations. In
reinforcement learning, the inputs are interactions with the real world and rewards accrued through
those actions rather than a fixed dataset.
1
Allen Newell. “Intellectual Issues in the History of Artificial Intelligence.” In: The Study of Information: Interdisciplinary Messages.
Ed. by Fritz Machlup and Una Mansfield. New York, New York, USA: John Wiley & Sons, 1983, pp. 187–294.
2
A. L. Samuel. “Some Studies in Machine Learning Using the Game of Checkers.” In: IBM Journal of Research and Development 3.3
(Jul. 1959), pp. 210–229.
2 | Trustworthy Machine Learning
The applications of machine learning may be divided into three broad categories: (1) cyber-
physical systems, (2) decision sciences, and (3) data products. Cyber-physical systems are engineered
systems that integrate computational algorithms and physical components, e.g. surgical robots, self-
driving cars, and the smart grid. Decision sciences applications use machine learning to aid people in
making important decisions and informing strategy, e.g. pretrial detention, medical treatment, and
loan approval. Data products applications are the use of machine learning to automate informational
products, e.g. web advertising placement and media recommendation. These settings vary widely in
terms of their interaction with people, the scale of data, the time scale of operation and consequence,
and the severity of consequences. Trustworthy machine learning is important in all three application
categories, but is typically more pronounced in the first two categories: cyber-physical systems and
decision sciences. In data products applications, trustworthy machine learning contributes to a
functioning non-violent society.
Just a few years ago, the example applications in all of the categories would have been unheard of.
In recent years, however, machine learning has achieved superlative performance on several
narrowly-defined tasks across domains (often surpassing the abilities of human experts on those same
tasks) and invaded the popular imagination due to the confluence of three factors: data, algorithms,
and computation. The amount of data that is captured digitally and thus available to machine learning
algorithms has increased exponentially. Algorithms such as deep neural networks have been
developed to generalize well from that data. Computational paradigms such as graphical processing
units and cloud computing have allowed machine learning algorithms to tractably learn from very
large datasets.
The end result is that machine learning has become a general purpose technology that can be used
in many different application domains for many different uses. Like other general purpose
technologies before it,3 such as the domestication of plants, the wheel, and electricity, machine
learning is starting to remake all parts of society. In some parts of the world, machine learning already
has an incipient role in every part of our lives, including health and wellness, law and order,
commerce, entertainment, finance, human capital management, communication, transportation, and
philanthropy.
Despite artificial intelligence’s promise to reshape different sectors, there has not yet been wide
adoption of the technology except in certain pockets such as electronic commerce and media. Like
other general purpose technologies, there are many short-term costs to the changes required in
infrastructure, organizations, and human capital.4 In particular, it is difficult for many businesses to
collect and curate data from disparate sources. Importantly, corporations do not trust artificial
intelligence and machine learning in critical enterprise workflows because of a lack of transparency
into the inner workings and a potential lack of reliability. For example, a recent study of business
3
List of general purpose technologies: domestication of plants, domestication of animals, smelting of ore, wheel, writing,
bronze, iron, waterwheel, three-masted sailing ship, printing, steam engine, factory system, railway, iron steamship, internal
combustion engine, electricity, motor vehicle, airplane, mass production, computer, lean production, internet, biotechnology,
nanotechnology. Richard G. Lipsey, Kenneth I. Carlaw, and Clifford T. Bekar. Economic Transformations. Oxford, England, UK:
Oxford University Press, 2005.
4
Brian Bergstein. “This Is Why AI Has Yet to Reshape Most Businesses.” In: MIT Technology Review (Feb. 2019). URL:
https://www.technologyreview.com/s/612897/this-is-why-ai-has-yet-to-reshape-most-businesses.
Establishing Trust | 3
decision makers found that only 21% of them have a high level of trust in different types of analytics;5
the number is likely smaller for machine learning, which is a part of analytics in business parlance.
This book is being written at a juncture in time when there is a lot of enthusiasm for machine
learning. It is also a time when many societies are reckoning with social justice. Many claim that it is
the beginning of the age of artificial intelligence, but others are afraid of impending calamity. The
technology is poised to graduate from the experimental sandboxes of academic and industrial
laboratories to truly widespread adoption across domains, but only if the gap in trust can be overcome.
I restrain from attempting to capture the zeitgeist of the age, but provide a concise and self-
contained treatment of the technical aspects of machine learning. The goal is not to mesmerize you,
but to get you to think things through.6 There is a particular focus on mechanisms for increasing the
trustworthiness of machine learning systems. As you’ll discover throughout the journey, there is no
single best approach toward trustworthy machine learning applicable across all applications and
domains. Thus, the text focuses on helping you develop the thought process for weighing the different
considerations rather than giving you a clear-cut prescription or recipe to follow. Toward this end, I
provide an operational definition of trust in the next section and use it as a guide on our conceptual
development of trustworthy machine learning. I tend to present evergreen concepts rather than
specific tools and tricks that may soon become dated.
“What is trust? I could give you a dictionary definition, but you know it when you
feel it. Trust happens when leaders are transparent, candid, and keep their word.
It’s that simple.”
It is not enough to simply be satisfied by: ‘you know it when you feel it.’ The concept of trust is defined
and studied in many different fields including philosophy, psychology, sociology, economics, and
organizational management. Trust is the relationship between a trustor and a trustee: the trustor trusts
the trustee. A definition of trust from organizational management is particularly appealing and
5
Maria Korolov. “Explainable AI: Bringing Trust to Business AI Adoption.” In: CIO (Sep. 2019). URL:
https://www.cio.com/article/3440071/explainable-ai-bringing-trust-to-business-ai-adoption.html.
6
The curious reader should research the etymology of the word ‘mesmerize.’
4 | Trustworthy Machine Learning
relevant for defining trust in machine learning because machine learning systems in high-stakes
applications are typically used within organizational settings. Trust is the willingness of a party to be
vulnerable to the actions of another party based on the expectation that the other will perform a particular action
important to the trustor, irrespective of the ability to monitor or control that other party.7 This definition can
be put into practice as a foundation for desiderata of machine learning systems.
“The toughest thing about the power of trust is that it’s very difficult to build and
very easy to destroy.”
Moreover, the trustor’s expectation of the trustee can evolve over time, even if the trustworthiness
of the trustee remains constant. A typical dynamic of increasing trust over time begins with the
trustor’s expectation of performance being based on (1) the predictability of individual acts, moves onto
(2) expectation based on dependability captured in summary statistics, finally culminating in (3) the
trustor’s expectation of performance based on faith that dependability will continue in the future.8
Predictability could arise from some sort of understanding of the trustee by the trustor (for example
their motivations or their decision-making procedure) or by low variance in the trustee’s behavior. The
expectation referred to in dependability is the usual notion of expectation in probability and statistics.
In much of the literature on the topic, both the trustor and the trustee are people. For our purposes,
however, an end-user or other person is the trustor and the machine learning system is the trustee.
Although the specifics may differ, there are not many differences between a trustworthy person and a
trustworthy machine learning system. However, the final trust of the trustor, subject to cognitive
biases, may be quite different for a human trustee and machine trustee depending on the task.9
7
Roger C. Mayer, James H. Davis, and F. David Schoorman. “An Integrative Model of Organizational Trust.” In: Academy of Man-
agement Review 20.3 (Jul. 1995), pp. 709–734.
8
John K. Rempel, John G. Holmes, and Mark P. Zanna. “Trust in Close Relationships.” In: Journal of Personality and Social Psychol-
ogy 49.1 (Jul. 1985), pp. 95–112.
9
Min Kyung Lee. “Understanding Perception of Algorithmic Decisions: Fairness, Trust, and Emotion in Response to Algorith-
mic Management.” In: Big Data & Society 5.1 (Jan.–Jun. 2018).
Establishing Trust | 5
openness, promise fulfilment, and receptivity to name a few.10 Similarly, you can list several attributes
of a trustworthy information system, such as: correctness, privacy, reliability, safety, security, and
survivability.11 The 2019 International Conference on Machine Learning (ICML) listed the following
topics under trustworthy machine learning: adversarial examples, causality, fairness, interpretability,
privacy-preserving statistics and machine learning, and robust statistics and machine learning. The
European Commission’s High Level Expert Group on Artificial Intelligence listed the following
attributes: lawful, ethical, and robust (both technically and socially).
Such long and disparate lists give us some sense of what people deem to be trustworthy
characteristics, but are difficult to use as anything but a rough guide. However, we can distill these
attributes into a set of separable sub-domains that provide an organizing framework for
trustworthiness. Several pieces of work converge onto a nearly identical set of four such separable
attributes; a selected listing is provided in Table 1.1. The first three rows of Table 1.1 are attributes of
trustworthy people. The last two rows are attributes of trustworthy artificial intelligence. Importantly,
through separability, it is implied that each of the qualities is conceptually different and we can
examine each of them in isolation of each other.
10
Graham Dietz and Deanne N. Den Hartog. “Measuring Trust Inside Organisations.” In: Personnel Review 35.5 (Sep. 2006), pp.
557–588.
11
Fred B. Schneider, ed. Trust in Cyberspace. Washington, DC, USA: National Academy Press, 1999.
12
Aneil K. Mishra. “Organizational Responses to Crisis: The Centrality of Trust.” In: Trust in Organizations. Ed. by Roderick M.
Kramer and Thomas Tyler. Newbury Park, California, USA: Sage, 1996, pp. 261–287.
13
David H. Maister, Charles H. Green, and Robert M. Galford. The Trusted Advisor. New York, New York, USA: Touchstone, 2000.
14
Sandra J. Sucher and Shalene Gupta. “The Trust Crisis.” In: Harvard Business Review (Jul. 2019). URL: https://hbr.org/cover-
story/2019/07/the-trust- crisis.
15
Ehsan Toreini, Mhairi Aitken, Kovila Coopamootoo, Karen Elliott, Carlos Gonzalez Zelaya, and Aad van Moorsel. “The Rela-
tionship Between Trust in AI and Trustworthy Machine Learning Technologies.” In: Proceedings of the ACM Conference on Fairness,
Accountability, and Transparency. Barcelona, Spain, Jan. 2020, pp. 272–283.
16
Maryam Ashoori and Justin D. Weisz. “In AI We Trust? Factors That Influence Trustworthiness of AI-Infused Decision-
Making Processes.” arXiv:1912.02675, 2019.
6 | Trustworthy Machine Learning
1. basic performance,
2. reliability,
We keep the focus on making machine learning systems worthy of trust rather than touching on other
(possibly duplicitous) ways of making them trusted.
Kiri L. Wagstaff. “Machine Learning that Matters.” In: Proceedings of the International Conference on Machine Learning. Edinburgh,
17
the process of creating trustworthy machine learning systems, given the high consequence of
considerations like safety and reliability, should also be done in a thoughtful manner without
overzealous haste. Taking shortcuts can come back and bite you.
Highlighted in Figure 1.1, the remainder of Part 1 discusses the book’s limitations and works
through a couple of preliminary topics that are important for understanding the concepts of
trustworthy machine learning: the personas and lifecycle of developing machine learning systems in
practice, and quantifying the concept of safety in terms of uncertainty.
Figure 1.1. Organization of the book. This first part focuses on introducing the topic of trustworthy machine
learning and covers a few preliminary topics. Accessible caption. A flow diagram from left to right with six
boxes: part 1: introduction and preliminaries; part 2: data; part 3: basic modeling; part 4: reliability;
part 5: interaction; part 6: purpose. Part 1 is highlighted. Parts 3–4 are labeled as attributes of safety.
Parts 3–6 are labeled as attributes of trustworthiness.
Part 2 is a discussion of data, the prerequisite for doing machine learning. In addition to providing
a short overview of different data modalities and sources, the part touches on three topics relevant for
trustworthy machine learning: biases, consent, and privacy.
Part 3 relates to the first attribute of trustworthy machine learning: basic performance. It describes
optimal detection theory and different formulations of supervised machine learning. It teaches several
different learning algorithms such as discriminant analysis, naïve Bayes, k-nearest neighbor, decision
Kahneman and Tversky described two ways in which the brain forms thoughts, which they call ‘System 1’ and ‘System 2.’
18
System 1 is fast, automatic, emotional, stereotypic and consciousness. System 2 is slow, effortful, logical, calculating, and
conscious. Please engage the ‘System 2’ parts of your thought processes and be deliberate when you develop trustworthy
machine learning systems.
8 | Trustworthy Machine Learning
trees and forests, logistic regression, support vector machines, and neural networks. The part
concludes with methods for causal discovery and causal inference.
Part 4 is about the second attribute of trustworthy machine learning: reliability. This attribute is
discussed through three specific topics: distribution shift, fairness, and adversarial robustness. The
descriptions of these topics not only define the problems, but also provide solutions for detecting and
mitigating the problems.
Part 5 is about the third attribute: human interaction with machine learning systems in both
directions—understanding the system and giving it instruction. The part begins with interpretability
and explainability of models. It moves onto methods for testing and documenting aspects of machine
learning algorithms that can then be transparently reported, e.g. through factsheets. The final topic of
this part is on the machine eliciting the policies and values of people and society to govern its behavior.
Part 6 discusses the fourth attribute: what those values of people and society may be. It begins by
covering the ethics principles assembled by different parties as their paradigms for machine learning.
Next, it discusses how the inclusion of creators of machine learning systems with diverse lived
experiences broadens the values, goals, and applications of machine learning, leading in some cases to
the pursuit of social good through the technology. Finally, it shows how the prevailing paradigm of
machine learning in information recommendation platforms leads to filter bubbles and
disinformation, and suggests alternatives. The final chapter about platforms is framed in terms of
trustworthy institutions, which have different attributes than individual trustworthy people or
individual trustworthy machine learning systems.
1.3 Limitations
Machine learning is an increasingly vast topic of study that requires several volumes to properly
describe. The elements of trust in machine learning are also now becoming quite vast. In order to keep
this book manageable for both me (the author) and you (the reader) it is limited in its depth and
coverage of topics. Parts of the book are applicable both to simpler data analysis paradigms that do not
involve machine learning and to explicitly programmed computer-based decision support systems,
but for the sake of clarity and focus, they are not called out separately.
Significantly, despite trustworthy machine learning being a topic at the intersection of technology
and society, the focus is heavily skewed toward technical definitions and methods. I recognize that
philosophical, legal, political, sociological, psychological, and economic perspectives may be even
more important to understanding, analyzing, and affecting machine learning’s role in society than the
technical perspective. Nevertheless, these topics are outside the scope of the book. Insights from the
field of human-computer interaction are also extremely relevant to trustworthy machine learning; I
discuss these to a limited extent at various points in the book, particularly Part 5.
Within machine learning, I focus on supervised learning at the expense of unsupervised and
reinforcement learning. I do, however, cover graphical representations of probability and causality as
well as their inference. Within supervised learning, the primary focus is on classification problems in
which the labels are categorical. Regression, ordinal regression, ranking, anomaly detection,
recommendation, survival analysis, and other problems without categorical labels are not the focus.
The depth in describing various classification algorithms is limited and focused on high-level concepts
rather than more detailed accounts or engineering tricks for using the algorithms.
Establishing Trust | 9
Several different forms and modalities of data are briefly described in Part 2, such as time series,
event streams, graphs, and parsed natural language. However, the primary focus of subsequent
chapters is on forms of data represented as feature vectors.19 Structured, tabular data as well as
images are naturally represented as feature vectors. Natural language text is also often represented by
a feature vector for further analysis.
An important ongoing direction of machine learning research is transfer learning, a paradigm in
which previously learned models are repurposed for new uses and contexts after some amount of fine-
tuning with data from the new context. A related concept for causal models is statistical
transportability. Nonetheless, this topic is beyond the scope of the book except in passing in a couple
of places. Similarly, the concepts of multi-view machine learning and causal data fusion, which involve
the modeling of disparate sets of features are not broached. In addition, the paradigm of active
learning, in which the labeling of data is done sequentially rather than in batch before modeling, is not
discussed in the book.
As a final set of technical limitations, the depth of the mathematics is limited. For example, I do not
present the concepts of probability at a depth requiring measure theory. Moreover, I stop at the posing
of optimization problems and do not go into specific algorithms for conducting the optimization.20
Discussions of statistical learning theory, such as generalization bounds, are also limited.
“Science currently is taught as some objective view from nowhere (a term I learned
about from reading feminist studies works), from no one’s point of view.”
I encourage you, the reader, to create your own positionality statement as you embark on your journey
to create trustworthy machine learning systems.
19
A feature is an individual measurable attribute of an observed phenomenon. Vectors are mathematical objects that can be
added together and multiplied by numbers.
20
Mathematical optimization is the selection of a best element from some set of alternatives based on a desired criterion.
10 | Trustworthy Machine Learning
statistics, operations research, signal processing, information theory, and information systems
venues, as well as the industry-oriented venues I mentioned earlier.
More importantly for trustworthy machine learning, I would like to mention my privileges and
personal biases. I was born and raised in the 1980s and 1990s in predominantly white upper middle-
class suburbs of Syracuse, a medium-sized city in upstate New York located on the traditional lands of
the Onöñda’gaga’ people, that is one of the most racially-segregated in the United States. Other places I
have lived for periods of three months or longer are Ithaca, Elmsford, Ossining, and Chappaqua in New
York; Burlington and Cambridge in Massachusetts; Livermore, California; Ludhiana, New Delhi, and
Aligarh in northern India; Manila, Philippines; Paris, France; and Nairobi, Kenya. I am a cis male,
second-generation American of South Asian descent. To a large extent, I am an adherent of dharmic
religious practices and philosophies. One of my great-great-grandfathers was the first Indian to study
at MIT in 1905. My father and his parents lived hand-to-mouth at times, albeit with access to the social
capital of their forward caste group. My twin brother, father, and both grandfathers are or were
professors of electrical engineering. My mother was a public school teacher. I studied in privileged
public schools for my primary and secondary education and an Ivy League university for my
undergraduate education. My employer, IBM, is a powerful and influential corporation. As such, I have
been highly privileged in understanding paths to academic and professional success and having an
enabling social network. Throughout my life, however, I have been a member of a minority group with
limited political power. I have had some visibility into hardship beyond the superficial level, but none
of this experience has been lived experience, where I would not have a chance to leave if I wanted to.
1.4.3 Interaction
I wrote the book with some amount of transparency. While I was writing the first couple of chapters in
early 2020, anyone could view them through Overleaf (https://v2.overleaf.com/read/bzbzymggkbzd).
After I signed a book contract with Manning Publications, chapters were posted to the Manning Early
Access Program as I wrote them, with readers having an opportunity to engage via the Manning
liveBook Discussion Forum. After the publisher and I parted ways in September 2021, I posted
chapters of the in-progress manuscript to http://www.trustworthymachinelearning.com. I received
several useful comments from various individuals throughout the drafting process via email
([email protected]), Twitter direct message (@krvarshney), telephone (+1-914-945-1628), and
personal meetings. When I completed version 0.9 of the book at the end of December 2021, I posted it
at the same site. On January 28, 2022, I convened a panel of five people with lived experiences
different from mine to provide their perspectives on the content contained in version 0.9 using a
modified Diverse Voices method.21 An electronic version of this edition of the book will continue to be
available at no cost at the same website: http://www.trustworthymachinelearning.com.
21
Lassana Magassa, Meg Young, and Batya Friedman. “Diverse Voices: A How-To Guide for Facilitating Inclusiveness in Tech
Policy.” Tech Policy Lab, University of Washington, 2017. The panelists who provided impartial input were Mashael Alzaid,
Kenya Andrews, Noah Chasek-Macfoy, Scott Fancher, and Timothy Odonga. As a central part of the Diverse Voices method,
they were offered honoraria, which some declined. The funds came from an honorarium I received for participating in an AI
Documentation Summit convened by The Data Nutrition Project in January 2022.
12 | Trustworthy Machine Learning
1.5 Summary
▪ Machine learning systems are influencing critical decisions that have consequences to our
daily lives, but society lacks trust in them.
▪ Trustworthiness is composed of four attributes: competence, reliability, openness, and
selflessness.
22
List of yamas: ahiṃsā (non-harm), satya (benevolence and truthfulness), asteya (responsibility and non-stealing), brahmacarya
(good direction of energy), and aparigraha (simplicity and generosity).
23
List of niyamas: śauca (clarity and purity), santoṣa (contentment), tapas (sacrifice for others), svādhyayā (self-study), and īsvara-
praṇidhāna (humility and service to something bigger).
24
Carrie J. Cai and Philip J. Guo. “Software Developers Learning Machine Learning: Motivations, Hurdles, and Desires.” In:
Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing. Memphis, Tennessee, USA, Oct. 2019, pp.
25–34.
Establishing Trust | 13
▪ The book is organized to match this decomposition of the four components of trust.
▪ Despite my limitations and the limitations of the contents, the book endeavors to develop a
conceptual understanding not only of the principles and theory behind how machine learning
systems can achieve these goals to become more trustworthy, but also develop the algorithmic
and non-algorithmic methods to pursue them in practice.
▪ By the end of the book, your thought process should naturally be predisposed to including
elements of trustworthiness throughout the lifecycle of machine learning solutions you
develop.
14 | Trustworthy Machine Learning
2
Machine Learning Lifecycle
Imagine that you are a project manager on the innovation team of m-Udhār Solar, a (fictional) pay-as-
you-go solar energy provider to poor rural villages that is struggling to handle a growing load of
applications. The company is poised to expand from installing solar panels in a handful of pilot districts
to all the districts in the state, but only if it can make loan decisions for 25 times as many applications
per day with the same number of loan officers. You think machine learning may be able to help.
Is this a problem to address with machine learning? How would you begin the project? What steps
would you follow? What roles would be involved in carrying out the steps? Which stakeholders’ buy-in
would you need to win? And importantly, what would you need to do to ensure that the system is
trustworthy? Making a machine learning system trustworthy should not be an afterthought or add-on,
but should be part of the plan from the beginning.
The end-to-end development process or lifecycle involves several steps:
1. problem specification,
2. data understanding,
3. data preparation,
4. modeling,
5. evaluation, and
Narrow definitions consider only the modeling step to be the realm of machine learning. They consider
the other steps to be part of the broader endeavor of data science and engineering. Most books and
research on machine learning are similarly focused on the modeling stage. However, you cannot really
execute the development and deployment of a trustworthy machine learning system without focusing
on all parts of the lifecycle. There are no shortcuts. This chapter sketches out the master plan.
Machine Learning Lifecycle | 15
Figure 2.1. Steps of the machine learning lifecycle codified in CRISP-DM. Different personas participate in differ-
ent phases of the lifecycle. Accessible caption. A series of six steps arranged in a circle: (1) problem speci-
fication; (2) data understanding; (3) data preparation; (4) modeling; (5) evaluation; (6) deployment and
monitoring. There are some backward paths: from data understanding to problem specification; from
modeling to data preparation; from evaluation to problem specification. Five personas are associated
with different steps: problem owner with steps 1–2; data engineer with steps 2–3; data scientist with
steps 1–4; model validator with step 5; ML operations engineer with step 6. A diverse stakeholders per-
sona is on the side overseeing all steps.
Because the modeling stage is often put on a pedestal, there is a temptation to use the analogy of an
onion in working out the project plan: start with the core modeling, work your way out to data
16 | Trustworthy Machine Learning
understanding/preparation and evaluation, and then further work your way out to problem specification
and deployment/monitoring. This analogy works well for a telecommunications system for example,1
both pedagogically and in how the technology is developed, but a sequential process is more appropriate
for a trustworthy machine learning system. Always start with understanding the use case and specifying
the problem.
“People are involved in every phase of the AI lifecycle, making decisions about which
problem to address, which data to use, what to optimize for, etc.”
The different steps are carried out by different parties with different personas including problem
owners, data engineers, data scientists, model validators, and machine learning (ML) operations
engineers. Problem owners are primarily involved with problem specification and data understanding.
Data engineers work on data understanding and data preparation. Data scientists tend to play a role in
all of the first four steps. Model validators perform evaluation. ML operations engineers are responsible
for deployment and monitoring.
Additional important personas in the context of trustworthiness are the potential trustors of the
system: human decision makers being supported by the machine learning model (m-Udhār loan
officers), affected parties about whom the decisions are made (rural applicants; they may be members
of marginalized groups), regulators and policymakers, and the general public. Each stakeholder has
different needs, concerns, desires, and values. Systems must meet those needs and align with those
values to be trustworthy. Multi-stakeholder engagement is essential and cannot be divorced from the
technical aspects of design and development. Documenting and transparently reporting the different
steps of the lifecycle help build trust among stakeholders.
1
C. Richard Johnson, Jr. and William A. Sethares. Telecommunication Breakdown: Concepts of Communication Transmitted via Soft-
ware-Defined Radio. Upper Saddle River, New Jersey, USA: Prentice Hall, 2003.
2
Andrew D. Selbst, danah boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. “Fairness and Abstraction
in Sociotechnical Systems.” In: Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Barcelona, Spain,
Jan. 2020, pp. 59–68.
Machine Learning Lifecycle | 17
“We all have a responsibility to ask not just, ‘can we do this?’, but ‘should we do this?’”
Problem identification and understanding is best done as a dialogue between problem owners and
data scientists because problem owners might not have the imagination of what is possible through
machine learning and data scientists do not have a visceral understanding of the pain points that
problem owners are facing. Problem owners should also invite representatives of marginalized groups
for a seat at the problem understanding table to voice their pain points.3 Problem identification is
arguably the most important and most difficult thing to do in the entire lifecycle. An inclusive design
process is imperative. Finding the light at the end of the tunnel is actually not that hard, but finding the
tunnel can be very hard. The best problems to tackle are ones that have a benefit to humanity, like
helping light up the lives and livelihoods of rural villagers.
Once problem owners have identified a problem worth solving, they need to specify metrics of
success. Being a social enterprise, the metric for m-Udhār Solar is number of households served with
acceptable risk of defaulting. In general, these metrics should be in real-world terms relevant to the use
case, such as lives saved, time reduced, or cost avoided.4 The data scientist and problem owner can then
map the real-world problem and metrics to machine learning problems and metrics. This specification
should be as crisp as possible, including both the quantities to be measured and their acceptable values.
The goals need not be specified only as traditional key performance indicators, but can also include
objectives for maintenance of performance across varying conditions, fairness of outcomes across
groups and individuals, resilience to threats, or number of insights provided to users. Defining what is
meant by fairness and specifying a threat model are part of this endeavor. For example, m-Udhār aims
not to discriminate by caste or creed. Again, these real-world goals must be made precise through a
conversation between problem owners, diverse voices, and data scientists. The process of eliciting
objectives is known as value alignment.
One important consideration in problem scoping is resource availability, both in computing and
human resources. A large national or multinational bank will have many more resources than m-Udhār
Solar. A large technology company will have the most of all. What can reasonably be accomplished is
gated by the skill of the development team, the computational power for training models and evaluating
new samples, and the amount of relevant data.
Machine learning is not a panacea. Even if the problem makes sense, machine learning may not be
the most appropriate solution to achieve the metrics of success. Oftentimes, back-of-the-envelope
calculations can indicate the lack of fit of a machine learning solution before other steps are undertaken.
A common reason for machine learning to not be a viable solution is lack of appropriate data, which
brings us to the next step: data understanding.
3
Meg Young, Lassana Magassa, and Batya Friedman. “Toward Inclusive Tech Policy Design: A Method for Underrepresented
Voices to Strengthen Tech Policy Documents.” In: Ethics and Information Technology 21.2 (Jun. 2019), pp. 89–103. The input of
diverse stakeholders, especially those from marginalized groups, should be monetarily compensated.
4
Kiri L. Wagstaff. “Machine Learning that Matters.” In: Proceedings of the International Conference on Machine Learning. Edinburgh,
Scotland, UK, Jun.–Jul. 2012, pp. 521–528.
18 | Trustworthy Machine Learning
▪ grouping or recoding categorical feature values to deal with rarely occurring values or to combine
semantically similar values, and
▪ dropping features that induce leakage or should not be used for legal, ethical, or privacy reasons.
Feature engineering is mathematically transforming features to derive new features, including through
interactions of several raw features. Apart from the initial problem specification, feature engineering is
the point in the lifecycle that requires the most creativity from data scientists. Data cleaning and feature
engineering require the data engineer and data scientist to make many choices that have no right or
wrong answer. Should m-Udhār’s data engineer and data scientist group together any number of
motorcycles owned by the household greater than zero? How should they encode the profession of the
applicant? The data scientist and data engineer should revisit the project goals and continually consult
with subject matter experts and stakeholders with differing perspectives to help make appropriate
choices. When there is ambiguity, they should work towards safety, reliability, and aligning with elicited
values.
2.5 Modeling
The modeling step receives a clear problem specification (including metrics of success) and a fixed,
clean training dataset. A mental model for trustworthy modeling includes three main parts:
This idea is diagrammed in Figure 2.2. Details of this step will be covered in depth throughout the book,
but an overview is provided here.
20 | Trustworthy Machine Learning
Figure 2.2. Main parts of trustworthy machine learning modeling. Distribution shift, unfairness, adversarial at-
tacks, and lack of explainability can be mitigated using the various techniques listed below each part. Details of
these methods are presented in the remainder of the book. Accessible caption. A block diagram with a train-
ing dataset as input to a pre-processing block with a pre-processed dataset as output. The pre-pro-
cessed dataset is input to a model training block with an initial model as output. The initial model is
input to a post-processing block with a final model as output. The following techniques are examples of
pre-processing: domain adaptation (distribution shift); bias mitigation pre-processing (unfairness);
data sanitization (adversarial attacks); disentangled representation (lack of explainability). The follow-
ing techniques are examples of model training: domain robustness (distribution shift); bias mitigation
in-processing (unfairness); smoothing/adversarial training (adversarial attacks); directly interpretable
models (lack of explainability). The following techniques are examples of post-processing: bias mitiga-
tion post-processing (unfairness); patching (adversarial attacks); post hoc explanations (lack of ex-
plainability).
Different from data preparation, data pre-processing is meant to alter the statistics or properties of the
dataset to achieve certain goals. Domain adaptation overcomes a lack of robustness to changing
environments, including temporal bias. Bias mitigation pre-processing changes the dataset to overcome
social bias and population bias. Data sanitization aims to remove traces of data poisoning attacks by
malicious actors. Learning disentangled representations overcomes a lack of human interpretability of the
features. All should be performed as required by the problem specification.
The main task in the modeling step is to use an algorithm that finds the patterns in the training
dataset and generalizes from them to fit a model that will predict labels for new unseen data points with
good performance. (The term predict does not necessarily imply forecasting into the future, but simply
refers to providing a guess for an unknown value.) There are many different algorithms for fitting
models, each with a different inductive bias or set of assumptions it uses to generalize. Many machine
learning algorithms explicitly minimize the objective function that was determined in the problem
specification step of the lifecycle. Some algorithms minimize an approximation to the specified
objective to make the mathematical optimization easier. This common approach to machine learning is
known as risk minimization.
The no free lunch theorem of machine learning says that there is no one best machine learning
algorithm for all problems and datasets.5 Which one works best depends on the characteristics of the
5
David H. Wolpert. “The Lack of A Priori Distinctions Between Learning Algorithms.” In: Neural Computation 8.7 (Oct. 1996), pp.
1341–1390.
Machine Learning Lifecycle | 21
dataset. Data scientists try out several different methods, tune their parameters, and see which one
performs best empirically. The empirical comparison is conducted by randomly splitting the training
dataset into a training partition on which the model is fit and a testing partition on which the model’s
performance is validated. The partitioning and validation can be done once, or they can be done several
times. When done several times, the procedure is known as cross-validation; it is useful because it
characterizes the stability of the results. Cross-validation should be done for datasets with a small
number of samples.
The basic machine learning algorithm can be enhanced in several ways to satisfy additional
objectives and constraints captured in the problem specification. One way to increase reliability across
operating conditions is known as domain robustness. Machine learning algorithms that reduce unwanted
biases are known as bias mitigation in-processing. One example category of methods for defending against
data poisoning attacks is known as smoothing. A defense against a different kind of adversarial attack,
the evasion attack, is adversarial training. Certain machine learning algorithms produce models that are
simple in form and thus directly interpretable and understandable to people. Once again, all of these
enhancements should be done according to the problem specification.
Post-processing rules change the predicted label of a sample or compute additional information to
accompany the predicted label. Post-processing methods can be divided into two high-level categories:
open-box and closed-box. Open-box methods utilize information from the model such as its parameters
and functions of its parameters. Closed-box methods can only see the output predictions arising from
given inputs. Open-box methods should be used if possible, such as when there is close integration of
model training and post-processing in the system. In certain scenarios, post-processing methods, also
known as post hoc methods, are isolated from the model for logistical or security reasons. In these
scenarios, only closed-box methods are tenable. Post-processing techniques for increasing reliability,
mitigating unwanted biases, defending against adversarial attacks, and generating explanations should
be used judiciously to achieve the goals of the problem owner. For example, post hoc explanations are
important to provide to m-Udhār Solar’s loan officers so that they can better discuss the decision with
the applicant.
The specification of certain use cases calls for causal modeling: finding generalizable instances of
cause-and-effect from the training data rather than only correlative patterns. These are problems in
which input interventions are meant to change the outcome. For example, when coaching an employee
for success, it is not good enough to identify the pattern that putting in extra hours is predictive of a
promotion. Good advice represents a causal relationship: if the employee starts working extra hours,
then they can expect to be promoted. It may be that there is a common cause (e.g. conscientiousness)
for both doing quality work and working extra hours, but it is only doing quality work that causes a
promotion. Working long hours while doing poor quality work will not yield a promotion; causal
modeling will show that.
2.6 Evaluation
Once m-Udhār’s data scientists have a trained and tested model that they feel best satisfies the problem
owner’s requirements, they pass it on to model validators. A model validator conducts further
independent testing and evaluation of the model, often with a completely separate held-out dataset that
the data scientist did not have access to. It is important that the held-out set not have any leakage from
22 | Trustworthy Machine Learning
the training set. To stress-test the model’s safety and reliability, the model validator can and should
evaluate it on data collected under various conditions and data generated to simulate unlikely events.
The model validator persona is part of model risk management. Model risk is the chance of decisions
supported by statistical or machine learning models yielding gross harms. Issues can come from any of
the preceding lifecycle steps: from bad problem specification to data quality problems to bugs in the
machine learning algorithm software. Even this late in the game, it is possible that the team might have
to start over if issues are discovered. It is only after the model validator signs off on the model that it is
put into production. Although not standard practice yet in machine learning, this ‘signing off’ can be
construed as a declaration of conformity, a document often used in various industries and sectors
certifying that a product is operational and safe.
2.8 Summary
▪ The machine learning lifecycle consists of six main sequential steps: (1) problem specification,
(2) data understanding, (3) data preparation, (4) modeling, (5) evaluation, and (6) deployment
and monitoring, performed by people in different roles.
▪ The modeling step has three parts: (1) pre-processing, (2) model training, and (3) post-
processing.
▪ To operationalize a machine learning system, plan for the different attributes of trustworthiness
starting from the first step of problem specification. Considerations beyond basic performance
should not be sprinkled on at the end like pixie dust, but developed at every step of the way with
input from diverse stakeholders, including affected users from marginalized groups.
Safety | 23
3
Safety
Imagine that you are a data scientist at the (fictional) peer-to-peer lender ThriveGuild. You are in the
problem specification phase of the machine learning lifecycle for a system that evaluates and approves
borrowers. The problem owners, diverse stakeholders, and you yourself want this system to be
trustworthy and not cause harm to people. Everyone wants it to be safe. But what is harm and what is
safety in the context of a machine learning system?
Safety can be defined in very domain-specific ways, like safe toys not having lead paint or small parts
that pose choking hazards, safe neighborhoods having low rates of violent crime, and safe roads having
a maximum curvature. But these definitions are not particularly useful in helping define safety for
machine learning. Is there an even more basic definition of safety that could be extended to the machine
learning context? Yes, based on the concepts of (1) harm, (2) aleatoric uncertainty and risk, and (3) epistemic
uncertainty.1 (These terms are defined in the next section.)
This chapter teaches you how to approach the problem specification phase of a trustworthy machine
learning system from a safety perspective. Specifically, by defining safety as minimizing two different
types of uncertainty, you can collaborate with problem owners to crisply specify safety requirements
and objectives that you can then work towards in the later parts of the lifecycle.2 The chapter covers:
▪ Constructing the concept of safety from more basic concepts applicable to machine learning:
harm, aleatoric uncertainty, and epistemic uncertainty.
▪ Charting out how to distinguish between the two types of uncertainty and articulating how to
quantify them using probability theory and possibility theory.
▪ Specifying problem requirements in terms of summary statistics of uncertainty.
1
Niklas Möller and Sven Ove Hansson. “Principles of Engineering Safety: Risk and Uncertainty Reduction.” In: Reliability Engi-
neering and System Safety 93.6 (Jun. 2008), pp. 798–805.
2
Kush R. Varshney and Homa Alemzadeh. “On the Safety of Machine Learning: Cyber-Physical Systems, Decision Sciences, and
Data Products.” In: Big Data 5.3 (Sep. 2017), pp. 246–255.
24 | Trustworthy Machine Learning
3
Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg. “Formalizing Trust in Artificial Intelligence: Prerequisites, Causes
and Goals of Human Trust in AI.” In: Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Mar. 2021, pp.
624–635.
4
Eyke Hüllermeier and Willem Waegeman. “Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Con-
cepts and Methods.” In: Machine Learning 110.3 (Mar. 2021), pp. 457–506.
Safety | 25
reduce the epistemic uncertainty. ThriveGuild’s epistemic uncertainty about an applicant’s loan-
worthiness can be reduced by doing an employment verification.
“Not knowing the chance of mutually exclusive events and knowing the chance to be
equal are two quite different states of knowledge.”
Whereas aleatoric uncertainty is inherent, epistemic uncertainty depends on the observer. Do all
observers have the same amount of uncertainty? If yes, you are dealing with aleatoric uncertainty. If
some observers have more uncertainty and some observers have less uncertainty, then you are dealing
with epistemic uncertainty.
The two uncertainties are quantified in different ways. Aleatoric uncertainty is quantified using
probability and epistemic uncertainty is quantified using possibility. You have probably learned
probability theory before, but it is possible that possibility theory is new to you. We’ll dive into the details
in the next section. To repeat the definition of safety in other words: safety is the reduction of the probability
of expected harms and the possibility of unexpected harms. Problem specifications for trustworthy machine
learning need to include both parts, not just the first part.
The reduction of aleatoric uncertainty is associated with the first attribute of trustworthiness (basic
performance). The reduction of epistemic uncertainty is associated with the second attribute of
trustworthiness (reliability). A summary of the characteristics of the two types of uncertainty is shown
in Table 3.1. Do not take the shortcut of focusing only on aleatoric uncertainty when developing your
machine learning model; make sure that you focus on epistemic uncertainty as well.
‖𝐴‖
𝑃(𝐴) = .
‖Ω‖
Equation 3.1
1. 𝑃(𝐴) ≥ 0,
2. 𝑃(Ω) = 1, and
3. if 𝐴 and 𝐵 are disjoint events (they have no outcomes in common; 𝐴 ∩ 𝐵 = ∅), then 𝑃(𝐴 ∪ 𝐵) =
𝑃(𝐴) + 𝑃(𝐵).
5
Equation 3.1 is only valid for finite sample spaces, but the same high-level idea holds for infinite sample spaces.
Safety | 27
These three properties are pretty straightforward and just formalize what we normally mean by
probability. A probability of an event is a number between zero and one. The probability of one event or
another event happening is the sum of their individual probabilities as long as the two events don’t
contain any of the same outcomes.
The probability mass function (pmf) makes life easier in describing probability for discrete sample
spaces. It is a function 𝑝 that takes outcomes ω as input and gives back probabilities for those outcomes.
The sum of the pmf across all outcomes in the sample space is one, ∑ω∈Ω 𝑝(ω) = 1, which is needed to
satisfy the second property of probability.
The probability of an event is the sum of the pmf values of its constituent outcomes. For example, if
the pmf of employment status is 𝑝(employed) = 0.60, 𝑝(unemployed) = 0.05, and 𝑝(other) = 0.35, then
the probability of event {employed, other} is 𝑃({employed, other}) = 0.60 + 0.35 = 0.95. This way of
adding pmf values to get an overall probability works because of the third property of probability.
Random variables are a really useful concept in specifying the safety requirements of machine
learning problems. A random variable 𝑋 takes on a specific numerical value 𝑥 when 𝑋 is measured or
observed; that numerical value is random. The set of all possible values of 𝑋 is 𝒳. The probability
function for the random variable 𝑋 is denoted 𝑃𝑋 . Random variables can be discrete or continuous. They
can also represent categorical outcomes by mapping the outcome values to a finite set of numbers, e.g.
mapping {employed, unemployed, other} to {0, 1, 2}. The pmf of a discrete random variable is written as
𝑝𝑋 (𝑥).
Pmfs don’t exactly make sense for uncountably infinite sample spaces. So the cumulative distribution
function (cdf) is used instead. It is the probability that a continuous random variable 𝑋 takes a value less
than or equal to some sample point 𝑥, i.e. 𝐹𝑋 (𝑥) = 𝑃(𝑋 ≤ 𝑥). An alternative representation is the
probability density function (pdf) 𝑝𝑋 (𝑥) = 𝑑𝑥
𝑑
𝐹𝑋 (𝑥), the derivative of the cdf with respect to 𝑥.6 The value of a
pdf is not a probability, but integrating a pdf over a set yields a probability.
To better understand cdfs and pdfs, let’s look at one of the ThriveGuild features you’re going to use
in your machine learning lending model: the income of the applicant. Income is a continuous random
variable whose cdf may be, for example:7
1 − 𝑒 −0.5𝑥 , 𝑥≥0
𝐹𝑋 (𝑥) = { .
0, otherwise
Equation 3.2
Figure 3.1 shows what this distribution looks like and how to compute probabilities from it. It shows that
the probability that the applicant’s income is less than or equal to 2 (in units such as ten thousand
dollars) is 1 − 𝑒 −0.5∙2 = 1 − 𝑒 −1 ≈ 0.63. Most borrowers tend to earn less than 2. The pdf is the derivative
of the cdf:
6
I overload the notation 𝑝𝑋 ; it should be clear from the context whether I’m referring to a pmf or pdf.
7
This specific choice is an exponential distribution. The general form of an exponential distribution is: 𝑝𝑋 (𝑥) =
−λ𝑥
{λ𝑒 ,
𝑥≥0
, for any λ > 0.
0, otherwise
28 | Trustworthy Machine Learning
Equation 3.3
Figure 3.1. An example cdf and corresponding pdf from the ThriveGuild income distribution example. Accessi-
ble caption. A graph at the top shows the cdf and a graph at the bottom shows its corresponding pdf.
Differentiation is the operation to go from the top graph to the bottom graph. Integration is the opera-
tion to go from the bottom graph to the top graph. The top graph shows how to read off a probability
directly from the value of the cdf. The bottom graph shows that obtaining a probability requires inte-
grating the pdf over an interval.
Joint pmfs, cdfs, and pdfs of more than one random variable are multivariate functions and can
contain a mix of discrete and continuous random variables. For example, 𝑝𝑋,𝑌,𝑍 (𝑥, 𝑦, 𝑧) is the notation for
the pdf of three random variables 𝑋, 𝑌, and 𝑍. To obtain the pmf or pdf of a subset of the random
variables, you sum the pmf or integrate the pdf over the rest of the variables outside of the subset you
want to keep. This act of summing or integrating is known as marginalization and the resulting
probability distribution is called the marginal distribution. You should contrast the use of the term
Safety | 29
‘marginalize’ here with the social marginalization that leads individuals and groups to be made
powerless by being treated as insignificant.
The employment status feature and the loan approval label in the ThriveGuild model are random
variables that have a joint pmf. For example, this multivariate function could be 𝑝(employed, approve) =
0.20, 𝑝(employed, deny) = 0.40, 𝑝(unemployed, approve) = 0.01, 𝑝(unemployed, deny) = 0.04,
𝑝(other, approve) = 0.10, and 𝑝(other, deny) = 0.25. This function is visualized as a table of probability
values in Figure 3.2. Summing loan approval out from this joint pmf, you recover the marginal pmf for
employment status given earlier. Summing employment status out, you get the marginal pmf for loan
approval as 𝑝(approve) = 0.31 and 𝑝(deny) = 0.69.
Figure 3.2. Examples of marginalizing a joint distribution by summing out one of the random variables. Acces-
sible caption. A table of the joint pmf has employment status as the columns and loan approval as the
rows. The entries are the probabilities. Adding the numbers in the columns gives the marginal pmf of
employment status. Adding the numbers in the rows gives the marginal pmf of loan approval.
Probabilities, pmfs, cdfs, and pdfs are all tools for quantifying aleatoric uncertainty. They are used
to specify the requirements for the accuracy of models, which is critical for the first of the two parts of
safety: risk minimization. A correct prediction is an event and the probability of that event is the
accuracy. For example, working with the problem owner, you may specify that the ThriveGuild lending
model must have at least a 0.92 probability of being correct. The accuracy of machine learning models
and other similar measures of basic performance are the topic of Chapter 6 in Part 3 of the book.
30 | Trustworthy Machine Learning
1. Π(∅) = 0,
2. Π(Ω) = 1, and
3. if 𝐴 and 𝐵 are disjoint events (they have no outcomes in common; 𝐴 ∩ 𝐵 = ∅), then Π(𝐴 ∪ 𝐵) =
max(Π(𝐴), Π(𝐵)).
One difference is that the third property of possibility contains maximum, whereas the third property of
probability contains addition. Probability is additive, but possibility is maxitive. The probability of an
event is the sum of the probabilities of its constituent outcomes, but the possibility of an event is the
maximum of the possibilities of its constituent outcomes. This is because possibilities can only be zero
or one. If you have two events, both of which have possibility equal to one, and you want to know the
possibility of one or the other occurring, it does not make sense to add one plus one to get two, you should
take the maximum of one and one to get one.
You should use possibility in specifying requirements for the ThriveGuild machine learning system
to address the epistemic uncertainty (reliability) side of the two-part definition of safety. For example,
there will be epistemic uncertainty in what the best possible model parameters are if there is not enough
of the right training data. (The data you ideally want to have is from the present, from a fair and just
world, and that has not been corrupted. However, you’re almost always out of luck and have data from
the past, from an unjust world, or that has been corrupted.) The data that you have can bracket the
possible set of best parameters through the use of the possibility function. Your data tells you that one
set of model parameters is possibly the best set of parameters, and that it is impossible for other different
sets of model parameters to be the best. Problem specifications can place limits on the cardinality of the
possibility set. Dealing with epistemic uncertainty in machine learning is the topic of Part 4 of the book
in the context of generalization, fairness, and adversarial robustness.
Safety | 31
∞
𝐸[𝑋] = ∫ 𝑥𝑝𝑋 (𝑥)𝑑𝑥.
−∞
Equation 3.4
Recall that in the example earlier, ThriveGuild borrowers had the income pdf 0.5𝑒 −0.5𝑥 for 𝑥 ≥ 0 and zero
∞
elsewhere. The expected value of income is thus ∫0 𝑥0.5𝑒 −0.5𝑥 𝑑𝑥 = 2.8 When you have a bunch of samples
drawn from the probability distribution of 𝑋, denoted {𝑥1 , 𝑥2 , … , 𝑥𝑛 }, then you can compute an empirical
1
version of the expected value, the sample mean, as 𝑥̅ = ∑𝑛𝑗=1 𝑥𝑗 . Not only can you compute the expected
𝑛
value of a random variable alone, but also the expected value of any function of a random variable. It is
the integral of the pdf multiplied by the function. Through expected values of performance, also known
as risk, you can specify average behaviors of systems being within certain ranges for the purposes of
safety.
How much variability in income should you plan for among ThriveGuild applicants? An important
expected value is the variance 𝐸[(𝑋 − 𝐸[𝑋])2 ], which measures the spread of a distribution and helps
1 2
answer the question. Its sample version, the sample variance is computed as ∑𝑛𝑗=1(𝑥𝑗 − 𝑥̅ ) . The
𝑛−1
correlation between two random variables 𝑋 (e.g., income) and 𝑌 (e.g., loan approval) is also an expected
value, 𝐸[𝑋𝑌], which tells you whether there is some sort of statistical relationship between the two
random variables. The covariance, 𝐸[(𝑋 − 𝐸[𝑋])(𝑌 − 𝐸[𝑌])] = 𝐸[𝑋𝑌] − 𝐸[𝑋]𝐸[𝑌], tells you whether if one
random variable increases, the other will also increase, and vice versa. These different expected values
and summary statistics give different insights about aleatoric uncertainty that are to be constrained in
the problem specification.
8
The expected value of a generic exponentially-distributed random variable is 1⁄λ.
32 | Trustworthy Machine Learning
very informative. For probabilities close to one, the information is close to zero because common
occurrences are not informative. Do you go around telling everyone that you did not win the lottery?
Probably not, because it is not informative. The expected value of the information of 𝑋 is its entropy:
Equation 3.5
Uniform distributions with equal probability for all outcomes have maximum entropy among all
possible distributions. The difference between the maximum entropy achieved by the uniform
distribution and the entropy of a given random variable is the redundancy. It is known as the Theil index
when used to summarize inequality in a population. For a discrete random variable 𝑋 taking non-
negative values, which is usually the case when measuring assets, income, or wealth of individuals, the
Theil index is:
𝑥 𝑥
Theil index = ∑ 𝑝𝑋 (𝑥) log ( ),
𝐸[𝑋] ̅̅̅̅̅̅
𝐸[𝑋]
𝑥∈𝒳
Equation 3.6
where 𝒳 = {0,1, … , ∞} and the logarithm is the natural logarithm. The index’s values range from zero to
one. The entropy-maximizing distribution in which all members of a population have the same value,
which is the mean value, has zero Theil index and represents the most equality. A Theil index of one
represents the most inequality. It is achieved by a pmf with one non-zero value and all other zero values.
(Think of one lord and many serfs.) In Chapter 10, you’ll see how to use the Theil index to specify
machine learning systems in terms of their individual fairness and group fairness requirements
together.
𝑝(𝑥)
𝐷(𝑝 ∥ 𝑞) = − ∑ 𝑝(𝑥) log ( ).
𝑞(𝑥)
𝑥∈𝒳
Equation 3.7
It measures how similar or different two distributions are. Similarity of one distribution to a reference
distribution is often a requirement in machine learning systems.
Safety | 33
The cross-entropy is another quantity defined for two random variables on the same sample space
that represents the average information in one random variable with pmf 𝑝(𝑥) when described using a
different random variable 𝑞(𝑥):
Equation 3.8
As such, it is the entropy of the first random variable plus the K-L divergence between the two variables:
Equation 3.9
When 𝑝 = 𝑞, then 𝐻(𝑝 ∥ 𝑞) = 𝐻(𝑝) because the K-L divergence term goes to zero and there is no
remaining mismatch between 𝑝 and 𝑞. Cross-entropy is used as an objective for training neural networks
as you’ll see in Chapter 7.
Equation 3.10
It is symmetric in its two arguments and measures how much information is shared between 𝑋 and 𝑌.
In Chapter 5, mutual information is used to set a constraint on privacy: the goal of not sharing
information. It crops up in many other places as well.
This updated probability is known as a conditional probability and is used to quantify a probability
when you have additional information that the outcome is part of some event. The conditional
probability of event 𝐴 given event 𝐵 is the ratio of the cardinality of the joint event 𝐴 and 𝐵, to the
cardinality of the event 𝐵:9
‖𝐴 ∩ 𝐵‖ 𝑃(𝐴 ∩ 𝐵)
𝑃( 𝐴 ∣ 𝐵 ) = = .
‖𝐵‖ 𝑃(𝐵)
Equation 3.11
In other words, the sample space changes from Ω to 𝐵, so that is why the denominator of Equation 3.1
(‖𝐴‖/‖Ω‖) changes from Ω to 𝐵 in Equation 3.11. The numerator ‖𝐴 ∩ 𝐵‖ captures the part of the event 𝐴
that is within the new sample space 𝐵. There are similar conditional versions of pmfs, cdfs, and pdfs
defined for random variables.
Through conditional probability, you can reason not only about distributions and summaries of
uncertainty, but also how they change when observations are made, outcomes are revealed, and
evidence is collected. Using a machine learning model is similar to getting the conditional probability of
the label given the feature values of an input data point. The probability of loan approval given the
features for one specific applicant being employed with an income of 15,000 dollars is a conditional
probability.
In terms of summary statistics, the conditional entropy of 𝑌 given 𝑋 is:
𝑝𝑌,𝑋 (𝑦, 𝑥)
𝐻( 𝑌 ∣ 𝑋 ) = − ∑ ∑ 𝑝𝑌,𝑋 (𝑦, 𝑥) log ( ).
𝑝𝑋 (𝑥)
𝑦∈𝒴 𝑥∈𝒳
Equation 3.12
Equation 3.13
In this form, you can see that mutual information quantifies the reduction in entropy in a random
variable by conditioning on another random variable. In this role, it is also known as information gain,
and used as a criterion for learning decision trees in Chapter 7. Another common criterion for learning
decision trees is the Gini index:
9
Event 𝐵 has to be non-empty and the sample space has to be finite for this definition to be applicable.
Safety | 35
Equation 3.14
𝐴 ⫫ 𝐵 ⇔ 𝑃( 𝐴 ∣ 𝐵 ) = 𝑃(𝐴).
Equation 3.15
Knowledge of the tendency of 𝐴 to occur given that 𝐵 has occurred is not changed by knowledge of 𝐵. If
in ThriveGuild’s data, 𝑃( employed ∣ deny ) = 0.50 and 𝑃(employed) = 0.60, then since the two numbers
0.50 and 0.60 are not the same, employment status and loan approval are not independent, they are
dependent. Employment status is used in loan approval decisions. The definition of conditional
probability further implies that:
𝐴 ⫫ 𝐵 ⇔ 𝑃(𝐴, 𝐵) = 𝑃(𝐴)𝑃(𝐵).
Equation 3.16
The probability of the joint event is the product of the marginal probabilities. Moreover, if two random
variables are independent, their mutual information is zero.
The concept of independence can be extended to more than two events. Mutual independence
among several events is more than simply a collection of pairwise independence statements; it is a
stronger notion. A set of events is mutually independent if any of the constituent events is independent
of all subsets of events that do not contain that event. The pdfs, cdfs, and pmfs of mutually independent
random variables can be written as the products of the pdfs, cdfs, and pmfs of the individual constituent
random variables. One commonly used assumption in machine learning is of independent and identically
distributed (i.i.d.) random variables, which in addition to mutual independence, states that all of the
random variables under consideration have the same probability distribution.
A further concept is conditional independence, which involves at least three events. The events 𝐴 and
𝐵 are conditionally independent given 𝐶, denoted 𝐴 ⫫ 𝐵 ∣ 𝐶, when knowledge of the tendency of 𝐴 to
occur given that 𝐵 has occurred is not changed by knowledge of 𝐵 precisely when it is known that 𝐶
36 | Trustworthy Machine Learning
occurred. Similar to the unconditional case, the probability of the joint conditional event is the product
of the marginal conditional probabilities under conditional independence.
𝐴 ⫫ 𝐵 ∣ 𝐶 ⇔ 𝑃( 𝐴 ∩ 𝐵 ∣ 𝐶 ) = 𝑃( 𝐴 ∣ 𝐶 )𝑃( 𝐵 ∣ 𝐶 ).
Equation 3.17
Conditional independence also extends to random variables and their pmfs, cdfs, and pdfs.
Figure 3.3. An example graphical model consisting of four events. The employment status and gender nodes have
no parents; employment status is the parent of income, and thus there is an edge from employment status to in-
come; both income and employment status are the parents of loan approval, and thus there are edges from income
and from employment status to loan approval. The graphical model is shown on the left with the names of the
events and on the right with their symbols.
The statistical relationships are determined by the graph structure. The probability of several events
𝐴1 , … , 𝐴𝑛 is the product of all the events conditioned on their parents:
𝑃(𝐴1 , … , 𝐴𝑛 ) = ∏ 𝑃 ( 𝐴𝑗 ∣∣ 𝑝𝑎(𝐴𝑗 ) ).
𝑗=1
Equation 3.18
Safety | 37
As a special case of Equation 3.18 for the graphical model in Figure 3.3, the corresponding probability
may be written as 𝑃(𝐴1 , 𝐴2 , 𝐴3 , 𝐴4 ) = 𝑃( 𝐴1 ∣ 𝐴2 )𝑃(𝐴2 )𝑃( 𝐴3 ∣ 𝐴1 , 𝐴2 )𝑃(𝐴4 ). Valid probability distributions
lead to directed acyclic graphs. Graphs are acyclic if you follow a path of arrows and can never return to
nodes you started from. An ancestor of a node is any node that is its parent, parent of its parent, parent
of its parent of its parent, and so on recursively.
From the small and simple graph structure in Figure 3.3, it is clear that the loan approval depends
on both income and employment status. Income depends on employment status. Gender is independent
of everything else. Making independence statements is more difficult in larger and more complicated
graphs, however. Determining all of the different independence relationships among all the events or
random variables is done through the concept of d-separation: a subset of nodes 𝑆1 is independent of
another subset of nodes 𝑆2 conditioned on a third subset of nodes 𝑆3 if 𝑆3 d-separates 𝑆1 and 𝑆2 . One way
to explain d-separation is through the three different motifs of three nodes each shown in Figure 3.4,
known as a causal chain, common cause, and common effect. The differences among the motifs are in the
directions of the arrows. The configurations on the left have no node that is being conditioned upon, i.e.
no node’s value is observed. In the configurations on the right, node 𝐴3 is being conditioned upon and is
thus shaded. The causal chain and common cause motifs without conditioning are connected. The causal
chain and common cause with conditioning are separated: the path from 𝐴1 to 𝐴2 is blocked by the
knowledge of 𝐴3 . The common effect motif without conditioning is separated; in this case, 𝐴3 is known
as a collider. Common effect with conditioning is connected; moreover, conditioning on any descendant
of 𝐴3 yields a connected path between 𝐴1 and 𝐴2 . Finally, a set of nodes 𝑆1 and 𝑆2 is d-separated
conditioned on a set of nodes 𝑆3 if and only if each node in 𝑆1 is separated from each node in 𝑆2 .10
causal chain
connected separated
common cause
connected separated
common effect
(or any descendent of 𝐴3 observed)
separated connected
Figure 3.4. Configurations of nodes and edges that are connected and separated. Nodes colored gray have been
observed. Accessible caption. The causal chain is 𝐴1 → 𝐴3 → 𝐴2 ; it is connected when 𝐴3 is unobserved
and separated when 𝐴3 is observed. The common cause is 𝐴1 𝐴3 → 𝐴2 ; it is connected when 𝐴3 is un-
observed and separated when 𝐴3 is observed. The common effect is 𝐴1 → 𝐴3 𝐴2 ; it is separated when
𝐴3 is unobserved and connected when 𝐴3 or any of its descendants are observed.
10
There may be dependence not captured in the structure if one random variable is a deterministic function of another.
38 | Trustworthy Machine Learning
Although d-separation among two sets of nodes can be checked by checking all three-node motifs
along all paths between the two sets, there is a more constructive algorithm to check for d-separation.
1. Construct the ancestral graph of 𝑆1 , 𝑆2 , and 𝑆3 . This is the subgraph containing the nodes in 𝑆1 , 𝑆2 ,
and 𝑆3 along with all of their ancestors and all of the edges among these nodes.
2. For each pair of nodes with a common child, draw an undirected edge between them. This step
is known as moralization.11
3. Make all edges undirected.
4. Delete all 𝑆3 nodes.
5. If 𝑆1 and 𝑆2 are separated in the undirected sense, then they are d-separated.
Figure 3.5. An example of running the constructive algorithm to check for d-separation. Accessible caption.
The original graph has edges from 𝐴1 and 𝐴2 to 𝐴3 , from 𝐴3 to 𝐴4 and 𝐴5 , and from 𝐴4 to 𝐴6 . 𝑆1 contains
only 𝐴4 , 𝑆2 contains only 𝐴5 , and 𝑆3 contains 𝐴2 and 𝐴3 . After step 1, 𝐴6 is removed. After step 2, an un-
directed edge is drawn between 𝐴1 and 𝐴2 . After step 3, all edges are undirected. After step 4, only 𝐴1 ,
𝐴4 , and 𝐴5 remain and there are no edges. After step 5, only 𝐴4 and 𝐴5 , and equivalently 𝑆1 and 𝑆2 , re-
main and there is no edge between them. They are separated, so 𝑆1 and 𝑆2 are d-separated conditioned
on 𝑆3 .
11
The term moralization reflects a value of some but not all societies: that it is moral for the parents of a child to be married.
Safety | 39
3.5.3 Conclusion
Independence and conditional independence allow you to know whether random variables affect one
another. They are fundamental relationships for understanding a system and knowing which parts can
be analyzed separately while determining a problem specification. One of the main benefits of graphical
models is that statistical relationships are expressed through structural means. Separations are more
clearly seen and computed efficiently.
3.6 Summary
▪ The first two attributes of trustworthiness, accuracy and reliability, are captured together
through the concept of safety.
▪ Safety is the minimization of the aleatoric uncertainty and the epistemic uncertainty of
undesired high-stakes outcomes.
▪ Aleatoric uncertainty is inherent randomness in phenomena. It is well-modeled using
probability theory.
▪ Epistemic uncertainty is lack of knowledge that can, in principle, be reduced. Often in practice,
however, it is not possible to reduce epistemic uncertainty. It is well-modeled using possibility
theory.
▪ Problem specifications for trustworthy machine learning systems can be quantitatively
expressed using probability and possibility.
▪ It is easier to express these problem specifications using statistical and information-theoretic
summaries of uncertainty than full distributions.
▪ Conditional probability allows you to update your beliefs when you receive new measurements.
▪ Independence and graphical models encode random variables not affecting one another.
40 | Trustworthy Machine Learning
4
Data Sources and Biases
The mission of the (fictional) non-profit organization Unconditionally is charitable giving. It collects
donations and distributes unconditional cash transfers—funds with no strings attached—to poor
households in East Africa. The recipients are free to do whatever they like with the money.
Unconditionally is undertaking a new machine learning project to identify the poorest of the poor
households to select for the cash donations. The faster they can complete the project, the faster and
more efficiently they can move much-needed money to the recipients, some of whom need to replace
their thatched roofs before the rainy season begins.
The team is in the data understanding phase of the machine learning lifecycle. Imagine that you are
a data scientist on the team pondering which data sources to use as features and labels to estimate the
wealth of households. You examine all sorts of data including daytime satellite imagery, nighttime
illumination satellite imagery, national census data, household survey data, call detail records from
mobile phones, mobile money transactions, social media posts, and many others. What will you choose
and why? Will your choices lead to unintended consequences or to a trustworthy system?
The data understanding phase is a really exciting time in the lifecycle. The problem goals have been
defined; working with the data engineers and other data scientists, you cannot wait to start acquiring
data and conducting exploratory analyses. Having data is a prerequisite for doing machine learning, but
not any data will do. It is important for you and the team to be careful and intentional at this point. Don’t
take shortcuts. Otherwise, before you know it, you will have a glorious edifice built upon a rocky
foundation.
This chapter begins Part 2 of the book focused on all things data (remember the organization of the
book shown in Figure 4.1).
Data Sources and Biases | 41
Figure 4.1. Organization of the book. This second part focuses on different considerations of trustworthiness
when working with data. Accessible caption. A flow diagram from left to right with six boxes: part 1: in-
troduction and preliminaries; part 2: data; part 3: basic modeling; part 4: reliability; part 5: interaction;
part 6: purpose. Part 2 is highlighted. Parts 3–4 are labeled as attributes of safety. Parts 3–6 are labeled
as attributes of trustworthiness.
The chapter digs into how you and Unconditionally’s data engineers and other data scientists should:
Appraising data sets for biases is critical for trustworthiness and is the primary focus of the chapter. The
better job done at this stage, the less correction and mitigation of harms needs to be done in later stages
of the lifecycle. Bias evaluation should include input from affected individuals of the planned machine
learning system. If all possible relevant data is deemed too biased, a conversation with the problem
owner and other stakeholders on whether to even proceed with the project is a must. (Data privacy and
consent are investigated in Chapter 5.)
4.1 Modalities
Traditionally, when most people imagine data, they imagine tables of numbers in an accounting
spreadsheet coming out of some system of record. However, data for machine learning systems can
include digital family photographs, surveillance videos, tweets, legislative documents, DNA strings,
event logs from computer systems, sensor readings over time, structures of molecules, and any other
information in digital form. In the machine learning context, data is assumed to be a finite number of
samples drawn from any underlying probability distribution.
The examples of data given above come from different modalities (images, text, time series, etc.). A
modality is a category of data defined by how it is received, represented, and understood. Figure 4.2
presents a mental model of different modalities. There are of course others that are missing from the
figure.
42 | Trustworthy Machine Learning
Figure 4.2. A mental model of different modalities of data. Accessible caption. A hierarchy diagram with
data at its root. Data has children structured and semi-structured. Structured has children tabular and
graphs. Tabular has children static tabular data, time series, and event streams. Graphs has children
social networks, physical networks, and molecules. Semi-structured has children signals and se-
quences. Signals has children images, audio, and video. Sequences has children natural language, bio-
logical sequences, and software source code.
One of Unconditionally’s possible datasets is from a household survey. It is an example of the static
tabular data modality and part of the structured data category of modalities.1 It is static because it is not
following some time-varying phenomenon. The columns are different attributes that can be used as
features and labels, and the rows are different records or sample points, i.e. different people and
households. The columns contain numeric values, ordinal values, categorical values, strings of text, and
special values such as dates. Although tabular data might look official, pristine, and flawless at first
glance due to its nice structure, it can hide all sorts of false assumptions, errors, omissions, and biases.
Time series constitute another modality that can be stored in tabular form. As measurements at
regular intervals in time (usually of numeric values), such data can be used to model trends and forecast
quantities in time. Longitudinal or panel data, repeated measurements of the same individuals over time,
are often time series. Household surveys are rarely longitudinal however, because they are logistically
difficult to conduct. Cross-sectional surveys, simply several tabular datasets taken across time but without
any linking, are logistically much easier to collect because the same individuals do not have to be tracked
down.
Another of Unconditionally’s possible datasets is mobile money transactions. Time stamps are a
critical part of transactions data, but are not time series because they do not occur at regular intervals.
Every mobile money customer asynchronously generates an event whenever they receive or disburse
funds, not mediated by any common clock across customers. Transaction data is an example of the event
stream modality. In addition to a time stamp, event streams contain additional values that are measured
such as monetary amount, recipient, and items purchased. Other event streams include clinical tests
conducted in a hospital and social services received by clients.
1
There are modalities with even richer structure than tabular data, such as graphs that can represent social networks and the
structure of chemical molecules.
Data Sources and Biases | 43
Unconditionally can estimate poverty using satellite imagery. Digital images are the modality that
spurred a lot of the glory showered upon machine learning in the past several years. They are part of the
semi-structured branch of modalities. In general, images can be regular optical images or ones measured
in other ranges of the electromagnetic spectrum. They are composed of numeric pixel values across
various channels in their raw form and tend to contain a lot of spatial structure. Video, an image sequence
over time, has a lot of spatiotemporal structure. Modern machine learning techniques learn these spatial
and spatiotemporal representations by being trained on vast quantities of data, which may themselves
contain unwanted biases and unsuitable content. (The model for the specified problem is a fine-tuned
version of the model pre-trained on the large-scale, more generic dataset. These large pre-trained
models are referred to as foundation models.) Videos may also contain audio signals.
One of your colleagues at Unconditionally imagines that although less likely, the content of text
messages and social media posts might predict a person’s poverty level. This modality is natural language
or text. Longer documents, including formal documents and publications, are a part of the same
modality. The syntax, semantics, and pragmatics of human language is complicated. One way of dealing
with text includes parsing the language and creating a syntax tree. Another way is representing text as
sparse structured data by counting the existence of individual words, pairs of words, triplets of words,
and so on in a document. These bag-of-words or n-gram representations are currently being superseded
by a third way: sophisticated large language models, a type of foundation model, trained on vast corpora
of documents. Just like in learning spatial representations of images, the learning of language models
can be fraught with many different biases, especially when the norms of the language in the training
corpus do not match the norms of the application. A language model trained on a humongous pile of
newspaper articles from the United States will typically not be a good foundation for a representation
for short, informal, code-mixed text messages in East Africa.2
Typically, structured modalities are their own representations for modeling and correspond to
deliberative decision making by people, whereas semi-structured modalities require sophisticated
transformations and correspond to instinctive perception by people. These days, the sophisticated
transformations for semi-structured data tend to be learned using deep neural networks that are trained
on unimaginably large datasets. This process is known as representation learning. Any biases present in
the very large background datasets carry over to models fine-tuned on a problem-specific dataset
because of the originally opaque and uncontrollable representation learning leading to the foundation
model. As such, with semi-structured data, it is important that you not only evaluate the problem-
specific dataset, but also the background dataset. With structured datasets, it is more critical that you
analyze data preparation and feature engineering.3
2
Other strings with language-like characteristics such as DNA or amino acid sequences and software source code are currently
being approached through techniques similar to natural language processing.
3
There are new foundation models for structured modalities. Inkit Padhi, Yair Schiff, Igor Melnyk, Mattia Rigotti, Youssef
Mroueh, Pierre Dognin, Jerret Ross, Ravi Nair, and Erik Altman. “Tabular Transformers for Modeling Multivariate Time Se-
ries.” In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Jun. 2021, pp. 3565–3569.
44 | Trustworthy Machine Learning
at Unconditionally must evaluate your data sources carefully to separate the wheat from the chaff: only
including the good stuff. There are many different categories of data sources, which imply different
considerations in assessing their quality.
4
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethi-
cal Boundaries.” In: Frontiers in Big Data 2.13 (Jul. 2019).
Data Sources and Biases | 45
administrative data, these sources are not produced for the problem specification, but are repurposed
for predictive or causal modeling. Many a time, just like administrative data, social data is only a proxy
for what the problem specification requires and can be misleading or even outright wrong. The social
media content of potential recipients of Unconditionally’s cash transfer you analyze may be like this.
Since social data is created for purposes like communicating, seeking jobs, and maintaining
friendships, the quality, accuracy, and reliability of this data source may be much less than
administrative data. Text may include various slang, non-standard dialects, misspellings, and biases.
Other modalities of social data are riddled with vagaries of their own. The information content of
individual data points might not be very high. Also, there can be large amounts of sampling biases
because not all populations participate in social platforms to the same extent. In particular,
marginalized populations may be invisible in some types of social data.
4.2.4 Crowdsourcing
Supervised learning requires both features and labels. Unlabeled data is much easier to acquire than
labeled data. Crowdsourcing is a way to fill the gap: crowd workers label the sentiment of sentences,
determine whether a piece of text is hate speech, draw boxes around objects in images, and so on.5 They
evaluate explanations and the trustworthiness of machine learning systems. They help researchers
better understand human behavior and human-computer interaction. Unconditionally contracted with
crowd workers to label the type of roof of homes in satellite images.
In many crowdsourcing platforms, the workers are low-skill individuals whose incentive is
monetary. They sometimes communicate with each other outside of the crowdsourcing platform and
behave in ways that attempt to game the system to their benefit. The wages of crowd workers may be
low, which raises ethical concerns. They may be unfamiliar with the task or the social context of the task,
which may yield biases in labels. For example, crowd workers may not have the context to know what
constitutes a household in rural East Africa and may thus introduce biases in roof labeling. (More details
on this example later.) Gaming the system may also yield biases. Despite some platforms having quality
control mechanisms, if you design the labeling task poorly, you will obtain poor quality data. In some
cases, especially those involving applications with a positive social impact, the crowdworkers may have
higher skill and be intrinsically motivated to do a conscientious job. Nevertheless, they may still be
unfamiliar with the social context or have other biases.
5
Jennifer Wortman Vaughan. “Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Re-
search.” In: Journal of Machine Learning Research 18.193 (May 2018).
46 | Trustworthy Machine Learning
Another way to perform data augmentation is through generative machine learning: using the given
data to train a data generator that outputs many more samples to then be used for training a classifier.
Ideally, these generated data points should be as diverse as the given dataset. However, a big problem
known as mode collapse, which produces samples from only one part of the probability distribution of the
given data, can yield severe biases in the resulting dataset.
4.2.6 Conclusion
Different data sources are useful in addressing various problem specifications, but all have biases of one
kind or the other. Most data sources are repurposed. You must take care when selecting among data
sources by paying attention to the more prevalent biases for any given data source. The next section
describes biases from the perspective of their different kinds and where in the lifecycle they manifest.
Figure 4.3. A mental model of spaces, validities, and biases. Accessible caption. A sequence of four spaces,
each represented as a cloud. The construct space leads to the observed space via the measurement
process. The observed space leads to the raw data space via the sampling process. The raw data space
leads to the prepared data space via the data preparation process. The measurement process contains
social bias, which threatens construct validity. The sampling process contains representation bias and
temporal bias, which threatens external validity. The data preparation process contains data prepara-
tion bias and data poisoning, which threaten internal validity.
Data Sources and Biases | 47
There are three main kinds of validity: (1) construct validity, (2) external validity, and (3) internal
validity.6 Construct validity is whether the data really measures what it ought to measure. External
validity is whether analyzing data from a given population generalizes to other populations. Internal
validity is whether there are any errors in the data processing.
The various kinds of validity are threatened by various kinds of bias. There are many categorizations
of types of bias, but for simplicity, let’s focus on just five.7 Social bias threatens construct validity,
representation bias and temporal bias threaten external validity, and data preparation bias and data poisoning
threaten internal validity. These biases are detailed throughout this section.
It is useful to also imagine different spaces in which various abstract and concrete versions of the
data exist: a construct space, an observed space, a raw data space, and a prepared data space. The construct
space is an abstract, unobserved, theoretical space in which there are no biases. Hakuna matata, the East
African problem-free philosophy, reigns in this ideal world. The construct space is operationalized to
the observed space through the measurement of features and labels.8 Data samples collected from a
specific population in the observed space live in the raw data space. The raw data is processed to obtain
the final prepared data to train and test machine learning models.
6
Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Emre Kıcıman. “Social Data: Biases, Methodological Pitfalls, and Ethi-
cal Boundaries.” In: Frontiers in Big Data 2.13 (Jul. 2019).
7
Harini Suresh and John Guttag. “A Framework for Understanding Sources of Harm Throughout the Machine Learning Life
Cycle.” In: Proceedings of the ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. Oct. 2021, p. 17.
8
Abigail Z. Jacobs and Hanna Wallach. “Measurement and Fairness.” In: Proceedings of the ACM Conference on Fairness, Accounta-
bility, and Transparency. Mar. 2021, pp. 375–385.
9
Lav R. Varshney and Kush R. Varshney. “Decision Making with Quantized Priors Leads to Discrimination.” In: Proceedings of the
IEEE 105.2 (Feb. 2017), pp. 241–255.
48 | Trustworthy Machine Learning
some of these social biases by taking input from a diverse panel of informants in the data understanding
phase of the lifecycle. (The role of a diverse team in data understanding is covered in greater depth in
Chapter 16.)
4.3.6 Conclusion
The different categories of bias neutralize different types of validity. Appraising data and preparing data
are difficult tasks that must be done comprehensively without taking shortcuts. More diverse teams may
be able to brainstorm more threats to validity than less diverse teams. Assessing data requires a careful
consideration not only of the modality and source, but also of the measurement, sampling, and
preparation. The mental model of biases provides you with a checklist to go through before using a
dataset to train a machine learning model. Have you evaluated social biases? Is your dataset
representative? Could there be any temporal dataset shifts over time? Have any data preparation steps
accidently introduced any subtle biases? Has someone snuck in, accessed the data, and changed it for
their malicious purpose?
What should you do if any bias is found? Some biases can be overcome by collecting better data or
redoing preparation steps better. Some biases will slip through and contribute to epistemic uncertainty
in the modeling phase of the machine learning lifecycle. Some of the biases that have slipped through
can be mitigated in the modeling step explicitly through defense algorithms or implicitly by being robust
to them. You’ll learn how in Part 4 of the book.
4.4 Summary
▪ Data is the prerequisite for modeling in machine learning systems. It comes in many forms from
various sources and can pick up many different biases along the way.
▪ It is critical to ascertain which biases are present in a dataset because they jeopardize the validity
of the system solving the specified problem.
▪ Evaluating structured datasets involves evaluating the dataset itself, including a focus on data
preparation. Evaluating semi-structured datasets that are represented by foundation models and
50 | Trustworthy Machine Learning
▪ No matter how careful one is, there is no completely unbiased dataset. Nevertheless, the more
effort put in to catching and fixing biases before modeling, the better.
▪ Trustworthy machine learning systems should be designed to mitigate biases that slip through
the data understanding and data preparation phases of the lifecycle.
Privacy and Consent | 51
5
Privacy and Consent
A global virus pandemic is starting to abate, and different organizations are scrambling to put together
‘back-to-work’ plans to allow employees to return to their workplace after several months in lockdown
at home. Toward this end, organizations are evaluating a (fictional) machine learning-based mobile app
named TraceBridge. It supports the return to the office by collecting and modeling location traces,
health-related measurements, other social data (e.g. internal social media and calendar invitations
among employees), and administrative data (e.g. space planning information and org charts), to
facilitate digital contact tracing: the process of figuring out disease-spreading interactions between an
infected person and others. Is TraceBridge the right solution? Will organizations be able to re-open
safely or will the employees be homebound for even more seemingly unending months?
The data that TraceBridge collects, even if free from many biases investigated in Chapter 5, is not
free from concern. Does TraceBridge store the data from all employees in a centralized database? Who
has access to the data? What would be revealed if there were a data breach? Have the employees been
informed about possible uses of the data and agreed to them? Does the organization have permission to
share their data with other organizations? Can employees opt out of the app or would that jeopardize
their livelihood? Who gets to know that an employee has tested positive for the disease? Who gets to
know their identity and their contacts?
The guidance to data scientists in Chapter 4 was to be wary of biases that creep into data and problem
formulations because of the harms they can cause. In this chapter, the thing to be wary about is whether
it is even right to use certain data for reasons of consent, power, and privacy.1 Employers are now
evaluating the app. However, when the problem owners, developers, and data scientists of TraceBridge
were creating the app, they had to:
1
Eun Seo Jo and Timnit Gebru. “Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning.” In:
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Barcelona, Spain, Jan. 2020, pp. 306–316.
52 | Trustworthy Machine Learning
2
Shakir Mohamed, Marie-Therese Png, and William Isaac. “Decolonial AI: Decolonial Theory as Sociotechnical Foresight in
Artificial Intelligence.” In: Philosophy and Technology 33 (Jul. 2020), pp. 659–684.
Privacy and Consent | 53
even argue that people should be compensated for their personal data because they are selling their
privacy.3 In short, data is power.
Data used in machine learning is often fraught with power and consent issues because it is often
repurposed from other uses or is so-called data exhaust: byproducts from people’s digital activities. For
example, many large-scale image datasets used for training computer vision models are scraped from
the internet without explicit consent from the people who posted the images.4 Although there may be
implicit consent through vehicles such as Creative Commons licenses, a lack of explicit consent can
nevertheless be problematic. Sometimes copyright laws are violated in scraped and repurposed data.
Why does this happen? It is almost always due to system designers taking shortcuts to gather large
datasets and show value quickly without giving thought to power and consent. And it is precisely the
most powerful who tend to be least cognizant of issues of power. People from marginalized, minoritized,
and otherwise less powerful backgrounds tend to have more knowledge of the perspectives of both the
powerful and the powerless.5 This concept, known as the epistemic advantage of people with lived
experience of marginalization, is covered in greater detail in Chapter 16. Similarly, except in regulated
application domains such as health care, privacy issues have usually been an afterthought due to
convenience. Things have started to change due to comprehensive laws such as the General Data
Protection Regulation enacted in the European Economic Area in 2018.
In summary, problem owners and data scientists should not have any calculus to weigh issues of
power, consent and privacy against conveniences in data collection. For the fourth attribute of trust
(aligned purpose), trustworthy machine learning systems require that data be used consensually,
especially from those who could be subject to exploitation. No ifs, ands, or buts!
3
Nicholas Vincent, Yichun Li, Renee Zha, and Brent Hecht. “Mapping the Potential and Pitfalls of ‘Data Dividends’ as a Means of
Sharing the Profits of Artificial Intelligence.” arXiv:1912.00757, 2019.
4
Abeba Birhane and Vinay Uday Prabhu. “Large Image Datasets: A Pyrrhic Win for Computer Vision?” In: Proceedings of the IEEE
Winter Conference on Applications of Computer Vision. Jan. 2021, pp. 1536–1546.
5
Miliann Kang, Donovan Lessard, and Laura Heston. Introduction to Women, Gender, Sexuality Studies. Amherst, Massachusetts,
USA: University of Massachusetts Amherst Libraries, 2017.
54 | Trustworthy Machine Learning
preserving data mining is also known as interactive anonymization. TraceBridge may want to do either or
both: publishing datasets for examination by organizational leaders or state regulators, and issuing
contact tracing alerts without exposing individually-identifiable data. They have to go down both paths
and learn about the appropriate technical approaches in each case: syntactic anonymity for data
publishing and differential privacy for data mining.6
There are three main categories of variables when dealing with privacy: (1) identifiers, (2) quasi-
identifiers, and (3) sensitive attributes. Identifiers directly reveal the identity of a person. Examples
include the name of the person, national identification numbers such as the social security number, or
employee serial numbers. Identifiers should be dropped from a dataset to achieve privacy, but such
dropping is not the entire solution. In contrast, quasi-identifiers do not uniquely identify people on their
own, but can reveal identity when linked together through a process known as re-identification. Examples
are gender, birth date, postal code, and group membership. Sensitive attributes are features that people
do not want revealed. Examples are health status, voting record, salary, and movement information.
Briefly, syntactic anonymity works by modifying quasi-identifiers to reduce their information content,
including suppressing them, generalizing them, and shuffling them. Differential privacy works by
adding noise to sensitive attributes. A mental model for the two modes of privacy is given in Figure 5.1.
To make this mental model more concrete, let’s see how it applies to an actual sample dataset of
employees and their results on a diagnostic test for the virus (specifically the cycle threshold (CT) value
of a polymerase chain reaction test), which we treat as sensitive. The original dataset, the transformed
dataset after k-anonymity with 𝑘 = 3, and the transformed dataset after differential privacy are shown
in Table 5.1, Table 5.2, and Table 5.3 (details on k-anonymity and differential privacy are forthcoming).
Organization CT Value
AI 12
AI 20
AI 35
Hybrid Cloud 31
Hybrid Cloud 19
Hybrid Cloud 27
6
John S. Davis II and Osonde A. Osoba. “Privacy Preservation in the Age of Big Data: A Survey.” RAND Justice, Infrastructure, and
Environment Working Paper WR-1161, 2016.
Privacy and Consent | 55
Table 5.3. The values returned for queries under differential privacy with Laplace noise added to the sensitive at-
tribute in the sample original dataset.
Figure 5.1. A mental model of privacy-preservation broken down into two branches: data publishing with syntac-
tic anonymity and data mining with differential privacy. Accessible caption. A hierarchy diagram with pri-
vacy-preservation at its root. One child is data publishing, which is done when you release dataset. The
only child of data publishing is syntactic anonymity. Syntactic anonymity is illustrated by a table with
columns for quasi-identifiers and sensitive attributes. By suppressing, generalizing, or shuffling quasi-
identifiers, some rows have been reordered and others have taken on a different value. The other child
of privacy-preservation is data mining, which is done when you query dataset in a controlled manner.
The only child of data mining is differential privacy. Differential privacy is also illustrated by a table
with columns for quasi-identifiers and sensitive attributes. By adding noise to sensitive attributes, all
the rows are noisy.
56 | Trustworthy Machine Learning
Through k-anonymity, the reidentification risk is reduced down from that of the full dataset to the
number of clusters. With l-diversity or t-closeness added on top of k-anonymity, the predictability of the
7
Latanya Sweeney. “k-Anonymity: A Model for Protecting Privacy.” In: International Journal of Uncertainty, Fuzziness and
Knowledge-Based Systems 10.5 (Oct. 2002), pp. 557–570.
8
Ashwin Machanavajjhala, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venkitasubramaniam. “l-Diversity: Pri-
vacy Beyond k-Anonymity.” In: ACM Transactions on Knowledge Discovery from Data 1.1 (Mar. 2007), p. 3.
9
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. “t-Closeness: Privacy Beyond k-Anonymity and l-Diversity.” In:
Proceedings of the IEEE International Conference on Data Engineering. Istanbul, Turkey, Apr. 2007, pp. 106–115.
10
Michele Bezzi. “An Information Theoretic Approach for Privacy Metrics.” In: Transactions on Data Privacy 3.3 (Dec. 2010), pp.
199–215.
Privacy and Consent | 57
sensitive attributes from the anonymized quasi-identifiers is constrained. These relationships are
valuable ways of reasoning about what information is and is not revealed due to anonymization. By
expressing them in the common statistical language of information theory, they can be examined and
studied alongside other problem specifications and requirements of trustworthiness in broader
contexts.
Equation 5.1
The 𝜖 is a tiny positive parameter saying how much privacy we want. The value of 𝑒 𝜖 becomes closer and
closer to one as 𝜖 gets closer and closer to zero. When 𝜖 is zero, the two probabilities are required to be
equal and thus the two datasets have to be indistinguishable, which is exactly the sense of anonymity
that differential-privacy is aiming for.12 You can’t tell the difference in the query result when you add the
new person in, so you can’t figure out their sensitive attribute any more than what you could have figured
out in general from the dataset.
https://rebootrx.org/covid-cancer
11
̃ is a continuous random
We should really write 𝑃(𝑌̃(𝑊1 ) ∈ 𝑆) ≤ 𝑒 𝜖 𝑃(𝑌̃(𝑊2 ) ∈ 𝑆) for some interval or other set 𝑆 because if 𝑌
12
variable, then its probability of taking any specific value is always zero. It only has a specific probability when defined over a
set.
58 | Trustworthy Machine Learning
The main trick in differential privacy is solving for the kind of noise and its strength to add to 𝑦(𝑊).
For lots of query functions, the best kind of noise comes from the Laplace distribution.13 As stated
earlier, some queries are easier than others. This easiness is quantified using global sensitivity, which
measures how much a single row of a dataset impacts the query value. Queries with smaller global
sensitivity need lower strength noise to achieve 𝜖-differential privacy.
Just like with syntactic privacy, it can be easier to think about differential privacy alongside other
problem specifications in trustworthy machine learning like accuracy, fairness, robustness, and
explainability when expressed in terms of information theory rather than the more specialized
terminology used in defining it earlier. To do so, we also need to say that the dataset 𝑊 is a random
variable, so the probabilities that we want to be close to each other are the noised query results
conditioned on the dataset realizations 𝑃(𝑌̃ ∣ 𝑊 = 𝑤1 ) and 𝑃(𝑌̃ ∣ 𝑊 = 𝑤2 ). Then we can pose our objective
of differential privacy as wanting the mutual information between the dataset and noisy query result
𝐼(𝑊, 𝑌̃) to be minimized. With some more specifics added to minimizing the mutual information, we can
get back a relationship in terms of the 𝜖 of 𝜖-differential privacy.14 The idea of examining the mutual
information between the dataset and the query is as follows. Since mutual information measures the
reduction in uncertainty about the dataset by the knowledge of the query, having zero (or small) mutual
information indicates that we don’t learn anything about the dataset’s composition from the query
result, which is exactly the idea of differential privacy.
Differential privacy is intended to prevent attributing the change of the query’s value to any one
person’s row of data. Nevertheless, one criticism of differential privacy as imagined in its information
theory presentation is that it can allow the value of sensitive attributes to be inferred if there are
correlations or associations among the rows. Stricter information-theoretic conceptions of differential
privacy have been developed that require the rows of the dataset to be independent.
5.2.3 Conclusion
TraceBridge has a few different use cases as part of their app and contact tracing system. One is
publishing sensitive data for the leaders of the organization, external regulators, or auditors to look at.
A second is to interactively query statistics from the sensitive data without revealing things about
individuals. Each use has different appropriate approaches: syntactic anonymity for the first and
differential privacy for the second, along with different requirements on the system design and the
infrastructure required. Existing legal protections in various jurisdictions and application domains are
mostly for the first use case (data publishing), but the regulations themselves are usually unclear on
their precise notion of privacy. TraceBridge may have to go with both approaches to privacy in
developing a trusted system.
We’ve reached the end of this section and haven’t talked about the tradeoff of privacy with utility. All
measures and approaches of providing privacy should be evaluated in conjunction with how the data is
going to be used. It is a balance. The tradeoff parameters are there for a reason. The usefulness of a
dataset after k-anonymization is usually pretty good for a decent-sized 𝑘, but might not be so great after
1 |𝑥−𝜇|
13
The pdf of the Laplace distribution is 𝑝𝑋 (𝑥) = exp (− ), where 𝜇 is the mean and 𝑏 is a scale parameter such that the var-
2𝑏 𝑏
iance is 2𝑏 2 .
14
Darakshan J. Mir. “Information-Theoretic Foundations of Differential Privacy.” In: Foundations and Practice of Security. Mon-
treal, Canada, Oct. 2012, pp. 374–381.
Privacy and Consent | 59
achieving t-closeness for a decent 𝑡. Similarly, a criticism of differential privacy for typical queries is
that the usefulness of the query results is not that great for a small 𝜖 (adding a large magnitude of noise).
However, there are no blanket statements to be made: these intuitions have to be appraised for specific
scenarios and datasets, and privacy parameters have to be chosen carefully without taking shortcuts
and incorporating input from multiple stakeholders.
5.4 Summary
▪ Data is a valuable resource that comes from people. The use of this data should be consensually
obtained. If there is no consent, do not proceed.
▪ It is easy for data scientists to set up machine learning systems that exploit and subjugate
vulnerable individuals and groups. Do not do it. Instead, be careful, thoughtful, and take input
from powerless groups.
▪ By consenting to the use of their data, people give up their privacy. Various methods can be used
to preserve their privacy.
▪ Syntactic anonymization methods group together individuals with similar quasi-identifiers and
then obfuscate those quasi-identifiers. These methods are useful when publishing individual-
60 | Trustworthy Machine Learning
level data.
▪ Differential privacy methods add noise to queries about sensitive attributes when users can only
interact with the data through known and fixed queries. These methods are useful when
statistically analyzing the data or computing models from the data.
6
Detection Theory
Let’s continue from Chapter 3, where you are the data scientist building the loan approval model for the
(fictional) peer-to-peer lender ThriveGuild. As then, you are in the first stage of the machine learning
lifecycle, working with the problem owner to specify the goals and indicators of the system. You have
already clarified that safety is important, and that it is composed of two parts: basic performance
(minimizing aleatoric uncertainty) and reliability (minimizing epistemic uncertainty). Now you want to
go into greater depth in the problem specification for the first part: basic performance. (Reliability
comes in Part 4 of the book.)
What are the different quantitative metrics you could use in translating the problem-specific goals
(e.g. expected profit for the peer-to-peer lender) to machine learning quantities? Once you’ve reached
the modeling stage of the lifecycle, how would you know you have a good model? Do you have any special
considerations when producing a model for risk assessment rather than simply offering an
approve/deny output?
Machine learning models are decision functions: based on the borrower’s features, they decide a
response that may lead to an autonomous approval/denial action or be used to support the decision
making of the loan officer. The use of decision functions is known as statistical discrimination because
we are distinguishing or differentiating one class label from the other. You should contrast the use of the
term ‘discrimination’ here with unwanted discrimination that leads to systematic advantages to certain
groups in the context of algorithmic fairness in Chapter 10. Discrimination here is simply telling the
difference between things. Your favorite wine snob talking about their discriminative palate is a distinct
concept from racial discrimination.
This chapter begins Part 3 of the book on basic modeling (see Figure 6.1 to remind yourself of the lay
of the land) and uses detection theory, the study of optimal decision making in the case of categorical
output responses,1 to answer the questions above that you are struggling with.
1
Estimation theory is the study of optimal decision making in the case of continuous output responses.
62 | Trustworthy Machine Learning
Figure 6.1. Organization of the book. This third part focuses on the first attribute of trustworthiness, competence
and credibility, which maps to machine learning models that are well-performing and accurate. Accessible cap-
tion. A flow diagram from left to right with six boxes: part 1: introduction and preliminaries; part 2:
data; part 3: basic modeling; part 4: reliability; part 5: interaction; part 6: purpose. Part 3 is high-
lighted. Parts 3–4 are labeled as attributes of safety. Parts 3–6 are labeled as attributes of trustworthi-
ness.
▪ selecting metrics to quantify the basic performance of your decision function (including ones
that summarize performance across operating conditions),
▪ testing whether your decision function is as good as it could ever be, and
▪ differentiating performance in risk assessment problems from performance in binary decision
problems.
2
For ease of explanation in this chapter and in later parts of the book, we mostly stick with the case of two label values and do
not delve much into the case with more than two label values.
3
This is also the basic task of supervised machine learning. In supervised learning, the decision function is based on data sam-
ples from (𝑋, 𝑌) rather than on the distributions; supervised learning is coming up soon enough in the next chapter, Chapter 7.
Detection Theory | 63
These are known as true negatives (TN), false negatives (FN), true positives (TP), and false positives (FP),
respectively. A true negative is denying an applicant who should be denied according to some ground
truth, a false negative is denying an applicant who should be approved, a true positive is approving an
applicant who should be approved, and a false positive is approving an applicant who should be denied.
Let’s organize these events in a table known as the confusion matrix:
𝑌 = 1 𝑌 = 0
𝑦̂(𝑋) = 1 TP FP
𝑦̂(𝑋) = 0 FN TN
Equation 6.1
Equation 6.2
These conditional probabilities are nothing more than a direct implementation of the definitions of the
events. The probability 𝑝TN is known as the true negative rate as well as the specificity and the selectivity.
The probability 𝑝FN is known as the false negative rate as well as the probability of missed detection and
the miss rate. The probability 𝑝TP is known as the true positive rate as well as the probability of detection,
the recall, the sensitivity, and the power. The probability 𝑝FP is known as the false positive rate as well as
the probability of false alarm and the fall-out. The probabilities can be organized in a slightly different
table as well:
𝑃( 𝑦̂(𝑋) ∣ 𝑌 ) 𝑌 = 1 𝑌 = 0
Equation 6.3
64 | Trustworthy Machine Learning
These probabilities give you some quantities by which to understand the performance of the decision
function 𝑦̂. Selecting one over the other involves thinking about the events themselves and how they
relate to the real-world problem. A false positive, approving an applicant who should be denied, means
that a ThriveGuild lender has to bear the cost of a default, so it should be kept small. A false negative,
denying an applicant who should be approved, is a lost opportunity for ThriveGuild to make a profit
through the interest they charge.
The events above are conditioned on the true label. Conditioning on the predicted label also yields
events and probabilities of interest in characterizing performance:
𝑃( 𝑌 ∣∣ 𝑦̂(𝑋) ) 𝑌 = 1 𝑌 = 0
Equation 6.4
These conditional probabilities are reversed from Equation 6.2. The probability 𝑝𝑁𝑃𝑉 is known as the
negative predictive value. The probability 𝑝𝐹𝑂𝑅 is known as the false omission rate. The probability 𝑝𝑃𝑃𝑉 is
known as the positive predictive value as well as the precision. The probability 𝑝𝐹𝐷𝑅 is known as the false
discovery rate. If you care about the quality of the decision function, focus on the first set (𝑝TN , 𝑝FN , 𝑝TP ,
𝑝FP ). If you care about the quality of the predictions, focus on the second set (𝑝NPV , 𝑝FOR , 𝑝PPV , 𝑝FDR ).
When you need to numerically compute these probabilities, apply the decision function to several
i.i.d. samples of (𝑋, 𝑌) and denote the number of TN, FN, TP, and FP events as 𝑛TN , 𝑛FN , 𝑛TP , and 𝑛FP ,
respectively. Then use the following estimates of the probabilities:
𝑛TP 𝑛FP
𝑝TP ≈ 𝑝FP ≈
𝑛TP + 𝑛FN 𝑛FP + 𝑛TN
𝑛FN 𝑛TN
𝑝FN ≈ 𝑝TN ≈
𝑛FN + 𝑛TP 𝑛TN + 𝑛FP
𝑛TP 𝑛FP
𝑝PPV ≈ 𝑝FDR ≈
𝑛TP + 𝑛FP 𝑛FP + 𝑛TP
𝑛FN 𝑛TN
𝑝FOR ≈ 𝑝NPV ≈
𝑛FN + 𝑛TN 𝑛TN + 𝑛FN
Equation 6.5
As an example, let’s say that ThriveGuild makes the following number of decisions: 𝑛TN = 1234, 𝑛FN =
73, 𝑛TP = 843, and 𝑛FP = 217. You can estimate the various performance probabilities by plugging these
numbers into the respective expressions above. The results are 𝑝TN ≈ 0.85, 𝑝FN ≈ 0.08, 𝑝TP ≈ 0.92, 𝑝FP ≈
Detection Theory | 65
0.15, 𝑝NPV ≈ 0.94, 𝑝FOR ≈ 0.06, 𝑝PPV ≈ 0.80, and 𝑝FDR ≈ 0.20. These are all reasonably good values, but must
ultimately be judged according to the ThriveGuild problem owner's goals and objectives.
𝑝E = 𝑝0 𝑝FP + 𝑝1 𝑝FN .
Equation 6.6
The balanced probability of error, also known as the balanced error rate, is the unweighted average of the
false negative rate and false positive rate:
1 1
𝑝BE = 𝑝FP + 𝑝FN .
2 2
Equation 6.7
They summarize the basic performance of the decision function. Balancing is useful when there are a
lot more data points with one label than the other, and you care about each type of error equally. Accuracy,
the complement of the probability of error: 1 − 𝑝E , and balanced accuracy, the complement of the balanced
probability of error: 1 − 𝑝BE, are sometimes easier for problem owners to appreciate than error rates.
The 𝐹1-score, the harmonic mean of 𝑝TP and 𝑝PPV , is an accuracy-like summary measure to
characterize the quality of a prediction rather than the decision function:
𝑝TP 𝑝PPV
𝐹1 = 2 .
𝑝TP + 𝑝PPV
Equation 6.8
Continuing the example from before with 𝑝TP ≈ 0.92 and 𝑝PPV ≈ 0.80, let ThriveGuild’s prior
probability of receiving applications to be denied according to some ground truth be 𝑝0 = 0.65 and
applications to be approved be 𝑝1 = 0.35. Then, plugging in to the relevant equations above, you’ll find
ThriveGuild to have 𝑝E ≈ 0.13, 𝑝BE ≈ 0.11, and 𝐹1 ≈ 0.86. Again, these are reasonable values that may be
deemed acceptable to the problem owner.
As the data scientist, you can get pretty far with these abstract TN, FN, TP, and FP events, but they
have to be put in the context of the problem owner’s goals. ThriveGuild cares about making good bets on
borrowers so that they are profitable. More generally across real-world applications, error events yield
significant consequences to affected people including loss of life, loss of liberty, loss of livelihood, etc.
Therefore, to truly characterize the performance of a decision function, it is important to consider the
costs associated with the different events. You can capture these costs through a cost function 𝑐(𝑌, 𝑦̂(𝑋))
and denote the costs as 𝑐(0,0) = 𝑐00 , 𝑐(1,0) = 𝑐10 , 𝑐(1,1) = 𝑐11 , and 𝑐(0,1) = 𝑐01 , respectively.
Taking costs into account, the characterization of performance for the decision function is known as
the Bayes risk 𝑅:
66 | Trustworthy Machine Learning
𝑅 = (𝑐10 − 𝑐00 )𝑝0 𝑝𝐹𝑃 + (𝑐01 − 𝑐11 )𝑝1 𝑝𝐹𝑁 + 𝑐00 𝑝0 + 𝑐11 𝑝1 .
Equation 6.9
Breaking the equation down, you’ll see that the two error probabilities, 𝑝𝐹𝑃 and 𝑝𝐹𝑁 are the main
components, multiplied by their relevant prior probabilities and costs. The costs of the non-error events
appear just multiplied by their costs. The Bayes risk is the performance metric most often used in
finding optimal decision functions. Actually finding the decision function is known as solving the
Bayesian detection problem. Eliciting the cost function 𝑐(⋅,⋅) for a given real-world problem from the
problem owner is part of value alignment, described in Chapter 14.
A mental model or roadmap, shown in Figure 6.2, to hold throughout the rest of the chapter is that
the Bayes risk and the Bayesian detection problem are the central concept, and all other concepts are
related to the central concept in various ways and for various purposes. The terms and concepts that
have not yet been defined and evaluated are coming up soon.
Figure 6.2. A mental model for different concepts in detection theory surrounding the central concept of Bayes
risk and Bayesian detection. A diagram with Bayes risk and Bayesian detection at the center and four
other groups of concepts radiating outwards. False positive rate, false negative rate, error rate, and ac-
curacy are special cases. Receiver operating characteristic, recall-precision curve, and area under the
curve arise when examining all operating points. Brier score and calibration curve arise in probabilis-
tic risk assessment. False discover rate, false omission rate, and 𝐹1-score relate to performance of pre-
dictions.
Because getting things right is a good thing, it is often assumed that there is no cost to correct
decisions, i.e., 𝑐00 = 0 and 𝑐11 = 0, which is also assumed in this book going forward. In this case, the
Bayes risk simplifies to:
Detection Theory | 67
Equation 6.10
To arrive at this simplified equation, just insert zeros for 𝑐00 and 𝑐11 in Equation 6.9. The Bayes risk with
𝑐10 = 1 and 𝑐01 = 1 is the probability of error.
We are implicitly assuming that 𝑐(⋅,⋅) does not depend on 𝑋 except through 𝑦̂(𝑋). This assumption is
not required, but made for simplicity. You can easily imagine scenarios in which the cost of a decision
depends on the feature. For example, if one of the features used in the loan approval decision by
ThriveGuild is the value of the loan, the cost of an error (monetary loss) depends on that feature.
Nevertheless, for simplicity, we usually make the assumption that the cost function does not explicitly
depend on the feature value. For example, under this assumption, the cost of a false negative may be
𝑐10 = 100,000 dollars and the cost of a false positive 𝑐01 = 50,000 dollars for all applicants.
4
The recall-precision curve is an alternative to understand performance across operating points. It is the curve traced out on
the 𝑝PPV –𝑝TP plane starting at (𝑝𝑃𝑃𝑉 = 0, 𝑝𝑇𝑃 = 1) and ending at (𝑝𝑃𝑃𝑉 = 1, 𝑝𝑇𝑃 = 0). It has a one-to-one mapping with the ROC
and is more easily understood by some people. Jesse Davis and Mark Goadrich. “The Relationship Between Precision-Recall
68 | Trustworthy Machine Learning
Figure 6.3. An example receiver operating characteristic (ROC). Accessible caption. A plot with 𝑝TP on the
vertical axis and 𝑝FP on the horizontal axis. Both axes range from 0 to 1. A dashed diagonal line goes
from (0,0) to (1,1) and corresponds to random guessing. A solid concave curve, the ROC, goes from
(0,0) to (1,1) staying above and to the left of the diagonal line.
Let us denote the best possible decision function as 𝑦̂ ∗ (⋅) and its corresponding Bayes risk as 𝑅 ∗. They
are specified using the minimization of the expected cost:
and ROC Curves.” In: Proceedings of the International Conference on Machine Learning. Pittsburgh, Pennsylvania, USA, Jun. 2006,
pp. 233–240.
Detection Theory | 69
Equation 6.11
where the expectation is over both 𝑋 and 𝑌. Because it achieves the minimal cost, the function 𝑦̂ ∗ (⋅) is
the best possible 𝑦̂(⋅) by definition. Whatever Bayes risk 𝑅 ∗ it has, no other decision function can have a
lower Bayes risk 𝑅.
We aren’t going to work it out here, but the solution to the minimization problem in Equation 6.11 is
the Bayes optimal decision function, and takes the following form:
0, Λ(𝑥) ≤ 𝜂
𝑦̂ ∗ (⋅) = {
1, Λ(𝑥) > 𝜂
Equation 6.12
𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑌 = 1 )
Λ(𝑥) =
𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑌 = 0 )
Equation 6.13
𝑐10 𝑝0
𝜂= .
𝑐01 𝑝1
Equation 6.14
The likelihood ratio is as its name says: it is the ratio of the likelihood functions. It is a scalar value even
if the features 𝑋 are multivariate. As the ratio of two non-negative pdf values, it has the range [0, ∞) and
can be viewed as a random variable. The threshold is made up of both costs and prior probabilities. This
optimal decision function 𝑦̂ ∗ (⋅) given in Equation 6.12 is known as the likelihood ratio test.
6.2.1 Example
As an example, let ThriveGuild’s loan approval decision be determined solely by one feature 𝑋: the
income of the applicant. Recall that we modeled income to be exponentially-distributed in Chapter 3.
Specifically, let 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑌 = 1 ) = 0.5𝑒 −0.5𝑥 and 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑌 = 0 ) = 𝑒 −𝑥 , both for 𝑥 ≥ 0. Like earlier in this
chapter, 𝑝0 = 0.65, 𝑝1 = 0.35, 𝑐10 = 100000, and 𝑐01 = 50000. Then simply plugging in to Equation 6.13,
you’ll get:
0.5𝑒 −0.5𝑥
Λ(𝑥) = = 0.5𝑒 0.5𝑥 , 𝑥≥0
𝑒 −𝑥
Equation 6.15
70 | Trustworthy Machine Learning
100000 0.65
𝜂= = 3.7.
50000 0.35
Equation 6.16
Plugging these expressions into the Bayes optimal decision function given in Equation 6.12, you’ll get:
Equation 6.17
0, 𝑥≤4
𝑦̂ ∗ (𝑥) = {
1, 𝑥>4
Equation 6.18
by multiplying both sides of the inequalities in both cases by 2, taking the natural logarithm, and then
multiplying by 2 again. Applicants with an income less than or equal to 4 are denied and applicants with
an income greater than 4 are approved. The expected value of 𝑋 ∣ 𝑌 = 1 is 2 and the expected value of
𝑋 ∣ 𝑌 = 0 is 1. Thus in this example, an applicant's income has to be quite a bit higher than the mean to
be approved.
You should use the Bayes-optimal risk 𝑅 ∗ to lower bound the performance of any machine learning
classifier that you might try for a given data distribution.5 No matter how hard you work or how creative
you are, you can never overcome the Bayes limit. So you should be happy if you get close. If the Bayes-
optimal risk itself is too high, then the thing to do is to go back to the data understanding and data
preparation stages of the machine learning lifecycle and get more informative data.
5
There are techniques for estimating the Bayes risk of a dataset without having access to its underlying probability distribu-
tion. Ryan Theisen, Huan Wang, Lav R. Varshney, Caiming Xiong, and Richard Socher. “Evaluating State-of-the-Art Classifica-
tion Models Against Bayes Optimality” In: Advances in Neural Information Processing Systems 34 (Dec. 2021).
Detection Theory | 71
The likelihood ratio ranges from zero to infinity and the threshold value 𝜂 = 1 is optimal for equal
priors and equal costs. Applying any monotonically increasing function to both the likelihood ratio and
the threshold still yields a Bayes optimal decision function with the same risk 𝑅 ∗. That is,
0, 𝑔(Λ(𝑥)) ≤ 𝑔(𝜂)
𝑦̂ ∗ (⋅) = {
1, 𝑔(Λ(𝑥)) > 𝑔(𝜂)
Equation 6.19
Equation 6.20
It is the mean-squared error of the score S with respect to the true label Y. For a finite number of samples
{(𝑠1 , 𝑦1 ), … , (𝑠𝑛 , 𝑦𝑛 )}, you can compute it as:
𝑛
1 2
Brier score = ∑(𝑠𝑗 − 𝑦𝑗 ) .
𝑛
𝑗=1
Equation 6.21
The Brier score decomposes into the sum of two separable components: calibration and refinement.6
The concept of calibration is that the predicted score corresponds to the proportion of positive true
labels. For example, a bunch of data points all having a calibrated score of 𝑠 = 0.7 implies that 70% of
them have true label 𝑦 = 1 and 30% of them have true label 𝑦 = 0. Said another way, perfect calibration
implies that the probability of the true label 𝑌 being 1 given the predicted score 𝑆 being 𝑠 is the value 𝑠
itself: 𝑃( 𝑌 = 1 ∣ 𝑆 = 𝑠 ) = 𝑠. Calibration is important for probabilistic risk assessments: a perfectly
calibrated score can be interpreted as a probability of predicting one class or the other. It is also an
important concept for evaluating causal inference methods, described in Chapter 8, for algorithmic
fairness, described in Chapter 10, and for communicating uncertainty, described in Chapter 13.
6
José Hernández-Orallo, Peter Flach, and Cèsar Ferri. “A Unified View of Performance Metrics: Translating Threshold Choice
into Expected Classification Loss.” In: Journal of Machine Learning Research 13 (Oct. 2012), pp. 2813–2869.
72 | Trustworthy Machine Learning
Since any monotonically increasing transformation 𝑔(⋅) can be applied to a decision function
without changing its ability to discriminate, you can improve the calibration of a decision function by
finding a better 𝑔(⋅). The calibration loss quantitatively captures how close a decision function is to
perfect calibration. The refinement loss is a sort of variance of how tightly the true labels distribute
around a given score. For {(𝑠1 , 𝑦1 ), … , (𝑠𝑛 , 𝑦𝑛 )} that have been sorted by their score values and binned into
𝑘 groups {ℬ1 , … , ℬ𝑘 } with average values {(𝑠̅1 , 𝑦̅1 ), … , (𝑠̅𝑘 , 𝑦̅𝑘 )} within the bins
𝑘
1
calibration loss = ∑‖ℬ𝑖 ‖ (𝑠̅𝑖 − 𝑦̅𝑖 )2
𝑛
𝑖=1
𝑘
1
refinement loss = ∑‖ℬ𝑖 ‖ 𝑦̅𝑖 (1 − 𝑦̅𝑖 ).
𝑛
𝑖=1
Equation 6.22
As stated earlier, the sum of the calibration loss and refinement loss is the Brier score.
A calibration curve, also known as a reliability diagram, shows the (𝑠̅𝑘 , 𝑦̅𝑘 ) values as a plot. One
example is shown in Figure 6.4. The closer to a straight diagonal from (0,0) to (1,1), the better. Plotting
this curve is a good diagnostic tool for you to understand the calibration of a decision function.
Figure 6.4. An example calibration curve. Accessible caption. A plot with 𝑃(𝑌 = 1) on the vertical axis and
𝑠 on the horizontal axis. Both axes range from 0 to 1. A dashed diagonal line goes from (0,0) to (1,1) and
corresponds to perfect calibration. A solid S-shaped curve, the calibration curve, goes from (0,0) to
(1,1) starting below and to the right of the diagonal line before crossing over to being above and to the
left of the diagonal line.
6.4 Summary
▪ Four possible events result from binary decisions: false negatives, true negatives, false positives,
and true positives.
▪ Different ways to combine the probabilities of these events lead to classifier performance metrics
Detection Theory | 73
▪ Detection theory, the study of optimal decisions, which provides fundamental limits to how well
machine learning models may ever perform is a tool for you to assess the basic performance of
your models.
▪ Decision functions may output continuous-valued scores rather than only hard, zero or one,
decisions. Scores indicate confidence in a prediction. Calibrated scores are those for which the
score value is the probability of a sample belonging to a label class.
74 | Trustworthy Machine Learning
7
Supervised Learning
The (fictional) information technology company JCN Corporation is reinventing itself and changing its
focus to artificial intelligence and cloud computing. As part of managing its talent during this enterprise
transformation, it is conducting a machine learning project to estimate the expertise of its employees
from a variety of data sources such as self-assessments of skills, work artifacts (patents, publications,
software documentation, service claims, sales opportunities, etc.), internal non-private social media
posts, and tabular data records including the employee’s length of service, reporting chain, and pay
grade. A random subset of the employees has been explicitly evaluated on a binary yes/no scale for
various AI and cloud skills, which constitute the labeled training data for machine learning. JCN
Corporation’s data science team has been given the mission to predict the expertise evaluation for all
the other employees in the company. For simplicity, let’s focus on only one of the expertise areas:
serverless architecture.
Imagine that you are on JCN Corporation’s data science team and have progressed beyond the
problem specification, data understanding, and data preparation phases of the machine learning
lifecycle and are now at the modeling phase. By applying detection theory, you have chosen an
appropriate quantification of performance for predicting an employee’s skill in serverless architecture:
the error rate—the Bayes risk with equal costs for false positives and false negatives—introduced in
Chapter 6.
It is now time to get down to the business of learning a decision function (a classifier) from the
training data that generalizes well to predict expertise labels for the unlabeled employees. Deep learning
is one family of classifiers that is on the tip of everyone’s tongue. It is certainly one option for you, but
there are many other kinds of classifiers too. How will you evaluate different classification algorithms
to select the best one for your problem?
“My experience in industry strongly confirms that deep learning is a narrow sliver of
methods needed for solving complex automated decision making problems.”
A very important concept in practicing machine learning, first mentioned in Chapter 2, is the no free
lunch theorem. There is no one single machine learning method that is best for all datasets.1 What is a
good choice for one dataset might not be so great for another dataset. It all depends on the characteristics
of the dataset and the inductive bias of the method: the assumptions on how the classifier should generalize
outside the training data points. The challenge in achieving good generalization and a small error rate
is protecting against overfitting (learning a model that too closely matches the idiosyncrasies of the
training data) and underfitting (learning a model that does not adequately capture the patterns in the
training data). The goal is to get to the Goldilocks point where things are not too hot (overfitting) and not
too cold (underfitting), but just right.
An implication of the no free lunch theorem is that you must try several different methods for the
JCN Corporation expertise problem and see how they perform empirically before deciding on one over
another. Simply brute forcing it—training all the different methods and computing their test error to see
which one is smallest—is common practice, but you decide that you want to take a more refined
approach and analyze the inductive biases of different classifiers. Your analysis will determine the
domains of competence of various classifiers: what types of datasets do they perform well on and what type
of datasets do they perform poorly on.2 Recall that competence or basic accuracy is the first attribute of
trustworthy machine learning as well as the first half of safety.
Why would you want to take this refined approach instead of simply applying a bunch of machine
learning methods from software packages such as scikit-learn, tensorflow, and pytorch without
analyzing their inductive biases? First, you have heeded the admonitions from earlier chapters to be
safe and to not take shortcuts. More importantly, however, you know you will later be creating new
algorithms that respect the second (reliability) and third (interaction) attributes of trustworthiness. You
must not only be able to apply algorithms, you must be able to analyze and evaluate them before you can
create. Now go forth and analyze classifiers for inventorying expertise in the JCN Corporation workforce.
1
David H. Wolpert. “The Lack of A Priori Distinctions Between Learning Algorithms.” In: Neural Computation 8.7 (Oct. 1996), pp.
1341–1390.
2
Tin Kam Ho and Ester Bernadó-Mansilla. “Classifier Domains of Competence in Data Complexity Space.” In: Data Complexity in
Pattern Recognition. Ed. by Mitra Basu and Tin Kam Ho. London, England, UK: Springer, 2006, pp. 135–152.
3
Maniel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. “Do We Need Hundreds of Classifiers to Solve
Real World Classification Problems?” In: Journal of Machine Learning Research 15 (Oct. 2014), pp. 3133–3181.
76 | Trustworthy Machine Learning
water receives the classification 𝑦̂ = 1 (employee is skilled in serverless architecture). Sea level is the
threshold value and the coastline is the level set or decision boundary. An example of a decision
boundary for a two-dimensional feature space is shown in Figure 7.1.
Figure 7.1. An example of a decision boundary in a feature space. The gray regions correspond to feature values
for which the decision function predicts employees are skilled in serverless architecture. The white regions corre-
spond to features for which the decision function predicts employees are unskilled in serverless architecture. The
black lines are the decision boundary. Accessible caption. A stylized plot with the first feature dimension
𝑥 (1) on the horizontal axis and the second feature dimension 𝑥 (2) on the vertical axis. The space is par-
titioned into a couple of blob-like gray regions labeled 𝑦̂ = 1 and a white region labeled 𝑦̂ = 0. The
boundary between the regions is marked as the decision boundary. Classifier regions do not have to be
all one connected component.
Three key characteristics of a dataset determine how well the inductive biases of a classifier match
the dataset:
1. overlap of data points from the two class labels near the decision boundary,
2. linearity or nonlinearity of the decision boundary, and
3. number of data points, their density, and amount of clustering.
Classifier domains of competence are defined in terms of these three considerations.4 Importantly,
domains of competence are relative notions: does one classification algorithm work better than others? 5
They are not absolute notions, because at the end of the day, the absolute performance is limited by the
Bayes optimal risk defined in Chapter 6. For example, one classification method that you tried may work
better than others on datasets with a lot of class overlap near the decision boundary, nearly linear shape
of the decision boundary, and not many data points. Another classification method may work better than
4
In the scope of this chapter, the JCN team use these characteristics qualitatively as a means of gaining intuition. Quantitative
measures for these characteristics are described by Tin Kam Ho and Mitra Basu. “Complexity Measures of Supervised Classifi-
cation Problems.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24.3 (Mar. 2002), pp. 289–300.
5
For the purposes of this chapter, ‘work better’ is only in terms of basic performance (the first attribute of trustworthiness), not
reliability or interaction (the second and third attributes of trustworthiness). Classifier domains of reliability and domains of
quality interaction can also be defined.
Supervised Learning | 77
others on datasets without much class overlap near a tortuously-shaped decision boundary. Yet another
classification method may work better than others on very large datasets. In the remainder of this
chapter, you will analyze many different supervised learning algorithms. The aim is not only describing
how they work, but analyzing their inductive biases and domains of competence.
There are specific methods within these two broad categories of supervised classification algorithms. A
mental model for different ways of doing supervised machine learning is shown in Figure 7.2.
Figure 7.2. A mental model for different ways of approaching supervised machine learning. A hierarchy dia-
gram with supervised learning at its root. Supervised learning has children plug-in and risk minimiza-
tion. Plug-in has children parametric and nonparametric. Parametric has children linear discriminant
analysis and quadratic discriminant analysis. Nonparametric has children k-nearest neighbor and na-
ïve Bayes. Risk minimization has children empirical risk minimization and structural risk minimiza-
tion. Structural risk minimization has children decision trees and forests, margin-based methods, and
neural networks.
If the assumed parametric form for the likelihood functions is multivariate Gaussian in 𝑑 dimensions
with mean parameters 𝜇0 and 𝜇1 and covariance matrix parameters Σ0 and Σ1,6 then the first step is to
compute their empirical estimates 𝜇̂ 0 , 𝜇̂ 1, Σ̂0 , and Σ̂1 from the training data, which you know how to do
from Chapter 3. The second step is to plug those estimates into the likelihood ratio to get the classifier
decision function. Under the Gaussian assumption, the method is known as quadratic discriminant
analysis because after rearranging and simplifying the likelihood ratio, the quantity compared to a
threshold turns out to be a quadratic function of 𝑥. If you further assume that the two covariance
matrices Σ0 and Σ1 are the same matrix Σ, then the quantity compared to a threshold is even simpler: it
is a linear function of 𝑥, and the method is known as linear discriminant analysis.
Figure 7.3 shows examples of linear and quadratic discriminant analysis classifiers in 𝑑 = 2
dimensions trained on the data points shown in the figure. The red diamonds are the employees in the
training set unskilled at serverless architecture. The green squares are the employees in the training set
skilled at serverless architecture. The domain of competence for linear and quadratic discriminant
analysis is datasets whose decision boundary is mostly linear, with a dense set of data points of both
class labels near that boundary.
1 𝑇 Σ−1 (𝑥−𝜇 )
1
6
The mathematical form of the likelihood functions is: 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑦 = 0 ) = 𝑒 −2(𝑥−𝜇0 ) 0 0
and 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑦 = 1 ) =
√(2π)𝑑 𝑑𝑒𝑡(Σ0 )
1
1 −2(𝑥−𝜇1 )𝑇 Σ−1
1 (𝑥−𝜇1 )
𝑒 , where det is the matrix determinant function.
√(2π)𝑑 𝑑𝑒𝑡(Σ1 )
Supervised Learning | 79
Figure 7.3. Examples of linear discriminant analysis (left) and quadratic discriminant analysis (right) classifiers.
Accessible Caption. Stylized plot showing two classes of data points arranged in a noisy yin yang or in-
terleaving moons configuration. The linear discriminant decision boundary is a straight line that cuts
through the middle of the two classes. The quadratic discriminant decision boundary is a smooth
curve that turns a little to enclose one of the classes.
instead. The k-nearest neighbor method works better than other classifiers when the decision boundary
is very wiggly and broken up into lots of components, and when there is not much overlap in the classes.
Figure 7.4 shows examples of naïve Bayes and k-nearest neighbor classifiers in two dimensions. The k-
nearest neighbor classifier in the figure has 𝑘 = 5.
Figure 7.4. Examples of naïve Bayes (left) and k-nearest neighbor (right) classifiers. Accessible caption. Styl-
ized plot showing two classes of data points arranged in a noisy yin yang or interleaving moons config-
uration. The naïve Bayes decision boundary is a smooth curve that turns a little to enclose one of the
classes. The k-nearest neighbor decision boundary is very jagged and traces out the positions of the
classes closely.
𝑝E = 𝑝0 𝑃( 𝑦̂(𝑋) = 1 ∣ 𝑌 = 0 ) + 𝑝1 𝑃( 𝑦̂(𝑋) = 0 ∣ 𝑌 = 1 ).
Equation 7.1
The prior probabilities of the class labels 𝑝0 and 𝑝1 multiply the probabilities of the events when the
decision function is wrong 𝑃( 𝑦̂(𝑋) = 1 ∣ 𝑌 = 0 ) and 𝑃( 𝑦̂(𝑋) = 0 ∣ 𝑌 = 1 ). You cannot directly compute the
error probability because you do not have access to the full underlying probability distribution. But is
there an approximation to the error probability that you can compute using the training data?
Supervised Learning | 81
First, because the training data is sampled i.i.d. from the underlying distribution, the proportion of
employees in the training data set skilled and unskilled at serverless architecture will approximately
match the prior probabilities 𝑝0 and 𝑝1 , so you do not have to worry about them explicitly. Second, the
probabilities of both the false positive event 𝑃( 𝑦̂(𝑋) = 1 ∣ 𝑌 = 0 ) and false negative event
𝑃( 𝑦̂(𝑋) = 0 ∣ 𝑌 = 1 ) event can be expressed collectively as 𝑃(𝑦̂(𝑋) ≠ 𝑌), which corresponds to 𝑦̂(𝑥𝑗 ) ≠ 𝑦𝑗
for training data samples. The zero-one loss function 𝐿 (𝑦𝑗 , 𝑦̂(𝑥𝑗 )) captures this by returning the value 1 for
𝑦̂(𝑥𝑗 ) ≠ 𝑦𝑗 and the value 0 for 𝑦̂(𝑥𝑗 ) = 𝑦𝑗 . Putting all these things together, the empirical approximation to
the error probability, known as the empirical risk 𝑅emp , is:
𝑛
1
𝑅emp = ∑ 𝐿 (𝑦𝑗 , 𝑦̂(𝑥𝑗 )).
𝑛
𝑗=1
Equation 7.2
Minimizing the empirical risk over all possible decision functions 𝑦̂ is a possible classification algorithm,
but not one that you and the other JCN Corporation data scientists evaluate just yet. Let’s understand
why not.
Figure 7.5. Illustration of the structural risk minimization principle. Accessible caption. A plot with 𝑝E or
𝑅emp on the vertical axis and increasing complexity of hypothesis spaces on the horizontal axis. The
empirical risk decreases all the way to zero with increasing complexity. The generalization error first
decreases and then increases. It has a sweet spot in the middle.
The hypothesis space ℱ is the inductive bias of the classifier. Thus, within the paradigm of the
structural risk minimization principle, different choices of hypothesis spaces yield different domains of
competence. In the next section, you and your team of JCN data scientists analyze several different risk
minimization classifiers popularly used in practice, including decision trees and forests, margin-based
classifiers (logistic regression, support vector machines, etc.), and neural networks.
𝑛
1
𝑦̂(∙) = arg min ∑ 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) .
𝑓∈ℱ 𝑛
𝑗=1
Equation 7.3
This equation may look familiar because it is similar to the Bayesian detection problem in Chapter 6.
The function in the hypothesis space that minimizes the sum of the losses on the training data is 𝑦̂.
Different methods have different hypothesis spaces ℱ and different loss functions 𝐿(⋅,⋅). An alternative
way to control the complexity of the classifier is not through changing the hypothesis space ℱ, but
through a complexity penalty or regularization term 𝐽(⋅) weighted by a regularization parameter 𝜆:
𝑛
1
𝑦̂(∙) = 𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) + 𝜆𝐽(𝑓).
𝑓∈ℱ 𝑛
𝑗=1
Equation 7.4
Supervised Learning | 83
The choice of regularization term 𝐽 also yields an inductive bias for you to analyze.
Figure 7.6. An example of a decision stump classifier. Accessible caption. On the left is a decision node
𝑥 (2) ≤ −0.278. When it is true, 𝑦̂ = 1 and when it is false, 𝑦̂ = 0. On the right is a stylized plot showing
two classes of data points arranged in a noisy yin yang or interleaving moons configuration. The deci-
sion boundary is a horizontal line.
The hypothesis space of decision trees includes decision functions with more complexity than
decision stumps. A decision tree is created by splitting on single feature dimensions within each branch
of the decision stump, splitting within those splits, and so on. An example of a decision tree with two
levels is shown in Figure 7.7. Decision trees can go much deeper than two levels to create fairly complex
decision boundaries. An example of a complex decision boundary from a decision tree classifier is
shown in Figure 7.8.
84 | Trustworthy Machine Learning
Figure 7.7. An example of a two-level decision tree classifier. Accessible caption. On the left is a decision
node 𝑥 (2) ≤ −0.278. When it is true, there is another decision node 𝑥 (1) ≤ −1.151. When this decision
node is true, 𝑦̂ = 0 and when it is false, 𝑦̂ = 1. When the top decision node is false, there is another deci-
sion node 𝑥 (1) ≤ 1.085. When this decision node is true, 𝑦̂ = 0 and when it is false, 𝑦̂ = 1. On the right is
a stylized plot showing two classes of data points arranged in a noisy yin yang or interleaving moons
configuration. The decision boundary is a made up of three line segments: the first segment is vertical,
it turns right into a horizontal segment, and then up into another vertical segment.
Figure 7.8. An example decision tree classifier with many levels. Accessible caption. A stylized plot showing
two classes of data points arranged in a noisy yin yang or interleaving moons configuration. The deci-
sion boundary is a made up of several vertical and horizontal segments.
The hypothesis space of decision forests is made up of ensembles of decision trees that vote for their
prediction, possibly with an unequal weighting given to different trees. The weighted majority vote from
the decision trees is the overall classification. The mental model for a decision forest is illustrated in
Figure 7.9 and an example decision boundary is given in Figure 7.10.
Supervised Learning | 85
Figure 7.9. A mental model for a decision forest. Accessible caption. Three individual decision trees each
predict separately. Their predictions feed into a vote node which outputs 𝑦̂. The predictions from each
tree are combined using a (weighted) majority vote.
86 | Trustworthy Machine Learning
Figure 7.10. An example decision forest classifier. Accessible caption. Stylized plot showing two classes of
data points arranged in a noisy yin yang or interleaving moons configuration. The decision boundary is
fairly jagged with mostly axis-aligned segments and traces out the positions of the classes closely.
Decision stumps and decision trees can be directly optimized for the zero-one loss function that
appears in the empirical risk.7 More commonly, however, greedy heuristic methods are employed for
learning decision trees in which the split for each node is done one at a time, starting from the root and
progressing to the leaves. The split is chosen so that each branch is as pure as can be for the two classes:
mostly just employees skilled at serverless architecture on one side of the split, and mostly just
employees unskilled at serverless architecture on the other side of the split. The purity can be quantified
by two different information-theoretic measures, information gain and Gini index, which were
introduced in Chapter 3. Two decision tree algorithms are popularly-used: the C5.0 decision tree that uses
information gain as its splitting criterion, and the classification and regression tree (CART) that uses Gini
index as its splitting criterion. The depth of decision trees is controlled to prevent overfitting. The
domain of competence of C5.0 and CART decision trees is tabular datasets in which the phenomena
represented in the features tend to have threshold and clustering behaviors without much class overlap.
Decision forests are made up of a lot of decision trees. C5.0 or CART trees are usually used as these
base classifiers. There are two popular ways to train decision forests: bagging and boosting. In bagging,
different subsets of the training data are presented to different trees and each tree is trained separately.
All trees have equal weight in the majority vote. In boosting, a sequential procedure is followed. The first
tree is trained in the standard manner. The second tree is trained to focus on the training samples that
the first tree got wrong. The third tree focuses on the errors of the first two trees, and so on. Earlier trees
receive greater weight. Decision forests have good competence because of the diversity of their base
classifiers. As long as the individual trees are somewhat competent, any unique mistake that any one
tree makes is washed out by the others for an overall improvement in generalization.
The random forest classifier is the most popular bagged decision forest and the XGBoost classifier is
the most popular boosted decision forest. Both have very large domains of competence. They are robust
7
Oktay Günlük, Jayant Kalagnanam, Minhan Li, Matt Menickelly, and Katya Scheinberg. “Optimal Generalized Decision Trees
via Integer Programming.” arXiv:1612.03225, 2019.
Supervised Learning | 87
and work extremely well for almost all kinds of structured datasets. They are the first-choice algorithms
for practicing data scientists to achieve good accuracy models with little to no tuning of parameters.
8
The nonlinear functions of the features are usually kernel functions, which satisfy certain mathematical properties that allow
for efficient optimization during training.
9
In the nonlinear case, replace 𝑤 𝑇 𝑥𝑗 with 𝑤 𝑇 𝑘(𝑥𝑗 ) for a kernel function 𝑘. To avoid cluttering up the mathematical notation,
always assume that 𝑥𝑗 or 𝑘(𝑥𝑗 ) has a column of all ones to allow for a constant shift.
Computing (2𝑦𝑗 – 1) is the inverse of applying the sign function, adding one, and dividing by two. It is performed to get values
10
−1 and +1 from the class labels. When a classification is correct, 𝑤 𝑇 𝑥𝑗 and (2𝑦𝑗 – 1) have the same sign. Multiplying two num-
bers with the same sign results in a positive number. When a classification is incorrect, 𝑤 𝑇 𝑥𝑗 and (2𝑦𝑗 – 1) have different signs.
Multiplying two numbers with different signs results in a negative number.
88 | Trustworthy Machine Learning
Figure 7.11. Margin-based loss functions. Accessible caption. A plot with loss on the vertical axis and mar-
gin on the horizontal axis. The logistic loss decreases smoothly. The hinge loss decreases linearly until
the point (1,0), after which it is 0 for all larger values of the margin.
Figure 7.12. Example linear logistic regression (top left), linear SVM (top right), nonlinear polynomial SVM (bot-
tom left), and nonlinear radial basis function SVM (bottom right) classifiers. Accessible caption. Stylized plot
showing two classes of data points arranged in a noisy yin yang or interleaving moons configuration.
The linear logistic regression and linear SVM decision boundaries are diagonal lines through the mid-
dle of the moons. The polynomial SVM decision boundary is a diagonal line with a smooth bump to bet-
ter follow the classes. The radial basis function SVM decision boundary smoothly encircles one of the
classes with a blob-like region.
Supervised Learning | 89
Figure 7.13. Diagram of a neural network. Accessible caption. Three nodes on the left form the input
layer. They are labeled 𝑥 (1), 𝑥 (2), and 𝑥 (3) . To the right of the input layer is hidden layer 1 with four
nodes. To the right of hidden layer 1 is hidden layer 2 with four hidden nodes. To the right of hidden
layer 2 is one node labeled 𝑦̂ constituting the output layer. There are edges between each node of one
layer and each node of the adjacent layer. Each edge has a weight. Each hidden node sums all of its in-
puts and applies an activation function. The output node sums all of its inputs and applies a hard
threshold.
Logistic regression is actually a very simple neural network with just an input layer and an output
node, so let’s start there. The input layer is simply a set of nodes, one for each of the 𝑑 feature dimensions
𝑥 (1), … , 𝑥 (𝑑) relevant for predicting the expertise of employees. They have weighted edges coming out of
them, going into the output node. The weights on the edges are the coefficients in 𝑤, i.e. 𝑤 (1) , … , 𝑤 (𝑑) . The
90 | Trustworthy Machine Learning
output node sums the weighted inputs, so computes 𝑤 𝑇 𝑥, and then passes the sum through the step
function. This overall procedure is exactly the same as logistic regression described earlier, but
described in a graphical way.
In the regular case of a neural network with one or more hidden layers, nodes in the hidden layers
also start with a weighted summation. However, instead of following the summation with an abrupt step
function, hidden layer nodes use softer, more gently changing activation functions. A few different
activation functions are used in practice, whose choice contributes to the inductive bias. Two examples,
the sigmoid or logistic activation function 1/(1 + 𝑒 −𝑧 ) and the rectified linear unit (ReLU) activation
function max{ 0, 𝑧}, are shown in Figure 7.14. The ReLU activation is typically used in all hidden layers of
deep neural networks because it has favorable properties for optimization techniques that involve the
gradient of the activation function.
Figure 7.14. Activation functions. Accessible caption. A plot with activation function on the vertical axis
and input on the horizontal axis. The sigmoid function is a gently rolling S-shaped curve that equals 0.5
at input value 0, approaches 0 as the input goes to negative infinity, and approaches 1 as the input goes
to positive infinity. The ReLU function is 0 for all negative inputs and increases linearly starting at 0.
When there are several hidden layers, the outputs of nodes in one hidden layer feed into the nodes
of the next hidden layer. Thus, the neural network’s computation is a sequence of compositions of
weighted sum, activation function, weighted sum, activation function, and so on until reaching the
output layer, which finally applies the step function. The number of nodes per hidden layer and the
number of hidden layers is a design choice for JCN Corporation’s data scientists to make.
You and your team have analyzed the hypothesis space. Cool beans. The next thing for you to analyze
is the loss function of neural networks. Recall that margin-based loss functions multiply the true label
𝑦𝑗 by the distance 𝑤 𝑇 𝑥𝑗 (not by the predicted label 𝑦̂(𝑥𝑗 )), before applying the step function. The cross-
entropy loss, the most common loss function used in neural networks, does kind of the same thing. It
compares the true label 𝑦𝑗 to a soft prediction 𝜑(𝑥𝑗 ) in the range [0,1] computed in the output node before
the step function has been applied to it. The cross-entropy loss function is:
Equation 7.5
Supervised Learning | 91
The form of the expression comes from cross-entropy, the average information in the true label
random variable 𝑦 when described using the predicted distance random variable 𝜑, introduced in
Chapter 3. Cross-entropy should be minimized because you want the description in terms of the
prediction to be matched to the ground truth. It turns out that the cross-entropy loss is equivalent to the
margin-based logistic loss function in binary classification problems, but it is pretty involved to show it
mathematically because the margin-based loss function is a function of one variable that multiplies the
prediction and the true label, whereas the two arguments are kept separate in cross-entropy loss.11
The last question to ask is about regularization. Although ℓ1-norm, ℓ2 -norm, or other penalties can
be added to the cross-entropy loss, the most common way to regularize neural networks is dropout. The
idea is to randomly remove some nodes from the network on each iteration of an optimization procedure
during training. Dropout’s goal is somewhat similar to bagging, but instead of creating an ensemble of
several neural networks explicitly, dropout makes each iteration appear like a different neural network
of an ensemble, which helps diversity and generalization. An example neural network classifier with
one hidden layer and ReLU activation functions is shown in Figure 7.15. Repeating the statement from
the beginning of this section, the domain of competence for artificial neural networks is semi-structured
datasets with a large number of data points.
Figure 7.15. Example neural network classifier. Accessible caption. Stylized plot showing two classes of
data points arranged in a noisy yin yang or interleaving moons configuration. The decision boundary is
mostly smooth and composed of two almost straight diagonal segments that form a slightly bent elbow
in the middle of the two moons.
7.5.4 Conclusion
You have worked your way through several different kinds of classifiers to compare and contrast their
domains of competence and evaluate their appropriateness for your expertise assessment prediction
task. Your dataset consists of mostly structured data, is of moderate size, and has a lot of feature axis-
aligned separations between employees skilled and unskilled at serverless architecture. For these
11
Tyler Sypherd, Mario Diaz, Lalitha Sankar, and Peter Kairouz. “A Tunable Loss Function for Binary Classification.” In: Pro-
ceedings of the IEEE International Symposium on Information Theory. Paris, France, Jul. 2019, pp. 2479–2483.
92 | Trustworthy Machine Learning
reasons, you can expect that XGBoost will be a competent classifier for your problem. But you should
nevertheless do some amount of empirical testing of a few different methods.
7.6 Summary
▪ There are many different methods for finding decision functions from a finite number of training
samples, each with their own inductive biases for how they generalize.
▪ Different classifiers have different domains of competence: what kinds of datasets they have
lower generalization error on than other methods.
▪ Parametric and non-parametric plug-in methods (discriminant analysis, naïve Bayes, k-nearest
neighbor) and risk minimization methods (decision trees and forests, margin-based methods,
neural networks) all have a role to play in practical machine learning problems.
▪ It is important to analyze their inductive biases and domains of competence not only to select the
most appropriate method for a given problem, but also to be prepared to extend them for
fairness, robustness, explainability, and other elements of trustworthiness.
Causal Modeling | 93
8
Causal Modeling
In cities throughout the United States, the difficulty of escaping poverty is exacerbated by the difficulty
in obtaining social services such as job training, mental health care, financial education classes, legal
advice, child care support, and emergency food assistance. They are offered by different agencies in
disparate locations with different eligibility requirements. It is difficult for poor individuals to navigate
this perplexity and avail themselves of services that they are entitled to. To counteract this situation, the
(fictional) integrated social service provider ABC Center takes a holistic approach by housing many
individual social services in one place and having a centralized staff of social workers guide their clients.
To better advise clients on how to advance themselves in various aspects of life, the center’s director
and executive staff would like to analyze the data that the center collects on the services used by clients
and the life outcomes they achieved. As problem owners, they do not know what sort of data modeling
they should do. Imagine that you are a data scientist collaborating with the ABC Center problem owners
to analyze the situation and suggest appropriate problem specifications, understand and prepare the
data available, and finally perform modeling. (This chapter covers a large part of the machine learning
lifecycle whereas other chapters so far have mostly focused on smaller parts.)
Your first instinct may be to suggest that ABC Center take a machine learning approach that predicts
life outcomes (education, housing, employment, etc.) from a set of features that includes classes taken
and sessions attended. Examining the associations and correlations in the resulting trained model may
yield some insights, but misses something very important. Do you know what it is? It’s causality! If you
use a standard machine learning formulation of the problem, you can’t say that taking an automobile
repair training class causes an increase in the wages of the ABC Center client. When you want to
understand the effect of interventions (specific actions that are undertaken) on outcomes, you have to do
more than machine learning, you have to perform causal modeling.1 Cause and effect are central to
understanding the world, but standard supervised learning is not a method for obtaining them.
1
Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn, and Huan Liu. “A Survey of Learning Causality from Data: Problems
and Methods.” In: ACM Computing Surveys 53.4 (Jul. 2020), p. 75.
94 | Trustworthy Machine Learning
Toward the goal of suggesting problem formulations to ABC Center, understanding the relevant data,
and creating models for them, in this chapter you will:
“While probabilities encode our beliefs about a static world, causality tells us
whether and how probabilities change when the world changes, be it by intervention
or by act of imagination.”
child care causing both wages and stable housing. (If a client has child care, they can more easily search
for jobs and places to live since they don’t have to take their child around with them.)
Figure 8.1. An example causal graph of how the clients of ABC Center respond to interventions and life changes.
Accessible caption. A graph with six nodes: counseling sessions, anxiety, wages, child care, used car,
and stable housing. There are edges from counseling sessions to anxiety, anxiety to wages, wages to
used car, wages to stable housing, and child care to both wages and stable housing.
Nodes represent random variables in structural causal models just as they do in Bayesian networks.
However, edges don’t just represent statistical dependencies, they also represent causal relationships.
A directed edge from random variable 𝑇 (e.g. counseling sessions) to 𝑌 (e.g. anxiety) indicates a causal
effect of 𝑇 (counseling sessions) on 𝑌 (anxiety). Since structural causal models are a generalization of
Bayesian networks, the Bayesian network calculations for representing probabilities in factored form
and for determining conditional independence through d-separation continue to hold. However,
structural causal models capture something more than the statistical relationships because of the
structural equations.
Structural equations, also known as functional models, tell you what happens when you do something.
Doing or intervening is the act of forcing a random variable to take a certain value. Importantly, it is not
just passively observing what happens to all the random variables when the value of one of them has
been revealed—that is simply a conditional probability. Structural causal modeling requires a new
operator 𝑑𝑜(∙), which indicates an intervention. The interventional distribution 𝑃( 𝑌 ∣∣ 𝑑𝑜(𝑡) ) is the
distribution of the random variable 𝑌 when the random variable 𝑇 is forced to take value 𝑡. For a causal
graph with only two nodes 𝑇 (counseling session) and 𝑌 (anxiety) with a directed edge from 𝑇 to 𝑌, 𝑇 →
𝑌, the structural equation takes the functional form:
Equation 8.1
96 | Trustworthy Machine Learning
where 𝑛𝑜𝑖𝑠𝑒𝑌 is some noise or randomness in 𝑌 and 𝑓𝑌 is any function. There is an exact equation relating
an intervention on counseling sessions, like starting counseling sessions (changing the variable from 0
to 1), to the probability of a client’s anxiety. The key point is that the probability can truly be expressed
as an equation with the treatment as an argument on the right-hand side. Functional models for
variables with more parents would have those parents as arguments in the function 𝑓𝑌 , for example
𝑃( 𝑌 ∣∣ 𝑑𝑜(𝑡) ) = 𝑓𝑌 (𝑡1 , 𝑡2 , 𝑡3 , 𝑛𝑜𝑖𝑠𝑒𝑌 ) if 𝑌 has three parents. (Remember from Chapter 3 that directed edges
begin at parent nodes and end at child nodes.)
and observational data does not. Causal modeling with interventional data is usually straightforward
and causal modeling with observational data is much more involved.
In the modeling phase when dealing with observational data, the two problem formulations
correspond to two different categories of methods. Causal discovery is to learn the structural causal model.
Causal inference is to estimate the causal effect. Specific methods for conducting causal discovery and
causal inference from observational data are the topic of Sections 8.4 and 8.5, respectively. A mental
model of the modeling methods for the two formulations is given in Figure 8.2.
Figure 8.2. Classes of methods for causal modeling from observational data. Accessible caption. A hierarchy
diagram with causal modeling from observational data at its root with children causal discovery and
causal inference. Causal discovery has children conditional independence-based methods and func-
tional model-based methods. Causal inference has children treatment models and outcome models.
𝜏 = 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑡 = 1) ]– 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑡 = 0) ].
Equation 8.2
This difference of the expected value of the outcome label under the two values of the intervention
precisely shows how the outcome changes due to the treatment. How much do wages change because of
the automobile repair class? The terminology contains average because of the expected value.
98 | Trustworthy Machine Learning
For example, if 𝑌 ∣ 𝑑𝑜(𝑡 = 0) is a Gaussian random variable with mean 13 dollars per hour and
standard deviation 1 dollar per hour,2 and 𝑌 ∣ 𝑑𝑜(𝑡 = 1) is a Gaussian random variable with mean
18 dollars per hour and standard deviation 2 dollars per hour, then the average treatment effect is 18 −
13 = 5 dollars per hour. Being trained in automobile repair increases the earning potential of clients by
5 dollars per hour. The standard deviation doesn’t matter here.
Figure 8.3. Motifs that block and do not block paths between the treatment node 𝑇 and the outcome label node 𝑌.
Backdoor paths are not blocked. Accessible caption. If an entire path is made up of causal chains without
observation (𝑋1 → 𝑋3 → 𝑋2 ), it is not a backdoor path. Backdoor paths begin with an edge coming into
the treatment node and can contain common causes without observation (𝑋1 𝑋3 → 𝑋2 ; 𝑋3 is a con-
founding variable) and common effects with observation (𝑋1 → 𝑋3 𝑋2 ; the underline indicates that 𝑋3
or any of its descendants is observed). In this case, 𝑋3 in the common cause without observation is a
confounding variable. The other three motifs—causal chain with observation (𝑋1 → 𝑋3 → 𝑋2 ), common
cause with observation (𝑋1 𝑋3 → 𝑋2 ), and common effect without observation (𝑋1 → 𝑋3 𝑋2 )—are
blockers. If any of them exist along the way from the treatment to the outcome label, there is no open
backdoor path.
1 𝑥−𝜇 2
1
2
The pdf of a Gaussian random variable 𝑋 with mean 𝜇 and standard deviation 𝜎 is 𝑝𝑋 (𝑥) = 𝑒 −2( 𝜎
)
. Its expected value is 𝜇.
𝜎√2𝜋
Causal Modeling | 99
1. a causal chain motif with the middle node observed, i.e. the middle node is conditioned upon,
2. a common cause motif with the middle node observed, or
3. a common effect motif with the middle node not observed (this is a collider)
anywhere between 𝑇 and 𝑌. Backdoor paths can contain (1) the common cause without observation motif
and (2) the common effect with observation motif between 𝑇 and 𝑌. The motifs that block and do not
block a path are illustrated in Figure 8.3.
The lack of equality between the interventional distribution 𝑃( 𝑌 ∣∣ 𝑑𝑜(𝑡) ) and the associational
distribution 𝑃( 𝑌 ∣ 𝑡 ) is known as confounding bias.3 Any middle nodes of common cause motifs along a
backdoor path are confounding variables or confounders. Confounding is the central challenge to be
overcome when you are trying to infer the average treatment effect in situations where intervening is
not possible (you cannot 𝑑𝑜(𝑡)). Section 8.5 covers how to mitigate confounding while estimating the
average treatment effect.
8.2.2 An Example
Figure 8.4 shows an example of using ABC Center’s causal graph (introduced in Figure 8.1) while
quantifying a causal effect. The center wants to test whether reducing a client’s high anxiety to low
anxiety affects their stable housing status. There is a backdoor path from anxiety to stable housing going
through wages and child care. The path begins with an arrow going into anxiety. A common cause
without observation, wages child care → stable housing, is the only other motif along the path to stable
housing. It does not block the path. Child care, as the middle node of a common cause, is a confounding
variable. If you can do the treatment, that is intervene on anxiety, which is represented
diagrammatically with a hammer, the incoming edges to anxiety from counseling sessions and wages
are removed. Now there is no backdoor path anymore, and you can proceed with the treatment effect
quantification.
Often, however, you cannot do the treatment. These are observational rather than interventional
settings. The observational setting is a completely different scenario than the interventional setting.
Figure 8.5 shows how things play out. Since you cannot make the edge between anxiety and wages go
away through intervention, you have to include the confounding variable of whether the client has child
care or not in your model, and only then will you be able to do a proper causal effect quantification
between anxiety and stable housing. Including, observing, or conditioning upon confounding variables
3
There can be confounding bias without a backdoor path in special cases involving selection bias. Selection bias is when the
treatment variable and another variable are common causes for the outcome label.
100 | Trustworthy Machine Learning
is known as adjusting for them. Adjusting for wages rather than child care is an alternative way to block
the backdoor path in the ABC Center graph.
Figure 8.4. The scenario of causal effect quantification when you can intervene on the treatment. Accessible
caption. The causal graph of Figure 8.1 is marked with anxiety as the treatment and stable housing as
the outcome. A backdoor path is drawn between the two passing through wages and child care, which
is marked as a confounder. Intervening on anxiety is marked with a hammer. Its effect is the removal
of edges into anxiety from counseling sessions and wages, and the removal of the backdoor path.
Causal Modeling | 101
Figure 8.5. The scenario of causal effect quantification when you cannot intervene on the treatment and thus
have to adjust for a variable along a backdoor path. Accessible caption. The causal graph of Figure 8.1 is
marked with anxiety as the treatment and stable housing as the outcome. A backdoor path is drawn
between the two passing through wages and child care, which is marked as a confounder. Adjusting for
child care colors its node gray and removes the backdoor path.
At the end of the day, the whole point of computing causal effects is to inform decision making. If
there are two competing social services that ABC Center can offer a client, causal models should
recommend the one with the largest effect on the outcome that they care about. In the next sections, you
will proceed to find models for this task from data.
you actually do the treatment. It is data collected as part of an experiment that has already been thought
out beforehand. An experiment that ABC Center might conduct to obtain interventional data is to enroll
one group of clients in a financial education seminar and not enroll another group of clients. The group
receiving the treatment of the financial education seminar is the treatment group and the group not
receiving the seminar is the control group. ABC Center would collect data about those clients along many
feature dimensions, and this would constitute the dataset to be modeled in the next phase of the
lifecycle. It is important to collect data for all features that you think could possibly be confounders.
As already seen in the previous sections and irrespective of whether collected interventionally or
observationally, the random variables are: the treatment 𝑇 that designates the treatment and control
groups (anxiety intervention), the outcome label 𝑌 (stable housing), and other features 𝑋 (child care and
others). A collection of samples from these random variables constitute the dataset in average treatment
effect estimation: {(𝑡1 , 𝑥1 , 𝑦1 ), … , (𝑡𝑛 , 𝑥𝑛 , 𝑦𝑛 )}. An example of such a dataset is shown in Figure 8.6. In
estimating a structural causal model, you just have random variables 𝑋 and designate a treatment and
outcome label later if needed.
Figure 8.6. A dataset for treatment effect estimation. The axes are two feature dimensions of 𝑥. The unfilled data
points are the control group 𝑡 = 0 and the filled data points are the treatment group 𝑡 = 1. The diamond data
points have the outcome label 𝑦 = 0 and the square data points have the outcome label 𝑦 = 1.
The causal graph can be estimated from the entire data if the goal of ABC Center is to get a general
understanding of all of the causal relationships. Alternatively, causal effects can be estimated from the
treatment, confounding, and outcome label variables if the director wants to test a specific hypothesis
such as anxiety reduction seminars having a causal effect on stable housing. A special kind of
experiment, known as a randomized trial, randomly assigns clients to the treatment group and control
group, thereby mitigating confounding within the population being studied.
It is not possible to observe both the outcome label 𝑦 under the control 𝑡 = 0 and its counterfactual 𝑦
under the treatment 𝑡 = 1 because the same individual cannot both receive and not receive the
treatment at the same time. This is known as the fundamental problem of causal modeling. The fundamental
problem of causal modeling is lessened by randomization because, on average, the treatment group and
control group contain matching clients that almost look alike. Randomization does not prevent a lack of
external validity however (recall from Chapter 4 that external validity is the ability of a dataset to
Causal Modeling | 103
generalize to a different population). It is possible that some attribute of all clients in the population is
commonly caused by some other variable that is different in other populations.
Randomized trials are considered to be the gold standard in causal modeling that should be done if
possible. Randomized trials and interventional data collection more broadly, however, are often
prohibited by ethical or logistical reasons. For example, it is not ethical to withhold a treatment known
to be beneficial such as a job training class to test a hypothesis. From a logistical perspective, ABC Center
may not, for example, have the resources to give half its clients a $1000 cash transfer (similar to
Unconditionally’s modus operandi in Chapter 4) to test the hypothesis that this intervention improves
stable housing. Even without ethical or logistical barriers, it can also be the case that ABC Center’s
director and executive staff think up a new cause-and-effect relationship they want to investigate after
the data has already been collected.
In all of these cases, you are in the setting of observational data rather than interventional data. In
observational data, the treatment variable’s value has not been forced to a given value; it has just taken
whatever value it happened to take, which could be dependent on all sorts of other variables and
considerations. In addition, because observational data is often data of convenience that has not been
purposefully collected, it might be missing a comprehensive set of possible confounding variables.
The fundamental problem of causal modeling is very apparent in observational data and because of
it, testing and validating causal models becomes challenging. The only (unsatisfying) ways to test causal
models are (1) through simulated data that can produce both a factual and counterfactual data point, or
(2) to collect an interventional dataset from a very similar population in parallel. Regardless, if all you
have is observational data, all you can do is work with it in the modeling phase of the lifecycle.
The first manual option is a good option, but it can lead to the inclusion of human biases and is not
scalable to problems with a large number of variables, more than twenty or thirty. The second
experimental option is also a good option, but is also not scalable to a large number of variables because
interventional experiments would have to be conducted for every possible edge. The third option, known
as causal discovery, is the most tractable in practice and what you should pursue with ABC Center.5
You’ve probably heard the phrase “those who can, do; those who can’t, teach” which is shortened to
“those who can’t do, teach.” In causal modeling from observational data when you can’t intervene, the
4
Clark Glymour, Kun Zhang, and Peter Spirtes. “Review of Causal Discovery Methods Based on Graphical Models.” In: Frontiers
in Genetics 10 (Jun. 2019), p. 524.
5
There are advanced methods for causal discovery that start with observational data and tell you a few important experiments
to conduct to get an even better graph, but they are beyond the scope of the book.
104 | Trustworthy Machine Learning
phrase to keep in mind is “those who can’t do, assume.” Causal discovery has two branches, shown back
in Figure 8.2, each with a different assumption that you need to make. The first branch is based on
conditional independence testing and relies on the faithfulness assumption. The main idea of faithfulness
is that the conditional dependence and independence relationships among the random variables
encode the causal relationships. There is no coincidental or deterministic relationship among the
random variables that masks a causal relationship. The Bayesian network edges are the edges of the
structural causal model. Faithfulness is usually true in practice.
One probability distribution can be factored in many ways by choosing different sets of variables to
condition on, which leads to different graphs. Arrows pointing in different directions also lead to
different graphs. All of these different graphs arising from the same probability distribution are known
as a Markov equivalence class. One especially informative example of a Markov equivalence class is the
setting with just two random variables, say anxiety and wages.6 The graph with anxiety as the parent and
wages as the child and the graph with wages as the parent and anxiety as the child lead to the same
probability distribution, but with opposite cause-and-effect relationships. One important point about
the conditional independence testing branch of causal discovery methods is that they find Markov
equivalence classes of graph structures rather than finding single graph structures.
The second branch of causal discovery is based on making assumptions on the form of the structural
equations 𝑃( 𝑌 ∣∣ 𝑑𝑜(𝑡) ) = 𝑓𝑌 (𝑡, 𝑛𝑜𝑖𝑠𝑒𝑌 ) introduced in Equation 8.1. Within this branch, there are several
different varieties. For example, some varieties assume that the functional model has a linear function
𝑓𝑌 , others assume that the functional model has a nonlinear function 𝑓𝑌 with additive noise 𝑛𝑜𝑖𝑠𝑒𝑌 , and
even others assume that the probability distribution of the noise 𝑛𝑜𝑖𝑠𝑒𝑌 has small entropy. Based on the
assumed functional form, a best fit to the observational data is made. The assumptions in this branch
are much stronger than in conditional independence testing, but lead to single graphs as the solution
rather than Markov equivalence classes. These characteristics are summarized in Table 8.1.
In the remainder of this section, you’ll see an example of each branch of causal discovery in action:
the PC algorithm for conditional independence testing-based methods and the additive noise model-
based approach for functional model-based methods.
Matthew Ridley, Gautam Rao, Frank Schilbach, and Vikram Patel. “Poverty, Depression, and Anxiety: Causal Evidence and
6
0. The overall PC algorithm starts with a complete undirected graph with edges between all pairs
of nodes.
1. As a first step, the algorithm tests every pair of nodes; if they are independent, it deletes the
edge between them. Next it continues to test conditional independence for every pair of nodes
conditioning on larger and larger subsets, deleting the edge between the pair of nodes if any
conditional independence is found. The end result is the undirected skeleton of the causal
graph.
The reason for this first step is as follows. There is an undirected edge between nodes 𝑋1 and 𝑋2
if and only if 𝑋1 and 𝑋2 are dependent conditioned on every possible subset of all other nodes.
(So if a graph has three other nodes 𝑋3 , 𝑋4 , and 𝑋5 , then you’re looking for 𝑋1 and 𝑋2 to be (1) un-
conditionally dependent given no other variables, (2) dependent given 𝑋3 , (3) dependent given
𝑋4 , (4) dependent given 𝑋5 , (5) dependent given 𝑋3 , 𝑋4 , (6) dependent given 𝑋3 , 𝑋5 , (7) dependent
given 𝑋4 , 𝑋5 , and (8) dependent given 𝑋3 , 𝑋4 , 𝑋5 .) These conditional dependencies can be figured
out using d-separation, which was introduced back in Chapter 3.
2. The second step puts arrowheads on as many edges as it can. The algorithm conducts condi-
tional independence tests between the first and third nodes of three-node chains. If they’re de-
pendent conditioned on some set of nodes containing the middle node, then a common cause
(collider) motif with arrows is created. The end result is a partially-oriented causal graph. The
direction of edges that the algorithm cannot figure out remain unknown. All choices of all of
those orientations give you the different graphs that make up the Markov equivalence class.
The reason for the second step is that an undirected chain of nodes 𝑋1 , 𝑋2 , and 𝑋3 can be made
directed into 𝑋1 → 𝑋2 𝑋3 if and only if 𝑋1 and 𝑋3 are dependent conditioned on every possible
subset of nodes containing 𝑋2 . These conditional dependencies can also be figured out using d-
separation.
At the end of the example in Figure 8.7, the Markov equivalence class contains four possible graphs.
In Chapter 3, d-separation was presented in the ideal case when you know the dependence and
independence of each pair of random variables perfectly well. But when dealing with data, you don’t
have that perfect knowledge. The specific computation you do on data to test for conditional
independence between random variables is often based on an estimate of the mutual information
between them. This seemingly straightforward problem of conditional independence testing among
continuous random variables has a lot of tricks of the trade that continue to be researched and are
beyond the scope of this book.7
7
Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G. Dimakis, and Sanjay Shakkottai. “Model-Pow-
ered Conditional Independence Test.” In: Advances in Neural Information Processing Systems 31 (Dec. 2017), pp. 2955–2965.
106 | Trustworthy Machine Learning
Figure 8.7. An example of the steps of the PC algorithm. Accessible caption. In step 0, there is a fully con-
nected undirected graph with the nodes child care, wages, stable housing, and used car. In step 1, the
edges between child care and used car, and between stable housing and used car have been removed
because they exhibit conditional independence. The undirected skeleton is left. In step 2, the edge be-
tween child care and stable housing is oriented to point from child care to stable housing, and the edge
between wages and stable housing is oriented to point from wages to stable housing. The edges be-
tween child care and wages and between wages and used car remain undirected. This is the partially-
oriented graph. There are four possible directed graphs which constitute the Markov equivalence class
solution: with edges pointing from child care to wages and used car to wages, with edges pointing from
child care to wages and wages to used car, with edges pointing from wages to child care and from used
care to wages, and with edges poiting from wages to child care and wages to used car.
anxiety. You might think that a change in wages causes a change in anxiety (𝑇 causes 𝑌), but it could be
the other way around (𝑌 causes 𝑇).
One specific method in this functional model-based branch of causal discovery methods is known
as the additive noise model. It requires that 𝑓𝑌 not be a linear function and that the noise be additive:
𝑃( 𝑌 ∣∣ 𝑑𝑜(𝑡) ) = 𝑓𝑌 (𝑡) + 𝑛𝑜𝑖𝑠𝑒𝑌 ; here 𝑛𝑜𝑖𝑠𝑒𝑌 should not depend on 𝑡. The plot in Figure 8.8 shows an
example nonlinear function along with a quantification of the noise surrounding the function. This noise
band is equal in height around the function for all values of 𝑡 since the noise does not depend on 𝑡. Now
look at what’s going on when 𝑡 is the vertical axis and 𝑦 is the horizontal axis. Is the noise band equal in
height around the function for all values of 𝑦? It isn’t, and that’s the key observation. The noise has
constant height around the function when the cause is the horizontal axis and it doesn’t when the effect
is the horizontal axis. There are ways to test for this phenomenon from data, but if you understand this
idea shown in Figure 8.8, you’re golden. If ABC Center wants to figure out whether a decrease in anxiety
causes an increase in wages, or if an increase in wages causes a decrease in anxiety, you know what
analysis to do.
Figure 8.8. Example of using the additive noise model method to tell cause and effect apart. Since the height of
the noise band is the same across all 𝑡 and different across all 𝑦, 𝑡 is the cause and 𝑦 is the effect. Accessible cap-
tion. Two plots of the same nonlinear function and noise bands around it. The first plot has 𝑦 on the
vertical axis and 𝑡 on the horizontal axis; the second has 𝑡 on the vertical axis and 𝑦 on the horizontal
axis. In the first plot, the height of the noise band is consistently the same for different values of 𝑡. In
the second plot, the height of the noise band is consistently different for different values of 𝑦.
This phenomenon does not happen when the function is linear. Try drawing Figure 8.8 for a linear
function in your mind, and then flip it around as a thought experiment. You’ll see that the height of the
noise is the same both ways and so you cannot tell cause and effect apart.
Figure 8.5, you know that child care is a confounding variable, and because of proper foresight and not
taking shortcuts, you have collected data on it. The 𝑡𝑖 values report those clients who received an anxiety
reduction treatment, the 𝑥𝑖 values report data on clients’ child care situation and other possible
confounders, and the 𝑦𝑖 values are the client’s outcome label on stable housing.
Remember our working phrase: “those who can’t do, assume.” Just like in causal discovery, causal
inference from observational data requires assumptions. A basic assumption in causal inference is
similar to the independent and identically distributed (i.i.d.) assumption in machine learning,
introduced in Chapter 3. This causal inference assumption, the stable unit treatment value assumption,
simply says that the outcome of one client only depends on the treatment made to that client, and is not
affected by treatments to other clients. There are two important assumptions:
1. No unmeasured confounders also known as ignorability. The dataset needs to contain all the con-
founding variables within 𝑋.
2. Overlap also known as positivity. The probability of the treatment 𝑇 given the confounding varia-
bles must be neither equal to 0 nor equal to 1. It must take a value strictly greater than 0 and
strictly less than 1. This definition explains the name positivity because the probability has to
be positive, not identically zero. Another perspective on the assumption is that the probability
distribution of 𝑋 for the treatment group and the probability distribution of 𝑋 for the control
group should overlap; there should not be any support for one of the two distributions where
there isn’t support for the other. Overlap and the lack thereof is illustrated in Figure 8.9 using a
couple of datasets.
Figure 8.9. On the left, there is much overlap in the support of the treatment and control groups, so average
treatment effect estimation is possible. On the right, there is little overlap, so average treatment effect estimation
should not be pursued. Accessible caption. Plots with data points from the treatment group and control
group overlaid with regions indicating the support of their underlying distributions. In the left plot,
there is much overlap in the support and in the right plot, there isn’t.
Both assumptions together go by the name strong ignorability. If the strong ignorability assumptions
are not true, you should not conduct average treatment effect estimation from observational data. Why
are these two assumptions needed and why are they important? If the data contains all the possible
confounding variables, you can adjust for them to get rid of any confounding bias that may exist. If the
data exhibits overlap, you can manipulate or balance the data to make it look like the control group and
the treatment group are as similar as can be.
Causal Modeling | 109
If you’ve just been given a cleaned and prepared version of ABC Center’s data to perform average
treatment effect estimation on, what are your next steps? There are four tasks for you to do in an iterative
manner, illustrated in Figure 8.10. The first task is to specify a causal method, choosing between (1)
treatment models and (2) outcome models. These are the two main branches of conducting causal inference
from observational data and were shown back in Figure 8.2. Their details are coming up in the next
subsections. The second task in the iterative approach is to specify a machine learning method to be
plugged in within the causal method you choose. Both types of causal methods, treatment models and
outcome models, are based on machine learning under the hood. The third task is to train the model.
The fourth and final task is to evaluate the assumptions to see whether the result can really be viewed
as a causal effect.8 Let’s go ahead and run the four tasks for the ABC Center problem.
Figure 8.10. The steps to follow while conducting average treatment effect estimation from observational data.
Outside of this loop, you may also go back to the data preparation phase of the machine learning lifecycle if
needed. Accessible caption. A flow diagram starting with specify causal method, leading to specify ma-
chine learning method, leading to train model, leading to evaluate assumptions. If assumptions are not
met, flow back to specify causal method. If assumptions are met, return average treatment effect.
8
Yishai Shimoni, Ehud Karavani, Sivan Ravid, Peter Bak, Tan Hung Ng, Sharon Hensley Alford, Denise Meade, and Ya’ara Gold-
schmidt. “An Evaluation Toolkit to Guide Model Selection and Cohort Definition in Causal Inference.” arXiv:1906.00442, 2019.
110 | Trustworthy Machine Learning
probability weighting, is to give more weight to clients in the treatment group that were much more likely
to be assigned to the control group and vice versa. Clients given the anxiety reduction treatment 𝑡𝑗 = 1
are given weight inversely proportional to their propensity score 𝑤𝑗 = 1/𝑃( 𝑇 = 1 ∣∣ 𝑋 = 𝑥𝑗 ). Clients not
given the treatment 𝑡𝑗 = 0 are similarly weighted 𝑤𝑗 = 1/𝑃( 𝑇 = 0 ∣∣ 𝑋 = 𝑥𝑗 ) which also equals 1/ (1 −
𝑃( 𝑇 = 1 ∣∣ 𝑋 = 𝑥𝑗 )). The average treatment effect of anxiety reduction on stable housing is then simply
the weighted mean difference of the outcome label between the treatment group and the control group.
If you define the treatment group as 𝒯 = {𝑗 ∣ 𝑡𝑗 = 1} and the control group as 𝒞 = {𝑗 ∣ 𝑡𝑗 = 0}, then the
average treatment effect estimate is
1 1
𝜏= ∑ 𝑤𝑗 𝑦𝑗 − ∑ 𝑤𝑗 𝑦𝑗 .
‖𝒯‖ ‖𝒞‖
𝑗∈𝒯 𝑗∈𝒞
Equation 8.3
Getting the propensity score 𝑃( 𝑇 ∣ 𝑋 ) from training data samples {(𝑥1 , 𝑡1 ), … , (𝑥𝑛 , 𝑡𝑛 )} is a machine
learning task with features 𝑥𝑗 and labels 𝑡𝑗 in which you want a (calibrated) continuous score as output
(the score was called 𝑠(𝑥) in Chapter 6). The learning task can be done with any of the machine learning
algorithms from Chapter 7. The domains of competence of the different choices of machine learning
algorithms for estimating the propensity score are the same as for any other machine learning task, e.g.
decision forests for structured datasets and neural networks for large semi-structured datasets.
Once you’ve trained a propensity score model, the next step is to evaluate it to see whether it meets
the assumptions for causal inference. (Just because you can compute an average treatment effect
doesn’t mean that everything is hunky-dory and that your answer is actually the causal effect.) There
are four main evaluations of a propensity score model: (1) covariate balancing, (2) calibration, (3) overlap
of propensity distribution, and (4) area under the receiver operating characteristic (AUC). Calibration
and AUC were introduced in Chapter 6 as ways to evaluate typical machine learning problems, but
covariate balancing and overlap of propensity distribution are new here. Importantly, the use of AUC to
evaluate propensity score models is different than its use to evaluate typical machine learning problems.
Since the goal of inverse probability weighting is to make the potential confounding variables 𝑋 look
alike in the treatment and control groups, the first evaluation, covariate balancing, tests whether that has
been accomplished. This is done by computing the standardized mean difference (SMD) going one-by-
one through the 𝑋 features (child care and other possible confounders). Just subtract the mean value of
the feature for the control group data from the mean value for the treatment group data, and divide by
the square root of the average variance of the feature for the treatment and control groups.
‘Standardized’ refers to the division at the end, which is done so that you don’t have to worry about the
absolute scale of different features. An absolute value of SMD greater than about 0.1 for any feature
should be a source of concern. If you see this happening, your propensity score model is not good and
you shouldn’t draw causal conclusions from the average treatment effect.
The second evaluation is calibration. Since the propensity score model is used as an actual
probability in inverse probability weighting, it has to have good calibration to be effective. Introduced in
Chapter 6, the calibration loss needs to be small and the calibration curve needs to be as much of a
straight line as possible. If they aren’t, you shouldn’t draw causal conclusions from the average
treatment effect you compute and need to go back to step 1.
Causal Modeling | 111
The third evaluation is based on the distributions of the propensity score for the treatment group
and the control group, illustrated in Figure 8.11. Spikes in the distribution near 0 or 1 are bad because
they indicate a possible large set of 𝑋 values that can be almost perfectly classified by the propensity
score model. Perfect classification means that there is almost no overlap of the treatment group and
control group in that region, which is not desired to meet the positivity assumption. If you see such
spikes, you should not proceed with this model. (This evaluation doesn’t tell you what the non-overlap
region is, but just that it exists.)
Figure 8.11. Example propensity score distributions; the one on the left indicates a possible overlap violation,
whereas the one on the right does not. Accessible caption. Plots with density on the vertical axis and pro-
pensity score on the horizontal axis. Each plot overlays a control pdf and a treatment pdf. The pdfs in
the left plot do not overlap much and the control group distribution has a spike near 0. The pdfs in the
right plot are almost completely on top of each other.
The fourth evaluation of a treatment model is the AUC. Although its definition and computation are
the same as in Chapter 6, good values of the AUC of a propensity score model are not near the perfect
1.0. Intermediate values of the AUC, like 0.7 or 0.8, are just right for average treatment effect estimation.
A poor AUC of nearly 0.5 remains bad for a propensity score model. If the AUC is too high or too low, do
not proceed with this model. Once you’ve done all the diagnostic evaluations and none of them raise an
alert, you should proceed with reporting the average treatment effect that you have computed as an
actual causal insight. Otherwise, you have to go back and specify different causal methods and/or
machine learning methods.
Why is learning 𝐸[ 𝑌 ∣ 𝑇, 𝑋 ] models from data useful? How do you get the average treatment effect of
anxiety reduction on stable housing from them? Remember that the definition of the average treatment
effect is 𝜏 = 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑡 = 1) ] − 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑡 = 0) ]. Also, remember that when there is no confounding, the
associational distribution and interventional distribution are equal, so 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑡) ] = 𝐸[ 𝑌 ∣ 𝑇 = 𝑡 ]. Once
you have 𝐸[ 𝑌 ∣ 𝑇 = 𝑡, 𝑋 ], you can use something known as the law of iterated expectations to adjust for 𝑋
and get 𝐸[ 𝑌 ∣ 𝑇 = 𝑡 ]. The trick is to take an expectation over 𝑋 because 𝐸𝑋 [𝐸𝑌 [ 𝑌 ∣ 𝑇 = 𝑡, 𝑋 ]] =
𝐸𝑌 [ 𝑌 ∣ 𝑇 = 𝑡 ]. (The subscripts on the expectations tell you which random variable you’re taking the
expectation with respect to.) To take the expectation over 𝑋, you sum the outcome model over all the
values of 𝑋 weighted by the probabilities of each of those values of 𝑋. It is clear sailing after that to get
the average treatment effect because you can compute the difference 𝐸[ 𝑌 ∣ 𝑇 = 1 ] − 𝐸[ 𝑌 ∣ 𝑇 = 0 ]
directly.
You have the causal model; now on to the machine learning model. When the outcome label 𝑌 takes
binary values 0 and 1 corresponding the absence and presence of stable housing, then the expected
values are equivalent to the probabilities 𝑃( 𝑌 ∣ 𝑇 = 1, 𝑋 = 𝑥 ) and 𝑃( 𝑌 ∣ 𝑇 = 0, 𝑋 = 𝑥 ). Learning these
probabilities is a job for a calibrated machine learning classifier with continuous score output trained
on labels 𝑦𝑖 and features (𝑡𝑗 , 𝑥𝑗 ). You can use any machine learning method from Chapter 7 with the same
guidelines for domains of competence. Traditionally, it has been common practice to use linear margin-
based methods for the classifier, but nonlinear methods should be tried especially for high-dimensional
data with lots of possible confounding variables.
Just like with treatment models, being able to compute an average treatment effect using outcome
models does not automatically mean that your result is a causal inference. You still have to evaluate. A
first evaluation, which is also an evaluation for treatment models, is calibration. You want small
calibration loss and a straight line calibration curve. A second evaluation for outcome models is
accuracy, for example measured using AUC. With outcome models, just like with regular machine
learning models but different from treatment models, you want the AUC to be as large as possible
approaching 1.0. If the AUC is too small, do not proceed with this model and go back to step 1 in the
iterative approach to average treatment effect estimation illustrated in Figure 8.10.
A third evaluation for outcome models examines the predictions they produce to evaluate
ignorability or no unmeasured confounders. The predicted 𝑌 ∣ 𝑡 = 1 and 𝑌 ∣ 𝑡 = 0 values coming out of
the outcome models should be similar for clients who were actually part of the treatment group (received
the anxiety reduction intervention) and clients who were part of the control group (did not receive the
anxiety reduction intervention). If the predictions are not similar, there is still some confounding left
over after adjusting for 𝑋 (child care and other variables), which means that the assumption of no
unmeasured confounders is violated. Thus, if the predicted 𝑌 ∣ 𝑡 = 1 and 𝑌 ∣ 𝑡 = 0 values for the two
groups do not mostly overlap, then do not proceed and go back to the choice of causal model and
machine learning model.
Causal Modeling | 113
8.5.3 Conclusion
You’ve evaluated two options for causal inference: treatment models and outcome models. Which option
is better in what circumstances? Treatment and outcome modeling are inherently different problems
with different features and labels. You can just end up having better evaluations for one causal method
than the other using the different machine learning method options available. So just try both branches
and see how well the results correspond. Some will be better matched to the relevant modeling tasks,
depending on the domains of competence of the machine learning methods under the hood.
But do you know what? You’re in luck and don’t have to choose between the two branches of causal
inference. There’s a hybrid approach called doubly-robust estimation in which the propensity score
values are added as an additional feature in the outcome model.9 Doubly-robust models give you the best
of both worlds! ABC Center’s director is waiting to decide whether he should invest more in anxiety
reduction interventions. Once you’re done with your causal modeling analysis, he’ll be able to make an
informed decision.
8.6 Summary
▪ Causality is a fundamental concept that expresses how changing one thing (the cause) results in
another thing changing (the effect). It is different than correlation, predictability, and
dependence.
▪ Causal models are critical to inform decisions involving interventions and treatments with
expected effects on outcomes. Predictive associational models are not sufficient when you are
‘doing’ something to an input.
▪ In addition to informing decisions, causal modeling is a way to avoid harmful spurious
relationships in predictive models.
▪ Structural causal models extend Bayesian networks by encoding causal relationships in addition
to statistical relationships. Their graph structure allows you to understand what causes what, as
well as chains of causation, among many variables. Learning their graph structure is known as
causal discovery.
▪ Causal inference between a hypothesized pair of treatment and outcome is a different problem
specification. To validly conduct causal inference from observational data, you must control for
confounding.
▪ Causal modeling requires assumptions that are difficult to validate, but there is a set of
evaluations you should perform as part of modeling to do the best that you can.
9
Miguel A. Hernán and James M. Robins. Causal Inference: What If. Boca Raton, Florida, USA: Chapman & Hall/CRC, 2020.
114 | Trustworthy Machine Learning
9
Distribution Shift
Wavetel is a leading (fictional) mobile telephony provider in India that expanded its operations to several
East and Central African countries in recent years. One of its profit centers in the African markets is
credit enabled by mobile money that it runs through partnerships with banks in each of the nations. The
most straightforward application of mobile money is savings, first started in Kenya in 2007 under the
name M-Pesa. With mobile money savings, customers can deposit, withdraw, and transfer funds
electronically without a formal bank account, all through their mobile phones. (Remember that these
transactions are one of the data sources that Unconditionally evaluated in Chapter 4.) More advanced
financial services such as credit and insurance later emerged. In these advanced services, the bank
takes on financial risk and can’t just hand out accounts without an application process and some amount
of due diligence.
Having seen how profitable mobile money-enabled credit can be, Wavetel strongly lobbied for it to
be allowed in its home country of India and has just seen the regulations signed into law. Partnering with
the (fictional) Bank of Bulandshahr, Wavetel is ready to deploy this new service under the name Phulo.
Based on market research, Wavetel and the Bank of Bulandshahr expect Phulo to receive tens of
thousands of applications per day when first launched. They have to be ready to approve or deny those
applications in near real-time. To deal with this load, imagine that they have hired your data science
team as consultants to create a machine learning model that makes the decisions.
The task you face, approving and denying mobile phone-enabled loans for unbanked customers in
India has never been done before. The Bank of Bulandshahr’s historical loan approval data will not be
useful for making decisions on Phulo applicants. However, Wavetel has privacy-preserved data from
mobile money-enabled credit systems in several East and Central African countries that it has the rights
and consent to use in its India operations. Can you train the Phulo machine learning model using the
African datasets? What could go wrong?
If you’re not careful, there’s a lot that could go wrong. You could end up creating a really harmful and
unreliable system, because of the big lie of machine learning: the core assumption that training data and
testing data is independent and identically distributed (i.i.d.). This is almost never true in the real world,
Distribution Shift | 115
where there tends to be some sort of difference in the probability distributions of the training data and
the data encountered during the model’s deployment. This difference in distributions is known as
distribution shift. A competent model that achieves high accuracy when tested through cross-validation
might not maintain that competence in the real world. Too much epistemic uncertainty sinks the ship
of even a highly risk-minimizing model.
“All bets are off if there is a distribution shift when the model is deployed. (There's
always a distribution shift.)”
This chapter begins Part 4 of the book on reliability and dealing with epistemic uncertainty, which
constitutes the second of four attributes of trustworthiness (the others are basic performance, human
interaction, and aligned purpose) as well as the second of two attributes of safety (the first is minimizing
risk and aleatoric uncertainty). As shown in Figure 9.1, you’re halfway home to creating trustworthy
machine learning systems!
Figure 9.1. Organization of the book. This fourth part focuses on the second attribute of trustworthiness, reliabil-
ity, which maps to machine learning models that are robust to epistemic uncertainty. Accessible caption. A flow
diagram from left to right with six boxes: part 1: introduction and preliminaries; part 2: data; part 3:
basic modeling; part 4: reliability; part 5: interaction; part 6: purpose. Part 4 is highlighted. Parts 3–4
are labeled as attributes of safety. Parts 3–6 are labeled as attributes of trustworthiness.
In this chapter, while working through the modeling phase of the machine learning lifecycle to create
a safe and reliable Phulo model, you will:
▪ examine how epistemic uncertainty leads to poor machine learning models, both with and
without distribution shift,
▪ judge which kind of distribution shift you have, and
▪ mitigate the effects of distribution shift in your model.
to differentiate aleatoric uncertainty from epistemic uncertainty. Now’s the time to apply what you
learned and figure out where epistemic uncertainty is rearing its head!
First, in Figure 9.2, let’s expand on the picture of the different biases and validities from Figure 4.3
to add a modeling step that takes you from the prepared data space to a prediction space where the
output predictions of the model live. As you learned in Chapter 7, in modeling, you’re trying to get the
classifier to generalize from the training data to the entire set of features by using an inductive bias,
without overfitting or underfitting. Working backwards from the prediction space, the modeling process
is the first place where epistemic uncertainty creeps in. Specifically, if you don’t have the information to
select a good inductive bias and hypothesis space, but you could obtain it in principle, then you have
epistemic uncertainty.1 Moreover, if you don’t have enough high-quality data to train the classifier even
if you have the perfect hypothesis space, you have epistemic uncertainty.
Figure 9.2. Different spaces and what can go wrong due to epistemic uncertainty throughout the machine learn-
ing pipeline. Accessible caption. A sequence of five spaces, each represented as a cloud. The construct
space leads to the observed space via the measurement process. The observed space leads to the raw
data space via the sampling process. The raw data space leads to the prepared data space via the data
preparation process. The prepared data space leads to the prediction space via the modeling process.
The measurement process contains social bias, which threatens construct validity. The sampling pro-
cess contains representation bias and temporal bias, which threatens external validity. The data prep-
aration process contains data preparation bias and data poisoning, which threaten internal validity.
The modeling process contains underfitting/overfitting and poor inductive bias, which threaten gener-
alization.
The epistemic uncertainty in the model has a few different names, including the Rashomon effect2 and
underspecification.3 The main idea, illustrated in Figure 9.3, is that a lot of models perform similarly well
1
Eyke Hüllermeier and Willem Waegeman. “Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Con-
cepts and Methods.” In: Machine Learning 110.3 (Mar. 2021), pp. 457–506.
2
Aaron Fisher, Cynthia Rudin, and Francesca Dominici. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s
Importance by Studying an Entire Class of Prediction Models Simultaneosly.” In: Journal of Machine Learning Research 20.177
(Dec. 2019). Rashomon is the title of a film in which different witnesses give different descriptions of the same event.
3
Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan
Deaton, Jacob Eisenstein, Matthew D. Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan
Distribution Shift | 117
in terms of aleatoric uncertainty and risk, but have different ways of generalizing because you have not
minimized epistemic uncertainty. They all have the possibility of being competent and reliable models.
They have a possibility value of 1. (Other models that do not perform well have a possibility value of 0 of
being competent and reliable models.) However, many of these possible models are unreliable and they
take shortcuts by generalizing from spurious characteristics in the data that people would not naturally
think are relevant features to generalize from.4 They are not causal. Suppose one of the African mobile
money-enabled credit datasets just happens to have a spurious feature like the application being
submitted on a Tuesday that predicts the credit approval label very well. In that case, a machine learning
training algorithm will not know any better and will use it as a shortcut to fit the model. And you know
that taking shortcuts is no-no for you when building trustworthy machine learning systems, so you don’t
want to let your models take shortcuts either.
Figure 9.3. Among all the models you are considering, many of them can perform well in terms of accuracy and
related measures; they are competent and constitute the Rashomon set. However, due to underspecification and
the epistemic uncertainty that is present, many of the competent models are not safe and reliable. Accessible
caption. A nested set diagram with reliable models being a small subset of competent models
(Rashomon set), which are in turn a small subset of models in the hypothesis space.
How can you know that a competent high-accuracy model is one of the reliable, safe ones and not
one of the unreliable, unsafe ones? The main way is to stress test it by feeding in data points that are
edge cases beyond the support of the training data distribution. More detail about how to test machine
learning systems in this way is covered in Chapter 13.
The main way to reduce epistemic uncertainty in the modeling step that goes from the prepared data
space to the prediction space is data augmentation. If you can, you should collect more data from more
environments and conditions, but that is probably not possible in your Phulo prediction task.
Karthikesalingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, Andrea Montanari, Zachary Nado, Vivek
Nataragan, Christopher Nielson, Thomas F. Osborne, Rajiv Raman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Mertin Sen-
eviratne, Shannon Sequeira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, Steve Yadlowsky,
Taedong Yun, Xiaohua Zhai, and D. Sculley. “Underspecification Presents Challenges for Credibility in Modern Machine Learn-
ing.” arXiv:2011.03395, 2020.
4
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A.
Wichmann. “Shortcut Learning in Deep Neural Networks.” In: Nature Machine Intelligence 2.11 (Nov. 2020), pp. 665–673.
118 | Trustworthy Machine Learning
Alternatively, you can augment your training dataset by synthetically generating training data points,
especially those that push beyond the margins of the data distribution you have. Data augmentation can
be done for semi-structured data modalities by flipping, rotating, and otherwise transforming the data
points you have. In structured data modalities, like have with Phulo, you can create new data points by
changing categorical values and adding noise to continuous values. However, be careful that you do not
introduce any new biases in the data augmentation process.
1. Prior probability shift, also known as label shift, is when the label distributions are different but
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
the features given the labels are the same: 𝑝𝑌 (𝑦) ≠ 𝑝𝑌 (𝑦)5 and 𝑝𝑋∣𝑌 (𝑥 ∣ 𝑦) =
(𝑑𝑒𝑝𝑙𝑜𝑦)
𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑦 ).
2. Covariate shift is the opposite, when the feature distributions are different but the labels given
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
the features are the same: 𝑝𝑋 (𝑥) ≠ 𝑝𝑋 (𝑥) and 𝑝𝑌∣𝑋 ( 𝑦 ∣ 𝑥 ) = 𝑝𝑌∣𝑋 ( 𝑦 ∣ 𝑥 ).
3. Concept drift is when the labels given the features are different but the features are the same:
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑑𝑒𝑝𝑙𝑜𝑦)
𝑝𝑌∣𝑋 ( 𝑦 ∣ 𝑥 ) ≠ 𝑝𝑌∣𝑋 ( 𝑦 ∣ 𝑥 ) and 𝑝𝑋(𝑡𝑟𝑎𝑖𝑛) (𝑥) = 𝑝𝑋 (𝑥), or when the features given the labels
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
are different but the labels are the same: 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑦 ) ≠ 𝑝𝑋∣𝑌 ( 𝑥 ∣ 𝑦 ) and 𝑝𝑌(𝑡𝑟𝑎𝑖𝑛) (𝑦) =
(𝑑𝑒𝑝𝑙𝑜𝑦)
𝑝𝑌 (𝑦).
All other distribution shifts do not have special names like these three types. The first two types of
distribution shift come from sampling differences whereas the third type of distribution shift comes
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
5
You can also write this as 𝑝0 ≠ 𝑝0 .
Distribution Shift | 119
from measurement differences. The three different types of distribution shift are summarized in Table
9.1.
Let’s go through an example of each type to see which one (or more than one) affects the Phulo
situation. There will be prior probability shift if there are different proportions of creditworthy people in
present-day India and historical African countries, maybe because of differences in the overall
economy. There will be covariate shift if the distribution of features is different. For example, maybe
people in India have more assets in gold than in East and Central African countries. There will be concept
drift if the actual mechanism connecting the features and creditworthiness is different. For example,
people in India who talk or SMS with many people may be more creditworthy while in East and Central
Africa, people who talk or SMS with few people may be more creditworthy.
One way to describe the different types of distribution shifts is through the context or environment.
What does changing the environment in which the data was measured and sampled do to the features
and label? And if you’re talking about doing, you’re talking about causality. If you treat the environment
as a random variable 𝐸, then the different types of distribution shifts have the causal graphs shown in
Figure 9.4.6
Figure 9.4. Causal graph representations of the different types of distribution shift. Accessible caption. Graphs
of prior probability shift (𝐸 → 𝑌 → 𝑋), covariate shift (𝐸 → 𝑋 → 𝑌), concept drift (𝐸 → 𝑌 𝑋), and con-
cept drift (𝐸 → 𝑋 𝑌).
6
Meelis Kull and Peter Flach. “Patterns of Dataset Shift.” In: Proceedings of the International Workshop on Learning over Multiple
Contexts. Nancy, France, Sep. 2014.
120 | Trustworthy Machine Learning
These graphs illustrate a nuanced point. When you have prior probability shift, the label causes the
feature and when you have covariate shift, the features cause the label. This is weird to think about, so
let’s slow down and work through this concept. In the first case, 𝑌 → 𝑋, the label is known as an intrinsic
label and the machine learning problem is known as anticausal learning. A prototypical example is a
disease with a known pathogen like malaria that causes specific symptoms like chills, fatigue, and fever.
The label of a patient having a disease is intrinsic because it is a basic property of the infected patient,
which then causes the observed features. In the second case, 𝑋 → 𝑌, the label is known as an extrinsic
label and the machine learning problem is known as causal learning. A prototypical example of this case
is a syndrome, a collection of symptoms such as Asperger’s that isn’t tied to a pathogen. The label is just
a label to describe the symptoms like compulsive behavior and poor coordination; it doesn’t cause the
symptoms. The two different versions of concept drift correspond to anticausal and causal learning,
respectively. Normally, in the practice of doing supervised machine learning, the distinction between
anticausal and causal learning is just a curiosity, but it becomes important when figuring out what to do
to mitigate the effect of distribution shift. It is not obvious which situation you’re in with the Phulo model,
and you’ll have to really think about it.
that are applicable at two different points of the machine learning modeling pipeline, show in Figure
9.5.7 Data distribution-based shift detection is done on the training data before the model training.
Classifier performance-based shift detection is done afterwards on the model. Data distribution-based
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
shift detection, as its name implies, directly compares 𝑝𝑋,𝑌 (𝑥, 𝑦) and 𝑝𝑋,𝑌 (𝑥, 𝑦) to see if they are
similar or different. A common way is to compute their K-L divergence, which was introduced in Chapter
3. If it is too high, then there is distribution shift. Classifier performance-based shift detection examines
the Bayes risk, accuracy, 𝐹1-score, or other model performance measure. If it is much poorer than the
performance during cross-validation, there is distribution shift.
That is all well and good, but did you notice something about the two methods of distribution shift
detection that make them unusable for your Phulo development task? They require the deployed
distribution: both its features and labels. But you don’t have it! If you did, you would have used it to train
the Phulo model. Shift detection methods are really meant for monitoring scenarios in which you keep
getting data points to classify over time and you keep getting ground truth labels soon thereafter.
7
Jie Lu, Anjin Liu, Fan Dong, Feng Gu, João Gama, and Guangquan Zhang. “Learning Under Concept Drift: A Review.” In: IEEE
Transactions on Knowledge and Data Engineering 31.12 (Dec. 2019), pp. 2346–2363.
Distribution Shift | 121
Figure 9.5. Modeling pipeline for detecting and mitigating distribution shift. Accessible caption. A block dia-
gram with a training dataset as input to a pre-processing block labeled adaptation with a pre-pro-
cessed dataset as output. The pre-processed dataset is input to a model training block labeled robust-
ness with a model as output. A data distribution-based shift detection block is applied to the training
dataset. A classifier performance-based shift detection block is applied to the model.
If you’ve started to collect unlabeled feature data from India, you can do unsupervised data
distribution-based shift detection by comparing the India feature distribution to the Africa feature
distributions. But you kind of already know that they’ll be different, and this unsupervised approach will
not permit you to determine which type of distribution shift you have. Thus, in a Phulo-like scenario,
you just have to assume the existence and type of distribution shift based on your knowledge of the
problem. (Remember our phrase from Chapter 8: “those who can’t do, assume.”)
“Machine learning systems need to robustly model the range of situations that occur
in the real-world.”
The two approaches take place in different parts of the modeling pipeline shown in Figure 9.5.
Adaptation is done on the training data as a pre-processing step, whereas robustness is introduced as
part of the model training process. The two kinds of mitigation are summarized in Table 9.2.
Type Where in the Known Deployment Approach for Prior Proba- Approach for
Pipeline Environment bility and Covariate Shifts Concept Drift
adaptation pre- yes sample weights obtain labels
processing
robustness model training no min-max formulation invariant risk
minimization
122 | Trustworthy Machine Learning
The next sections work through adaptation and robustness for the different types of distribution
shift. Mitigating prior probability shift and covariate shift is easier than mitigating concept drift because
the relationship between the features and labels does not change in the first two types. Thus, the
classifier you learned on the historical African training data continues to capture that relationship even
on India deployment data; it just needs a little bit of tuning.
9.3 Adaptation
The first mitigation approach, adaptation, is done as a pre-processing of the training data from East and
Central African countries using information available in unlabeled feature data 𝑋 (𝑑𝑒𝑝𝑙𝑜𝑦) from India. To
perform adaptation, you must know that India is where you’ll be deploying the model and you must be
able to gather some features.
1. Train a classifier on one random split of the training data to get 𝑦̂ (𝑡𝑟𝑎𝑖𝑛) (𝑥) and compute the clas-
sifier’s confusion matrix on another random split of the training data:
𝑝𝑇𝑃 𝑝𝐹𝑃
𝐶 = [𝑝 𝑝 ].
𝐹𝑁 𝑇𝑁
2. Run the unlabeled features of the deployment data through the classifier: 𝑦̂ (𝑡𝑟𝑎𝑖𝑛) (𝑋 (𝑑𝑒𝑝𝑙𝑜𝑦) ) and
compute the probabilities of positives and negatives in the deployment data as a vector:
𝑃(𝑦̂ (𝑡𝑟𝑎𝑖𝑛) (𝑋 (𝑑𝑒𝑝𝑙𝑜𝑦) ) = 1)
𝑎 = [ ].
𝑃(𝑦̂ (𝑡𝑟𝑎𝑖𝑛) (𝑋 (𝑑𝑒𝑝𝑙𝑜𝑦) ) = 0)
3. Compute weights 𝑤 = 𝐶 −1 𝑎. This is a vector of length two.
4. Apply the weights to the training data points in the first random split and retrain the classifier.
The first of the two weights multiplies the loss function of the training data points with label 1.
The second of the two weights multiplies the loss function of the training data points with label
0.
The retrained classifier is what you want to use when you deploy Phulo in India under the assumption
of prior probability shift.
8
Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola. “Detecting and Correcting for Label Shift with Black Box Predic-
tors.” In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden, Jul. 2018, pp. 3122–3130.
Distribution Shift | 123
𝑛
1
𝑦̂(∙) = arg min ∑ 𝑤𝑗 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) .
𝑓∈ℱ 𝑛
𝑗=1
Equation 9.1
(Compare this to Equation 7.3, which is the same thing, but without the weights.) The importance weight
is the ratio of the probability density of the features 𝑥𝑗 under the deployment distribution and the
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
training distribution: 𝑤𝑗 = 𝑝𝑋 (𝑥𝑗 )/𝑝𝑋 (𝑥𝑗 ).9 The weighting scheme tries to make the African
features look more like the Indian features by emphasizing those that are less likely in East and Central
African countries but more likely in India.
How do you compute the weight from the training and deployment datasets? You could first try to
estimate the two pdfs separately and then evaluate them at each training data point and plug them into
the ratio. But that usually doesn’t work well. The better way to go is to directly estimate the weight.
The most straightforward technique is similar to computing propensity scores in Chapter 8. You learn a
classifier with a calibrated continuous score 𝑠(𝑥) such as logistic regression or any other classifier from
Chapter 7. The dataset to train this classifier is a concatenation of the deployment and training datasets.
The labels are 1 for the data points that come from the deployment dataset and the labels are 0 for the
data points that come from the training dataset. The features are the features. Once you have the
continuous output of the classifier as a score, the importance weight is:10
𝑛(𝑡𝑟𝑎𝑖𝑛) 𝑠(𝑥𝑗 )
𝑤𝑗 = ,
𝑛(𝑑𝑒𝑝𝑙𝑜𝑦) (1 − 𝑠(𝑥𝑗 ))
Equation 9.2
where 𝑛(𝑡𝑟𝑎𝑖𝑛) and 𝑛(𝑑𝑒𝑝𝑙𝑜𝑦) are the number of data points in the training and deployment datasets,
respectively.
9
Hidetoshi Shimodaira. “Improving Predictive Inference Under Covariate Shift by Weighting the Log-Likelihood Function.” In:
Journal of Statistical Planning and Inference 90.2 (Oct. 2000), pp. 227–244.
10
Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density Ratio Estimation in Machine Learning. Cambridge, England,
UK: Cambridge University Press, 2012.
124 | Trustworthy Machine Learning
9.4 Robustness
Often, you do not have any data from the deployment environment, so adaptation is out of the question.
You might not even know what the deployment environment is going to be. It might not even end up
being India for all that you know. Robustness does not require you to have any deployment data. It
modifies the learning objective and procedure.
Equation 9.3
You lose out on performance. Your epistemic uncertainty in knowing the right prior probabilities has
hurt the Bayes risk.
(𝑡𝑟𝑎𝑖𝑛)
To be robust to the uncertain prior probabilities in present-day India, choose a value for 𝑝0 so
that the worst-case performance is as good as possible. Known as a min-max formulation, the problem
is to find a min-max optimal prior probability point that you’re going to use when you deploy the Phulo
model. Specifically, you want:
In Equation 6.10, 𝑅 = 𝑐10 𝑝0 𝑝𝐹𝑃 + 𝑐01 𝑝1 𝑝𝐹𝑁 , the dependence of 𝑝FP and 𝑝FN on 𝑝0 was not explicitly noted, but this dependence
11
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
arg min max 𝑅 (𝑝0
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
, 𝑝0 ).
𝑝0 𝑝0
Equation 9.4
Normally in the book, we stop at the posing of the formulation. In this instance, however, since the
min-max optimal solution has nice geometric properties, let’s carry on. The mismatched Bayes risk
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛) (𝑡𝑟𝑎𝑖𝑛)
function 𝑅(𝑝0 , 𝑝0 ) is a linear function of 𝑝0 for a fixed value of 𝑝0 . When 𝑝0 =
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑑𝑒𝑝𝑙𝑜𝑦)
𝑝0 , the Bayes optimal threshold is recovered and 𝑅(𝑝0 , 𝑝0 ) is the optimal Bayes risk 𝑅
defined in Chapter 6. It is a concave function that is zero at the endpoints of the interval [0,1].12 The linear
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
mismatched Bayes risk function is tangent to the optimal Bayes risk function at 𝑝0 = 𝑝0 and
greater than it everywhere else.13 This relationship is shown in Figure 9.6.
Figure 9.6. An example mismatched (dashed line) and matched Bayes risk function (solid curve). Accessible
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
caption. A plot with 𝑅(𝑝0 , 𝑝0 ) on the vertical axis and 𝑝0 on the horizontal axis. The
(𝑑𝑒𝑝𝑙𝑜𝑦)
matched Bayes risk is 0 at 𝑝0 = 0, increases to a peak in the middle and decreases back to 0 at
(𝑑𝑒𝑝𝑙𝑜𝑦)
𝑝0 = 1. Its shape is concave. The mismatched Bayes risk is a line tangent to the matched Bayes
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
risk at the point 𝑝0 = 𝑝0 , which in this example is at a point greater than the peak of the
matched Bayes risk. There’s a large gap between the matched and mismatched Bayes risk, especially
(𝑑𝑒𝑝𝑙𝑜𝑦)
towards 𝑝0 = 0.
The solution is the prior probability value at which the matched Bayes risk function has zero slope.
It turns out that the correct answer is the place where the mismatched Bayes risk tangent line is flat—at
the top of the hump as shown in Figure 9.7. Once you have it, use it in the threshold of the Phulo decision
function to deal with prior probability shift.
This is true under the ongoing assumption that the costs of correct classifications 𝑐00 = 0 and 𝑐11 = 0.
12
Kush R. Varshney. “Bayes Risk Error is a Bregman Divergence.” In: IEEE Transactions on Signal Processing 59.9 (Sep. 2011), pp.
13
4470–4472.
126 | Trustworthy Machine Learning
Figure 9.7. The most robust prior probability to use in the decision function place is where the matched Bayes risk
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
function is maximum. Accessible caption. A plot with 𝑅(𝑝0 , 𝑝0 ) on the vertical axis and 𝑝0
(𝑑𝑒𝑝𝑙𝑜𝑦)
on the horizontal axis. The matched Bayes risk is 0 at 𝑝0 = 0, increases to a peak in the middle
(𝑑𝑒𝑝𝑙𝑜𝑦)
and decreases back to 0 at 𝑝0 = 1. Its shape is concave. The mismatched Bayes risk is a horizontal
(𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
line tangent to the matched Bayes risk at the optimal point 𝑝0 = 𝑝0 , which is at the peak of the
matched Bayes risk. The maximum gap between the matched and mismatched Bayes risk is as small
as can be.
𝑛
1
𝑦̂(∙) = arg min max ∑ 𝑤𝑗 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) ,
𝑓∈ℱ 𝑤 𝑛
𝑗=1
Equation 9.5
where 𝑤 is the set of possible weights that are non-negative and sum to one. The classifier that optimizes
the objective in Equation 9.5 is the robust classifier you’ll want to use for the Phulo model to deal with
covariate shift.
Junfeng Wen, Chun-Nam Yu, and Russell Greiner. “Robust Learning under Uncertain Test Distributions: Relating Covariate
14
Shift to Model Misspecification.” In: Proceedings of the International Conference on Machine Learning. Beijing, China, Jun. 2014, pp.
631–639. Weihua Hu, Gang Niu, Issei Sato, and Masashi Sugiyama. “Does Distributionally Robust Supervised Learning Give
Robust Classifiers?” In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden, Jul. 2018, pp. 2029–
2037.
Distribution Shift | 127
formulation because the training data from East and Central African countries is not indicating the right
relationship between the features and label in India. A model robust to concept drift must extrapolate
outside of what the training data can tell you. And that is too open-ended of a task to do well unless you
make some more assumptions.
One reasonable assumption you can make is that the set of features split into two types: (1) causal or
stable features, and (2) spurious features. You don’t know which ones are which beforehand. The causal
features capture the intrinsic parts of the relationship between features and labels, and are the same set
of features in different environments. In other words, this set of features is invariant across the
environments. Spurious features might be predictive in one environment or a few environments, but
not universally so across environments. You want the Phulo model to rely on the causal features whose
predictive relationship with labels holds across Tanzania, Rwanda, Congo, and all the other countries
that Wavetel has data from and ignore the spurious features. By doing so, the hope is that the model will
not only perform well for the countries in the training set, but also any new country or environment that
it encounters, such as India. It will be robust to the environment in which it is deployed.
Invariant risk minimization is a variation on the standard risk minimization formulation of machine
learning that helps the model focus on the causal features and avoid the spurious features when there
is data from more than one environment available for training. The formulation is:15
𝑛𝑒
1 (𝑒) (𝑒)
𝑦̂(∙) = arg min ∑ ∑ 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) .
𝑓∈ℱ 𝑛𝑒
𝑒∈ℰ 𝑗=1
𝑛𝑒
1 (𝑒) (𝑒)
such that 𝑓 ∈ arg min ∑ 𝐿 (𝑦𝑗 , 𝑔(𝑥𝑗 )) for all 𝑒 ∈ ℰ.
𝑔∈ℱ 𝑛𝑒
𝑗=1
Equation 9.6
Let’s break this equation down bit by bit to understand it more. First, the set ℰ is the set of all
environments or countries from which we have training data (Tanzania, Rwanda, Congo, etc.) and each
(𝑒) (𝑒) (𝑒) (𝑒)
country is indexed by 𝑒. There are 𝑛𝑒 training samples {(𝑥1 , 𝑦1 ), … , (𝑥𝑛𝑒 , 𝑦𝑛𝑒 )} from each country. The
inner summation in the top line is the regular risk expression that we’ve seen before in Chapter 7. The
outer summation in the top line is just adding up all the risks for all the environments, so that the
classifier minimizes the total risk. The interesting part is the constraint in the second line. It is saying
that the classifier that is the solution of the top line must simultaneously minimize the risk for each of
the environments or countries separately as well. As you know from earlier in the chapter, there can be
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. “Invariant Risk Minimization.” arXiv:1907.02893,
15
2020.
128 | Trustworthy Machine Learning
many different classifiers that minimize the loss—they are the Rashomon set—and that is why the second
line has the ‘element of’ symbol ∈. The invariant risk minimization formulation adds extra specification
to reduce the epistemic uncertainty and allow for better out-of-distribution generalization.
You might ask yourself whether the constraint in the second line does anything useful. Shouldn’t the
first line alone give you the same solution? This is a question that machine learning researchers are
currently struggling with.16 They find that usually, the standard risk minimization formulation of
machine learning from Chapter 7 is the most robust to general distribution shifts, without the extra
invariant risk minimization constraints. However, when the problem is an anticausal learning problem
and the feature distributions across environments have similar support, invariant risk minimization
may outperform standard machine learning (remember from earlier in the chapter that the label causes
the features in anticausal learning).17
In a mobile money-enabled credit approval setting like you have with Phulo, it is not entirely clear
whether the problem is causal learning or anticausal learning: do the features cause the label or do the
labels cause the features? In a traditional credit scoring problem, you are probably in the causal setting
because there are strict parameters on features like salary and assets that cause a person to be viewed
by a bank as creditworthy or not. In the mobile money and unbanked setting, you could also imagine the
problem to be anticausal if you think that a person is inherently creditworthy or not, and the features
you’re able to collect from their mobile phone usage are a result of the creditworthiness. As you’re
developing the Phulo model, you should give invariant risk minimization a shot because you have
datasets from several countries, require robustness to concept drift and generalization to new countries,
and likely have an anticausal learning problem. You and your data science team can be happy that you’ve
given Wavetel and the Bank of Bulandshahr a model they can rely on during the launch of Phulo.
9.5 Summary
▪ Machine learning models should not take shortcuts if they are to be reliable. You must minimize
epistemic uncertainty in modeling, data preparation, sampling, and measurement.
▪ Data augmentation is a way to reduce epistemic uncertainty in modeling.
▪ Distribution shift—the mismatch between the probability distribution of the training data and the
data you will see during deployment—has three special cases: prior probability shift, covariate
shift, and concept drift. Often, you can’t detect distribution shift. You just have to assume it.
▪ Prior probability shift and covariate shift are easier to overcome than concept drift because they
arise from sampling bias rather than measurement bias.
▪ A pre-processing strategy for mitigating prior probability shift and covariate shift is adaptation,
in which sample weights multiply the training loss during the model learning process. Finding
16
Ishaan Gulrajani and David Lopez-Paz. “In Search of Lost Domain Generalization.” In: Proceedings of the International Confer-
ence on Learning Representations. May 2021. Pritish Kamath, Akilesh Tangella, Danica J. Sutherland, and Nathan Srebro. “Does
Invariant Risk Minimization Capture Invariance?” arXiv:2010.01134, 2021.
17
Kartik Ahuja, Jun Wang, Karthikeyan Shanmugam, Kush R. Varshney, and Amit Dhurandhar. “Empirical or Invariant Risk
Minimization? A Sample Complexity Perspective.” In: Proceedings of the International Conference on Learning Representations. May
2021.
Distribution Shift | 129
the weights requires a fixed target deployment distribution and unlabeled data from it.
▪ A strategy for mitigating prior probability and covariate shift during model training is min-max
robustness, which changes the learning formulation to try to do the best in the worst-case
environment that could be encountered during deployment.
▪ Adapting to concept drift requires the acquisition of some labeled data from the deployment
environment.
▪ Invariant risk minimization is a strategy for mitigating concept drift and achieving distributional
robustness that focuses the model’s efforts on causal features and ignores spurious features. It
may work well in anticausal learning scenarios in which the label causes the features.
130 | Trustworthy Machine Learning
10
Fairness
Sospital is a leading (fictional) health insurance company in the United States. Imagine that you are the
lead data scientist collaborating with a problem owner in charge of transforming the company’s care
management programs. Care management is the set of services that help patients with chronic or
complex conditions manage their health and have better clinical outcomes. Extra care management is
administered by a dedicated team composed of physicians, other clinicians, and caregivers who come
up with and execute a coordinated plan that emphasizes preventative health actions. The problem
owner at Sospital has made a lot of progress in implementing software-based solutions for the care
coordination piece and has changed the culture to support them, but is still struggling with the patient
intake process. The main struggle is in identifying the members of health plans that need extra care
management. This is a mostly manual process right now that the problem owner would like to automate.
You begin the machine learning lifecycle through an initial set of conversations with the problem
owner and determine that it is not an exploitative use case that could immediately be an instrument of
oppression. It is also a problem in which machine learning may be helpful. You next consult a paid panel
of diverse voices that includes actual patients. You learn from them that black Americans have not been
served well by the health care system historically and have a deep-seated mistrust of it. Therefore, you
should ensure that the machine learning model does not propagate systematic disadvantage to the black
community. The system should be fair and not contain unwanted biases.
Your task now is to develop a detailed problem specification for a fair machine learning system for
allocating care management programs to Sospital members and proceed along the different phases of
the machine learning lifecycle without taking shortcuts. In this chapter, you will:
All of the different forms of justice have important roles in society and sociotechnical systems. In the
problem specification phase of a model that determines who receives Sospital’s care management and
who doesn’t, you need to focus on distributive justice. This focus on distributive justice is generally true
in designing machine learning systems because machine learning itself is focused on outcomes. The
other kinds of justice are important in setting the context in which machine learning is and is not used.
They are essential in promoting accountability and holistically tamping down racism, sexism, classism,
ageism, ableism, and other unwanted discriminatory behaviors.
“Don’t conflate CS/AI/tech ethics and social justice issues. They’re definitely related,
but not interchangeable.”
Why would different individuals and groups receive an unequal allocation of care management?
Since it is a limited resource, not everyone can receive it.1 The more chronically ill that patients are, the
more likely they should be to receive care management. This sort of discrimination is generally
acceptable, and is the sort of task machine learning systems are suited for. It becomes unacceptable and
unfair when the allocation gives a systematic advantage to certain privileged groups and individuals and a
systematic disadvantage to certain unprivileged groups and individuals. Privileged groups and
individuals are defined to be those who have historically been more likely to receive the favorable label in
a machine learning binary classification task. Receiving care management is a favorable label because
patients are given extra services to keep them healthy. Other favorable labels include being hired, not
being fired, being approved for a loan, not being arrested, and being granted bail. Privilege is a result of
power imbalances, and the same groups may not be privileged in all contexts, even within the same
society. In some narrow societal contexts, it may even be the elite who are without power.
1
You can argue that this way of thinking is flawed and society should be doing whatever it takes so that care management is not
a limited resource, but it is the reality today.
132 | Trustworthy Machine Learning
Privileged and unprivileged groups are delineated by protected attributes such as race, ethnicity,
gender, religion, and age. There is no one universal set of protected attributes. They are determined
from laws, regulations, or other policies governing a particular application domain in a particular
jurisdiction. As a health insurer in the United States, Sospital is regulated under Section 1557 of the
Patient Protection and Affordable Care Act with the specific protected attributes of race, color, national
origin, sex, age, and disability. In health care in the United States, non-Hispanic whites are usually a
privileged group due to multifaceted reasons of power. For ease of explanation and conciseness, the
remainder of the chapter uses whites as the privileged group and blacks as the unprivileged group.
There are two main types of fairness you need to be concerned about: (1) group fairness and (2)
individual fairness. Group fairness is the idea that the average classifier behavior should be the same
across groups defined by protected attributes. Individual fairness is the idea that individuals similar in
their features should receive similar model predictions. Individual fairness includes the special case of
two individuals who are exactly the same in every respect except for the value of one protected attribute
(this special case is known as counterfactual fairness). Given the regulations Sospital is operating under,
group fairness is the more important notion to include in the care management problem specification,
but you should not forget to consider individual fairness in your problem specification.
“If humans didn’t behave the way we do there would be no behavior data to correct.
The training data is society.”
2
See https://www.cms.gov/files/document/blueprint-codes-code-systems-value-sets.pdf for details about these coding
schemes.
Fairness | 133
Figure 10.1. Bias in measurement and sampling are the most obvious sources of unfairness in machine learning,
but not the only ones. Accessible caption. A sequence of five spaces, each represented as a cloud. The
construct space leads to the observed space via the measurement process. The observed space leads to
the raw data space via the sampling process. The raw data space leads to the prepared data space via
the data preparation process. The prepared data space leads to the prediction space via the modeling
process. The measurement process contains social bias, which threatens construct validity. The sam-
pling process contains representation bias and temporal bias, which threatens external validity. The
data preparation process contains data preparation bias and data poisoning, which threaten internal
validity. The modeling process contains underfitting/overfitting and poor inductive bias, which
threaten generalization.
Social bias enters claims data in a few ways. First, you might think that patients who visit doctors a
lot and get many prescriptions filled, i.e. utilize the health care system a lot, are sicker and thus more
appropriate candidates for care management. While it is directionally true that greater health care
utilization implies a sicker patient, it is not true when comparing patients across populations such as
whites and blacks. Blacks tend to be sicker for an equal level of utilization due to structural issues in the
health care system.3 The same is true when looking at health care cost instead of utilization. Another
social bias can be in the codes. For example, black people are less-often treated for pain than white
people in the United States due to false beliefs among clinicians that black people feel less pain.4
Moreover, there can be social bias in the human-determined labels of selection for care management in
the past due to implicit cognitive biases or prejudice on the part of the decision maker. Representation
bias enters claims data because it is only from Sospital’s own members. This population may, for
example, undersample blacks if Sospital offers its commercial plans primarily in counties with larger
white populations.
Besides the social and representation biases given above that are already present in raw data, you
need to be careful that you don’t introduce other forms of unfairness in the problem specification and
data preparation phases. For example, suppose you don’t have the labels from human decision makers
in the past. In that case, you might decide to use a threshold on utilization or cost as a proxy outcome
3
Moninder Singh and Karthikeyan Natesan Ramamurthy. “Understanding Racial Bias in Health Using the Medical Expenditure
Panel Survey Data.” In: Proceedings of the NeurIPS Workshop on Fair ML for Health. Vancouver, Canada, Dec. 2019.
4
Oluwafunmilayo Akinlade. “Taking Black Pain Seriously.” In: New England Journal of Medicine 383.e68 (Sep. 2020).
134 | Trustworthy Machine Learning
variable, but that would make blacks less likely to be selected for care management at equal levels of
infirmity for the reasons described above. Also, as part of feature engineering, you might think to
combine individual cost or utilization events into more comprehensive categories, but if you aren’t
careful you could make racial bias worse. It turns out that combining all kinds of health system
utilization into a single feature yields unwanted racial bias, but keeping inpatient hospital nights and
frequent emergency room utilization as separate kinds of utilization keeps the bias down in nationally-
representative data.5
“As AI is embedded into our day to day lives it’s critical that we ensure our models
don’t inadvertently incorporate latent stereotypes and prejudices.”
You might be thinking that you already know how to measure and mitigate biases in measurement,
sampling, and data preparation from Chapter 9, distribution shift. What’s different about fairness?
Although there is plenty to share between distribution shift and fairness,6 there are two main technical
differences between the two topics. First is access to the construct space. You can get data from the
construct space in distribution shift scenarios. Maybe not immediately, but if you wait, collect, and label
data from the deployment environment, you will have data reflecting the construct space. However, you
never have access to the construct space in fairness settings. The construct space reflects a perfect
egalitarian world that does not exist in real life, so you can’t get data from it. (Recall that in Chapter 4, we
said that hakuna matata reigns in the construct space (it means no worries).) Second is the specification
of what is sought. In distribution shift, there is no further specification beyond just trying to match the
shifted distribution. In fairness, there are precise policy-driven notions and quantitative criteria that
define the desired state of data and/or models that are not dependent on the data distribution you have.
You’ll learn about these precise notions and how to choose among them in the next chapter.
Related to causal and anticausal learning covered in Chapter 9, the protected attribute is like the
environment variable. Fairness and distributive justice are usually conceived in a causal (rather than
anticausal) learning framework in which the outcome label is extrinsic: the protected attribute may
cause the other features, which in turn cause the selection for care management. However, this setup is
not always the case.
5
Moninder Singh. “Algorithmic Selection of Patients for Case Management: Alternative Proxies to Healthcare Costs.” In: Pro-
ceedings of the AAAI Workshop on Trustworthy AI for Healthcare. Feb. 2021.
6
Elliot Creager, Jörn-Henrik Jacobsen, and Richard Zemel. “Exchanging Lessons Between Algorithmic Fairness and Domain
Generalization.” arXiv:2010.07249, 2020.
Fairness | 135
deeper into group fairness. Group fairness is about comparing members of the privileged group and
members of the unprivileged group on average.
Equation 10.1
A value of 0 means that members of the unprivileged group (blacks) and the privileged group (whites)
are getting selected for extra care management at equal rates, which is considered a fair situation. A
negative value of statistical parity difference indicates that the unprivileged group is at a disadvantage
and a positive value indicates that the privileged group is at a disadvantage. A requirement in a problem
specification may be that the learned model must have a statistical parity difference close to 0. An
example calculation of statistical parity difference is shown in Figure 10.2.
Figure 10.2. An example calculation of statistical parity difference. Accessible caption. 3 members of the
unprivileged group are predicted with the favorable label (receive care management) and 7 are pre-
dicted with the unfavorable label (don’t receive care management). 4 members of the privileged group
are predicted with the favorable label and 6 are predicted with the unfavorable label. The selection rate
for the unprivileged group is 3/10 and for the privileged group is 4/10. The difference, the statistical
parity difference is −0.1.
disparate impact ratio = 𝑃( 𝑦̂(𝑋) = fav ∣ 𝑍 = unpr )/𝑃( 𝑦̂(𝑋) = fav ∣ 𝑍 = priv ).
Equation 10.2
136 | Trustworthy Machine Learning
Here, a value of 1 indicates fairness, values less than 1 indicate disadvantage faced by the unprivileged
group, and values greater than 1 indicate disadvantage faced by the privileged group. The disparate
impact ratio is also sometimes known as the relative risk ratio or the adverse impact ratio. In some
application domains such as employment, a value of the disparate impact ratio less than 0.8 is
considered unfair and values greater than 0.8 are considered fair. This so-called four-fifths rule problem
specification is asymmetric because it does not speak to disadvantage experienced by the privileged
group. It can be symmetrized by considering disparate impact ratios between 0.8 and 1.25 to be fair.
Statistical parity difference and disparate impact ratio can be understood as measuring a form of
independence between the prediction 𝑦̂(𝑋) and the protected attribute 𝑍.7 Besides statistical parity
difference and disparate impact ratio, another way to quantify the independence between 𝑦̂(𝑋) and 𝑍 is
their mutual information.
Both statistical parity difference and disparate impact ratio can also be defined on the training data
instead of the model predictions by replacing 𝑦̂(𝑋) with 𝑌. Thus, they can be measured and tested (1) on
the dataset before model training, as a dataset fairness metric, as well as (2) on the learned classifier after
model training as a classifier fairness metric, shown in Figure 10.3.
Figure 10.3. Two types of fairness metrics in different parts of the machine learning pipeline. Accessible cap-
tion. A block diagram with a training dataset as input to a pre-processing block with a pre-processed
dataset as output. The pre-processed dataset is input to a model training block with an initial model as
output. The initial model is input to a post-processing block with a final model as output. A dataset
fairness metric block is applied to the training dataset and pre-processed dataset. A classifier fairness
metric block is applied to the initial model and final model.
7
Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Opportunities. URL:
https://fairmlbook.org, 2020.
Fairness | 137
unprivileged and privileged groups and the difference of the false favorable rates between the
unprivileged and privileged groups, and average them:
Equation 10.3
Figure 10.4. An example calculation of average odds difference. The crosses below the members indicate a true
need for care management. Accessible caption. In the unprivileged group, 2 members receive true favora-
ble outcomes and 2 receive false unfavorable outcomes, giving a 2/4 true favorable rate. In the privi-
leged group, 3 members receive true favorable outcomes and 1 receives a false unfavorable outcome,
giving a 3/4 true favorable rate. The true favorable rate difference is −0.25. In the unprivileged group, 1
member receives a false favorable outcome and 5 receive a true unfavorable outcome, giving a 1/6
false favorable rate. In the privileged group, 1 member receives a false favorable outcome and 5 re-
ceive a true unfavorable outcome, giving a 1/6 false favorable rate. The false favorable rate difference
is 0. Averaging the two differences gives a −0.125 average odds difference.
In the average odds difference, the true favorable rate difference and the false favorable rate
difference can cancel out and hide unfairness, so it is better to take the absolute value before averaging:
Equation 10.3
The average odds difference is a way to measure the separation of the prediction 𝑦̂(𝑋) and the protected
attribute 𝑍 by the true label 𝑌 in any of the three Bayesian networks shown in Figure 10.5. A value of 0
average absolute odds difference indicates independence of 𝑦̂(𝑋) and 𝑍 conditioned on 𝑌. This is deemed
a fair situation and termed equality of odds.
138 | Trustworthy Machine Learning
Figure 10.5. Illustration of the true label 𝑌 separating the prediction and the protected attribute in various
Bayesian networks. Accessible caption. Three networks that show separation: 𝑌̂ → 𝑌 → 𝑍, 𝑌̂ 𝑌 𝑍, and
𝑌̂ 𝑌 → 𝑍.
8
Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. “On the (Im)possibility of Fairness: Different Value
Systems Require Different Mechanisms for Fair Decision Making.” In: Communications of the ACM 64.4 (Apr. 2021), pp. 136–143.
Fairness | 139
Figure 10.6. Illustration of the predicted label 𝑌̂ separating the true label and the protected attribute in various
Bayesian networks, which is known as sufficiency. Accessible caption. Three networks that show sufficiency:
𝑌 → 𝑌̂ → 𝑍, 𝑌 𝑌̂ 𝑍, and 𝑌 𝑌̂ → 𝑍.
Since sufficiency and separation are somewhat opposites of each other with 𝑌 and 𝑌̂ reversed, their
quantifications are also opposites with 𝑌 and 𝑌̂ reversed. Recall from Chapter 6 that the positive
predictive value is the reverse of the true positive rate: 𝑃( 𝑌 = fav ∣ 𝑦̂(𝑋) = fav ) and that the false
omission rate is the reverse of the false positive rate: 𝑃( 𝑌 = fav ∣ 𝑦̂(𝑋) = unf ). To quantify sufficiency
unfairness, compute the average difference of the positive predictive value and false omission rate
across the unprivileged (black) and privileged (white) groups:
Equation 10.4
An example calculation for average predictive value difference is shown in Figure 10.7. The example
illustrates a case in which the two halves of the metric cancel out because they have opposite sign, so a
version with absolute values before averaging makes sense:
Equation 10.5
140 | Trustworthy Machine Learning
Figure 10.7. An example calculation of average predictive value difference. The crosses below the members indi-
cate a true need for care management. Accessible caption. In the unprivileged group, 2 members receive
true favorable outcomes and 1 receives a false unfavorable outcome, giving a 2/3 positive predictive
value. In the privileged group, 3 members receive true favorable outcomes and 1 receives a false unfa-
vorable outcome, giving a 3/4 positive predictive value. The positive predictive value difference is
−0.08. In the unprivileged group, 2 members receive a false unfavorable outcome and 5 receive a true
unfavorable outcome, giving a 2/7 false omission rate. In the privileged group, 1 member receives a
false unfavorable outcome and 5 receive a true unfavorable outcome, giving a 1/6 false omission rate.
The false omission rate difference is 0.12. Averaging the two differences gives a 0.02 average predictive
value difference.
10.3.5 Choosing Between Average Odds and Average Predictive Value Difference
What’s the difference between separation and sufficiency? Which one makes more sense for the Sospital
care management model? This is not a decision based on politics and worldviews like the decision
between independence and separation. It is a decision based on what the favorable label grants the
affected user: is it assistive or simply non-punitive?9 Getting a loan is assistive, but not getting arrested
is non-punitive. Receiving care management is assistive. In assistive cases like receiving extra care,
separation (equalized odds) is the preferred fairness metric because it relates to recall (true positive
rate), which is of primary concern in these settings. If receiving care management had been a non-
punitive act, then sufficiency (calibration) would have been the preferred fairness metric because
precision is of primary concern in non-punitive settings. (Precision is equivalent to positive predictive
value, which is one of the two components of the average predictive value difference.).
10.3.6 Conclusion
You can construct all sorts of different group fairness metrics by computing differences or ratios of the
various confusion matrix entries and other classifier performance metrics detailed in Chapter 6, but
independence, separation, and sufficiency are the three main ones. They are summarized in Table 10.1.
9
Karima Makhlouf, Sami Zhioua, and Catuscia Palamidessi. “On the Applicability of ML Fairness Notions.” arXiv:2006.16745,
2020. Boris Ruf and Marcin Detyniecki. “Towards the Right Kind of Fairness in AI.” arXiv:2102.08453, 2021.
Fairness | 141
Type Statistical Fairness Metric Can Be A Da- Social Bias in Favorable La-
Relation- taset Metric? Measurement bel
ship
independ- 𝑌̂ ⫫ 𝑍 statistical parity yes yes assistive or
ence difference non-punitive
separation 𝑌̂ ⫫ 𝑍 ∣ 𝑌 average odds dif- no no assistive
ference
sufficiency 𝑌 ⫫ 𝑍 ∣ 𝑌̂ average predictive no no non-punitive
(calibration) value difference
Based on the different properties of the three group fairness metrics, and the likely social biases in the
data you’re using to create the Sospital care management model, you should focus on independence and
statistical parity difference.
10.4.1 Consistency
The consistency metric is quantified as follows:
𝑛
1 1
consistency = 1 − ∑ |𝑦̂𝑗 − ∑ 𝑦̂𝑗′ |.
𝑛 𝑘
𝑗=1 𝑗′∈𝒩𝑘 (𝑥𝑗 )
Equation 10.6
For each of the 𝑛 Sospital members, the prediction 𝑦̂𝑗 is compared to the average prediction of the 𝑘
nearest neighbors. When the predicted labels of all of the 𝑘 nearest neighbors match the predicted label
of the person themselves, you get 0. If all of the nearest neighbor predicted labels are different from the
predicted label of the person, the absolute value is 1. Overall, because of the ‘one minus’ at the beginning
of Equation 10.6, the consistency metric is 1 if all similar points have similar labels and less than 1 if
similar points have different labels.
142 | Trustworthy Machine Learning
The biggest question in individual fairness is deciding the distance metric by which the nearest
neighbors are determined. Which kind of distance makes sense? Should all features be used in the
distance computation? Should protected attributes be excluded? Should some feature dimensions be
corrected for in the distance computation? These choices are where politics and worldviews come into
play.10 Typically, protected attributes are excluded, but they don’t have to be. If you believe there is no
bias during measurement (the “what you see is what you get” worldview), then you should simply use
the features as is. In contrast, suppose you believe that there are structural social biases in measurement
(the “we’re all equal” worldview). In that case, you should attempt to undo those biases by correcting the
features as they’re fed into a distance computation. For example, if you believe that blacks with three
outpatient doctor visits are equal in health to whites with five outpatient doctor visits, then your distance
metric can add two outpatient visits to the black members as a correction.
10
Reuben Binns. “On the Apparent Conflict Between Individual and Group Fairness.” In: Proceedings of the ACM Conference on
Fairness, Accountability, and Transparency. Barcelona, Spain, Jan. 2020, pp. 514–524.
11
Joshua R. Loftus, Chris Russell, Matt J. Kusner, and Ricardo Silva. “Causal Reasoning for Algorithmic Fairness.”
arXiv:1805.05859, 2018.
12
Alice Xiang. “Reconciling Legal and Technical Approaches to Algorithmic Bias.” In: Tennessee Law Review 88.3 (2021).
Fairness | 143
of 1 indicates a totally unfair society where one person holds all the wealth and a value of 0 indicates an
egalitarian society where all people have the same amount of wealth.
What is the equivalent of wealth in the context of machine learning and distributive justice in health
care management? It has to be some sort of non-negative benefit value 𝑏𝑗 that you want to be equal for
different Sospital members. Once you’ve defined the benefit 𝑏𝑗 , plug it into the Theil index expression
and use it as a combined group and individual fairness metric:
𝑛
1 𝑏𝑗 𝑏𝑗
Theil index = ∑ log .
𝑛 𝑏̅ 𝑏̅
𝑗=1
Equation 10.7
The equation averages the benefit divided by the mean benefit 𝑏̅, multiplied by its natural log, across all
people.
That’s all well and good, but benefit to who and under which worldview? The research group that
proposed using the Theil index in algorithmic fairness suggested that 𝑏𝑗 be 2 for false favorable labels
(false positives), 1 for true favorable labels (true positives), 1 for true unfavorable labels (true negatives),
and 0 for false unfavorable labels (false negatives).13 This recommendation is seemingly consistent with
the “what you see is what you get” worldview because it is examining model performance, assumes the
costs of false positives and false negatives are the same, and takes the perspective of affected members
who want to get care management even if they are not truly suitable candidates. More appropriate
benefit functions for the problem specification of the Sospital model may be 𝑏𝑗 that are (1) 1 for true
favorable and true unfavorable labels and 0 for false favorable and false unfavorable labels (“what you
see is what you get” while balancing societal needs), or (2) 1 for true favorable and false favorable labels
and 0 for true unfavorable and false unfavorable labels (“we’re all equal”).
10.4.4 Conclusion
Individual fairness consistency and Theil index are both excellent ways to capture various nuances of
fairness in different contexts. Just like group fairness metrics, they require you to clarify your worldview
and aim for the same goals in a bottom-up way. Since the Sospital care management setting is regulated
using group fairness language, it behooves you to use group fairness metrics in your problem
specification and modeling. Counterfactual or causal fairness is a strong requirement from the
perspective of the philosophy and science of law, but the regulations are only just catching up. So you
might need to utilize causal fairness in problem specifications in the future, but not just yet. As you’ve
learned so far, the problem specification and data phases are critical for fairness. But that makes the
modeling phase no less important. The next section focuses on bias mitigation to improve fairness as
part of the modeling pipeline.
13
Till Speicher, Hoda Heidari, Nina Grgić-Hlača, Krishna P. Gummadi, Adish Singla, Adrian Weller, and Muhammad Bilal
Zafar. “A Unified Approach to Quantifying Algorithmic Unfairness: Measuring Individual & Group Unfairness via Inequality
Indices.” In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London, England, UK,
Jul. 2018, pp. 2239–2248.
144 | Trustworthy Machine Learning
14
The assumption that training datasets contain protected attributes can be violated for regulatory or privacy reasons. The
situation is known as fairness under unawareness. See: Jiahao Chen, Nathan Kallus, Xiaojie Mao, Geoffry Svacha, and Madeleine
Udell. “Fairness Under Unawareness.” In: Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Atlanta,
Georgia, USA, Jan. 2019, pp. 339–348.
15
Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. “Fairness Without Demographics in Re-
peated Loss Minimization.” In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden, Jul. 2018, pp.
1929–1938. Preethi Lahoti, Alex Beutel, Jilin Chen, Kang Lee, Flavien Prost, Nithum Thain, Xuezhi Wang, and Ed H. Chi. “Fair-
ness Without Demographics through Adversarially Reweighted Learning.” In: Advances in Neural Information Processing Systems
33 (Dec. 2020), pp. 728–740.
Fairness | 145
because that understanding is used in this and other later chapters. The reason to dive into the details
of bias mitigation algorithms is different. In choosing a bias mitigation algorithm, you have to (1) know
where in the pipeline you can intervene, (2) consider your worldview, and (3) understand whether
protected attributes are allowed as features and will be available in the deployment data when you are
scoring new Sospital members.
Figure 10.8. Three types of bias mitigation algorithms in different parts of the machine learning pipeline. Acces-
sible caption. A block diagram with a training dataset as input to a bias mitigation pre-processing
block with a pre-processed dataset as output. The pre-processed dataset is input to a bias mitigation
in-processing block with an initial model as output. The initial model is input to a bias mitigation post-
processing block with a final model as output.
10.5.1 Pre-Processing
At the pre-processing stage of the modeling pipeline, you don’t have the trained model yet. So pre-
processing methods cannot explicitly include fairness metrics that involve model predictions.
Therefore, most pre-processing methods are focused on the “we’re all equal” worldview, but not
exclusively so. There are several ways for pre-processing a training data set: (1) augmenting the dataset
with additional data points, (2) applying instance weights to the data points, and (3) altering the labels.
One of the simplest algorithms for pre-processing the training dataset is to append additional rows
of made-up members that do not really exist. These imaginary members are constructed by taking
existing member rows and flipping their protected attribute values (like counterfactual fairness).16 The
augmented rows are added sequentially based on a distance metric so that ‘realistic’ data points close
to modes of the underlying dataset are added first. This ordering maintains the fidelity of the data
distribution for the learning task. A plain uncorrected distance metric takes the “what you see is what
you get” worldview and only overcomes sampling bias, not measurement bias. A corrected distance
metric like the example described in the previous section (adding two outpatient visits to the black
members) takes the “we’re all equal” worldview and can overcome both measurement and sampling
bias (threats to both construct and external validity). This data augmentation approach needs to have
protected attributes as features of the model and they must be available in deployment data.
Another way to pre-process the training data set is through sample weights, similar to inverse
probability weighting and importance weighting seen in Chapter 8 and Chapter 9, respectively. The
reweighing method is geared toward improving statistical parity (“we’re all equal” worldview), which can
be assessed before the care management model is trained and is a dataset fairness metric.17 The goal of
16
Shubham Sharma, Yunfeng Zhang, Jesús M. Ríos Aliaga, Djallel Bouneffouf, Vinod Muthusamy, and Kush R. Varshney. “Data
Augmentation for Discrimination Prevention and Bias Disambiguation.” In: Proceedings of the AAAI/ACM Conference on AI, Ethics,
and Society. New York, New York, USA, Feb. 2020, pp. 358–364.
17
Faisal Kamiran and Toon Calders. “Data Preprocessing Techniques for Classification without Discrimination.” In: Knowledge
and Information Systems 33.1 (Oct. 2012), pp. 1–33.
146 | Trustworthy Machine Learning
independence between the label and protected attribute corresponds to their joint probability being the
product of their marginal probabilities. This product probability appears in the numerator and the
actual observed joint probability appears in the denominator of the weight:
Equation 10.8
Protected attributes are required in the training data to learn the model, but they don’t have to be part
of the model or the deployment data.
Whereas data augmentation and reweighing do not change the training data you have from historical
care management decisions, other methods do. One simple method, only for statistical parity and the
“we’re all equal” worldview, known as massaging flips unfavorable labels of unprivileged group members
to favorable labels and favorable labels of privileged group members to unfavorable labels.18 The chosen
data points are those closest to the decision boundary that have low confidence. Massaging does not
need to have protected attributes in the deployment data.
A different approach, the fair score transformer, works on (calibrated) continuous score labels 𝑆 =
𝑝𝑌∣𝑋 (𝑌 = 𝑓𝑎𝑣 ∣ 𝑥) rather than binary labels.19 It is posed as an optimization in which you find
transformed scores 𝑆′ that have small cross-entropy with the original scores 𝑆, i.e. 𝐻(𝑆 ∥ 𝑆′), while
constraining the statistical parity difference, average odds difference, or other group fairness metrics of
your choice to be of small absolute value. You convert the pre-processed scores back into binary labels
with weights to feed into a standard training algorithm. You can take the “what you see is what you get”
worldview with the fair score transformer because it assumes that the classifier later trained on the pre-
processed dataset is competent, so that the pre-processed score it produces is a good approximation to
the score predicted by the trained model. Although there are pre-processing methods that alter both the
labels and (structured or semi-structured) features,20 the fair score transformer proves that you only
need to alter the labels. It can deal with deployment data that does not come with protected attributes.
Data augmentation, reweighing, massaging, and fair score transformer all have their own domains
of competence. Some perform better than others on different fairness metrics and dataset
characteristics. You’ll have to try different ones to see what happens on the Sospital data.
18
Faisal Kamiran and Toon Calders. “Data Preprocessing Techniques for Classification without Discrimination.” In: Knowledge
and Information Systems 33.1 (Oct. 2012), pp. 1–33.
19
Dennis Wei, Karthikeyan Natesan Ramamurthy, and Flavio P. Calmon. “Optimized Score Transformation for Fair Classifica-
tion.” In: Proceedings of the International Conference on Artificial Intelligence and Statistics. Aug. 2020, pp. 1673–1683.
20
Some examples are the methods described in the following three papers. Michael Feldman, Sorelle A. Friedler, John Moeller,
Carlos Scheidegger, and Suresh Venkatasubramanian. “Certifying and Removing Disparate Impact.” In: Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, Australia, Aug. 2015, pp. 259–268. Flavio P.
Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R. Varshney. “Optimized Pre-Pro-
cessing for Discrimination Prevention.” In: Advances in Neural Information Processing Systems 30 (Dec. 2017), pp. 3992–4001.
Prasanna Sattigeri, Samuel C. Hoffman, Vijil Chenthamarakshan, and Kush R. Varshney. “Fairness GAN: Generating Datasets
with Fairness Properties Using a Generative Adversarial Network.” In: IBM Journal of Research and Development 63.4/5 (Jul./Sep.
2019), p. 3.
Fairness | 147
10.5.2 In-Processing
In-processing bias mitigation algorithms are straightforward to state, but often more difficult to actually
optimize. The statement is as follows: take an existing risk minimization supervised learning algorithm,
such as (a repetition of Equation 7.4):
𝑛
1
𝑦̂(∙) = 𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ 𝐿 (𝑦𝑗 , 𝑓(𝑥𝑗 )) + 𝜆𝐽(𝑓)
𝑓∈ℱ 𝑛
𝑗=1
Equation 10.9
and regularize or constrain it using a fairness metric. The algorithm can be logistic regression and the
regularizer can be statistical parity difference, in which case you have the prejudice remover.21 More
recent fair learning algorithms are broader and allow for any standard risk minimization algorithm
along with a broad set of group fairness metrics as constraints that cover the different types of fairness.22
A recent in-processing algorithm regularizes the objective function using a causal fairness term. Under
strong ignorability assumptions (remember from Chapter 8 that these are no unmeasured confounders
and overlap), the regularizer is an average treatment effect-like term 𝐽 = 𝐸[ 𝑌 ∣ 𝑑𝑜(𝑧 = 1), 𝑋 ] −
𝐸[ 𝑌 ∣ 𝑑𝑜(𝑧 = 0), 𝑋 ].23
Once trained, the resulting models can be used on new unseen Sospital members. These in-
processing algorithms do not require the deployment data to contain the protected attribute. The trick
with all of them is structuring the bias mitigating regularization term or constraint so that the objective
function can tractably be minimized through an optimization algorithm.
10.5.3 Post-Processing
If you’re in the situation that the Sospital care management model has already been trained and you
cannot change it or touch the training data (for example if you are purchasing a pre-trained model from
a vendor to include in your pipeline), then the only option you have is to mitigate unwanted biases using
post-processing. You can only alter the output predictions 𝑌̂ to meet the group fairness metrics you
desire based on your worldview (i.e. flipping the predicted labels from receiving care management to
not receiving care management and vice versa). If you have some validation data with labels, you can
post-process with the “what you see is what you get” worldview. You can always post-process with the
“we’re all equal” worldview, with or without validation data.
21
Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, and Jun Sakuma. “Fairness-Aware Classifier with Prejudice Remover
Regularizer.” In: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Bristol, Eng-
land, UK, Sep. 2012, pp. 35–50.
22
Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. “A Reductions Approach to Fair Clas-
sification.” In: Proceedings of the International Conference on Machine Learning. Stockholm, Sweden, Jul. 2018, pp. 60–69. L. Elisa
Celis, Lingxiao Huang, Vijay Kesarwani, and Nisheeth K. Vishnoi. “Classification with Fairness Constraints: A Meta-Algorithm
with Provable Guarantees.” In: Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Atlanta, Georgia,
USA, Jan. 2019, pp. 319–328. Ching-Yao Chuang and Youssef Mroueh. “Fair Mixup: Fairness via Interpolation.” In: Proceedings
of the International Conference on Learning Representations. May 2021.
23
Pietro G. Di Stefano, James M. Hickey, and Vlasios Vasileiou. “Counterfactual Fairness: Removing Direct Effects Through Reg-
ularization.” arXiv:2002.10774, 2020.
148 | Trustworthy Machine Learning
Since group fairness metrics are computed on average, flipping any random member’s label within
a group is the same as flipping any other random member’s.24 A random selection of people, however,
seems to be procedurally unfair. To overcome this issue, similar to massaging, you can prioritize flipping
the labels of members whose data points are near the decision boundary and are thus low confidence
samples.25 You can also choose people within a group so that you reduce individual counterfactual
unfairness.26 All of these approaches require the protected attribute in the deployment data.
The fair score transformer described in the pre-processing section also has a post-processing
version, which does not require the protected attribute and should be considered the first choice
algorithm in the category of post-processing bias mitigation if the base classifier outputs continuous
scores. It performs well empirically and is not computationally-intensive. Just like the pre-processing
version, the idea is to find an optimal transformation of the predicted score output into a new score,
which can then be thresholded to a binary prediction for the final care management decision that
Sospital makes.
10.5.4 Conclusion
All of the different bias mitigation algorithms are options as you’re deciding what to finally do in the care
management modeling pipeline. The things you have to think about are:
1. where in the pipeline can you make alterations (this will determine the category pre-, in-, or
post-processing)
2. which worldview you’ve decided with the problem owner (this will disallow some algorithms
that don’t work for the worldview you’ve decided)
3. whether the deployment data contains the protected attributes (if not, this will disallow some
algorithms that require them).
These different decision points are summarized in Table 10.2. After that, you can just go with the
algorithm that gives you the best quantitative results. But what is best? It is simply the pipeline with the
best value for the fairness metric you’ve chosen in your problem specification.
But you might ask, shouldn’t I consider a tradeoff of fairness and accuracy when I choose the
pipeline? Balancing tradeoffs and relationships among different elements of trustworthy machine
learning is more fully covered in Chapter 14, but before getting there, it is important to note one
important point. Even though it is a convenient shortcut, measuring classification accuracy on data from
the prepared data space, which already contains social bias, representation bias, and data preparation
bias is not the right thing to do. Just like you should measure performance of distribution shift
adaptation on data from the new environment—its construct space, you should measure accuracy after
bias mitigation in its construct space where there is no unfairness. There is a tradeoff between fairness
and accuracy measured in the prepared data space, but importantly there is no tradeoff between
24
Geoff Pleiss, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. “On Fairness and Calibration.” In: Ad-
vances in Neural Information Processing Systems 31 (Dec. 2017), pp. 5684–5693.
25
Faisal Kamiran, Asim Karim, and Xiangliang Zhang. “Decision Theory for Discrimination-Aware Classification.” In: Proceed-
ings of the IEEE International Conference on Data Mining. Brussels, Belgium, Dec. 2012, pp. 924–929.
26
Pranay K. Lohia, Karthikeyan Natesan Ramamurthy, Manish Bhide, Diptikalyan Saha, Kush R. Varshney, and Ruchir Puri.
“Bias Mitigation Post-Processing for Individual and Group Fairness.” In: IEEE International Conference on Acoustics, Speech, and
Signal Processing. Brighton, England, UK, May 2019, pp. 2847–2851.
Fairness | 149
accuracy and fairness in the construct space.27 You can approximate a construct space test set by using
the data augmentation pre-processing method.
In your Sospital problem, you have almost complete flexibility because you do control the training
data and model training, are focused on independence and the “we’re all equal” worldview, and are able
to include protected attributes for Sospital’s members in the deployment data. Try everything, but start
with the fair score transformer pre-processing.
Michael Wick, Swetasudha Panda, and Jean-Baptiste Tristan. “Unlocking Fairness: A Trade-Off Revisited.” In: Advances in
27
Neural Information Processing Systems 32 (Dec. 2019), pp. 8783–8792. Kit T. Rodolfa, Hemank Lamba, and Rayid Ghani. “Empiri-
cal Observation of Negligible Trade-Offs in Machine Learning for Public Policy.” In: Nature Machine Intelligence 3 (Oct. 2021), pp.
896–904.
150 | Trustworthy Machine Learning
“I continue to worry that in CS (as in psychology), debates about bias have become a
powerful distraction—drawing attention away from what's most important toward
what's more easily measurable.”
The second issue is as follows. Have we too easily swept the important considerations of algorithmic
fairness under the rug of mathematics? Yes and no. If you have truly thought through the different
sources of inequity arising throughout the machine learning lifecycle utilizing a panel of diverse voices,
then applying the quantitative metrics and mitigation algorithms is actually pretty straightforward. It is
straightforward because of the hard work you’ve done before getting to the modeling phase of the
lifecycle and you should feel confident in going forward. If you have not done the hard work earlier in
the lifecycle (including problem specification), blindly applying bias mitigation algorithms might not
reduce harms and can even exacerbate them. So don’t take shortcuts.
10.7 Summary
▪ Fairness has many forms, arising from different kinds of justice. Distributive justice is the most
appropriate for allocation decisions made or supported by machine learning systems. It asks for
some kind of sameness in the outcomes across individuals and groups.
▪ Unfairness can arise from problem misspecification (including inappropriate proxy labels),
feature engineering, measurement of features from the construct space to the observed space,
and sampling of data points from the observed space to the raw data space.
▪ There are two important worldviews in determining which kind of sameness is most appropriate
for your problem.
▪ If you believe there are social biases in measurement (not only representation biases in
sampling), then you have the “we’re all equal” worldview; independence and statistical parity
difference are appropriate notions of group fairness.
▪ If you believe there are no social biases in measurement, only representation biases in sampling,
then you have the “what you see is what you get” worldview; separation, sufficiency, average odds
difference, and average predictive value difference are appropriate notions of group fairness.
▪ If the favorable label is assistive, separation and average odds difference are appropriate notions
of group fairness. If the favorable label is non-punitive, sufficiency and average predictive value
difference are appropriate notions of group fairness.
Fairness | 151
▪ Individual fairness is a limiting version of group fairness with finer and finer groups. Worldviews
play a role in determining distance metrics between individuals.
▪ Bias mitigation algorithms can be applied as pre-processing, in-processing, or post-processing
within the machine learning pipeline. Different algorithms apply to different worldviews. The
choice of algorithm should consider the worldview in addition to empirical performance.
152 | Trustworthy Machine Learning
11
Adversarial Robustness
Imagine that you are a data scientist at a (fictional) new player in the human resources (HR) analytics
space named HireRing. The company creates machine learning models that analyze resumes and
metadata in job application forms to prioritize candidates for hiring and other employment decisions.
They go in and train their algorithms on each of their corporate clients’ historical data. As a major value
proposition, the executives of HireRing have paid extra attention to ensuring robustness to distribution
shift and ensuring fairness of their machine learning pipelines and are now starting to focus their
problem specification efforts on securing models from malicious acts. You have been entrusted to lead
the charge in this new area of machine learning security. Where should you begin? What are the different
threats you need to be worried about? What can you do to defend against potential adversarial attacks?
Adversaries are people trying to achieve their own goals to the detriment of the goals of HireRing and
their clients, usually in a secretive way. For example, they may simply want to make the accuracy of an
applicant prioritization model worse. They may be more sophisticated and want to trick the machine
learning system into putting some small group of applicants at the top of the priority list irrespective of
the employability expressed in their features while leaving the model’s behavior unchanged for most
applicants.
This chapter teaches you all about defending and certifying the models HireRing builds for its clients
by:
▪ distinguishing different threat models based on what the adversary attacks (training data or
models), their goals, and what they are privy to know and change,
▪ defending against different types of attacks through algorithms that add robustness to models,
and
The topic of adversarial robustness relates to the other two chapters in this part of the book on
reliability (distribution shift and fairness) because it also involves a mismatch between the training data
and the deployment data. You do not know what that difference is going to be, so you have epistemic
Adversarial Robustness | 153
uncertainty that you want to adapt to or be robust against. In distribution shift, the difference in
distributions is naturally occurring; in fairness, the difference between a just world and the world we
live in is because of encompassing societal reasons; in adversarial robustness, the difference between
distributions is because of a sneaky adversary. Another viewpoint on adversarial attacks is not through
the lens of malicious actors, but from the lens of probing system reliability—pushing machine learning
systems to their extremes—by testing them in worst case scenarios. This alternative viewpoint is not the
framing of the chapter, but you should keep it in the back of your mind and we will return to it in Chapter
13.
HireRing has just been selected by a large (fictional) retail chain based in the midwestern United
States named Kermis to build them a resume and job application screening model. This is your first
chance to work with a real client on the problem specification phase for adversarial robustness and not
take any shortcuts. To start, you need to work through the different types of malicious attacks and decide
how you can make the HireRing model being developed for Kermis the most reliable and trustworthy it
can be. Later you’ll work on the modeling phase too.
1
Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. “Adversarial Attacks
and Defences: A Survey.” arXiv:1810.00069, 2018. Ximeng Liu, Lehui Xie, Yaopeng Wang, Jian Zou, Jinbo Xiong, Zuobin Ying,
and Athanasios V. Vasilakos. “Privacy and Security Issues in Deep Learning: A Survey.” In: IEEE Access 9 (Dec. 2021), pp. 4566–
4593.
154 | Trustworthy Machine Learning
Figure 11.1. A mental model for the different types of adversarial attacks, according to their target, their capa-
bility, and their goal. A hierarchy diagram with adversarial attacks at its root. Adversarial attacks has
children poisoning and evasion, both of which are in the target dimension. Poisoning has children data
injection, data modification, and logic corruption, which are in the capability dimension. Evasion has
children strict closed-box, adaptive closed-box, non-adaptive closed-box, and open-box, which are in
the capability dimension. Below the hierarchy diagram are items in the goal dimension: confidence
reduction, misclassification, targeted misclassification, and source/target misclassification, which ap-
ply to the whole diagram.
11.1.1 Target
Adversaries may target either the modeling phase or the deployment phase of the machine learning
lifecycle. By attacking the modeling phase, they can corrupt the training data or model so that it is
mismatched from the data seen in deployment. These are known as poisoning attacks and have
similarities with distribution shift, covered in Chapter 9, as they change the statistics of the training data
or model. Evasion attacks that target the deployment phase are a different beast that do not have a direct
parallel with distribution shift, but have a loose similarity with individual fairness covered in Chapter
10. These attacks are focused on altering individual examples (individual resumes) that are fed into the
machine learning system to be evaluated. As such, modifications to single data points may not affect the
deployment probability distribution much at all, but can nevertheless achieve the adversary’s goals for
a given input resume.
One way to understand poisoning and evasion attacks is by way of the decision boundary, shown in
Figure 11.2. Poisoning attacks shift the decision boundary in a way that the adversary wants. In contrast,
evasion attacks do not shift the decision boundary, but shift data points across the decision boundary in
ways that are difficult to detect. An original data point, the features of the resume 𝑥, shifted by δ becomes
𝑥 + δ. A basic mathematical description of an evasion attack is the following:
Equation 11.1
Adversarial Robustness | 155
The adversary wants to find a small perturbation 𝛿 to add to the resume 𝑥 so that the predicted label
changes (𝑦̂(𝑥 + 𝛿) ≠ 𝑦̂(𝑥)) from select to reject or vice versa. In addition, the perturbation should be
smaller in length or norm ‖⋅‖ than some small value 𝜖. The choice of norm and value depend on the
application domain. For semi-structured data modalities, the norm should capture human perception
so that the perturbed data point and the original data point look or sound almost the same to people.
Figure 11.2. Examples of a poisoning attack (left) and an evasion attack (right). In the poisoning attack, the ad-
versary has injected a new data point, the square with the light border into the training data. This action shifts the
decision boundary from what it would have been: the solid black line, to something else that the adversary desires:
the dashed black line. More diamond deployment data points will now be misclassified. In the evasion attack, the
adversary subtly perturbs a deployment data point across the decision boundary so that it is now misclassified.
Accessible caption. The stylized plot illustrating a poisoning attack shows two classes of data points
arranged in a noisy yin yang or interleaving moons configuration and a decision boundary smoothly
encircling one of the classes with a blob-like region. A poisoning data point with the label of the inside
of the region is added outside the region. It causes a new decision boundary that puts it inside, while
also causing the misclassification of another data point. The stylized plot illustrating the evasion attack
has a data point inside the blob-like region. The attack pushes it outside the region.
Another way to write the label change 𝑦̂(𝑥 + 𝛿) ≠ 𝑦̂(𝑥) is through the zero-one loss function:
𝐿(𝑦̂(𝑥), 𝑦̂(𝑥 + 𝛿)) = 1. (Remember that the zero-one loss takes value 0 when both arguments are the same
and value 1 when the arguments are different.) Because the zero-one loss can only take the two values 0
and 1, you can also write the adversarial example using a maximum as:
Equation 11.2
In this notation, you can also put in other loss functions such as cross-entropy loss, logistic loss, and
hinge loss from Chapter 7.
156 | Trustworthy Machine Learning
11.1.2 Capability
Some adversaries are more capable than others. In the poisoning category, adversaries change the
training data or model somehow, so they have to have some access inside Kermis’ information
technology infrastructure. The easiest thing they can do is slip in some additional resumes that get
added to the training data. This is known as data injection. More challenging is data modification, in which
the adversary changes labels or features in the existing training dataset. The most challenging of all is
logic corruption, in which the adversary changes the code and behavior of the machine learning algorithm
or model. You can think of the data injection and data modification attacks as somewhat similar to bias
mitigation pre-processing and logic corruption as somewhat similar to bias mitigation in-processing,
except for a nefarious purpose.
In the evasion category, the adversary does not need to change anything at all inside Kermis’
systems. So these are easier attacks to carry out. The attackers just have to create adversarial examples:
specially crafted resumes designed in clever ways to fool the machine learning system. But how
adversaries craft these tricky resumes depends on what information they have about how the model
makes its predictions. The easiest thing for an adversary to do is just submit a bunch of resumes into
the HireRing model and see whether they get selected or not; this gives the adversary a labeled dataset.
When adversaries cannot change the set of resumes and just have to submit a batch that they have, it is
called strict closed-box access. When they can change the input resumes based on the previous ones
they’ve submitted and the predicted labels they’ve observed, it is called adaptive closed-box access.
Adaptivity is a bit harder because the attacker might have to wait a while for Kermis to select or not select
the resumes that they’ve submitted. You might also be able to catch on that something strange is
happening over time. The next more difficult kind of information that adversaries can have about the
HireRing model trained for Kermis is known as non-adaptive closed-box access. Here, the adversary knows
the training data distribution 𝑝𝑋,𝑌 (𝑥, 𝑦) but cannot submit resumes. Finally, the classifier decision
function itself 𝑦̂(⋅) is the most difficult-to-obtain information about a model for an adversary. This full
knowledge of the classifier is known as open-box access.
Since Kermis has generally good cybersecurity overall, you should be less worried about poisoning
attacks, especially logic corruption attacks. Even open-box access for an evasion attack seems less
likely. Your biggest fear should be one of the closed-box evasion attacks. Nevertheless, you shouldn’t let
your guard down and you should still think about defending against all of the threats.
11.1.3 Goal
The third dimension of threats is the goal of the adversary, which applies to both poisoning and evasion
attacks. Different adversaries try to do different things. The easiest goal is confidence reduction: to shift
classifier scores so that they are closer to the middle of the range [0,1] and thus less confident. The next
goal is misclassification: trying to get the classifier to make incorrect predictions. (This is the formulation
given in Equation 11.1.) Job applications to Kermis that should be selected are rejected and vice versa.
When you have a binary classification problem like you do in applicant screening, there is only one way
to be wrong: predicting the other label. However, when you have more than two possible labels,
misclassification can produce any other label that is not the true one. Targeted misclassification goes a
step further and ensures that the misclassification isn’t just any other label, but a specific one of the
attacker’s choice. Finally, and most sophisticated of all, source/target misclassification attacks are
Adversarial Robustness | 157
designed so that misclassification only happens for some input job applications and the label of the
incorrect prediction also depends on the input. Backdoor or Trojan attacks are an example of
source/target misclassification in which a small subset of inputs (maybe ones whose resumes include
some special keyword) trigger the application to be accepted. The more sophisticated goals are harder
to pull off, but also the most dangerous for Kermis and HireRing if successful. The problem specification
should include provisions to be vigilant for all these different goals of attacks.
Figure 11.3. Different categories of defenses against poisoning attacks in the machine learning pipeline. Accessi-
ble caption. A block diagram with a training dataset as input to a data sanitization block with a pre-pro-
cessed dataset as output. The pre-processed dataset is input to a smoothing block with an initial model
as output. The initial model is input to a patching block with a final model as output.
Machine learning defenses against poisoning attacks are an active area of research. Specific attacks
and defenses are continually improving in an arms race. Since by the time this book comes out, all
presently known attacks and defenses are likely to have been superseded, only the main ideas are given
rather than in-depth accounts.
2
Micah Goldblum, Dmitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Mądry, Bo Li, and
Tom Goldstein. “Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses.” arXiv:2012.10544,
2021.
158 | Trustworthy Machine Learning
outliers, robust statistics, is as follows. The set of outliers is assumed to have small cardinality compared
to the clean, unpoisoned training resumes. The two sets of resumes, poison and clean, are differentiated
by having differing means normalized by their variances. Recent methods are able to differentiate the
two sets efficiently even when the number of features is large.3 For high-dimensional semi-structured
data, the anomaly detection should be done in a representation space rather than in the original input
feature space. Remember from Chapter 4 that learned representations and language models compactly
represent images and text data, respectively, using the structure they contain. Anomalies are more
apparent when the data is well-represented.
11.2.2 Smoothing
When HireRing is training the Kermis classifier, defenses against data poisoning make the model more
robust by smoothing the score function. The general idea is illustrated in Figure 11.4, which compares
a smooth and less smooth score function. By preferring smooth score functions during training, there is
a lower chance for adversaries to succeed in their attacks.
Figure 11.4. A comparison of a smooth (left) and less smooth (right) score function. The value of the score function
is indicated by shading: it is 0 where the shading is white and 1 where the shading is black. The decision boundary,
where the score function takes value 0.5 is indicated by red lines. The less smooth score function may have been
attacked. Accessible caption. Stylized plot showing a decision boundary smoothly encircling one of the
classes with a blob-like region. The underlying score function is indicated by shading, becoming
smoothly whiter in the inside the region and smoothly blacker outside the region. This is contrasted
with another decision boundary that has some tiny enclaves of the opposite class inside the blob-like
region. Its underlying score function is not smooth.
Smoothing can be done, for example, by applying a k-nearest neighbor prediction on top of another
underlying classifier. By doing so, the small number of poisoned resumes are never in the majority of a
Pang Wei Koh, Jacob Steinhardt, and Percy Liang. “Stronger Data Poisoning Attacks Break Data Sanitization Defenses.” In:
3
neighborhood and their effect is ignored. Any little shifts in the decision boundary stemming from the
poisoned data points are removed. Another way to end up with a smooth score function, known as
gradient shaping, is by directly constraining or regularizing its slope or gradient within the learning
algorithm. When the magnitude of the gradient of a decision function is almost the same throughout the
feature space, it is resilient to perturbations caused by a small number of anomalous points: it is more
like the left score function in Figure 11.4 than the right score function. Smoothing can also be
accomplished by averaging together the score functions of several independent classifiers.
11.2.3 Patching
Patching, primarily intended for neural network models, mitigates the effect of backdoor attacks as a
post-processing step. Backdoors show up as anomalous edge weights and node activations in neural
networks. There is something statistically weird about them. Say that you have already trained an initial
model on a poisoned set of Kermis job applications that has yielded a backdoor. The idea of the patching
is similar to how you fix a tear in a pair of pants. First you ‘cut’ the problem out of the ‘fabric’: you prune
the anomalous neural network nodes. Then you ‘sew’ a patch over it: you fine-tune the model with some
clean resumes or a set of resumes generated to approximate a clean distribution.
Figure 11.5. Different defenses against evasion attacks. Accessible caption. A hierarchy diagram with de-
fenses against evasion attacks at its root, which has denoising and adversarial training as its children.
Denoising has children input domain, frequency domain, and latent domain.
4
Zhonghan Niu, Zhaoxi Chen, Linyi Li, Yubin Yang, Bo Li, and Jinfeng Yi. “On the Limitations of Denoising Strategies as Adver-
sarial Defenses.” arXiv:2012.09384, 2020.
Adversarial Robustness | 161
If your machine learning model is a neural network, then the values of the data as it passes through
intermediate layers constitute a latent representation. The third denoising category works in this latent
representation space. Like in the frequency domain, this category also squashes the values in certain
dimensions. However, these techniques do not simply assume that the adversarial noise is concentrated
in a certain part of the space (e.g. high frequencies), but learn these dimensions using clean job
applications and their corresponding adversarial examples that you create.
Equation 11.3
Notice that the inner maximization is the same expression as finding adversarial examples given in
Equation 11.2. Thus, to carry out adversarial training, all you have to do is produce adversarial examples
for the Kermis training resumes and use those adversarial examples as a new training data set in a
typical machine learning algorithm. HireRing must become a good attacker to become a good defender.
5
Aleksander Mądry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. “Towards Deep Learning Mod-
els Resistant to Adversarial Attacks.” In: Proceedings of the International Conference on Learning Representations. Vancouver, Can-
ada, Apr.–May 2018.
162 | Trustworthy Machine Learning
In contrast, a way to characterize the adversarial robustness of a classifier that is agnostic to the
evasion attack is the CLEVER score.6 An acronym for cross-Lipschitz extreme value for network
robustness, the CLEVER score (indirectly) analyzes the distance from a job application data point to the
classifier decision boundary. Misclassification attacks will be unsuccessful if this distance is too far
because it will exceed 𝜖, the bound on the norm of the perturbation 𝛿. The higher the CLEVER score, the
more robust the model. Generally speaking, smooth, non-complicated decision boundaries without
many small islands (like the left score function in Figure 11.4) have large distances from data points on
average, have large average CLEVER scores, and are robust to all kinds of evasion attacks. In the problem
specification phase with the Kermis problem owners, you can set an acceptable minimum value for the
average CLEVER score. If the model achieves it, HireRing can confidently certify a level of security and
robustness.
11.4 Summary
▪ Adversaries are actors with bad intentions who try to attack machine learning models by
degrading their accuracy or fooling them.
▪ Poisoning attacks are implemented during model training by corrupting either the training data
or model.
▪ Evasion attacks are implemented during model deployment by creating adversarial examples
that appear genuine, but fool models into making misclassifications.
▪ Adversaries may just want to worsen model accuracy in general or may have targeted goals that
they want to achieve, such as obtaining specific predicted labels for specific inputs.
▪ Adversaries have different capabilities of what they know and what they can change. These
differences in capabilities and goals determine the threat.
▪ Defenses for poisoning attacks take place at different parts of the machine learning pipeline: data
sanitization (pre-processing), smoothing (model training), and patching (post-processing).
▪ Defenses for evasion attacks include denoising that attempts to remove adversarial
perturbations from inputs and adversarial training which induces min-max robustness.
▪ Models can be certified for robustness to evasion attacks using the CLEVER score.
▪ Even without malicious actors, adversarial attacks are a way for developers to test machine
learning systems in worst case scenarios.
6
Tsui-Wei Weng, Huan Zhang, Pin-Yu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, Cho-Jui Hsieh, and Luca Daniel. “Evaluating the
Robustness of Neural Networks: An Extreme Value Theory Approach.” In: Proceedings of the International Conference on Learning
Representations. Vancouver, Canada, Apr.–May 2018.
Interpretability and Explainability | 163
12
Interpretability and Explainability
Hilo is a (fictional) startup company trying to shake up the online second home mortgage market. A type
of second mortgage known as a home equity line of credit (HELOC) allows customers to borrow
intermittently using their house as collateral. Hilo is creating several unique propositions to
differentiate itself from other companies in the space. The first is that it integrates the different
functions involved in executing a second mortgage, including a credit check of the borrower and an
appraisal of the value of the home, in one system. Second, its use of machine learning throughout these
human decision-making processes is coupled with a maniacal focus on robustness to distribution shift,
fairness, and adversarial robustness. Third, it has promised to be scrutable to anyone who would like to
examine the machine learning models it will use and to provide avenues for recourse if the machine’s
decisions are problematic in any respect. Imagine that you are on the data science team assembled by
Hilo and have been tasked with addressing the third proposition by making the machine learning
models interpretable and explainable. The platform’s launch date is only a few months away, so you had
better get cracking.
Interpretability of machine learning models is the aim to let people understand how the machine
makes its predictions. It is a challenge because many of the machine learning approaches in Chapter 7
are not easy for people to understand since they have complicated functional forms. Interpretability and
explainability are a form of interaction between the machine and a human, specifically communication
from the machine to the human, that allow the machine and human to collaborate in decision making.1
This topic and chapter lead off Part 5 of the book on interaction, which is the third attribute of
trustworthiness of machine learning. Remember that the organization of the book matches the
attributes of trustworthiness, shown in Figure 12.1.
1
Ben Green and Yiling Chen. “The Principles and Limits of Algorithm-in-the-Loop Decision Making.” In: Proceedings of the ACM
Conference on Computer-Supported Cooperative Work and Social Computing. Austin, Texas, USA, Nov. 2019, p. 50.
164 | Trustworthy Machine Learning
Figure 12.1. Organization of the book. The fifth part focuses on the third attribute of trustworthiness, intimacy
or interaction, which maps to machine learning models that can communicate with people and receive instruction
from people about their values. Accessible caption. A flow diagram from left to right with six boxes: part 1:
introduction and preliminaries; part 2: data; part 3: basic modeling; part 4: reliability; part 5: interac-
tion; part 6: purpose. Part 5 is highlighted. Parts 3–4 are labeled as attributes of safety. Parts 3–6 are
labeled as attributes of trustworthiness.
The typical output of a machine learning model is the predicted label 𝑌̂, but this label is not enough
to communicate how the machine makes its predictions. Something more, in the form of an explanation,
is also needed. The machine is the transmitter of information and the human is the receiver or consumer
of that information. As shown in Figure 12.2, the communication process has to overcome human
cognitive biases—the limitations that people have in receiving information—that threaten human-
machine collaboration. This is sometimes known as the last mile problem.2 The figure completes the
picture of biases and validities you’ve seen starting in Chapter 4. The final space is the perceived space,
which is the final understanding that the human consumer has of the predictions from Hilo’s machine
learning models.
You will not be able to create a single kind of explanation that appeals to all of the different potential
consumers of explanations for Hilo’s models. Even though the launch date is only a few months away,
don’t take the shortcut of assuming that any old explanation will do. The cognitive biases of different
people are different based on their persona, background, and purpose. As part of the problem
specification phase of the machine learning lifecycle, you’ll first have to consider all the different types
of explanations at your disposal before going into more depth on any of them during the modeling phase.
2
James Guszcza. “The Last-Mile Problem: How Data Science and Behavioral Science Can Work Together.” In: Deloitte Review 16
(2015), pp. 64–79.
Interpretability and Explainability | 165
Figure 12.2. A mental model of spaces, validities, and biases. The final space is the perceived space, which is what
the human understands from the machine’s output. Accessible caption. A sequence of six spaces, each rep-
resented as a cloud. The construct space leads to the observed space via the measurement process.
The observed space leads to the raw data space via the sampling process. The raw data space leads to
the prepared data space via the data preparation process. The prepared data space leads to the predic-
tion space via the modeling process. The prediction space leads to the perceived space via the commu-
nication process. The measurement process contains social bias, which threatens construct validity.
The sampling process contains representation bias and temporal bias, which threatens external valid-
ity. The data preparation process contains data preparation bias and data poisoning, which threaten
internal validity. The modeling process contains underfitting/overfitting and poor inductive bias,
which threaten generalization. The communication process contains cognitive bias, which threatens
human-machine collaboration.
“If we don’t know what is happening in the black box, we can’t fix its mistakes to
make a better model and a better world.”
Note that unlike the other three personas, the primary concern of the data scientist persona is not
building interaction and intimacy for trustworthiness. The four different personas and their goals are
summarized in Table 12.1.
166 | Trustworthy Machine Learning
Table 12.1. The four main personas of consumers of explanations and their goals.
▪ The third dichotomy is feature-based vs. sample-based: is the explanation given as a statement about
the features or is it given by pointing to other data points in their entirety. Feature-based
explanations require that the underlying features be meaningful and understandable by the
consumer. If they are not already meaningful, a pre-processing step known as disentangled
representation may be required. This pre-processing finds directions of variation in semi-
structured data that are not necessarily aligned to the given features but have some human
interpretation, and is expanded upon in Section 12.2.
Since there are three dichotomies, there are eight possible combinations of explanation types.
Certain types of explanations are more appropriate for certain personas to meet their goals. The fourth
persona, data scientists from your own team at Hilo, may need to use any and all of the types of
explanations to debug and improve the model.
▪ Local, exact, feature-based explanations help affected users such as HELOC applicants gain
recourse and understand precisely which feature values they have to change in order to pass the
credit check.
▪ Global and local approximate explanations help decision makers such as appraisers and credit
3
Vijay Arya, Rachel K. E. Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Q. Vera
Liao, Ronny Luss, Aleksandra Mojsilović, Sami Mourad, Pablo Pedemonte, Ramya Raghavendra, John Richards, Prasanna
Sattigeri, Karthikeyan Shanmugam, Moninder Singh, Kush R. Varshney, Dennis Wei, and Yunfeng Zhang. “One Explanation
Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques.” arXiv:1909.03012, 2019. Q. Vera Liao and Kush R.
Varshney. “Human-Centered Explainable AI (XAI): From Algorithms to User Experiences.” arXiv:2110.10790, 2021.
Interpretability and Explainability | 167
officers achieve their dual goals of roughly understanding how the overall model works to
develop trust in it (global) and having enough information about a machine-predicted property
value to combine with their own information to produce a final appraisal (local).
▪ Global and local, exact, sample-based explanations and global, exact, feature-based explanations
help regulators understand the behavior and predictions of the model as a safeguard. By being
exact, the explanations apply to all data points, including edge cases that might be washed out in
approximate explanations. Of these, the local, exact, sample-based and global, exact, feature-
based explanations that appeal to regulators come from directly interpretable models.
▪ Regulators and decision makers can both benefit from global, approximate, sample-based
explanations to gain understanding.
Table 12.2. The three dichotomies of explanations and their mapping to personas.
Another dichotomy that you might consider in the problem specification phase is whether you will
allow the explanation consumer to interactively probe the Hilo machine learning system to gain further
insight, or whether the system will simply produce static output explanations that the consumer cannot
further interact with. The interaction can be through natural language dialogue between the consumer
and the machine, or it could be by means of visualizations that the consumer adjusts and drills down
into.4 The variety of static explanations is already plenty for you to deal with without delving into
interaction, so you decide to proceed only with static methods.
4
Josua Krause, Adam Perer, and Kenney Ng. “Interacting with Predictions: Visual Inspection of Black-Box Machine Learning
Models.” In: Proceedings of the CHI Conference on Human Factors in Computing Systems. San Jose, California, USA, May 2016, pp.
5686–5697.
168 | Trustworthy Machine Learning
Mirroring the three points of intervention in the modeling pipeline seen in Part 4 of the book for
distributional robustness, fairness, and adversarial robustness, Figure 12.3 shows different actions for
interpretability and explainability. As mentioned earlier, disentangled representation is a pre-
processing step. Directly interpretable models arise from training decision functions in specific
constrained hypothesis classes (recall that the concept of hypothesis classes was introduced in Chapter
7). Finally, many methods of explanation are applied on top of already-trained, non-interpretable
models such as neural networks in a post hoc manner.
Figure 12.3. Pipeline view of explanation methods. Accessible caption. A block diagram with a training da-
taset as input to a disentangled representation block with a pre-processed dataset as output. The pre-
processed dataset is input to a directly interpretable model block with an initial model as output. The
initial model is input to a post hoc explanation block with a final model as output.
12.1.3 Conclusion
Now you have the big picture view of different explanation methods, how they help consumers meet
their goals, and how they fit into the machine learning pipeline steps. The appraisal and HELOC approval
systems you’re developing for Hilo require you to appeal to all of the different consumer types, and you
have the ability to intervene on all parts of the pipeline, so you should start putting together a
comprehensive toolkit of interpretable and explainable machine learning techniques.
etc.5 that are uncorrelated with each other and also provide information not captured in other input data.
(For example, even though the size and number of floors of the home could be estimated from images,
it will already be captured in other tabular data.) Such a representation is known as a disentangled
representation. The word disentangled is used because in such a representation, intervening on one
dimension does not cause other dimensions to also change. Recently developed methods can learn
disentangled representations directly from unlabeled data.6 Although usually not the direct objective of
disentangling, such representations tend to yield meaningful dimensions that people can provide
semantics to, such as the example of ‘foliage in the neighborhood’ mentioned above. Therefore,
disentangled representation is a way of pre-processing the training data features to make them more
human-interpretable. Modeling and explanation methods later in the pipeline take the new features as
input.
Sometimes, disentangled representation to improve the features is not good enough to provide
meaning to consumers. Similarly, sometimes tabular data features are just not sufficient to provide
meaning to a consumer. In these cases, an alternative pre-processing step is to directly elicit meaningful
explanations from consumers, append them to the dataset as an expanded cardinality label set, and
train a model to predict both the original appraisal or creditworthiness as well as the explanation.7
5
Stephen Law, Brooks Paige, and Chris Russell. “Take a Look Around: Using Street View and Satellite Images to Estimate House
Prices.” In: ACM Transactions on Intelligent Systems and Technology 10.5 (Nov. 2019), p. 54.
6
Xinqi Zhu, Chang Xu, and Dacheng Tao. “Where and What? Examining Interpretable Disentangled Representations.” In: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Jun. 2021, pp. 5857–5866.
7
Michael Hind, Dennis Wei, Murray Campbell, Noel C. F. Codella, Amit Dhurandhar, Aleksandra Mojsilović, Karthikeyan Nate-
san Ramamurthy, and Kush R. Varshney. “TED: Teaching AI to Explain its Decisions.” In: Proceedings of the AAAI/ACM Conference
on AI, Ethics, and Society. Honolulu, Hawaii, USA, Jan. 2019, pp. 123–129.
8
Kush R. Varshney and Homa Alemzadeh. “On the Safety of Machine Learning: Cyber-Physical Systems, Decision
Sciences, and Data Products.” In: Big Data 5.3 (Sep. 2017), pp. 246–255.
170 | Trustworthy Machine Learning
This is a very compact rule set in which regulators can easily see the features involved and their
thresholds. They can reason that the model is more lenient on external risk when the number of
satisfactory trades is higher. They can also reason that the model does not include any objectionable
features. (Once decision trees or Boolean rule sets become too large, they start becoming less
interpretable.)
One common refrain that you might hear is of a tradeoff between accuracy and interpretability. This
argument is false.11 Due to the Rashomon effect introduced in Chapter 9, many kinds of models,
including decision trees and rule sets, have almost equally high accuracy on many datasets. The domain
of competence for decision trees and rule sets is broad (recall that the domain of competence introduced
in Chapter 7 is the set of dataset characteristics on which a type of model performs well compared to
other models). While it is true that scalably training these models has traditionally been challenging due
to their discrete nature (discrete optimization is typically more difficult than continuous optimization),
the challenges have recently been overcome.12
When trained using advanced discrete optimization, decision trees and Boolean rule set classifiers show
competitive accuracies across many datasets.
9
It is important to note that interpretability is about consumers understanding how the model makes its predictions, but not
necessarily why. Consumers can supplement the how with the why based on their common-sense knowledge.
10
The example HELOC explanations throughout the chapter are based on the tutorial https://github.com/Trusted-
AI/AIX360/blob/master/examples/tutorials/HELOC.ipynb and demonstration http://aix360.mybluemix.net/data developed by
Vijay Arya, Amit Dhurandhar, Q. Vera Liao, Ronny Luss, Dennis Wei, and Yunfeng Zhang.
11
Cynthia Rudin. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models
Instead.” In: Nature Machine Intelligence 1.5 (May 2019), pp. 206–215.
12
Oktay Günlük, Jayant Kalagnanam, Minhan Li, Matt Menickelly, and Katya Scheinberg. “Optimal Generalized Decision Trees
via Integer Programming.” arXiv:1612.03225, 2019. Sanjeeb Dash, Oktay Günlük, and Dennis Wei. “Boolean Decision Rules via
Column Generation.” In: Advances in Neural Information Processing Systems 31 (Dec. 2018), pp. 4655–4665.
Interpretability and Explainability | 171
1
𝑃( 𝑌̂ = 1 ∣ 𝑋 = 𝑥 ) = (1) 𝑥 (1) +⋯+𝑤 (𝑑) 𝑥 (𝑑) ) ,
1 + 𝑒 −(𝑤
Equation 12.1
which you’ve seen before as the logistic activation function for neural networks in Chapter 7. It can be
rearranged to the following:
𝑃( 𝑌̂ = 1 ∣ 𝑋 = 𝑥 )
log ( ) = 𝑤 (1) 𝑥 (1) + ⋯ + 𝑤 (𝑑) 𝑥 (𝑑) .
1 − 𝑃( 𝑌̂ = 1 ∣ 𝑋 = 𝑥 )
Equation 12.2
The left side of Equation 12.2 is called the log-odds. When the log-odds is positive, 𝑌̂ = 1 is the more likely
prediction: creditworthy. When the log-odds is negative, 𝑌̂ = 0 is the more likely prediction: non-
creditworthy.
The way to understand the behavior of the classifier is by examining how the probability, the score,
or the log-odds change when you increase an individual feature attribute’s value by 1. Examining the
response to changes is a general strategy for explanation that recurs throughout the chapter. In the case
of linear logistic regression, an increase of feature value 𝑥 (𝑖) by 1 while leaving all other feature values
constant adds 𝑤 (𝑖) to the log-odds. The weight value has a clear effect on the score. The most important
features per unit change of feature values are those with the largest absolute values of the weights. To
more easily compare feature importance using the weights, you should first standardize each of the
features to zero mean and unit standard deviation. (Remember that standardization was first introduced
when evaluating the covariate balancing of causal models in Chapter 8.)
training data. Usually, smooth spline functions are chosen. (A spline is a function made up of piecewise
polynomials strung together.)
Table 12.3. An example generalized linear rule model for HELOC credit checks.
The three plain features (‘external risk estimate’, ‘revolving balance divided by credit limit’, and
‘number of satisfactory trades’) were standardized before doing anything else, so you can compare the
weight values to see which features are important. The decision stump of ‘months since most recent
inquiry’ being greater than zero is the most important because it has the largest coefficient. The decision
stump of ‘external risk estimate’ being greater than 69 is the least important because it has the smallest
coefficient. This is the same kind of understanding that you would apply to a linear logistic regression
model.
The way to further understand this model is by remembering that the weight contributes to the log-
odds for every unit change of the feature. Taking the ‘external risk estimate’ feature as an example, the
GLRM tells you that:
▪ for every increase of External Risk Estimate by 1, increase the log-odds by 0.0266 (this number is
obtained by undoing the standardization on the weight 0.6542);
13
Dennis Wei, Sanjeeb Dash, Tian Gao, and Oktay Günlük. “Generalized Linear Rule Models.” In: Proceedings of the International
Conference on Machine Learning. Long Beach, California, USA, Jul. 2019, pp. 6687–6696.
14
This feature excludes inquiries made in the last 7 days to remove inquiries that are likely due to price comparison shopping.
Interpretability and Explainability | 173
The rule is fairly straightforward for consumers such as regulators to understand while being an
expressive model for generalization. As shown in Figure 12.4, you can plot the contributions of the
‘external risk estimate’ feature to the log-odds to visually see how the Hilo classifier depends on it. Plots
of 𝑓 (𝑖) (𝑥 (𝑖) ) for other GAMs look similar, but can be nonlinear in different ways.
Figure 12.4. Contribution of the ‘external risk estimate’ feature to the log-odds of the classifier. Accessible cap-
tion. A plot with contribution to log-odds on the vertical axis and external risk estimate on the horizon-
tal axis. The contribution to log-odds function increases linearly with three jump discontinuities.
You can also ‘undo’ the log-odds to get the actual probability (Equation 12.1 instead of Equation 12.2),
but it is not additive like the log-odds. Nevertheless, the shape of the probability curve is informative in
the same way, and is shown in Figure 12.5.
Figure 12.5. Contribution of the ‘external risk estimate’ feature to the probability of the classifier. Accessible
caption. A plot with probability on the vertical axis and external risk estimate on the horizontal axis.
The probability function increases linearly with three jump discontinuities.
174 | Trustworthy Machine Learning
GA2Ms, equivalently known as explainable boosting machines, are directly interpretable models that
work the same as GAMs, but with two-dimensional nonlinear interaction terms 𝑓 (𝑖,𝑖’) (𝑥 (𝑖) , 𝑥 (𝑖’) ).15 Visually
showing their contribution to the log-odds of the classifier requires two-dimensional plots. It is generally
difficult for people to understand interactions involving more than two dimensions and therefore
higher-order GA2Ms are not used in practice.16 However, if you allow higher-degree rules in GLRMs, you
end up with GA2Ms of AND-rules or OR-rules involving multiple interacting feature dimensions that
unlike general higher-order GA2Ms, are still directly interpretable because rules involving many
features can be understood by consumers.
15
Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. “Accurate Intelligible Models with Pairwise Interactions.” In:
Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, Illinois, USA, Aug. 2013,
pp. 623–631.
16
A recently developed neural network architecture has full interactions between dimensions, but can still be decoded into the
effects of individual features using very special properties of continued fractions, based on which the architecture is designed. A
continued fraction is a representation of a number as the sum of its integer part and the reciprocal of another number; this
other number is represented as the sum of its integer part and the reciprocal of another number; and so on. Isha Puri, Amit
Dhurandhar, Tejaswini Pedapati, Karthikeyan Shanmugam, Dennis Wei, and Kush R. Varshney. “CoFrNets: Interpretable Neu-
ral Architecture Inspired by Continued Fractions.” In: Advances in Neural Information Processing Systems 34 (Dec. 2021).
17
Pang Wei Koh and Percy Liang. “Understanding Black-Box Predictions via Influence Functions.” In: Proceedings of the Interna-
tional Conference on Machine Learning. Sydney, Australia, Aug. 2017, pp. 1885–1894.
Interpretability and Explainability | 175
Hessian matrices across all the training data points is also computed and denoted by 𝐻. Then the
𝑇
influence of sample 𝑥𝑗 on 𝑥𝑡𝑒𝑠𝑡 is −∇𝐿(𝑦𝑡𝑒𝑠𝑡 , 𝑦̂(𝑥𝑡𝑒𝑠𝑡 )) 𝐻 −1 ∇𝐿 (𝑦𝑗 , 𝑦̂(𝑥𝑗 )).
The expression takes this form because of the following reason. First, the −𝐻 −1 ∇𝐿 (𝑦𝑗 , 𝑦̂(𝑥𝑗 )) part of
the expression is a step in the direction toward the minimum of the loss function at 𝑥𝑗 (for those who
have heard about it before, this is the Newton direction). Taking a step toward the minimum affects the
model parameters just like deleting 𝑥𝑗 from the training dataset, which is what the deletion diagnostics
method does explicitly. The expression involves both the slope and the local curvature because the
steepest direction indicated by the slope is bent towards the minimum of quadratic functions by the
Hessian. Second, the ∇𝐿(𝑦𝑡𝑒𝑠𝑡 , 𝑦̂(𝑥𝑡𝑒𝑠𝑡 )) part of the expression maps the overall influence of 𝑥𝑗 to the 𝑥𝑡𝑒𝑠𝑡
sample. Once you have all the influence values for a set of held-out test houses or applicants, you can
average, rank, and present them to the regulator to gain global understanding about the model.
the learning objective of the directly interpretable model from the standard risk minimization objective
to an objective of matching the uninterpretable model as closely as possible.18
The second sub-approach for approximation using directly interpretable models, known as SRatio,
computes training data weights based on the uninterpretable model and interpretable model. Then it
trains the directly interpretable model with the instance weights.19 You’ve seen reweighing of data
points repeatedly in the book: inverse probability weighting for causal inference, confusion matrix-
based weights to adapt to prior probability shift, importance weights to adapt to covariate shift, and
reweighing as a pre-processing bias mitigation algorithm. The general idea here is the same, and is
almost a reversal of importance weights for covariate shift.
Remember from Chapter 9 that in covariate shift settings, the training and deployment feature
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦)
distributions are different, but the labels given the features are the same: 𝑝𝑋 (𝑥) ≠ 𝑝𝑋 (𝑥) and
(𝑡𝑟𝑎𝑖𝑛) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑑𝑒𝑝𝑙𝑜𝑦) (𝑡𝑟𝑎𝑖𝑛)
𝑝𝑌∣𝑋 ( 𝑦 ∣ 𝑥) = 𝑝𝑌∣𝑋 (𝑦 ∣ 𝑥 ). The importance weights are then: 𝑤𝑗 = 𝑝𝑋 (𝑥𝑗 )/𝑝𝑋 (𝑥𝑗 ). For
explanation, there is no separate training and deployment distribution; there is an uninterpretable and
an interpretable model. Also, since you’re explaining the prediction process, not the data generating
process, you care about the predicted label 𝑌̂ instead of the true label 𝑌. The feature distributions are the
same because you train the uninterpretable and interpretable models on the same training data houses
or applicants, but the predicted labels given the features are different since you’re using different
(interp) (𝑢𝑛𝑖𝑛𝑡𝑒𝑟𝑝) (𝑖𝑛𝑡𝑒𝑟𝑝) (𝑢𝑛𝑖𝑛𝑡𝑒𝑟𝑝)
models: 𝑝𝑋 (𝑥) = 𝑝𝑋 (𝑥) and 𝑝𝑌̂∣𝑋 ( 𝑦̂ ∣ 𝑥 ) ≠ 𝑝𝑌̂∣𝑋 ( 𝑦̂ ∣ 𝑥 ).
So following the same pattern as adapting to covariate shift by computing the ratio of the
(𝑢𝑛𝑖𝑛𝑡𝑒𝑟𝑝) (𝑖𝑛𝑡𝑒𝑟𝑝)
probabilities that are different, the weights are: 𝑤𝑗 = 𝑝𝑌̂∣𝑋 ( 𝑦̂ ∣ 𝑥 )/𝑝𝑌̂∣𝑋 ( 𝑦̂ ∣ 𝑥 ). You want the
interpretable model to look like the uninterpretable model. In the weight expression, the numerator
comes from the classifier score of the trained uninterpretable model and the denominator comes from
the score of the directly interpretable model trained without weights.
12.4.2 LIME
Global feature-based explanation using model approximation has an extension to the local explanation
case known as local interpretable model-agnostic explanations (LIME). The idea is similar to the global
method from the previous subsection. First you train an uninterpretable model and then you
approximate it by fitting a simple interpretable model to it. The difference is that you do this
approximation around each deployment data point separately rather than trying to come up with one
overall approximate model.
To do so, you get the uninterpretable model’s prediction on the deployment data point you care
about, but you don’t stop there. You add a small amount of noise to the deployment data point’s features
several times to create a slew of perturbed input samples and classify each one. You then use this new
set of data points to train the directly interpretable model. The directly interpretable model is a local
approximation because it is based only on a single deployment data point and a set of other data points
created around it. The interpretable model can be a logistic regression or decision tree and is simply
shown to the decision maker, the Hilo appraiser or credit officer.
18
Sarah Tan, Rich Caruana, Giles Hooker, Paul Koch, and Albert Gordo. “Learning Global Additive Explanations for Neural Nets
Using Model Distillation.” arXiv:1801.08640, 2018.
19
Amit Dhurandhar, Karthikeyan Shanmugam, and Ronny Luss. “Enhancing Simple Models by Exploiting What They Already
Know.” In: Proceedings of the International Conference on Machine Learning. Jul. 2020, pp. 2525–2534.
Interpretability and Explainability | 177
Figure 12.6. Partial dependence plot of the ‘external risk estimate’ feature for some uninterpretable classifier
model. Accessible caption. A plot with partial dependence on the vertical axis and external risk esti-
mate on the horizontal axis. The partial dependence smoothly increases in a sigmoid-like shape.
The plot of an individual feature’s exact contribution to the probability in Figure 12.5 for GAMs looks
similar to a partial dependence plot in Figure 12.6 for an uninterpretable model, but is different for one
important reason. The contributions of the individual features exactly combine to recreate a GAM
because the different features are unlinked and do not interact with each other. In uninterpretable
models, there can be strong correlations and interactions among input feature dimensions exploited by
the model for generalization. By not visualizing the joint behaviors of multiple features in partial
dependence plots, an understanding of those correlations is lost. The set of all 𝑑 partial dependence
functions is not a complete representation of the classifier. Together, they are only an approximation to
the complete underlying behavior of the creditworthiness classifier.
12.4.4 SHAP
Just like LIME is a local version of global model approximation, a method known as SHAP is a local
version of partial dependence plots. The partial dependence plot shows the entire curve of partial
dependence across all feature values, whereas SHAP focuses on the precise point on the feature axis
corresponding to a particular applicant in the deployment data. The SHAP value is simply the difference
between the partial dependence value for that applicant and the average probability, shown in Figure
12.7.
178 | Trustworthy Machine Learning
Figure 12.7. Example showing the SHAP value as the difference between the partial dependence and average
probability for a given applicant’s ‘external risk estimate’ value. Accessible caption. A plot with partial de-
pendence on the vertical axis and external risk estimate on the horizontal axis. The partial depend-
ence smoothly increases in a sigmoid-like shape. A horizontal line passing through the partial depend-
ence function marks the average probability. The difference between the partial dependence and the
average probability is the SHAP value.
20
François Chollet. “Xception: Deep Learning with Depthwise Separable Convolutions.” In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. Honolulu, Hawaii, USA, Jul. 2017, pp. 1251–1258.
21
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. “Grad-
CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” In: Proceedings of the IEEE International Confer-
ence on Computer Vision. Venice, Italy, Oct. 2017, pp. 618–626. The implementation https://keras.io/examples/vision/grad_cam/
by François Chollet was used to create the figure.
Interpretability and Explainability | 179
Figure 12.8. Two examples of grad-CAM applied to an image classification model. The left column is the input
image, the middle column is the grad-CAM saliency map with white indicating higher attribution, and the right
column superimposes the attribution on top of the image. The top row image is classified by the model as a ‘mobile
home’ and the bottom row image is classified as a ‘palace.’ (Both classifications are incorrect.) The salient archi-
tectural elements are correctly highlighted by the explanation algorithm in both cases. Accessible caption. In
the first example, the highest attribution on a picture of a townhouse is on the windows, stairs, and
roof. In the second example, the highest attribution on a picture of a colonial-style house is on the front
portico.
12.4.6 Prototypes
Another kind of explanation useful for the decision maker persona, appraiser or credit officer, is through
local sample-based approximations of uninterpretable models. Remember that local directly
interpretable models, such as the k-nearest neighbor classifier work by averaging the labels of nearby
HELOC applicant data points. The explanation is just the list of those other applicants and their labels.
However, it is not required that a sample-based explanation only focus on nearby applicants. In this
section, you will learn an approach for approximate local sample-based explanation that presents
prototypical applicants as its explanation.
Prototypes—data points in the middle of a cluster of other data points shown in Figure 12.9—are useful
ways for consumers to perform case-based reasoning to gain understanding of a classifier.22 This
reasoning is as follows. To understand the appraised value of a house, compare it to the most
prototypical other house in the neighborhood that is average in every respect: average age, average
square footage, average upkeep, etc. If the appraised value of the house in question is higher than the
22
Been Kim, Rajiv Khanna, and Oluwasanmi Koyejo. “Examples Are Not Enough, Learn to Criticize!” In: Advances in Neural Infor-
mation Processing Systems 29, (Dec. 2016), pp. 2288–2296. Karthik S. Gurumoorthy, Amit Dhurandhar, Guillermo Cecchi, and
Charu Aggarwal. “Efficient Data Representation by Selecting Prototypes with Importance Weights.” In: Proceedings of the IEEE
International Conference on Data Mining. Beijing, China, Nov. 2019, pp. 260–269.
180 | Trustworthy Machine Learning
prototype, you can see which features have better values and thus get a sense of how the classifier works,
and vice versa.
Figure 12.9. Example of a dataset with three prototype samples marked. Accessible caption. A plot with sev-
eral data points, some of which are clustered into three main clusters. Central datapoints within those
clusters are marked as prototypes.
However, just showing the one nearest prototype is usually not enough. You’ll also want to show a
few other nearby prototypes so that the consumer can gain even more intuition. Importantly, listing
several nearby prototypes to explain an uninterpretable model and listing several nearby data points to
explain the k-nearest neighbor classifier is not the same. It is often the case that the several nearby house
data points are all similar to each other and do not provide any further intuition than any one of them
alone. With nearby prototype houses, each one is quite different from the others and therefore does
provide new understanding.
Let’s look at examples of applicants in deployment data whose creditworthiness was predicted by an
uninterpretable Hilo model along with three of their closest prototypes from the training data. As a first
example, examine an applicant predicted to be creditworthy by the model. The labels of the prototypes
must match that of the data point. The example creditworthy applicant’s prototype explanation is given
in Table 12.4.
The data point and the nearest prototype are quite similar to each other, but with the applicant
having a slightly lower ‘external risk estimate’ and slightly longer time since the oldest trade. It makes
sense that the applicant would be predicted to be creditworthy just like the first prototype, even with
those differences in ‘external risk estimate’ and ‘months since oldest trade open.’ The second nearest
prototype represents applicants who have been in the system longer but have executed fewer trades,
and have a lower ‘external risk estimate.’ The decision maker can understand from this that the model
is willing to predict applicants as creditworthy with lower ‘external risk estimate’ values if they
counteract that low value with longer time and fewer trades. The third nearest prototype represents
applicants who have been in the system even longer, executed even fewer trades, have never been
delinquent, and have a very high ‘external risk estimate’: the really solid applicants.
Interpretability and Explainability | 181
Table 12.4. An example prototype explanation for a HELOC applicant predicted to be creditworthy.
Table 12.5. An example prototype explanation for a HELOC applicant predicted to be non-creditworthy.
The contrastive explanations method (CEM) pulls out such local exact explanations from
uninterpretable models in a way that leads directly to avenues for recourse by applicants.23 CEM yields
two complementary explanations that go together: (1) pertinent negatives and (2) pertinent positives. The
terminology comes from medical diagnosis. A pertinent negative is something in the patient’s history
that helps a diagnosis because the patient denies that it is present. A pertinent positive is something that
is necessarily present in the patient. For example, a patient with abdominal discomfort, watery stool,
and without fever will be diagnosed with likely viral gastroenteritis rather than bacterial gastroenteritis.
The abdominal discomfort and watery stool are pertinent positives and the lack of fever is a pertinent
negative. A pertinent negative explanation is the minimum change needed in the features to change the
predicted label. Changing no fever to fever will change the diagnosis from viral to bacterial.
The mathematical formulation of CEM is almost the same as an adversarial example that you learned
about in Chapter 11: find the smallest sparse perturbation 𝛿 so that 𝑦̂(𝑥 + 𝛿) is different from 𝑦̂(𝑥). For
pertinent negatives, you want the perturbation to be sparse or concentrated in a few features to be
interpretable and understandable. This contrasts with adversarial examples whose perturbations
should be diffuse and spread across a lot of features to be imperceptible. A pertinent positive
explanation is also a sparse perturbation that is removed from 𝑥 and maintains the predicted label.
Contrastive explanations are computed in a post hoc manner after an uninterpretable model has already
been trained. Just like for adversarial examples, there are two cases for the computation: open-box when
Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and Payel Das. “Expla-
23
nations Based on the Missing: Towards Contrastive Explanations with Pertinent Negatives.” In: Advances in Neural Information
Processing Systems 32 (Dec. 2018), pp. 590–601.
Interpretability and Explainability | 183
the gradients of the model are made available and closed-box when the gradients are not made available
and must be estimated.
Table 12.6. An example contrastive explanation for a HELOC applicant predicted to be creditworthy.
Table 12.7. An example contrastive explanation for a HELOC applicant predicted to be creditworthy.
Examples of contrastive explanations for the same two applicants presented in the prototype section
are given in Table 12.6 (creditworthy; pertinent positive) and Table 12.7 (non-creditworthy; pertinent
negative). To remain creditworthy, the pertinent positive states that this HELOC applicant must
maintain the values of ‘external risk estimate,’ ‘number of satisfactory trades,’ and ‘percent trades never
delinquent.’ The ‘average months in file’ is allowed to drop to 91, which is a similar behavior seen in the
first prototype of the prototype explanation. For the non-creditworthy applicant, the pertinent negative
184 | Trustworthy Machine Learning
perturbation is sparse as desired, with only three variables changed. This minimal change to the
applicant’s features tells them that if they improve their ‘external risk estimate’ by 16 points, wait 14
months to increase their ‘average months in file’, and increase their ‘number of satisfactory trades’ by
5, the model will predict them to be creditworthy. The recourse for the applicant is clear.
What are these quantitative proxy metrics for interpretability? Some measure simplicity, like the
number of operations needed to make a prediction using a model. Others compare an explanation
method’s ordering of features attribution to some ground-truth ordering. (These explainability metrics
only apply to feature-based explanations.) An explainability metric known as faithfulness is based on this
idea of comparing feature orderings.25 Instead of requiring a true ordering, however, it measures the
correlation between a given method’s feature order to the order in which the accuracy of a model drops
the most when the corresponding feature is deleted. A correlation value of 1 is the best faithfulness.
Unfortunately, when faithfulness is applied to saliency map explanations, it is unreliable.26 You should
24
Finale Doshi-Velez and Been Kim. “Towards a Rigorous Science of Interpretable Machine Learning.” arXiv:1702.08608,
2017.
25
David Alvarez-Melis and Tommi S. Jaakkola. “Towards Robust Interpretability with Self-Explaining Neural Networks.” In:
Advances in Neural Information Processing Systems 32 (Dec. 2018), pp. 7786–7795.
26
Richard Tomsett, Dan Harborne, Supriyo Chakraborty, Prudhvi Gurram, and Alun Preece. “Sanity Checks for Saliency Met-
rics.” In: Proceedings of the AAAI Conference on Artificial Intelligence. New York, New York, USA, Feb. 2020, pp. 6021–6029.
Interpretability and Explainability | 185
be beware of functionally-grounded evaluation and always try to do at least a little human-grounded and
application-grounded evaluation before the Hilo platform goes live.
You’ve put together an explainability toolkit for both Hilo house appraisal and credit checking
models and implemented the appropriate methods for the right touchpoints of the applicants,
appraisers, credit officers, and regulators. You haven’t taken shortcuts and have gone through a few
rounds of functionally-grounded evaluations. Your contributions to the Hilo platform help make it a
smashing success when it is launched.
12.7 Summary
▪ Interpretability and explainability are needed to overcome cognitive biases in the last-mile
communication problem between the machine learning model and the human consumer.
▪ There is no one best explanation method. Different consumers have different personas with
different needs to achieve their goals. The important personas are the affected user, the decision
maker, and the regulator.
▪ Human interpretability of machine learning requires features that people can understand to
some extent. If the features are not understandable, disentangled representation can help.
▪ Explanation methods can be divided into eight categories by three dichotomies. Each category
tends to be most appropriate for one consumer persona. The first dichotomy is whether the
explanation is for the entire model or a specific input data point (global/local). The second
dichotomy is whether the explanation is an exact representation of the underlying model or it
contains some approximation (exact/approximate). The third dichotomy is whether the language
used in creating the explanation is based on the features or on entire data points (feature-
based/sample-based).
▪ Ideally, you want to quantify how good an explanation method is by showing explanations to
consumers in the context of the actual task and eliciting their feedback. Since this is expensive
and difficult, proxy quantitative metrics have been developed, but they are far from perfect.
186 | Trust in Machine Learning
13
Transparency
Imagine that you are a model validator in the model risk management department at JCN Corporation,
the (fictional) information technology company undergoing an enterprise transformation first
encountered in Chapter 7. In addition to using machine learning for estimating the skills of its
employees, JCN Corporation is rolling out machine learning in another human resources effort:
proactive retention. Using historical employee administrative data, JCN Corporation is developing a
system to predict employees at risk of voluntarily resigning in the next six months and offering
incentives to retain them. The data includes internal corporate information on job roles and
responsibilities, compensation, market demand for jobs, performance reviews, promotions, and
management chains. JCN Corporation has consent to use the employee administrative data for this
purpose through employment contracts. The data was made available to JCN Corporation’s data science
team under institutional control after a syntactic anonymity transformation was performed.
The team has developed several attrition prediction models using different machine learning
algorithms, keeping accuracy, fairness, distributional robustness, adversarial robustness, and
explainability as multiple goals. If the attrition prediction models are fair, the proactive retention system
could make employment at JCN Corporation more equitable than it is right now. The project has moved
beyond the problem specification, data understanding, data preparation, and modeling phases of the
development lifecycle and is now in the evaluation phase.
“The full cycle of a machine learning project is not just modeling. It is finding the
right data, deploying it, monitoring it, feeding data back [into the model], showing
safety—doing all the things that need to be done [for a model] to be deployed. [That
goes] beyond doing well on the test set, which fortunately or unfortunately is what we
in machine learning are great at.”
Your job as the model validator is to test out and compare the models to ensure at least one of them
is safe and trustworthy before it is deployed. You also need to obtain buy-in from various parties before
you can sign your name and approve the model’s deployment. To win the support of internal JCN
Corporation executives and compliance officers, external regulators,1 and members of a panel of diverse
employees and managers within the company you’ll assemble, you need to provide transparency by
communicating not only the results of independent tests you conduct, but also what happened in the
earlier phases of the lifecycle. (Transparent reporting to the general public is also something you should
consider once the model is deployed.) Such transparency goes beyond model interpretability and
explainability because it is focused on model performance metrics and their uncertainty
characterizations, various pieces of information about the training data, and the suggested uses and
possible misuses of the model.2 All of these pieces of information are known as facts.
Not all of the various consumers of your transparent reporting are looking for the same facts or the
same level of detail. Modeling tasks besides predicting voluntary attrition may require different facts.
Transparency has no one-size-fits-all solution. Therefore, you should first run a small design exercise
to understand which facts and details are relevant for the proactive retention use case and for each
consumer, and the presentation style preferred by each consumer.3 (Such an exercise is related to value
alignment, which is elaborated upon in Chapter 14.) The artifact that ultimately presents a collection of
facts to a consumer is known as a factsheet. After the design exercise, you can be off to the races with
creating, collecting, and communicating information about the lifecycle.
You are shouldering a lot of responsibility, so you don’t want to perform your job in a haphazard way
or take any shortcuts. To enable you to properly evaluate and validate the JCN Corporation voluntary
resignation models and communicate your findings to various consumers, this chapter teaches you to:
▪ conduct tests that measure the probability of expected harms and the possibility of unexpected
harms to generate quantitative facts,
▪ communicate these test result facts and their uncertainty, and
▪ defend your efforts against people who are not inclined to trust you.
You’re up to the task padawan, so let’s start equipping you with the tools you need.
13.1 Factsheets
Transparency should reveal several kinds of facts that come from different parts of the lifecycle.4 From
the problem specification phase, it is important to capture the goals, intended uses, and possible
1
Regulations play a role in the company’s employee retention programs because they are subject to fair employment laws.
2
Q. Vera Liao and Kush R. Varshney. “Human-Centered Explainable AI (XAI): From Algorithms to User Experiences.”
arXiv:2110.10790, 2021.
3
John Richards, David Piorkowski, Michael Hind, Stephanie Houde, and Aleksandra Mojsilović. “A Methodology for Creating AI
FactSheets.” arXiv:2006.13796, 2020.
4
Matthew Arnold, Rachel K. E. Bellamy, Michael Hind, Stephanie Houde, Sameep Mehta, Aleksandra Mojsilović, Ravi Nair,
Karthikeyan Natesan Ramamurthy, Alexandra Olteanu, David Piorkowski, Darrell Reimer, John Richards, Jason Tsay, and
188 | Trust in Machine Learning
misuses of the system along with who was involved in making those decisions (e.g. were diverse voices
included?). From the data understanding phase, it is important to capture the provenance of the data,
including why it was originally collected. From the data preparation phase, it is important to catalog the
data transformations and feature engineering steps employed by the data engineers and data scientists,
as well as any data quality analyses that were performed. From the modeling phase, it is important to
understand what algorithmic choices were made and why, including which mitigations were employed.
From the evaluation phase, it is important to test for trust-related metrics and their uncertainties
(details are forthcoming in the next section). Overall, there are two types of facts for you to transparently
report: (1) (qualitative) knowledge from inside a person’s head that must be explicitly asked about, and
(2) data, processing steps, test results, models, or other artifacts that can be grabbed digitally.
How do you get access to all this information coming from all parts of the machine learning
development lifecycle and from different personas? Wouldn’t it be convenient if it were documented
and transparently reported all along? Because of the tireless efforts of your predecessors in the model
risk management department, JCN Corporation has instrumented the entire lifecycle with a mandatory
tool that manages machine learning development by creating checklists and pop-up reminders for
different personas to enter qualitative facts at the time they should be top-of-mind for them. The tool
also automatically collects and version-controls digital artifacts as facts as soon as they are generated.
Let’s refer to the tool as fact flow, which is shown in Figure 13.1.
Figure 13.1. The fact flow captures qualitative and quantitative facts generated by different people and processes
throughout the machine learning development lifecycle and renders them into factsheets appropriate for different
consumers. Accessible caption. Facts from people and technical steps in the development lifecycle go
into a renderer which may output a detailed factsheet, a label factsheet, or a SDoC factsheet.
Kush R. Varshney. “FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity.” In: IBM Journal
of Research and Development 63.4/5 (Jul./Sep. 2019), p. 6.
Transparency | 189
Since machine learning is a general purpose technology (recall the discussion in Chapter 1), there is
no universal set of facts that applies to all machine learning models irrespective of their use and
application domain. The facts to validate the machine learning systems for m-Udhār Solar,
Unconditionally, ThriveGuild, and Wavetel (fictional companies discussed in previous chapters) are not
exactly the same; more precision is required.5 Moreover, the set of facts that make it to a factsheet and
their presentation depends on the consumer. As the model validator, you need a full dump of all the
facts. You should adjust the factsheet to a summary label, document, or presentation slides for personas
who, to overcome their cognitive biases, need fewer details. You should broadly disseminate simpler
factsheets among JCN Corporation managers (decision makers), employees (affected users), and the
general public who yearn for transparency. You will have determined the set of facts, their level of detail,
and their presentation style for different personas through your initial design exercise. Fact flow has a
renderer for you to create different factsheet presentations.
You should also sign and release a factsheet rendered as a supplier’s declaration of conformity (SDoC)
for external regulators. An SDoC is a written assurance that a product or service conforms to a standard
or technical regulation. Your declaration is based on your confidence in the fact flow tool and the
inspection of the results you have conducted.6 Conformity is one of several related concepts (compliance,
impact, and accountability), but different from each of them.7 Conformity is abiding by specific
regulations whereas compliance is abiding by broad regulatory frameworks. Conformity is a statement
on abiding by regulations at the current time whereas impact is abiding by regulations into an uncertain
future. Conformity is a procedure by which to show abidance whereas accountability is a responsibility
to do so. As such, conformity is the narrowest of definitions and is the one that forms the basis for the
draft regulation of high-risk machine learning systems in the European Economic Area and may become
a standard elsewhere too. Thus SDoCs represent an up-and-coming requirement for machine learning
systems used in high-stakes decision making, including proactive retention at JCN Corporation.
5
Ryan Hagemann and Jean-Marc Leclerc. “Precision Regulation for Artificial Intelligence.” In: IBM Policy Lab Blog (Jan. 2020).
URL: https://www.ibm.com/blogs/policy/ai-precision-regulation.
6
National Institute of Standards and Technology. “The Use of Supplier’s Declaration of Conformity.” URL:
https://www.nist.gov/system/files/documents/standardsgov/Sdoc.pdf.
7
Nikolaos Ioannidis and Olga Gkotsopoulou. “The Palimpsest of Conformity Assessment in the Proposed Artificial Intelligence
Act: A Critical Exploration of Related Terminology.” In: European Law Blog (Jul. 2021). URL: https://europeanlaw-
blog.eu/2021/07/02/the-palimpsest-of-conformity-assessment-in-the-proposed-artificial-intelligence-act-a-critical-explora-
tion-of-related-terminology.
190 | Trust in Machine Learning
data scientists completely isolated their held-out data set and didn’t incur any leakage into modeling.8
As the model validator, you can ensure such isolation in your testing.
Importantly, testing machine learning systems is different from testing other kinds of software
systems.9 Since the whole point of machine learning systems is to generalize from training data to label
new unseen input data points, they suffer from the oracle problem: not knowing what the correct answer
is supposed to be for a given input.10 The way around this problem is not by looking at a single
employee’s input data point and examining its corresponding output attrition prediction, but by looking
at two or more variations that should yield the same output. This approach is known as using
metamorphic relations.
For example, a common test for counterfactual fairness (described in Chapter 10) is to input two data
points that are the same in every way except having different values of a protected attribute. If the
predicted label is not the same for both of them, the test for counterfactual fairness fails. The important
point is that the actual predicted label value (will voluntarily resign/won’t voluntarily resign) is not the
key to the test, but whether that predicted value is equal for both inputs. As a second example for
competence, if you multiply a feature’s value by a constant in all training points, train the model, and
then score a test point that has been scaled by the same constant, you should get the same prediction of
voluntary resignation as if you had not done any scaling at all. In some other application involving semi-
structured data, a metamorphic relation for an audio clip may be to speed it up or slow it down while
keeping the pitch the same. Coming up with such metamorphic relations requires ingenuity; automating
this process is an open research question.
In addition to the oracle problem of machine learning, there are three factors you need to think about
that go beyond the typical testing done by JCN Corporation data scientists while generating facts:
1. testing for dimensions beyond accuracy, such as fairness, robustness, and explainability,
2. pushing the system to its limits so that you are not only testing average cases, but also covering
edge cases, and
8
Sebastian Schelter, Yuxuan He, Jatin Khilnani, and Julia Stoyanovich. “FairPrep: Promoting Data to a First-Class Citizen in
Studies of Fairness-Enhancing Interventions.” In: Proceedings of the International Conference on Extending Database Technology.
Copenhagen, Denmark, Mar.–Apr. 2020, pp. 395–398.
9
P. Santhanam. “Quality Management of Machine Learning Systems.” In: Proceedings of the AAAI Workshop on Engineering Depend-
able and Secure Machine Learning Systems. New York, New York, USA, Feb. 2020.
10
Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. “Machine Learning Testing: Survey, Landscapes and Horizons.” In: IEEE
Transactions on Software Engineering 48.1 (Jan. 2022), pp. 1–36.
Transparency | 191
using metrics such as faithfulness (described in Chapter 12).11 You also need to test for accuracy under
distribution shifts (described in Chapter 9). Since the JCN Corporation data science team has created
multiple attrition prediction models, you can compare the different options. Once you have computed
the metrics, you can display them in the factsheet as a table such as Table 13.1 or in visual ways to be
detailed in Section 13.3 to better understand their domains of competence across dimensions of
trustworthiness. (Remember that domains of competence for accuracy were a main topic of Chapter 7.)
Table 13.1. Result of testing several attrition models for multiple trust-related metrics.
In these results, the decision forest with boosting has the best accuracy and robustness to
distribution shift, but the poorest adversarial robustness, and poor fairness and explainability. In
contrast, the logistic regression model has the best adversarial robustness and explainability, while
having poorer accuracy and distributional robustness. None of the models have particularly good
fairness (disparate impact ratio), and so the data scientists should go back and do further bias
mitigation. The example emphasizes how looking only at accuracy leaves you with blind spots in the
evaluation phase. As the model validator, you really do need to test for all the different metrics.
11
Moninder Singh, Gevorg Ghalachyan, Kush R. Varshney, and Reginald E. Bryant. “An Empirical Study of Accuracy, Fairness,
Explainability, Distributional Robustness, and Adversarial Robustness.” In: Proceedings of the KDD Workshop on Measures and Best
Practices for Responsible AI. Aug. 2021.
12
Aniya Aggarwal, Samiulla Shaikh, Sandeep Hans, Swastik Haldar, Rema Ananthanarayanan, and Diptikalyan Saha. “Testing
Framework for Black-Box AI Models.” In: Proceedings of the IEEE/ACM International Conference on Software Engineering. May 2021,
pp. 81–84.
192 | Trust in Machine Learning
the probability distribution of the training and held-out datasets. And in fact, you can think about
crafting adversarial examples for fairness and explainability as well as for accuracy.13 Another way to
find edge cases in machine learning systems is by using a crowd of human testers who are challenged
to ‘beat the machine.’14 They get points in a game for coming up with rare but catastrophic data points.
Importantly, the philosophy of model validators such as yourself who are testing the proactive
retention system is different from the philosophy of malicious actors and ‘machine beaters.’ These
adversaries need to succeed just once to score points, whereas model validators need to efficiently
generate test cases that have good coverage and push the system from many different sides. You and
other model validators have to be obsessed with failure; if you’re not finding flaws, you have to think that
you’re not trying hard enough.15 Toward this end, coverage metrics have been developed for neural
networks that measure if every neuron in the model has been tested. However, such coverage metrics
can be misleading and do not apply to other kinds of machine learning models.16 Developing good
coverage metrics and test case generation algorithms to satisfy those coverage metrics remains an open
research area.
“I can live with doubt and uncertainty and not knowing. I think it’s much more
interesting to live not knowing than to have answers which might be wrong.”
The total predictive uncertainty includes both aleatoric and epistemic uncertainty. It is indicated by
the score for well-calibrated classifiers (remember the definition of calibration, Brier score, and
calibration loss17 from Chapter 6). When the attrition prediction classifier is well-calibrated, the score is
13
Botty Dimanov, Umang Bhatt, Mateja Jamnik, and Adrian Weller. “You Shouldn’t Trust Me: Learning Models Which Conceal
Unfairness from Multiple Explanation Methods.” In: Proceedings of the European Conference on Artificial Intelligence. Santiago de
Compostela, Spain, Aug.–Sep. 2020. Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. “Fooling
LIME and SHAP: Adversarial Attacks on Post Hoc Explanation Methods.” In: Proceedings of the AAAI/ACM Conference on AI, Ethics,
and Society. New York, New York, USA, Feb. 2020, pp. 180–186.
14
Joshua Attenberg, Panos Ipeirotis, and Foster Provost. “Beat the Machine: Challenging Humans to Find a Predictive Model’s
‘Unknown Unknowns.’” In: Journal of Data and Information Quality 6.1 (Mar. 2015), p. 1.
15
Thomas G. Diettrich. “Robust Artificial Intelligence and Robust Human Organizations.” In: Frontiers of Computer Science 13.1
(2019), pp. 1–3.
16
Dusica Marijan and Arnaud Gotlieb. “Software Testing for Machine Learning.” In: Proceedings of the AAAI Conference on Artificial
Intelligence. New York, New York, USA, Feb. 2020, pp. 13576–13582.
17
A popular variation of the calibration loss detailed in Chapter 6, known as the expected calibration error, uses the average abso-
lute difference rather than the average squared difference.
Transparency | 193
also the probability of an employee voluntarily resigning being 1; scores close to 0 and 1 are certain
predictions and scores close to 0.5 are uncertain predictions. Nearly all of the classifiers that we’ve
talked about in the book give continuous-valued scores as output, but many of them, such as the naïve
Bayes classifier and modern deep neural networks, tend not to be well-calibrated.18 They have large
values of calibration loss because their calibration curves are not straight diagonal lines like they ideally
should be (remember the picture of a calibration curve dropping below and pushing above the ideal
diagonal line in Figure 6.4).
Figure 13.2. Different methods for quantifying the uncertainty of classifiers. Accessible caption. Hierarchy
diagram with uncertainty quantification as the root. Uncertainty quantification has children total pre-
dictive uncertainty, and separate aleatoric and epistemic uncertainty. Total predictive uncertainty has
children mitigate miscalibration and estimate uncertainty. Mitigate miscalibration has children Platt
scaling and isotonic regression. Estimate uncertainty has children jackknife and infinitesimal jack-
knife. Separate aleatoric and epistemic uncertainty has children Bayesian methods and ensemble
methods.
Just like in other pillars of trustworthiness, algorithms for obtaining uncertainty estimates and
mitigating poor calibration apply at different stages of the machine learning pipeline. Unlike other topic
areas, there is no pre-processing for uncertainty quantification. There are, however, methods that apply
during model training and in post-processing. Two post-processing methods for mitigating poor
calibration, Platt scaling and isotonic regression, both take the classifier’s existing calibration curve and
straighten it out. Platt scaling assumes that the existing calibration curve looks like a sigmoid or logistic
18
Alexandru Niculescu-Mizil and Rich Caruana. “Predicting Good Probabilities with Supervised Learning.” In: Proceedings of the
International Conference on Machine Learning. Bonn, Germany, Aug. 2005, pp. 625–632. Chuan Guo, Geoff Pleiss, Yu Sun, and
Killian Q. Weinberger. “On Calibration of Modern Neural Networks.” In: Proceedings of the International Conference on Machine
Learning. Sydney, Australia, Aug. 2017, pp. 1321–1330.
194 | Trust in Machine Learning
activation function whereas isotonic regression can work with any shape of the existing calibration
curve. Isotonic regression requires more data than Platt scaling to work effectively.
A post-processing method for total predictive uncertainty quantification that does not require you
to start with an existing classifier score works in almost the same way as computing deletion diagnostics
described in Chapter 12 for explanation. You train many attrition models, leaving one training data point
out each time. You compute the standard deviation of the accuracy of each of these models and report
this number as an indication of predictive uncertainty. In the uncertainty quantification context, this is
known as a jackknife estimate. You can do the same thing for other metrics of trustworthiness as well,
yielding an extended table of results that goes beyond Table 13.1 to also contain uncertainty
quantification, shown in Table 13.2. Such a table should be displayed in a factsheet.
Table 13.2. Result of testing several attrition models for multiple trust-related metrics with uncertainty quantified
using standard deviation below the metric values.
Chapter 12 noted that deletion diagnostics are costly to compute directly, which motivated influence
functions as an approximation for explanation. The same kind of approximation involving gradients and
Hessians, known as an infinitesimal jackknife, can be done for uncertainty quantification.19 Influence
functions and infinitesimal jackknives may also be derived for some fairness, explainability, and
robustness metrics.20
Using a calibrated score or (infinitesimal) jackknife-based standard deviation as the quantification
of uncertainty does not allow you to decompose the total predictive uncertainty into aleatoric and
epistemic uncertainty, which can be important as you decide to approve the JCN Corporation proactive
retention system. There are, however, algorithms applied during model training that let you estimate
the aleatoric and epistemic uncertainties separately. These methods are like directly interpretable
models (Chapter 12) and bias mitigation in-processing (Chapter 10) in terms of their place in the
19
Ryan Giordano, Will Stephenson, Runjing Liu, Michael I. Jordan, and Tamara Broderick. “A Swiss Army Infinitesimal Jack-
knife.” In: Proceedings of the International Conference on Artificial Intelligence and Statistics. Naha, Okinawa, Japan, Apr. 2019, pp.
1139–1147.
20
Hao Wang, Berk Ustun, and Flavio P. Calmon. “Repairing without Retraining: Avoiding Disparate Impact with Counterfactual
Distributions.” In: Proceedings of the International Conference on Machine Learning. Long Beach, California, USA, Jul. 2019, pp.
6618–6627. Brianna Richardson and Kush R. Varshney. “Addressing the Design Needs of Implementing Fairness in AI via In-
fluence Functions.” In: INFORMS Annual Meeting. Anaheim, California, USA, Oct. 2021.
Transparency | 195
pipeline. The basic idea to extract the two uncertainties is as follows.21 The total uncertainty of a
prediction, i.e. the predicted label 𝑌̂ given the features 𝑋, is measured using the entropy 𝐻( 𝑌̂ ∣ 𝑋 )
(remember entropy from Chapter 3). This prediction uncertainty includes both epistemic and aleatoric
uncertainty; it is general and does not fix the choice of the actual classifier function 𝑦̂ ∗ (⋅) within a
hypothesis space ℱ. The epistemic uncertainty component captures the lack of knowledge of a good
hypothesis space and a good classifier within a hypothesis space. Therefore, epistemic uncertainty goes
away once you fix the choice of hypothesis space and classifier. All that remains is aleatoric uncertainty.
The aleatoric uncertainty is measured by another entropy 𝐻( 𝑌̂ ∣ 𝑋, 𝑓 ), averaged across classifiers 𝑓(⋅) ∈
ℱ whose probability of being a good classifier is based on the training data. The epistemic uncertainty is
then the difference between 𝐻( 𝑌̂ ∣ 𝑋 ) and the average 𝐻( 𝑌̂ ∣ 𝑋, 𝑓 ).
There are a couple ways to obtain these two entropies and thereby the aleatoric and epistemic
uncertainty. Bayesian methods, including Bayesian neural networks, are one large category of methods
that learn full probability distributions for the features and labels, and thus the entropies can be
computed from the probability distribution. The details of Bayesian methods are beyond the scope of
this book.22 Another way to obtain the aleatoric and epistemic uncertainty is through ensemble methods,
including ones involving bagging and dropout that explicitly or implicitly create several independent
machine learning models that are aggregated (bagging and dropout were described in Chapter 7).23 The
average classifier-specific entropy for characterizing aleatoric uncertainty is estimated by simply
averaging the entropy of several data points for all the models in the trained ensemble considered
separately. The total uncertainty is estimated by computing the entropy of the entire ensemble together.
21
Stefan Depeweg, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluf. “Decomposition of Uncertainty in
Bayesian Deep Learning for Efficient and Risk-Sensitive Learning.” In: Proceedings of the International Conference on Machine
Learning. Stockholm, Sweden, Jul. 2018, pp. 1184–1193.
22
Alex Kendall and Yarin Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?” In: Advances
in Neural Information Processing Systems 31 (Dec. 2017), pp. 5580–5590.
23
Yarin Gal and Zoubin Gharahmani. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learn-
ing.” In: Proceedings of the International Conference on Machine Learning. New York, New York, USA, Jun. 2016, pp. 1050–1059.
Aryan Mobiny, Pengyu Yuan, Supratik K. Moulik, Naveen Garg, Carol C. Wu, and Hien Van Nguyen. “DropConnect is Effective
in Modeling Uncertainty of Bayesian Deep Networks.” In: Scientific Reports 11.5458 (Mar. 2021). Mohammad Hossein Shaker
and Eyke Hüllermeier. “Aleatoric and Epistemic Uncertainty with Random Forests.” In: Proceedings of the International Sympo-
sium on Intelligent Data Analysis. Apr. 2020, pp. 444–456.
24
Po-Ming Law, Sana Malik, Fan Du, and Moumita Sinha. “The Impact of Presentation Style on Human-in-the-Loop Detection of
Algorithmic Bias.” In: Proceedings of the Graphics Interface Conference. May 2020, pp. 299–307.
196 | Trust in Machine Learning
Figure 13.3. Bar graph of trust metrics for four different models.
An alternative is the parallel coordinate plot, which is a line graph of the different metric dimensions
next to each other, but normalized separately.26 An example is shown in Figure 13.4. The separate
normalization per metric permits you to flip the direction of the axis so that, for example, higher is
always better. (This flipping has been done for empirical robustness in the figure.) Since the lines can
25
Javier Antorán, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel Hernández-Lobato. “Getting a CLUE: A Method
for Explaining Uncertainty Estimates.” In: Proceedings of the International Conference on Learning Representations. May 2021.
26
Parallel coordinate plots have interesting mathematical properties. For more details, see: Rida E. Moustafa. “Parallel Coordi-
nate and Parallel Coordinate Density Plots.” In: WIREs Computational Statistics 3 (Mar./Apr. 2011), pp. 134–148.
Transparency | 197
overlap, there is less of a crowding effect from too many models being compared than with bar graphs.
(If there are so so many models that even the parallel coordinate plot becomes unreadable, an alternative
is the parallel coordinate density plot, which gives an indication of how many lines there are in every part
of the plot using shading.) The main purpose of parallel coordinate plots is precisely to compare items
along several categories with different metrics. Conditional parallel coordinate plots, an interactive version
of parallel coordinate plots, allow you to expand upon submetrics within a higher-level metric.27 For
example, if you create an aggregate metric that combines several adversarial robustness metrics
including empirical robustness, CLEVER score, and others, an initial visualization will only contain the
aggregate robustness score, but can be expanded to show the details of the other metrics it is composed
of. Parallel coordinate plots can be wrapped around a polygon to yield a radar chart, an example of which
is shown in Figure 13.5.
Figure 13.4. Parallel coordinate plot of trust metrics for four different models.
Daniel Karl I. Weidele. “Conditional Parallel Coordinates.” In: Proceedings of the IEEE Visualization Conference. Vancouver, Can-
27
Figure 13.5. Radar chart of trust metrics for four different models.
It is not easy to visualize metrics such as disparate impact ratio in which both small and large values
indicate poor performance and intermediate values indicate good values. In these cases, and also to
appeal to less technical consumers in the case of all metrics, simpler non-numerical visualizations
involving color patches (e.g. green/yellow/red that indicate good/medium/poor performance),
pictograms (e.g. smiley faces or stars), or Harvey balls (○/◔/◑/◕/●) may be used instead. See Figure
13.6 for an example. However, these visualizations require thresholds to be set in advance on what
constitutes a good, medium, or poor value. Eliciting these thresholds is part of value alignment, covered
in Chapter 14.
Figure 13.6. Simpler non-numeric visualization of trust metrics for four different models.
Transparency | 199
28
Umang Bhatt, Javier Antorán, Yunfeng Zhang, Q. Vera Liao, Prasanna Sattigeri, Riccardo Fogliato, Gabrielle Gauthier
Melançon, Ranganath Krishnan, Jason Stanley, Omesh Tickoo, Lama Nachman, Rumi Chunara, Madhulika Srikumar, Adrian
Weller, and Alice Xiang. “Uncertainty as a Form of Transparency: Measuring, Communicating, and Using Uncertainty.” In:
Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. Jul. 2021, pp. 401–413.
29
Anne Marthe van der Bles, Sander van der Linden, Alexandra L. J. Freeman, James Mitchell, Ana B. Galvao, Lisa Zaval, and
David J. Spiegelhalter. “Communicating Uncertainty About Facts, Numbers and Science.” In: Royal Society Open Science
6.181870 (Apr. 2019).
200 | Trust in Machine Learning
Figure 13.7. Bar graph with error bars of trust metrics for four different models.
Figure 13.8. Box-and-whisker plot of trust metrics for four different models.
Transparency | 201
Figure 13.9. Violin plot of trust metrics for four different models.
more than half of the computer owners to change a fact that has been written, which is a difficult
endeavor. Blockchains provide a form of distributed trust.
There are two kinds of blockchains: (1) permissioned (also known as private) and (2) permissionless (also
known as public). Permissioned blockchains restrict reading and writing of information and ownership
of machines to only those who have signed up with credentials. Permissionless blockchains are open to
anyone and can be accessed anonymously. Either may be an option for maintaining the provenance of
facts while making the attrition prediction model more trustworthy. If all consumers are within the
corporation or are among a fixed set of regulators, then a permissioned blockchain network will do the
trick. If the general public or others external to JCN Corporation are the consumers of the factsheet, then
a permissionless blockchain is the preferred solution.
Posting facts to a blockchain solves the problem of maintaining the provenance of facts, but what if
there is tampering in the creation of the facts themselves? For example, what if a data scientist discovers
a small bug in the feature engineering code that shouldn’t affect model performance very much and
fixes it. Retraining the entire model will go on through the night, but there’s a close-of-business deadline
to submit facts. So the data scientist submits facts from a previously trained model. Shortcuts like this
can also be prevented with blockchain technologies.30 Since the training of many machine learning
models is done in a deterministic way by an iterative procedure (such as gradient descent), other
computers in the blockchain network can endorse and verify that the training computation was actually
run by locally rerunning small parts of the computation starting from checkpoints of the iterations
posted by the data scientist. The details of how to make such a procedure tractable in terms of
computation and communication costs is beyond the scope of the book.
In your testing, you found that all of the models were lacking in fairness, so you sent them back to
the data scientists to add better bias mitigation, which they did to your satisfaction. The various
stakeholders are satisfied now as well, so you can go ahead and sign for the system’s conformity and
push it on to the deployment stage of the lifecycle. Alongside the deployment efforts, you also release a
factsheet for consumption by the managers within JCN Corporation who will be following through on
the machine’s recommended retention actions. Remember that one of the promises of this new machine
learning system was to make employment at JCN Corporation more equitable, but that will only happen
if the managers adopt the system’s recommendations.31 Your efforts at factsheet-based transparency
have built enough trust among the managers so they are willing to adopt the system, and JCN
Corporation will have fairer decisions in retention actions.
30
Ravi Kiran Raman, Roman Vaculin, Michael Hind, Sekou L. Remy, Eleftheria K. Pissadaki, Nelson Kibichii Bore, Roozbeh
Daneshvar, Biplav Srivastava, and Kush R. Varshney. “A Scalable Blockchain Approach for Trusted Computation and Verifiable
Simulation in Multi-Party Collaborations.” In: Proceedings of the IEEE International Conference on Blockchain and Cryptocurrency.
May 2019, Seoul, Korea, pp. 277–284.
31
There have been instances where a lack of transparency in machine learning algorithms designed to reduce inequity were
adopted to a greater extent by privileged decision makers and adopted to a lesser extent by unprivileged decision makers,
which ended up exacerbated inequity instead of tamping it down. See: Shunyung Zhang, Kannan Srinivasan, Param Vir Singh,
and Nitin Mehta. “AI Can Help Address Inequity—If Companies Earn Users’ Trust.” In: Harvard Business Review (Sep. 2021). URL:
https://hbr.org/2021/09/ai-can-help-address-inequity-if-companies-earn-users-trust.
Transparency | 203
13.5 Summary
▪ Transparency is a key means for increasing the third attribute of trustworthiness in machine
learning (openness and human interaction).
▪ Fact flow is a mechanism for automatically collecting qualitative and quantitative facts about a
development lifecycle. A factsheet is a collection of facts, appropriately rendered for a given
consumer, that enables transparency and conformity assessment.
▪ Model validation and risk management involve testing models across dimensions of trust,
computing the uncertainties of the test results, capturing qualitative facts about the development
lifecycle, and documenting and communicating these items transparently via factsheets.
▪ Testing machine learning models is a unique endeavor different from other software testing
because of the oracle problem: not knowing in advance what the behavior should be.
▪ Visualization helps make test results and their uncertainties more accessible to various
consumer personas.
▪ Facts and factsheets become more trustworthy if their provenance can be maintained and
verified. Immutable ledgers implemented using blockchain networks provide such capabilities.
204 | Trustworthy Machine Learning
14
Value Alignment
The first two chapters in this part of the book on interaction were focused on the communication from
the machine system to the human consumer. This chapter is focused on the other direction of
interaction: from humans to the machine system. Imagine that you’re the director of the selection
committee of Alma Meadow, a (fictional) philanthropic organization that invests in early-stage social
enterprises and invites the founders of those mission-driven organizations to participate in a two-year
fellowship program. Alma Meadow receives about three thousand applications per year and selects
about thirty of them to be fellowship recipients. As the director of this process, you are considering using
machine learning in some capacity to improve the way it works. As such, you are a problem owner in the
problem specification phase of an incipient machine learning lifecycle. Your main concern is that you
do not sacrifice Alma Meadow’s mission or values in selecting social impact startups.
“We need to have more conversations where we're doing this translation between
policy, world outcome impact, what we care about and then all the math and data and
tech stuff is in the back end trying to achieve these things.”
—Rayid Ghani, machine learning and public policy researcher at Carnegie Mellon
University
Values are fundamental beliefs that guide actions. They indicate the importance of various things and
actions to a person or group of people, and determine the best ways to live and behave. Embedding Alma
Meadow’s values in the machine learning system that you are contemplating is known as value alignment
and has two parts.1 The first part is technical: how to encode and elicit values in such a way that machine
learning systems can access them and behave accordingly. The second part is normative: what the actual
values are. (The word normative refers to norms in the social rather than mathematical sense: standards
1
Iason Gabriel. “Artificial Intelligence, Values, and Alignment.” In: Minds and Machines 30 (Oct. 2020), pp. 411–437.
Value Alignment | 205
or principles of right action.) The focus of this chapter is on the first part of value alignment: the technical
aspects for you, your colleagues, and other stakeholders to communicate your values (likely influenced
by laws and regulations). The chapters in the sixth and final part of the book on purpose delve into the
values themselves.
Before diving into the technical details of value alignment, let’s first take a step back and talk about
two ways of expressing values: (1) deontological and (2) consequentionalist.2 At a simplified level,
deontological values are about defining good actions without concern for their outcomes, and
consequentialist values are focused on defining outcomes that are good for all people. As an example, Alma
Meadow has two deontological values: at least one of the recipients of the fellowship per year will be a
formerly incarcerated individual and fellowship recipients’ social change organizations cannot promote
a specific religious faith. These explicit rules or constraints on the action of awarding fellowships do not
look into the effect on any outcome. In contrast, one of Alma Meadow’s consequentionalist values is that
a fellowship recipient chosen from the applicant pool leads a social impact startup that will most
improve the worldwide disability-adjusted life-years (DALY) in the next ten years. DALY is a metric that
indicates the combined morbidity and mortality of the global disease burden. (It cannot be perfectly
known which applicant satisfies this at the time the decision is made due to uncertainty, but it can still
be a value.) It is a consequentionalist value because it is in terms of an outcome (DALY).
There is some overlap between deontology and procedural justice (described in Chapter 10), and
between consequentionalism and distributive justice. One important difference between
consequentialism and distributive justice is that in operationalizing distributive justice through group
fairness as done in Chapter 10, the population over whom good outcomes are sought are the affected
users, and that the justice/fairness is limited in time and scope to just the decision itself.3 In contrast, in
consequentionalism, the good is for all people throughout the broader society and the outcomes of
interest are not only the immediate ones, but the longer term ones as well. Just like distributive justice
was the focus in Chapter 10 rather than procedural justice because of its more natural operationalization
in supervised classification, consequentialism is the focus here rather than deontology. However, it
should be noted that deontological values may be elicited from people as rules and used as additional
constraints to the Alma Meadow applicant screening model. In certain situations, such constraints can
be easily added to the model without retraining.4
2
Joshua Greene, Francesca Rossi, John Tasioulas, Kristen Brent Venable, and Brian Williams. “Embedding Ethical Principles
in Collective Decision Support Systems.” In: Proceedings of the AAAI Conference on Artificial Intelligence. Phoenix, Arizona, USA,
Feb. 2016, pp. 4147–4151.
3
Dallas Card and Noah A. Smith. “On Consequentialism and Fairness.” In: Frontiers in Artificial Intelligence 3.34 (May 2020).
4
Elizabeth M. Daly, Massimiliano Mattetti, Öznur Alkan, and Rahul Nair. “User Driven Model Adjustment via Boolean Rule Ex-
planations.” In: Proceedings of the AAAI Conference on Artificial Intelligence. Feb. 2021, pp. 5896–5904.
206 | Trustworthy Machine Learning
It is critical not to take any shortcuts in value alignment because it forms the foundation for the other
parts of the lifecycle. By going through the value alignment process, you arrive at problem specifications
that data scientists try to satisfy using machine learning models, bias mitigation algorithms,
explainability algorithms, adversarial defenses, etc. during the modeling phase of the lifecycle.
One thing to be wary of is underspecification that allows machine learning models to take shortcuts
(also known as specification gaming and reward hacking in the value alignment literature).5 This concept
was covered in detail in Chapter 9, but is worth repeating. Any values that are left unsaid are free
dimensions for machine learning algorithms to use as they please. So for example, even if the values you
provide to the machine don’t prioritize fairness, you might still be opposed to an extremely extremely
unfair model in spirit. If you don’t include at least some specification for a minimal level of fairness, the
model may very well learn to be extremely unfair if it helps achieve specified values in accuracy,
uncertainty quantification, and privacy.
In the remainder of the chapter, you will go through the problem specification phase for selecting
Alma Meadow’s fellows using supervised machine learning, insisting on value alignment. By the end,
you’ll have a better handle on the following questions.
▪ What are the different levels of consequentionalist values that you should consider?
▪ How should these values be elicited from individual people and fused together when elicited from
a group of people?
▪ How do you put together elicited values with transparent documentation covered in Chapter 13
to govern machine learning systems?
The first question you should ask is whether you should even work on a problem. The answer may
be no. If you stop and think for a minute, many problems are not problems to be solved. At face value,
5
Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike,
and Shane Legg. “Specification Gaming: The Flip Side of AI Ingenuity.” In: DeepMind Blog (Apr. 2020). URL: https://deep-
mind.com/blog/article/Specification-gaming-the-flip-side-of-AI-ingenuity.
Value Alignment | 207
evaluating three thousand applications and awarding fellowships seems not to be oppressive, harmful,
misguided, or useless, but nevertheless, you should think deeply before answering.
“Technical audiences are never satisfied with the fix being ‘just don’t do it.’”
Even if a problem is one that should be solved, machine learning is not always the answer. Alma Meadow
has used a manual process to sort through applications for over thirty years, and has not been worse for
wear. So why make the change now? Are there only some parts of the overall evaluation process for
which machine learning makes sense?
The second question is more detailed. Among the different aspects of trustworthiness covered in the
book so far, such as privacy, consent, accuracy, distributional robustness, fairness, adversarial
robustness, interpretability, and uncertainty quantification, which ones are of the greatest concern? Are
some essential and others only nice-to-haves? The third question takes the high-level elements of
trustworthiness and brings them down to the level of specific metrics. Is accuracy, balanced accuracy,
or AUC a more appropriate metric? How about the choice between statistical parity difference and
average absolute odds difference? Lastly, the fourth question focuses on the preferred ranges of values
of the metrics selected in the third question. Is a Brier score less than or equal to 0.25 acceptable?
Importantly, there are relationships among the different pillars; you cannot create a system that is
perfect in all respects. For example, typical differential privacy methods worsen fairness and
uncertainty quantification.6 Explainability may be at odds with other dimensions of trustworthiness.7
Thus in the fourth question, it is critical to understand the relationships among metrics of different
pillars and only specify ranges that are feasible.
6
Marlotte Pannekoek and Giacomo Spigler. “Investigating Trade-Offs in Utility, Fairness and Differential Privacy in Neural Net-
works.” arXiv:2102.05975, 2021. Zhiqi Bu, Hua Wang, Qi Long, and Weijie J. Su. “On the Convergence of Deep Learning with
Differential Privacy.” arXiv:2106.07830, 2021.
7
Adrian Weller. “Transparency: Motivations and Challenges.” In: Explainable AI: Interpreting, Explaining and Visualizing Deep
Learning. Ed. by Wojciech Samek, Grégoire Montavon, Andrea Vedaldi, Lars Kai Hansen, and Klaus-Robert Müller. Cham, Swit-
zerland: Springer, 2019, pp. 23–40.
208 | Trustworthy Machine Learning
Toolkit,8 which lists eight different broad consequences of the machine learning system that you should
ponder:
Links to case studies accompany each of these checklist items in the Ethical OS Toolkit. Some of the case
studies show when the item has happened in the real-world, and some show actions taken to prevent
such items from happening. Another source of case studies is the continually-updated AI Incident
Database.9 Part 6 of the book, which is focused on purpose, touches on some of the items and case studies
as well.
Starting with the checklist, your first step is to decide which items are good and which items are bad.
In practice, you will read through the case studies, compare them to the Alma Meadow use case, spend
some time thinking, and come up with your judgement. Many people, including you, will mark each of
the eight items as bad, and judge the overall system to be too bad to proceed if any of them is true. But
values are not universal. Some people may mark some of the checklist items as good. Some judgements
may even be conditional. For example, with all else being equal, you might believe that algorithmic bias
(item 4) is good if economic inequality (item 3) is false. In this second case and in even more complicated
cases, reasoning about your preferences is not so easy.
CP-nets are a representation of values, including conditional ones, that help you figure out your
overall preference for the system and communicate it to the machine.10 (The ‘CP’ stands for ‘conditional
preference.’) CP-nets are directed graphical models with each node representing one attribute (checklist
item) and arrows indicating conditional relationships. Each node also has a conditional preference table
that gives the preferred values. (In this way, they are similar to causal graphs and structural equations
you learned about in Chapter 8.) The symbol ≻ represents a preference relation; the argument on the
left is preferred to the one on the right. The CP-net of the first case above (each of the eight items is bad)
is given in Figure 14.1. It has an additional node at the bottom capturing the overall preference for
working on the problem, which is conditioned on the eight items. There is a simple, greedy algorithm
8
URL: https://ethicalos.org/wp-content/uploads/2018/08/Ethical-OS-Toolkit-2.pdf
9
Sean McGregor. “Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database.” In: Proceed-
ings of the AAAI Conference on Artificial Intelligence. Feb. 2021, pp. 15458–15463.
10
Craig Boutilier, Ronen I. Brafman, Carmel Domshlak, Holger H. Hoos, and David Poole. “CP-Nets: A Tool for Representing and
Reasoning with Conditional Ceteris Paribus Preference Statements.” In: Journal of Artificial Intelligence Research 21.1 (Jan. 2004),
pp. 135–191.
Value Alignment | 209
for figuring out the most preferred instantiation of the values from CP-nets. However, in this case it is
easy to figure out the answer without an algorithm: it is the system that does not satisfy any of the eight
checklist items and says to go ahead and work on the problem. In more general cases with complicated
CP-nets, the inference algorithm is helpful.
Figure 14.1. An example CP-net for whether Alma Meadow should work on the application evaluation problem.
At the top is the graphical model. At the bottom are the conditional preference tables. Accessible caption. Eight
nodes disinformation, addiction, economic inequity, algorithmic bias, surveillance state, loss of data
control, surreptitious, and hate and crime all have the node work on problem as their child. All prefer-
ences for the top eight nodes are no ≻ yes. In all configurations of yeses and noes, the work on problem
preference is no ≻ yes, except when all top eight nodes have configuration no, when it is no ≻ yes.
With the values decided, it is time to go through the checklist items and determine whether they are
consistent with your most preferred values:
1. Disinformation = no: evaluating applications from social entrepreneurs is unlikely to subvert
the truth.
2. Addiction = no: this use of machine learning is not likely to lead to addiction.
3. Economic inequality = partly yes, partly no: it is possible the system could only select applica-
tions that have very technical descriptions of the social impact startup’s value proposition and
have been professionally polished. However, this possibility is not enough of a concern to com-
pletely stop the use of machine learning. What this concern does suggest, though, is that ma-
chine learning only be used to prioritize semi-finalists rather than later in the evaluation pro-
cess because human evaluators may find gems that seem unusual to the machine.
210 | Trustworthy Machine Learning
4. Algorithmic bias = no: Alma Meadow has been extremely proactive in preventing social bias
with respect to common protected attributes in its human evaluations in past years, so the
training data will not yield much social bias in models.
5. Surveillance state = no: the machine learning system is unlikely to be an instrument of oppres-
sion.
6. Loss of data control = no: by sharing their ideas in the application, budding social entrepre-
neurs could feel that they are giving up their intellectual property, but Alma Meadow has gone
to great lengths to ensure that is not the case. In fact, toward one of its values, Alma Meadow
provides information to applicants on how to construct confidential information assignment
agreements.
7. Surreptitious = no: the system is unlikely to do anything users don’t know about.
8. Hate and crime = no: the system is unlikely to enable criminal activities.
None of the items are properties of the system, including economic inequality when restricting the use
of machine learning only to a first-round prioritization. This is consistent with your most-preferred
values, so you should work on this problem.
1. Disadvantage (no, yes): the decisions have the possibility of giving systematic disadvantage to
certain groups or individuals.
5. Retraining (no, yes): the model is retrained frequently to match the time scale of distribution
shift.
6. People data (not about people, about people but not SPI, SPI): the system may use data about
people which may be sensitive personal information (SPI).
7. Security (external, internal and not secure, secure): the data, model interface, or software code
are available either externally or only internally, and may be kept highly secured.
Value Alignment | 211
Once you have given these seven system preferences, giving conditional preferences for the different
elements of trustworthiness is more compact. They can simply be given as high or low priority values
based on just a few of the system preferences. For example, if there is a possibility of systematic
disadvantage and the problem involves people data, then giving attention to fairness may be highly
valued. Putting everything together yields a CP-net like the one in Figure 14.2.
Figure 14.2. An example CP-net for which pillars of trustworthiness Alma Meadow should prioritize when devel-
oping a model for the application evaluation problem. At the top is the graphical model. At the bottom are the con-
ditional preference tables. Accessible caption. In the graphical model, there are edges from disadvantage
to fairness, people data to fairness, human-in-the-loop to explainability, regulator to explainability,
recourse to explainability, human-in-the-loop to uncertainty quantification, regulator to uncertainty
quantification, retraining to uncertainty quantification, retraining to distributional robustness, people
data to privacy, security to privacy, and security to adversarial robustness. The conditional preference
tables list many different complicated preferences.
212 | Trustworthy Machine Learning
The top-level system property preferences will be highly specific to your Alma Meadow application
evaluation use case. You and other problem owners have the requisite knowledge at your fingertips to
provide your judgements. The conditional preferences connecting the top-level properties with the
specific elements of trustworthiness (fairness, explainability, etc.) are more generic and generalizable.
Even if the edges and conditional preference tables given in the figure are not 100% universal, they are
close to universal and can be used as-is in many different application domains.
In the Alma Meadow example in Figure 14.2, your specific judgements are: systematic disadvantage
is possible, you prefer a human decision-maker in the loop, there will not be a regulator audit, you prefer
that social entrepreneur applicants have an opportunity for recourse, you prefer the system not be
retrained frequently, you prefer that the applications contain data about people (both about the
applicant and the population their organization serves) but not anything personally-sensitive, and you
prefer that the data and models be secured. Based on these values and the conditional preferences lower
in the CP-net, the following pillars are inferred to be higher priority: fairness, explainability, uncertainty
quantification, and distributional robustness. Privacy and adversarial robustness are inferred to be
lower priority.
11
Gaurush Hiranandani, Harikrishna Narasimhan, and Oluwasanmi Koyejo. “Fair Performance Metric Elicitation.” In: Advances
in Neural Information Processing Systems 33 (Dec. 2020), pp. 11083–11095.
Value Alignment | 213
contextual information may be used instead.12 Another approach based on pairwise comparisons is
known as the analytical hierarchy process; it asks for numerical ratings (one to nine) in the comparison so
that you not only indicate which metric is better, but by roughly how much as well.13
12
Hong Shen, Haojian Jin, Ángel Alexander Cabrera, Adam Perer, Haiyi Zhu, and Jason I. Hong. “Designing Alternative Repre-
sentations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance.” In: Proceedings of
the ACM on Human-Computer Interaction 4.CSCW2 (Oct. 2020), p. 153.
13
Yunfeng Zhang, Rachel K. E. Bellamy, and Kush R. Varshney. “Joint Optimization of AI Fairness and Utility: A Human-Cen-
tered Approach.” In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York, New York, USA, Feb. 2020, pp.
400–406.
14
Visar Berisha, Alan Wisler, Alfred O. Hero, III, and Andreas Spanias. “Empirically Estimable Classification Bounds Based on a
Nonparametric Divergence Measure.” In: IEEE Transactions on Signal Processing 64.3 (Feb. 2016), pp. 580–591. Ryan Theisen,
Huan Wang, Lav R. Varshney, Caiming Xiong, and Richard Socher. “Evaluating State-of-the-Art Classification Models Against
Bayes Optimality.” In: Advances in Neural Processing Systems 34 (Dec. 2021).
15
Frank Nielsen. “An Information-Geometric Characterization of Chernoff Information.” In: IEEE Signal Processing Letters 20.3
(Mar. 2013), pp. 269–272.
16
Sanghamitra Dutta, Dennis Wei, Hazar Yueksel, Pin-Yu Chen, Sijia Liu, and Kush R. Varshney, “Is There a Trade-Off Between
Fairness and Accuracy? A Perspective Using Mismatched Hypothesis Testing.” In: Proceedings of the International Conference on
Machine Learning. Jul. 2020, pp. 2803–2813. Kush R. Varshney, Prashant Khanduri, Pranay Sharma, Shan Zhang, and Pramod
K. Varshney, “Why Interpretability in Machine Learning? An Answer Using Distributed Detection and Data Fusion Theory.” In:
Proceedings of the ICML Workshop on Human Interpretability in Machine Learning. Stockholm, Sweden, Jul. 2018, pp. 15–20. Zuxing
Li, Tobias J. Oechtering, and Deniz Gündüz. “Privacy Against a Hypothesis Testing Adversary.” In: IEEE Transactions on Infor-
mation Forensics and Security 14.6 (Jun. 2019), pp. 1567–1581.
214 | Trustworthy Machine Learning
Figure 14.3. Schematic diagram of feasible set of trust-related metrics. Accessible caption. A shaded region
enclosed by three curved segments is labeled feasible. It is surrounded by five axes: accuracy, Brier
score, empirical robustness, faithfulness, and disparate impact ratio.
The feasible set is a good starting point, but there is still the question of deciding on the preferred
ranges of the metrics. Two approaches may help. First, a value alignment system can automatically
collect or create a corpus of many models for the same or similar prediction task and compute their
metrics. This will yield an empirical characterization of the interrelationships among the metrics. 17 You
can better understand your choice of metric values based on their joint distribution in the corpus. The
joint distribution can be visualized using a parallel coordinate density plot mentioned in Chapter 13.
Second, the value alignment system can utilize a variation of so-called trolley problems for supervised
machine learning. A trolley problem is a thought experiment about a fictional situation in which you can
save the lives of five people who’ll otherwise be hit by a trolley by swerving and killing one person.
Whether you choose to divert the trolley reveals your values. Variations of trolley problems change the
number of people who die under each option and associate attributes with the people.18 They are also
pairwise comparisons. Trolley problems are useful for value elicitation because humans are more easily
able to reason about small numbers than the long decimals that usually appear in trust metrics.
Moreover, couching judgements in terms of an actual scenario helps people internalize the
consequences of the decision and relate them to their use case.
As an example, consider the two scenarios shown in Figure 14.4. Which one do you prefer? Would
you rather have an adversarial example fool the system or have a large disparate impact ratio? The
actual numbers also play a role because a disparate impact ratio of 2 in scenario 2 is quite high. There is
no right or wrong answer, but whatever you select indicates your values.
17
Moninder Singh, Gevorg Ghalachyan, Kush R. Varshney, and Reginald E. Bryant. “An Empirical Study of Accuracy, Fairness,
Explainability, Distributional Robustness, and Adversarial Robustness.” In: KDD Workshop on Measures and Best Practices for Re-
sponsible AI. Aug. 2021.
18
Edmond Awad, Sohan Dsouza, Richard Kim, Jonathan Schulz, Joseph Henrich, Azim Shariff, Jean-François Bonnefon, and
Iyad Rahwan. “The Moral Machine Experiment.” In: Nature 563.7729 (Oct. 2018), pp. 59–64.
Value Alignment | 215
Figure 14.4. A pairwise comparison of illustrated scenarios. Accessible caption. Two scenarios each have
different small numbers of members of unprivileged and privileged groups receiving and not receiving
the fellowship. The first scenario also has an adversarial example.
preferences.19 Other methods for aggregating individual preferences into collective preferences are also
based on voting.20
Voting methods typically aim to choose the value that is preferred by the majority in every pairwise
comparison with other possible values (this majority-preferred set of values is known as the Condorcet
winner). However, it is not clear if such majoritarianism is really what you want when combining the
preferences of the various stakeholders. Minority voices may raise important points that shouldn’t be
drowned out by the majority, which is apt to happen in independent individual elicitation followed by a
voting-based preference fusion. The degree of participation by members of minoritized groups should
not be so weak as to be meaningless or even worse: extractive (the idea of extraction conceived in
postcolonialialism is covered in Chapter 15).21 This shortcoming of voting systems suggests that an
alternative process be pursued that does not reproduce existing power dynamics. Participatory design—
various stakeholders, data scientists and engineers working together in facilitated sessions to
collectively come up with single CP-nets and pairwise comparisons—is a suggested remedy, but may in
fact also reproduce existing power dynamics if not conducted well. So in your role at Alma Meadow, don’t
skimp on well-trained facilitators for participatory design sessions.
14.4 Governance
You’ve come to an agreement with the stakeholders on the values that should be expressed in Alma
Meadow’s application screening system. You’ve specified them as feasible ranges of quantitative
metrics that the machine learning system can incorporate. Now how do you ensure that those desired
values are realized by the deployed machine learning model? Through control or governance.22 Viewing
the lifecycle as a control system, illustrated in Figure 14.5, the values coming out of value alignment are
the reference input, the data scientists are the controllers that try to do all they can so the machine
learning system meets the desired values, and model facts (described in Chapter 13 as part of
transparency) are the measured output of testing that indicate whether the values are met. Any
difference between the facts and the values is a signal of misalignment to the data scientists; they must
do a better job in modeling. In this way, the governance of machine learning systems requires both the
elicitation of the system’s desired behavior (value alignment) and the reporting of facts that measure
those behaviors (transparency).
19
Lirong Xia, Vincent Conitzer, and Jérôme Lang. “Voting on Multiattribute Domains with Cyclic Preferential Dependencies.”
In: Proceedings of the AAAI Conference on Artificial Intelligence. Chicago, Illinois, USA, Jul. 2008, pp. 202–207. Indrani Basak and
Thomas Saaty. “Group Decision Making Using the Analytic Hierarchy Process.” In: Mathematical and Computer Modelling 17.4–5
(Feb.–Mar. 1993), pp. 101–109.
20
Ritesh Noothigattu, Snehalkumar ‘Neil’ S. Gaikwad, Edmond Awad, Sohan Dsouza, Iyad Rahwan, Pradeep Ravikumar, and
Ariel D. Procaccia. “A Voting-Based System for Ethical Decision Making.” In: Proceedings of the AAAI Conference on Artificial Intelli-
gence. New Orleans, Louisiana, USA, Feb. 2018, pp. 1587–1594. Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim,
Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu, Siheon Lee, Alexandros Psomas, and Ariel D. Procaccia.
“WeBuildAI: Participatory Framework for Algorithmic Governance.” In: Proceedings of the ACM on Human-Computer Interaction
3.181 (Nov. 2019).
21
Sasha Costanza-Chock. Design Justice: Community-Led Practices to Build the Worlds We Need. Cambridge, Massachusetts, USA:
MIT Press, 2020.
22
Osonde A. Osoba, Benjamin Boudreaux, and Douglas Yeung. “Steps Towards Value-Aligned Systems.” In: Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society. New York, New York, USA, Feb. 2020, pp. 332–336.
Value Alignment | 217
Figure 14.5. Transparent documentation and value alignment come together to help in the governance of ma-
chine learning systems. Accessible caption. A block diagram that starts with a value alignment block out
of which come values. Facts are subtracted from values to yield misalignment. Misalignment is input
to a data scientists block with modeling as output. Modeling is input to a machine learning model with
output that is fed into a testing block. The output of testing is the same facts that were subtracted from
values, creating a feedback loop.
In Chapter 13, factsheets contained not only quantitative test results, but also intended uses and
other qualitative knowledge about the development process. However, in the view of governance
presented here, only the quantitative test results seem to be used. So, is governance concerned only with
test outcomes, which are of a consequentialist nature, or is it also concerned with the development
process, which is of a deontological nature? Since the controllers—the data scientists—are people with
inherent quirks and biases, both kinds of facts together help them see the big picture goals without
losing track of their lower-level, day-to-day duties for resolving misalignment. Thus, a codification of
processes to be followed during development is an integral part of governance. Toward this end, you
have instituted a set of checklists for Alma Meadow’s data scientists to follow, resulting in a well-
governed system overall.
14.5 Summary
▪ Interaction between people and machine learning systems is not only from the machine learning
system to a human via explainability and transparency. The other direction from humans to the
machine, known as value alignment, is just as critical so that people can instruct the machine on
acceptable behaviors.
▪ There are two kinds of values: consequentialist values that are concerned with outcomes and
deontological values that are concerned with actions. Consequentialist values are more natural
in value alignment for supervised machine learning systems.
▪ Value alignment for supervised classification consists of four levels. Should you work on a
problem? Which pillars of trustworthiness are high priority? What are the appropriate metrics?
What are acceptable metric value ranges?
▪ CP-nets and pairwise comparisons are tools for structuring the elicitation of preferences of
values across the four levels.
▪ The preferences of a group of stakeholders, including those from traditionally marginalized
backgrounds, may be combined using either voting or participatory design sessions.
▪ Governance of machine learning systems combines value alignment to elicit desired behaviors
with factsheet-based transparency to measure whether those elicited behaviors are being met.
218 | Trustworthy Machine Learning
15
Ethics Principles
Figure 15.1. Organization of the book. The sixth part focuses on the fourth attribute of trustworthiness, purpose,
which maps to the use of machine learning that is uplifting. Accessible caption. A flow diagram from left to
right with six boxes: part 1: introduction and preliminaries; part 2: data; part 3: basic modeling; part 4:
reliability; part 5: interaction; part 6: purpose. Part 6 is highlighted. Parts 3–4 are labeled as attributes
of safety. Parts 3–6 are labeled as attributes of trustworthiness.
Benevolence implies the application of machine learning for good purposes. From a
consequentionalist perspective (defined in chapter 14), we should broadly be aiming for good outcomes
for all people. But a single sociotechnical system surely cannot do that. So we must ask: whose good?
Whose interests will machine learning serve? Who can machine learning empower to achieve their
goals?
Ethics Principles | 219
The values encoded into machine learning systems are an ultimate expression of power. The most
powerful can push for their version of ‘good.’ However, for machine learning systems to be worthy of
trust, the final values cannot only be those that serve the powerful, but must also include the values of
the most vulnerable. Chapter 14 explains technical approaches for bringing diverse voices into the value
alignment process; here we try to understand what those voices have to say.
But before getting there, let’s take a step back and think again about the governance of machine
learning as a control system. What do we have to do to make it selfless and empowering for all? As shown
in Figure 15.2, which extends Figure 14.5, there is a paradigm—a normative theory of how things should
be done—that yields principles out of which values arise. The values then influence modeling.
Figure 15.2. A paradigm determines the principles by which values are constructed. The paradigm is one of the
most effective points in the system to intervene to change its behavior. Accessible caption. A block diagram
that starts with a paradigm block with output principles. Principles are input to a value alignment
block with output values. Facts are subtracted from values to yield misalignment. Misalignment is in-
put to a data scientists block with modeling as output. Modeling is input to a machine learning model
with output that is fed into a testing block. The output of testing is the same facts that were subtracted
from values, creating a feedback loop. Paradigm is intervened upon, shown using a hammer.
There are many leverage points in such a complex system to influence how it behaves.1 Twiddling with
parameters in the machine learning model is a leverage point that may have some small effect.
Computing facts quickly and bringing them back to data scientists is a leverage point that may have
some slightly larger effect. But the most effective leverage point to intervene on is the paradigm
producing the principles.2 Therefore, in this chapter, we focus on different paradigms and the
principles, codes, and guidelines that come from them.
1
Donella H. Meadows. Thinking in Systems: A Primer. White River Junction, Vermont, USA: Chelsea Green Publishing, 2008.
2
More philosophically, Meadows provides an even more effective leverage point: completely transcending the idea of para-
digms through enlightenment.
220 | Trustworthy Machine Learning
economically-developed countries have been more active than those in less economically-developed
countries, which may exacerbate power imbalances. Moreover, the entire framing of ethics principles
for machine learning is based on Western philosophy rather than alternative conceptions of ethics.3
There are many similarities across the different sets of principles, but also key differences.4
First, let’s look at the similarities. At a coarse-grained level, five principles commonly occur in ethics
guidelines from different organizations:
1. privacy,
2. fairness and justice,
3. safety and reliability,
4. transparency (which usually includes interpretability and explainability), and
5. social responsibility and beneficence.
This list is not dissimilar to the attributes of trustworthiness that have guided the progression of the
book. Some topics are routinely omitted from ethics principles, such as artificial general intelligence
and existential threats (machines taking over the world), and the psychological impacts of machine
learning systems.
Differences manifest when looking across sectors: governments, NGOs, and private corporations.
Compared to the private sector, governments and NGOs take a more participatory approach to coming
up with their principles. They also have longer lists of ethical principles beyond the five core ones listed
above. Furthermore, the documents espousing their principles contain greater depth.
The topics of emphasis are different across the three sectors. Governments emphasize
macroeconomic concerns of the adoption of machine learning, such as implications on employment and
economic growth. NGOs emphasize possible misuse of machine learning. Private companies emphasize
trust, transparency, and social responsibility. The remainder of the chapter drills down into these high-
level patterns.
15.2 Governments
What is the purpose of government? Some of the basics are law and order, defense of the country from
external threats, and general welfare, which includes health, well-being, safety, and morality of the
people. Countries often create national development plans that lay out actions toward improving general
welfare. In 2015, the member countries of the United Nations ratified a set of 17 sustainable
development goals to achieve by 2030 that harmonize a unified purpose for national development.
These global goals are:
3
Abeba Birhane. “Algorithmic Injustice: A Relational Ethics Approach.” In: Patterns 2.2 (Feb. 2021), p. 100205. Ezinne
Nwankwo and Belona Sonna. “Africa’s Social Contract with AI.” In: ACM XRDS Magazine 26.2 (Winter 2019), pp. 44–48.
4
Anna Jobin, Marcello Ienca, and Effy Vayena. “The Global Landscape of AI Ethics Guidelines.” In: Nature Machine Intelligence 1
(Sep. 2019), pp. 389–399. Daniel Schiff, Jason Borenstein, Justin Biddle, and Kelly Laas. “AI Ethics in the Public, Private, and
NGO Sectors: A Review of a Global Document Collection.” In IEEE Transactions on Technology and Society 2.1 (Mar. 2021), pp. 31–
42.
Ethics Principles | 221
2. end hunger, achieve food security and improved nutrition and promote sustainable agriculture,
3. ensure healthy lives and promote well-being for all at all ages,
4. ensure inclusive and equitable quality education and promote lifelong learning opportunities
for all,
5. achieve gender equality and empower all women and girls,
6. ensure availability and sustainable management of water and sanitation for all,
7. ensure access to affordable, reliable, sustainable and modern energy for all,
8. promote sustained, inclusive and sustainable economic growth, full and productive employ-
ment and decent work for all,
9. build resilient infrastructure, promote inclusive and sustainable industrialization and foster
innovation,
10. reduce inequality within and among countries,
11. make cities and human settlements inclusive, safe, resilient and sustainable,
12. ensure sustainable consumption and production patterns,
13. take urgent action to combat climate change and its impacts,
14. conserve and sustainably use the oceans, seas and marine resources for sustainable develop-
ment,
15. protect, restore and promote sustainable use of terrestrial ecosystems, sustainably manage for-
ests, combat desertification, and halt and reverse land degradation and halt biodiversity loss,
16. promote peaceful and inclusive societies for sustainable development, provide access to justice
for all and build effective, accountable and inclusive institutions at all levels,
17. strengthen the means of implementation and revitalize the global partnership for sustainable
development.
Toward satisfying the purpose of government, governmental AI ethics principles are grounded in the
kinds of concerns stated in the sustainable development goals. Fairness and justice are a part of many
of the goals, including goals five, ten, and sixteen, and also appear as a core tenet of ethics principles.
Several other goals relate to social responsibility and beneficence.
Economic growth and productive employment are main aspects of goal eight and play a role in goals
nine and twelve. Governments have an overriding fear that machine learning technologies will eliminate
jobs through automation without creating others in their place. Therefore, as mentioned in the previous
section, the economic direction is played up in governmental AI ethics guidelines and not so much in
those of other sectors.
As part of this goal, there are increasing calls for a paradigm shift towards AI systems that complement
or augment human intelligence instead of imitating it.5
Furthermore, towards both economic competitiveness and defense from external threats, some
countries have now started engaging in a so-called arms race. Viewing the development of machine
learning as a race may encourage taking shortcuts in safety and governance, which is cautioned against
throughout this book.6
“There is one and only one social responsibility of business: to engage in activities
designed to increase its profits.”
In 2019, however, the Business Roundtable, an association of the chief executives of 184 large
companies headquartered in the United States, stated a broader purpose for corporations:
1. Delivering value to our customers. We will further the tradition of American companies leading
the way in meeting or exceeding customer expectations.
2. Investing in our employees. This starts with compensating them fairly and providing important
benefits. It also includes supporting them through training and education that help develop
new skills for a rapidly changing world. We foster diversity and inclusion, dignity and respect.
3. Dealing fairly and ethically with our suppliers. We are dedicated to serving as good partners to
the other companies, large and small, that help us meet our missions.
4. Supporting the communities in which we work. We respect the people in our communities and
protect the environment by embracing sustainable practices across our businesses.
5. Generating long-term value for shareholders, who provide the capital that allows companies to
invest, grow and innovate. We are committed to transparency and effective engagement with
shareholders.
Shareholder value is listed only in the last item. Other items deal with fairness, transparency and
sustainable development. AI ethics principles coming from corporations are congruent with this
broadening purpose of the corporation itself, and are also focused on fairness, transparency and
sustainable development.7
5
Daron Acemoglu, Michael I. Jordan, and E. Glen Weyl. “The Turing Test is Bad for Business.” In: Wired (Nov. 2021).
6
Stephen Cave and Seán S. ÓhÉigeartaigh. “An AI Race for Strategic Advantage: Rhetoric and Risks.” In: Proceedings of the
AAAI/ACM Conference on AI, Ethics, and Society. New Orleans, Louisiana, USA, Feb. 2018, pp. 36–40.
7
In January 2022, the Business Roundtable came out with 10 AI ethics principles of their own: (1) innovate with and for diver-
sity, (2) mitigate the potential for unfair bias, (3) design for and implement transparency, explainability and interpretability,
Ethics Principles | 223
“I think we're in the third era, which is the age of integrated impact where we have
created social impact that is part of the core value and function of the company
overall.”
The 2019 statement by the Business Roundtable is not without criticism. Some argue that it is simply
a public relations effort without accompanying actions that could lead to a paradigm change. Others
argue it is a way for chief executives to lessen their accountability to investors.8 AI ethics principles by
corporations, especially those by companies developing machine learning technologies, face a similar
criticism known as ethics washing—creating a façade of developing ethical or responsible machine
learning that hides efforts that are actually very shallow.9 An extreme criticism is that technology
companies actively mislead the world about their true purpose and intentions with machine learning.10
(4) invest in a future-ready AI workforce, (5) evaluate and monitor model fitness and impact, (6) manage data collection and
data use responsibly, (7) design and deploy secure AI systems, (8) encourage a company-wide culture of responsible AI, (9)
adapt existing governance structures to account for AI, and (10) operationalize AI governance throughout the whole organiza-
tion.
8
Lucian A. Bebchuk and Roberto Tallarita. “The Illusory Promise of Stakeholder Governance.” In: Cornell Law Review 106 (2020),
pp. 91–178.
9
Elettra Bietti. “From Ethics Washing to Ethics Bashing: A View on Tech Ethics from Within Moral Philosophy.” In: Proceedings
of the ACM Conference on Fairness, Accountability, and Transparency. Barcelona, Spain, Jan. 2020, pp. 210–219.
10
Mohamed Abdalla and Moustafa Abdalla. “The Grey Hoodie Project: Big Tobacco, Big Tech, and the Threat on Academic In-
tegrity.” In: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. Jul. 2021, pp. 287–297.
224 | Trustworthy Machine Learning
From the perspective of critical theory, machine learning systems tend to be instruments that
reinforce hegemony (power exerted by a dominant group).11 They extract data from vulnerable groups
and at the same time, deliver harm to those same and other vulnerable groups. Therefore, the AI ethics
principles coming from civil society often call for a disruption of the entrenched balance of power,
particularly by centering the contexts of the most vulnerable and empowering them to pursue their
goals.
As an example, the AI principles stated by an NGO that supports the giving of humanitarian relief to
vulnerable populations are the following:12
11
Shakir Mohamed, Marie-Therese Png, and William Isaac. “Decolonial AI: Decolonial Theory as Sociotechnical Foresight in
Artificial Intelligence” In: Philosophy & Technology 33 (Jul. 2020), pp. 659–684. Alex Hanna, Emily Denton, Andrew Smart, and
Jamila Smith-Loud. “Towards a Critical Race Methodology in Algorithmic Fairness.” In: Proceedings of the ACM Conference on
Fairness, Accountability, and Transparency. Barcelona, Spain, Jan. 2020, pp. 501–512. Emily M. Bender, Timnit Gebru, Angelina
McMillan-Major, and Shmargaret Shmitchell. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜” In:
Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Mar. 2021, pp. 610–623.
12
Jasmine Wright and Andrej Verity. “Artificial Intelligence Principles for Vulnerable Populations in Humanitarian Contexts.”
Digital Humanitarian Network, Jan. 2020.
13
Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, Michelle Bao. “The Values Encoded in Machine
Learning Research.” arXiv:2106.15590, 2021.
Ethics Principles | 225
growing body of research literature on fairness, explainability, and robustness (they are ‘hot topics’), the
incentives for researchers are starting to align with the pursuit of research in technical trustworthy
machine learning algorithms. Several open-source and commercial software tools have also been
created in recent years to make the algorithms from research labs accessible to data scientists. But
having algorithmic tools is also only one leverage point for putting ethical AI principles into practice.
Practitioners also need the knowhow for affecting change in their organizations and managing various
stakeholders. One approach for achieving organizational change is a checklist of harms co-developed
with stakeholders.14 Research is needed to further develop more playbooks for organization change.
Putting principles to practice is a process that has its own lifecycle.15 The first step is a series of small
efforts such as ad hoc risk assessments initiated by tempered radicals (people within the organization who
believe in the change and continually take small steps toward achieving it). The second step uses the
small efforts to demonstrate the importance of trustworthy machine learning and obtain the buy-in of
executives to agree to ethics principles. The executives then invest in educating the entire organization
on the principles and also start valuing the work of individuals who contribute to trustworthy machine
learning practices in their organization. The impetus for executives may also come from external forces
such as the news media, brand reputation, third-party audits, and regulations. The third step is the
insertion of fact flow tooling (remember this was a way to automatically capture facts for transparency
in Chapter 13) and fairness/robustness/explainability algorithms throughout the lifecycle of the
common development infrastructure that the organization uses. The fourth step is instituting the
requirement that diverse stakeholders be included in problem specification (value alignment) and
evaluation of machine learning systems with veto power to modify or stop the deployment of the system.
Simultaneously, this fourth step includes the budgeting of resources to pursue trustworthy machine
learning in all model development throughout the organization.
15.6 Summary
▪ The purpose of trustworthy machine learning systems is to do good, but there is no single
definition of good.
▪ Different definitions are expressed in ethics principles from organizations across the
government, private, and social sectors.
▪ Common themes are privacy, fairness, reliability, transparency, and beneficence.
14
Michael A. Madaio, Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach. “Co-Designing Checklists to Understand
Organizational Challenges and Opportunities around Fairness in AI.” In: Proceedings of the CHI Conference on Human Factors in
Computing Systems. Honolulu, Hawaii, USA, Apr. 2020, p. 318.
15
Bogdana Rakova, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. “Where Responsible AI Meets Reality: Practi-
tioner Perspectives on Enablers for Shifting Organizational Practices.” In: Proceedings of the ACM on Human-Computer Interaction
5.CSCW1 (Apr. 2021), p. 7. Kathy Baxter. “AI Ethics Maturity Model.” Sep. 2021.
226 | Trustworthy Machine Learning
▪ A series of small actions can push an organization to adopt AI ethics paradigms and principles.
The adoption of principles is an effective start for an organization to adopt trustworthy machine
learning as standard practice, but not the only intervention required.
▪ Going from principles to practice also requires organization-wide education, tooling for
trustworthy machine learning throughout the organization’s development lifecycle, budgeting of
resources to put trustworthy machine learning checks and mitigations into all models, and veto
power for diverse stakeholders at the problem specification and evaluation stages of the lifecycle.
Lived Experience | 227
16
Lived Experience
Recall Sospital, the leading (fictional) health insurance company in the United States that tried to
transform its care management system in Chapter 10 with fairness as a top concern. Imagine that you
are a project manager at Sospital charged with reducing the misuse of opioid pain medications by
members. An opioid epidemic began in the United States in the late 1990s and is now at a point that over
81,000 people die per year from opioid overdoses. As a first step, you analyze member data to
understand the problem better. Existing machine learning-based opioid overdose risk models trained
on data from state prescription drug monitoring programs (which may include attributes originating in
law enforcement databases) have severe issues with data quality, consent, bias, interpretability, and
transparency.1 Also, the existing risk models are predictive machine learning models that can easily
pick up spurious correlations instead of causal models that do not. So you don’t want to take the shortcut
of using the existing models. You want to start from scratch and develop a model that is trustworthy.
Once you have such a model, you plan to deploy it to help human decision makers intervene in fair,
responsible, and supportive ways.
You are starting to put a team together to carry out the machine learning lifecycle for the opioid
model. You have heard the refrain that diverse teams are better for business.2 For example, a 2015 study
found that the top quartile of companies in gender and racial/ethnic diversity had 25% better financial
performance than other companies.3 In experiments, diverse teams have focused more on facts and
been more innovative.4 But do diverse teams create better, less biased, and more trustworthy machine
1
Maia Szalavitz. “The Pain Was Unbearable. So Why Did Doctors Turn Her Away?” In: Wired (Aug. 2021).
2
Among many other points that are used throughout this chapter, Fazelpour and De-Arteaga emphasize that the business case
view on diversity is problematic because it takes the lack of diversity as a given and burdens people from marginalized groups
to justify their presence. Sina Fazelpour and Maria De-Arteaga. “Diversity in Sociotechnical Machine Learning Systems.”
arXiv:2107.09163, 2021.
3
The study was of companies in the Americas and the United Kingdom. Vivian Hunt, Dennis Layton, and Sara Prince. “Why
Diversity Matters.” McKinsey & Company, Jan. 2015.
4
David Rock and Heidi Grant. “Why Diverse Teams are Smarter.” In: Harvard Business Review (Nov. 2016).
228 | Trustworthy Machine Learning
learning models?5 How and why? What kind of diversity are we talking about? In which roles and phases
of the machine learning lifecycle is diversity a factor?
“I believe diversity in my profession will lead to better technology and better benefits
to humanity from it.”
The first question is whether the team even affects the models that are created. Given the same data,
won’t all skilled data scientists produce the same model and have the same inferences? A real-world
experiment assigned 29 teams of skilled data scientists an open-ended causal inference task of
determining whether soccer referees are biased against players with dark skin, all using exactly the
same data.6 Due to different subjective choices the teams made in the problem specification and
analysis, the results varied. Twenty teams found a significant bias against dark-skinned players, which
means that nine teams did not. In another real-world example, 25 teams of data scientists developed
mortality prediction models from exactly the same health data and had quite variable results, especially
in terms of fairness with respect to race and gender.7 In open-ended lifecycles, models and results may
depend a lot on the team.
If the team matters, what are the characteristics of the team that matter? What should you be looking
for as you construct a team for modeling individual patients’ risk of opioid misuse? Let’s focus on two
team characteristics: (1) information elaboration: how do team members work together, and (2) cognitive:
what do individual team members know. In the first characteristic: information elaboration,
socioculturally non-homogeneous teams are more likely to slow down and consider critical and
contentious issues; they are less apt to take shortcuts.8 Such a slowdown is not prevalent in
homogeneous teams and importantly, does not depend on the team members having different sets of
knowledge. All of the team members could know the critical issues, but still not consider them if the
members are socioculturally homogeneous.
You have probably noticed quotations sprinkled throughout the book that raise issues relevant to the
topic of a given section. You may have also noticed that the people quoted have different sociocultural
backgrounds, which may be different than yours. This is an intentional feature of the book. Even if they
are not imparting knowledge that’s different from the main text of the book, the goal of the quotes is for
you to hear these voices so that you are pushed to slow down and not take shortcuts. (Not taking
shortcuts is a primary theme of the book.)
5
Caitlin Kuhlman, Latifa Jackson, and Rumi Chunara. “No Computation Without Representation: Avoiding Data and Algorithm
Biases Through Diversity.” arXiv:2002.11836, 2020.
6
Raphael Silberzahn and Eric L. Uhlmann. “Crowdsourced Research: Many Hands Make Tight Work.” In: Nature 526 (Oct.
2015), pp. 189–191.
7
Timothy Bergquist, Thomas Schaffter, Yao Yan, Thomas Yu, Justin Prosser, Jifan Gao, Guanhua Chen, Łukasz Charzewski,
Zofia Nawalany, Ivan Brugere, Renata Retkute, Alidivinas Prusokas, Augustinas Prusokas, Yonghwa Choi, Sanghoon Lee, Jun-
seok Choe, Inggeol Lee, Sunkyu Kim, Jaewoo Kang, Patient Mortality Prediction DREAM Challenge Consortium, Sean D.
Mooney, and Justin Guinney. “Evaluation of Crowdsourced Mortality Prediction Models as a Framework for Assessing AI in
Medicine.” medRxiv:2021.01.18.21250072, 2021.
8
Daniel Steel, Sina Fazelpour, Bianca Crewe, and Kinley Gillette. “Information Elaboration and Epistemic Effects of Diversity.”
In: Synthese 198.2 (Feb. 2021), pp. 1287–1307.
Lived Experience | 229
In modern Western science and engineering, knowledge derived from lived experience is typically
seen as invalid; often, only knowledge obtained using the scientific method is seen as valid. This
contrasts with critical theory, which has knowledge from the lived experience of marginalized people at
its very foundation. Given the many ethics principles founded in critical theory covered in Chapter 15,
it makes sense to consider lived experience in informing your development of a model for opioid misuse
risk. Toward this end, in the remainder of the chapter, you will:
▪ map the cognitive benefit of the lived experience of team members to the needs and
requirements of different phases of the machine learning lifecycle, and
▪ formulate lifecycle roles and architectures that take advantage of that mapping.
9
Neurodiversity is not touched upon in this chapter, but is another important dimension that could be expanded upon.
10
Natalie Alana Ashton and Robin McKenna. “Situating Feminist Epistemology.” In: Episteme 17.1 (Mar. 2020), pp. 28–47.
230 | Trustworthy Machine Learning
“New perspectives ask new questions and that's a fact. This is exactly why inclusion
matters!”
The second phase, data understanding, requires digging into the available data and its provenance
to identify the possible bias and consent issues detailed in Chapter 4 and Chapter 5. This is another
phase in which it is important for the team to be critical, and it is useful to have members with epistemic
advantage. In Chapter 10, we already saw that the team developing the Sospital care management
system needed to recognize the bias against African Americans when using health cost as a proxy for
health need. Similarly, a diagnosis for opioid addiction in a patient’s data implies that the patient
actually interacted with Sospital for treatment, which will also be biased against groups that are less
likely to utilize the health care system. Problem owners, stakeholders, and data scientists from
marginalized groups are more likely to recognize this issue. Furthermore, a variety of lived experiences
will help discover that large dosage opioid prescriptions from veterinarians in a patient’s record are for
their pets, not for them; prescription claims for naltrexone, an opioid itself, represent treatment for
opioid addiction, not evidence of further misuse; and so on.
The third phase in developing an opioid misuse model is data preparation. You can think of data
preparation in two parts: (1) data integration and (2) feature engineering. Critique stemming from lived
experience has little role to play in data integration because of its mechanical and rote nature. Is this
also the case in the more creative feature engineering part? Remember from Chapter 10 that biases may
be introduced in feature engineering, such as by adding together different health costs to create a single
column. Such biases may be spotted by team members who are advantaged in looking for them.
However, if dataset constraints, such as dataset fairness metric constraints, have already been included
in the problem specification of the opioid misuse model in anticipation of possible harms, then no
additional epistemic advantage is needed to spot the issues. Thus, there is less usefulness of lived
experience of marginalization among team members in the data preparation stage of the lifecycle.
11
Vinodkumar Prabhakaran and Donald Martin Jr. “Participatory Machine Learning Using Community-Based System Dynam-
ics.” In: Health and Human Rights Journal 22.2 (Dec. 2020), pp. 71–74.
Lived Experience | 231
In the fourth phase of the lifecycle, the team will take the prepared data and develop an
individualized causal model of factors that lead to opioid misuse.12 Coming after the problem
specification phase that sets forth the modeling task and the performance metrics, and after the data
understanding and data preparation phases that finalize the dataset, the modeling phase is not open-
ended like the soccer referee and mortality prediction tasks described in the previous section. The
modeling is quite constrained from the perspective of the data scientist.
A recent study tasked 399 data scientists, each working alone, with developing models of the
mathematical literacy of people based on approximately five hundred of their biographical features; the
dataset and basic performance metrics were clearly specified (no fairness metric was specified).13
Importantly, the dataset had many data points and was purposefully and carefully collected as a
representative sample without population biases. Thus, the dataset had negligible epistemic
uncertainty. The study analyzed the 399 models that were created and found no significant relationship
between the unwanted bias of the models and the sociocultural characteristics of the data scientists that
produced them.
In this example and other similar regimented and low-epistemic uncertainty modeling tasks, the
lived experience of the team is seemingly of low importance. In contrast, when there is great epistemic
uncertainty like you may have in analyzing opioid abuse, the inductive bias of the model chosen by the
data scientist has a great role to play and the lived experience of the data scientist can become important.
However, mirroring the argument made earlier about an explicit problem specification lessening the
epistemic advantage for members of marginalized groups in feature engineering, a clear specification
of all relevant trust metric dimensions also lessens the usefulness of lived experience in modeling.
Evaluating the opioid risk model once it has been created is not as straightforward as simply testing
it for the specified allowable trust metric ranges in the ways described in Chapter 14. Once a model is
tangible, you can manipulate it in various ways and better imagine the harms it could lead to. Thus,
being critical of the model during evaluation is also a job better done by a team that has members who
have experienced systematic disadvantage and are attuned to the negative impacts it may have if it is
deployed within Sospital’s operations.
Finally, if the model has passed the evaluation stage, the ML operations engineers on the team carry
out the deployment and monitoring phase of the lifecycle. Their role is primarily to ensure technical
integration with Sospital’s other systems and noting when the trust metric ranges elicited during value
alignment are violated over time. This is another phase of the lifecycle in which there is not much
epistemic advantage to be had by a team containing engineers with lived experience of marginalization.
Overall, as shown in Figure 16.1, three lifecycle phases (problem specification, data understanding,
and evaluation) can take advantage of having a diverse team containing members that have lived
experience of marginalization. The other three phases (data preparation, modeling, and deployment
and monitoring) benefit less from the epistemic advantage of team members with lived experience of
12
Chirag Nagpal, Dennis Wei, Bhanukiran Vinzamuri, Monica Shekhar, Sara E. Berger, Subhro Das, and Kush R. Varshney. “In-
terpretable Subgroup Discovery in Treatment Effect Estimation with Application to Opioid Prescribing Guidelines.” In: Proceed-
ings of the ACM Conference on Health, Inference, and Learning. Apr. 2020, pp. 19–29.
13
Bo Cowgill, Fabrizio Dell’Acqua, Samuel Deng, Daniel Hsu, Nakul Verma, and Augustin Chaintreau. “Biased Programmers?
Or Biased Data? A Field Experiment in Operationalizing AI Ethics.” In: Proceedings of the ACM Conference on Economics and Compu-
tation. Jul. 2020, pp. 679–681.
232 | Trustworthy Machine Learning
systematic harm. This conclusion suggests a particular lifecycle architecture for developing your opioid
risk model, discussed in the next section.
Figure 16.1. The different phases of the machine learning lifecycle delineated by how useful knowledge from lived
experience is. Knowledge from lived experience is more useful in problem specification, data understanding, and
evaluation. Knowledge from lived experience is less useful in data preparation, modeling, and deployment and
monitoring. Accessible caption. A diagram of the development lifecycle is marked according to which
phases find lived experience more useful and less useful.
experience of oppression,14 then any competent, reliable, communicative, and selfless data engineers,
data scientists, and ML operations engineers equipped with the tools and training in trustworthy
machine learning will create a trustworthy opioid misuse risk model irrespective of their lived
experience. The pool of skilled data scientists at Sospital does not include many individuals with lived
experience, and you also don’t want to levy a ‘minority tax’—the burden of extra responsibilities placed
on minority employees in the name of diversity—on the ones there are. So you go with the best folks
available, and that is perfectly fine. (Machine learning researchers creating the tools for practitioners
should have a variety of lived experiences because researchers have to both pose and answer the
questions. Fortuitously, though their numbers are small overall, researchers from groups traditionally
underrepresented in machine learning and associated with marginalization are seemingly
overrepresented in research on trustworthy machine learning, as opposed to other areas of machine
learning research.15)
If the lived experience of the data scientists and engineers on the team is less relevant for building
trustworthy machine learning systems, what if the data scientists and engineers are not living beings at
all? Technology advances are leading to a near-future state in which feature engineering and modeling
will be mostly automated, using so-called auto ML. Algorithms will construct derived features, select
hypothesis classes, tune hyperparameters of machine learning algorithms, and so on. As long as these
auto ML algorithms are themselves trustworthy,16 then it seems as though they will seamlessly enter the
lifecycle, interact with problem owners and model validators, and successfully create a trustworthy
model for opioid misuse.
Shown in Figure 16.2, in this near-future, auto ML instead of data scientists is the controller in the
control theory perspective on governance introduced in Chapter 14 and Chapter 15. And this is a-okay.
Such an architecture involving auto ML empowers problem owners and marginalized communities to
pursue their goals without having to rely on scarce and expensive data scientists. This architecture
enables more democratized and accessible machine learning for Sospital problem owners when paired
with low-code/no-code interfaces (visual software development environments that allow users to create
applications with little or no knowledge of traditional computer programming).
“It's about humans at the center, it's about those unnecessary barriers, where people
have domain expertise but have difficulty teaching the machine about it.”
14
Those specifications and validations must also be given true power. This point is discussed later using the terminology ‘par-
ticipation washing’.
15
Yu Tao and Kush R. Varshney. “Insiders and Outsiders in Research on Machine Learning and Society.” arXiv:2102.02279,
2021.
16
Jaimie Drozdal, Justin Weisz, Dakuo Wang, Gaurav Dass, Bingsheng Yao, Changruo Zhao, Michael Muller, Lin Ju, and Hui Su.
“Trust in AutoML: Exploring Information Needs for Establishing Trust in Automated Machine Learning Systems.” In: Proceed-
ings of the International Conference on Intelligent User Interfaces. Mar. 2020, pp. 297–307.
234 | Trustworthy Machine Learning
Figure 16.2. The control theory perspective of AI governance with auto ML technologies serving as the controller
instead of data scientists. Accessible caption. A block diagram that starts with a paradigm block with out-
put principles. Principles are input to a value alignment block with output values. Facts are subtracted
from values to yield misalignment. Misalignment is input to an auto ML block with modeling as output.
Modeling is input to a machine learning model with output that is fed into a testing block. The output of
testing is the same facts that were subtracted from values, creating a feedback loop.
A recent survey of professionals working within the machine learning lifecycle asked respondents
their preference for auto ML in different lifecycle phases.17 The respondents held different lifecycle
personas. The preferred lifecycle phases for automation were precisely those in which lived experience
is less important: data preparation, modeling, and deployment and monitoring. The phases of the
lifecycle that respondents did not care to see automation take hold were the ones where lived experience
is more important: problem specification, data understanding, and evaluation. Moreover, respondents
from the problem owner persona desired the greatest amount of automation, probably because of the
empowerment it provides them. These results lend further credence to an architecture for machine
learning development that emphasizes inclusive human involvement in the ‘takeoff’ (problem
specification and data understanding) and ‘landing’ (evaluation) phases of the lifecycle while permitting
‘auto pilot’ (auto ML) in the ‘cruising’ (data preparation and modeling) phase.
Another recent survey showed that machine learning experts were more likely to call for strong
governance than machine learning non-experts.18 This result suggests that problem owners may not
realize the need for explicit value alignment in an automated lifecycle. Therefore, the empowerment of
problem owners should be only enabled in architectures that place the elicitation of paradigms and
values at the forefront.
17
Dakuo Wang, Q. Vera Liao, Yunfeng Zhang, Udayan Khurana, Horst Samulowitz, Soya Park, Michael Muller, and Lisa Amini.
“How Much Automation Does a Data Scientist Want?” arXiv:2101.03970, 2021.
18
Matthew O'Shaughnessy, Daniel Schiff, Lav R. Varshney, Christopher Rozell, and Mark Davenport. “What Governs Attitudes
Toward Artificial Intelligence Adoption and Governance?” osf.io/pkeb8, 2021.
19
Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano. “Participation is Not a Design Fix for Machine Learning.”
arXiv:2007.02423, 2020. Bas Hofstra, Vivek V. Kulkarni, Sebastian Munoz-Najar Galvez, Bryan He, Dan Jurafsky, and Daniel A.
McFarland. “The Diversity–Innovation Paradox in Science.” In: Proceedings of the National Academy of Sciences of the United States
of America 117.17 (Apr. 2020), pp. 9284–9291.
Lived Experience | 235
Participatory design sessions that include diverse voices, especially those with lived experience of
marginalization, have to be credited and compensated. The sessions are not enough if they turn out to
be merely for show. The outcomes of those sessions have to be backed by power and upheld throughout
the lifecycle of developing the opioid abuse model. Otherwise, the entire architecture falls apart and the
need for team members with lived experience returns to all phases of the lifecycle.
Leaving aside the difficult task of backing the inputs of marginalized people with the power they need
to be given, how should you even go about bringing together a diverse panel? From a practical
perspective, what if you are working under constraints?20 Broad advertising and solicitations from
entities that vulnerable people don’t know may not yield many candidates. More targeted recruitment
in specific social media groups and job listing sites may be somewhat better, but will still miss certain
groups. Unfortunately, there are no real shortcuts. You have to develop relationships with institutions
serving different communities and with members of those communities. Only then will you be able to
recruit people to participate in the problem specification, data understanding, and evaluation phases
(either as employees or simply as one-time panelists) and be able to do what you know that you should.
16.3 Summary
▪ The model produced in a machine learning lifecycle depends on characteristics of the team.
▪ Teams that are socioculturally heterogeneous tend to slow down and not take shortcuts.
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. “Stakeholder Participation in AI: Beyond ‘Add Diverse
20
Stakeholders and Stir.’” In: Proceedings of the NeurIPS Human-Centered AI Workshop. Dec. 2021.
236 | Trustworthy Machine Learning
17
Social Good
As you know from Chapter 7 and Chapter 13, the (fictional) information technology company JCN
Corporation has several data science teams that work on different problems faced by the company and
its customers. JCN Corporation’s chief executive is a member of the Business Roundtable and a
signatory to broadening the values of private industry from being solely driven by shareholders to being
driven by other stakeholders too (this was covered in Chapter 15). In this environment, you recall that
the fourth attribute of trustworthiness includes beneficence and helping others. Toward this end, you
have the idea to start a data science for social good program at JCN to engage those data science teams part-
time to conduct projects that directly contribute to uplifting humanity.
“Imagine what the world would look like if we built products that weren't defined by
what the market tells us is profitable, but instead what our hearts tell us is essential.”
Taking a consequentialist view (remember consequentialism from Chapter 14), ‘social impact’ or
‘making a difference’ is promoting the total wellbeing of humanity (in expected value over the long term
without sacrificing anything that might be of comparable moral importance).1 But what does that really
mean? And whose good or whose value of wellbeing are we talking about?
Nov. 2021.
Social Good | 237
“The phrase ‘data science for social good’ is a broad umbrella, ambiguously defined.
As many others have pointed out, the term often fails to specify good for whom.”
It is dangerous for you to think that you or the data science teams at JCN Corporation are in position
to determine what is an appropriate problem specification to uplift the most vulnerable people in the
world. Data science for social good is littered with examples of technologists taking the shortcut of being
paternalistic and making that determination themselves. If your data science teams are diverse and
include people with lived experience of marginalization (see Chapter 16), then maybe they will be less
paternalistic and push to have diverse, external problem owners.
“Most technologists from the Global North are often not self-aware and thus look at
problems in the Global South through the lens of technology alone. In doing so, they
inevitably silence the plurality of perspectives.”
But who should those external problem owners be? Your first inclination is to look towards
international development experts from large well-established governmental and non-governmental
organizations, and consulting the seventeen UN Sustainable Development Goals (SDGs) listed in Chapter
15. But as you investigate further, you realize that there were a lot of struggles of power and politics that
went into determining the SDGs; in particular, the lower-level targets beneath the seventeen goals may
not represent the views of the most vulnerable.2 You also learn that international development overall
has many paternalistic tendencies and is also littered with projects that make no sense. Some may even
be harmful to the people they intend to uplift.
Thus, while taking inspiration from the high-level topics touched on by the SDGs, you decide on the
following theory of change for the JCN data science for social good program you are creating. Using
machine learning, you will empower smaller, innovative social change organizations that explicitly
include the knowledge of the vulnerable people they intend to uplift when they work towards social
impact. (Collectively, civil society organizations and social enterprises—for-profit businesses that have
social impact as their main goal—are known as social change organizations.) Toward developing a social
good program within JCN Corporation, in this chapter you will:
2
Serge Kapto. “Layers of Politics and Power Struggles in the SDG Indicator Process.” In: Global Policy 10.S1 (Jan. 2019), pp. 134–
136.
238 | Trustworthy Machine Learning
▪ formulate a lifecycle for achieving a successful data science for social good program, and
▪ sketch out empowering machine learning architectures and platforms for promoting social good.
Before jumping into it, a few words on how you can gain internal support within JCN Corporation to
devote resources to the program. There are several value propositions that go beyond appealing to the
broadening stakeholder values that JCN Corporation is adopting and beyond appealing to the potential
for positive public relations. First, machine learning problem specifications in social impact
applications tend to have different constraints than those found in information technology and
enterprise applications. Constraints are the mother of innovation, and so working on these problems
will lead to new innovations for JCN. Second, by partnering with civil society organizations, JCN
Corporation will receive valuable feedback and public references about its machine learning tools that
enterprise customers may be unwilling to provide. Public references that allow JCN to tout its
capabilities are distinct from positive public relations because they do not depend on the goodness of
the application. Third, working on these projects attracts, retains, and grows the skills of talented data
scientists in JCN Corporation. Fourth, if the program is run on JCN Corporation’s cloud computing
platform, the platform’s usage will grow. Tax deductions for charitable giving are conspicuously absent
from the value propositions because JCN Corporation will be receiving product feedback and possible
cloud usage from the social change organizations.
3
Hugo Gerard, Kamalesh Rao, Mark Simithraaratchy, Kush R. Varshney, Kunal Kabra, and G. Paul Needham. “Predictive Mod-
eling of Customer Repayment for Sustainable Pay-As-You-Go Solar Power in Rural India.” In: Proceedings of the Data for Good
Exchange Conference. New York, New York, USA. Sep. 2015. Brian Abelson, Kush R. Varshney, and Joy Sun. “Targeting Direct
Cash Transfers to the Extremely Poor.” In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New
York, New York, USA, Aug. 2014, pp. 1563–1572. Debarun Bhattacharjya, Karthikeyan Shanmugam, Tian Gao, Nicholas Mattei,
Kush R. Varshney, and Dharmashankar Subramanian. “Event-Driven Continuous Time Bayesian Networks.” In: Proceedings of
the AAAI Conference on Artificial Intelligence. New York, New York, USA, Feb. 2020, pp. 3259–3266. Aditya Garg, Alexandra Ol-
teanu, Richard B. Segal, Dmitriy A. Katz-Rogozhnikov, Keerthana Kumar, Joana Maria, Liza Mueller, Ben Beers, and Kush R.
Varshney. “Demystifying Social Entrepreneurship: An NLP Based Approach to Finding a Social Good Fellow.” In: Proceedings of
the Data Science for Social Good Conference. Chicago, Illinois, USA, Sep. 2017. Skyler Speakman, Srihari Sridharan, and Isaac
Markus. “Three Population Covariate Shift for Mobile Phone-based Credit Scoring.” In: Proceedings of the ACM SIGCAS Conference
on Computing and Sustainable Societies. Menlo Park, California, USA, Jun. 2018, p. 20.
Social Good | 239
▪ accessibility,
▪ agriculture,
▪ education,
▪ environment,
▪ financial inclusion,
▪ health care,
▪ infrastructure (e.g. urban planning and transportation),
▪ information verification and validation,
▪ public safety and justice, and
▪ social work,
▪ supervised learning,
▪ reinforcement learning,
▪ computer vision,
▪ natural language processing,
▪ robotics,
▪ knowledge representation and reasoning,
4
Michael Chui, Martin Harryson, James Manyika, Roger Roberts, Rita Chung, Ashley van Heteren, and Pieter Nel. “Notes from
the AI Frontier: Applying AI for Social Good.” McKinsey & Company, Dec. 2018. Zheyuan Ryan Shi, Claire Wang, and Fei Fang.
“Artificial Intelligence for Social Good: A Survey.” arXiv:2001.01818, 2020.
240 | Trustworthy Machine Learning
currently have an inventory of kale; that is not data science for social good.5 Data science for social good
requires social change organizations to be problem owners who state the problem specification based
on the lived experiences of their beneficiaries (and even better, bring their beneficiaries to a panel of
diverse voices to inform the project).
Needless to say, the data science for social good you do in your program at JCN Corporation must be
imbued with data privacy and consent, along with the first three attributes of trustworthy machine
learning: competence, reliability (including fairness and robustness), and interaction (including
explainability, transparency, and value alignment). This is especially the case because these systems
are affecting the most vulnerable members of society.
17.1.2 How Has Data Science for Social Good Been Conducted?
Surveys of the data science for social good landscape find that nearly all efforts have been conducted as
one-off projects that involve the development of a custom-tailored solution, irrespective of whether they
are carried out as data science competitions, weekend volunteer events, longer term volunteer-based
consulting engagements, student fellowship programs, corporate philanthropy, specialized non-
governmental organizations, or dedicated innovation teams of social change organizations.
Creating such one-off solutions requires a great deal of time and effort both from the social change
organization and the data scientists. There is limited reuse of assets and learnings from one project to
the next because (1) every new project involves a different social change organization and (2) data
scientists acting as volunteers are unable to conduct a sequence of several projects over time. Moreover,
these projects typically require the social change organization to integrate machine learning solutions
with their other systems and practices, to deploy those solutions, and monitor and maintain the
solutions over time themselves. Very few social change organizations are equipped to do such ‘last-mile’
implementation, partly because their funding typically does not allow them to invest time and resources
into building up technological capacity.
The confluence of all these factors has led to the state we are in: despite the data science for social
good movement being nearly a decade long, most projects continue to only be demonstrations without
meaningful and lasting impact on social change organizations and their constituents. 6 A project lasting
a few months may show initial promise, but then is not put into practice and does not ‘make a difference.’
5
Jake Porway. “You Can’t Just Hack Your Way to Social Change.” In: Harvard Business Review (Mar. 2013). URL:
https://hbr.org/2013/03/you-cant-just-hack-your-way-to.
6
Kush R. Varshney and Aleksandra Mojsilović. “Open Platforms for Artificial Intelligence for Social Good: Common Patterns as
a Pathway to True Impact.” In: Proceedings of the ICML AI for Social Good Workshop. Long Beach, California, USA, Jul. 2019.
7
Lu Liu, Nima Dehmamy, Jillian Chown, C. Lee Giles, and Dashun Wang. “Understanding the Onset of Hot Streaks Across Artis-
tic, Cultural, and Scientific Careers.” In: Nature Communications 12 (Sep. 2021), p. 5392.
Social Good | 241
diversity of topics (which is then followed by a narrowly-focused exploitation phase that produces the
impact),8 a data science for social good program needs to begin broadly and go through the following
three-step lifecycle, illustrated in Figure 17.1.
Figure 17.1. Illustration of the three phases of the lifecycle of a data science for social good program: (1) piloting
and innovating with a portfolio of projects, (2) reusing and hardening solutions to the common patterns, and (3)
creating a usable platform that can reach a lot of social change organizations. Accessible caption. Step 1, pilot
and innovate, shows several different development lifecycles with icons for different social good appli-
cations in their center, colored gray to indicate they are not yet hardened. Step 2, reuse and harden,
shows a sequence of three development lifecycles in which the social good application icon gets pro-
gressively darker to black to indicate hardening. Step 3, deliver at scale shows a development lifecycle
inside a computer window illustrating its incorporation into a platform, touching tens of social good
applications.
1. Pilot and innovate. You should conduct several individual projects to learn about the needs of so-
cial change organizations that may be addressed by machine learning. In this phase, your data
scientists will also gain the experience of conducting multiple projects and start seeing com-
monalities across them. While doing so, JCN Corporation will gain from new innovations under
new constraints. You can choose to be somewhat intentional in the application area of social
good to match corporate values or in the technical area of machine learning to match technical
areas of interest, but not overly so.
2. Reuse and harden. Once you have several projects under your belt, you must step back and ana-
lyze the common patterns that emerge. Your goal at this stage is to develop common algorithms
or algorithmic toolkits to address those common patterns in as reusable a way as possible. You
want to meet the needs of multiple social change organizations using a common model or algo-
rithm. This type of machine learning innovation is unique; most data scientists and machine
learning researchers are not trained to step back and abstract things in this way, so it will be a
8
The word ‘exploit’ is used in a positive sense here, but is used in a negative sense later in the chapter.
242 | Trustworthy Machine Learning
challenge. However, this sort of insight and innovation is precisely the feedback that will be
helpful for JCN Corporation’s teams developing software tools and products for conducting data
science.
3. Deliver at scale. Those common reusable algorithms will not make high impact until they are
made available within an environment that low-resourced and low-skilled social change organ-
izations can be empowered to tweak, use, and maintain. (Refer to inclusive low-code/no-code
architectures in Chapter 16 for a related discussion.) The delivery will likely be ‘as-a-service’ on
JCN Corporation’s cloud-based environment. Software-as-a-service is software that is licensed
as a subscription, is centrally hosted, and is accessed by users using a web browser. Therefore,
integration with other systems is greatly simplified and the responsibility for maintenance falls
on JCN Corporation rather than the social change organization.
You are probably comfortable with the first phase of this data science for social good program
lifecycle. As long as you ensure that social change organizations—representing the interests of their
beneficiaries who have lived experience of vulnerability—are the problem owners and involved in
evaluation, then the JCN Corporation data scientists can approach the portfolio of projects in this phase
in a manner they are used to.
The second phase presupposes that there are common patterns in social good projects that can be
addressed using common models or algorithms. Evidence is starting to mount that this is indeed the
case. For example, the same algorithm for bandit data-driven optimization is used in social good
applications as varied as feeding the hungry and stopping wildlife poachers.9 As a second example, most
of the social good use cases (fictionally) covered in the book are quite different from each other, but are
all fair allocation problems posed as binary classification that can be addressed using a common
algorithmic toolkit such as AI Fairness 360, a library of fairness metrics and bias mitigation
algorithms.10 Moreover, large language models have been fine-tuned for several disparate social good
domains such as collecting evidence for drug repurposing and simplifying text for people with low-
literacy or cognitive disability.11 (Large language models are a kind of foundation model; introduced in
Chapter 4, foundation models are machine learning models trained on large-scale data that can be fine-
tuned for specific problems.)
The third phase of the lifecycle of a social good program is mostly unproven as yet, but is what you
should be working toward in the program you intend to start at JCN Corporation. The result is an
accessible and inclusive data science for social good platform that is described in the next section.
9
Zheyuan Ryan Shi, Zhiwei Steven Wu, Rayid Ghani, and Fei Fang. “Bandit Data-Driven Optimization: AI for Social Good and
Beyond.” arXiv:2008.11707, 2020.
10
Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia,
Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards,
Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. “AI Fairness 360: An Extensible
Toolkit for Detecting and Mitigating Algorithmic Bias.” In: IBM Journal of Research and Development 63.4/5 (Jul./Sep. 2019), p. 4.
11
Shivashankar Subramanian, Ioana Baldini, Sushma Ravichandran, Dmitriy A. Katz-Rogozhnikov, Karthikeyan Natesan
Ramamurthy, Prasanna Sattigeri, Kush R. Varshney, Annmarie Wang, Pradeep Mangalath, and Laura B. Kleiman. “A Natural
Language Processing System for Extracting Evidence of Drug Repurposing from Scientific Publications.” In: Proceedings of the
AAAI Conference on Artificial Intelligence. New York, New York, USA, Feb. 2020, pp. 13369–13381. Sanja Stajner. “Automatic Text
Simplification for Social Good: Progress and Challenges.” In: Findings of the Association for Computational Linguistics. Aug. 2021,
pp. 2637–2652.
Social Good | 243
Before getting there, two comments on the term ‘scale.’ Scaling is a paradigm seen as paramount in
much of the technology industry, and is the main reason to pursue a digital platform that can be built
once and used by many. However, scaling is not the mission of many social change organizations;
although some would like to grow, many would like to remain small with a very pointed mission.12
Moreover, scaling as an overriding paradigm is not free from criticism and can be seen as a means for
exploiting the most vulnerable.13 In creating a social good program and platform with JCN Corporation,
your goal is to make the work of all social change organizations easier, irrespective of whether they
would like to scale themselves. You can control any possible exploitation by centering the values of the
most vulnerable throughout the development lifecycle.
12
Anne-Marie Slaughter. “Thinking Big for Social Enterprise Can Mean Staying Small.” In: Financial Times (Apr. 2018). URL:
https://www.ft.com/content/86061a82-46ce-11e8-8c77-ff51caedcde6.
13
Katherine Ye. “Silicon Valley and the English Language.” URL: https://book.affecting-technologies.org/silicon-valley-and-the-
english-language/. Jul. 2020.
14
C. K. Prahalad. The Fortune at the Bottom of the Pyramid: Eradicating Poverty Through Profits. Upper Saddle River, New Jersey,
USA: Wharton School Publishing, 2005.
244 | Trustworthy Machine Learning
11. distribution methods designed to reach both highly dispersed rural markets and highly dense
urban markets; and
12. focus on broad architecture, enabling quick and easy incorporation of new features.
What are the important points among these principles for a machine learning platform that empowers
social change organizations and what is a platform anyway?
A digital platform is a collection of people, processes, and internet-based tools that enable users to
develop and run something of value. Therefore, a machine learning platform contains web-based
software tools to carry out the entire machine learning development lifecycle from the problem
specification phase all the way to the deployment and monitoring phase, with roles for all the personas
including problem owners, data scientists, data engineers, model validators, operations engineers, and
diverse voices. Importantly, a machine learning platform is more than simply an off-the-shelf
programming library for modeling.
In fact, there are three kinds of machine learning capabilities: (1) off-the-shelf machine learning
packages, (2) machine learning platforms, and (3) bespoke machine learning builds.15 At one extreme,
off-the-shelf packages are useful for top-of-the-pyramid organizations with a high level of data science
skill and a high level of resources, but not for bottom-of-the-pyramid social change organizations. At the
other extreme, bespoke or custom-tailored development (which has been the predominant mode of data
science for social good over the last decade) should only be used for extremely complex problems or
when an organization needs a technological competitive advantage. These are not the circumstances in
which social change organizations typically operate; usually their problems, although having unique
constraints, are not overly complicated from a machine learning perspective and usually their
advantages in serving their beneficiaries are non-technological. Thus, it makes sense to be just right and
serve social change organizations using machine learning platforms.
What does a machine learning platform for the bottom of the pyramid entail? Translating the twelve
general principles to a machine learning platform for social change implies a focus on appropriate
functionality, adaptable user interfaces, deskilling, broad architecture, distribution methods, and
education. You’ll obtain appropriate functionality by paring down the machine learning capabilities to
the core model, algorithm, or toolkit that is reusable by many different social change organizations with
similar needs, as discussed earlier. Such foundational capabilities mean that the algorithms have to be
created only once and can be improved by a dedicated machine learning team that is not reliant on, or
part of, any one social change organization.
The remaining aspects touch on the last-mile problem. You can achieve adaptable user interfaces
and deskilling by following the inclusive architecture presented in Chapter 16 for people with lived
experience of marginalization. Such an architecture takes the scarce and expensive skill of data
scientists out of the development lifecycle through low-code/no-code and auto ML. Low-code/no-code
and auto ML should make it easy to configure and fine-tune the machine learning capability for the
Andrew Burgess. The Executive Guide to Artificial Intelligence: How to Identify and Implement Applications for AI in Your Organization.
15
specific task being approached by the social change organization. It should also be easy to slightly
change the definition of an outcome variable and apply the model to a new setting with slightly different
features. The interface should also provide a data catalog and tools for managing data. Moreover, the
interface should include meaningful and easy to consume visualizations of the output predictions. The
focus should be to simplify, simplify, simplify, but not so much that you are left with something
meaningless.
A web and cloud-based platform is specifically designed to support quick and easy incorporation of
new capabilities. Any improvements to the machine learning diffuse to social change organizations
automatically. Similarly, cloud-based platforms are designed in a way that allow broad distribution to
any device anywhere there is an internet connection. This method of delivery is starting to lead to
turnkey deployment and monitoring of machine learning systems.
Finally, the last component of a machine learning platform for social impact is education: teaching
and reference materials, tutorials, how-to guides, examples, etc. presented in the language of social
change organizations. It must be presented in a way that people starting at different skill levels all have
an on-ramp to the content. An important part of the education for members of social change
organizations is sparking the imagination of what’s possible using machine learning in the social impact
sector. A persona that has not come up in the book so far, a broker who bridges the gap between members
of social change organizations and the data science world by translating and aligning the concepts used
in each field, is very useful in the education component of the platform.16
Have you noticed something? All of the desirable attributes of a machine learning platform seem to
be desirable not only for empowering social change organizations, but also desirable for any
organization, including ones at the top of the pyramid. And that is the beauty of bottom-of-the-pyramid
innovation: it is good old innovation that is useful for everyone including JCN Corporation’s enterprise
customers.
Beyond the design and the ease of use of the platform, a critical aspect for you to sustainably bring
the platform and overall data science for social good program to fruition is winning the support of large
grantmaking foundations that fund social change organizations. First, the foundations must give some
sort of implicit permission to social change organizations to use the platform and provide them enough
leeway in their budgets to get started. Second, in a similar vein as international development projects
specified without the perspective of vulnerable people, there are many international development
efforts whose funding did not provision for maintenance and long-term support beyond the initial
headline. JCN Corporation will not be able to sustain a data science for social good platform you create
without grants for its maintenance, so you’ll need to line up funding. Foundations are beginning to see
the need to support technology efforts among their grantees,17 but are not yet ready to fund a platform
operated by a private corporation.
You have your work cut out for you to launch a data science for social good program at JCN
Corporation and push it along the lifecycle beyond just the initial set of projects to common algorithms
16
Youyang Hou and Dakuo Wang. “Hacking with NPOs: Collaborative Analytics and Broker Roles in Civic Data Hackathons.” In:
Proceedings of the ACM on Human-Computer Interaction 1.CSCW (Nov. 2017), p. 53.
17
Michael Etzel and Hilary Pennington. “Time to Reboot Grantmaking.” In: Stanford Social Innovation Review. URL:
https://ssir.org/articles/entry/time_to_reboot_grantmaking, Jun. 2017.
246 | Trustworthy Machine Learning
and then a scalable platform. But with enough conviction, wherewithal, and luck, you just might be able
to pull it off. Go forth, you genuine do-gooder!18
17.4 Summary
▪ Data science for social good—using machine learning in a beneficent way—is not an application
area for machine learning, but a paradigm and value system.
▪ The goal is to empower social change organizations in the development of machine learning
systems that help uplift vulnerable people on their own terms.
▪ The decade-long experience with data science for social good has rarely yielded truly impactful
results because individual projects fail to overcome the last-mile problem.
▪ Social change organizations are typically low-resourced and need much more than just code or
a custom solution to be able to use machine learning in their operations.
▪ Machine learning platforms that are specifically designed to deskill data science needs and
minimize the effort for deployment, maintenance, and support are the solution. Such platforms
should be built around common algorithmic patterns in the social impact space that you start
seeing by conducting several projects over a lifecycle.
▪ All the attributes of trustworthy machine learning are essential in applying machine learning for
social impact, including fairness, robustness, explainability, and transparency.
William D. Coplin. How You Can Help: An Easy Guide to Doing Good Deeds in Your Everyday Life. New York, New York, USA:
18
Routledge, 2000.
Filter Bubbles and Disinformation | 247
18
Filter Bubbles and Disinformation
Imagine that you’re a technology executive who is unhappy with the stranglehold that a handful of
companies have on how people receive information via ad-supported social media timelines,
recommendations, and search engines. Your main issue with these ‘big tech’ companies is the filter
bubbles, disinformation, and hate speech festering on their platforms that threaten a functioning non-
violent society. Many of these phenomena result from machine learning systems that help the platforms
maximize engagement and revenue. Economists call these considerations that extend beyond revenue
maximization for the company and are detrimental to society negative externalities. According to your
values, recommendation and search to maximize engagement are problems that should not even be
worked on in their currently prevailing paradigm because they have consequences on several of the
items listed in Chapter 14 (e.g. disinformation, addiction, surveillance state, hate and crime).
“The best minds of my generation are thinking about how to make people click ads.
That sucks.”
In recent months, you have seen an upstart search engine enter the fray that is not ad-driven and is
focused on ‘you,’ with ‘you’ referring to the user and the user’s information needs. This upstart gives you
a glimmer of hope that something new and different can possibly break through the existing
monopolies. However, your vision for something new is not centered on the singular user ‘you’, but on
plural society. Therefore, you start planning a (fictional) search engine and information
recommendation site of your own with a paradigm that aims to keep the negative externalities of the
current ad/engagement paradigm at bay. Recalling a phrase that the conductor of your symphonic band
used to say before concerts: “I nod to you and up we come,” you name your site Upwe.com.
Does Upwe.com have legs? Can a search engine company really focus on serving a broader and
selfless purpose? Many would argue that it is irrational to neither focus on solely serving the user (to
make it attractive for paying subscribers) nor maximizing the platform’s engagement (to maximize the
248 | Trustworthy Machine Learning
company’s ad revenue). However, as you learned in Chapter 15, corporations are already moving toward
broadening their purpose from maximizing shareholder value to maximizing the value for a larger set
of stakeholders. And by focusing on the collective ‘we,’ you are appealing to a different kind of ethics:
relationality instead of rationality. Relational ethics asks people to include considerations beyond
themselves (which is the scope of rational ethics), especially their relationships with other people and
the environment in determining the right action. One effect of relational thinking is bringing negative
externalities to the forefront and mitigating an extractive or colonial mindset, including in the context
of machine learning.1
So coming back to the original question: is Upwe.com tenable? Does your vision for it have any hope?
In this chapter, you’ll work toward an answer by:
▪ sketching the reasons why society is so reliant on the digital platforms of ‘big tech,’
▪ examining the paradigm that leads to echo chambers, disinformation, and hate speech in greater
detail, and
▪ evaluating possible means for countering the negative externalities.
1
Sabelo Mhlambi. “From Rationality to Relationality: Ubuntu as an Ethical and Human Rights Framework for Artificial Intelli-
gence Governance.” Harvard University Carr Center Discussion Paper Series 2020-009, Jul. 2020.
2
Matthew Hutson. “What Do You Know? The Unbearable Vicariousness of Knowledge.” In: MIT Technology Review 123.6
(Nov./Dec. 2020), pp. 74–79.
Filter Bubbles and Disinformation | 249
closed-box system bringing it to you.3 Nonetheless, people cannot entirely abdicate their epistemic
responsibility to try to verify either the knowledge itself, its source, or the system bringing it forward.
From the very beginning of the book, the trustworthiness of machine learning systems has been
equated to the trustworthiness of individual other people, such as coworkers, advisors, or decision
makers. This framing has followed you throughout the journey of becoming familiar with trustworthy
machine learning: going from competence and reliability to interaction and selflessness. However, when
discussing the trustworthiness of the machine learning backing information filtering in digital
platforms, this correspondence breaks down. To the general public, the machine learning is beyond the
limits of their knowledge and interaction to such a degree that the machine learning model is not an
individual person any longer, but an institution like a bank, post office, or judicial system. It is just there.
Members of the public are not so much users of machine learning as they are subject to machine
learning.4 And institutional trust is different from interpersonal trust.
Public trust in institutions is not directed towards a specific aspect, component or interaction with
the institution, but is an overarching feeling about something pervasive. The general public does not go
in and test specific measures of the trustworthiness of an institution like they may with a person, i.e.
assessing a person’s ability, fairness, communication, beneficence, etc. (or even care to know the results
of such an assessment). Members of the public rely on the system itself having the mechanisms in place
to ensure that it is worthy of trust. The people’s trust is built upon mechanisms such as governance and
control described in Chapter 14, so these mechanisms need to be understandable and not require
epistemic dependence. To understand governance, people need to understand and agree with the values
that the system is working to align itself toward. Thus as you envision Upwe.com, you must give your
utmost attention to getting the paradigm right and making the values understandable to anyone. Putting
these two things in place will enable the public to make good on their epistemic responsibility.
Remember from Chapter 15 that intervening on the paradigm is the most effective leverage point of a
system and is why the focus of this chapter is on the paradigm rather than on tackling negative
externalities more directly, such as methods for detecting hate speech.
3
Boaz Miller and Isaac Record. “Justified Belief in a Digital Age: On the Epistemic Implications of Secret Internet Technolo-
gies.” In: Episteme 10.2 (Jun. 2013), pp. 117–134.
4
Bran Knowles and John T. Richards. “The Sanction of Authority: Promoting Public Trust in AI.” In: Proceedings of the ACM Con-
ference on Fairness, Accountability, and Transparency. Mar. 2021, pp. 262–271.
250 | Trustworthy Machine Learning
First, let’s see how single-mindedly valuing engagement leads to the harms of echo chambers,
disinformation, and hate speech. The end of the section will briefly mention some alternatives to
engagement maximization.
“When you see perspectives that are different from yours, it requires thinking and
creates aggravations. As a for-profit company that's selling attention to advertisers,
Facebook doesn't want that, so there's a risk of algorithmic reinforcement of
homogeneity, and filter bubbles.”
In an echo chamber, a person is repeatedly presented with the same information without any differences
of opinion. This situation leads to their believing in that information to an extreme degree, even when it
is false. Filter bubbles often lead to echo chambers. Although filter bubbles may be considered a helpful
act of curation, by being in one, the user is not exposed to a diversity of ideas. They suffer from epistemic
inequality.5 Recall from Chapter 16 that diversity leads to information elaboration—slowing down to think
about contentious issues. Thus, by being in a filter bubble, people are apt to take shortcuts, which can
lead to a variety of harms.
5
Shoshana Zuboff. “Caveat Usor: Surveillance Capitalism as Epistemic Inequality.” In: After the Digital Tornado. Ed. by Kevin
Werbach. Cambridge, England, UK: Cambridge University Press, 2020.
6
Laurent Itti and Pierre Baldi. “Bayesian Surprise Attracts Human Attention.” In: Vision Research 49.10 (Jun. 2009), pp. 1295–
1306.
7
Lav R. Varshney. “Limit Theorems for Creativity with Intentionality.” In: Proceedings of the International Conference on Computa-
tional Creativity. Sep. 2020, pp. 390–393.
Filter Bubbles and Disinformation | 251
engaged on a platform. Moreover, people spread false news significantly faster on social media
platforms than true news.8
Clickbait is one example of false, surprising, and attractive content that drives engagement. It is a
kind of misinformation (a falsehood that may or may not have been deliberately created to mislead) and
also a kind of disinformation (a falsehood that was purposefully created to mislead). In fact, ‘big tech’
companies have been found to finance so-called clickbait farms to drive up their platforms’
engagement.9
Another type of disinformation enabled by machine learning is deepfakes. These are images or videos
created with the help of generative modeling that make it seem as though a known personality is saying
or doing something that they did not say or do. Deepfakes are used to create credible messaging that is
false.
Although some kinds of misinformation can be harmless, many kinds of disinformation can be
extremely harmful to individuals and societies. For example, Covid-19 anti-vaccination disinformation
on social media in 2021 led to vaccination hesitancy in many countries, which led to greater spread of
the disease and death. Other disinformation has political motives that are meant to destabilize a nation.
8
Soroush Vosoughi, Deb Roy, and Sinan Aral. “The Spread of True and Fake News Online.” In: Science 359.6380 (Mar. 2018), pp.
1146–1151.
9
Karen Hao. “How Facebook and Google Fund Global Misinformation.” In: MIT Technology Review. URL: https://www.technolo-
gyreview.com/2021/11/20/1039076/facebook-google-disinformation-clickbait, 2021.
252 | Trustworthy Machine Learning
Messages on social media platforms and actions in the real world are closely intertwined.10 Hate
speech, offensive speech, and messages inciting violence on digital platforms foment many harms in
the physical world. Several recent instances of hateful violence, such as against the Rohingya minority
in Myanmar in 2018 and the United States Capitol Building in 2021, have been traced back to social
media.
18.2.4 Alternatives
You’ve seen how maximizing engagement leads to negative externalities in the form of real-world
harms. But are there proven alternatives you could use in the machine learning algorithm running
Upwe.com’s information retrieval system instead? Partly because there are few incentives to work on
the problem among researchers within ‘big tech,’ and because researchers elsewhere do not have the
ability to try out or implement any ideas that they may have, the development of alternatives has been
few and far between.11
Nevertheless, as you develop the paradigm for Upwe.com, the following are a few concepts that you
may include. You may want the platform to maximize the truth of the factual information that the user
receives. You may want the platform to always return content from a diversity of perspectives and
expose users to new relations with which they may form a diverse social network.12 You may wish to
maximize some longer-term enjoyment for the user that they themselves might not realize is
appropriate for them at the moment; this paradigm is known as extrapolated volition. Such concepts may
be pursued as pre-processing, during model training, or as post-processing, but they would be limited
to only those that you yourself came up with.13 A participatory value alignment process that includes
members of marginalized groups would be even better to come up with all of the concepts you should
include in Upwe.com’s paradigm.
Furthermore, you need to have transparency in the paradigm you adopt so that all members of
society can understand it. Facts and factsheets (covered in Chapter 13) are useful for presenting the
lower-level test results of individual machine learning models, but not so much for institutional trust
(except as a means for trained auditors to certify a system). CP-nets (covered in Chapter 14) are
understandable representations of values, but do not reach all the way back to the value system or
paradigm. It is unclear how to document and report the paradigm itself, and is a topic you should
experiment with as you work on Upwe.com.
10
Alexandra Olteanu, Carlos Castillo, Jeremy Boy, and Kush R. Varshney. “The Effect of Extremist Violence on Hateful Speech
Online.” In: Proceedings of the AAAI International Conference on Web and Social Media. Stanford, California, USA, Jun. 2018, pp. 221–
230.
11
Ivan Vendrov and Jeremy Nixon. “Aligning Recommender Systems as Cause Area.” In: Effective Altruism Forum. May 2019.
12
Jianshan Sun, Jian Song, Yuanchun Jiang, Yezheng Liu, and Jun Li. “Prick the Filter Bubble: A Novel Cross Domain Recom-
mendation Model with Adaptive Diversity Regularization.” In: Electronic Markets (Jul. 2021).
13
Jonathan Stray, Ivan Vendrov, Jeremy Nixon, Steven Adler, and Dylan Hadfield-Menell. “What Are You Optimizing For?
Aligning Recommender Systems with Human Values.” In: Proceedings of the ICML Participatory Approaches to Machine Learning
Workshop. Jul. 2020.
Filter Bubbles and Disinformation | 253
1. High reach content disclosure. Companies must regularly report on the content, source, and reach
of pieces of knowledge that receive high engagement on their platform.
2. Content moderation disclosure. Companies must report the content moderation policies of their
platform and provide examples of moderated content to qualified individuals.
3. Ad transparency. Companies must regularly report key information about every ad that appears
on their platform.
4. Superspreader accountability. People who spread disinformation that leads to real-world negative
consequences are penalized.
5. Communications decency control on ads and recommendation systems. Make companies liable for
hateful content that spreads on their platform due to the information filtering algorithm, even if
it is an ad.
Many of these recommended regulations enforce transparency since it is a good way of building
institutional trust. However, they do not provide governance on the paradigm underlying the platform
because it is difficult to measure the paradigm. Nevertheless, they will control the paradigm to some
extent. If social media platforms are deemed public utilities or common carriers, like telephone and
electricity providers, then even more strict regulations are possible. Importantly, if you have designed
14
Brian Martin. Nonviolence versus Capitalism. London, England, UK: War Resisters’ International, 2001.
15
Daron Acemoglu. “AI’s Future Doesn’t Have to Be Dystopian.” In: Boston Review. URL: https://bostonreview.net/forum/ais-
future-doesnt-have-to-be-dystopian/, 2021.
16
Emily Saltz, Soubhik Barari, Claire Leibowicz, and Claire Wardle. “Misinformation Interventions are Common, Divisive, and
Poorly Understood.” In: Harvard Kennedy School Misinformation Review 2.5 (Sep. 2021).
17
Throughout the chapter, the governance of platforms is centered on the needs of the general public, but the needs of legiti-
mate content creators are just as important. See: Li Jin and Katie Parrott. “Legitimacy Lost: How Creator Platforms Are Eroding
Their Most Important Resource.” URL: https://every.to/means-of-creation/legitimacy-lost, 2021.
18
Katie Couric, Chris Krebs, and Rashad Robinson. Aspen Digital Commission on Information Disorder Final Report. Nov. 2021.
254 | Trustworthy Machine Learning
Upwe.com to already be on the right side of regulations when they become binding, you will have a leg
up on other platforms and might have a chance of being sustainable.
In parallel, you should also try to push for direct ways of controlling the paradigm rather than
controlling the negative externalities because doing so will be more powerful. Regulations are one
recognized way of limiting negative externalities; Pigouvian taxes are the other main method recognized
by economists. A Pigouvian tax is precisely a tax on a negative externality to discourage the behaviors
that lead to it. A prominent example is a tax on carbon emissions levied on companies that pollute the
air. In the context of social media platforms, the tax would be on every ad that was delivered based on a
targeting model driven by machine learning.19 Such a tax would directly push ‘big tech’ companies to
change their paradigm while leaving the Upwe.com paradigm alone.
Seeing out your vision of an Upwe.com that contributes to the wellbeing of all members of society
may seem like an insurmountable challenge, but do not lose hope. Societal norms are starting to push
for what you want to build, and that is the key.
18.4 Conclusion
▪ There is so much and such complicated knowledge in our world today that it is impossible for
anyone to understand it all, or even to verify it. We all have epistemic dependence on others.
▪ Much of that dependence is satisfied by content on the internet that comes to us on information
platforms filtered by machine learning algorithms. The paradigm driving those algorithms is
maximizing the engagement of the user on the platform.
▪ The engagement maximization paradigm inherently leads to side effects such as filter bubbles,
disinformation, and hate speech, which have real-world negative consequences.
▪ The machine learning models supporting content recommendation on the platforms is so
disconnected from the experiences of the general public that it does not make sense to focus on
models’ interpersonal trustworthiness, which has been the definition of trustworthiness
throughout the book. An alternative notion of institutional trustworthiness is required.
▪ Institutional trustworthiness is based on governance mechanisms and their transparency, which
can be required by government regulations if there is enough societal pressure for them.
Transparency may help change the underlying paradigm, but taxes may be a stronger direct
push.
▪ A new paradigm based on relational ethics is needed, which centers truth, a diversity of
perspectives, and wellbeing for all.
19
Paul Romer. “A Tax To Fix Big Tech.” In: New York Times (7 May 2019), p. 23.
Shortcut | 255
Shortcut
Even though I have admonished you throughout the entire book to slow down, think, and not take
shortcuts, I know some of you will still want to take shortcuts. Don’t do it. But if you’re adamant about it
and are going take a shortcut anyway, I might as well equip you properly.
Here is a picture showing how I structured the book, going from bottom to top. This direction
makes sense pedagogically because you need to understand the concepts at the bottom before you can
understand the nuances of the concepts that are higher up. For example, it is difficult to understand
fairness metrics without first covering detection theory, and it is difficult to understand value
elicitation about fairness metrics without first covering fairness. However, if you want to jump right
into things, you should notionally start at the top and learn things from below as you go along.
Accessible caption. A stack of items in 8 layers. Top layer: ethics principles; layer 2: governance; layer
3: transparency, value alignment; layer 4: interpretability/explainability, testing, uncertainty
quantification; layer 5: distributional robustness, fairness, adversarial robustness; layer 6: detection
theory, supervised learning, causal modeling; layer 7: data biases, data consent, data privacy; bottom
layer: probability and possibility theory, data. An upward arrow is labeled pedagogical direction. A
downward arrow is labeled notional shortcut.
256 | Trustworthy Machine Learning
Preparation Steps:
1. Assemble socioculturally diverse team of problem owners, data engineers and model validators
including members with lived experience of marginalization.
2. Determine ethics principles, making sure to center the most vulnerable people.
3. Set up data science development and deployment environment that includes fact flow tool to
automatically collect and version-control digital artifacts.
4. Install software libraries in environment for testing and mitigating issues related to fairness
and robustness, and computing explanations and uncertainties.
Lifecycle Steps:
1. Identify problem.
2. Conduct facilitated participatory design session including panel of diverse stakeholders to an-
swer the following four questions according to ethics principles:
5. Ensure that dataset has been obtained with consent and does not violate privacy standards.
6. Understand semantics of dataset in detail, including potential unwanted biases.
7. Prepare data and conduct exploratory data analysis with a particular focus on unwanted biases.
11. Deploy model, compute explanations or uncertainties along with predictions if of concern, and
keep monitoring model for metrics of trustworthiness of concern.