0% found this document useful (0 votes)
454 views4 pages

Data Science Ethics Concepts Techniques and Cautionary Tales

Uploaded by

yiteng liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
454 views4 pages

Data Science Ethics Concepts Techniques and Cautionary Tales

Uploaded by

yiteng liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: amstat.tandfonline.com/journals/uasa20

Data Science Ethics: Concepts, Techniques and


Cautionary Tales
David Martens, Oxford, UK: Oxford University Press, 2022, xii + 255 pp.,
$80.00(H), ISBN 978-0-19-284726-3.

Sabrina Giordano

To cite this article: Sabrina Giordano (2023) Data Science Ethics: Concepts, Techniques and
Cautionary Tales, Journal of the American Statistical Association, 118:541, 774-776, DOI:
10.1080/01621459.2022.2163898

To link to this article: https://doi.org/10.1080/01621459.2022.2163898

Published online: 05 Apr 2023.

Submit your article to this journal

Article views: 2264

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://amstat.tandfonline.com/action/journalInformation?journalCode=uasa20
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2023, VOL. 118, NO. 541, 774–777
https://doi.org/10.1080/01621459.2022.2163898

BOOK REVIEWS

Data Science Ethics: Concepts, Techniques and Handbook of Measurement Error Models
Cautionary Tales Grace Y. Yi, Aurore Delaigle, Kwun Chuen Gary Chan 776
David Martens Sabrina Giordano 775 and Paul Gustafson

© 2023 American Statistical Association


BOOK REVIEWS 775

Data Science Ethics: Concepts, Techniques and Cau- uncovering potentially dangerous consequences of government
tionary Tales, David Martens, Oxford, UK: Oxford Uni- backdoors to access digital personal data in order to increase
versity Press, 2022, xii+255 pp., $80.00(H), ISBN 978-0-19- citizens’ security, while exposing them to a high risk of abuse.
284726-3. Bias in data is also presented as a fairness issue (unfair repre-
sentativeness of the population or of certain sensitive groups),
We live in the era of digital transformation which has mag- whereas ethical issues, which historically and currently arise
nified the ability to access all types of information and has from the classical method of collecting data on individuals, are
empowered data science to turn any kind of data into business, examined at length with ad hoc cautionary tales and case studies.
security, health, economic advantages and much more. However, After gathering data, fairness issues of privacy violation and
this is at the expense of an increasing invasion of privacy and discrimination can still emerge, and ethical data preprocess-
a widespread experience of data-driven decisions that are often ing methods are therefore needed as illustrated in Chapter 3.
unexplained and sometimes discriminatory. The debate on what Removing personal identifiers from a dataset does not avoid the
is right and what is wrong to do with data is still open and this risk of reidentifying persons or unveiling their sensitive (sexual,
book makes an outstanding contribution on it. political, religious etc) information. On the one hand, group-
Throughout the book, the cautionary tales educate the read- ing data, by making continuous variables discrete, or suppress-
ers on the consequences on people, companies, and society of ing some values, or techniques like k-anonymity, l-diversity, t-
overlooked ethical aspects. Indeed, several discussion exercises closeness help to reduce the probability of linking a person to
stimulate questions about the right balance between the useful- a specific data instance. On the other hand, they unfortunately
ness of data science practice and its ethical implications. reduce the informative content of the datasets thereby diminish-
The content of the book is structured in seven chapters. The ing the predictive performance of models. Highly appropriate
introduction (Chapter 1) outlines the role of each chapter as real cases and cautionary tales are inserted to warn us with
consecutive steps in a data science project: from collecting and the problem of reidentifying a person or revealing sensitive
processing data to modeling it and making use of the results. attribute of a person based on information that can be poten-
Each stage entails specific ethical issues which are evaluated tially obtained through additional sources, for instance, external
by the author based on three criteria: Fairness, Accountability, datasets, locations or webpages visited, social media actions.
and Transparency (FAT). In fact, within the FAT framework, Another objective of the proposed ethical data preprocessing
every stage has to be fair, in terms of privacy and discrimination is to measure and remove the bias against sensitive groups,
against sensitive groups—for example, gender, race, religion; which potentially is in the original dataset, as a means to prevent
transparent, in relation to the way data is collected, used and the results of the prediction model, applied to biased data,
made accessible, and to the clear explanation of model predic- from being discriminatory. For practical purposes, measures for
tions and consequences of their use in practice. Accountability dataset fairness and methods to remove such bias are provided
relates to demonstrable measures of effective fairness and trans- and exemplified.
parency. What is meant by fairness and transparency changes Chapter 4 deepens ethical aspects of modeling related to fair-
according to the perspective of the individual, depending on ness (privacy, discrimination) and transparency (explainability).
whether he or she is a manager, a data scientist, a person on The mentioned privacy-preserving methodologies in Chapter
whom the information is collected (data subject), or to whom 4 consist of adding noise to the model outcomes, analyzing
the model is applied (model subject). The three ethical criteria encrypted data directly in a cloud computing service, and per-
and the four roles of the subjects are the keys to reading all forming either a joint data analysis among multiple parties with-
chapters of the book. out sharing data, or a deep learning algorithm with a centralized
The ethical aspects of data gathering process are the core model that uses data from multiple clients. Discrimination-
topics of Chapter 2 and raise questions about the type of data aware approaches both clarify how to measure potential dis-
to collect and use, for which proposals and for how long to crimination against sensitive groups in the model predictions,
keep it available. Answers are given starting from the legal and provide a range of solutions to detect and eliminate bias
principles, which protect privacy as a human right, to the use during model building by looking for a tradeoff between the
of the key cryptographic mechanisms of data protection such model accuracy and fairness. Cautionary tales on historical eth-
as encryption, hashing, obfuscation, and the techniques of the nic discrimination show that data modeling is not always free of
differential privacy where noise is added to data before use. unfair practices. Indeed, the third part of the chapter is dictated
Different scenarios and discussion points present the thin line by the need to justify a decision made based on a prediction
between what the user should know (with explicit informed model without having to say: “Computer algorithm says so!”.
consent) and what can be allowed for the legitimate interest of All subjects have the right to know why a decision has been
controllers in assessing, for example, health risk or security. The made on them (explanation of instance-based prediction) and
need for a balance between privacy and security is discussed by managers often want to understand how the prediction model
776 BOOK REVIEWS

comes to its decisions over a large dataset (global explanation). tion, or managers who derive business and competitiveness from
Therefore, a set of examples in the chapter highlights the need for data. Yet, this is a book aimed at all those who want researchers,
explainable and comprehensible model predictions in order to companies and governments to be ethically responsible when
avoid skepticism and reluctance to use the model to make a real making decisions using their own data and for purposes that
decision, and to identify errors in the model itself and provide might involve them. Each of us can become a data subject and
further directions to improve its performance. a model subject, thus, each of us should read this book to
Chapter 5 emphasizes the importance of an ethical evaluation become aware of what ethical thinking should orient data-driven
of the model. It stimulates the use of appropriate measures of decisions, which affect us much more closely than we might
predictive performance (e.g., showing misclassification rates as expect.
well as accuracy metrics), fairness (e.g., assessing the privacy of
the dataset and reporting transparently the involved sensitive Sabrina Giordano
groups) and measures related to what extent an explanation of Department of Economics, Statistics and Finance,
the model can be provided. The author’s complaints of unethical University of Calabria, Cosenza, Italy
use of data (data dredging) and interpretation of the results sabrina.giordano@unical.it
(p-value hacking, missed multiple comparisons), suggests that
researchers should report transparently (good and bad) out-
comes and ensure reproducibility. With cautionary tales and dis-
cussion points the author makes one reflect on ethical conduct
in line with the principles of research integrity. Handbook of Measurement Error Models, Grace Y. Yi,
The last stage is the model deployment which is not exempted Aurore Delaigle, and Paul Gustafson, eds. Boca Raton, FL:
from ethical concerns, and some of these are discussed in Chap- Chapman & Hall/CRC Press, 2022, xiv+577 pp., $270.00(H),
ter 6. This part of the book draws the reader’s attention to ISBN: 978-1-138-10640-6.
the following issues: the access to the data science system can
be, for various reasons, limited and this constraint can give This is a new addition to the Chapman and Hall/CRC Hand-
power to those who have access; the predictions generated by books of Modern Statistical Methods Series, which has 27 vol-
the model may provide different treatments to people; models umes as of December 2022. This volume focuses on the topic of
may be vulnerable and lend themselves to dishonest use thereby measurement error, which appears ubiquitously in many practi-
affecting negatively people and society. Such aspects highlight cal problems exemplified in the book. Unfortunately, there has
the need for a data science ethics policy and the advisability of not been a widespread focus on measurement error in graduate
creating an ad hoc committee to ensure its implementation. education, leaving the topic relatively underappreciated by gen-
The author composed the book as a sequence of questions, eral audiences. However, I believe this handbook has an appeal
raised by the continuing need to derive knowledge from data to most statisticians due to its breadth, as I will explain by going
while respecting its protection, and as a set of techniques and through the chapter contents.
measures in response to them. Nevertheless, the book is not a The handbook has a total of 24 chapters which are organized
cookbook on what to do to be ethically correct in each step into seven parts, each contains two to five chapters. The intro-
of a data science practice; it reveals the ethical implications in ductory part contains two chapters. The first chapter starts with
the data science applications and proposes a range of solutions an overview of the challenges of covariate measurement error
highlighting their merits and risks, in the ongoing search for problems and includes a history of the methodological devel-
a balance between the practical utility of data analysis and in opment since 1980s, citing an extensive list of key references.
compliance with ethically right choices. As a matter of fact, Chapter 2 discusses the impact of ignoring measurement error
ethical data science is not a checklist, but it is a way of thinking under various measurement error assumptions and includes
and acting when working on data. both continuous and categorical predictors. The two chapters lay
The book highlights the author’s ability to draw on ethical the ground work for the rest of the discussions.
concerns through real-life examples. In fact, the most striking Since measurement error problem is inherently related to
ones set precedents that often caused the milestone change latent variable models, identifiability is a key challenge and is
of company and governmental choices in the direction of an discussed in Part 2. Chapter 3 discusses identifiability results
ethical practice of data science. Moreover, in all chapters, ethical for covariate measurement error in linear and nonlinear models.
concerns are introduced by an opening story, and the underlying When point identification is lacking, set identification in terms
concepts are presented by immersing the reader in existing cases of bounds can be attained and often has tractable solutions
linked to world-famous company names, but also to simple for discrete data, which is discussed in Chapter 4. Chapter 5
people in which the reader can identify with. discusses identification by collecting an additional instrumental
The book does not outline in-depth technical, mathematical, variable, one that correlates with the error-free variable, but does
and computational details, but the bibliographical references not correlate with the regression and measurement errors.
are timely and give a fairly complete overview of the measures Part 3 surveys parametric and semiparametric models for
and techniques currently on offer. Furthermore, the legal ref- handling measurement error: Chapter 6 discusses likelihood
erences in European and other legislation, highlight regulatory methods and implementation using expectation- maximization
developments. algorithm. Chapter 7 focuses on regression calibration which
The book is suitable for students in data science and business, replaces the mismeasured covariate by a prediction from a
data scientists who transform data into knowledge and innova- regression model for the mismeasured covariate. Chapter 8

You might also like