0% found this document useful (0 votes)
23 views18 pages

1 s2.0 S0306457321001369 Main

This article explores fairness metrics and bias mitigation strategies in rating-based recommender systems, an area that has seen less research compared to algorithmic fairness in machine learning. It discusses various biases that can arise in recommender systems, proposes a novel bias mitigation strategy, and evaluates its effectiveness using synthetic and empirical datasets. The authors aim to bridge the gap between algorithmic fairness concepts in machine learning and their application in recommender systems.

Uploaded by

ceboc130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

1 s2.0 S0306457321001369 Main

This article explores fairness metrics and bias mitigation strategies in rating-based recommender systems, an area that has seen less research compared to algorithmic fairness in machine learning. It discusses various biases that can arise in recommender systems, proposes a novel bias mitigation strategy, and evaluates its effectiveness using synthetic and empirical datasets. The authors aim to bridge the gap between algorithmic fairness concepts in machine learning and their application in recommender systems.

Uploaded by

ceboc130
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Information Processing and Management 58 (2021) 102646

Contents lists available at ScienceDirect

Information Processing and Management


journal homepage: www.elsevier.com/locate/ipm

Fairness metrics and bias mitigation strategies for rating predictions


Ashwathy Ashokan a ,∗, Christian Haas a,b
a
University of Nebraska at Omaha, 6001 Dodge St, Omaha, NE, 68182, USA
b
WU Vienna University of Economics and Business, Vienna, Austria

ARTICLE INFO ABSTRACT

Keywords: Algorithm fairness is an established line of research in the machine learning domain with
Recommender systems substantial work while the equivalent in the recommender system domain is relatively new. In
Fairness metrics this article, we consider rating-based recommender systems which model the recommendation
Bias mitigation
process as a prediction problem. We consider different types of biases that can occur in this
Algorithmic fairness
setting, discuss various fairness definitions, and also propose a novel bias mitigation strategy
to address potential unfairness in a rating-based recommender system. Based on an analysis
of fairness metrics used in machine learning and a discussion of their applicability in the
recommender system domain, we map the proposed metrics from the two domains and identify
commonly used concepts and definitions of fairness. Finally, to address unfairness and potential
bias against certain groups in a recommender system, we develop a bias mitigation algorithm
and conduct case studies on one synthetic and one empirical dataset to show its effectiveness.
Our results show that unfairness can be significantly lowered through our approach and that
bias mitigation is a fruitful area of research for recommender systems.

1. Introduction

Recommender systems provide personalized recommendations to users and are applied in many different scenarios such as
entertainment, social-networking, and item recommendation. Based on user preferences, behavior, and constraints, they create a
list of recommendations that are considered of relevance to the user (Ricci, Rokach, & Shapira, 2011). Due to their widespread
use, different algorithms for recommender systems have been developed over time, with the most popular ones using collaborative
filtering approach, content-based approach, or a hybrid of the two. In the current era of data explosion, recommender systems are
an example of data-driven approaches that are increasingly utilized in many decision processes. For example, students deciding
what course to enroll in based on a course recommendations system, users deciding which brand of product to purchase from
an e-commerce website based on product recommendations, or researchers choosing an article to read based on document
recommendations.
Machine learning and related approaches are being increasingly used in this digital era, especially in systems that use and
process big data, including recommender systems. Recommender systems are evaluated based on how useful or relevant the provided
recommendation was to the user. Some of the conventional methods used for evaluating recommender systems are adapted from
the machine learning (classification) domain. These include metrics such as accuracy, root mean square error(RMSE), mean average
precision at K(MAP@K), mean average recall at K(MAR@K) etc (Ricci et al., 2011; Shi et al., 2012). Recently, a new concern about
the decision-making ability of the machine learning systems (Dwork, Hardt, Pitassi, Reingold, & Zemel, 2012; Feldman, Friedler,
Moeller, Scheidegger, & Venkatasubramanian, 2015; Kusner, Loftus, Russell, & Silva, 2017; Zafar, Valera, Gomez Rodriguez, &

∗ Corresponding author.
E-mail address: [email protected] (A. Ashokan).

https://doi.org/10.1016/j.ipm.2021.102646
Received 1 December 2020; Received in revised form 2 April 2021; Accepted 19 May 2021
Available online 12 June 2021
0306-4573/© 2021 Elsevier Ltd. All rights reserved.
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Gummadi, 2017a), especially when dealing with human subject data, has led to the rise of a new research area called Algorithmic
Fairness. This research area focuses on developing fairness definitions and measures as well as bias mitigation strategies when existing
algorithms and decision processes are found to be biased. Algorithmic bias, hereby referred to as simply bias, refers to a systematic
and repeatable error in the output of a algorithm and can arise from various factors such as unrepresentative training sample,
incorrect and incomplete labeling of training data or its representation of flawed information that reflects historical inequalities (Lee,
Resnick, & Barton, 2019). Biases in various forms are prevalent in these data-driven systems. Examples include gender bias embodied
in natural language processing algorithms (Caliskan, Bryson, & Narayanan, 2017), job recommendations (Deshpande, Pan, & Foulds,
2020), political bias in news recommendation (Bernhardt, Krasa, & Polborn, 2008), and racial bias in Ad recommendations (Sweeney
& Latanya, 2013). Search and recommender systems are two further examples of data hungry applications where bias is created
and propagated on the Web, potentially creating a vicious circle of bias (Baeza-Yates, 2020).
Similar to the machine learning domain, the concern of fairness is highly relevant in recommender systems; for example, bias can
exist against certain groups of users, categories of items, etc. A more specific example can be potential gender-based discrimination
on STEM course recommendation (Pollack, 2013) or a possible bias in item recommendation (Zhu, Wang, & Caverlee, 2020) based
on age/gender of the user or the demographics of the seller etc. Other examples found in literature are restaurant recommendation
systems not recommending Mexican restaurant enough times, bias that can exist in the search results, such as a scholarly articles
search engine ranking articles from certain publishers higher (Ekstrand, Burke, & Diaz, 2019).
As defined earlier, algorithmic fairness is a stream of research that defines, analyzes, and mitigates these potential biases, with
prominent applications in machine learning tasks such as classification and regression (here by referred to as simple machine learning
tasks). While there is a substantial amount of work in algorithmic fairness and bias mitigation in machine learning (in classification
and regression setting), the question of fairness and bias in recommender systems(prediction and ranking problem) is relatively
novel. Several fairness definitions and metrics have been suggested for recommender systems (e.g., Burke, Sonboli, & Ordonez-
Gauger, 2018; Lee et al., 2014; Verma & Rubin, 2018; Yao & Huang, 2017; Zhu, Hu, & Caverlee, 2018), yet a systematic mapping
of different types of fairness and bias in recommender systems is currently missing. Due to the body of existing work in the machine
learning setting, we aim to leverage these machine learning-based definitions and analyze their applicability to model and measure
fairness in recommender systems. Specifically, we aim to answer the following two research questions:

• RQ 1 – What concepts from Algorithmic Fairness, specifically classification, can be mapped to fairness metrics in (rating-based)
recommender system settings?
• RQ 2 – Can bias mitigation approaches successfully increase different types of fairness in rating-based recommender systems?

The goal of this work is to synergize the body of work on algorithmic fairness in machine learning and recommender system
domain and, to advance research in recommender system fairness domain by proposing a novel bias mitigation approach for
rating-based recommender systems.
The contributions of this article are twofold. First, we provide a mapping of various fairness concepts and definitions between
the machine learning and recommender system domains, identifying similarities and gaps. Second, we define and investigate a bias
mitigation strategy aimed at increasing fairness in recommender systems, and study its effectiveness in a use case with both synthetic
and empirical data.
The remainder of the article is structured as follows. Section 2 introduces relevant concepts and related work from recommender
systems and machine learning. Various fairness metrics from machine learning and classification as well as recommender systems
are discussed in Section 3. Section 4 discusses bias mitigation strategies and Section 5 introduces a novel bias mitigation approach to
reduce unfairness. Section 6 provides an evaluation using synthetic and empirical datasets to study the mitigation strategy. Finally,
Section 7 summarizes the results and provides an outlook on further work.

2. Background and related work

2.1. Methodology

For assessing the current state of the field of fairness in machine learning and recommender system, a semi-systematic literature
review was conducted as illustrated in Fig. 1.
The first step in the process was to formulate a research question to define the aim of the study and map answers from the
literature. The research question that this semi-systematic literature review address is the RQ1, presented in Section 1. It aims
to determine what concepts from classification-based fair machine learning can be mapped to fairness metrics in rating-based
recommender system settings.
The next step was the search process which involved searching and identifying literature that answers the research question.
The search process was conducted using the following keywords: ‘‘fairness’’, ‘‘algorithmic fairness’’, ‘‘recommender systems’’, ‘‘fair
recommendations’’, ‘‘Ethical AI’’, ‘‘unfair’’, ‘‘fair machine learning’’, ‘‘bias’’, ‘‘types of bias’’, ‘‘bias in machine learning’’, ‘‘fairness
and bias’’, ‘‘consequence of machine learning’’, ‘‘search engine bias’’ etc.
The third step of the literature review process involved filtering the search results for relevance by reviewing its abstract,
introduction, and conclusion. A relevant paper contains information that answers RQ1 such as papers that contain definitions of
fairness and bias in the machine learning and recommender system domain, papers that discuss measures or metrics to quantify
fairness and bias or papers that discuss various bias mitigation techniques. The key references from some of the relevant papers

2
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Fig. 1. Illustration of the semi-systematic literature review process for RQ1.

were also reviewed. Most of the selected papers came from major conferences such as SIGIR, ITCS, SIGKDD, WWW, FairWare, FAT*,
FATML, RecSys, NIPS, ICML, DTL, ICML, FATREC, FairUMAP, UMAP, DAB, CIKM, TOIS, ICWSM, ASONAM, EuroSys, ECIR, ICDM,
PMLR, SIAM, WSDM, and highly rated book and journal publishers such as Springer, Communications of the ACM, Information
Retrieval Journal, etc.
In the final step, selected literature was critically read and examined to map out answers to RQ1. This step focuses on extracting
information related to the definitions of fairness and bias in machine learning and recommender system domain, evaluation measures
or some form of quantification of fairness/unfairness/bias, and approaches to mitigate fairness/bias.

2.2. Types, sources, and effects of bias

With the increasing use of data-driven and algorithmic decision making, it is crucial that these decisions treat all users fairly.
Services such as search engines and recommender systems that filter information are not merely algorithms due to human influence
on the design and filtering process of algorithms. This tailored search result creates a filter bubble effect and is reflective of the
societal and internal biases in subtle ways (Baeza-Yates, 2018).
From a general computer system perspective, bias can be defined as ‘‘systematically and unfairly discriminate against certain
individuals or groups of individuals in favor of others’’ (Friedman & Nissenbaum, 1996). Bias in a computing system can be of many
types — pre-existing bias, technical bias and emergent bias. Pre-existing bias reflects potential societal biases that can affect the
system design. Technical bias stems from technical limitations and their effect on the system outcome, and emergent bias can be
introduced after the system is implemented (Friedman & Nissenbaum, 1996). From an algorithmic perspective, bias in the output
of an algorithm can stem from different sources: bias in the data, or bias in the algorithm. Algorithmic bias can exist even when
there is no discrimination intent by the creators of the algorithm. When biased data is fed into algorithms we potentially get biased
outputs even though the algorithm itself might not be biased (Baeza-Yates, 2018). These biased outputs are again fed into machine
learning models and gets propagated to form the vicious circle of bias (Baeza-Yates, 2020) also referred to as the feedback loop in
machine learning systems (Mehrabi, Morstatter, Saxena, Lerman & Galstyan, 2019). Baeza-Yates (2020) suggests that bias are of
three main types, statistical bias, cultural bias and cognitive bias. Table 1 captures emerging definitions of some of the types of bias
that can exist in machine learning systems categorized based on their occurrence in the machine learning pipeline.
Biases in algorithmic decision making can lead to unfair predictions for users in different groups, e.g., based on gender, age,
or race. Fairness is generally used as a (often quantitative) measure to describe the effects of bias in such settings. Though many
definitions have been coined for fairness (Verma & Rubin, 2018), generically it is the concept that all groups and/or users are
treated equally; i.e. with equality or equity or justice (Baeza-Yates, 2020). For the subsequent considerations, we use the term
biased interchangeably with unfair, and unbiased interchangeably with fair.

2.3. Algorithmic fairness in machine learning

Machine learning models are used in a variety of domains, from risk modeling and decision making in insurance, admission and
success prediction in education, to credit-scoring and criminal investigation. Transparency can help identifying potential biases that
get introduced in the machine learning process. Enhancing transparency makes the model more interpretable, increasing user trust
and confidence in the use of machine learning systems for decision making (Zhou & Chen, 2018).
Fairness is a key criterion in machine learning that has gained significant recent attention and is a matter of active scientific
discourse. Abdollahi and Nasraoui (2018) discuss that there are multiple ways to achieve fairness:

1. by manipulating the input data of the machine learning model,


2. by adapting the algorithms that are used to train and build the model, and

3
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Table 1
Emerging Definitions of Types of Biases.
ML pipeline Type of Bias Definition Reference
category
Historical Bias Bias in data generation process due to the already existing Baeza-Yates (2018) and Suresh and
bias from socio-technical issues. Guttag (2019)
Representation Bias Bias created when defining and sampling a population. Baeza-Yates (2018) and Suresh and
Data Generation Guttag (2019)
bias Measurement Bias Bias created from the way a particular feature is chosen, Suresh and Guttag (2019)
utilized and measured.
Population Bias Bias due to the difference in the user population of the Hargittai (2007) and Olteanu,
dataset and the original target population. Castillo, Diaz, and Kiciman (2016)
Sampling Bias Bias that arises when subgroups are not sample at random. Baeza-Yates (2018) and Mehrabi,
Morstatter, Saxena et al. (2019)
Simpson’s Paradox Bias from the difference in behavior of population Bickel, Hammel, and OĆonnell
sub-groups when aggregated and taken individually. (1975), Blyth (1972)
Evaluation Bias Bias that happens during model evaluation. Buolamwini (2017) and Suresh and
Guttag (2019)
Aggregation Bias Bias that occurs due to false assumptions about a population Suresh and Guttag (2019)
affecting the model’s outcome.
Model Building Popularity Bias Bias that occurs from more popular items being exposed Abdollahpouri, Mansoury, Burke, and
and Evaluation more. Mobasher (2019), Ciampaglia,
Nematzadeh, Menczer, and Flammini
(2018), Introna and Nissenbaum
(2000), Jannach, Lerche,
Kamehkhosh, and Jugovac (2015),
Kowald, Schedl, and Lex (2020)
Algorithmic Bias Bias that gets added purely by the algorithm when the input Baeza-Yates (2018)
data is unbiased.
Omitted Variable Bias that occurs leaving out one or more important variables Mehrabi, Morstatter, Saxena et al.
Bias from the model (2019).
Demographic Bias Bias that occurs from users of different demographic Drozdowski, Rathgeb, Dantcheva,
groups(age and gender) being treated differently. Damer, and Busch (2020) and
Ekstrand, Tian, Azpiazu et al. (2018)
Temporal Bias Bias that arises from the behavior and population differences Olteanu et al. (2016) and Tufekci
over time. (2014)
Behavioral Bias Bias from the difference in user behavior across datasets. Miller et al. (2016), Olteanu et al.
(2016)
Content Production Structural, lexical, semantic, and syntactic differences in the Nguyen, Gravel, Trieschnigg, and
Bias contents generated by users. Meder (2013) and Olteanu et al.
(2016)
Linking Bias Bias when the network attributes from user activities Mehrabi, Morstatter, Peng and
Deployment and
misrepresent the true user behavior. Galstyan (2019), Olteanu et al.
User Interaction
(2016), Wilson, Boe, Sala,
Puttaswamy, and Zhao (2009)
Presentation Bias Bias that occurs the way information is presented Baeza-Yates (2018).
Social Bias Bias that occurs from other people influencing ones judgment Baeza-Yates (2018) and Wang and
Wang (2014).
Emergent Bias Bias arising due to the difference in the behavior of the real Friedman and Nissenbaum (1996)
users and users in the dataset.
Observer Bias Bias that gets introduced when a researcher inadvertently Mester (2021)
projects their expectation into research data.
Interaction Bias Bias created due to the difference in the way users’ interact Baeza-Yates (2018)
with a device/system.
Ranking Bias Bias that occurs from top-ranked results being more clicked Baeza-Yates (2018), Lerman and
upon. Hogg (2014)

3. by adjusting the outputs/predictions that are made by the models.

As mentioned before, fairness in machine learning is often defined as treating users from different groups equally, where the
group identifier is part of the input data (i.e., a feature such as gender, age, or race). Simply disregarding these ‘sensitive’ features
when building machine learning models is not sufficient to eliminate bias, i.e., an algorithm that is simply unaware of these group
aspects does not necessarily achieve fair outcomes as, a lot of times, the ‘sensitive’ feature(such as race, gender, age etc.) could be
highly correlated with other unprotected features(such as zipcode, college attended, socioeconomic status, education level and other
predictor variables) in the data perpetuating inequality. (Pedreshi, Ruggieri, & Turini, 2008). Hence, different approaches have been
suggested to alter or obfuscate sensitive attributes to reduce discrimination while retaining a good overall performance/accuracy
(e.g., Kamiran, Karim, & Zhang, 2012; Zemel, Wu, Swersky, Pitassi, & Dwork, 2013).
Several researchers provided definitions of fairness and designed fair algorithms (see, e.g., Burke et al., 2018; Doshi-Velez & Kim,
2017; Hardt et al., 2016). Table 2 provides an overview and formulation of the common and most frequently used machine learning

4
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Table 2
Common fairness definitions in machine learning.
Concept Definition Reference
Statistical Parity Sometimes referred to as demographic parity, Statistical Parity Dwork et al. (2012), Gajane
is the likelihood of positive outcomes should be equal for and Pechenizkiy (2017),
both the privileged and unprivileged group. Kusner et al. (2017) and
Verma and Rubin (2018)
Equalized Odds This states that the privileged and unprivileged groups Gajane and Pechenizkiy
should have equal rates for true positives and false positives. (2017), Hardt, Price, and
Srebro (2016) and Verma and
Rubin (2018)
Equal Opportunity Unlike equalized odds, equal opportunity only concerns the Gajane and Pechenizkiy
true positive, and states that the privileged and unprivileged (2017), Hardt et al. (2016)
groups should have equal true positive rates. and Verma and Rubin (2018)
Disparate Impact This is similar to statistical parity but takes the ratio of the Feldman et al. (2015) and
likelihood of positive outcomes for the privileged and Zafar et al. (2017a)
unprivileged groups instead.
Disparate Mistreatment Misclassification rates(training error rate) for the privileged Zafar et al. (2017a)
and unprivileged groups should be the same for a model to
be fair
Treatment Equality A model satisfies treatment equality if the ratio of false Berk et al. (2017)
negatives and false positives is equal for both the privileged
and unprivileged groups.
General Entropy Index A measure of the inequality at a group or individual level Speicher et al. (2018)
with respect to the fairness of the algorithmic outcome.
Individual Fairness This states that any two individuals who are similar, as Binns (2020), Dwork et al.
(formerly Fairness defined by a similarity metric (considered as ground truth, (2012), Kusner et al. (2017)
Through Awareness) e.g., inverse distance metrics) should receive similar
treatment.
Fairness Through This states that an algorithm or model can be considered fair Chen, Kallus, Mao, Svacha,
Unawareness as long as it does not use any of the unprivileged group and Udell (2019), Gajane and
defining attributes or the protected attributes explicitly in Pechenizkiy (2017),
the decision-making process. Grgic-Hlaca, Zafar, Gummadi,
and Weller (2016), Kusner
et al. (2017) and Pedreshi
et al. (2008)
Counterfactual Fairness As per this the outcome should remain the same in both the Kusner et al. (2017) and
real/actual world and a counter-factual world where the Gajane and Pechenizkiy
individual belongs to a different group with respect to the (2017)
protected attribute.

classification fairness measures. In addition, Kamishima, Akaho, Asoh, and Sakuma (2012) formulate a regularization approach
which penalizes the classifier for discrimination, aiming to remove bias during model building. This concept can be extended for
use in logistic regression classifiers and a variety of other probabilistic models. Fish, Kun, and Lelkes (2016) propose a method that
shifts the decision boundary in the learning algorithm, which provides a trade-off between bias and accuracy.
From a biased data input perspective, several approaches have been suggested. Pedreshi et al. (2008) introduces the notion
of discriminatory classification rules and show that simply leaving out the discriminatory variable (e.g., race) is not sufficient in
removing discrimination. They provide characterizations of direct and indirect discrimination for classification rules and a measure
of discrimination for these classification rules. Kamiran and Calders (2012) propose a method to achieve fairness by changing the
data before training the model. They aim to convert biased data into unbiased data by leaving out the sensitive attribute based on
a ranking function learned on the biased data. Lum and Johndrow (2016) provide a probabilistic definition of algorithmic bias and
propose bias removal from predictive models independent of the specific data types by removing all sensitive information from the
training data. Zafar, Valera, Gomez-Rodriguez, and Gummadi (2017b) introduce the concept of decision boundary fairness while
balancing disparate treatment and disparate impact measures. Section 4 provides additional information on various bias mitigation
strategies used in machine learning.

2.4. Algorithmic fairness in recommender systems

Fairness metrics and potential bias in recommender systems have been studied by several researchers. Lee et al. (2014) study
fairness-aware loan recommendation systems and argue that fairness and recommendation are two contradicting tasks. They measure
fairness as the standard deviation of the top-N recommendations, where a low standard deviation signifies a fair recommendation
without compromising accuracy. Burke et al. (2018) highlights the importance of recognizing the role of personalization when
extending the concept of fairness to recommender systems as most popular recommender systems are all personalized recommender

5
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

systems. They view fairness as a multi-sided concept in multi-stakeholder recommender system where the end user is not the only
party whose interests are considered. They introduce the notion of ’’C-fairness’’ for fair user/consumer recommendation, and ’’P-
fairness’’ for fairness of producer recommendation. In this context, fair recommendations should be made to both job seekers and
employers in a job recommender systems, both parties in a matchmaking website, advertising websites, scientific collaboration sites
etc. Burke et al. (2018) also shows that defining generalized approaches to multi-sided fairness in recommendation is hard due to
the domain specificity of the multi-stakeholder environment. Ekstrand, Tian, Kazi et al. (2018) extend this concept by presenting
an empirical analysis of P-fairness for several collaborative filtering algorithms.
Steck (2018) shows that recommender systems trained in an offline-setting with the goal of accuracy suffer from the problem
of unbalanced recommendation and leads to the filter bubble effect (Baeza-Yates, 2020). The author extends the general concept of
calibration in machine learning to create fair recommendations to the users of recommender systems. Calibration in a classification
algorithm ensures that the predicted proportions of various classes are in tune with the actual proportion of data points in the
dataset. Calibrated recommendations ensure that the recommended list reflects the various interests of the user according to their
representation in the dataset. Kamishima, Akaho, Asoh, and Sakuma (2018) uses the concept of recommendation independence to
create fair recommendations. Recommendation independence excludes the influence of sensitive information from a recommended
outcome. Zhu et al. (2018) define the fairness goal for recommender systems as overcoming algorithmic bias and making neutral
recommendations independent of group membership (e.g., based on gender or age). Another recent approach by Rastegarpanah,
Gummadi, and Crovella (2019) proposes adding specifically designed ‘antidote’ data to the input instead of manipulating the actual
input data. Their aim is to improve the social desirability of the calculated recommendations. Weydemann, Sacharidis, and Werthner
(2019) studies a location-based recommender system to analyze fairness concerns of both users and locations subject to unfair
treatments. Abdollahpouri, Mansoury et al. (2019), Kowald et al. (2020) looking into addressing the unfairness caused to the user
due to popularity bias in recommendations.
From the perspective of measuring fairness, Beutel et al. (2019) propose an unbiased way to measure recommender system
ranking fairness. The proposed method learns user preference through pairwise comparison and defines the notion of pairwise
fairness, intra-group pairwise fairness, and inter-group pairwise fairness to study if a model systematically mis-ranks or under-ranks
items from a particular group. Singh and Joachims (2018) and Biega, Gummadi, and Weikum (2018) also define fairness measures
over ranking that apply to recommender systems. However, from a rating prediction problem perspective, Yao and Huang (2017)
is seminal. Yao and Huang (2017) propose four metrics that address different forms of unfairness to address potential biases in
collaborative filtering recommender systems stemming from a population imbalance or observation bias. We use their definitions
as baseline for the fairness metrics comparison in Section 3. An alternate approach is to use a fair regression model for calculating
the predicted rating (Agarwal, Dudik, & Wu, 2019).

3. Fairness metrics in recommender systems and machine learning

Fairness in algorithmic decision making is a topic of significant research interest, leading to the definition of a variety of fairness
metrics in the past years. Much of the work stems from the domain of machine learning, specifically (binary) classification. While
we provide a definition of these concepts, they are not always directly applicable to a rating-based recommender system. Hence,
this section also provides a mapping of metrics between the two domains and adapts several concepts such as individual fairness to
the context of recommender systems.

3.1. General notation

For the definitions of fairness metrics in recommender system and machine learning, we will use the following notation. Let
𝑖 ∈ {1, … , 𝑛} be the index of users, and 𝑗 ∈ {1, … , 𝑚} be the index of items to be recommended. As fairness metrics compare
different groups, we define a group index for the users. While it is also possible to define a group index for items, the extension to
the definitions are straightforward and thus omitted here. Following previous notation by Yao and Huang (2017), let 𝑔𝑖 ∈ {𝑝𝑟, 𝑛𝑝𝑟}
denote the group of user 𝑖, where 𝑝𝑟 indicates the privileged group, and 𝑛𝑝𝑟 the unprivileged group. Let 𝑟𝑖𝑗 be the rating of item 𝑗
by user 𝑖, and 𝑦̂𝑖𝑗 be the predicted score of item 𝑗 for user 𝑖 (e.g., calculated via matrix factorization in collaborative filtering). 𝑟̄𝑗
and 𝑦̂̄𝑗 denote the average rating and average predicted score for item 𝑗, respectively. Similarly, for the case of binary classification
in machine learning, let 𝑦̂𝑖 ∈ {0, 1} be the predicted class for user 𝑖, and 𝑦𝑖 ∈ {0, 1} be the actual class of user 𝑖. Without loss of
generality, class 1 is considered the positive, favorable class, e.g., indicating that a user received a loan.

3.2. Fairness in machine learning

As a variety of fairness metrics and concepts for (mostly binary) classification has been introduced in the past years, the following
section provides the definitions of commonly used metrics and their interpretation. Table 2 lists some of these key metrics and their
references in literature.

6
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

3.2.1. Group fairness metrics


Group fairness metrics define and compare the fairness of a classifier with respect to two or more groups. One of the earliest
definitions of fairness is the concept of statistical parity, which compares the percentage of favorable predictions (i.e., with 𝑦̂ = 1)
between two groups (see, e.g., Corbett-Davies, Pierson, Feller, Goel, & Huq, 2017; Kamishima et al., 2012; Zemel et al., 2013). The
idea behind achieving statistical parity is that the chance of getting classified with the positive label is independent of the user’s
group membership.
| |
𝑆𝑃𝑑𝑖𝑓 𝑓 = |𝑃 𝑟(𝑦̂𝑝𝑟 = 1) − 𝑃 𝑟(𝑦̂𝑛𝑝𝑟 = 1)| (1)
| |
Similar to statistical parity, the concept of Disparate Impact (Feldman et al., 2015) compares the probability of favorable outcome
between two groups. However, it uses a relative instead of an absolute comparison and has its background in a legal doctrine that
aims to avoid unintended consequences and unfairness based on group membership. Formally, Disparate Impact is defined as:
𝑃 𝑟(𝑦̂𝑝𝑟 = 1)
𝐷𝑖𝑠𝑝𝑎𝑟𝑎𝑡𝑒𝐼𝑚𝑝𝑎𝑐𝑡 = (2)
𝑃 𝑟(𝑦̂𝑛𝑝𝑟 = 1)
Whereas the previous two metrics look at the probability of getting the favorable label and do not consider potentially
existing group differences (e.g., a different percentage of class 1 observations between the groups), the following concepts of equal
opportunity and equalized odds consider additional metrics based on binary classification that take these potential differences into
account. Specifically, they look at true positive rates (TPR) and false positive rates (FPR). Equal Opportunity considers the true
positive rate (TPR), whereas Equalized Odds require both TPR and FPR to be independent of group membership (Hardt et al., 2016;
Pleiss, Raghavan, Wu, Kleinberg, & Weinberger, 2017):
| |
𝐸𝑞𝑂𝑝𝑝𝐷𝑖𝑓 𝑓 = |𝑇 𝑃 𝑅𝑝𝑟 − 𝑇 𝑃 𝑅𝑛𝑝𝑟 | ,
| |
𝑃 𝑟(𝑦̂ = 1|𝑦 = 1)
where 𝑇 𝑃 𝑅 =
𝑃 𝑟(𝑦̂ = 0|𝑦 = 1) + 𝑃 𝑟(𝑦̂ = 1|𝑦 = 1)
( )
| | | |
𝐸𝑞𝑂𝑑𝑑𝑠𝐷𝑖𝑓 𝑓 = 0.5 ∗ |𝐹 𝑃 𝑅𝑝𝑟 − 𝐹 𝑃 𝑅𝑛𝑝𝑟 | + |𝑇 𝑃 𝑅𝑝𝑟 − 𝑇 𝑃 𝑅𝑛𝑝𝑟 | ,
| | | |
𝑃 𝑟(𝑦̂ = 1|𝑦 = 0)
where 𝐹 𝑃 𝑅 =
𝑃 𝑟(𝑦̂ = 1|𝑦 = 0) + 𝑃 𝑟(𝑦̂ = 0|𝑦 = 0)

3.2.2. Distribution-based fairness metrics - inequality index


Whereas the previous metrics consider fairness from a group perspective, Speicher et al. (2018) consider the general distribution
of fairness for classification using an inequality index. Specifically, they define a Generalized Entropy Index (GEI) that measures the
inequality between all users with respect to how fair they are treated by the algorithm. Entropy based metrics such as Generalized
Entropy Index (GEI) are a family of inequality indices that can be used to measure fairness at both group level and individual level.
The Theil index is the most commonly used flavor of GEI. It can be considered a measure of the inequality between all the individuals
with respect to how fair the outcome of the algorithm is Speicher et al. (2018).
In general, the GEI uses a parameter 𝛼 in its definition, where 𝛼 = 1 corresponds to the special case of the Theil index:
∑𝑛 [( )𝛼 ]
1 𝑏𝑖
𝐺𝐸𝐼 = −1 (3)
𝑛𝛼 (𝛼 − 1) 𝑖=1 𝜇

( )
1 ∑ 𝑏𝑖
𝑛
𝑏𝑖
𝑇 ℎ𝑒𝑖𝑙 = 𝑙𝑜𝑔 (4)
𝑛 𝑖=1 𝜇 𝜇

𝜇 represents the average benefit across all users: 𝜇 = 1𝑛 𝑛𝑖=1 𝑏𝑖 . The definition of the benefit function 𝑏𝑖 needs to be determined
for the specific problem. In classification, Speicher et al. (2018) propose 𝑏𝑖 = 𝑦̂ − 𝑦 + 1. When transferring this concept to the
recommender system domain, we need to adjust the benefit function to properly represent features of the new domain.

3.3. Fairness in recommender systems

Building on concepts from machine learning, Yao and Huang (2017) introduced several fairness metrics for rating prediction
recommender systems. Most of these metrics are inspired from the concept of equal opportunity, i.e., the quality of recommendations
should be independent of the group membership. Specifically, we will use following fairness metrics for rating-based recommender
systems.
Value unfairness Yao and Huang (2017) introduce the following definition of 𝑈𝑣𝑎𝑙 , which measures the consistent difference
between predicted and actual ratings on a per-item basis. Hence, it measures if the predictions/recommendations for one group are
consistently higher or lower than their actual ratings.

1 ∑ |( ̂
𝑚
) ( )|
𝑈𝑣𝑎𝑙 = | 𝑦̄ − 𝑟̄𝑗,𝑝𝑟 − 𝑦̂̄𝑗,𝑛𝑝𝑟 − 𝑟̄𝑗,𝑛𝑝𝑟 | (5)
𝑚 𝑗=1 | 𝑗,𝑝𝑟 |

7
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

The previous definition of value unfairness includes the direction of predicted vs. expected rating differences for the groups. For
example, value unfairness is high when the first group is overestimated (𝑦̂̄𝑗,𝑝𝑟 −𝑟̄𝑗,𝑝𝑟 is positive) and the second group is underestimated
(𝑦̂̄𝑗,𝑛𝑝𝑟 − 𝑟̄𝑗,𝑛𝑝𝑟 is negative). An example cited in Yao & Huang is where male students are consistently recommended STEM courses
even when they are not interested in STEM topics. According to the formula, the value of unfairness is high when prediction for the
privileged group is overestimated, and for the unprivileged group is underestimated. However the error in prediction evenly balance
out when both groups of users have the same direction and magnitude of error thus leading to a low value for value unfairness (Yao
& Huang, 2017).
The closely related definition of absolute unfairness 𝑈𝑎𝑏𝑠 does not consider the direction of error, but only the absolute magnitude.
In the previous example, value unfairness would be high, but absolute unfairness could be small. Absolute unfairness is defined as:

1 ∑‖̂
𝑚
‖ | |
𝑈𝑎𝑏𝑠 = ‖𝑦̄ − 𝑟̄𝑗,𝑝𝑟 ‖ − |𝑦̂̄𝑗,𝑛𝑝𝑟 − 𝑟̄𝑗,𝑛𝑝𝑟 | (6)
𝑚 𝑗=1 ‖ 𝑗,𝑝𝑟 ‖ | |

For example, the value of absolute unfairness will be zero when the privileged group of male students is given predictions, say
0.5 above their true preferences, and the unprivileged group of female students is given prediction 0.5 below their true preferences.
However, the value of value unfairness in the aforementioned example would be different, with a value of 1 for this example.
Both 𝑈𝑣𝑎𝑙 and 𝑈𝑎𝑏𝑠 consider both positive and negative differences between predicted and actual ratings. Yao and Huang (2017)
provide a straightforward extension by looking at two closely related metrics for unfairness: underestimation unfairness 𝑈𝑢𝑛𝑑𝑒𝑟 and
overestimation unfairness 𝑈𝑜𝑣𝑒𝑟 . The idea behind these two metrics is to focus on either underestimation of overestimation of the actual
ratings. I.e., the two metrics consider if one of the groups is consistently under- or overestimated with respect to actual ratings.

1 ∑|
𝑚
{ } { }|
𝑈𝑢𝑛𝑑𝑒𝑟 = |𝑚𝑎𝑥 0, 𝑟̄𝑗,𝑝𝑟 − 𝑦̂̄𝑗,𝑝𝑟 − 𝑚𝑎𝑥 0, 𝑟̄𝑗,𝑝𝑟 − 𝑦̂̄𝑗,𝑛𝑝𝑟 | (7)
𝑚 𝑗=1 | |

1 ∑|
𝑚
{ } { }|
𝑈𝑜𝑣𝑒𝑟 = |𝑚𝑎𝑥 0, 𝑦̂̄𝑗,𝑝𝑟 − 𝑟̄𝑗,𝑝𝑟 − 𝑚𝑎𝑥 0, 𝑦̂̄𝑗,𝑛𝑝𝑟 − 𝑟̄𝑗,𝑛𝑝𝑟 | (8)
𝑚 𝑗=1 | |

Finally, Yao and Huang (2017) define the non-parity metric 𝑈𝑝𝑎𝑟 for recommender system as the absolute difference in average
predicted ratings between the two groups, adapting the concept of statistical parity from classification (Kamishima et al., 2012).
I.e., if one group is consistently given lower or higher predicted ratings than the other group, non-parity would be considered high:

| |
𝑈𝑝𝑎𝑟 = |𝑦̂̄𝑝𝑟 − 𝑦̂̄𝑛𝑝𝑟 | (9)
| |

3.4. Domain mapping

As mentioned before, some metrics in recommender systems are based on respective ideas from machine learning. Fig. 2 provides
an overview of related metrics and their relationships.
The boxes in Fig. 2 without any outgoing arrows are the fairness metrics in the machine learning domain that are relevant to
the recommender system domain and is worth adapting.

4. Bias mitigation strategies

Defining and quantifying potential unfairness in algorithmic decision making is a necessary first step, but does not answer the
question how we can increase the fairness of a given solution. This is the focus of bias mitigation strategies, which define procedures
that aim to decrease the bias and unfairness. Typically, bias mitigation strategies focus on a specific definition or metric for fairness,
and adjust the solution calculation such that fairness is increased.

4.1. Mitigation strategies in classification problems

Table 1 mentions the various types of bias. Numerous bias mitigation strategies have been proposed though at a high level they
can be distinguished based on the process step in which the adjustments to the calculated solutions are made. Fig. 3 shows the three
categories of strategies that are often considered in literature.

4.1.1. Pre-processing techniques: Data transformations


Data-based models crucially depend on the quality of the input data to achieve good (and relevant) solutions as they learn and
replicate patterns they find in the data. Hence, the first type of approach targets adjustments to the input data to increase fairness. In
particular, based on the fact that the data used to train and build a model can include certain biases itself, pre-processing approaches
aim to adjust the input data such that the subsequently calculated solutions have a lower unfairness than before. Strategies for this
type of pre-processing can be a reweighing of the input data (Kamiran & Calders, 2012), learning data transformations to reduce
discrimination (Calmon, Wei, Vinzamuri, Ramamurthy, & Varshney, 2017; Feldman et al., 2015), or learning fair representations of
the original data while encoding protected attributes, which can be used instead of the original data to create predictions (Zemel
et al., 2013).

8
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Fig. 2. Mapping of fairness metrics from machine learning to recommender systems.

Fig. 3. Types of Bias Mitigation Strategies (Ruggieri, Pedreschi, & Turini, 2010).

4.1.2. In-processing techniques: Algorithmic adjustments


The second type of bias mitigation strategies targets adjustments to the algorithms themselves. This is often achieved by looking
at bias mitigation as an optimization problem that needs to balance the fairness of a solution with other relevant performance
metrics (e.g., the accuracy of a solution). In these in-processing approaches, a specific fairness metric is commonly added to the
algorithm’s objective or as additional constraint, and then solved to find a good trade-off. For example, Calders and Verwer (2010)
suggest an adjusted Naive-Bayes class that achieves discrimination-free classification on a protected attribute. Adding the fairness
objective as regularization term which penalizes unfair solutions in probabilistic prediction models is proposed by Kamishima et al.
(2012). Celis, Huang, Keswani, and Vishnoi (2019) present a meta-algorithm that accepts a variety of fairness constraints, and show
that it is able to find near-optimal solutions with respect to certain fairness metrics.

4.1.3. Post-processing techniques: Solution adjustments


While the previous two approaches either alter the input data or the algorithms themselves, the third type of bias mitigation
strategy considers the already calculated solutions. These post-processing techniques are algorithm-agnostic and thus can be applied
to any classification outcome. In classification, these post-processing adjustments are based on the calculated probabilities that an
observation is classified as a particular class. More specifically, the approaches aim to achieve a higher fairness of the solution by
changing the classification threshold used for certain groups or instances. As examples for this type of bias mitigation, Kamiran
et al. (2012) suggest a reject-option classification where certain instances are re-labeled (i.e., their predicted class switched) such
that overall unfairness is reduced. Using a similar concept, Hardt et al. (2016) develop a linear model to determine the specific
re-labeling that optimizes the fairness of the solution, and Pleiss et al. (2017) use a calibrated version of the previous concept.

9
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

4.2. Transferring mitigation concepts recommender systems

Common types of bias encountered in the recommender system domain are popularity bias (Abdollahpouri, Mansoury et al., 2019;
Jannach et al., 2015), demographic bias (Drozdowski et al., 2020), observation bias (Yao & Huang, 2017), sparsity bias (Bellogín,
Castells, & Cantador, 2017), inductive bias, selection bias, conformity bias, exposure bias and position bias (Chen et al., 2020).
Mitigating bias in recommender systems can improve system usability and user satisfaction. For example, ensuring women interested
in STEM courses are sufficiently recommended STEM courses will improve the users’ trust and satisfaction. Even though a number
of bias mitigation approaches have been suggested in machine learning, the design and analysis of bias mitigation strategies in
recommender systems is a fairly novel area of research. The main concept behind bias mitigation is similar, however: given a set
of fairness metrics, the goal is to either adapt the algorithm, the predictions, or the data to increase fairness.
Yao and Huang (2017) provide an in-processing approach for including fairness aspects when building collaborative filtering
recommender systems. Specifically, they include the respective fairness metric into the derivative formulation used in the matrix
factorization step, and adjust the derivatives accordingly. They show that this in-processing approach can increase the fairness of the
recommendations to a certain degree while not having a significant impact on the error rate. Islam, Keya, Pan, and Foulds (2019) also
presents an in-processing approached for a collaborative filtering recommender system trained on social media data. Their de-biasing
approach is based on a word vector attenuation technique borrowed from the field of Natural Language Processing. Abdollahpouri,
Burke, and Bamshad (2019) propose a post-processing approach to mitigate popularity bias in recommender systems. Popularity
bias in the recommender system occurs when the more popular items are frequently recommended, and the less popular items
rarely show up as recommendations. They propose a post-processing approach to balance out the exposure of each item in the item
catalog by re-ranking and showing that the approach efficiently manages popularity bias without compromising accuracy. Edizel,
Bonchi, Hajian, Panisson, and Tassa (2020) highlight the link between predictability of sensitive attributes and bias and proposes a
post-processing approach for mitigating bias with minimal impact on the utility of recommendations.
As practically all classification-based mitigation strategies described in Section 4.1 are developed for binary classification
scenarios, a direct application of these approaches in a rating-based collaborative filtering system is not possible. While a conversion
of the rating-based (e.g., ratings 1–5) to a binary prediction (e.g., like/ not like) could be used to convert the rating-based to a
binary recommendation setting, it is unclear how such a conversion and subsequent application of bias mitigation strategies would
affect the initial rating-based recommender system. Hence, we focus on the need to develop mitigation strategies for rating-based
recommender systems in general. Specifically, we introduce and discuss a flexible, metric-agnostic bias mitigation approach in the
next section.

5. A novel bias mitigation approach for recommender systems

We propose a new, fairness metric-agnostic bias mitigation approach to adjust predicted scores based on learned/observed
differences between privileged and unprivileged groups after training a recommender system. Specifically, we aim to adjust the
predicted scores 𝑦̂𝑖 for the users such that the (applied definition of) fairness of the solution increases.
This type of approach, as compared to integrated in-processing approaches such as the one suggested by Yao and Huang (2017),
has two main advantages: First, the algorithm itself does not have to be adjusted, i.e., it can be applied to any algorithm used for
calculating the predicted ratings. Second, it is agnostic with respect to the applied fairness metric and thus can be readily applied
to any quantitative fairness metric definition that is considered important for a given scenario.
The adjustment of the predicted scores and thus the specific type of bias mitigation depends on the targeted fairness metric,
similar to the mitigation strategies developed for machine learning classification. The algorithm can be applied both as an in-
processing approach by learning the adjustments during training procedure and applying them on the separate test data, or as
post-processing approach which directly calculates the necessary adjustments on the test data itself (similar to post-processing
approaches in Machine Learning). For the following description, we adopt an in-processing version where the algorithm learns
the adjustments on the training data.
In general, the bias mitigation strategy aims to learn a difference 𝛿 = 𝑦−𝑟 between predicted ratings 𝑦 and observed ratings 𝑟 based
on the results on the training set. This difference 𝛿 can represent different types of unfairness, e.g., if 𝛿 is different for subgroups of
users (thus indicating potential unfairness in predicted ratings). Assuming that the data used for training the recommender system is
comparable to the test data/new data the recommender system is applied on, the ratings can be actively adjusted using the learned
differences 𝛿.
Here, we consider several versions of the bias mitigation approach depending on the type of fairness adjustment. Let 𝑦̂̄𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 be
the predicted rating for item 𝑗 on the training set for a specific group of users, and 𝑦̂̄𝑔,𝑡𝑟𝑎𝑖𝑛 the average predicted ratings across all
items for the protected and unprotected groups. 𝑟̄𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 are the average rating for item 𝑗 on the training dataset for a specific group,
and 𝑟̄𝑔,𝑡𝑟𝑎𝑖𝑛 the average ratings across all items.

Value-based adjustment: Here, we look at differences 𝛿𝑗,𝑔 in predicted vs actual ratings for each group and item in the training
set. I.e.,
𝛿𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 = 𝑦̂̄𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 − 𝑟̄𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 (10)
These learned differences are then used to adjust the test set predictions for each group, with the goal to reduce value and
absolute unfairness:
𝑦̂𝑖,𝑗,𝑔,𝑡𝑒𝑠𝑡 = 𝑦̂𝑖,𝑗,𝑔,𝑡𝑒𝑠𝑡 + 𝛿𝑗,𝑔,𝑡𝑟𝑎𝑖𝑛 (11)

10
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Parity-based adjustment: the overall difference 𝛿 between predicted ratings for two groups 𝑔1 and 𝑔2 is learned on the training
set:

𝛿𝑡𝑟𝑎𝑖𝑛 = 𝑦̂̄𝑔1 ,𝑡𝑟𝑎𝑖𝑛 − 𝑦̂̄𝑔2 ,𝑡𝑟𝑎𝑖𝑛 (12)

Then, it is used to adjust the test set predictions to reduce non-parity.

𝑦̂𝑖,𝑗,𝑔1 ,𝑡𝑒𝑠𝑡 = 𝑦̂𝑖,𝑗,𝑔1 ,𝑡𝑒𝑠𝑡 + 𝛿𝑡𝑟𝑎𝑖𝑛 (13)

Algorithm 1: Bias mitigation approach to increase fairness in rating-based recommender systems


Input: 𝑘: number of folds
Result: Adjusted predicted ratings 𝑦̂ with optimized fairness, assuming 𝛿𝑡𝑒𝑠𝑡 ∼ 𝛿𝑡𝑟𝑎𝑖𝑛
Step 1: Divide the data into k-folds;
Step 2: Build collaborative filtering model and adjust predicted scores:
for 𝑡𝑟𝑎𝑖𝑛, 𝑡𝑒𝑠𝑡 in 𝑓 𝑜𝑙𝑑𝑠 do
build collaborative filtering model on 𝑡𝑟𝑎𝑖𝑛 fold;
learn difference 𝛿 in the fairness metric between the privileged and unprivileged groups on this training set:
if value-based adjustment then
calculate item-based fairness difference for each group in the training set:
𝛿𝑗,𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 = 𝑦̂̄𝑗,𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 − 𝑟̄𝑗,𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 and 𝛿𝑗,𝑛𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 = 𝑦̂̄𝑗,𝑛𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 − 𝑟̄𝑗,𝑛𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 ;
calculate the predicted scores 𝑦̂𝑖,𝑗 for the test set;
adjust the predicted scores with the rules:
if if 𝑖 ∈ 𝑝𝑟 then
𝑦̂𝑖,𝑗,𝑝𝑟,𝑡𝑒𝑠𝑡 = 𝑦̂𝑖,𝑗,𝑝𝑟,𝑡𝑒𝑠𝑡 + 𝛿𝑗,𝑝𝑟,𝑡𝑟𝑎𝑖𝑛
else if 𝑖 ∈ 𝑛𝑝𝑟 then
𝑦̂𝑗,𝑛𝑝𝑟,𝑡𝑒𝑠𝑡 = 𝑦̂𝑖,𝑗,𝑛𝑝𝑟,𝑡𝑒𝑠𝑡 + 𝛿𝑗,𝑛𝑝𝑟,𝑡𝑟𝑎𝑖𝑛
else if parity adjustment then
calculate overall fairness difference:
𝛿𝑡𝑟𝑎𝑖𝑛 = 𝑦̂̄𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 − 𝑦̂̄𝑛𝑝𝑟,𝑡𝑟𝑎𝑖𝑛 ;
calculate the predicted scores 𝑦̂𝑖,𝑗 for the test set;
adjust the predicted scores with the rule:
if if 𝑖 ∈ 𝑛𝑝𝑟 then
𝑦̂𝑖,𝑗,𝑛𝑝𝑟,𝑡𝑒𝑠𝑡 = 𝑦̂𝑖,𝑗,𝑛𝑝𝑟,𝑡𝑒𝑠𝑡 + 𝛿𝑡𝑟𝑎𝑖𝑛
end

The pseudocode of the bias mitigation strategy is described in Algorithm 1. The rationale behind this bias mitigation approach
is straightforward: Under the assumption that training and test sets are reasonably similar and that the unfairness characteristics
found in the training set are similar to the ones found in the test set (i.e., 𝛿𝑡𝑒𝑠𝑡 ∼ 𝛿𝑡𝑟𝑎𝑖𝑛 , where 𝛿 refers to the differences between the
groups), the differences 𝛿𝑡𝑟𝑎𝑖𝑛 between calculated scores 𝑦̂𝑡𝑟𝑎𝑖𝑛 and actual ratings 𝑟𝑡𝑟𝑎𝑖𝑛 can be learned and then applied to the test set
in a post-calculation step. The differences 𝛿 for both the privileged and unprivileged groups can use a global perspective (the parity
consideration), or alternatively a per-item basis which follows the value/ absolute fairness metric. These differences can be applied
to the predictions on the test set, aiming to reduce the amount of bias found between the privileged and unprivileged groups.
As the bias mitigation strategy focuses on reducing unfairness between the privileged and unprivileged group, the choice of these
privileged and unprivileged groups is crucial. In practice, this choice will be scenario-dependent and can factor in different types of
biases that might be present (see e.g., Table 1). For example, historical and demographic biases can manifest in unfairness due to
race or gender, hence these (combinations of) attributes can be used to define what constitutes privileged and unprivileged groups.
We will investigate the effects of these mitigation strategies on the fairness metrics and other rating-based performance metrics
such as root mean squared error (RMSE) and the mean absolute error (MAE) in the next section. Specifically, we investigate the
effects of the adjustment on synthetic and empirical data.

6. Evaluation

6.1. Datasets

For the subsequent evaluation, we use two datasets: a synthetically created dataset following Yao and Huang (2017) and the
MovieLens 1M dataset (Harper & Konstan, 2016). The synthetic dataset represents a course recommendation scenario where fairness
considers if the ratings for specific courses are consistently under- or over-predicted for the unprivileged and privileged groups. The
empirical dataset provides a well-known setting where the existence of biases and unfairness is less obvious, yet still of importance
when predicting ratings.
For both evaluation scenarios, we first calculate the recommendations without any bias mitigation considerations. These
‘unmitigated’ recommendations are used as the baseline for the subsequent comparisons. Based on this, we apply different versions
of the novel bias mitigation strategy to calculate ‘mitigated’ recommendations. Their impact on various fairness and performance
metrics will then be used to determine the effectiveness of the mitigation strategies.

11
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

6.1.1. Synthetic data: Course recommendation


For the synthetic data, we replicate the scenario proposed by Yao and Huang (2017) which considers course recommendation
(in terms of rating prediction) for male and female students for different types of courses. Students have different preferences for
specific types of courses. The scenario considers four different user types: Male students (M), male students preferring STEM course
(MS), female students (W), and female students preferring STEM courses (WS). In addition, the scenario considers three different
courses (item type): STEM courses, courses mostly appealing to female students (Fem), and courses mostly appealing to male students
(Masc).
The data itself is created using a block model with different user distributions, probabilities for rating high or low (𝐿), and
probabilities for observing the ratings (𝑂). This allows for a flexible creation of synthetic data which represents different biases.
Specifically, we consider the following scenario. The user population is skewed in the sense that 40% belong to type W, 10% to
type WS, 40% to type MS, and 10% to type M. This can be thought of as reflecting user distributions seen in courses from certain
domains. Ratings can either be 2 (for liking a course) or 1 (for not liking the course). We use the following probabilities for ratings
𝐿 and observations 𝑂 which reflect an observation bias. For example, 𝐿𝑊 𝑆,𝑆𝑇 𝐸𝑀 indicates that a female student preferring STEM
courses gives a positive rating of 2 with 80% probability. Similarly, 𝑂𝑊 𝑆,𝑆𝑇 𝐸𝑀 specifies that there is a 40% probability that we
observe this rating, i.e., that the rating will appear in the observed (training) data.
⎡ 𝐹 𝑒𝑚 𝑆𝑇 𝐸𝑀 𝑀𝑎𝑠𝑐 ⎤ ⎡ 𝐹 𝑒𝑚 𝑆𝑇 𝐸𝑀 𝑀𝑎𝑠𝑐 ⎤
⎢𝑊 0.8 0.2 0.2 ⎥ ⎢𝑊 0.6 0.2 0.1 ⎥
⎢ ⎥ ⎢ ⎥
𝐿 = ⎢𝑊 𝑆 0.8 0.8 0.2 ⎥ , 𝑂 = ⎢𝑊 𝑆 0.3 0.4 0.2 ⎥
⎢𝑀𝑆 0.2 0.8 0.8 ⎥ ⎢𝑀𝑆 0.1 0.3 0.5 ⎥
⎢ ⎥ ⎢ ⎥
⎣𝑀 0.2 0.2 0.8 ⎦ ⎣𝑀 0.05 0.5 0.35 ⎦
Together, this scenario represents an extreme case where we have far fewer female students in STEM, and these female students
are less likely to rate STEM courses as compared to male users. Fairness in this case considers if the ratings for specific courses are
consistent between male and female students. For the following evaluation, we use 400 users and 300 items similar to the evaluation
scenario by Yao and Huang (2017).
To evaluate the synthetic dataset and the effects of the bias mitigation approach, we use the observed ratings as training data
and compare the predictions for the unobserved rating to its expected value (𝐿𝑢𝑠𝑒𝑟,𝑖𝑡𝑒𝑚 + 1). For the purpose of defining privileged
and unprivileged groups, we define Male students (M, MS) as the privileged group and Female students (W, WS) as the unprivileged
group.

6.1.2. Empirical data: MovieLens


For the evaluation with an empirical dataset, we use the well-known Movielens dataset with 1 million observations. The dataset
provides user ratings on a scale from 1–5 for movies, and includes information about movie categories and user demographics.
Adopting the same gender-based group definition as Yao and Huang (2017), we define privileged (Male) and unprivileged (Female)
groups for this scenario.
For data pre-processing, we follow the steps described in Yao and Huang (2017) to make the easily new metric comparable
to theirs. We only include users that had at least 50 ratings and focused on movies of five categories (Action, Crime, Musical,
Romance, and Sci-Fi) as they exhibit difference in ratings between the privileged and unprivileged groups. In addition, we only
included movies that have at least 100 ratings in order to get reasonably sized test sets for the cross validation. Overall, the filtered
MovieLens dataset used in the following evaluation has 2788 unique users as well as 815 unique movies.

6.2. Simulation setting, algorithms, and evaluation metrics

We use the previously described datasets to investigate the variety of fairness metrics as well as the proposed bias mitigation
strategy. For both datasets, we adopt an evaluation strategy similar to Yao and Huang (2017) and follow the general idea of a 5-fold
cross validation evaluation approach. Specifically, the Movielens dataset is divided into 5 folds, and each fold is used for testing
the collaborative filtering approach (trained on the other 4 folds) exactly once. For the synthetic data, we independently create the
synthetic dataset 5 times. In both cases, the reported results represent the average for the 5 test folds/evaluation datasets.1
We use two standard collaborative filtering approaches to build the recommender system: a standard KNN-based item-item
model, and an Alternating Least Squares (ALS) approach (Zhou, Wilkinson, Schreiber, & Pan, 2008). While these are standard
approaches, we consider the focus of this work to be the investigation of the fairness metrics and bias mitigation strategy, not
a specific algorithm. For the bias mitigation strategy to increase fairness, we use the approach described in Algorithm 1 and the
value-based, and parity-based adjustments. The two approaches will be referred to as ‘val_adj’, and ‘parity_adj’, respectively, and
compared against the original solution (‘_orig’).
The evaluation uses several performance and fairness related metrics. Considering fairness, the previously described metrics are
used. For performance evaluation, we use two traditional evaluation metrics. First, we use Root Mean Squared Error (RMSE) and
Mean Absolute Error (MAE) for evaluating the rating predictions. The Python-based implementation of the simulation is using the
LensKit package (Ekstrand, 2019), as well as the AIF360 package from algorithmic fairness (Bellamy et al., 2018).

1 To investigate the impact of the randomly created synthetic data on the evaluation results, we also repeated the evaluation using 10 independently created

synthetic datasets. The results were consistent with the presented evaluation.

12
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Fig. 4. Average performance of different algorithms and bias mitigation approaches on performance and fairness for the synthetic dataset.

6.3. Synthetic data: Fairness metrics and effects of bias mitigation

For the synthetic data, Fig. 4 provides an overview of the results of 5 independently repeated runs of the simulation.
We can observe several interesting results. First, the choice of algorithm has a substantial effect on the observed performance
and fairness metrics. Specifically, we see that in the case of the synthetic dataset, the ItemItem approach leads to better results
with respect to the traditional performance metrics, and also leads to lower unfairness in many of the fairness categories. However,
this advantage is not universal as for the inequality indices, ALS provides fairer (i.e., less biased) solutions over the 5 independent
repetitions.
Second, the fairness metrics introduced in Section 3 provide a valuable addition to comparing the performance of different
algorithms. There is a clear trade-off where for selected fairness considerations one type of algorithm can outperform the other,
whereas for other metrics the opposite result holds. Depending on the fairness metric that is considered most useful for a given
scenario, this comparison can help a recommender system developer to select the appropriate algorithm while achieving the best
fairness in recommendations. On the other hand, these results reflect the complexity of considering different types of fairness
definitions and mitigation strategies, and also underline that achieving a good result in one fairness metric (e.g., ItemItem approaches
lead to lower Value Unfairness) leads to worse results in another fairness metric (e.g., the Theil Index performance of ItemItem vs
ALS).
Considering the effectiveness of the bias mitigation approach, the results indicate that we can successfully increase the fairness of
the recommendations for the fairness metric on which the mitigation algorithm focuses on. For example, the value-based adjustment

13
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Fig. 5. Average performance of different algorithms and bias mitigation approaches on performance and fairness metrics for the MovieLens dataset.

significantly lowers the value and absolute unfairness compared to the original solution, in addition to lowering the overestimation
unfairness. Parity-based bias mitigation successfully lowers non-parity, yet has different effects on the other metrics (some are
increased, some are decreased).
Increasing fairness is one aspect, however, we need to analyze the impact of such a strategy on traditional performance metrics as
well. For the synthetic data, the bias mitigation approaches even yield a lower average RMSE and MAE than the original, unmitigated
predictions. This is a promising result as it indicates that in certain scenarios, bias mitigation and performance are not competing
goals. Yet, as we will see in the evaluation of the empirical dataset this will not be true across different datasets.

6.4. Empirical data

Fig. 5 shows the results for the MovieLens dataset. Similar to before, we can see that there are differences between the two
recommender algorithms themselves, albeit small ones. More noticeable, however, is that the bias mitigation approach produces
positive results on this dataset as well. Considering bias mitigation of value unfairness, we see that value unfairness can be (slightly)
lowered when using the value adjustment bias mitigation strategy. The improvements for the other two version of bias mitigation
are more noticeable. For parity adjustment, the results indicate that almost perfect parity (and respectively, almost perfect score for
disparate impact) can be achieved. However, this slightly increases the value and overestimation unfairness measures.

14
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Taking the performance measures RMSE and MAE into account, we can see that the parity-based bias mitigation strategy seems
to incur a slight loss of predictive performance as indicated by the slight increase in RMSE value. In contrast, there does not seem
to be an effect of the bias mitigation strategies on the MAE values.
Overall, we can see that the suggested bias mitigation approach significantly lowers selected unfairness metrics at the cost of
slightly worse performance metrics in some cases. This trade-off between fairness and performance as well as which fairness metric
to focus on needs to be considered when selecting a specific type of bias mitigation strategy in a given scenario.

6.5. Discussion and limitations

The evaluation using the synthetic dataset and MovieLens dataset provided several interesting insights. First, in both datasets the
bias mitigation approach is successful in increasing fairness according to the specific target metric;i.e.,value-based bias mitigation
reduces value unfairness, parity-based adjustment reduces parity in the empirical MovieLens dataset. Second, the effects of the bias
mitigation on other metrics is complex, and thus representative of the complexity of using and comparing different definitions of
fairness. In general, while the target fairness metric can be improved by the bias mitigation approach in most cases, other fairness
metrics might be decreased or increased at the same time. Third, the impact of fairness improvements on predictive performance
in the rating prediction scenario, measured by RMSE and MAE, also depends on the specific bias mitigation strategy. Specifically,
value-based adjustment showed improved RMSE values for the synthetic dataset and slight improvements for the MovieLens dataset,
whereas parity-based adjustment led to a slight increase in RMSE and MAE for the MovieLens dataset. Hence, the impact of fairness
improvements on predictive performance needs to be studied more extensively and could well be scenario-dependent.
Overall, referring back to the research questions posed in Section 1, for RQ 1 we showed that a mapping of fairness metrics
between the machine learning and recommender system domain is possible and useful to identify synergies and leverage concepts
that are helpful to define and improve fairness in both machine learning and recommender systems.
Given the number of different fairness metrics that have been proposed in literature, our mapping does not address an exhaustive
list of fairness concepts in machine learning but focuses on core concepts commonly used in fair classification. Most of these decision
making systems that use machine learning address a binary classification problem while this paper is looking into fairness concepts
applicable to a rating based recommender system, which is a multi class problem. Exploring the implications of adapting fair machine
learning concept to recommender systems is a novel avenue to explore.
With respect to RQ 2, we show that our proposed bias mitigation approach inspired by the machine learning domain can
successfully increase the fairness of recommendations. The evaluation further highlights apparent trade-offs between achieving
higher fairness vs. better performance in certain scenarios.
A limitation of the proposed bias mitigation approach is that it is based on the strong assumption that train and test set are
similar in their unfairness characteristics and will potentially not work well when this does not hold true. Also, the current case
study looks at fairness only from a user-fairness perspective while ensuring item-fairness perspective is equally essential. For a
better generalization of the performance of the proposed approach, the case study can be extended to other types of datasets as
well. Finally, as the current study only considers rating predictions an extension to ranking recommendation can be considered next.
Overall, these results provide useful insights how fairness can be integrated into recommender systems, benefiting both the users
and the system itself.

7. Conclusion

Algorithmic decision making is becoming ubiquitous, yet the increased adoption and potential impact on users and individuals
also requires the consideration of potential discrimination of these algorithmic decisions against (groups of) users. In this article,
we contribute to the emerging literature on fairness in recommender systems in two ways.
First, we develop a mapping of fairness metrics from machine learning to the recommender system domain to leverage synergies
between the two domains and tap into the extensive history of fairness approaches in machine learning. The mapping between the
two domains identifies fairness concepts that are equivalent or similar in both domains as well as concepts that do not extend to
the other domain.
Second, we propose a novel bias mitigation strategy for recommender systems that aims to increase the fairness of the provided
rating predictions. We show the effectiveness of the proposed bias mitigation approach through two case studies, using a synthetic
dataset which represents an observation bias in ratings, and an empirical dataset (MovieLens 1M). The case study analyzes the effect
of the bias mitigation approach on the considered fairness and performance metrics. Our results indicate that increasing the fairness
of the user recommendations can be achieved using the bias mitigation approach, whereas the effects on performance metrics such
as RMSE and MAE is scenario-dependent.
Our categorization and mapping of fairness metrics as well as the analysis of bias mitigation strategies allows both researchers
and recommender system practitioners to understand and address different fairness aspects in recommendations, as well as trade-offs
between system performance and fairness. The suggested bias mitigation approach can be used to lower specific biases in user rating
predictions and make the resulting recommender system fairer from a user perspective.
There are several topics that we aim to investigate next. First, while the bias mitigation approach seems to work quite well, we
aim to consider additional pre-processing and in-processing techniques and their effectiveness in reducing bias. For example, Yao and
Huang (2017) suggest an in-processing technique that includes specific fairness metrics in the calculation of the matrix factorization,
and show that this can lower specific fairness metrics as well. Pre-processing is successfully used in machine learning to reduce bias,

15
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

and its potential for recommender systems is an interesting research question. Second, we aim to further investigate the potential
trade-offs between fairness and performance metrics. The evaluation in Section 6 showed that increasing fairness can lead to lower
RMSE and MAE performance in certain scenarios. An extended study that considers different parameterizations and adjustments
to our suggested bias mitigation approach can shed additional light on the ‘cost’ of reducing bias, similar to a study performed on
the cost of achieving fairness in machine learning (Haas, 2019). Third, we plan to extend the current evaluation by considering
additional datasets and scenarios. While the MovieLens dataset lends itself due to its popularity and demographic information that
can be used to define privileged and unprivileged groups, the effectiveness of the bias mitigation strategy in other scenarios is
worthy of investigation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.

References

Abdollahi, B., & Nasraoui, O. (2018). Transparency in fair machine learning: the case of explainable recommender systems (pp. 21–35).
Abdollahpouri, H., Burke, R., & Mobasher, B. (2019). Managing popularity bias in recommender systems with personalized re-ranking. arXiv preprint
arXiv:1901.07555.
Abdollahpouri, H., Mansoury, M., Burke, R., & Mobasher, B. (2019). The unfairness of popularity bias in recommendation. arXiv preprint arXiv:1907.13286.
Agarwal, A., Dudik, M., & Wu, Z. S. (2019). Fair regression: Quantitative definitions and reduction-based algorithms. In K. Chaudhuri, & R. Salakhutdinov (Eds.),
Proceedings of machine learning research: vol. 97, Proceedings of the 36th international conference on machine learning (pp. 120–129). Long Beach, California,
USA: PMLR, http://proceedings.mlr.press/v97/agarwal19d.html.
Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61. http://dx.doi.org/10.1145/3209581.
Baeza-Yates, R. (2020). Bias in search and recommender systems. In Fourteenth ACM conference on recommender systems (p. 2). New York, NY, USA: Association
for Computing Machinery.
Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., et al. (2018). AI Fairness 360: An extensible toolkit for detecting. In Understanding,
and mitigating unwanted algorithmic bias.
Bellogín, A., Castells, P., & Cantador, I. (2017). Statistical biases in information retrieval metrics for recommender systems. Information Retrieval Journal, 20(6),
606–634.
Berk, R., Heidari, H., Jabbari, S., Joseph, M., Kearns, M., Morgenstern, J., et al. (2017). A convex framework for fair regression. arXiv:1706.02409.
Bernhardt, D., Krasa, S., & Polborn, M. (2008). Political polarization and the electoral effects of media bias. Journal of Public Economics, 92(5–6), 1092–1104.
Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., et al. (2019). Fairness in recommendation ranking through pairwise comparisons. In Proceedings of
the 25th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2212–2220). New York, NY, USA: Association for Computing
Machinery, http://dx.doi.org/10.1145/3292500.3330745.
Bickel, P. J., Hammel, E. A., & OĆonnell, J. W. (1975). Sex bias in graduate admissions: Data from berkeley. Science, 187(4175), 398–404. http://dx.doi.org/
10.1126/science.187.4175.398, arXiv:https://science.sciencemag.org/content/187/4175/398.full.pdf.
Biega, A. J., Gummadi, K. P., & Weikum, G. (2018). Equity of attention: Amortizing individual fairness in rankings. In The 41st international ACM SIGIR conference
on research and development in information retrieval (pp. 405–414). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/
3209978.3210063.
Binns, R. (2020). On the apparent conflict between individual and group fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency
(pp. 514–524). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3351095.3372864.
Blyth, C. R. (1972). On simpson’s paradox and the sure-thing principle. Journal of the American Statistical Association, 67(338), 364–366, http://www.jstor.org/
stable/2284382.
Buolamwini, J. (2017). Gender shades: intersectional phenotypic and demographic evaluation of face datasets and gender classifiers (Ph.D. thesis).
Burke, R., Sonboli, N., & Ordonez-Gauger, A. (2018). Balanced neighborhoods for multi-sided fairness in recommendation. In FAT.
Calders, T., & Verwer, S. (2010). Three naive Bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21(2), 277–292.
Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334),
183–186.
Calmon, F., Wei, D., Vinzamuri, B., Ramamurthy, K. N., & Varshney, K. R. (2017). Optimized pre-processing for discrimination prevention. In Advances in neural
information processing systems (pp. 3992–4001).
Celis, L. E., Huang, L., Keswani, V., & Vishnoi, N. K. (2020). Classification with fairness constraints: A meta-algorithm with provable guarantees. In Proceedings
of the conference on fairness, accountability, and transparency (pp. 319–328.
Chen, J., Dong, H., Wang, X., Feng, F., Wang, M., & He, X. (2020). Bias and debias in recommender system: A survey and future directions. arXiv:2010.03240.
Chen, J., Kallus, N., Mao, X., Svacha, G., & Udell, M. (2019). Fairness under unawareness: Assessing disparity when protected class is unobserved. In
Proceedings of the conference on fairness, accountability, and transparency (pp. 339–348). New York, NY, USA: Association for Computing Machinery,
http://dx.doi.org/10.1145/3287560.3287594.
Ciampaglia, G. L., Nematzadeh, A., Menczer, F., & Flammini, A. (2018). How algorithmic popularity bias hinders or promotes quality. Scientific Reports, 8(1),
http://dx.doi.org/10.1038/s41598-018-34203-2.
Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., & Huq, A. (2017). Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM
SIGKDD international conference on knowledge discovery and data mining (pp. 797–806). New York, New York, USA: ACM Press.
Deshpande, K. V., Pan, S., & Foulds, J. R. (2020). Mitigating demographic bias in AI-based resume filtering. In Adjunct publication of the 28th ACM conference on
user modeling, adaptation and personalization (pp. 268–275).
Doshi-Velez, F., & Kim, B. (2017). Towards A rigorous science of interpretable machine learning.
Drozdowski, P., Rathgeb, C., Dantcheva, A., Damer, N., & Busch, C. (2020). Demographic bias in biometrics: A survey on an emerging challenge. IEEE Transactions
on Technology and Society, 1(2), 89–103. http://dx.doi.org/10.1109/TTS.2020.2992344.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science
conference (pp. 214–226). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/2090236.2090255.
Edizel, B., Bonchi, F., Hajian, S., Panisson, A., & Tassa, T. (2020). FaiRecSys: mitigating algorithmic bias in recommender systems. International Journal of Data
Science and Analytics, 9(2), 197–213.
Ekstrand, M. D. (2019). The LKPY package for recommender systems experiments: Next-generation tools and lessons learned from the lenskit project.

16
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Ekstrand, M. D., Burke, R., & Diaz, F. (2019). Fairness and discrimination in recommendation and retrieval. In Proceedings of the 13th ACM conference on
recommender systems (pp. 576–577). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3298689.3346964.
Ekstrand, M. D., Tian, M., Azpiazu, I. M., Ekstrand, J. D., Anuyah, O., McNeill, D., et al. (2018). All the cool kids, how do they fit in?: Popularity and demographic
biases in recommender evaluation and effectiveness. In Conference on fairness, accountability and transparency (pp. 172–186).
Ekstrand, M. D., Tian, M., Kazi, M. R. I., Mehrpouyan, H., & Kluver, D. (2018). Exploring author gender in book rating and recommendation. In Proceedings of
the 12th ACM conference on recommender systems (pp. 242–250). New York, New York, USA: ACM Press.
Feldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., & Venkatasubramanian, S. (2015). Certifying and removing disparate impact. In Proceedings of the 21th
ACM SIGKDD international conference on knowledge discovery and data mining (pp. 259–268). New York, New York, USA: ACM Press.
Fish, B., Kun, J., & Lelkes, Á. D. (2016). A confidence-based approach for balancing fairness and accuracy. In Proceedings of the 2016 SIAM international conference
on data mining (pp. 144–152). SIAM.
Friedman, B., & Nissenbaum, H. (1996). Bias in computer systems. ACM Transactions on Information Systems, 14(3), 330–347. http://dx.doi.org/10.1145/230538.
230561.
Gajane, P., & Pechenizkiy, M. (2017). On formalizing fairness in prediction with machine learning. arXiv:1710.03184.
Grgic-Hlaca, N., Zafar, M., Gummadi, K., & Weller, A. (2016). The case for process fairness in learning: Feature selection for fair decision making.
Haas, C. (2019). The price of fairness - A framework to explore trade-offs in algorithmic fairness. In 2019 international conference on information systems.
Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. In Proceedings of the 30th international conference on neural information
processing systems (pp. 3323–3331). USA: Curran Associates Inc.
Hargittai, E. (2007). Whose space? Differences among users and non-users of social network sites. J. Comp.-Med. Commun., 13(1), 276–297. http://dx.doi.org/
10.1111/j.1083-6101.2007.00396.x.
Harper, F. M., & Konstan, J. A. (2016). The movielens datasets: History and context. Acm Transactions on Interactive Intelligent Systems (Tiis), 5(4), 19.
Introna, L., & Nissenbaum, H. (2000). Defining the web: The politics of search engines. Computer, 33, 54–62. http://dx.doi.org/10.1109/2.816269.
Islam, R., Keya, K. N., Pan, S., & Foulds, J. (2019). Mitigating demographic biases in social media-based recommender systems. KDD (Social Impact Track).
Jannach, D., Lerche, L., Kamehkhosh, I., & Jugovac, M. (2015). What recommenders recommend: An analysis of recommendation biases and possible
countermeasures. User Modeling and User-Adapted Interaction, 25(5), 427–491. http://dx.doi.org/10.1007/s11257-015-9165-3.
Kamiran, F., & Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1), 1–33.
Kamiran, F., Karim, A., & Zhang, X. (2012). Decision theory for discrimination-aware classification. In 2012 IEEE 12th international conference on data mining (pp.
924–929). IEEE.
Kamishima, T., Akaho, S., Asoh, H., & Sakuma, J. (2012). Fairness-aware classifier with prejudice remover regularizer. In Joint European conference on machine
learning and knowledge discovery in databases (pp. 35–50). Berlin, Heidelberg: Springer.
Kamishima, T., Akaho, S., Asoh, H., & Sakuma, J. (2020). Recommendation independence. In Conference on Fairness, Accountability and Transparency (pp. 187–201.
Kowald, D., Schedl, M., & Lex, E. (2020). The unfairness of popularity bias in music recommendation: A reproducibility study. In European conference on information
retrieval (pp. 35–42). Springer.
Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (2017). Counterfactual fairness. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
& R. Garnett (Eds.), Advances in neural information processing systems (vol. 30) (pp. 4066–4076). Curran Associates, Inc., URL https://proceedings.neurips.cc/
paper/2017/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf.
Lee, E. L., Lou, J.-K., Chen, W.-M., Chen, Y.-C., Lin, S.-D., Chiang, Y.-S., et al. (2014). Fairness-aware loan recommendation for microfinance services. In Proceedings
of the 2014 international conference on social computing (pp. 1–4). New York, New York, USA: ACM Press.
Lee, N. T., Resnick, P., & Barton, G. (2019). Algorithmic bias detection and mitigation: Best practices and policies to reduce consumer harms.
Lerman, K., & Hogg, T. (2014). Leveraging position bias to improve peer recommendation. Communications of the ACM, 61(6), 54–61. http://dx.doi.org/10.1145/
3209581.
Lum, K., & Johndrow, J. (2016). A statistical framework for fair predictive algorithms. arXiv e-prints. arXiv:1610.08077.
Mehrabi, N., Morstatter, F., Peng, N., & Galstyan, A. (2019). Debiasing community detection: The importance of lowly connected nodes. In Proceedings of the
2019 IEEE/ACM international conference on advances in social networks analysis and mining (pp. 509–512). New York, NY, USA: Association for Computing
Machinery, http://dx.doi.org/10.1145/3341161.3342915.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2019). A survey on bias and fairness in machine learning. arXiv:1908.09635.
Mester, T. (2021). Artificial intelligence partiality | computing.URL https://data36.com/statistical-bias-types-explained/.
Miller, H., Thebault-Spieker, J., Chang, S., Johnson, I., Terveen, L., & Hecht, B. (2016). ‘‘Blissfully happy’’ or ‘‘ready tofight’’: Varying interpretations of emoji.
URL https://www.aaai.org/ocs/index.php/ICWSM/ICWSM16/paper/view/13167.
Nguyen, D., Gravel, R., Trieschnigg, D., & Meder, T. (2013). ‘‘How old do you think I am?’’: A study of language and age in twitter. In Proceedings of the 7th
international conference on weblogs and social media (pp. 439–448).
Olteanu, A., Castillo, C., Diaz, F., & Kiciman, E. (2016). Social data: Biases, methodological pitfalls, and ethical boundaries. SSRN Electronic Journal, http:
//dx.doi.org/10.2139/ssrn.2886526.
Pedreshi, D., Ruggieri, S., & Turini, F. (2008). Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on knowledge
discovery and data mining (pp. 560–568). New York, NY, USA: ACM.
Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., & Weinberger, K. Q. (2017). On fairness and calibration. In Advances in neural information processing systems (pp.
5680–5689).
Pollack, E. (2013). Why are there still so few women in science. The New York Times Magazine, 3.
Rastegarpanah, B., Gummadi, K. P., & Crovella, M. (2019). Fighting fire with fire. In Proceedings of the twelfth ACM international conference on web search and
data mining (pp. 231–239). New York, New York, USA: ACM Press.
Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems handbook. In Recommender systems handbook (pp. 1–35). Boston, MA: Springer
US.
Ruggieri, S., Pedreschi, D., & Turini, F. (2010). Data mining for discrimination discovery. ACM Transactions on Knowledge Discovery from Data, 4(2), http:
//dx.doi.org/10.1145/1754428.1754432.
Shi, Y., Karatzoglou, A., Baltrunas, L., Larson, M., Hanjalic, A., & Oliver, N. (2012). TFMAP: Optimizing MAP for top-n context-aware recommendation. In
Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval (pp. 155–164). NY, USA: ACM.
Singh, A., & Joachims, T. (2018). Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and
data mining (pp. 2219–2228). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/3219819.3220088.
Speicher, T., Heidari, H., Grgic-Hlaca, N., Gummadi, K. P., Singla, A., Weller, A., et al. (2018). A unified approach to quantifying algorithmic unfairness. In
Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2239–2248). New York, New York, USA: ACM Press.
Steck, H. (2018). Calibrated recommendations. In Proceedings of the 12th ACM conference on recommender systems (pp. 154–162). New York, NY, USA: Association
for Computing Machinery, http://dx.doi.org/10.1145/3240323.3240372.
Suresh, H., & Guttag, J. V. (2019). A framework for understanding unintended consequences of machine learning. arXiv:1901.10002.
Sweeney, L., & Latanya (2013). Discrimination in online ad delivery. Queue, 11(3), 10.
Tufekci, Z. (2014). Big questions for social media big data: Representativeness, validity and other methodological pitfalls. arXiv:1403.7400.

17
A. Ashokan and C. Haas Information Processing and Management 58 (2021) 102646

Verma, S., & Rubin, J. (2018). Fairness definitions explained. In Proceedings of the international workshop on software fairness (pp. 1–7). New York, NY, USA: ACM,
http://dx.doi.org/10.1145/3194770.3194776, http://doi.acm.org/10.1145/3194770.3194776.
Wang, T., & Wang, D. (2014). Why Amazon’s ratings might mislead you: The story of herding effects. Big Data, 2, 196–204. http://dx.doi.org/10.1089/big.2014.
0063.
Weydemann, L., Sacharidis, D., & Werthner, H. (2019). Defining and measuring fairness in location recommendations. In Proceedings of the 3rd ACM SIGSPATIAL
international workshop on location-based recommendations, geosocial networks and geoadvertising. New York, NY, USA: Association for Computing Machinery,
http://dx.doi.org/10.1145/3356994.3365497.
Wilson, C., Boe, B., Sala, A., Puttaswamy, K. P., & Zhao, B. Y. (2009). User interactions in social networks and their implications. In Proceedings of the 4th
ACM european conference on computer systems (pp. 205–218). New York, NY, USA: Association for Computing Machinery, http://dx.doi.org/10.1145/1519065.
1519089.
Yao, S., & Huang, B. (2017). Beyond parity: Fairness objectives for collaborative filtering. In Advances in neural information processing systems 30 (pp. 2921–2930).
Zafar, M. B., Valera, I., Gomez Rodriguez, M., & Gummadi, K. P. (2017a). Fairness beyond disparate treatment & disparate impact. In Proceedings of the 26th
international conference on world wide web. international world wide web conferences steering committee, http://dx.doi.org/10.1145/3038912.3052660.
Zafar, M., Valera, I., Gomez-Rodriguez, M., & Gummadi, K. (2020). Fairness constraints: Mechanisms for fair classification. In AISTATS.
Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2020). Learning fair representations. In Proceedings of the 30th international conference on machine learning
(pp. 325–333.
Zhou, J., & Chen, F. (2018). Human and machine learning e-book (pp. 21–35).
Zhou, Y., Wilkinson, D., Schreiber, R., & Pan, R. (2008). Large-scale parallel collaborative filtering for the netflix prize. In International conference on algorithmic
applications in management (pp. 337–348). Springer.
Zhu, Z., Hu, X., & Caverlee, J. (2018). Fairness-aware tensor-based recommendation. In Proceedings of the 27th ACM international conference on information and
knowledge management (pp. 1153–1162). New York, New York, USA: ACM Press.
Zhu, Z., Wang, J., & Caverlee, J. (2020). Measuring and mitigating item under-recommendation bias in personalized ranking systems. In Proceedings of the
43rd international ACM SIGIR conference on research and development in information retrieval (pp. 449–458). New York, NY, USA: Association for Computing
Machinery, http://dx.doi.org/10.1145/3397271.3401177.

18

You might also like