0% found this document useful (0 votes)
48 views22 pages

Evaluation of RS

Recommender systems

Uploaded by

21jr1a4364
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views22 pages

Evaluation of RS

Recommender systems

Uploaded by

21jr1a4364
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

Evaluating recommender systems

Introduction
Recommender systems require that users interact with computer systems as well
as with other users. Therefore, many methods used in social behavioral research
are applicable when answering research questions such as
 Do users find interactions with a recommender system useful?

 Are they satisfied with the quality of the recommendations they receive?

 What drives people to contribute knowledge such as ratings and


comments that boost the quality of a system’s predictions? Or

 What is it exactly that users like about receiving recommendations?

 Is it the degree of serendipity and novelty, or is it just the fact that they
are spared from having to search for them?
Many more questions like these could be formulated and researched to evaluate
whether a technical system is efficient with respect to a specified goal, such as
increasing customer satisfaction or ensuring the economic success of an
e-commerce platform.

Basic characteristics of evaluation designs


 The above table differentiates empirical research based on the units
that are subjected to research methods, such as people or computer
hardware.

 Furthermore, it denotes the top-level taxonomy of empirical research


methods, namely experimental and nonexperimental research, as
well as the distinction between real-world and lab scenarios where
evaluations can be conducted.
General properties of evaluation research
(i) General remarks
 Thoroughly describing the methodology, following a
systematic procedure, and documenting the decisions made
during the course of the evaluation exercise ensure that the
research can be repeated and results verified. This answers
the question of how research has been done.
 Furthermore, criteria such as the
(a) validity,
(b) reliability, and
(c) sensibility of the constructs used and measured
relate to the subject matter of the research, questioning
what is done.
 Internal validity refers to the extent to which the effects
observed are due to the controlled test conditions (e.g., the
varying of a recommendation algorithm’s parameters)
instead of differences in the set of participants
(predispositions) or uncontrolled/unknown external effects.

 In contrast, External validity refers to the extent to which


results are generalizable to other user groups or situations.
 External validity examines, whether the evaluated
recommendation scenario is representative of real-world
situations and
 It also checks whether the findings of the evaluation exercise
are transferrable to them.
 Reliability is another postulate of rigorous empirical work,
requiring the absence of inconsistencies and errors in the data
and measurements.

 Sensibility necessitates that different evaluations of observed


aspects are also reflected in a difference in measured
numbers.
(ii) Subjects of evaluation design

People are typically the subjects of sociobehavioral research


studies – that is, the focus of observers.

Obviously, in recommender systems research, the populations of


interest are primarily specific subgroups such as online
customers,web users, or students who receive adaptive and
personalized item suggestions.

An experimental setup that is widespread in machine learning


(ML) or information retrieval (IR) is datasets with synthetic or
historical user interaction data.
 Synthetic datasets are biased toward the design of a specific
algorithm and that they therefore treat other algorithms unfairly.

 Natural datasets include historical interaction records of real


users. They can be categorized based on the type of user actions
recorded.

 For example, the most prominent datasets from the movie


domain contain explicit user ratings on a multipoint Likert scale.

 The sparsity of a dataset is derived from the ratio of empty and


total entries in the user–item matrix and is computed as follows:
Nevertheless, the results of evaluating recommender systems using
historical datasets cannot be compared directly to studies with real
users and vice versa.

Consider the classification scheme depicted in the above figure. If an


item that was proposed by the recommender is actually liked by a
user, it is classified as a correct prediction.

If a recommender is evaluated using historical user data, preference


information is only known for those items that have been actually
rated by the users.

No assumptions can be made for all unrated items because users


 Thus, one needs to be aware that evaluating
recommender systems using either online users or
historical data has some shortcomings.

 These shortcomings can be overcome only by providing


a marketplace (i.e., the set of all recommendable items)
that is completely transparent to users who, therefore,
rate all items.
Research methods
Defining the goals of research and identifying which aspects of the users or
subjects of the scientific inquiry are relevant in the context of recommendation
systems lie at the starting point of any evaluation.

These observed or measured aspects are termed variables in empirical research;


they can be assumed to be either independent or dependent.
Independent Variables :
Gender, income, education, or personality traits as they are, in
principle, static throughout the course of the scientific inquiry.

Further variables are independent if they are controlled by the evaluation design,
such as the type of recommendation algorithm that is applied to users or the
items that are recommended to them.
Dependent variables are those that are assumed to be influenced
by the independent variables – for instance, user satisfaction,
perceived utility, or click-through rate can be measured.
Experimental Design :
In an experimental research design, one or more of the
independent variables are manipulated to ascertain their impact
on the dependent variables:

An experiment is a study in which at least one variable is


manipulated and units are randomly assigned to the different
levels or categories of the manipulated variables.
 The following figure illustrates such an experiment design, in which
subjects (i.e., units) are randomly assigned to different treatments – for
instance, different recommendation algorithms.

 Thus, the type of algorithm would constitute the manipulated variable.


The dependent variables (e.g., v and v in the figure) are measured
1 2

before and after the treatment – for instance, with the help of a
questionnaire or by implicitly observing user behavior.
An example Experimental Design
 A quasi-experimental design distinguishes itself from a real experiment
by its lacking random assignments of subjects to different treatments – in
other words, subjects decide on their own about their treatment.

 This might introduce uncontrollable bias because subjects may make the
decision based on unknown reasons.

 For instance, when comparing mortality rates between populations being


treated in hospitals and those staying at home, it is obvious that higher
mortality rates in hospitals do not allow us to conclude that these medical
treatments are a threat to people’s lives.

 However, when comparing purchase rates of e-commerce users who used


a recommender system with the purchase rates of those who did not, a
methodological flaw is less obvious.
 Non-experimental designs include all other forms of quantitative
research, as well as qualitative research.

 Quantitative research relies on numerical measurements of different


aspects of objects, such as asking users different questions about the
perceived utility of a recommendation application with answers on a
seven-point Likert scale, requiring them to rate a recommended item
or measuring the viewing time of different web pages.

 In contrast, qualitative research approaches would conduct interviews


with open-ended questions, record think aloud protocols when users
interact with a web site, or employ focus group discussions to find
out about users’ motives for using a recommender system.
Examples of Non-experimental Design :

Longitudinal research :
•The entity under investigation is observed repeatedly as it evolves
over time.
•Such a design allows criteria such as the impact of
recommendations on the customer’s lifetime value to be measured.

Cross-sectional research :
•analyzing relations among variables that are simultaneously
measured in different groups, allowing generalizable findings from
different application domains to be identified.
 Case studies
• represent an additional way of collecting and analyzing empirical
evidence that can be applied to recommendation systems research when
researchers are interested in more principled questions.

• They focus on answering research questions about how and why and
combine whichever types of quantitative and qualitative methods
necessary to investigate contemporary phenomena in their real-life
contexts.
Example :
How recommendation technology contributed to Amazon.com’s
becoming the world’s largest book retailer?
Evaluation settings :

The evaluation setting is another basic characteristic of evaluation


research.

In principle, we can differentiate between lab studies and field


studies.

A lab situation is created expressly for the purpose of the study

A field study is conducted in an preexisting real-world


environment.

You might also like