Machine Learning Explanations:
LIME framework
Giorgio Visani
About Me
Giorgio Visani
PhD Student @ Bologna University, Computer Science & Engineering
Department (DISI)
Data Scientist @ Crif S.p.A.
Find me on: Linkedè Bologna University GitHub ¥
Why do we need Explanations?
Difficult to understand on what grounds the algorithm took the
decision.
Right to Explanation Concept: each individual affected by
Algorithm’s decisions have the right to know the model’s rationale.
Especially in Europe, there are quite strict regulatory requirements for
using Machine Learning models in sensitive fields:
• GDPR [6]
• ”Ethical Guidelines for trustworthy AI” [5]
• Report from the ”European Banking Authority” [1]
Other purposes of explanations:
• Help Data Scientists understand better how the model behaves
(eg. for better parameter tuning)
• Interpret the powerful patterns discovered by the Machine
Learning models (in order to have a better uderstanding of the
phenomenon)
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
Models in the space
It is possible to code almost every information
into numbers
• Images: Intensity of the colour for each pixel
• Words: transform them with word embedding
• Categories: Create a variable with one value per
each category
The Learning Model: Y = f(X1, · · · , Xp)
is a plane in the space of the variables
Encodes the relationship between the variable of
interest Y and the other variables X
Any Prediction technique builds the best function f(x) to approximate
our data, given the constraints. Some of them produce simple f(x)
and return the formula, the black-box techniques create very
complicated functions and do not provide the math formula.
Giorgio Visani 2 / 23
Machine vs Statistical Learning I
Why ML is more powerful than classical statistical learning?
Linear Regression
Statistical Learning
Constraints on the
shape
Logistic Regression
1D ML model
Machine Learning
more flexible, it adapts
better to the data
2D ML model
ML handles well Interactions and Correlation between variables
Giorgio Visani 3 / 23
Machine vs Statistical Learning II
Statistical methods ML models
• Constraints on the shape
• Less powerful
they adapt less to the data
• Simplicity: simple surfaces
• Parametric: provide the f
formula, given by a set of
parameters.
Can understand how the function
behaves without looking at the
picture of the geometrical space!
• No constraints on the shape
• Better prediction: understand
correlation and interaction
• Complex functions
• Don’t have the f formula!
With 1/2 variable, We can draw
the f surface, when it is more we
have no way to understand the
surface
Giorgio Visani 4 / 23
Interpretability Issue
Interpretability: “the ability to explain or to present the results, in
understandable terms, to a human” [4].
Why ML is difficult to Interpret?
• No formula for the surface
or too complex formula
• Just use the graph to understand it
no graph for more than two X variables Imagine such a difficult surface
on a space of 50 variables.
Impossible to represent
Giorgio Visani 5 / 23
Interpretable Models
Decision Tree
Linear Regression
Giorgio Visani 6 / 23
Interpretable Tools
Two main approaches:
• Transparent ML models
Build powerful models with simple to
understand formulas
• Post-Hoc Techniques
to be used on difficult black-box
models
In the Post-Hoc Methods, further sub-divisions:
Credits to [9]
Giorgio Visani 7 / 23
Surrogate Models
Surrogate:
A simpler model that mimics the ML model on the geometrical space,
but remains understandable.
Global Surrogates mimic the ML model on the entire geometrical
space
Local Surrogates focus on a small region and approximate the ML
model only in that part.
Giorgio Visani 8 / 23
Model Agnostic Techniques
Exploit the geometrical foundations of Prediction Tools:
each model (be it Statistical or Machine/Deep Learning) tries to
approximate an unknown function f(x) in the Rp+1
geometrical space
spanned by the p independent variables X and the dependent
variable Y.
The dataset observations are points lying on the surface f(x).
The model produces a function f0
(x), built using the dataset points
(y, x).
Model Agnostic Tools: the general idea is to gain insights about the
function f0
(x). This can be done for each model.
Giorgio Visani 9 / 23
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
LIME
Model agnostic, Local technique, developed in 2016 [8]
Objective: find the tangent plane to the ML surface, in the point
(yi , xi ) we want to explain.
The tangent formula is human-understandable and it should be a good approximation
for the ML function in the neighbourhood of (yi , xi ) (Taylor Theorem)
Analytically unfeasible
• don’t have a parametric formulation of the ML function
• the ML surface may have a huge number of discontinuity points → non
differentiable
Solution: sample points on the ML surface,
approximate the tangent with a linear model
through the points (Ridge Regression), in the
neighbourhood of the reference individual.
Giorgio Visani 10 / 23
LIME Intuition
LIME’s goal is to find the tangent at a precise point (the reference
individual).
From Taylor Theorem, we know that each function f can be approximated
using a polynomial. The approximation error depends on the distance from
the reference point and on the degree of the polynomial (higher degree
ensures lower error).
Taylor Polynomial of degree 1:
fTaylor
(x0) = f(x0) + f0
(x0)(x − x0) + O(x2
)
The tangent corresponds to the Taylor polynomial of degree 1, the
simplest. Since it is the lowest polynomial degree, to obtain a good
approximation we should consider a small region around the
reference point — the smaller the more accurate is the linear approximation
of f(x) —
Giorgio Visani 11 / 23
All in all, it is just about computing the tangent.
Why LIME is not computing the tangent analytically?
Would require just to calculate the derivative of f(x). It is simpler (get rid of
the generation step) and less time-consuming.
Unfortunately, ML has no f(x) formula. Without the formula, it is
impossible to calculate the derivative!!
LIME reconstructs a part of the f(x) function, using the generation
step, and approximates its derivative with Linear Regression.
Giorgio Visani 12 / 23
LIME in a nutshell
Giorgio Visani 13 / 23
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
Generation Step
LIME generates n points x0
i all over the Rp
space of the X variables,
also in far away regions from our red point.
x0
i stands for the LIME generated points, while xi are the observations of the
original dataset.
We generate only the X values for the n points x0
i , but we miss the Y
value for the new units. So we plug each x0
i into the ML model and we
obtain its prediction for the new point: ŷ0
i . We actually generated a
brand-new dataset.
Giorgio Visani 14 / 23
How LIME does the Sampling?
First of all, LIME standardizes all the features (using x−mean(x)
stdev(x) ).
Important to compare the variables
Then we sample from each variable separately, as if the variable is
Gaussian.
This assumption is not a problem, because it just influences how the points
are placed in the geometrical space. In particular we will have more points
around the variable mean, while the concentration decreases moving away
from it.
For Categorical variables, we sample the category ID randomly, the
probability of obtaining a given category is the same as in the original
dataset
Giorgio Visani 15 / 23
Why don’t just generate points close to the reference?
This is a delicate subject.
In principle, it would be better to
consider only the points in the
region of interest, although the
proper size of the region is not
fixed but depends on the reference
point.
In fact, the neighborhood should
include all the linear area of the
ML curve around the reference
point, therefore it depends on the
local curvature of f(x).
Different points have different
proper size for the linear local
region.
Figura 1: The best neighborhood size
depends on the reference point and
the curvature of the ML function
around it.
Giorgio Visani 16 / 23
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
Weighting Step
Since we are not interested in far-away points (LIME is a local
method), we must ignore them. How to do it? LIME gives a weight to
each generated point, using a Gaussian (RBF) Kernel.
RBF

x(i)

= exp −


x(i)
− x(ref)


2
kw
!
Giorgio Visani 17 / 23
the Kernel Width parameter
The Gaussian Kernel attributes a
value in the range [0, 1], the higher
the closer to the reference point. The
kernel width kw parameter decides
how large is the circle of the
meaningful weights around the red
dot.
RBF Kernel Formula:
RBF(x(i)
) = exp

||x(i)
− x(ref)
||2
kw

Kernel Width (kw) is the only free parameter, defines weights’ radius
Thanks to the weights, we understand if the points are far away or
close to the red dot
Giorgio Visani 18 / 23
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
Local Explainable Model
As the last step, LIME uses a surrogate model to approximate the ML
model in the small region around our reference red dot, determined
by the weights.
We may choose any kind of explainable model for the approximation
(Decision Trees, Logistic Regression, GLM, GAM, etc), although my
preference is for Linear Regression (it can be viewed as the tangent
to ML model.
The default surrogate model in LIME’s Python implementation is Ridge
Regression, which belongs to the Linear Regression class of models
Giorgio Visani 19 / 23
A brief digression on Ridge Regression
Ridge Regression is just a linear model: E(Y) = α +
Pd
j=1 βj Xj
But the coefficients are estimated using a penalty based on their
norm:
β̂R = X
X + λI
−1
X
y
Figura 2: Ridge line tangent to the ML model
Giorgio Visani 20 / 23
Background on Explanations
LIME
Generation Step
Weighting Step
Local Model Step
How to use LIME?
How to use LIME: Feature Importance
The explainable model is usually
exploited to understand which
variables are the most important
for the ML prediction for the
specific individual
(highest coefficients highlight more
important variables). Figura 3: LIME trained on a Medical
Dataset. We understand which are the
major death risk factors
Giorgio Visani 21 / 23
How to use LIME: What-If Tool
LIME models can also be used to test what-if scenarios:
If I were to earn 500$ more a year, how many points would I gain on
my credit score?
Important to remember that LIME model is valid only locally:
the scenario we test should be not too distant from our reference.
Such what-if tool is available only for surrogate models: it cannot be
done for other explainable methods such as the ones based on
feature attribution, because they don’t rely on prediction models.
Giorgio Visani 22 / 23
References I
[1] European Banking Authority EBA. Report on Big Data and Advanced Analytics. In:
(2020).
[2] Damien Garreau e Ulrike Luxburg. Explaining the Explainer: A First Theoretical Analysis of
LIME. In: International Conference on Artificial Intelligence and Statistics. PMLR, 2020,
pp. 1287–1296. ISBN: 2640-3498.
[3] Riccardo Guidotti et al. Local Rule-Based Explanations of Black Box Decision Systems.
en. In: arXiv:1805.10820 [cs] (mag. 2018). arXiv: 1805.10820 [cs].
[4] Patrick Hall e Navdeep Gill. An Introduction to Machine Learning Interpretability-Dataiku
Version. O’Reilly Media, Incorporated, 2018.
[5] AI HLEG. Ethics Guidelines for Trustworthy AI. In: (2019).
[6] John Kingston. Using Artificial Intelligence to Support Compliance with the General Data
Protection Regulation. en. In: Artificial Intelligence and Law 25.4 (dic. 2017), pp. 429–443.
ISSN: 0924-8463, 1572-8382. DOI: 10.1007/s10506-017-9206-9.
[7] Thibault Laugel et al. Defining Locality for Surrogates in Post-Hoc Interpretablity. In: arXiv
preprint arXiv:1806.07498 (2018). arXiv: 1806.07498.
[8] Marco Tulio Ribeiro, Sameer Singh e Carlos Guestrin. Why Should i Trust You?: Explaining
the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1135–1144. ISBN:
1-4503-4232-9.
[9] Gregor Stiglic et al. Interpretability of Machine Learning Based Prediction Models in
Healthcare. In: arXiv preprint arXiv:2002.08596 (2020). arXiv: 2002.08596.
Giorgio Visani 23 / 23

Machine Learning Explanations: LIME framework

  • 1.
    Machine Learning Explanations: LIMEframework Giorgio Visani
  • 2.
    About Me Giorgio Visani PhDStudent @ Bologna University, Computer Science & Engineering Department (DISI) Data Scientist @ Crif S.p.A. Find me on: Linkedè Bologna University GitHub ¥
  • 3.
    Why do weneed Explanations? Difficult to understand on what grounds the algorithm took the decision. Right to Explanation Concept: each individual affected by Algorithm’s decisions have the right to know the model’s rationale. Especially in Europe, there are quite strict regulatory requirements for using Machine Learning models in sensitive fields: • GDPR [6] • ”Ethical Guidelines for trustworthy AI” [5] • Report from the ”European Banking Authority” [1] Other purposes of explanations: • Help Data Scientists understand better how the model behaves (eg. for better parameter tuning) • Interpret the powerful patterns discovered by the Machine Learning models (in order to have a better uderstanding of the phenomenon)
  • 4.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 5.
    Models in thespace It is possible to code almost every information into numbers • Images: Intensity of the colour for each pixel • Words: transform them with word embedding • Categories: Create a variable with one value per each category The Learning Model: Y = f(X1, · · · , Xp) is a plane in the space of the variables Encodes the relationship between the variable of interest Y and the other variables X Any Prediction technique builds the best function f(x) to approximate our data, given the constraints. Some of them produce simple f(x) and return the formula, the black-box techniques create very complicated functions and do not provide the math formula. Giorgio Visani 2 / 23
  • 6.
    Machine vs StatisticalLearning I Why ML is more powerful than classical statistical learning? Linear Regression Statistical Learning Constraints on the shape Logistic Regression 1D ML model Machine Learning more flexible, it adapts better to the data 2D ML model ML handles well Interactions and Correlation between variables Giorgio Visani 3 / 23
  • 7.
    Machine vs StatisticalLearning II Statistical methods ML models • Constraints on the shape • Less powerful they adapt less to the data • Simplicity: simple surfaces • Parametric: provide the f formula, given by a set of parameters. Can understand how the function behaves without looking at the picture of the geometrical space! • No constraints on the shape • Better prediction: understand correlation and interaction • Complex functions • Don’t have the f formula! With 1/2 variable, We can draw the f surface, when it is more we have no way to understand the surface Giorgio Visani 4 / 23
  • 8.
    Interpretability Issue Interpretability: “theability to explain or to present the results, in understandable terms, to a human” [4]. Why ML is difficult to Interpret? • No formula for the surface or too complex formula • Just use the graph to understand it no graph for more than two X variables Imagine such a difficult surface on a space of 50 variables. Impossible to represent Giorgio Visani 5 / 23
  • 9.
    Interpretable Models Decision Tree LinearRegression Giorgio Visani 6 / 23
  • 10.
    Interpretable Tools Two mainapproaches: • Transparent ML models Build powerful models with simple to understand formulas • Post-Hoc Techniques to be used on difficult black-box models In the Post-Hoc Methods, further sub-divisions: Credits to [9] Giorgio Visani 7 / 23
  • 11.
    Surrogate Models Surrogate: A simplermodel that mimics the ML model on the geometrical space, but remains understandable. Global Surrogates mimic the ML model on the entire geometrical space Local Surrogates focus on a small region and approximate the ML model only in that part. Giorgio Visani 8 / 23
  • 12.
    Model Agnostic Techniques Exploitthe geometrical foundations of Prediction Tools: each model (be it Statistical or Machine/Deep Learning) tries to approximate an unknown function f(x) in the Rp+1 geometrical space spanned by the p independent variables X and the dependent variable Y. The dataset observations are points lying on the surface f(x). The model produces a function f0 (x), built using the dataset points (y, x). Model Agnostic Tools: the general idea is to gain insights about the function f0 (x). This can be done for each model. Giorgio Visani 9 / 23
  • 13.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 14.
    LIME Model agnostic, Localtechnique, developed in 2016 [8] Objective: find the tangent plane to the ML surface, in the point (yi , xi ) we want to explain. The tangent formula is human-understandable and it should be a good approximation for the ML function in the neighbourhood of (yi , xi ) (Taylor Theorem) Analytically unfeasible • don’t have a parametric formulation of the ML function • the ML surface may have a huge number of discontinuity points → non differentiable Solution: sample points on the ML surface, approximate the tangent with a linear model through the points (Ridge Regression), in the neighbourhood of the reference individual. Giorgio Visani 10 / 23
  • 15.
    LIME Intuition LIME’s goalis to find the tangent at a precise point (the reference individual). From Taylor Theorem, we know that each function f can be approximated using a polynomial. The approximation error depends on the distance from the reference point and on the degree of the polynomial (higher degree ensures lower error). Taylor Polynomial of degree 1: fTaylor (x0) = f(x0) + f0 (x0)(x − x0) + O(x2 ) The tangent corresponds to the Taylor polynomial of degree 1, the simplest. Since it is the lowest polynomial degree, to obtain a good approximation we should consider a small region around the reference point — the smaller the more accurate is the linear approximation of f(x) — Giorgio Visani 11 / 23
  • 16.
    All in all,it is just about computing the tangent. Why LIME is not computing the tangent analytically? Would require just to calculate the derivative of f(x). It is simpler (get rid of the generation step) and less time-consuming. Unfortunately, ML has no f(x) formula. Without the formula, it is impossible to calculate the derivative!! LIME reconstructs a part of the f(x) function, using the generation step, and approximates its derivative with Linear Regression. Giorgio Visani 12 / 23
  • 17.
    LIME in anutshell Giorgio Visani 13 / 23
  • 18.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 19.
    Generation Step LIME generatesn points x0 i all over the Rp space of the X variables, also in far away regions from our red point. x0 i stands for the LIME generated points, while xi are the observations of the original dataset. We generate only the X values for the n points x0 i , but we miss the Y value for the new units. So we plug each x0 i into the ML model and we obtain its prediction for the new point: ŷ0 i . We actually generated a brand-new dataset. Giorgio Visani 14 / 23
  • 20.
    How LIME doesthe Sampling? First of all, LIME standardizes all the features (using x−mean(x) stdev(x) ). Important to compare the variables Then we sample from each variable separately, as if the variable is Gaussian. This assumption is not a problem, because it just influences how the points are placed in the geometrical space. In particular we will have more points around the variable mean, while the concentration decreases moving away from it. For Categorical variables, we sample the category ID randomly, the probability of obtaining a given category is the same as in the original dataset Giorgio Visani 15 / 23
  • 21.
    Why don’t justgenerate points close to the reference? This is a delicate subject. In principle, it would be better to consider only the points in the region of interest, although the proper size of the region is not fixed but depends on the reference point. In fact, the neighborhood should include all the linear area of the ML curve around the reference point, therefore it depends on the local curvature of f(x). Different points have different proper size for the linear local region. Figura 1: The best neighborhood size depends on the reference point and the curvature of the ML function around it. Giorgio Visani 16 / 23
  • 22.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 23.
    Weighting Step Since weare not interested in far-away points (LIME is a local method), we must ignore them. How to do it? LIME gives a weight to each generated point, using a Gaussian (RBF) Kernel. RBF x(i) = exp − x(i) − x(ref) 2 kw ! Giorgio Visani 17 / 23
  • 24.
    the Kernel Widthparameter The Gaussian Kernel attributes a value in the range [0, 1], the higher the closer to the reference point. The kernel width kw parameter decides how large is the circle of the meaningful weights around the red dot. RBF Kernel Formula: RBF(x(i) ) = exp ||x(i) − x(ref) ||2 kw Kernel Width (kw) is the only free parameter, defines weights’ radius Thanks to the weights, we understand if the points are far away or close to the red dot Giorgio Visani 18 / 23
  • 25.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 26.
    Local Explainable Model Asthe last step, LIME uses a surrogate model to approximate the ML model in the small region around our reference red dot, determined by the weights. We may choose any kind of explainable model for the approximation (Decision Trees, Logistic Regression, GLM, GAM, etc), although my preference is for Linear Regression (it can be viewed as the tangent to ML model. The default surrogate model in LIME’s Python implementation is Ridge Regression, which belongs to the Linear Regression class of models Giorgio Visani 19 / 23
  • 27.
    A brief digressionon Ridge Regression Ridge Regression is just a linear model: E(Y) = α + Pd j=1 βj Xj But the coefficients are estimated using a penalty based on their norm: β̂R = X X + λI −1 X y Figura 2: Ridge line tangent to the ML model Giorgio Visani 20 / 23
  • 28.
    Background on Explanations LIME GenerationStep Weighting Step Local Model Step How to use LIME?
  • 29.
    How to useLIME: Feature Importance The explainable model is usually exploited to understand which variables are the most important for the ML prediction for the specific individual (highest coefficients highlight more important variables). Figura 3: LIME trained on a Medical Dataset. We understand which are the major death risk factors Giorgio Visani 21 / 23
  • 30.
    How to useLIME: What-If Tool LIME models can also be used to test what-if scenarios: If I were to earn 500$ more a year, how many points would I gain on my credit score? Important to remember that LIME model is valid only locally: the scenario we test should be not too distant from our reference. Such what-if tool is available only for surrogate models: it cannot be done for other explainable methods such as the ones based on feature attribution, because they don’t rely on prediction models. Giorgio Visani 22 / 23
  • 31.
    References I [1] EuropeanBanking Authority EBA. Report on Big Data and Advanced Analytics. In: (2020). [2] Damien Garreau e Ulrike Luxburg. Explaining the Explainer: A First Theoretical Analysis of LIME. In: International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 1287–1296. ISBN: 2640-3498. [3] Riccardo Guidotti et al. Local Rule-Based Explanations of Black Box Decision Systems. en. In: arXiv:1805.10820 [cs] (mag. 2018). arXiv: 1805.10820 [cs]. [4] Patrick Hall e Navdeep Gill. An Introduction to Machine Learning Interpretability-Dataiku Version. O’Reilly Media, Incorporated, 2018. [5] AI HLEG. Ethics Guidelines for Trustworthy AI. In: (2019). [6] John Kingston. Using Artificial Intelligence to Support Compliance with the General Data Protection Regulation. en. In: Artificial Intelligence and Law 25.4 (dic. 2017), pp. 429–443. ISSN: 0924-8463, 1572-8382. DOI: 10.1007/s10506-017-9206-9. [7] Thibault Laugel et al. Defining Locality for Surrogates in Post-Hoc Interpretablity. In: arXiv preprint arXiv:1806.07498 (2018). arXiv: 1806.07498. [8] Marco Tulio Ribeiro, Sameer Singh e Carlos Guestrin. Why Should i Trust You?: Explaining the Predictions of Any Classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1135–1144. ISBN: 1-4503-4232-9. [9] Gregor Stiglic et al. Interpretability of Machine Learning Based Prediction Models in Healthcare. In: arXiv preprint arXiv:2002.08596 (2020). arXiv: 2002.08596. Giorgio Visani 23 / 23