0% found this document useful (0 votes)
13 views

2-ML Principles

Uploaded by

yamada.jpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

2-ML Principles

Uploaded by

yamada.jpa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328096312

Machine learning projects workflow

Presentation · September 2018


DOI: 10.13140/RG.2.2.19844.37767

CITATION READS

1 1,233

1 author:

Giorgio Alfredo Spedicato


Unipol Gruppo Finanziario
99 PUBLICATIONS 968 CITATIONS

SEE PROFILE

All content following this page was uploaded by Giorgio Alfredo Spedicato on 05 October 2018.

The user has requested enhancement of the downloaded file.


Machine Learning and Actuarial Science
Core Concepts

GIORGIO ALFREDO SPEDICATO, PHD FCAS FSA CSPA


UNISACT 2018
Introduction
«machine learning» and «data mining» terms mean the use of algorithms to acquire insights on
relevant amount of data.
A common subdivision of the algorithms is:
▪ Supervised learning algorithms:
▪ Regression;
▪ Binary or Multinomial classification
▪ Non – supervised learning:
▪ Clustering
▪ Dimensionality reduction
▪ Association rules, Network analysis, …

«Statistical significance» as intended by classical statistical means is not useful in ML context due
to the high amount of data on which models are trained (high «statistical power»).
ML in Actuarial Science: use cases
▪Fine tuning of frequency and severity modeling for non – life pricing, due ML models better
handle interactions between variables and non – linearities.
▪Individual claim reserving («claim level analytics»)
▪As above, for retention and conversion modeling
▪Fraud risk assessment
▪Marketing analytics
▪Recommender systems
ML Projects workflow
▪Business scope definition:
▪ business context
▪ Type of approach(supervised/ not supervised), available predictions, potential deployment issues

▪Data preparation:
▪ ETL: extraction, transformation and load;
▪ Initial descriptive analysis (univariate, bivariate plots and statistics, possible variable transformation)

▪Modeling and deployment


▪ Selection of candidate models
▪ Models’ fit
▪ Performance Assessment
▪ Deployment
Models validation

▪ML focuses on «predictive» performance instead of «explicative» power.


▪Performance metrics depends by outcome nature:
▪ Regressions: RMSE, 𝑅2 , 𝑀𝐴𝐸 …
▪ Classification: AUC/GINI, LogLoss,…

▪«predictive performance» shall be evaluated in terms of generalizability (will the fitted model
work well on unseen data).
Models validation
▪«hold – out» approach: random (or reasoned) split of available data into train, validation and/or
test sets. Models are fit on train set, possibly chosen on the validation one and predictive
performance evalutated on test set
▪«cross – validation» approach:
▪ A k integer (eg. 10, 5,…) is chosen, the original sample is split into k random folders (k-fold cv);
▪ k model «runs» are fit, every time taking out an «hold out» data set on which predictive performance is
calculated;
▪ The estimated predictive performance is the average of the k estimates.

▪«k – fold» cv is more precise, but more computational demanding.


Performance assessment: continuous
outcomes
Performance assessment: continuous
outcomes
Performance assessment: binary
outcomes
▪Metrics:
𝑇𝑃+𝑇𝑁
▪ Accuracy = 𝑃+𝑁
𝑇𝑃
▪ Sensitivity / TPR / Recall= 𝑇𝑃+𝐹𝑁
𝑇𝑁
▪ Specificity / 1-FPR = 𝑇𝑁+𝐹𝑃
TP FN TP+FN=P 𝑇𝑃
FP+TN=N
▪ Precision / PPV = 𝑇𝑃+𝐹𝑃
FP
TN 𝑇𝑁
▪ Recall = 𝑇𝑁+𝐹𝑁
𝑃𝑃𝑉∗𝑇𝑅𝑃
▪ F1-score: 2 ∗ 𝑃𝑃𝑉+𝑇𝑅𝑃
Performance assessment: ROC, AUC e
GINI
Performance assessment: loss metrics
Continuous outcomes
ෞ𝑖 −𝑦𝑖 2
σ𝑖=1..𝑛 𝑦
◦ 𝑅𝑀𝑆𝐸 =
𝑛
𝑛
σ ෞ𝑖 −𝑦𝑖 2
𝑦
◦ 𝑅2 = σ𝑖=1
𝑛 2
𝑖=1 𝑦𝑖 −𝑦𝑖
σ𝑖=1..𝑛 𝑦
ෞ𝑖 −𝑦𝑖
◦ MAE=
𝑛

Binary and multinomial outcomes:


◦ Gini = 2*AUC-1
𝑀
1 𝑀
◦ logLoss =− 𝑁 σ𝑗=1 ෍ 𝑦𝑖𝑗 log𝑝𝑖𝑗
𝑗=1

Suggestions:
◦ Use «hold out» data when possible
◦ Try to evaluate current approach performance
Deployment & Life - cycle
▪Implement «on – line» all data preparation and scoring on production IT
infrastructure
▪Necessary checks:
▪ Reasonableness of results
▪ Numerical checks on IT testing environments

▪Models’ life cycle:


▪ How often models shall be fit on fresher data?
▪ How often the modeling approach has been deeply reviewed due to different business environment?
Supervised learning: regression
▪Linear models:
▪ Normal multivariate regression;
▪ Generalized Linear Models (GLM)
▪ Linear Support Vector Machines (SVM).

▪Non linear models:


▪ Generalized Non Linear models;
▪ Mars Splines.
▪ Radial and polynomial SVM;
▪ K Nearest Neighbors (KNN)
▪ (Deep) Neural Networks

▪Trees:
▪ Single trees: es. C5.0
▪ Bagging: Bagged Trees, Random Forest
▪ Boosted Trees: gbm, xgboost, lightgbm
Supervised learning: classification
▪Linear models:
▪ GLM (logistic, multinomial) possibly using non-linear or additive (splines) terms;
▪ Linear/Quadratic Discriminant Analysis
▪Non linear models:
▪ MARS Splines
▪ KNN
▪ SVM
▪ Naive Bayes
▪ (Deep) Neural Networks
▪Tree based approaches:
▪ Single trees: CHAID, C50
▪ Bagging (Random Forest)
▪ Boosting (GBM, XGBoost, LightGBM)
Unsupervised modeling
Clustering:
◦ Hierarchical clustering
◦ KMeans
◦ DBSCAN, OPTICS, …

Dimension reduction:
◦ PCA, Factor Analysis,…
◦ GLRM

Hybrid models:
◦ Arules
◦ Word2vec
Linear Discriminant Analysis
Support Vector Machines
▪Mathematical function (possibly non linear) creating separating regions in variables space.
▪Can be used both in classification and regression problems.
▪Issues:
▪ Computational complexity (o n3 )
▪ No automatic feature selection
▪ Hard interpretability
SVM: Kernels
MARS Splines
▪Multivariate Adaptive Regression Splines are based on hinge functions, es k1 max 0; 𝑥 − 𝑐 +
k 2 max 0; 𝑐 − 𝑥 to model the relation between predictors and the outcome.
▪Pros:
▪ Handlig both numeric and categorical data, interpretable non – linearity handling
▪ Allow for feature selection

▪Cons:
▪ More performant ML models exists
▪ Computational complexity reduce their ability to handle large data sets.
MARS Splines
KNN
▪KNN uses k-neighbors average value to predict new samples; K depends by the data set.
▪Can be used both for regression and classification
▪Pros:
▪ Easy and intuitive

▪Cons:
▪ Usually the fit is inferior than the one of other models.
▪ Computational complexity: 𝑂(𝑛 𝑑 + 𝑘 )
KNN
Hierarchical Clustering
General algorithm:
1. A distance metric is defined
𝑛
2. All 2
distance pairs are computed
3. Closest pairs are combined and the algorithm starts again

▪Pros:
▪ Different distance metrics can be used
▪ Visual output (dendrogram)

▪Cons:
▪ 𝑂(𝑛3 𝑑) vs the 𝑂(𝑛 ∗ 𝑘 ∗ 𝑑) of Kmeans
▪ Subjective choice of distance threshold to define the cluster.
Hierarchical Clustering: dendrogram
ARULES

Market basket analysis. Typically used to suggest most problable element that completes a set (es.
different insurance covers for personal business)
It infers rules from a binary transaction set based on probabilistic rules.
It can infer if-then rules in the form as «If you own A and B, then you can be insterested at C».
ARULES
TOOLS: H2O
▪Java based ML library that implements efficiently optimized version of broadly used ML algorithms:
▪ Supervised: GLM, Random Forest, GBM, XGBoost, Deep Learning, Naive Bayes and Stacked
Ensemble.
▪ Unsupervised: PCA, KMEANS, GLRM
▪Functions:
▪ Open source tool,
▪ It interfaces with R and Python using dedicated libraries;
▪ Docs: http://docs.h2o.ai/
▪It can be used:
▪ On desktop workstation (using multicore approach);
▪ On PC clusters;
▪ A dedicated version implements GPU calculations.
TOOLS: ML Models wrappers
• R ML libraries that allows a tidy implementation of ETL, tuning and assessing models
performance.
• Different ML models can be fit and compared using a unified approach.
• Available libraries are:
• R: caret e mlr
• Python: scikit-learn
ML Interpretability
▪ML model opacity negatively affected their diffusion and populatiryt in many context, desplite
they often offer significantly superior performance compared to traditional methods.
▪Thus, recent research focused on implementing algorithm to ease models assessment and
interpretability (general and local). Statistical libraries that implements such algorithms are LIME
e DALEX
▪Tools to ease interpretability are:
▪ Residuals distribution;
▪ Variable importance analysis;
▪ Partial dependency plots;
▪ Predictions breakdown
ML Interpretability: residuals analysis
RESIDUALS ANALYSIS RESIDUALS ANALYSIS
ML interpretability: variable importance
analysis
ML interpretability: marginal effects plot
Marginal effects plot Partial groups of categorical predictors
ML interpretability: marginal effects plot

View publication stats

You might also like