2-ML Principles
2-ML Principles
net/publication/328096312
CITATION READS
1 1,233
1 author:
SEE PROFILE
All content following this page was uploaded by Giorgio Alfredo Spedicato on 05 October 2018.
«Statistical significance» as intended by classical statistical means is not useful in ML context due
to the high amount of data on which models are trained (high «statistical power»).
ML in Actuarial Science: use cases
▪Fine tuning of frequency and severity modeling for non – life pricing, due ML models better
handle interactions between variables and non – linearities.
▪Individual claim reserving («claim level analytics»)
▪As above, for retention and conversion modeling
▪Fraud risk assessment
▪Marketing analytics
▪Recommender systems
ML Projects workflow
▪Business scope definition:
▪ business context
▪ Type of approach(supervised/ not supervised), available predictions, potential deployment issues
▪Data preparation:
▪ ETL: extraction, transformation and load;
▪ Initial descriptive analysis (univariate, bivariate plots and statistics, possible variable transformation)
▪«predictive performance» shall be evaluated in terms of generalizability (will the fitted model
work well on unseen data).
Models validation
▪«hold – out» approach: random (or reasoned) split of available data into train, validation and/or
test sets. Models are fit on train set, possibly chosen on the validation one and predictive
performance evalutated on test set
▪«cross – validation» approach:
▪ A k integer (eg. 10, 5,…) is chosen, the original sample is split into k random folders (k-fold cv);
▪ k model «runs» are fit, every time taking out an «hold out» data set on which predictive performance is
calculated;
▪ The estimated predictive performance is the average of the k estimates.
Suggestions:
◦ Use «hold out» data when possible
◦ Try to evaluate current approach performance
Deployment & Life - cycle
▪Implement «on – line» all data preparation and scoring on production IT
infrastructure
▪Necessary checks:
▪ Reasonableness of results
▪ Numerical checks on IT testing environments
▪Trees:
▪ Single trees: es. C5.0
▪ Bagging: Bagged Trees, Random Forest
▪ Boosted Trees: gbm, xgboost, lightgbm
Supervised learning: classification
▪Linear models:
▪ GLM (logistic, multinomial) possibly using non-linear or additive (splines) terms;
▪ Linear/Quadratic Discriminant Analysis
▪Non linear models:
▪ MARS Splines
▪ KNN
▪ SVM
▪ Naive Bayes
▪ (Deep) Neural Networks
▪Tree based approaches:
▪ Single trees: CHAID, C50
▪ Bagging (Random Forest)
▪ Boosting (GBM, XGBoost, LightGBM)
Unsupervised modeling
Clustering:
◦ Hierarchical clustering
◦ KMeans
◦ DBSCAN, OPTICS, …
Dimension reduction:
◦ PCA, Factor Analysis,…
◦ GLRM
Hybrid models:
◦ Arules
◦ Word2vec
Linear Discriminant Analysis
Support Vector Machines
▪Mathematical function (possibly non linear) creating separating regions in variables space.
▪Can be used both in classification and regression problems.
▪Issues:
▪ Computational complexity (o n3 )
▪ No automatic feature selection
▪ Hard interpretability
SVM: Kernels
MARS Splines
▪Multivariate Adaptive Regression Splines are based on hinge functions, es k1 max 0; 𝑥 − 𝑐 +
k 2 max 0; 𝑐 − 𝑥 to model the relation between predictors and the outcome.
▪Pros:
▪ Handlig both numeric and categorical data, interpretable non – linearity handling
▪ Allow for feature selection
▪Cons:
▪ More performant ML models exists
▪ Computational complexity reduce their ability to handle large data sets.
MARS Splines
KNN
▪KNN uses k-neighbors average value to predict new samples; K depends by the data set.
▪Can be used both for regression and classification
▪Pros:
▪ Easy and intuitive
▪Cons:
▪ Usually the fit is inferior than the one of other models.
▪ Computational complexity: 𝑂(𝑛 𝑑 + 𝑘 )
KNN
Hierarchical Clustering
General algorithm:
1. A distance metric is defined
𝑛
2. All 2
distance pairs are computed
3. Closest pairs are combined and the algorithm starts again
▪Pros:
▪ Different distance metrics can be used
▪ Visual output (dendrogram)
▪Cons:
▪ 𝑂(𝑛3 𝑑) vs the 𝑂(𝑛 ∗ 𝑘 ∗ 𝑑) of Kmeans
▪ Subjective choice of distance threshold to define the cluster.
Hierarchical Clustering: dendrogram
ARULES
Market basket analysis. Typically used to suggest most problable element that completes a set (es.
different insurance covers for personal business)
It infers rules from a binary transaction set based on probabilistic rules.
It can infer if-then rules in the form as «If you own A and B, then you can be insterested at C».
ARULES
TOOLS: H2O
▪Java based ML library that implements efficiently optimized version of broadly used ML algorithms:
▪ Supervised: GLM, Random Forest, GBM, XGBoost, Deep Learning, Naive Bayes and Stacked
Ensemble.
▪ Unsupervised: PCA, KMEANS, GLRM
▪Functions:
▪ Open source tool,
▪ It interfaces with R and Python using dedicated libraries;
▪ Docs: http://docs.h2o.ai/
▪It can be used:
▪ On desktop workstation (using multicore approach);
▪ On PC clusters;
▪ A dedicated version implements GPU calculations.
TOOLS: ML Models wrappers
• R ML libraries that allows a tidy implementation of ETL, tuning and assessing models
performance.
• Different ML models can be fit and compared using a unified approach.
• Available libraries are:
• R: caret e mlr
• Python: scikit-learn
ML Interpretability
▪ML model opacity negatively affected their diffusion and populatiryt in many context, desplite
they often offer significantly superior performance compared to traditional methods.
▪Thus, recent research focused on implementing algorithm to ease models assessment and
interpretability (general and local). Statistical libraries that implements such algorithms are LIME
e DALEX
▪Tools to ease interpretability are:
▪ Residuals distribution;
▪ Variable importance analysis;
▪ Partial dependency plots;
▪ Predictions breakdown
ML Interpretability: residuals analysis
RESIDUALS ANALYSIS RESIDUALS ANALYSIS
ML interpretability: variable importance
analysis
ML interpretability: marginal effects plot
Marginal effects plot Partial groups of categorical predictors
ML interpretability: marginal effects plot