1st edition | July 8-11, 2019
BigML, Inc #DutchMLSchool 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
BigML, Inc #DutchMLSchool
Gaming the ML Performance
3
• Use ML to improve performance automatically
• OptiML
• Unsupervised Feature Engineering (PCA, Topic Models,
Clustering, Anomaly Detection, etc)
• Automated feature selection
• Use domain knowledge to improve performance manually
• Bespoke features (requires expertise)
• Fusions of models
• Manual feature selection
A Tale of Two Strategies…
BigML, Inc #DutchMLSchool
what is Feature Engineering
4
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
• This is really, really important - more than algorithm selection!
• In fact, so important that BigML often does it
automatically
• ML Algorithms have no deeper understanding of data
• Numerical: have a natural order, can be scaled, etc
• Categorical: have discrete values, etc.
• The "magic" is the ability to find patterns quickly and efficiently
• ML Algorithms only know what you tell/show it with data
• Medical: Kg and M, but BMI = Kg/M2 is better
• Lending: Debt and Income, but DTI is better
• Intuition can be risky: remember to prove it with an evaluation!
BigML, Inc #DutchMLSchool
Built-in Transformations
5
2013-09-25 10:02
Date-Time Fields
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
• Date-Time fields have a lot of information "packed" into them
• Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME
BigML, Inc #DutchMLSchool
Built-in Transformations
6
Categorical Fields for Clustering/LR
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
• Clustering and Logistic Regression require numeric fields for
inputs
• Categorical values are transformed to numeric vectors
automatically*
• *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
BigML, Inc #DutchMLSchool
Built-in Transformations
7
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Text Fields
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
• Unstructured text contains a lot of potentially interesting
patterns
• Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
• Another option is Topic Modeling to extract thematic meaning
BigML, Inc #DutchMLSchool
Help ML to Work Better
8
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
title body
Breaking News… news covering…
… …
TEXT TEXT
When text is not actually unstructured
• In this case, the text field has structure (key/value pairs)
• Extracting the structure as new features may allow the ML
algorithm to work better
BigML, Inc #DutchMLSchool
FE Demo #1
9
BigML, Inc #DutchMLSchool
Help ML to Work at all
10
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
BigML, Inc #DutchMLSchool
FE Demo #2
11
BigML, Inc #DutchMLSchool
Feature Engineering
12
Discretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
BigML, Inc #DutchMLSchool
FE Demo #3
13
BigML, Inc #DutchMLSchool
Built-ins for FE
14
• Discretize: Converts a numeric value to categorical
• Replace missing values: fixed/max/mean/median/etc
• Normalize: Adjust a numeric value to a specific range of
values while preserving the distribution
• Math: Exponentiation, Logarithms, Squares, Roots, etc
• Types: Force a field value to categorical, integer, or real
• Random: Create random values for introducing noise
• Statistics: Mean, Population
• Refresh Fields:
• Types: recomputes field types. Ex: #classes > 1000
• Preferred: recomputes preferred status
BigML, Inc #DutchMLSchool
Flatline Add Fields
15
Computing with Existing Features
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
(/ (field "Debt") (field "Income"))
Debt
Income
Debt to Income Ratio
0,10
0,64
0,38
0
0,34
NUM
BigML, Inc #DutchMLSchool
FE Demo #4
16
BigML, Inc #DutchMLSchool
What is Flatline?
17
• DSL:
• Invented by BigML - Programmatic / Optimized for
speed
• Transforms datasets into new datasets
• Adding new fields / Filtering
• Transformations are written in lisp-style syntax
• Feature Engineering
• Computing new fields: (/ (field "Debt") (field
“Income”))
• Programmatic Filtering:
• Filtering datasets according to functions that evaluate
to true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering
BigML, Inc #DutchMLSchool
Flatline
18
• Lisp style syntax: Operators come first
• Correct: (+ 1 2) => NOT Correct: (1 + 2)
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Limited programming language structures
• let, cond, if, map, list operators, */+-, etc.
• Built-in transformations
• statistics, strings, timestamps, windows
BigML, Inc #DutchMLSchool
Flatline s-expressions
19
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row
BigML, Inc #DutchMLSchool
FE Demo #5
20
BigML, Inc #DutchMLSchool
Flatline s-expressions
21
date volume price
1 34353 314
2 44455 315
3 22333 315
4 52322 321
5 28000 320
6 31254 319
7 56544 323
8 44331 324
9 81111 287
10 65422 294
11 59999 300
12 45556 302
13 19899 301
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
day-4 day-3 day-2 day-1 4davg
-
314 -
314 315 -
314 315 315 -
314 315 315 321 316,25
315 315 321 320 317,75
315 321 320 319 318,75
BigML, Inc #DutchMLSchool
Flatline s-expressions
22
Current - (4-day avg)
std dev
Shock: Deviations from a Trend
Current : (field “price”)
4-day avg: (avg-window “price” -4 -1)
std dev: (standard-deviation “price”)
(/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
BigML, Inc #DutchMLSchool
FE Demo #6
23
BigML, Inc #DutchMLSchool
Advanced s-expressions
24
( = (mod (field "Highway Number")
2) 0)
Highway isEven?
BigML, Inc #DutchMLSchool
Advanced s-expressions
25
( /
( mod
( -
( /
( epoch ( field "date-field" ))
1000
)
621300
)
2551443
)
2551442
)
Moon Phase%
https://gist.github.com/petersen-poul/0cf5022ed1768837fe13af72b2488329
BigML, Inc #DutchMLSchool
Home Price Feature
26
Worth More
Worth Less
BigML, Inc #DutchMLSchool
Home Price Feature
27
LATITUDE LONGITUDE REFERENCE
LATITUDE
REFERENCE
LONGITUDE
44,583 -123,296775 44,5638 -123,2794
44,604414 -123,296129 44,5638 -123,2794
44,600108 -123,29707 44,5638 -123,2794
44,603077 -123,295004 44,5638 -123,2794
44,589587 -123,301154 44,5638 -123,2794
Distance (m)
700
30,4
19,38
37,8
23,39
BigML, Inc #DutchMLSchool
Haversine Formula
28
https://en.wikipedia.org/wiki/Haversine_formula
BigML, Inc #DutchMLSchool
Advanced s-expressions
29
( let
( R 6371000
latA (to-radians {lat-ref})
latB (to-radians ( field "LATITUDE" ) )
latD ( - latB latA )
longD ( to-radians ( - ( field "LONGITUDE" ) {long-ref} )
)
a ( +
( square ( sin ( / latD 2 ) ) )
( *
(cos latA)
(cos latB)
(square ( sin ( / longD 2)))
)
)
c ( * 2 ( asin ( min (list 1 (sqrt a)))))
)
( * R c )
)
Distance Lat/Long <=> Ref (Haversine)
BigML, Inc #DutchMLSchool
WhizzML + Flatline
30
HAVERSINE
FLATLINE
OUTPUT
DATASET
INPUT
DATASET
LONG Ref
LAT Ref
WHIZZML SCRIPT
https://bigml.com/gallery/scripts
BigML, Inc #DutchMLSchool
Advanced s-expressions
31
JSON Parser???
• Remember, Flatline is not a full programming language
• No loops
• No accumulated values
• Code executes on one row at a time and has a limited
view into other rows
https://gist.github.com/petersen-poul/504c62ceaace76227cc6d8e0c5f1704b
BigML, Inc #DutchMLSchool
Feature Engineering
32
Fix Missing Values in a “Meaningful” Way
F i l t e r
Zeros
Model 

insulin
Predict 

insulin
Select 

insulin
Fixed

Dataset
Amended

Dataset
Original

Dataset
Clean

Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
BigML, Inc #DutchMLSchool
FE Demo #7
33
BigML, Inc #DutchMLSchool
Feature Selection
34
BigML, Inc #DutchMLSchool
Feature Selection
35
• Model Summary
• Field Importance
• Algorithmic
• Best-First Feature Selection
• Boruta
• Leakage
• Tight Correlations (AD, Plot, Correlations)
• Test Data
• Perfect future knowledge
Care must be taken when creating features!
BigML, Inc #DutchMLSchool
Feature Selection
36
Leakage
• sales pipeline where step n-1 has no other
outcome then step n.
• stock close predicts stock open
• churn retention: the worst rep is actually the best
(correlation != causation)
• cancer prediction where one input is a doctor
ordered test for the condition
• 	account ID predicts fraud (because only new
accounts are fraudsters)
BigML, Inc #DutchMLSchool
Summary
37
• Feature Engineering: what is it / why it is important
• Automatic transformations: date-time, text, etc
• Built-in functions: filtering and feature engineering
• Discretization / Normalization / etc.
• Flatline: programmatic feature engineering / filtering
• Structure
• Examples: Adding fields / filtering
• When building features it is important to watch for leakage
BigML, Inc #DutchMLSchool 38
OptiML and Fusions
Automating Machine Learning
Poul Petersen
CIO, BigML, Inc
BigML, Inc #DutchMLSchool
Title
39
Decreasing Interpretability / Better Representation / Longer Training
IncreasingDataSize/Complexity
Early Stage

Rapid Prototyping
Mid Stage

Proven Application
Late Stage

Critical Performance
DeepnetsSingle Tree Model
Logistic Regression Boosted Trees
Random

Decision Forest
Decision Forest
TO
O
H
AR
D
BigML, Inc #DutchMLSchool
BigML Deepnets
40
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes significant expert knowledge
• Solution:
• Metalearning (a good initial guess)
• Network search (try a bunch)
Remember this?
BigML, Inc #DutchMLSchool
OptiML
41
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to find ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but finds the optimum
machine learning algorithm and parameters for your data
automatically
Key Insight: We can solve any parameter selection
problem in a similar way.
BigML, Inc #DutchMLSchool
The Challenge…
42
• We will start with a dataset from StumbleUpon
• Train/Test split with seed “bigml”
• Build and Evaluate:
• 1-click Model, LR, Ensemble, Deepnet
• Top model from OptiML output
• Compare the results using the phi coefficient
• Explore other ideas for improving performance further
BigML, Inc #DutchMLSchool
OptiML Demo
43
BigML, Inc #DutchMLSchool
Results…
44
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
•
What else can we try?
BigML, Inc #DutchMLSchool
Fusions Inside
45
• Fuse any set of models into a new “fusion”

• Must have the same objective type

• Inputs and feature space can differ

• Weights can be added 

• Give more importance to individual models

• Fusions can be fused as well

• Especially useful for fusing OptiML models
Key Insight: ML algorithms each have unique
strengths and weaknesses
BigML, Inc #DutchMLSchool
Performance thru Diversity
46
Dataset
Optimized 

Deepnet
Optimized 

Ensemble
Optimized 

Logistic Regression
Better?
BigML, Inc #DutchMLSchool
Fusion Demo #1
47
BigML, Inc #DutchMLSchool
Results…
48
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
• Fusion of top Model Types: 0.68
BigML, Inc #DutchMLSchool
Fusions: Under the Hood
49
P(TRUE) = [56+(100-67)+2*78] / 4
Model Prediction Probability Weight
Ensemble TRUE %56 1 Fus ion
Deepnet FALSE %67 1 TRUE %61
Model TRUE %78 2
Classification
Model Prediction Error Weight
Ensemble 156,78 12,56 1 Fus ion
Deepnet 139,55 9,88 1 160,13 17,49
Model 172,10 23,76 2
Regression
BigML, Inc #DutchMLSchool
Fusions: Like any BigML Model
50
• Fully accessible thru API and WhizzML

• Bindings have support for local predictions
BigML, Inc #DutchMLSchool
Decision Boundary Smoothness
51
Single Tree:
• Outcome changes abruptly near decision
boundary

• And not at all parallel to the boundary

• This can be “surprising”
Single Tree + Deepnet:
• Keep the interpretability of the tree

• But with a more nuanced decision boundary
BigML, Inc #DutchMLSchool
Feature Stability
52
Feature Importance: Different subsets of features may have similar modeling
performance
Fusing models gives better resilience against missing values as well as
ensuring that all relevant features are utilized.
BigML, Inc #DutchMLSchool
Weighting over Time
53
1 Day
Data significance over time:
• Some data may change significance in different times

• Short-term user behavior versus long-term

• Weights can set to account for significance of time
1 Week
1 Month
w=8
w=4
w=2
BigML, Inc #DutchMLSchool
Improved Class Separation
54
Consider a 3-class objective
• Really only care about “yes” versus “not yes”

• A single model may struggle to separate the two negative classes
Yes No Maybe
yes/no/maybe
yes/no
yes/maybe
BigML, Inc #DutchMLSchool
Feature Space Optimization
55
Model Skills: Some ML algorithms “generally” do better
on some feature types:
• RDF for sparse text vectors

• LR/Deepnets for numeric features

• Trees for categorical features
Full
Numeric
Text
BigML, Inc #DutchMLSchool
Fusions Demo #2
56
BigML, Inc #DutchMLSchool
Results…
57
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
• Fusion of top Model Types: 0.68
• Custom Feature Fusion: 0.70
BigML, Inc #DutchMLSchool
PCA
Principal Component Analysis
Poul Petersen
CIO, BigML
58
BigML, Inc #DutchMLSchool
Issues with High Dimensionality
59
• Implicitly increases model complexity, prone to overfitting
• Requires more observations in order to generalize well
• Contains correlated or useless variables
• Data is difficult to visualize
• Takes a longer time to train models or make predictions
Principal Component Analysis
addresses all of these issues
BigML, Inc #DutchMLSchool
Other Approaches
60
MODEL Pruning, Node threshold
ENSEMBLE Bagging, Randomization
LOGISTIC
REGRESSION
L1 and L2 penalties
DEEPNET Dropout
BigML, Inc #DutchMLSchool
Dimensionality Reduction
61
Feature Selection
• Preserves the original variables and selects a subset
• Often uses recursive methods or statistical thresholds
• Examples: RFE, Chi-Squared Test, Boruta
Feature Extraction
• Transforms original variables into variables better suited for modeling
• Examples: word vectors, clustering
• PCA falls into this category
Manual Approach
BigML, Inc #DutchMLSchool
When to use PCA
62
1. You want to reduce the number of variables in your model, but
it is not clear which should be eliminated
2. You want to generate variables that are not correlated
3. You are okay with sacrificing some amount of interpretability
for potential downstream performance gains
BigML, Inc #DutchMLSchool
How Does PCA Work?
63
Each PC is a linear combination of original variables
PC1 = w1F1 + w2F2 + w3F3 + … + wNFN
PC2 = w1F1 + w2F2 + w3F3 + … + wNFN
PCN = w1F1 + w2F2 + w3F3 + … + wNFN
…
BigML, Inc #DutchMLSchool
PCA Output
64
These principal components are not correlated
BigML, Inc #DutchMLSchool
PCA Workflow
65
SOURCE DATASET
TRAIN
TEST
BigML, Inc #DutchMLSchool
PCA Workflow
66
PCA
SOURCE DATASET
TRAIN
TEST
BigML, Inc #DutchMLSchool
PCA Workflow
67
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA
BigML, Inc #DutchMLSchool
PCA Workflow
68
NEW TRAIN
FEATURES
NEW TEST
FEATURES
BATCH
PROJECTION
BATCH
PROJECTION
SOURCE DATASET
TRAIN
TEST
PCA
BigML, Inc #DutchMLSchool
PCA Demo
69
BigML, Inc #DutchMLSchool
BigML PCA
70
• Standard PCA only applies to numerical data
• BigML uses three different data transformation methods in order to
handle different data types
• Numeric data: Principal Component Analysis (PCA)
• Categorical data: Multiple Correspondence Analysis (MCA)
• Mixed data: Factorial Analysis of Mixed Data (FAMD)
• BigML will automatically handle numeric, text, items, and categorical
data without needing user input
Co-organized by: Sponsor:
Business Partners:

More Related Content

PDF
DutchMLSchool. Machine Learning: Why Now?
PDF
DutchMLSchool. ML: A Technical Perspective
PDF
DutchMLSchool. Supervised vs Unsupervised Learning
PDF
DutchMLSchool. ML Business Perspective
PDF
DutchMLSchool. Machine Learning End-to-End
PDF
DutchMLSchool. ML for Energy Trading and Automotive Sector
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Machine Learning: Why Now?
DutchMLSchool. ML: A Technical Perspective
DutchMLSchool. Supervised vs Unsupervised Learning
DutchMLSchool. ML Business Perspective
DutchMLSchool. Machine Learning End-to-End
DutchMLSchool. ML for Energy Trading and Automotive Sector
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Logistic Regression, Deepnets, Time Series

What's hot (20)

PDF
DutchMLSchool. Models, Evaluations, and Ensembles
PDF
DutchMLSchool. Opening Remarks
PDF
DutchMLSchool. Associations and Topic Models
PDF
DutchMLSchool. ML for Logistics
PDF
MLSEV. Models, Evaluations and Ensembles
PDF
VSSML18 Introduction to Supervised Learning
PDF
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
PDF
Square's Machine Learning Infrastructure and Applications - Rong Yan
PPTX
Building Custom
Machine Learning Algorithms
with Apache SystemML
PDF
Data Workflows for Machine Learning - Seattle DAML
PPTX
End-to-End Machine Learning Project
PDF
Building a performing Machine Learning model from A to Z
PDF
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
PDF
How to Interview a Data Scientist
PPTX
Machine Learning for Sales & Marketing
PDF
Data Workflows for Machine Learning - SF Bay Area ML
PDF
How to Become a Data Scientist
PPTX
DIY Max-Diff webinar slides
PDF
OSCON 2014: Data Workflows for Machine Learning
DutchMLSchool. Models, Evaluations, and Ensembles
DutchMLSchool. Opening Remarks
DutchMLSchool. Associations and Topic Models
DutchMLSchool. ML for Logistics
MLSEV. Models, Evaluations and Ensembles
VSSML18 Introduction to Supervised Learning
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
Square's Machine Learning Infrastructure and Applications - Rong Yan
Building Custom
Machine Learning Algorithms
with Apache SystemML
Data Workflows for Machine Learning - Seattle DAML
End-to-End Machine Learning Project
Building a performing Machine Learning model from A to Z
Unified Approach to Interpret Machine Learning Model: SHAP + LIME
How to Interview a Data Scientist
Machine Learning for Sales & Marketing
Data Workflows for Machine Learning - SF Bay Area ML
How to Become a Data Scientist
DIY Max-Diff webinar slides
OSCON 2014: Data Workflows for Machine Learning
Ad

Similar to DutchMLSchool. Automating Decision Making (20)

PDF
BSSML17 - Feature Engineering
PDF
MLSEV. Automating Decision Making
PDF
VSSML18. Feature Engineering
PDF
MLSD18. Feature Engineering
PDF
BigML Education - Feature Engineering with Flatline
PDF
MLSD18. Real-World Use Case I
PDF
BSSML17 - Basic Data Transformations
PDF
DutchMLSchool 2022 - End-to-End ML
PDF
BSSML16 L7. Feature Engineering
PDF
VSSML16 L5. Basic Data Transformations
PDF
VSSML18. Clustering and Latent Dirichlet Allocation
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
PDF
Machine Learning: je m'y mets demain!
PDF
BSSML16 L1. Introduction, Models, and Evaluations
PDF
Predictive apps for startups
PPTX
Pandas application
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
PDF
MLSD18. Basic Transformations - BigML
PDF
VSSML18. Data Transformations
BSSML17 - Feature Engineering
MLSEV. Automating Decision Making
VSSML18. Feature Engineering
MLSD18. Feature Engineering
BigML Education - Feature Engineering with Flatline
MLSD18. Real-World Use Case I
BSSML17 - Basic Data Transformations
DutchMLSchool 2022 - End-to-End ML
BSSML16 L7. Feature Engineering
VSSML16 L5. Basic Data Transformations
VSSML18. Clustering and Latent Dirichlet Allocation
BSSML16 L10. Summary Day 2 Sessions
VSSML17 L5. Basic Data Transformations and Feature Engineering
Machine Learning: je m'y mets demain!
BSSML16 L1. Introduction, Models, and Evaluations
Predictive apps for startups
Pandas application
Tensors Are All You Need: Faster Inference with Hummingbird
MLSD18. Basic Transformations - BigML
VSSML18. Data Transformations
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
PDF
DutchMLSchool 2022 - Automation
PDF
DutchMLSchool 2022 - ML for AML Compliance
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
PDF
DutchMLSchool 2022 - My First Anomaly Detector
PDF
DutchMLSchool 2022 - Anomaly Detection
PDF
DutchMLSchool 2022 - History and Developments in ML
PDF
DutchMLSchool 2022 - A Data-Driven Company
PDF
DutchMLSchool 2022 - ML in the Legal Sector
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
PDF
DutchMLSchool 2022 - Citizen Development in AI
PDF
Democratizing Object Detection
PDF
BigML Release: Image Processing
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
PDF
Machine Learning in Retail: ML in the Retail Sector
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
PDF
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
Digital Transformation and Process Optimization in Manufacturing
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Citizen Development in AI
Democratizing Object Detection
BigML Release: Image Processing
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: ML in the Retail Sector
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance

Recently uploaded (20)

PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPT
statistic analysis for study - data collection
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Best Data Science Professional Certificates in the USA | IABAC
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Business_Capability_Map_Collection__pptx
PPT
statistics analysis - topic 3 - describing data visually
PPTX
Machine Learning and working of machine Learning
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
Microsoft 365 products and services descrption
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
statistic analysis for study - data collection
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Best Data Science Professional Certificates in the USA | IABAC
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Caseware_IDEA_Detailed_Presentation.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Business_Capability_Map_Collection__pptx
statistics analysis - topic 3 - describing data visually
Machine Learning and working of machine Learning
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
eGramSWARAJ-PPT Training Module for beginners
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
1 hour to get there before the game is done so you don’t need a car seat for ...
Microsoft 365 products and services descrption
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Tapan_20220802057_Researchinternship_final_stage.pptx

DutchMLSchool. Automating Decision Making

  • 1. 1st edition | July 8-11, 2019
  • 2. BigML, Inc #DutchMLSchool 2 Feature Engineering Creating Features that Make Machine Learning Work Poul Petersen CIO, BigML, Inc
  • 3. BigML, Inc #DutchMLSchool Gaming the ML Performance 3 • Use ML to improve performance automatically • OptiML • Unsupervised Feature Engineering (PCA, Topic Models, Clustering, Anomaly Detection, etc) • Automated feature selection • Use domain knowledge to improve performance manually • Bespoke features (requires expertise) • Fusions of models • Manual feature selection A Tale of Two Strategies…
  • 4. BigML, Inc #DutchMLSchool what is Feature Engineering 4 Feature Engineering: applying domain knowledge of the data to create new features that allow ML algorithms to work better, or to work at all. • This is really, really important - more than algorithm selection! • In fact, so important that BigML often does it automatically • ML Algorithms have no deeper understanding of data • Numerical: have a natural order, can be scaled, etc • Categorical: have discrete values, etc. • The "magic" is the ability to find patterns quickly and efficiently • ML Algorithms only know what you tell/show it with data • Medical: Kg and M, but BMI = Kg/M2 is better • Lending: Debt and Income, but DTI is better • Intuition can be risky: remember to prove it with an evaluation!
  • 5. BigML, Inc #DutchMLSchool Built-in Transformations 5 2013-09-25 10:02 Date-Time Fields … year month day hour minute … … 2013 Sep 25 10 2 … … … … … … … … NUM NUMCAT NUM NUM • Date-Time fields have a lot of information "packed" into them • Splitting out the time components allows ML algorithms to discover time-based patterns. DATE-TIME
  • 6. BigML, Inc #DutchMLSchool Built-in Transformations 6 Categorical Fields for Clustering/LR … alchemy_category … … business … … recreation … … health … … … … CAT business health recreation … … 1 0 0 … … 0 0 1 … … 0 1 0 … … … … … … NUM NUM NUM • Clustering and Logistic Regression require numeric fields for inputs • Categorical values are transformed to numeric vectors automatically* • *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
  • 7. BigML, Inc #DutchMLSchool Built-in Transformations 7 Be not afraid of greatness: some are born great, some achieve greatness, and some have greatness thrust upon ‘em. TEXT Text Fields … great afraid born achieve … … 4 1 1 1 … … … … … … … NUM NUM NUM NUM • Unstructured text contains a lot of potentially interesting patterns • Bag-of-words analysis happens automatically and extracts the "interesting" tokens in the text • Another option is Topic Modeling to extract thematic meaning
  • 8. BigML, Inc #DutchMLSchool Help ML to Work Better 8 { “url":"cbsnews", "title":"Breaking News Headlines Business Entertainment World News “, "body":" news covering all the latest breaking national and world news headlines, including politics, sports, entertainment, business and more.” } TEXT title body Breaking News… news covering… … … TEXT TEXT When text is not actually unstructured • In this case, the text field has structure (key/value pairs) • Extracting the structure as new features may allow the ML algorithm to work better
  • 10. BigML, Inc #DutchMLSchool Help ML to Work at all 10 When the pattern does not exist Highway Number Direction Is Long 2 East-West FALSE 4 East-West FALSE 5 North-South TRUE 8 East-West FALSE 10 East-West TRUE … … … Goal: Predict principle direction from highway number ( = (mod (field "Highway Number") 2) 0)
  • 12. BigML, Inc #DutchMLSchool Feature Engineering 12 Discretization Total Spend 7.342,99 304,12 4,56 345,87 8.546,32 NUM “Predict will spend $3,521 with error $1,232” Spend Category Top 33% Bottom 33% Bottom 33% Middle 33% Top 33% CAT “Predict customer will be Top 33% in spending”
  • 14. BigML, Inc #DutchMLSchool Built-ins for FE 14 • Discretize: Converts a numeric value to categorical • Replace missing values: fixed/max/mean/median/etc • Normalize: Adjust a numeric value to a specific range of values while preserving the distribution • Math: Exponentiation, Logarithms, Squares, Roots, etc • Types: Force a field value to categorical, integer, or real • Random: Create random values for introducing noise • Statistics: Mean, Population • Refresh Fields: • Types: recomputes field types. Ex: #classes > 1000 • Preferred: recomputes preferred status
  • 15. BigML, Inc #DutchMLSchool Flatline Add Fields 15 Computing with Existing Features Debt Income 10.134 100.000 85.234 134.000 8.112 21.500 0 45.900 17.534 52.000 NUM NUM (/ (field "Debt") (field "Income")) Debt Income Debt to Income Ratio 0,10 0,64 0,38 0 0,34 NUM
  • 17. BigML, Inc #DutchMLSchool What is Flatline? 17 • DSL: • Invented by BigML - Programmatic / Optimized for speed • Transforms datasets into new datasets • Adding new fields / Filtering • Transformations are written in lisp-style syntax • Feature Engineering • Computing new fields: (/ (field "Debt") (field “Income”)) • Programmatic Filtering: • Filtering datasets according to functions that evaluate to true/false using the row of data as an input. Flatline: a domain specific language for feature engineering and programmatic filtering
  • 18. BigML, Inc #DutchMLSchool Flatline 18 • Lisp style syntax: Operators come first • Correct: (+ 1 2) => NOT Correct: (1 + 2) • Dataset Fields are first-class citizens • (field “diabetes pedigree”) • Limited programming language structures • let, cond, if, map, list operators, */+-, etc. • Built-in transformations • statistics, strings, timestamps, windows
  • 19. BigML, Inc #DutchMLSchool Flatline s-expressions 19 (= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) )) Name Month - 3 Month - 2 Month - 1 Joe Schmo 123,23 0 0 Jane Plain 0 0 0 Mary Happy 0 55,22 243,33 Tom Thumb 12,34 8,34 14,56 Un-Labelled Data Labelled data Name Month - 3 Month - 2 Month - 1 Default Joe Schmo 123,23 0 0 FALSE Jane Plain 0 0 0 TRUE Mary Happy 0 55,22 243,33 FALSE Tom Thumb 12,34 8,34 14,56 FALSE Adding Simple Labels to Data Define "default" as missing three payments in a row
  • 21. BigML, Inc #DutchMLSchool Flatline s-expressions 21 date volume price 1 34353 314 2 44455 315 3 22333 315 4 52322 321 5 28000 320 6 31254 319 7 56544 323 8 44331 324 9 81111 287 10 65422 294 11 59999 300 12 45556 302 13 19899 301 Current - (4-day avg) std dev Shock: Deviations from a Trend day-4 day-3 day-2 day-1 4davg - 314 - 314 315 - 314 315 315 - 314 315 315 321 316,25 315 315 321 320 317,75 315 321 320 319 318,75
  • 22. BigML, Inc #DutchMLSchool Flatline s-expressions 22 Current - (4-day avg) std dev Shock: Deviations from a Trend Current : (field “price”) 4-day avg: (avg-window “price” -4 -1) std dev: (standard-deviation “price”) (/ (- ( f "price") (avg-window "price" -4, -1)) (standard-deviation "price"))
  • 24. BigML, Inc #DutchMLSchool Advanced s-expressions 24 ( = (mod (field "Highway Number") 2) 0) Highway isEven?
  • 25. BigML, Inc #DutchMLSchool Advanced s-expressions 25 ( / ( mod ( - ( / ( epoch ( field "date-field" )) 1000 ) 621300 ) 2551443 ) 2551442 ) Moon Phase% https://gist.github.com/petersen-poul/0cf5022ed1768837fe13af72b2488329
  • 26. BigML, Inc #DutchMLSchool Home Price Feature 26 Worth More Worth Less
  • 27. BigML, Inc #DutchMLSchool Home Price Feature 27 LATITUDE LONGITUDE REFERENCE LATITUDE REFERENCE LONGITUDE 44,583 -123,296775 44,5638 -123,2794 44,604414 -123,296129 44,5638 -123,2794 44,600108 -123,29707 44,5638 -123,2794 44,603077 -123,295004 44,5638 -123,2794 44,589587 -123,301154 44,5638 -123,2794 Distance (m) 700 30,4 19,38 37,8 23,39
  • 28. BigML, Inc #DutchMLSchool Haversine Formula 28 https://en.wikipedia.org/wiki/Haversine_formula
  • 29. BigML, Inc #DutchMLSchool Advanced s-expressions 29 ( let ( R 6371000 latA (to-radians {lat-ref}) latB (to-radians ( field "LATITUDE" ) ) latD ( - latB latA ) longD ( to-radians ( - ( field "LONGITUDE" ) {long-ref} ) ) a ( + ( square ( sin ( / latD 2 ) ) ) ( * (cos latA) (cos latB) (square ( sin ( / longD 2))) ) ) c ( * 2 ( asin ( min (list 1 (sqrt a))))) ) ( * R c ) ) Distance Lat/Long <=> Ref (Haversine)
  • 30. BigML, Inc #DutchMLSchool WhizzML + Flatline 30 HAVERSINE FLATLINE OUTPUT DATASET INPUT DATASET LONG Ref LAT Ref WHIZZML SCRIPT https://bigml.com/gallery/scripts
  • 31. BigML, Inc #DutchMLSchool Advanced s-expressions 31 JSON Parser??? • Remember, Flatline is not a full programming language • No loops • No accumulated values • Code executes on one row at a time and has a limited view into other rows https://gist.github.com/petersen-poul/504c62ceaace76227cc6d8e0c5f1704b
  • 32. BigML, Inc #DutchMLSchool Feature Engineering 32 Fix Missing Values in a “Meaningful” Way F i l t e r Zeros Model 
 insulin Predict 
 insulin Select 
 insulin Fixed
 Dataset Amended
 Dataset Original
 Dataset Clean
 Dataset ( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
  • 35. BigML, Inc #DutchMLSchool Feature Selection 35 • Model Summary • Field Importance • Algorithmic • Best-First Feature Selection • Boruta • Leakage • Tight Correlations (AD, Plot, Correlations) • Test Data • Perfect future knowledge Care must be taken when creating features!
  • 36. BigML, Inc #DutchMLSchool Feature Selection 36 Leakage • sales pipeline where step n-1 has no other outcome then step n. • stock close predicts stock open • churn retention: the worst rep is actually the best (correlation != causation) • cancer prediction where one input is a doctor ordered test for the condition • account ID predicts fraud (because only new accounts are fraudsters)
  • 37. BigML, Inc #DutchMLSchool Summary 37 • Feature Engineering: what is it / why it is important • Automatic transformations: date-time, text, etc • Built-in functions: filtering and feature engineering • Discretization / Normalization / etc. • Flatline: programmatic feature engineering / filtering • Structure • Examples: Adding fields / filtering • When building features it is important to watch for leakage
  • 38. BigML, Inc #DutchMLSchool 38 OptiML and Fusions Automating Machine Learning Poul Petersen CIO, BigML, Inc
  • 39. BigML, Inc #DutchMLSchool Title 39 Decreasing Interpretability / Better Representation / Longer Training IncreasingDataSize/Complexity Early Stage Rapid Prototyping Mid Stage Proven Application Late Stage Critical Performance DeepnetsSingle Tree Model Logistic Regression Boosted Trees Random Decision Forest Decision Forest TO O H AR D
  • 40. BigML, Inc #DutchMLSchool BigML Deepnets 40 • The success of a Deepnet is dependent on getting the right network structure for the dataset • But, there are too many parameters: • Nodes, layers, activation function, learning rate, etc… • And setting them takes significant expert knowledge • Solution: • Metalearning (a good initial guess) • Network search (try a bunch) Remember this?
  • 41. BigML, Inc #DutchMLSchool OptiML 41 • Each resource has several parameters that impact quality • Number of trees, missing splits, nodes, weight • Rather than trial and error, we can use ML to find ideal parameters • Why not make the model type, Decision Tree, Boosted Tree, etc, a parameter as well? • Similar to Deepnet network search, but finds the optimum machine learning algorithm and parameters for your data automatically Key Insight: We can solve any parameter selection problem in a similar way.
  • 42. BigML, Inc #DutchMLSchool The Challenge… 42 • We will start with a dataset from StumbleUpon • Train/Test split with seed “bigml” • Build and Evaluate: • 1-click Model, LR, Ensemble, Deepnet • Top model from OptiML output • Compare the results using the phi coefficient • Explore other ideas for improving performance further
  • 44. BigML, Inc #DutchMLSchool Results… 44 All scores are phi, evaluated against a holdout • 1-Click Decision Tree: 0.36 • 1-Click LR: 0.47 • 1-Click Ensemble: 0.58 • Best OptiML Model (LR): 0.66 • 1-Click Deepnet: 0.67 • What else can we try?
  • 45. BigML, Inc #DutchMLSchool Fusions Inside 45 • Fuse any set of models into a new “fusion” • Must have the same objective type • Inputs and feature space can differ • Weights can be added • Give more importance to individual models • Fusions can be fused as well • Especially useful for fusing OptiML models Key Insight: ML algorithms each have unique strengths and weaknesses
  • 46. BigML, Inc #DutchMLSchool Performance thru Diversity 46 Dataset Optimized Deepnet Optimized Ensemble Optimized Logistic Regression Better?
  • 48. BigML, Inc #DutchMLSchool Results… 48 All scores are phi, evaluated against a holdout • 1-Click Decision Tree: 0.36 • 1-Click LR: 0.47 • 1-Click Ensemble: 0.58 • Best OptiML Model (LR): 0.66 • 1-Click Deepnet: 0.67 • Fusion of top Model Types: 0.68
  • 49. BigML, Inc #DutchMLSchool Fusions: Under the Hood 49 P(TRUE) = [56+(100-67)+2*78] / 4 Model Prediction Probability Weight Ensemble TRUE %56 1 Fus ion Deepnet FALSE %67 1 TRUE %61 Model TRUE %78 2 Classification Model Prediction Error Weight Ensemble 156,78 12,56 1 Fus ion Deepnet 139,55 9,88 1 160,13 17,49 Model 172,10 23,76 2 Regression
  • 50. BigML, Inc #DutchMLSchool Fusions: Like any BigML Model 50 • Fully accessible thru API and WhizzML • Bindings have support for local predictions
  • 51. BigML, Inc #DutchMLSchool Decision Boundary Smoothness 51 Single Tree: • Outcome changes abruptly near decision boundary • And not at all parallel to the boundary • This can be “surprising” Single Tree + Deepnet: • Keep the interpretability of the tree • But with a more nuanced decision boundary
  • 52. BigML, Inc #DutchMLSchool Feature Stability 52 Feature Importance: Different subsets of features may have similar modeling performance Fusing models gives better resilience against missing values as well as ensuring that all relevant features are utilized.
  • 53. BigML, Inc #DutchMLSchool Weighting over Time 53 1 Day Data significance over time: • Some data may change significance in different times • Short-term user behavior versus long-term • Weights can set to account for significance of time 1 Week 1 Month w=8 w=4 w=2
  • 54. BigML, Inc #DutchMLSchool Improved Class Separation 54 Consider a 3-class objective • Really only care about “yes” versus “not yes” • A single model may struggle to separate the two negative classes Yes No Maybe yes/no/maybe yes/no yes/maybe
  • 55. BigML, Inc #DutchMLSchool Feature Space Optimization 55 Model Skills: Some ML algorithms “generally” do better on some feature types: • RDF for sparse text vectors • LR/Deepnets for numeric features • Trees for categorical features Full Numeric Text
  • 57. BigML, Inc #DutchMLSchool Results… 57 All scores are phi, evaluated against a holdout • 1-Click Decision Tree: 0.36 • 1-Click LR: 0.47 • 1-Click Ensemble: 0.58 • Best OptiML Model (LR): 0.66 • 1-Click Deepnet: 0.67 • Fusion of top Model Types: 0.68 • Custom Feature Fusion: 0.70
  • 58. BigML, Inc #DutchMLSchool PCA Principal Component Analysis Poul Petersen CIO, BigML 58
  • 59. BigML, Inc #DutchMLSchool Issues with High Dimensionality 59 • Implicitly increases model complexity, prone to overfitting • Requires more observations in order to generalize well • Contains correlated or useless variables • Data is difficult to visualize • Takes a longer time to train models or make predictions Principal Component Analysis addresses all of these issues
  • 60. BigML, Inc #DutchMLSchool Other Approaches 60 MODEL Pruning, Node threshold ENSEMBLE Bagging, Randomization LOGISTIC REGRESSION L1 and L2 penalties DEEPNET Dropout
  • 61. BigML, Inc #DutchMLSchool Dimensionality Reduction 61 Feature Selection • Preserves the original variables and selects a subset • Often uses recursive methods or statistical thresholds • Examples: RFE, Chi-Squared Test, Boruta Feature Extraction • Transforms original variables into variables better suited for modeling • Examples: word vectors, clustering • PCA falls into this category Manual Approach
  • 62. BigML, Inc #DutchMLSchool When to use PCA 62 1. You want to reduce the number of variables in your model, but it is not clear which should be eliminated 2. You want to generate variables that are not correlated 3. You are okay with sacrificing some amount of interpretability for potential downstream performance gains
  • 63. BigML, Inc #DutchMLSchool How Does PCA Work? 63 Each PC is a linear combination of original variables PC1 = w1F1 + w2F2 + w3F3 + … + wNFN PC2 = w1F1 + w2F2 + w3F3 + … + wNFN PCN = w1F1 + w2F2 + w3F3 + … + wNFN …
  • 64. BigML, Inc #DutchMLSchool PCA Output 64 These principal components are not correlated
  • 65. BigML, Inc #DutchMLSchool PCA Workflow 65 SOURCE DATASET TRAIN TEST
  • 66. BigML, Inc #DutchMLSchool PCA Workflow 66 PCA SOURCE DATASET TRAIN TEST
  • 67. BigML, Inc #DutchMLSchool PCA Workflow 67 BATCH PROJECTION BATCH PROJECTION SOURCE DATASET TRAIN TEST PCA
  • 68. BigML, Inc #DutchMLSchool PCA Workflow 68 NEW TRAIN FEATURES NEW TEST FEATURES BATCH PROJECTION BATCH PROJECTION SOURCE DATASET TRAIN TEST PCA
  • 70. BigML, Inc #DutchMLSchool BigML PCA 70 • Standard PCA only applies to numerical data • BigML uses three different data transformation methods in order to handle different data types • Numeric data: Principal Component Analysis (PCA) • Categorical data: Multiple Correspondence Analysis (MCA) • Mixed data: Factorial Analysis of Mixed Data (FAMD) • BigML will automatically handle numeric, text, items, and categorical data without needing user input