SlideShare a Scribd company logo
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai
Data Scientist, H2O.ai
GitHub: mlandry22, Email: mark@h2o.ai
Mark Landry
An Analysis of Driverless AI Feature Creation
Driver vs Driverless
Overview
• Analyze two recent problems with an emphasis on feature
generation
• Overview of the problem
• Discussion of my approach
• Show the features Driverless created
• Feature creation
• Feature representation
• Results
Who am I?
• Data scientist @ H2O since 2015, user since 2014
• 15 top 20 finishes in Kaggle, highest rank 33
• R | H2O | data.table | GBM
• The “driver”
Problem #1
How Many Attempts will a Student Make
• Online question/answer platform with computer science
problems
• Predict the number of attempts a particular student will make on
a particular problem
• Data
• Student: level, ranking, highest ranking
• Problem: type, 3-tier level, points awarded
• Training: 124,000 attempt counts
• Testing: 60,000 attempt counts
Regression as Classification
• Natural problem is numerical
• End user prefers buckets
• Volume
• 1: 53%
• 2 31%
• 3 9%
• 4 4%
• 5 1%
• 6 2%
The Driver Approach
• Think of it like a recommender problem
• Standard: Matrix factorization, collaborative filtering
• GBM: use deep categorical encodings
• Frequent use of target encoding
interaction
interaction
interaction
The Driver Approach
A messy chain of hierarchical target encoding and if/else statements
The Driver Approach
h2o.gbm feature importance – primarily using three target encodings
The Driverless Approach
The Driverless Approach
The Driverless Approach
The Driverless Approach
Top 5 Features: divided into components
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #1: same base target encoding I found to be the best
• {16} {CV TE} {problem_id} {0}
• 16: indicator of base features – 16 is later used twice more in the top
15
• CV TE: Cross-Fold target encoding
• Problem_id: feature used as basis for target encoding
• 0: the target; multinomial, so the two other uses are for class 1 & 5
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #2: deeper interaction
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• 51: base feature ID
• CV TE: out of sample target encoding result
• points * problem: interaction of two different features, both related to
the problem; it is subdividing the problem further
• 0: again, rate of class 0 as the target
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #3: no transformation
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• max rating
• used as is – no alternate encoding; was first natural feature in my model as well
• this is the first variable of the student dimension
• a 4-digit number with close to a normal distribution
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature #4: frequency encoding
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• Counting the occurrences of two fields
• Last online & rating are both numerics in the student dimension
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Feature 5: four-way interaction w/ target encoding
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
• Target encoding of class 0
• Finding the rate for each value of the result of a four way interaction
The Driverless Approach
Top 5 Features: did I try?
• YES {16} {CV TE} {problem_id} {0}
• NO {51} {CV TE} {points * problem_id} {0}
• YES {3} {max_rating}
• NO {32} {freq} {last_online_time_seconds * rating}
• NO {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}
The Driverless Approach
Problem #2
Bank Customer Churn
• Identify customers likely to churn balances in the next quarter
by 50%
• Data
• 300,00 training rows; 200,000 testing rows
• 377 columns
• Customer: age, gender, demographics
• Reported assets, liabilities
• Monthly balance history
The Driver Approach
Exploit before Explore
• 377 columns made [quick] manual investigation harder
• Rather than iterate: {analyze > model > analyze > … },
I changed to {model > analyze > model > … }
lagged features
The Driver Approach
After lagging, try differences and ratios
• Lagging features present the balance features at
several time steps.
• But often, the interesting part is not the raw balance
itself, but whether it is growing or shrinking
• Decision trees have a hard time “seeing” this so it is
wise to engineer mathematical features: + - * /
The Driver Approach
After lagging, try differences and ratios
• I used the leading monthly feature from the model and
created new features representing month-over-month
differences and a binary indicator
• One field, one specific length (1 month), two calculations
The Driverless Approach
The Driverless Approach
The Driverless Approach
It knows math!
subtraction
subtraction
subtraction
The Driverless Approach
Top 10 Features: divided into categories
• (3) Subtraction: #1, #6, #8
• (1) Truncated SVD components: #2
• (2) Cluster Distances: #3, #9
• (1) Target encoding: #4
• (3) Direct features: #5, #7, #10
The Driverless Approach
Lagged balances also used in clusters & SVD
• Distance to cluster #1 after segmenting columns into 6 clusters
• BAL_prev6
• D_prev1
• D_prev2
• I_AQB_PrevQ1
• Component #1 of truncated SVD of
• D_prev1
• D_prev2
• EOP_prev1_1
The Driverless Approach
Final Analysis
• On first iteration, Driverless AI had surpassed my manual
modeling
• Features were well beyond what I would have ever attempted
• Accuracy was stable: Driverless AI self-reported scores within
1% of competition submission
The Driverless Approach
Competition Results
Kaggle Grandmaster
Kaggle Grandmaster
Also Driverless AI
100% Driverless AI
The End
Thank You

More Related Content

PPTX
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
PPTX
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
PPTX
Production machine learning_infrastructure
PDF
Using H2O AutoML for Kaggle Competitions
PPTX
Recommendations for Building Machine Learning Software
PPTX
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
PDF
Making Netflix Machine Learning Algorithms Reliable
PDF
Drifting Away: Testing ML Models in Production
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Dat...
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...
Production machine learning_infrastructure
Using H2O AutoML for Kaggle Competitions
Recommendations for Building Machine Learning Software
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Making Netflix Machine Learning Algorithms Reliable
Drifting Away: Testing ML Models in Production

What's hot (20)

PPTX
Production and Beyond: Deploying and Managing Machine Learning Models
PDF
Modern Machine Learning Infrastructure and Practices
PPTX
Lessons Learned from Building Machine Learning Software at Netflix
PPTX
Personalized Page Generation for Browsing Recommendations
PDF
MLconf seattle 2015 presentation
PDF
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
PPTX
Recommendations for Building Machine Learning Software
PDF
Data Workflows for Machine Learning - Seattle DAML
PPTX
Machine Learning In Production
PDF
Architecting for Data Science
PPTX
Machine Learning With ML.NET
PDF
Making Data Science Scalable - 5 Lessons Learned
PDF
Catch Me If You Can: Keeping Up With ML Models in Production
PDF
Deploying ml
PDF
DutchMLSchool. ML Automation
PDF
Agile Machine Learning for Real-time Recommender Systems
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
The Machine Learning Workflow with Azure
PDF
Agile data visualisation
PDF
Workshop: Your first machine learning project
Production and Beyond: Deploying and Managing Machine Learning Models
Modern Machine Learning Infrastructure and Practices
Lessons Learned from Building Machine Learning Software at Netflix
Personalized Page Generation for Browsing Recommendations
MLconf seattle 2015 presentation
Machine Learning system architecture – Microsoft Translator, a Case Study : ...
Recommendations for Building Machine Learning Software
Data Workflows for Machine Learning - Seattle DAML
Machine Learning In Production
Architecting for Data Science
Machine Learning With ML.NET
Making Data Science Scalable - 5 Lessons Learned
Catch Me If You Can: Keeping Up With ML Models in Production
Deploying ml
DutchMLSchool. ML Automation
Agile Machine Learning for Real-time Recommender Systems
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
The Machine Learning Workflow with Azure
Agile data visualisation
Workshop: Your first machine learning project
Ad

Similar to Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai (20)

PDF
From science to engineering, the process to build a machine learning product
PDF
[CS570] Machine Learning Team Project (I know what items really are)
PDF
Before Kaggle
PDF
Before Kaggle : from a business goal to a Machine Learning problem
PDF
Just the Facets, Ma'am
PPTX
AutoML for user segmentation: how to match millions of users with hundreds of...
PPTX
Test Cases - are they dead?
PDF
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
PDF
Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj...
PPTX
Software engineering module 4 notes for btech and mca
PDF
EKON 23 Code_review_checklist
PDF
Model-based Testing: Taking BDD/ATDD to the Next Level
PDF
JUG Poznan - 2017.01.31
PDF
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
PPTX
Predictive Analytics based Regression Test Optimization
PDF
Database and application performance vivek sharma
PDF
Technical debt management strategies
PPSX
MDE in Practice
PPTX
OOP.pptx
PDF
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
From science to engineering, the process to build a machine learning product
[CS570] Machine Learning Team Project (I know what items really are)
Before Kaggle
Before Kaggle : from a business goal to a Machine Learning problem
Just the Facets, Ma'am
AutoML for user segmentation: how to match millions of users with hundreds of...
Test Cases - are they dead?
Алексей Ященко и Ярослав Волощук "False simplicity of front-end applications"
Model-Driven Optimization: Generating Smart Mutation Operators for Multi-Obj...
Software engineering module 4 notes for btech and mca
EKON 23 Code_review_checklist
Model-based Testing: Taking BDD/ATDD to the Next Level
JUG Poznan - 2017.01.31
Taking your machine learning workflow to the next level using Scikit-Learn Pi...
Predictive Analytics based Regression Test Optimization
Database and application performance vivek sharma
Technical debt management strategies
MDE in Practice
OOP.pptx
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
PDF
Intro to Enterprise h2oGPTe Presentation Slides
PDF
Enterprise h2o GPTe Learning Path Slide Deck
PDF
H2O Wave Course Starter - Presentation Slides
PDF
Large Language Models (LLMs) - Level 3 Slides
PDF
Data Science and Machine Learning Platforms (2024) Slides
PDF
Data Prep for H2O Driverless AI - Slides
PDF
H2O Cloud AI Developer Services - Slides (2024)
PDF
LLM Learning Path Level 2 - Presentation Slides
PDF
LLM Learning Path Level 1 - Presentation Slides
PDF
Hydrogen Torch - Starter Course - Presentation Slides
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
PDF
H2O Driverless AI Starter Course - Slides and Assignments
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
PPTX
Generative AI Masterclass - Model Risk Management.pptx
H2O Label Genie Starter Track - Support Presentation
H2O.ai Agents : From Theory to Practice - Support Presentation
H2O Generative AI Starter Track - Support Presentation Slides.pdf
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Intro to Enterprise h2oGPTe Presentation Slides
Enterprise h2o GPTe Learning Path Slide Deck
H2O Wave Course Starter - Presentation Slides
Large Language Models (LLMs) - Level 3 Slides
Data Science and Machine Learning Platforms (2024) Slides
Data Prep for H2O Driverless AI - Slides
H2O Cloud AI Developer Services - Slides (2024)
LLM Learning Path Level 2 - Presentation Slides
LLM Learning Path Level 1 - Presentation Slides
Hydrogen Torch - Starter Course - Presentation Slides
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
H2O Driverless AI Starter Course - Slides and Assignments
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Generative AI Masterclass - Model Risk Management.pptx

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Encapsulation theory and applications.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
A comparative analysis of optical character recognition models for extracting...
Group 1 Presentation -Planning and Decision Making .pptx
Enhancing emotion recognition model for a student engagement use case through...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Zenith AI: Advanced Artificial Intelligence
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Building Integrated photovoltaic BIPV_UPV.pdf
Heart disease approach using modified random forest and particle swarm optimi...
Encapsulation theory and applications.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
DP Operators-handbook-extract for the Mautical Institute
Assigned Numbers - 2025 - Bluetooth® Document
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
SOPHOS-XG Firewall Administrator PPT.pptx

Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai

  • 2. Data Scientist, H2O.ai GitHub: mlandry22, Email: [email protected] Mark Landry
  • 3. An Analysis of Driverless AI Feature Creation Driver vs Driverless
  • 4. Overview • Analyze two recent problems with an emphasis on feature generation • Overview of the problem • Discussion of my approach • Show the features Driverless created • Feature creation • Feature representation • Results
  • 5. Who am I? • Data scientist @ H2O since 2015, user since 2014 • 15 top 20 finishes in Kaggle, highest rank 33 • R | H2O | data.table | GBM • The “driver”
  • 6. Problem #1 How Many Attempts will a Student Make • Online question/answer platform with computer science problems • Predict the number of attempts a particular student will make on a particular problem • Data • Student: level, ranking, highest ranking • Problem: type, 3-tier level, points awarded • Training: 124,000 attempt counts • Testing: 60,000 attempt counts
  • 7. Regression as Classification • Natural problem is numerical • End user prefers buckets • Volume • 1: 53% • 2 31% • 3 9% • 4 4% • 5 1% • 6 2%
  • 8. The Driver Approach • Think of it like a recommender problem • Standard: Matrix factorization, collaborative filtering • GBM: use deep categorical encodings • Frequent use of target encoding interaction interaction interaction
  • 9. The Driver Approach A messy chain of hierarchical target encoding and if/else statements
  • 10. The Driver Approach h2o.gbm feature importance – primarily using three target encodings
  • 14. The Driverless Approach Top 5 Features: divided into components • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 15. The Driverless Approach Feature #1: same base target encoding I found to be the best • {16} {CV TE} {problem_id} {0} • 16: indicator of base features – 16 is later used twice more in the top 15 • CV TE: Cross-Fold target encoding • Problem_id: feature used as basis for target encoding • 0: the target; multinomial, so the two other uses are for class 1 & 5 • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 16. The Driverless Approach Feature #2: deeper interaction • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • 51: base feature ID • CV TE: out of sample target encoding result • points * problem: interaction of two different features, both related to the problem; it is subdividing the problem further • 0: again, rate of class 0 as the target • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 17. The Driverless Approach Feature #3: no transformation • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • max rating • used as is – no alternate encoding; was first natural feature in my model as well • this is the first variable of the student dimension • a 4-digit number with close to a normal distribution • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 18. The Driverless Approach Feature #4: frequency encoding • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • Counting the occurrences of two fields • Last online & rating are both numerics in the student dimension • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 19. The Driverless Approach Feature 5: four-way interaction w/ target encoding • {16} {CV TE} {problem_id} {0} • {51} {CV TE} {points * problem_id} {0} • {3} {max_rating} • {32} {freq} {last_online_time_seconds * rating} • {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0} • Target encoding of class 0 • Finding the rate for each value of the result of a four way interaction
  • 20. The Driverless Approach Top 5 Features: did I try? • YES {16} {CV TE} {problem_id} {0} • NO {51} {CV TE} {points * problem_id} {0} • YES {3} {max_rating} • NO {32} {freq} {last_online_time_seconds * rating} • NO {61} {NumToCatTE} {rating * max_rating * points * last_online_time_seconds} {0}
  • 22. Problem #2 Bank Customer Churn • Identify customers likely to churn balances in the next quarter by 50% • Data • 300,00 training rows; 200,000 testing rows • 377 columns • Customer: age, gender, demographics • Reported assets, liabilities • Monthly balance history
  • 23. The Driver Approach Exploit before Explore • 377 columns made [quick] manual investigation harder • Rather than iterate: {analyze > model > analyze > … }, I changed to {model > analyze > model > … } lagged features
  • 24. The Driver Approach After lagging, try differences and ratios • Lagging features present the balance features at several time steps. • But often, the interesting part is not the raw balance itself, but whether it is growing or shrinking • Decision trees have a hard time “seeing” this so it is wise to engineer mathematical features: + - * /
  • 25. The Driver Approach After lagging, try differences and ratios • I used the leading monthly feature from the model and created new features representing month-over-month differences and a binary indicator • One field, one specific length (1 month), two calculations
  • 28. The Driverless Approach It knows math! subtraction subtraction subtraction
  • 29. The Driverless Approach Top 10 Features: divided into categories • (3) Subtraction: #1, #6, #8 • (1) Truncated SVD components: #2 • (2) Cluster Distances: #3, #9 • (1) Target encoding: #4 • (3) Direct features: #5, #7, #10
  • 30. The Driverless Approach Lagged balances also used in clusters & SVD • Distance to cluster #1 after segmenting columns into 6 clusters • BAL_prev6 • D_prev1 • D_prev2 • I_AQB_PrevQ1 • Component #1 of truncated SVD of • D_prev1 • D_prev2 • EOP_prev1_1
  • 31. The Driverless Approach Final Analysis • On first iteration, Driverless AI had surpassed my manual modeling • Features were well beyond what I would have ever attempted • Accuracy was stable: Driverless AI self-reported scores within 1% of competition submission
  • 32. The Driverless Approach Competition Results Kaggle Grandmaster Kaggle Grandmaster Also Driverless AI 100% Driverless AI