Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai

Data Scientist, H2O.ai
GitHub: mlandry22, Email: mark@h2o.ai
Mark Landry

An Analysis of Driverless AI Feature Creation
Driver vs Driverless

Overview
• Analyze two recent problems with an emphasis on feature
generation
• Overview of the problem
• Discussion of my approach
• Show the features Driverless created
• Feature creation
• Feature representation
• Results

Who am I?
• Data scientist @ H2O since 2015, user since 2014
• 15 top 20 finishes in Kaggle, highest rank 33
• R | H2O | data.table | GBM
• The “driver”

Problem #1
How Many Attempts will a Student Make
• Online question/answer platform with computer science
problems
• Predict the number of attempts a particular student will make on
a particular problem
• Data
• Student: level, ranking, highest ranking
• Problem: type, 3-tier level, points awarded
• Training: 124,000 attempt counts
• Testing: 60,000 attempt counts

Regression as Classification
• Natural problem is numerical
• End user prefers buckets
• Volume
• 1: 53%
• 2 31%
• 3 9%
• 4 4%
• 5 1%
• 6 2%

The Driver Approach
• Think of it like a recommender problem
• Standard: Matrix factorization, collaborative filtering
• GBM: use deep categorical encodings
• Frequent use of target encoding
interaction
interaction
interaction

The Driver Approach
A messy chain of hierarchical target encoding and if/else statements

The Driver Approach
h2o.gbm feature importance – primarily using three target encodings

The Driverless Approach
Top 5 Features: divided into components
• {16} {CV TE} {problem_id} {0}
• {51} {CV TE} {points * problem_id} {0}
• {3} {max_rating}
• {32} {freq} {last_online_time_seconds * rating}
• {61} {NumToCatTE} {rating * max_rating * points *
last_online_time_seconds} {0}

Feature #1: same base target encoding I found to be the best
• 16: indicator of base features – 16 is later used twice more in the top
15
• CV TE: Cross-Fold target encoding
• Problem_id: feature used as basis for target encoding
• 0: the target; multinomial, so the two other uses are for class 1 & 5

Feature #2: deeper interaction
• 51: base feature ID
• CV TE: out of sample target encoding result
• points * problem: interaction of two different features, both related to
the problem; it is subdividing the problem further
• 0: again, rate of class 0 as the target

Feature #3: no transformation
• max rating
• used as is – no alternate encoding; was first natural feature in my model as well
• this is the first variable of the student dimension
• a 4-digit number with close to a normal distribution

Feature #4: frequency encoding
• Counting the occurrences of two fields
• Last online & rating are both numerics in the student dimension

Feature 5: four-way interaction w/ target encoding
• Target encoding of class 0
• Finding the rate for each value of the result of a four way interaction

Top 5 Features: did I try?
• YES {16} {CV TE} {problem_id} {0}
• NO {51} {CV TE} {points * problem_id} {0}
• YES {3} {max_rating}
• NO {32} {freq} {last_online_time_seconds * rating}
• NO {61} {NumToCatTE} {rating * max_rating * points *

Problem #2
Bank Customer Churn
• Identify customers likely to churn balances in the next quarter
by 50%
• Data
• 300,00 training rows; 200,000 testing rows
• 377 columns
• Customer: age, gender, demographics
• Reported assets, liabilities
• Monthly balance history

The Driver Approach
Exploit before Explore
• 377 columns made [quick] manual investigation harder
• Rather than iterate: {analyze > model > analyze > … },
I changed to {model > analyze > model > … }
lagged features

The Driver Approach
After lagging, try differences and ratios
• Lagging features present the balance features at
several time steps.
• But often, the interesting part is not the raw balance
itself, but whether it is growing or shrinking
• Decision trees have a hard time “seeing” this so it is
wise to engineer mathematical features: + - * /

The Driver Approach
After lagging, try differences and ratios
• I used the leading monthly feature from the model and
created new features representing month-over-month
differences and a binary indicator
• One field, one specific length (1 month), two calculations

It knows math!
subtraction
subtraction
subtraction

Top 10 Features: divided into categories
• (3) Subtraction: #1, #6, #8
• (1) Truncated SVD components: #2
• (2) Cluster Distances: #3, #9
• (1) Target encoding: #4
• (3) Direct features: #5, #7, #10

Lagged balances also used in clusters & SVD
• Distance to cluster #1 after segmenting columns into 6 clusters
• BAL_prev6
• D_prev1
• D_prev2
• I_AQB_PrevQ1
• Component #1 of truncated SVD of
• D_prev1
• D_prev2
• EOP_prev1_1

Final Analysis
• On first iteration, Driverless AI had surpassed my manual
modeling
• Features were well beyond what I would have ever attempted
• Accuracy was stable: Driverless AI self-reported scores within
1% of competition submission

Competition Results
Kaggle Grandmaster
Kaggle Grandmaster
Also Driverless AI
100% Driverless AI

Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai

More Related Content

What's hot (20)

Similar to Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai (20)

More from Sri Ambati (20)

Recently uploaded (20)

Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product Manager, H2O.ai