A Practical Enterprise Feature Store on Delta Lake

A Practical Feature
Store on Delta Lake
Nathan Buesgens
ML Operations
Bryan Christian
Data Science

Agenda
§ What is a Feature Store?
▪ MLOps for Acceleration and
Governance in the Enterprise
▪ Feature Store: Use Cases
▪ Edge Cases: 80/20
▪ Relation to the Data Warehouse
§ Design Reference
▪ Logical Data Model & Access
Patterns
▪ Physical Representation in the Delta
Lake

75%
Reduction in Feature Engineering
“Data Wrangling” Time
15X
Accelerated Model Delivery
with MLOps Automation and
Governance
END-TO-END VALUE DELIVERY
TIME TO VALUE & CONCURRENCY
SCALABLE INFRASTRUCTURE
I.E. AVOID:
“PROOF OF CONCEPT FACTORY”
MLOps: Data Science at Scale

BOTTLENECK
Feature
Engineering
Modelling
The feature store serves as the
consumption layer for ML
applications. It provides:
• Acceleration: pre-”hardened”
features reduces data wrangling
time for the Data Scientist.
• Governance: a common
consumptions pattern ensures
nothing is lost in the translation
to production.
Predictions
Curated
Data
Feature
Engineering
Modelling
Feature
Engineering
Modelling
Modelling
Modelling
Modelling
Feature
Store
Example: Feature Store
Infrastructure to support DS + MLE

The Feature Store is built on the following data science requirements that are relevant to predictive
analytics in Financial Services use cases.
Correct and consistently applied
joins across of multiple Delta
files without loss of processing
speed
Aggregations, window functions,
and transformations of data
Granularity of point in time and
level of the prediction (e.g.
individual, account, etc.)
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
34567 2021-05-01 0.03 0.92 0.13
45678 2021-05-01 0.42 0.59 0.50
The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward-
facing windows. Code-embedded metadata allows easy removal of future facing windows as
“independent” variables to prevent feature leakage.
Data Science Use Cases

§ Many ML use cases that don’t have an
online requirement: Esp.
“Human + AI”
§ Extending the MVP:
▪ Some online use cases can be
reframed as streaming use cases.
▪ Online use cases can be met with
extension to the Delta Lake design.
▪ See: feast.dev
§ Low-code & ciGzen science expands
user base, doesn’t necessarily
accelerate exisGng users.
§ 80/20 value from:
Op#mizing Access vs. Op#mizing
ETL Development
“Online” Features
Ultra-Low-Latency, Ultra-Timely Point Reads
Low-Code ETL
Configuration Based, AutoML, FeatureFlow, etc.
Edge Cases
Opportunities to Simplify for an 80/2- Feature Store MVP

▪ “Golden” aggregates of curated data.
▪ Highly structured, well-defined
granularities (esp. as 80/20 solution).
▪ Similar non-functional requirements for
strong governance standards, metadata
management, discovery, etc.
▪ Different Use Case: BI vs. Modelling
▪ Different Access Patterns, therefore:
▪ Different Data Model
▪ Different Technology Stack
▪ Supervised learning creates complex
requirements for:
“point in time accurate data”
• Differences
• Similarities
Comparison with Data Warehouse
i.e. Dimensional Model

WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In

Structured Streaming Programming Guide
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In

§ The thing being modelled.
The “Entity”
Term barrowed from Feast
Granularity
“As of”
Every feature for an entity “as of” a date.
Columns
§ Discrete granularity (daily, hourly, etc.), not an
“event time”.
§ 80/20 solution.
§ For “continuous” granularity see: Feast.
Features
Un-vectorized (80/20)
Targets
Necessarily at same granularity as features.
Predictions
One model’s prediction is often another’s feature.
Feature Store Logical Model
Data Model for Feature Store Access

No need to rebuild the whole
feature store when new features
are added.
(Certain sets of features might be rebuilt
at times, though they will have severely
shorter downtime.)
The SDK indexes the available features and upon request builds the joins to combine all desired features
into one cohesive data frame to provide a production grade feature selection tool.
Keyword searching enabled for
features so you can find any
feature you're looking for using
"human" logic
Tuning can be specific to each set
of features allowing more optimal
feature creation.
find()
select()
select_by()
To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
When you know exactly the features you want
Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
Core Functionality
SDK for Feature Store

find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
verbose
case_sensitive
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, prints out results otherwise just returns them.
If True, an exact match is required to return results.
Arguments
fs.find(regexp="^(?=.*asdf)(?=.*qw
erty).+")
Your search returned 20 results…
feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'}
feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'}
...
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The find method searches through all features given a set of criteria and returns any matches within the name or metadata
of columns. It is a great tool to explore the data without pulling in massive datasets
Value to Data Scientist
Explore what features are in
the feature store via metadata
and leverage metadata to
enforce governance (e.g., no
PI, 3rd party data, etc. as
needed)

date
*features
Return features given a specific date or use "latest" to return the last
updated feature date. For specific dates, please include a dictionary
with an operator and a date i.e. {">": "2021-05-01"}
Feature names as strings
Arguments
dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features
“feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want )
display(dataframe_name)
Example
The select method will return a dataframe of all selected features with the given date.
select() When you know exactly the features you want
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
Consistent way of selecting the
same feature set from the feature
store – consistent in dev and when
deployed in production
Consistent way of selecting
(in dev and prod) the same
feature set from the feature
store when creating a
dataframe

customer_id as_of feature_name_1 feature_name_qwerty_1 …
12345 2021-05-01 0.43 0.32 …
23456 2021-05-01 0.99 0.94 …
select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
date
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
case_sensitive
Return features given a specific date or use "latest" to return the last updated feature date.
For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"}
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, an exact match is required to return results.
Arguments
dataframe_name = fs.select_by("=": "2021-05-01“,
regexp="^(?=.*asdf)(?=.*qwerty).+")
display(dataframe_name)
Example
The select_by method searches through all features given a set of criteria and returns a dataframe including all the
features that match the criteria within the name or metadata.
Consistent way of exploring
the feature store and
leveraging metadata for
selection while simultaneity
creating a dataframe with the
selected features

Gold
BI Consumption:
Dimensional
Model
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
High Concurrency
Data Warehouse
Mirror
Mirror
Implementation on the Data Lake

Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
Mirror
SDK (Data Access Layer)
• Consistent view of “online” and “historic” features.
• Separation of logical and physical models.
• Metadata focused query interface for data science
exploration.
Historic Feature
Queries
Online Point
Reads
Implementation on the Data Lake

§ Simpliﬁes “point in .me joins”.
§ Not as ﬂexible or .mely.
Pre-defined time aggregations
“As Of” Granularity
“Dynamic Point in Time Joins”
Demonstrated by Feast
More flexible, improved timeliness.
Multiple feature tables
Technically possible to use a single wide table.
§ Simplifies:
▪ Schema Migration
▪ Query Planning & Optimization
▪ Scheduling
Physical Feature Tables
Two Choices

Summary
1
Feature stores accelerate data science & enable
better governance.
2
Most design complexity stems from machine
learning requirements for point in time accurate data.
3
80/20 solutions possible by carefully considering
“online” requirements.

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

A Practical Enterprise Feature Store on Delta Lake

Recommended

More Related Content

What's hot (20)

Similar to A Practical Enterprise Feature Store on Delta Lake (20)

More from Databricks (20)

Recently uploaded (20)

A Practical Enterprise Feature Store on Delta Lake