SlideShare a Scribd company logo
A Practical Feature
Store on Delta Lake
Nathan Buesgens
ML Operations
Bryan Christian
Data Science
Agenda
§ What is a Feature Store?
▪ MLOps for Acceleration and
Governance in the Enterprise
▪ Feature Store: Use Cases
▪ Edge Cases: 80/20
▪ Relation to the Data Warehouse
§ Design Reference
▪ Logical Data Model & Access
Patterns
▪ Physical Representation in the Delta
Lake
What is a Feature Store?
75%
Reduction in Feature Engineering
“Data Wrangling” Time
15X
Accelerated Model Delivery
with MLOps Automation and
Governance
END-TO-END VALUE DELIVERY
TIME TO VALUE & CONCURRENCY
SCALABLE INFRASTRUCTURE
I.E. AVOID:
“PROOF OF CONCEPT FACTORY”
MLOps: Data Science at Scale
BOTTLENECK
Feature
Engineering
Modelling
The feature store serves as the
consumption layer for ML
applications. It provides:
• Acceleration: pre-”hardened”
features reduces data wrangling
time for the Data Scientist.
• Governance: a common
consumptions pattern ensures
nothing is lost in the translation
to production.
Predictions
Curated
Data
Feature
Engineering
Modelling
Feature
Engineering
Modelling
Modelling
Modelling
Modelling
Feature
Store
Example: Feature Store
Infrastructure to support DS + MLE
The Feature Store is built on the following data science requirements that are relevant to predictive
analytics in Financial Services use cases.
Correct and consistently applied
joins across of multiple Delta
files without loss of processing
speed
Aggregations, window functions,
and transformations of data
Granularity of point in time and
level of the prediction (e.g.
individual, account, etc.)
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
34567 2021-05-01 0.03 0.92 0.13
45678 2021-05-01 0.42 0.59 0.50
The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward-
facing windows. Code-embedded metadata allows easy removal of future facing windows as
“independent” variables to prevent feature leakage.
Data Science Use Cases
§ Many ML use cases that don’t have an
online requirement: Esp.
“Human + AI”
§ Extending the MVP:
▪ Some online use cases can be
reframed as streaming use cases.
▪ Online use cases can be met with
extension to the Delta Lake design.
▪ See: feast.dev
§ Low-code & ciGzen science expands
user base, doesn’t necessarily
accelerate exisGng users.
§ 80/20 value from:
Op#mizing Access vs. Op#mizing
ETL Development
“Online” Features
Ultra-Low-Latency, Ultra-Timely Point Reads
Low-Code ETL
Configuration Based, AutoML, FeatureFlow, etc.
Edge Cases
Opportunities to Simplify for an 80/2- Feature Store MVP
▪ “Golden” aggregates of curated data.
▪ Highly structured, well-defined
granularities (esp. as 80/20 solution).
▪ Similar non-functional requirements for
strong governance standards, metadata
management, discovery, etc.
▪ Different Use Case: BI vs. Modelling
▪ Different Access Patterns, therefore:
▪ Different Data Model
▪ Different Technology Stack
▪ Supervised learning creates complex
requirements for:
“point in time accurate data”
• Differences
• Similarities
Comparison with Data Warehouse
i.e. Dimensional Model
Design
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
Structured Streaming Programming Guide
WINDOW FUNCTIONS
WATERMARK
1
2
3
FEATURE LEAKAGE
Point in Time Accurate Data
Three Ways Inconsistency Sneaks In
§ The thing being modelled.
The “Entity”
Term barrowed from Feast
Granularity
“As of”
Every feature for an entity “as of” a date.
Columns
§ Discrete granularity (daily, hourly, etc.), not an
“event time”.
§ 80/20 solution.
§ For “continuous” granularity see: Feast.
Features
Un-vectorized (80/20)
Targets
Necessarily at same granularity as features.
Predictions
One model’s prediction is often another’s feature.
Feature Store Logical Model
Data Model for Feature Store Access
No need to rebuild the whole
feature store when new features
are added.
(Certain sets of features might be rebuilt
at times, though they will have severely
shorter downtime.)
The SDK indexes the available features and upon request builds the joins to combine all desired features
into one cohesive data frame to provide a production grade feature selection tool.
Keyword searching enabled for
features so you can find any
feature you're looking for using
"human" logic
Tuning can be specific to each set
of features allowing more optimal
feature creation.
find()
select()
select_by()
To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
When you know exactly the features you want
Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
Core Functionality
SDK for Feature Store
find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex.
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
verbose
case_sensitive
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, prints out results otherwise just returns them.
If True, an exact match is required to return results.
Arguments
fs.find(regexp="^(?=.*asdf)(?=.*qw
erty).+")
Your search returned 20 results…
feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'}
feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'}
...
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The find method searches through all features given a set of criteria and returns any matches within the name or metadata
of columns. It is a great tool to explore the data without pulling in massive datasets
Value to Data Scientist
Explore what features are in
the feature store via metadata
and leverage metadata to
enforce governance (e.g., no
PI, 3rd party data, etc. as
needed)
SDK for Feature Store
date
*features
Return features given a specific date or use "latest" to return the last
updated feature date. For specific dates, please include a dictionary
with an operator and a date i.e. {">": "2021-05-01"}
Feature names as strings
Arguments
dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features
“feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want )
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select method will return a dataframe of all selected features with the given date.
select() When you know exactly the features you want
customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days
12345 2021-05-01 0.43 0.32 0.21
23456 2021-05-01 0.99 0.94 0.98
Consistent way of selecting the
same feature set from the feature
store – consistent in dev and when
deployed in production
Value to Data Scientist
Consistent way of selecting
(in dev and prod) the same
feature set from the feature
store when creating a
dataframe
SDK for Feature Store
customer_id as_of feature_name_1 feature_name_qwerty_1 …
12345 2021-05-01 0.43 0.32 …
23456 2021-05-01 0.99 0.94 …
select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex
date
regexp
kwrds
keys
kwrds_exclude
partial
partial_exclude
case_sensitive
Return features given a specific date or use "latest" to return the last updated feature date.
For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"}
A regular expression
A list of key words to look for
A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,}
A list of words to exclude from search
If kwrds is used, this decides if it should find all or any of them when searching.
If kwrds_exclude is used, this decides if it will exclude all or any of them when searching
If True, an exact match is required to return results.
Arguments
dataframe_name = fs.select_by("=": "2021-05-01“,
regexp="^(?=.*asdf)(?=.*qwerty).+")
display(dataframe_name)
Example
Calling the feature store with “fs”, a command could be:
With a returned result of…
The select_by method searches through all features given a set of criteria and returns a dataframe including all the
features that match the criteria within the name or metadata.
Value to Data Scientist
Consistent way of exploring
the feature store and
leveraging metadata for
selection while simultaneity
creating a dataframe with the
selected features
SDK for Feature Store
Gold
BI Consumption:
Dimensional
Model
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
High Concurrency
Data Warehouse
Mirror
Mirror
Implementation on the Data Lake
Bronze Silver
ML Consumption:
Feature Store
The Delta Lake
Optional:
Consumption
Optimized Databases
ETL ETL
Low Latency
Memory Cache
Mirror
SDK (Data Access Layer)
• Consistent view of “online” and “historic” features.
• Separation of logical and physical models.
• Metadata focused query interface for data science
exploration.
Historic Feature
Queries
Online Point
Reads
Implementation on the Data Lake
§ Simplifies “point in .me joins”.
§ Not as flexible or .mely.
Pre-defined time aggregations
“As Of” Granularity
“Dynamic Point in Time Joins”
Demonstrated by Feast
More flexible, improved timeliness.
Multiple feature tables
Technically possible to use a single wide table.
§ Simplifies:
▪ Schema Migration
▪ Query Planning & Optimization
▪ Scheduling
Physical Feature Tables
Two Choices
Summary
1
Feature stores accelerate data science & enable
better governance.
2
Most design complexity stems from machine
learning requirements for point in time accurate data.
3
80/20 solutions possible by carefully considering
“online” requirements.
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
DataWorks Summit
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
David Stein
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
Data mesh
Data meshData mesh
Data mesh
ManojKumarR41
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
Alex Ivy
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Airbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stackAirbyte @ Airflow Summit - The new modern data stack
Airbyte @ Airflow Summit - The new modern data stack
Michel Tricot
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
Dalibor Wijas
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
Databricks
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
David Stein
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 

Similar to A Practical Enterprise Feature Store on Delta Lake (20)

NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Catalyst optimizer
Catalyst optimizerCatalyst optimizer
Catalyst optimizer
Ayub Mohammad
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
Eduardo Castro
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
Lynn Langit
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Piyush Kumar
 
NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020NoSQL Endgame DevoxxUA Conference 2020
NoSQL Endgame DevoxxUA Conference 2020
Thodoris Bais
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
Overview of query evaluation
Overview of query evaluationOverview of query evaluation
Overview of query evaluation
avniS
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
Amit Juneja
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008Analysis Services en SQL Server 2008
Analysis Services en SQL Server 2008
Eduardo Castro
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
dwm042
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
Lynn Langit
 
Making sense of your data jug
Making sense of your data   jugMaking sense of your data   jug
Making sense of your data jug
Gerald Muecke
 
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
DataStax
 
DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)DataFinder concepts and example: General (20100503)
DataFinder concepts and example: General (20100503)
Data Finder
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
Steven Johnson
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
At the core you will have KUSTO
At the core you will have KUSTOAt the core you will have KUSTO
At the core you will have KUSTO
Riccardo Zamana
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
Lukas Vlcek
 
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Piyush Kumar
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Ad

Recently uploaded (20)

Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptxMulti-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
VikashVats1
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
aishwaryavdcw
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Blue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdfBlue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdf
mohammadhaidarayoobi
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptxMulti-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
VikashVats1
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
aishwaryavdcw
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
Blue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdfBlue Dark Professional Geometric Business Project Presentation .pdf
Blue Dark Professional Geometric Business Project Presentation .pdf
mohammadhaidarayoobi
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 

A Practical Enterprise Feature Store on Delta Lake

  • 1. A Practical Feature Store on Delta Lake Nathan Buesgens ML Operations Bryan Christian Data Science
  • 2. Agenda § What is a Feature Store? ▪ MLOps for Acceleration and Governance in the Enterprise ▪ Feature Store: Use Cases ▪ Edge Cases: 80/20 ▪ Relation to the Data Warehouse § Design Reference ▪ Logical Data Model & Access Patterns ▪ Physical Representation in the Delta Lake
  • 3. What is a Feature Store?
  • 4. 75% Reduction in Feature Engineering “Data Wrangling” Time 15X Accelerated Model Delivery with MLOps Automation and Governance END-TO-END VALUE DELIVERY TIME TO VALUE & CONCURRENCY SCALABLE INFRASTRUCTURE I.E. AVOID: “PROOF OF CONCEPT FACTORY” MLOps: Data Science at Scale
  • 5. BOTTLENECK Feature Engineering Modelling The feature store serves as the consumption layer for ML applications. It provides: • Acceleration: pre-”hardened” features reduces data wrangling time for the Data Scientist. • Governance: a common consumptions pattern ensures nothing is lost in the translation to production. Predictions Curated Data Feature Engineering Modelling Feature Engineering Modelling Modelling Modelling Modelling Feature Store Example: Feature Store Infrastructure to support DS + MLE
  • 6. The Feature Store is built on the following data science requirements that are relevant to predictive analytics in Financial Services use cases. Correct and consistently applied joins across of multiple Delta files without loss of processing speed Aggregations, window functions, and transformations of data Granularity of point in time and level of the prediction (e.g. individual, account, etc.) customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 34567 2021-05-01 0.03 0.92 0.13 45678 2021-05-01 0.42 0.59 0.50 The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward- facing windows. Code-embedded metadata allows easy removal of future facing windows as “independent” variables to prevent feature leakage. Data Science Use Cases
  • 7. § Many ML use cases that don’t have an online requirement: Esp. “Human + AI” § Extending the MVP: ▪ Some online use cases can be reframed as streaming use cases. ▪ Online use cases can be met with extension to the Delta Lake design. ▪ See: feast.dev § Low-code & ciGzen science expands user base, doesn’t necessarily accelerate exisGng users. § 80/20 value from: Op#mizing Access vs. Op#mizing ETL Development “Online” Features Ultra-Low-Latency, Ultra-Timely Point Reads Low-Code ETL Configuration Based, AutoML, FeatureFlow, etc. Edge Cases Opportunities to Simplify for an 80/2- Feature Store MVP
  • 8. ▪ “Golden” aggregates of curated data. ▪ Highly structured, well-defined granularities (esp. as 80/20 solution). ▪ Similar non-functional requirements for strong governance standards, metadata management, discovery, etc. ▪ Different Use Case: BI vs. Modelling ▪ Different Access Patterns, therefore: ▪ Different Data Model ▪ Different Technology Stack ▪ Supervised learning creates complex requirements for: “point in time accurate data” • Differences • Similarities Comparison with Data Warehouse i.e. Dimensional Model
  • 10. WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 11. Structured Streaming Programming Guide WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  • 12. § The thing being modelled. The “Entity” Term barrowed from Feast Granularity “As of” Every feature for an entity “as of” a date. Columns § Discrete granularity (daily, hourly, etc.), not an “event time”. § 80/20 solution. § For “continuous” granularity see: Feast. Features Un-vectorized (80/20) Targets Necessarily at same granularity as features. Predictions One model’s prediction is often another’s feature. Feature Store Logical Model Data Model for Feature Store Access
  • 13. No need to rebuild the whole feature store when new features are added. (Certain sets of features might be rebuilt at times, though they will have severely shorter downtime.) The SDK indexes the available features and upon request builds the joins to combine all desired features into one cohesive data frame to provide a production grade feature selection tool. Keyword searching enabled for features so you can find any feature you're looking for using "human" logic Tuning can be specific to each set of features allowing more optimal feature creation. find() select() select_by() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. When you know exactly the features you want Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex Core Functionality SDK for Feature Store
  • 14. find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. regexp kwrds keys kwrds_exclude partial partial_exclude verbose case_sensitive A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, prints out results otherwise just returns them. If True, an exact match is required to return results. Arguments fs.find(regexp="^(?=.*asdf)(?=.*qw erty).+") Your search returned 20 results… feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'} feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'} ... Example Calling the feature store with “fs”, a command could be: With a returned result of… The find method searches through all features given a set of criteria and returns any matches within the name or metadata of columns. It is a great tool to explore the data without pulling in massive datasets Value to Data Scientist Explore what features are in the feature store via metadata and leverage metadata to enforce governance (e.g., no PI, 3rd party data, etc. as needed) SDK for Feature Store
  • 15. date *features Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} Feature names as strings Arguments dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features “feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want ) display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select method will return a dataframe of all selected features with the given date. select() When you know exactly the features you want customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 Consistent way of selecting the same feature set from the feature store – consistent in dev and when deployed in production Value to Data Scientist Consistent way of selecting (in dev and prod) the same feature set from the feature store when creating a dataframe SDK for Feature Store
  • 16. customer_id as_of feature_name_1 feature_name_qwerty_1 … 12345 2021-05-01 0.43 0.32 … 23456 2021-05-01 0.99 0.94 … select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex date regexp kwrds keys kwrds_exclude partial partial_exclude case_sensitive Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, an exact match is required to return results. Arguments dataframe_name = fs.select_by("=": "2021-05-01“, regexp="^(?=.*asdf)(?=.*qwerty).+") display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select_by method searches through all features given a set of criteria and returns a dataframe including all the features that match the criteria within the name or metadata. Value to Data Scientist Consistent way of exploring the feature store and leveraging metadata for selection while simultaneity creating a dataframe with the selected features SDK for Feature Store
  • 17. Gold BI Consumption: Dimensional Model Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache High Concurrency Data Warehouse Mirror Mirror Implementation on the Data Lake
  • 18. Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache Mirror SDK (Data Access Layer) • Consistent view of “online” and “historic” features. • Separation of logical and physical models. • Metadata focused query interface for data science exploration. Historic Feature Queries Online Point Reads Implementation on the Data Lake
  • 19. § Simplifies “point in .me joins”. § Not as flexible or .mely. Pre-defined time aggregations “As Of” Granularity “Dynamic Point in Time Joins” Demonstrated by Feast More flexible, improved timeliness. Multiple feature tables Technically possible to use a single wide table. § Simplifies: ▪ Schema Migration ▪ Query Planning & Optimization ▪ Scheduling Physical Feature Tables Two Choices
  • 20. Summary 1 Feature stores accelerate data science & enable better governance. 2 Most design complexity stems from machine learning requirements for point in time accurate data. 3 80/20 solutions possible by carefully considering “online” requirements.
  • 21. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.