Dr. Jim Dowling
CEO / Co-Founder
Logical Clocks
Managed Feature Store
for ML Webinar
[ Presenter ]
Leadership & Offices
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
Dr. Jim Dowling
CEO
Theo Kakantousis
COO
Prof. Seif Haridi
Chief Scientist
Fabio Buso
VP Engineering
Steffen Grohsschmiedt
Head Of Cloud
www.logicalclocks.com
Shraddha Chouhan
Head Of Marketing
Hopsworks - Award Winning Platform
Today’s Journey to a Feature Store and Beyond
Ad-hoc Scripts
and Jobs
Shared Feature
Pipelines
Feature Store
MLOps with a
Feature Store
Known Feature Stores in Production
● Logical Clocks – Hopsworks (world’s first open source)
● Uber Michelangelo
● Airbnb – Bighead/Zipline
● Comcast
● Twitter
● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery)
● Branch
● Conde Nast
● Facebook FB Learner
● Netflix
Reference: www.featurestore.org
What is a Feature?
A feature is a measurable property of a phenomena under observation and
(part of) an input to a ML model.
Example features:
● A raw word, a pixel, a sound
wave, a sensor value;
● An aggregate
(mean, max, sum, min)
● A window
(last_hour, last_day, etc)
● A derived representation
(embedding or cluster)
numbers
(in arrays)
A Data Engineer’s perspective on Feature Engineering
numbers
arrays
(of numbers)
one-hot
encoding
Databases
Schemas
varchar, charsets
integer, blob,
varbinary
Feature Engineering is about Transforming Data
Feature Engineering is about Transforming Data
from pyspark.ml.feature import Normalizer
scaledDF = spark.parquet.read(”…”)
l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm")
l1_norm.transform(scaleDF)
Normalize
Consistent Features between Training and Inference
It’s not always trivial to ensure features are engineered
consistently between training and inference
Features
Training
Labels Model
Features
Inference
Model Labels
Feature Store – Reuse Cached Features
One
Feature
Pipeline
Get
Get
Features
Training
Labels Model
Features
Inference
Model Labels
Feature
Store
Features name Pclass Sex Survive Name Balance
Train / Test
Datasets
Survivename PClass Sex Balance
Join key
Feature
Groups
Titanic ​
Passenger List​
Passenger
Bank Account
File format
.tfrecords
.npy
.csv
.hdf5,
.petastorm, etc
Storage
GCS
Amazon S3
HopsFS
Features, FeatureGroups, and Train/Test Datasets are all versioned
Feature Store Concepts
Streaming App pushes click features every 5 secs
Streaming App pushes CDC data every 30 secs
Pandas App pushes user profile updates every hour
Batch App pushes featurized weblogs data every day
Online
Feature
Store
Offline
Feature
Store
SQL DW
S3, HDFS
SQL
Event Data
Real-Time Data
Real-time feature transformations (<2 secs) Online
App
Low
Latency
Features
High
Latency
Features
Train,
Batch App
FeatureGroups are ingested at different Cadences
Feature Store
No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
Feature Store
ClickFeatureGroup
TableFeatureGroup
UserFeatureGroup
LogsFeatureGroup
Event Data
SQL DW
S3, HDFS
SQL
DataFrameAPI
Kafka Input
Flink
RTFeatureGroup
Online
App
Train,
Batch App
FeatureGroup ingestion in Hopsworks
User Clicks
DB Updates
User Profile Updates
Weblogs
Real-time features
Kafka Output
Simplify access to the online/offline Feature Stores by providing a general-purpose DataFrame API.
Register a Feature Group with the Feature Store
from hops import featurestore as fs
df = # Spark or Pandas Dataframe
# Do feature engineering on ‘df’
# Register Dataframe as FeatureGroup
fs.create_featuregroup(df, ”titanic_df“)
HOPSWORKS
Rest API
1 Add Metadata
2 Add Statistics
….
Offline FS
Apache Hive
HopsFS
(External)
Spark Cluster
.parquet, .orc (TLS)
Online FS
MySQL Cluster
fs.create_featuregroup(df, “titanic_df”,
offline=True, online=True)
Feature Ingestion with Spark
Online
Feature Store
(Serving)
Offline
Feature Store
(Training & Batch)
Online Apps
Model Training
Batch Apps
Event Data
SQL DW
S3, HDFS
SQL
Ingest
Data
From
Used
By
Hopsworks Feature Store
Create Training Datasets using the Feature Store
from hops import featurestore as fs
sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”])
fs.create_training_dataset(sample_data, “titanic_training_dataset",
data_format="tfrecords“, training_dataset_version=1)
HOPSWORKS
Offline FS
Apache Hive
HopsFS
Join Features <<TLS>>
Online FS
MySQL Cluster
(External)
Spark Cluster
sample_data = fs.get_features([“name”,
“Pclass”, “Sex”, “Balance”, “Survived”])
Create Training Datasets with (External) Spark
Storage
GCS Amazon S3 HopsFS
.npy, .tfrecords, .csv
commit-0097
….
commit-0002
commit-0001
FeatureGroup
atomic
update
Feature Store
Time-Travel Queries for Creating Training Datasets
df = fs.get_features(…., from=“2017”, to=“2019”)
Storage
GCS Amazon S3 HopsFS
.tfrecords
.csv
.npy
US-West-la
MySQL
NDB1 Model
Online Application
1.JDBC 2.Predict
1. Build a Feature Vector using the Online Feature Store
US-West-1c
MySQL
NDB3Model
~5-50ms
Online Feature Store: High Availability & Low-Latency
US-West-1b
MySQL
NDB2Model
2-20ms
2. Send the Feature Vector to a Model for Prediction
HOPSWORKS
Rest API
Return JDBC Query
….
Offline FS
Apache Hive
HopsFS
Online FS
MySQL Cluster SELECT .. FROM WHERE … in [keys]
<<TLS>>
getQuery(“model”)
<<API-Key>> Online
Application
Online Feature Store: JDBC API
[keys]
user_id,
session_id,
timestamp, etc
Model
Prediction
HOPSWORKS
APPLICATIONS
API
DASHBOARDS
HOPSWORKS
DATASOURCES
ORCHESTRATION
In Airflow
BATCH
Apache Beam
Apache Spark
STREAMING
Apache Beam
Apache Spark
Apache Flink
HOPSWORKS
FEATURE
STORE
DISTRIBUTED
ML & DL
Pip
Conda
Tensorflow
scikit-learn
PyTorch
Jupyter
Notebooks
Tensorboard
FILESYSTEM & METADATA STORAGE
HopsFS
MODEL
SERVING
Kubernetes
MODEL
MONITORING
Kafka
+
Spark Streaming
Data Preparation
& Ingestion
Experimentation
& Model Training
Deploy
& Productionalize
Apache
Kafka
1
Feature
Engineering
2
Feature
Selection
3
Training &
Validation
4 Serving 5 Prediction
Train/Test Data
(S3, HDFS, etc)
Online
Application
Batch
Application
Data Warehouse
Data Lake
Feature
Engineering
Offline
Feature Store
Feature
Selection
Scoring &
Validation
Train
Model
Serving
Online
Feature Store
Model
Repository
Monitor
Experiments
Deploy
Feature Vector
Kafka
ML Lifecycle
Stage 1. Data Engineer
Models
Stage 2. Data Scientist
Model APIs
Stage 3. ML Engineer
Intelligent App
Stage 4. App Developer
Features
Model Hyperparameters
Model Candidates
Feature
Selection
Training DataTest Data
Model
Design
Model
Architecture
Model
Architecture
Model
Architecture
Model
Architecture
Model Repository
Model
Architecture
Model
Architecture
Model
ArchitectureTrial
Data Scientist
Experiments
Model Validation
Batch Apps
Online
Application
Predict
Get Online Features
App DeveloperRedshift S3 Cassandra Hadoop
Feature
Engineering
Feature Store
Data Engineer
Kubernetes / Serverless
KPI Dashboards
Alerts
Actions
Model
Architecture
Model
Architecture
Model
Architecture
Model
ArchitectureModel
Kafka
Model Inference API
Log Predictions
Predict
Streaming or
Serverless
Monitoring App
Log Predictions and
Join Outcomes
Online Model Serving
ML Engineer
Feature Store
Offline Features (Hive)
Secure Multi-Tenancy
Role-based Access Control
Encryption At-Rest, In-Motion
TLS/SSL everywhere
AI-Asset Governance
Models, experiments, data, GPUs
Data/Model/Feature Lineage
Discover/track dependencies
Real-Time, HA Database
MySQL Cluster (NDB)
JDBC API for Serving Clients
Online apps only need JDBC
In-Memory or NVMe data
Single-digit ms query times
Apache Hive on HopsFS
Scalable Data warehouse
Spark for Feature Computing
Fast backfilling of Training Data
HopsFS
NVMe speed with Big Data
HA and Horizontally Scalable
From 1 to 100s of nodes and
PBs of data
Hive
HA and Horizontally Scalable
Add nodes with no downtime
and scale to 10s of TBs
JDBC
NDB
NVMe
Security & GovernanceOnline Features (NDB)
Agenda for demo
Feature Store Overview
Access control / governance / statistics
Creating Features
Online vs Offline Features
Search for Features
Create training dataset
Query planner and hints
Online Feature Store
JDBC API for online the Feature Store
Hopsworks Subscription Models
Full Featured
AGPL-v3 License Model
Hopsworks Community
Kubernetes Support
• Model Serving
• Other services for robustness (Jupyter, more coming)
Authentication (LDAP, Kerberos, OAuth2)
Github support
Hopsworks Enterprise
Try it out!
www.hopsworks.ai
Stockholm
Box 1263,
Isafjordsgatan 22
Kista,
Sweden
London
IDEALondon,
69 Wilson St,
London,,
UK
Silicon Valley
470 Ramona St
Palo Alto
California,
USA
www.logicalclocks.com

More Related Content

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
MLOps in action
PPTX
Feature Store as a Data Foundation for Machine Learning
PDF
Unified MLOps: Feature Stores & Model Deployment
PPTX
Feature store: Solving anti-patterns in ML-systems
PDF
Databricks Overview for MLOps
PDF
Data Mesh Part 4 Monolith to Mesh
PDF
What is MLOps
DW Migration Webinar-March 2022.pptx
MLOps in action
Feature Store as a Data Foundation for Machine Learning
Unified MLOps: Feature Stores & Model Deployment
Feature store: Solving anti-patterns in ML-systems
Databricks Overview for MLOps
Data Mesh Part 4 Monolith to Mesh
What is MLOps

What's hot (20)

PDF
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
PDF
Building a Feature Store around Dataframes and Apache Spark
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
PDF
Observability & Datadog
PDF
Replicate Salesforce Data in Real Time with Change Data Capture
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Google Vertex AI
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
PDF
Ml ops intro session
PPTX
Power BI Made Simple
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
PDF
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
PDF
Generative AI For Everyone on AWS.pdf
PPTX
Databricks Fundamentals
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
Building a Feature Store around Dataframes and Apache Spark
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
How Starbucks Forecasts Demand at Scale with Facebook Prophet and Databricks
Observability & Datadog
Replicate Salesforce Data in Real Time with Change Data Capture
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Google Vertex AI
MLOps and Data Quality: Deploying Reliable ML Models in Production
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Real time stock processing with apache nifi, apache flink and apache kafka
Ml ops intro session
Power BI Made Simple
Data Lakehouse, Data Mesh, and Data Fabric (r1)
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Architect’s Open-Source Guide for a Data Mesh Architecture
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Generative AI For Everyone on AWS.pdf
Databricks Fundamentals
Ad

Similar to Managed Feature Store for Machine Learning (20)

PDF
Hopsworks data engineering melbourne april 2020
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Hamburg Data Science Meetup - MLOps with a Feature Store
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
PDF
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
PDF
The Feature Store in Hopsworks
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
PDF
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
PDF
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
Uddeholm ml workshop_hagfors_kim_hammar
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
PDF
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
PDF
Hopsworks Feature Store 2.0 a new paradigm
PDF
Kim Hammar - Spotify ML Guild Meetup - Feature Stores
PPTX
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
PDF
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
PDF
KFServing, Model Monitoring with Apache Spark and a Feature Store
PDF
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
PDF
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Hopsworks data engineering melbourne april 2020
Berlin buzzwords 2020-feature-store-dowling
Hamburg Data Science Meetup - MLOps with a Feature Store
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
The Feature Store in Hopsworks
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Kim Hammar - Feature Store: the missing data layer in ML pipelines? - HopsML ...
Kim Hammar - FOSDEM 2019 Brussels - Hopsworks Feature store
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Uddeholm ml workshop_hagfors_kim_hammar
Building Hopsworks, a cloud-native managed feature store for machine learning
Hopsworks hands on_feature_store_palo_alto_kim_hammar_23_april_2019
Hopsworks Feature Store 2.0 a new paradigm
Kim Hammar - Spotify ML Guild Meetup - Feature Stores
MetaConfig driven FeatureStore : MakeMyTrip | Presented at Data Con LA 2019 b...
Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serv...
KFServing, Model Monitoring with Apache Spark and a Feature Store
Data Con LA 2022 - Pre- Recorded - Simplifying AI/ML using Databricks feature...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Ad

Recently uploaded (20)

PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
Microsoft Office 365 Crack Download Free
PDF
Internet Download Manager IDM Crack powerful download accelerator New Version...
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
Tech Workshop Escape Room Tech Workshop
PPTX
Computer Software - Technology and Livelihood Education
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PDF
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
DOCX
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PDF
Workplace Software and Skills - OpenStax
PPTX
Download Adobe Photoshop Crack 2025 Free
PPTX
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
PPTX
GSA Content Generator Crack (2025 Latest)
PPTX
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
PDF
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
CNN LeNet5 Architecture: Neural Networks
Microsoft Office 365 Crack Download Free
Internet Download Manager IDM Crack powerful download accelerator New Version...
CCleaner 6.39.11548 Crack 2025 License Key
Tech Workshop Escape Room Tech Workshop
Computer Software - Technology and Livelihood Education
Wondershare Recoverit Full Crack New Version (Latest 2025)
Type Class Derivation in Scala 3 - Jose Luis Pintado Barbero
Modern SharePoint Intranet Templates That Boost Employee Engagement in 2025.docx
MCP Security Tutorial - Beginner to Advanced
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF-XChange Editor Plus 10.7.0.398.0 Crack Free Download Latest 2025
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Workplace Software and Skills - OpenStax
Download Adobe Photoshop Crack 2025 Free
Plex Media Server 1.28.2.6151 With Crac5 2022 Free .
GSA Content Generator Crack (2025 Latest)
4Seller: The All-in-One Multi-Channel E-Commerce Management Platform for Glob...
AI/ML Infra Meetup | Beyond S3's Basics: Architecting for AI-Native Data Access
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025

Managed Feature Store for Machine Learning

  • 1. Dr. Jim Dowling CEO / Co-Founder Logical Clocks Managed Feature Store for ML Webinar [ Presenter ]
  • 2. Leadership & Offices Stockholm Box 1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London,, UK Silicon Valley 470 Ramona St Palo Alto California, USA Dr. Jim Dowling CEO Theo Kakantousis COO Prof. Seif Haridi Chief Scientist Fabio Buso VP Engineering Steffen Grohsschmiedt Head Of Cloud www.logicalclocks.com Shraddha Chouhan Head Of Marketing
  • 3. Hopsworks - Award Winning Platform
  • 4. Today’s Journey to a Feature Store and Beyond Ad-hoc Scripts and Jobs Shared Feature Pipelines Feature Store MLOps with a Feature Store
  • 5. Known Feature Stores in Production ● Logical Clocks – Hopsworks (world’s first open source) ● Uber Michelangelo ● Airbnb – Bighead/Zipline ● Comcast ● Twitter ● GO-JEK Feast (GCE, open-source layer over BigTable/BigQuery) ● Branch ● Conde Nast ● Facebook FB Learner ● Netflix Reference: www.featurestore.org
  • 6. What is a Feature? A feature is a measurable property of a phenomena under observation and (part of) an input to a ML model. Example features: ● A raw word, a pixel, a sound wave, a sensor value; ● An aggregate (mean, max, sum, min) ● A window (last_hour, last_day, etc) ● A derived representation (embedding or cluster)
  • 7. numbers (in arrays) A Data Engineer’s perspective on Feature Engineering numbers arrays (of numbers) one-hot encoding Databases Schemas varchar, charsets integer, blob, varbinary
  • 8. Feature Engineering is about Transforming Data
  • 9. Feature Engineering is about Transforming Data from pyspark.ml.feature import Normalizer scaledDF = spark.parquet.read(”…”) l1_norm=Normalizer().setP(1).setInputCol("features").setOutputCol("l1_norm") l1_norm.transform(scaleDF) Normalize
  • 10. Consistent Features between Training and Inference It’s not always trivial to ensure features are engineered consistently between training and inference Features Training Labels Model Features Inference Model Labels
  • 11. Feature Store – Reuse Cached Features One Feature Pipeline Get Get Features Training Labels Model Features Inference Model Labels Feature Store
  • 12. Features name Pclass Sex Survive Name Balance Train / Test Datasets Survivename PClass Sex Balance Join key Feature Groups Titanic ​ Passenger List​ Passenger Bank Account File format .tfrecords .npy .csv .hdf5, .petastorm, etc Storage GCS Amazon S3 HopsFS Features, FeatureGroups, and Train/Test Datasets are all versioned Feature Store Concepts
  • 13. Streaming App pushes click features every 5 secs Streaming App pushes CDC data every 30 secs Pandas App pushes user profile updates every hour Batch App pushes featurized weblogs data every day Online Feature Store Offline Feature Store SQL DW S3, HDFS SQL Event Data Real-Time Data Real-time feature transformations (<2 secs) Online App Low Latency Features High Latency Features Train, Batch App FeatureGroups are ingested at different Cadences Feature Store No existing database is both scalable (PBs) and low latency (<10ms). Hence, online + offline Feature Stores.
  • 14. Feature Store ClickFeatureGroup TableFeatureGroup UserFeatureGroup LogsFeatureGroup Event Data SQL DW S3, HDFS SQL DataFrameAPI Kafka Input Flink RTFeatureGroup Online App Train, Batch App FeatureGroup ingestion in Hopsworks User Clicks DB Updates User Profile Updates Weblogs Real-time features Kafka Output Simplify access to the online/offline Feature Stores by providing a general-purpose DataFrame API.
  • 15. Register a Feature Group with the Feature Store from hops import featurestore as fs df = # Spark or Pandas Dataframe # Do feature engineering on ‘df’ # Register Dataframe as FeatureGroup fs.create_featuregroup(df, ”titanic_df“)
  • 16. HOPSWORKS Rest API 1 Add Metadata 2 Add Statistics …. Offline FS Apache Hive HopsFS (External) Spark Cluster .parquet, .orc (TLS) Online FS MySQL Cluster fs.create_featuregroup(df, “titanic_df”, offline=True, online=True) Feature Ingestion with Spark
  • 17. Online Feature Store (Serving) Offline Feature Store (Training & Batch) Online Apps Model Training Batch Apps Event Data SQL DW S3, HDFS SQL Ingest Data From Used By Hopsworks Feature Store
  • 18. Create Training Datasets using the Feature Store from hops import featurestore as fs sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”]) fs.create_training_dataset(sample_data, “titanic_training_dataset", data_format="tfrecords“, training_dataset_version=1)
  • 19. HOPSWORKS Offline FS Apache Hive HopsFS Join Features <<TLS>> Online FS MySQL Cluster (External) Spark Cluster sample_data = fs.get_features([“name”, “Pclass”, “Sex”, “Balance”, “Survived”]) Create Training Datasets with (External) Spark Storage GCS Amazon S3 HopsFS .npy, .tfrecords, .csv
  • 20. commit-0097 …. commit-0002 commit-0001 FeatureGroup atomic update Feature Store Time-Travel Queries for Creating Training Datasets df = fs.get_features(…., from=“2017”, to=“2019”) Storage GCS Amazon S3 HopsFS .tfrecords .csv .npy
  • 21. US-West-la MySQL NDB1 Model Online Application 1.JDBC 2.Predict 1. Build a Feature Vector using the Online Feature Store US-West-1c MySQL NDB3Model ~5-50ms Online Feature Store: High Availability & Low-Latency US-West-1b MySQL NDB2Model 2-20ms 2. Send the Feature Vector to a Model for Prediction
  • 22. HOPSWORKS Rest API Return JDBC Query …. Offline FS Apache Hive HopsFS Online FS MySQL Cluster SELECT .. FROM WHERE … in [keys] <<TLS>> getQuery(“model”) <<API-Key>> Online Application Online Feature Store: JDBC API [keys] user_id, session_id, timestamp, etc Model Prediction
  • 24. APPLICATIONS API DASHBOARDS HOPSWORKS DATASOURCES ORCHESTRATION In Airflow BATCH Apache Beam Apache Spark STREAMING Apache Beam Apache Spark Apache Flink HOPSWORKS FEATURE STORE DISTRIBUTED ML & DL Pip Conda Tensorflow scikit-learn PyTorch Jupyter Notebooks Tensorboard FILESYSTEM & METADATA STORAGE HopsFS MODEL SERVING Kubernetes MODEL MONITORING Kafka + Spark Streaming Data Preparation & Ingestion Experimentation & Model Training Deploy & Productionalize Apache Kafka
  • 25. 1 Feature Engineering 2 Feature Selection 3 Training & Validation 4 Serving 5 Prediction Train/Test Data (S3, HDFS, etc) Online Application Batch Application Data Warehouse Data Lake Feature Engineering Offline Feature Store Feature Selection Scoring & Validation Train Model Serving Online Feature Store Model Repository Monitor Experiments Deploy Feature Vector Kafka
  • 26. ML Lifecycle Stage 1. Data Engineer Models Stage 2. Data Scientist Model APIs Stage 3. ML Engineer Intelligent App Stage 4. App Developer Features Model Hyperparameters Model Candidates Feature Selection Training DataTest Data Model Design Model Architecture Model Architecture Model Architecture Model Architecture Model Repository Model Architecture Model Architecture Model ArchitectureTrial Data Scientist Experiments Model Validation Batch Apps Online Application Predict Get Online Features App DeveloperRedshift S3 Cassandra Hadoop Feature Engineering Feature Store Data Engineer Kubernetes / Serverless KPI Dashboards Alerts Actions Model Architecture Model Architecture Model Architecture Model ArchitectureModel Kafka Model Inference API Log Predictions Predict Streaming or Serverless Monitoring App Log Predictions and Join Outcomes Online Model Serving ML Engineer
  • 27. Feature Store Offline Features (Hive) Secure Multi-Tenancy Role-based Access Control Encryption At-Rest, In-Motion TLS/SSL everywhere AI-Asset Governance Models, experiments, data, GPUs Data/Model/Feature Lineage Discover/track dependencies Real-Time, HA Database MySQL Cluster (NDB) JDBC API for Serving Clients Online apps only need JDBC In-Memory or NVMe data Single-digit ms query times Apache Hive on HopsFS Scalable Data warehouse Spark for Feature Computing Fast backfilling of Training Data HopsFS NVMe speed with Big Data HA and Horizontally Scalable From 1 to 100s of nodes and PBs of data Hive HA and Horizontally Scalable Add nodes with no downtime and scale to 10s of TBs JDBC NDB NVMe Security & GovernanceOnline Features (NDB)
  • 28. Agenda for demo Feature Store Overview Access control / governance / statistics Creating Features Online vs Offline Features Search for Features Create training dataset Query planner and hints Online Feature Store JDBC API for online the Feature Store
  • 29. Hopsworks Subscription Models Full Featured AGPL-v3 License Model Hopsworks Community Kubernetes Support • Model Serving • Other services for robustness (Jupyter, more coming) Authentication (LDAP, Kerberos, OAuth2) Github support Hopsworks Enterprise
  • 30. Try it out! www.hopsworks.ai Stockholm Box 1263, Isafjordsgatan 22 Kista, Sweden London IDEALondon, 69 Wilson St, London,, UK Silicon Valley 470 Ramona St Palo Alto California, USA www.logicalclocks.com