0% found this document useful (0 votes)
143 views23 pages

OCI - Data - Flow - Spark - Streaming - and - Machine - Learning

OCI Data Flow now supports Spark Streaming, allowing customers to perform ETL on continuously streaming data at cloud scale. This includes use cases like predicting the remaining useful life (RUL) of manufacturing equipment from sensor data streams using Spark ML survival regression models. A simulator application generates simulated sensor data, which is streamed to a predictor application that applies a pre-trained RUL model to output RUL predictions and maintenance status/actions in real-time.

Uploaded by

meera2k10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views23 pages

OCI - Data - Flow - Spark - Streaming - and - Machine - Learning

OCI Data Flow now supports Spark Streaming, allowing customers to perform ETL on continuously streaming data at cloud scale. This includes use cases like predicting the remaining useful life (RUL) of manufacturing equipment from sensor data streams using Spark ML survival regression models. A simulator application generates simulated sensor data, which is streamed to a predictor application that applies a pre-trained RUL model to output RUL predictions and maintenance status/actions in real-time.

Uploaded by

meera2k10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Learning on Streaming Data with

OCI Data Flow

Sujoy Chowdhury
Product Manager, OCI Data Flow

Sivanesh Selvanataraj
Software Engineer, OCI Data Flow

May 5th, 2022


Introducing OCI Data Flow Streaming

Oracle Cloud Infrastructure Data Flow, a fully managed serverless Apache


Spark service, now supports Spark Streaming. Customers can now use OCI
Data Flow to do cloud scale ETL on their continuously produced stream data.

2
Data Lakehouse Platform on OCI
Open & flexible: analyze any database, any application, from anywhere

Data Lakehouse on OCI

Managed Open Source Data Warehouse


Data sources Data target

Data Stores
Any Database Big Data Service Autonomous Data
Data Flow Warehouse
Any BI Tool
Search Insights MySQL HeatWave

Data Definition
Any Application Object Storage
Data Movement & Discovery
Relational Data
Machine Learning
& Data Science

Any Cloud

Data Integration Data Catalog


GoldenGate
OCI Streaming
Data Science Platform Any Application
Any Events/Sensors
AI Services
Lakehouse: Cloud Scale ETL & ML using OCI Data Flow
Use all data to innovate in a lakehouse.
1. Cloud Scale ETL & ML
Cloud Scale ETL & ML
• Cloud Scale ETL at ‘landing zone’ for data warehouse
• Cloud Scale ETL at Data Lake (object store to object Raw data/ Lakehouse Insights
store) to optimize ADW stream

• Data Lake to optimize ADW storage (inexpensive, easily


accessible data store for important but rarely used data,
processed data & meta-data)
• Data Catalog harvests the lake (technical metadata) Autonomous Oracle
and applies a business context. This includes ETL OCI Data Flow ETL Data Warehouse Analytics
Cloud
data/metadata created by Spark applications.
ML
• Data Integration service for ETL orchestration,
Object Storage
workflows Data Lake

2. Stream data processing OCI Data


Catalog
Object Storage OCI Data Science
OCI AI Services
Data Lake
Oracle Machine
• Streamed logs, sensor data, CDR, clickstream that Learning
needs to be processed before ADW (ETL & ML)
3. Big Data processing using Python, Java OCI Data
• Data Engineers/ Data Scientists can use their preferred Integration
programming language for processing big data

4
Analyzing Streams with OCI Data Flow Spark Streaming

OCI Streaming
OCI Streaming

OCI Data Flow

Object Storage
Data Lake Object Storage
Data Lake

• Continuously retrieve input data, not just hourly/ daily.


• Support Kafka-compatible stream/ message queue as source and sink, in addition to batch file-
formats
• Processed data made available continuously (<1 min), not just after job completion
• Outage does not require job restart, can continue from last checkpoint
• Handle late data and watermarking: can catch up backlogged data over time
• Heavy-weight stream processing: Stream-static Joins, Stream-stream Joins, Inner Joins with optional
watermarking. Outer Joins with Watermarking, streaming Deduplication
• Machine Learning on stream data using MLLib.
Spark Streaming Machine Learning Use Cases

Manufacturing Financial Services Telecom

• Predicting RUL • Real time fraud • Churn prediction


‘Remaining Useful Life’ detection

6
Manufacturing: Predicting Remaining Useful Life (RUL) from stream data
Manufacturing
Lakehouse

OCI Streaming OCI Streaming

OCI Data Flow Autonomous


Data Warehouse

Object Storage
Data Lake
Object Storage
Data Lake

Color RUL Status, Action


Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green >60 days Healthy, no action needed.

7
Manufacturing: Primer on Predictive Maintenance

• Maintenance, repair and overhaul (MRO) of equipment is mission-critical for


Manufacturing industries
• 3 types of maintenance
• Preventive maintenance (PM) : Planned/Scheduled Maintenance
• Reactive maintenance (Run To Failure) : Maintenance after machine Runs To Failure
• Predictive maintenance (PdM) : Predict the trend and estimation when maintenance
is required
• Predictive maintenance
• Survival analysis:
• Spark ML Survival Regression Model: Accelerated Failure Model (AFT)

8
Spark ML Survival Regression Model

9
RUL Training dataset

• Turbofan Engine Degradation Simulation Data Set


A. Saxena and K. Goebel (2008 NASA Ames Prognostics
Data Repository (http://ti.arc.nasa.gov/project/prognostic-
data-repository), NASA Ames Research Center, Moffett
Field, CA
• Engine degradation Run-to-Failure simulation was carried
out using C-MAPSS (Commercial Modular Aero- Propulsion
System Simulation).

10
Trainer: OCI Data Flow application: RULSurvivalModelTrainer

11
Trainer: OCI Data Flow application: RULSurvivalModelTrainer

12
Simulator: OCI Data Flow Streaming application SensorDataSimulator

13
Simulating Sensor Data for Spark Streaming

Color RUL Status, Action


Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
14 Green >60 days Healthy, no action needed.
Predictor: OCI Data Flow streaming application RealTimeRULPredictor

15
Real time RUL prediction using OCI Data Flow Spark Streaming

16
Real time RUL prediction using OCI Data Flow Spark Streaming

Color RUL Status, Action


Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green 60 – 100 days Healthy, no action needed.

17
Demo: Predicting Remaining Useful Life (RUL) from stream data
Manufacturing
Lakehouse

OCI Streaming OCI Streaming

OCI Data Flow Autonomous


Data Warehouse

Object Storage
Data Lake
Object Storage
Data Lake

Color RUL Status, Action


Red 10 – 30 days Degraded, perform maintenance.
Yellow 30 – 60 days Degrading, watch.
Green >60 days Healthy, no action needed.

18
Spark Streaming with OCI Data Flow: Managed Spark UI view

19
Spark Oracle Datasource for connecting to Oracle DB

20
Spark Streaming with OCI Data Flow: OCI Logging integration

21
Fully managed Spark Streaming with OCI Data Flow

Data Flow continues to be the fully managed Spark experience with zero administration overhead:
• End-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead
Logs from Spark
• Cloud native authentication: Resource principal-based authentication enables Data Flow Streaming
applications to run beyond 24 hours
• Managed automatic security patching so that customers can focus building their application, not
update/ upgrade / infra operation.
• Managed automatic run resubmission by Data Flow for additional fault tolerance on top of Spark
• Deep OCI integration with OCI Logging, OCI Metrics for simpler troubleshooting, along with other
1P OCI services
• Pay for infra only: Data Flow Streaming feature will not introduce any additional meters. Customers
will continue to pay only for the infra their Data Flow runs use.

22
Resources
Manufacturing
Lakehouse

OCI Streaming OCI Streaming

OCI Data Flow Autonomous


Data Warehouse
• RUL Survival Model
Trainer
Object Storage • Real time RUL
Data Lake Predictor
Object Storage
Data Lake

Learn more
https://docs.oracle.com/en-us/iaas/data-flow/using/spark-streaming.htm
https://docs.oracle.com/en-us/iaas/data-flow/using/spark_oracle_datasource.htm
Training Dataset: https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan

Sample code https://github.com/oracle-samples/oracle-dataflow-samples

Contact
[email protected]
23

You might also like