SlideShare a Scribd company logo
Devendra Tagare <devtagare@gmail.com>
Data Engineer @ DataTorrent Inc
Committer @ Apache Software Foundation for Apex
@devtagare
ApacheCon North America, 2017
Dimensions Computation With Apache Apex
What is Apex ?
2
✓ Platform and Runtime Engine - enables development of scalable and fault-tolerant distributed
applications for processing streaming and batch data
✓ Highly Scalable - Scales linearly to billions of events per second with statically defined or dynamic
partitioning, advanced locality & affinity
✓ Highly Performant - In memory computations.Can reach single digit millisecond end-to-end
latency
✓ Fault Tolerant - Automatically recovers from failures - without manual intervention
✓ Stateful - Guarantees that no state will be lost
✓ YARN Native - Uses Hadoop YARN framework for resource negotiation
✓ Developer Friendly - Exposes an easy API for developing Operators, which can include any
custom business logic written in Java, and provides a Malhar library of many popular operators
and application examples.High level API for data scientists/ analysts.
Apex In the Wild
Data Sources
Op1
Hadoop (YARN + HDFS)
Real-time
Analytics &
Visualizations
Op3
Op2
Op4
Streaming Computation Actions & Insights
Data Targets
44
The Apex Ecosystem
Solutions for
Business
Ingestion & Data Prep ETL Pipelines
Tools Real-Time Data VisualizationManagement & MonitoringGUI Application Assembly
Application
Templates
Apex-Malhar Operator Library
Big Data
Infrastructure
Hadoop 2.x – YARN + HDFS – On Prem & Cloud
Core
High-level API
Transformation ML & Score SQL Analytics
FileSync
Dev Framework
Batch
Support
Apache Apex Core
Kafka-to-HDFS JDBC-to-HDFS HDFS-to-HDFS S3-to-HDFS
Application Development Model
5
● A Stream is a sequence of data tuples
● A typical Operator takes one or more input streams, performs computations & emits one or more output streams
■ Each Operator is YOUR custom business logic in java, or built-in operator from our open source
library
■ Operator has many instances that run in parallel and each instance is single-threaded
● Directed Acyclic Graph (DAG) is made up of operators and streams
Directed Acyclic Graph (DAG)
Filtered
Stream
Output
Stream
Tuple Tuple
FilteredStream
Enriched
Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
Stream Locality
• By default operators are deployed in containers (processes) randomly on different
nodes across the Hadoop cluster
• Custom locality for streams
Rack local: Data does not traverse network switches
Node local: Data is passed via loopback interface and frees up network bandwidth
Container local: Messages are passed via in memory queues between operators and
does not require serialization
Thread local: Messages are passed between operators in a same thread equivalent to
calling a subsequent function on the message
Fault Tolerance
• Operator state is check-pointed to a persistent store
Automatically performed by engine, no additional work needed by operator
In case of failure operators are restarted from checkpoint state
Frequency configurable per operator
Asynchronous and distributed by default
Default store is HDFS
• Automatic detection and recovery of failed operators
Heartbeat mechanism
• Buffering mechanism to ensure replay of data from recovered point so that there is no loss of
data
• Application master state check-pointed
Processing Guarantees
At-least once
• On recovery data will be replayed from a previous checkpoint
Messages will not be lost
Default mechanism and is suitable for most applications
• Can be used in conjunction with following mechanisms to achieve exactly-once behavior in fault
recovery scenarios
Transactions with meta information, Rewinding output, Feedback from external entity,
Idempotent operations
At-most once
• On recovery the latest data is made available to operator
Useful in use cases where some data loss is acceptable and latest data is sufficient
Exactly once
• At least once + state recovery + operator logic to achieve end-to-end exactly once
Apex Operator API
Input Adapters - read from external systems & emit
tuples to downstream operators, no input port
Generic Operators - process incoming data received
from input adapters or other generic
operators.Have both input & output ports
Output Adapters - write to external systems, no
output ports
Dimensions Compute Reference Architecture
Kafka/
HDFS
Parser
Parser
Parser
Enrich
&
Transform
Enrich
&
Transform
Enrich
&
Transform
Dimensional
Compute
Dimensional
Compute
Dimensional
Compute
Store
Query-In
Results
Visualization
Input Tuples
Input Tuples
Input Tuples
Parsed
Tuples
Parsed
Tuples
Parsed
Tuples
Enriched
Tuples
Enriched
Tuples
Enriched
Tuples
Aggregates
Aggregates
Aggregates
Visualization
Results
Visualization
Query
Aggregate
Query
Aggregate
Results
Dimensional Model - Key Concepts
Metrics : pieces of information we want to collect statistics about.
Dimensions : variables which can impact our measures.
Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of dimensions.
Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation.
Time Buckets : Time buckets are windows of time. Aggregations for a time bucket are comprised only of events
with a time stamp that falls into that time bucket.
With the managed state and High level api - Windowed operations also supported for fix window, sliding
window, session window for event time, system time, ingestion time.
Example : Ad-Tech : aggregate over key dimensions for revenue metrics
Dimensions - campaignId, advertiserId, time
Metrics - Cost, revenue, clicks, impressions
Aggregate functions -SUM,AM etc..
Combinations :
1. campaignId x time - cost,revenue
2. advertiser - revenue, impressions
3. campaignId x advertiser x time - revenue, clicks, impressions
11
Phases of Dimensional Compute
Aggregations in reality…..
12
Why break dimensional compute into stages ?
Aggregate footprint in memory generally rises
exponentially over time
Scalable implementations of dimensions compute
need to handle 100K+ event/sec.
Phases of dimensions compute
The pre-aggregation phase
The unification phase
The aggregation storage phase
Unique Aggregates : Dimensions Computation to scale by reducing the number of events entering the system
Example : ‘n’ events flowing through the system actually translate to a lower # unique aggregates
eg 500,000 adEvents flowing through the system actually translate to around 10,000 aggregates due to repeating
keys.
Partitioning : use partitioning to scale up the dimensional compute.
Example : If a partition can handle 500,000 events/second, then 8 partitions would be able to handle 4,000,000
events/second which are effectively combined into 80,00 aggregates/second
Problem of the Incomplete Aggregations ?
Aggregate values from previous batches not factored in - corrected in the Aggregation Storage phase.
Different partitions may share the say key and time buckets - partial aggregates - corrected in Unification phase.
Setting up the Pre-Aggregation phase of Dimensions Computation involves configuring a Dimension Computation
operator - DimensionsComputationFlexibleSingleSchemaPOJO
The Pre-aggregation phase
13
Ad Event
public AdEvent(String publisherId,
String campaignId
String location,
double cost,
double revenue,
long impressions,
long clicks,
long time….)
{
this.publisherId = publisherId;
this.campaignId = campaignId;
this.location = location;
this.cost = cost;
this.revenue = revenue;
this.impressions = impressions;
this.clicks = clicks;
this.time = time;
….
}
/* Getters and setters go here */
{"keys":[{"name":"campaignId","type":"integer"},
{"name":"adId","type":"integer"},
{"name":"creativeId","type":"integer"},
{"name":"publisherId","type":"integer"},
{"name":"adOrderId","type":"integer"}],
"timeBuckets":["1h","1d"],
"values":
[{"name":"impressions","type":"integer","aggregators":["SUM"]}
,
{"name":"clicks","type":"integer","aggregators":["SUM"]},
{"name":"revenue","type":"integer"}],
"dimensions":
[{"combination":["campaignId","adId"]},
{"combination":["creativeId","campaignId"]},
{"combination":["campaignId"]},
{"combination":["publisherId","adOrderId","campaignId"],
"additionalValues":["revenue:SUM"]}]
}
The Dimensional Model
14
Combines outputs - combines the outputs of all the partitions in the Pre-Aggregation phase into a single single
stream which can be passed on to the storage phase
Why combine ?
To reduce the number of aggregations even further ~ lower memory footprint, higher throughput
This is because the aggregations produced by different partitions which share the same key and time bucket can
be combined to produce a single aggregation ~ completeness for point to point query
Example : if the Unification phase receives 80,000 aggregations/second, you can expect 20,000
aggregations/second after unification.
Implementation : Add a unifier that can be set on your dimensions computation operator,
dimensions.setUnifier(new DimensionsComputationUnifierImpl<InputEvent,
Aggregate>());
The Unification Phase
15
Aggregation Persistence : Aggregations are persisted to HDFS using HDHT.
Dimensions Store persists aggregates and serves the below functions
Functions as a storage so that aggregations can be retrieved for visualization.
Functions as a storage allowing aggregations to be combined with incomplete aggregates produced by Unification.
Visualization
The Dimensions Store allows you to visualize your aggregations over time. This is done by allowing queries and responses
to be received from and sent to the UI via websocket.
Aggregation
The store produces complete aggregations by combining the incomplete aggregations received from the Unification stage
with aggregations persisted to HDFS.
Why have the previous phases ?
Dimensions Store is I/O intensive, and may cause bottle-necks.
Previous phases reduce the cardinality of events so that the Store will always have lesser # events.
Other variants & the new way : Use managed state instead of HDHT.
The Aggregation Storage Phase
16
Visualization with Apex
17 AdEvents over time
Query
Browser creates a websocket connection with the pubsub
server hosted by a webserver.
UI Widgets based on the Malhar angular dashboard can
send queries to the pubsub server via this connection to a
specific topic.
These queries are parsed by the Query operator and
passed onto DimensionsStore to fetch data from HDHT
Store.
QueryResult
The QueryResult operator gets the result from the
DimensionsStore operator for a given query, formats and
renders it to the widget.
Sample Visualization
Q & A
Thank You !!!
Resources
Apache Apex - http://apex.apache.org/
References : http://docs.datatorrent.com/
Subscribe to forums : Apex - http://apex.apache.org/community.html
Download - http://apex.apache.org/downloads
Twitter : @ApacheApex; Follow - https://twitter.com/apacheapex
Meetups - http://meetup.com/topics/apache-apex
20

More Related Content

What's hot (20)

Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Apex as yarn application
Apex as yarn applicationApex as yarn application
Apex as yarn application
Chinmay Kolhatkar
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
Akshay Gore
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache ApexExtending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 
Building your first aplication using Apache Apex
Building your first aplication using Apache ApexBuilding your first aplication using Apache Apex
Building your first aplication using Apache Apex
Yogi Devendra Vyavahare
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
Apache Apex
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and ApplicationsApache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Thomas Weise
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
Bhupesh Chawda
 
Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)Smart Partitioning with Apache Apex (Webinar)
Smart Partitioning with Apache Apex (Webinar)
Apache Apex
 
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingIntro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming
Apache Apex
 
Ingestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
University program - writing an apache apex application
University program  - writing an apache apex applicationUniversity program  - writing an apache apex application
University program - writing an apache apex application
Akshay Gore
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
DataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application MeetupDataTorrent Presentation @ Big Data Application Meetup
DataTorrent Presentation @ Big Data Application Meetup
Thomas Weise
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Fault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache ApexFault Tolerance and Processing Semantics in Apache Apex
Fault Tolerance and Processing Semantics in Apache Apex
Apache Apex Organizer
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Apex
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
Apache Apex
 
Extending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache ApexExtending The Yahoo Streaming Benchmark to Apache Apex
Extending The Yahoo Streaming Benchmark to Apache Apex
Apache Apex
 
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing SemanticsApache Apex Fault Tolerance and Processing Semantics
Apache Apex Fault Tolerance and Processing Semantics
Apache Apex
 

Similar to Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare (20)

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Stream Processing with Apache Apex
Stream Processing with Apache ApexStream Processing with Apache Apex
Stream Processing with Apache Apex
Pramod Immaneni
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
Dr. Prakash Sahu
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
Qualcomm Developer Network
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
Ricardo Jimenez-Peris
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
Etienne Coutaud
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
DataStax Academy
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
jimliddle
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Shay Hassidim
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
Apache Apex
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Stream Processing with Apache Apex
Stream Processing with Apache ApexStream Processing with Apache Apex
Stream Processing with Apache Apex
Pramod Immaneni
 
Intro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big DataIntro to Apache Apex @ Women in Big Data
Intro to Apache Apex @ Women in Big Data
Apache Apex
 
Performance eng prakash.sahu
Performance eng prakash.sahuPerformance eng prakash.sahu
Performance eng prakash.sahu
Dr. Prakash Sahu
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
Big Data Spain
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
The End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional ManagementThe End of a Myth: Ultra-Scalable Transactional Management
The End of a Myth: Ultra-Scalable Transactional Management
Ricardo Jimenez-Peris
 
Regain Control Thanks To Prometheus
Regain Control Thanks To PrometheusRegain Control Thanks To Prometheus
Regain Control Thanks To Prometheus
Etienne Coutaud
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
jimliddle
 
Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications Apache Apex: Stream Processing Architecture and Applications
Apache Apex: Stream Processing Architecture and Applications
Comsysto Reply GmbH
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
Ding Li
 
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform BenchmarkSunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Sunx4450 Intel7460 GigaSpaces XAP Platform Benchmark
Shay Hassidim
 
Ad

More from Apache Apex (13)

Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
Apache Apex
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
Apache Apex
 
Hadoop Interacting with HDFS
Hadoop Interacting with HDFSHadoop Interacting with HDFS
Hadoop Interacting with HDFS
Apache Apex
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
Apache Apex
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
Apache Apex
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data TransformationsKafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations
Apache Apex
 
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationBuilding Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application
Apache Apex
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Apache Apex
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Apache Apex
 
Apache Beam (incubating)
Apache Beam (incubating)Apache Beam (incubating)
Apache Beam (incubating)
Apache Apex
 
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache ApexMaking sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex
Apache Apex
 
Apache Apex & Bigtop
Apache Apex & BigtopApache Apex & Bigtop
Apache Apex & Bigtop
Apache Apex
 
Building Your First Apache Apex Application
Building Your First Apache Apex ApplicationBuilding Your First Apache Apex Application
Building Your First Apache Apex Application
Apache Apex
 
Ad

Recently uploaded (20)

Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdfTableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptxMulti-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
VikashVats1
 
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
smrithimuralidas
 
Data Analytics and visualization-PowerBi
Data Analytics and visualization-PowerBiData Analytics and visualization-PowerBi
Data Analytics and visualization-PowerBi
Krishnapriya975316
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
llm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blahllm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blah
saud140081
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docxHow Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
sofiawilliams5966
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 
Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdfTableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptxMulti-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
Multi-Agent-Solution-Architecture-for-Unified-Loan-Platform.pptx
VikashVats1
 
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
smrithimuralidas
 
Data Analytics and visualization-PowerBi
Data Analytics and visualization-PowerBiData Analytics and visualization-PowerBi
Data Analytics and visualization-PowerBi
Krishnapriya975316
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
llm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blahllm lecture 3 stanford blah blah blah blah
llm lecture 3 stanford blah blah blah blah
saud140081
 
Math arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is hereMath arihant handbook.pdf all formula is here
Math arihant handbook.pdf all formula is here
rdarshankumar84
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
EPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptxEPC UNIT-V forengineeringstudentsin.pptx
EPC UNIT-V forengineeringstudentsin.pptx
ExtremerZ
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
Human body make Structure analysis the part of the human
Human body make Structure analysis the part of the humanHuman body make Structure analysis the part of the human
Human body make Structure analysis the part of the human
ankit392215
 
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docxHow Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
How Data Annotation Services Drive Innovation in Autonomous Vehicles.docx
sofiawilliams5966
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf15 Benefits of Data Analytics in Business Growth.pdf
15 Benefits of Data Analytics in Business Growth.pdf
AffinityCore
 

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare

  • 1. Devendra Tagare <[email protected]> Data Engineer @ DataTorrent Inc Committer @ Apache Software Foundation for Apex @devtagare ApacheCon North America, 2017 Dimensions Computation With Apache Apex
  • 2. What is Apex ? 2 ✓ Platform and Runtime Engine - enables development of scalable and fault-tolerant distributed applications for processing streaming and batch data ✓ Highly Scalable - Scales linearly to billions of events per second with statically defined or dynamic partitioning, advanced locality & affinity ✓ Highly Performant - In memory computations.Can reach single digit millisecond end-to-end latency ✓ Fault Tolerant - Automatically recovers from failures - without manual intervention ✓ Stateful - Guarantees that no state will be lost ✓ YARN Native - Uses Hadoop YARN framework for resource negotiation ✓ Developer Friendly - Exposes an easy API for developing Operators, which can include any custom business logic written in Java, and provides a Malhar library of many popular operators and application examples.High level API for data scientists/ analysts.
  • 3. Apex In the Wild Data Sources Op1 Hadoop (YARN + HDFS) Real-time Analytics & Visualizations Op3 Op2 Op4 Streaming Computation Actions & Insights Data Targets
  • 4. 44 The Apex Ecosystem Solutions for Business Ingestion & Data Prep ETL Pipelines Tools Real-Time Data VisualizationManagement & MonitoringGUI Application Assembly Application Templates Apex-Malhar Operator Library Big Data Infrastructure Hadoop 2.x – YARN + HDFS – On Prem & Cloud Core High-level API Transformation ML & Score SQL Analytics FileSync Dev Framework Batch Support Apache Apex Core Kafka-to-HDFS JDBC-to-HDFS HDFS-to-HDFS S3-to-HDFS
  • 5. Application Development Model 5 ● A Stream is a sequence of data tuples ● A typical Operator takes one or more input streams, performs computations & emits one or more output streams ■ Each Operator is YOUR custom business logic in java, or built-in operator from our open source library ■ Operator has many instances that run in parallel and each instance is single-threaded ● Directed Acyclic Graph (DAG) is made up of operators and streams Directed Acyclic Graph (DAG) Filtered Stream Output Stream Tuple Tuple FilteredStream Enriched Stream Enriched Stream er Operator er Operator er Operator er Operator er Operator er Operator
  • 6. Stream Locality • By default operators are deployed in containers (processes) randomly on different nodes across the Hadoop cluster • Custom locality for streams Rack local: Data does not traverse network switches Node local: Data is passed via loopback interface and frees up network bandwidth Container local: Messages are passed via in memory queues between operators and does not require serialization Thread local: Messages are passed between operators in a same thread equivalent to calling a subsequent function on the message
  • 7. Fault Tolerance • Operator state is check-pointed to a persistent store Automatically performed by engine, no additional work needed by operator In case of failure operators are restarted from checkpoint state Frequency configurable per operator Asynchronous and distributed by default Default store is HDFS • Automatic detection and recovery of failed operators Heartbeat mechanism • Buffering mechanism to ensure replay of data from recovered point so that there is no loss of data • Application master state check-pointed
  • 8. Processing Guarantees At-least once • On recovery data will be replayed from a previous checkpoint Messages will not be lost Default mechanism and is suitable for most applications • Can be used in conjunction with following mechanisms to achieve exactly-once behavior in fault recovery scenarios Transactions with meta information, Rewinding output, Feedback from external entity, Idempotent operations At-most once • On recovery the latest data is made available to operator Useful in use cases where some data loss is acceptable and latest data is sufficient Exactly once • At least once + state recovery + operator logic to achieve end-to-end exactly once
  • 9. Apex Operator API Input Adapters - read from external systems & emit tuples to downstream operators, no input port Generic Operators - process incoming data received from input adapters or other generic operators.Have both input & output ports Output Adapters - write to external systems, no output ports
  • 10. Dimensions Compute Reference Architecture Kafka/ HDFS Parser Parser Parser Enrich & Transform Enrich & Transform Enrich & Transform Dimensional Compute Dimensional Compute Dimensional Compute Store Query-In Results Visualization Input Tuples Input Tuples Input Tuples Parsed Tuples Parsed Tuples Parsed Tuples Enriched Tuples Enriched Tuples Enriched Tuples Aggregates Aggregates Aggregates Visualization Results Visualization Query Aggregate Query Aggregate Results
  • 11. Dimensional Model - Key Concepts Metrics : pieces of information we want to collect statistics about. Dimensions : variables which can impact our measures. Combinations : set of dimensions for which one or metric would be aggregated.They are sub-sets of dimensions. Aggregations : the aggregate function eg.. SUM, TOPN, Standard deviation. Time Buckets : Time buckets are windows of time. Aggregations for a time bucket are comprised only of events with a time stamp that falls into that time bucket. With the managed state and High level api - Windowed operations also supported for fix window, sliding window, session window for event time, system time, ingestion time. Example : Ad-Tech : aggregate over key dimensions for revenue metrics Dimensions - campaignId, advertiserId, time Metrics - Cost, revenue, clicks, impressions Aggregate functions -SUM,AM etc.. Combinations : 1. campaignId x time - cost,revenue 2. advertiser - revenue, impressions 3. campaignId x advertiser x time - revenue, clicks, impressions 11
  • 12. Phases of Dimensional Compute Aggregations in reality….. 12 Why break dimensional compute into stages ? Aggregate footprint in memory generally rises exponentially over time Scalable implementations of dimensions compute need to handle 100K+ event/sec. Phases of dimensions compute The pre-aggregation phase The unification phase The aggregation storage phase
  • 13. Unique Aggregates : Dimensions Computation to scale by reducing the number of events entering the system Example : ‘n’ events flowing through the system actually translate to a lower # unique aggregates eg 500,000 adEvents flowing through the system actually translate to around 10,000 aggregates due to repeating keys. Partitioning : use partitioning to scale up the dimensional compute. Example : If a partition can handle 500,000 events/second, then 8 partitions would be able to handle 4,000,000 events/second which are effectively combined into 80,00 aggregates/second Problem of the Incomplete Aggregations ? Aggregate values from previous batches not factored in - corrected in the Aggregation Storage phase. Different partitions may share the say key and time buckets - partial aggregates - corrected in Unification phase. Setting up the Pre-Aggregation phase of Dimensions Computation involves configuring a Dimension Computation operator - DimensionsComputationFlexibleSingleSchemaPOJO The Pre-aggregation phase 13
  • 14. Ad Event public AdEvent(String publisherId, String campaignId String location, double cost, double revenue, long impressions, long clicks, long time….) { this.publisherId = publisherId; this.campaignId = campaignId; this.location = location; this.cost = cost; this.revenue = revenue; this.impressions = impressions; this.clicks = clicks; this.time = time; …. } /* Getters and setters go here */ {"keys":[{"name":"campaignId","type":"integer"}, {"name":"adId","type":"integer"}, {"name":"creativeId","type":"integer"}, {"name":"publisherId","type":"integer"}, {"name":"adOrderId","type":"integer"}], "timeBuckets":["1h","1d"], "values": [{"name":"impressions","type":"integer","aggregators":["SUM"]} , {"name":"clicks","type":"integer","aggregators":["SUM"]}, {"name":"revenue","type":"integer"}], "dimensions": [{"combination":["campaignId","adId"]}, {"combination":["creativeId","campaignId"]}, {"combination":["campaignId"]}, {"combination":["publisherId","adOrderId","campaignId"], "additionalValues":["revenue:SUM"]}] } The Dimensional Model 14
  • 15. Combines outputs - combines the outputs of all the partitions in the Pre-Aggregation phase into a single single stream which can be passed on to the storage phase Why combine ? To reduce the number of aggregations even further ~ lower memory footprint, higher throughput This is because the aggregations produced by different partitions which share the same key and time bucket can be combined to produce a single aggregation ~ completeness for point to point query Example : if the Unification phase receives 80,000 aggregations/second, you can expect 20,000 aggregations/second after unification. Implementation : Add a unifier that can be set on your dimensions computation operator, dimensions.setUnifier(new DimensionsComputationUnifierImpl<InputEvent, Aggregate>()); The Unification Phase 15
  • 16. Aggregation Persistence : Aggregations are persisted to HDFS using HDHT. Dimensions Store persists aggregates and serves the below functions Functions as a storage so that aggregations can be retrieved for visualization. Functions as a storage allowing aggregations to be combined with incomplete aggregates produced by Unification. Visualization The Dimensions Store allows you to visualize your aggregations over time. This is done by allowing queries and responses to be received from and sent to the UI via websocket. Aggregation The store produces complete aggregations by combining the incomplete aggregations received from the Unification stage with aggregations persisted to HDFS. Why have the previous phases ? Dimensions Store is I/O intensive, and may cause bottle-necks. Previous phases reduce the cardinality of events so that the Store will always have lesser # events. Other variants & the new way : Use managed state instead of HDHT. The Aggregation Storage Phase 16
  • 17. Visualization with Apex 17 AdEvents over time Query Browser creates a websocket connection with the pubsub server hosted by a webserver. UI Widgets based on the Malhar angular dashboard can send queries to the pubsub server via this connection to a specific topic. These queries are parsed by the Query operator and passed onto DimensionsStore to fetch data from HDHT Store. QueryResult The QueryResult operator gets the result from the DimensionsStore operator for a given query, formats and renders it to the widget.
  • 19. Q & A Thank You !!!
  • 20. Resources Apache Apex - http://apex.apache.org/ References : http://docs.datatorrent.com/ Subscribe to forums : Apex - http://apex.apache.org/community.html Download - http://apex.apache.org/downloads Twitter : @ApacheApex; Follow - https://twitter.com/apacheapex Meetups - http://meetup.com/topics/apache-apex 20