Privileged and confidential
Open Blueprint for Real-Time Analytics
with In-Stream Processing
Victoria Livschitz, Founder & CTO, Grid Dynamics
09/28/2017
2
Business Need
About the speaker:
CTO @Grid Dynamics: present
Founder and CEO @Grid Dynamics: 2006 – 2013
Principal engineer @Sun: 1997 – 2006
Engineering IT services company focused on digital transformation
through cloud, big data & open source for Fortune 500 clients.
Pioneer in real-time processing from company’s inception in 2006.
Architected 3 out of top-10 busiest e-commerce sites. Never had
production outage in peak season.
Frequent contributor to open source projects: Hadoop, Solr,
Lucene, Storm, others.
Victoria Livschitz
About Grid Dynamics:
3
Agenda
• What is “real-time” in analytics, and why it matters
• In-Stream Processing: emerging platform for real-time processing
• Open ISP blueprint: reference architecture, reference implementation
• Take ISP for a spin: reference demo of real-time twitter sentiment analysis
What is “real-time”, anyways?
4
5
What is “real-time” in analytics, machine learning,
data sciences & AI?
Receive
event
Event
Analyze
event
Act on
event
ResponseAugment
model
How long is the cycle?
What is done online vs. offline?
Learning Analysis
6
Weeks Days Hours Seconds
Receive
event
Event Analyze
event
Act on
event
ResponseAugment
model
How long is the cycle?
What is done online vs. offline?
Learning Analysis
What is “real-time” in analytics, machine learning,
data sciences & AI?
Event
Act on
event
Response
Receive
event
A few seconds
A day
Receive
event
Augment
model
Analyze
event
Modify
reaction
1.Offline learning/analytics, online response
Valueof“real-time”
2. Offline learning, real-time
analytics, online response
Event
Act on
event
Response
Receive
event
A few seconds
A day
Receive
event
Augment
model
Analyze
event
Modify
reaction
1.Offline learning/analytics, online response
Valueof“real-time”
Event
Receive
event Response
Analyze
event
Act on
event
A few
seconds
Receive
event
Augment
modelA day
Receive
event
Analyze
event
Act on
event
Augment
model
3. Real-time learning/analytics, online response A few seconds
2. Offline learning, real-time
analytics, online response
Event
Act on
event
Response
Receive
event
A few seconds
A day
Receive
event
Augment
model
Analyze
event
Modify
reaction
1.Offline learning/analytics, online response
Valueof“real-time”
Event
Receive
event Response
Analyze
event
Act on
event
A few
seconds
Receive
event
Augment
modelA day
Event Response
Wherereal-timemattersinretail?
10
11
Classification of retail use cases relative to “real-timeness”
Level 1: Segmented historic context: data on
what happened to all such customers before
Level 3: Situational context: where customer is,
what she wants – or might buy - right now
Level 4: Supply chain dynamics: demand surge,
product availability, competitive pricing
From time to time, send a coupon
based on a segment
Level 2: individualized historic context: 360-
degree view across personal data
On a birthday, offer a coupon based
on personal history
Right now, offer a product based on
what’s in her hands
During a storm, deliver trending
umbrella/pancho combo
Example: Personalized Offers
12
Level 1: Segmented historic context: data on
what happened to all such customers before
Level 3: Situational context: where customer is,
what she wants – or might buy - right now
Level 4: Supply/demand dynamics: impact of
demand surge, shortage, competitive actions...
Level 2: individualized historic context: 360-
degree view across individual’s data Suited
for offline
ML
Requires
real-time
ML
Historic aggregated
data
Real-time
individual’s data
Historic
individual’s data
Real-time
everything
Classification of retail use cases relative to “real-timeness”
13
Top 6 drivers of real-time applications in retail
#3. Dynamic pricing
Determine “right price” for products
based on availability, trending,
personal context & competitive price
#1. Personalized search
Augment search hits and relevancy
ranking based on personal context &
history
#2. Personalized offers
Motivate “buy now” behavior by
offering deals based on personal
context & history
#4. Dynamic inventory
Predict inventory needs & re-stock
products in stores based on
fluctuations in inventory & demand
#5. Intelligent sourcing
Determine what order to source from
what store to optimize delivery SLAs
& shipment costs
#6. Real-time alerts
Detect unusual patterns: fraud, surge in
demand, weather changes, shift in
brand sentiment. Respond right away
Emergingplatformforreal-timeanalytics:
In-StreamProcessing(ISP)
14
15
In a complex landscape of Big Data systems…
16
…In-Stream Processing (ISP) service is an approach
to build real-time extensions of Big Data applications
Today’s
focus
17
ISP is ideal for:
• Real-time data ingress to replace batch ETLs
• Real-time identification of one-in-a-million “actionable insights”
• Real-time response to actionable insights
• Real-time learning from new data
18
Conceptual architecture
19
ISP pipelines: complex behavior with simple steps
Easy to write, change or add a step
ISP marketplace: build vs. buy
20
21
Grid Dynamics open blueprint for ISP
22
23
Blueprint goals
Scalable to
100,000+/second Real-time streaming;
real-time ML
Cloud-ready
Proven for mission-
critical use
Open source
(and built 100%
with open source)
Production-ready
Portable across
clouds
Extendable
24
Selected stack for ISP blueprint
• REST API
• Message Queue
• HDFS
• Other
25
Designed as a complete platform
• No single points of failure
• No bottlenecks
• Built-in scaling
• Dockerized
• Deployable to any cloud
• Reference implementation for
AWS (open source)
• Reference demo: real-time
twitter sentiment analytics for
new movie reviews
ISP reference implementation:
fully-automated DevOps stack for running
ISP on any modern cloud
26
27
How to achieve cloud portability?
• Phase 1: bootstrap management cluster
• [manual] Choose a cloud. Get a set of VMs (6) to host mngt cluster
• [automated] Deploy & configure Mesos/Marathon cluster on available VMs
• Phase 2: use management cluster to provision ISP environments
• [automated] Deploy all ISP components as Docker containers
• [automated] Deploy analytics application components (like Twitter API)
• [automated] Configure all dependencies
• [automated] Scale on-demand
• [automated] Shut down when done
28
Topology with twitter data analytics demo
“TakeISPforaspin”demo:Real-timetwitter
sentimentanalyticsfornewmoviereviews
29
30
Real-time demo, a.k.a. “Data Science Kitchen”
• Provide reference example on how to use ISP platform…
• ... and learn the basics of data science along the way
• Gets actual Twitter data via streaming API
• Analyses & visualizes what people think about latest movies
• Exposes data science “kitchen”: models, training sets, dictionaries
• Provides nice web UI to play with data
• Uses our ISP RI (reference implementation)
• Demo is running on AWS as a public service
• Everything is open sourced
• Documentation on our Tech Blog
31
Demo app: pick movies you want to monitor
32
Compare different views on data
33
Compare trending between different movies
Examples of
positive &
negative Carrie
Fisher tweets
34
Where to learn more
• 7-part blog series on ISP
• 7-part blog series on Data Science Kitchen
1. Read our blog: blog.griddynamics.com
2. Play with our demo
• http://apps.griddynamics.com/realtime-twitter-sentiment-
analysis-example
3. Connect
• Twitter: @griddynamics
• Subscribe to our blog
• Drop email: info@griddynamics.com

Open Blueprint for Real-Time Analytics with In-Stream Processing

  • 1.
    Privileged and confidential OpenBlueprint for Real-Time Analytics with In-Stream Processing Victoria Livschitz, Founder & CTO, Grid Dynamics 09/28/2017
  • 2.
    2 Business Need About thespeaker: CTO @Grid Dynamics: present Founder and CEO @Grid Dynamics: 2006 – 2013 Principal engineer @Sun: 1997 – 2006 Engineering IT services company focused on digital transformation through cloud, big data & open source for Fortune 500 clients. Pioneer in real-time processing from company’s inception in 2006. Architected 3 out of top-10 busiest e-commerce sites. Never had production outage in peak season. Frequent contributor to open source projects: Hadoop, Solr, Lucene, Storm, others. Victoria Livschitz About Grid Dynamics:
  • 3.
    3 Agenda • What is“real-time” in analytics, and why it matters • In-Stream Processing: emerging platform for real-time processing • Open ISP blueprint: reference architecture, reference implementation • Take ISP for a spin: reference demo of real-time twitter sentiment analysis
  • 4.
  • 5.
    5 What is “real-time”in analytics, machine learning, data sciences & AI? Receive event Event Analyze event Act on event ResponseAugment model How long is the cycle? What is done online vs. offline? Learning Analysis
  • 6.
    6 Weeks Days HoursSeconds Receive event Event Analyze event Act on event ResponseAugment model How long is the cycle? What is done online vs. offline? Learning Analysis What is “real-time” in analytics, machine learning, data sciences & AI?
  • 7.
    Event Act on event Response Receive event A fewseconds A day Receive event Augment model Analyze event Modify reaction 1.Offline learning/analytics, online response Valueof“real-time”
  • 8.
    2. Offline learning,real-time analytics, online response Event Act on event Response Receive event A few seconds A day Receive event Augment model Analyze event Modify reaction 1.Offline learning/analytics, online response Valueof“real-time” Event Receive event Response Analyze event Act on event A few seconds Receive event Augment modelA day
  • 9.
    Receive event Analyze event Act on event Augment model 3. Real-timelearning/analytics, online response A few seconds 2. Offline learning, real-time analytics, online response Event Act on event Response Receive event A few seconds A day Receive event Augment model Analyze event Modify reaction 1.Offline learning/analytics, online response Valueof“real-time” Event Receive event Response Analyze event Act on event A few seconds Receive event Augment modelA day Event Response
  • 10.
  • 11.
    11 Classification of retailuse cases relative to “real-timeness” Level 1: Segmented historic context: data on what happened to all such customers before Level 3: Situational context: where customer is, what she wants – or might buy - right now Level 4: Supply chain dynamics: demand surge, product availability, competitive pricing From time to time, send a coupon based on a segment Level 2: individualized historic context: 360- degree view across personal data On a birthday, offer a coupon based on personal history Right now, offer a product based on what’s in her hands During a storm, deliver trending umbrella/pancho combo Example: Personalized Offers
  • 12.
    12 Level 1: Segmentedhistoric context: data on what happened to all such customers before Level 3: Situational context: where customer is, what she wants – or might buy - right now Level 4: Supply/demand dynamics: impact of demand surge, shortage, competitive actions... Level 2: individualized historic context: 360- degree view across individual’s data Suited for offline ML Requires real-time ML Historic aggregated data Real-time individual’s data Historic individual’s data Real-time everything Classification of retail use cases relative to “real-timeness”
  • 13.
    13 Top 6 driversof real-time applications in retail #3. Dynamic pricing Determine “right price” for products based on availability, trending, personal context & competitive price #1. Personalized search Augment search hits and relevancy ranking based on personal context & history #2. Personalized offers Motivate “buy now” behavior by offering deals based on personal context & history #4. Dynamic inventory Predict inventory needs & re-stock products in stores based on fluctuations in inventory & demand #5. Intelligent sourcing Determine what order to source from what store to optimize delivery SLAs & shipment costs #6. Real-time alerts Detect unusual patterns: fraud, surge in demand, weather changes, shift in brand sentiment. Respond right away
  • 14.
  • 15.
    15 In a complexlandscape of Big Data systems…
  • 16.
    16 …In-Stream Processing (ISP)service is an approach to build real-time extensions of Big Data applications Today’s focus
  • 17.
    17 ISP is idealfor: • Real-time data ingress to replace batch ETLs • Real-time identification of one-in-a-million “actionable insights” • Real-time response to actionable insights • Real-time learning from new data
  • 18.
  • 19.
    19 ISP pipelines: complexbehavior with simple steps Easy to write, change or add a step
  • 20.
  • 21.
  • 22.
    Grid Dynamics openblueprint for ISP 22
  • 23.
    23 Blueprint goals Scalable to 100,000+/secondReal-time streaming; real-time ML Cloud-ready Proven for mission- critical use Open source (and built 100% with open source) Production-ready Portable across clouds Extendable
  • 24.
    24 Selected stack forISP blueprint • REST API • Message Queue • HDFS • Other
  • 25.
    25 Designed as acomplete platform • No single points of failure • No bottlenecks • Built-in scaling • Dockerized • Deployable to any cloud • Reference implementation for AWS (open source) • Reference demo: real-time twitter sentiment analytics for new movie reviews
  • 26.
    ISP reference implementation: fully-automatedDevOps stack for running ISP on any modern cloud 26
  • 27.
    27 How to achievecloud portability? • Phase 1: bootstrap management cluster • [manual] Choose a cloud. Get a set of VMs (6) to host mngt cluster • [automated] Deploy & configure Mesos/Marathon cluster on available VMs • Phase 2: use management cluster to provision ISP environments • [automated] Deploy all ISP components as Docker containers • [automated] Deploy analytics application components (like Twitter API) • [automated] Configure all dependencies • [automated] Scale on-demand • [automated] Shut down when done
  • 28.
    28 Topology with twitterdata analytics demo
  • 29.
  • 30.
    30 Real-time demo, a.k.a.“Data Science Kitchen” • Provide reference example on how to use ISP platform… • ... and learn the basics of data science along the way • Gets actual Twitter data via streaming API • Analyses & visualizes what people think about latest movies • Exposes data science “kitchen”: models, training sets, dictionaries • Provides nice web UI to play with data • Uses our ISP RI (reference implementation) • Demo is running on AWS as a public service • Everything is open sourced • Documentation on our Tech Blog
  • 31.
    31 Demo app: pickmovies you want to monitor
  • 32.
  • 33.
    33 Compare trending betweendifferent movies Examples of positive & negative Carrie Fisher tweets
  • 34.
    34 Where to learnmore • 7-part blog series on ISP • 7-part blog series on Data Science Kitchen 1. Read our blog: blog.griddynamics.com 2. Play with our demo • http://apps.griddynamics.com/realtime-twitter-sentiment- analysis-example 3. Connect • Twitter: @griddynamics • Subscribe to our blog • Drop email: [email protected]