1
1
Flink-powered stream processing platform at Pinterest
Rainie Li
Software engineer@Pinterest
Kanchi Masalia
Software engineer@Pinterest
Agenda
1. Introduction
2. Challenges & Use cases
3. Platform missions & Frameworks
4. Ongoing Work
5. Q&A
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Streaming use cases on Xenon platform
OKR
promised
OKR
delivered
~2x
over
~3x
scale
Confidential
|
©
Pinterest
Why Real Time Stream Processing
● Ads real-time spend and reporting - Calculate spend against budget limits in near real time
to quickly adjust budget pacing and update advertisers with more timely reporting results
● Fast User Signals - Make user content signals available quickly after content creation and use
these signals in ML pipelines for a personalized and fresh user experience
● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time
● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement
metrics to Creators so they can refine their content with minimal feedback delay
● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users
by updating product metadata in near real time
● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup,
verification, and evaluation
Confidential
|
©
Pinterest
Existing Issues
● Fragmented technologies
○ Self-managed Kafka Streams jobs (Ads Infra)
○ Overwatch platform for small batch Spark jobs (Ads Data,
Measurement)
● Lack of developer support
● Availability & scalability issues
Confidential
|
©
Pinterest
Who are we?
● We are a team of engineers, SREs, PM and EM that builds the
stateful stream data processing platform called Xenon at Pinterest.
● We support around 100 engineers build and operate 100+ Flink
Applications.
● We run (near) real time applications with at 300M messages per
second and process 150TB data per second.
● We have enabled 10+ top level company KRs in the past 3 years.
Confidential
|
©
Pinterest
Xenon platform Mission
● Stability: reliably host all deployed Flink-based stream processing
applications
● Dev Velocity: quickly productionize new use cases / features to
meet business and product needs
● Cloud Efficiency: efficiently operate infras and strive for best
practices
Confidential
|
©
Pinterest
Xenon - Pinterest stream processing platform
Cluster
Management
(YARN)
NRTG
Common
Libraries and
Connectors
Flink SQL
The Resource Management & Job Execution Layer
The Developer APIs
Job State
Management
(Checkpoints,
Backups,
Restores, Edits)
Security /
Auth
(PII/FGAC)
Job Health &
Diagnosis
(Dr. Squirrel)
CI/CD Hermez
The Deployment Stack
Job
Management
Service
+
PinStats Analytic
Use case
“Overall, users … cited that currently
they have difficulties monitoring content
performance due to a lack of real-time
data being available, which they find
frustrating.”
Creator Content
Use cases
Fast user signals: Make user content
signals available quickly after content
creation
Safety: Reduce levels of unsafe content
as close to content creation time
Content Creation
Audience
Targeting
Content
Understanding
Quality
Interests &
Annotations
Embeddings
Performance
Ads real-time
spend and
reporting
Calculate spend against budget limits in
real time to quickly adjust budget and
update advertisers with more timely
results
Confidential
|
©
Pinterest
Xenon platform Mission No.1 - stability
● Xenon Stability Strategy
● Job Deployment Framework - Hermez and Job Submission service
● Job Management Service - Pinterest stateful streaming application
runtime monitoring and auto failure to different AZ service.
Repo Jenkins
Artifactory
S3
Hermez
Job Submission
Service
Yarn
Clusters
1
2
4
5
6
7
8
Xenon Job Deployment Framework
3
Xenon Jobs / Hermez workloads
154
Production Xenon use cases
>90
179
Deployments everyday
Highlights
Stability and Tier 1 support
● Enhanced JSS State Machine
● Supported job level dedicated S3 buckets
User experience
● Hermez supported most recent checkpoint deployment
● Hermez supported kill job and distributed shell
● Enriched savepoint information on Hermez
● Track daily & monthly deployment success rate
Metrics
● Job submission latency
Xenon Job Management Service
Monitoring
● Job Status
● Critical metrics (QPS)
● Checkpointing health
● Job/task health
● Notify users
Auto Recovery
Auto recover failed jobs
from:
● Last completed
checkpoint
● Most recent savepoint
● Fresh State
AZ Failure
Resilience
Auto failover jobs to
backup clusters in different
AZs when primary
cluster/AZ goes down
Xenon JMS
Statsboard
ZK Clusters
Hermez
JSS
Auto Recovery
Monitoring
Deployment
Yarn Clusters
AZ-a
Yarn Clusters
AZ-b
Yarn Clusters
AZ-c
Failover
JMS Architecture
Flink API
user
Jobs under management Faster recovery time
>90
Jobs get recovered
every week
10X
>7
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Developer Velocity
● Near Real Time Galaxy - Pinterest stateful streaming application Job
development framework
● CICD - Pinterest stateful streaming application change rollout flow
● Dr.Squirrel - Pinterest self-served streaming application
troubleshooting portal
● Working model - New Use Case Onboarding Process
Confidential
|
©
Pinterest
NRTG
Definition:
● Pinterest stateful streaming application Job development framework
History:
● Galaxy: a high-level managed execution platform for producing and
consuming signals (e.g. Entity features) about Pinterest entities (such
as pins, board, users).
● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow
API used in Batch, extends it to streaming applications.
Confidential
|
©
Pinterest
NRTG components (khaki boxes below)
VIP Navboost Signal (Map Transforms, Async RPC calls, Backfill)
● User code focuses only on Business logic. ✅
● Tune flink operators using configs. ✅
● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧
Xenon
Flink
Application
Code Config
Confidential
|
©
Pinterest
Xenon CICD framework - big picture
● Bring the CICD practice from stateless online services to stateful streaming world
● Leverage the same CICD infrastructure
● Customize the CICD pipeline for validating and deploying flink-based stream
application
● Achieve the goal of safely rolling out xenon user / platform changes with minimal
human efforts involved in validation
Confidential
|
©
Pinterest
Confidential
|
©
Pinterest
Xenon CICD pipelines - details
● auto-triggered based on cron rule and availability of new artifacts
● stability checks
○ job submission success
○ no restart-loop
○ savepoint generation success
○ ACA metrics validation
○ auto-recovery from TM/JM failure
● Prod deploy: decider-controlled, safe operations on prod job during
business hours
Confidential
|
©
Pinterest
Xenon CICD Pipeline UI
● Pipeline execution history
● Pipeline operation: disable / enable /
trigger
● Links to Pipeline YAML and Spinnaker
Spinnaker UI
● Pipeline parameters
● Pipeline execution status
● Details about each Stage
Xenon CICD framework - User Interface
Confidential
|
©
Pinterest
Job Debugging tool - Dr. Squirrel
Definition:
● One-stop shop for Flink job troubleshooting
Features:
● Surface suspicious stats to Xenon users instead of users searching for them
○ GC, CPU, memory, backpressure, exceptions, bad config...
● Provide instructions on top of suspicious stats
Goal:
● Cut down troubleshooting time, lower the required Flink internal knowledge for
troubleshooting, increase the dev velocity
Dr. Squirrel UI
Architecture - Part 1
Architecture - Part 2
Confidential
|
©
Pinterest
Working model - New Use Case Onboarding Process
● Xenon team provides managed bootstrap of new use case:
○ best practices in terms of choosing framework and deciding job graph
○ Dev environment setup
○ a buildable and deployable skeleton project (bazel, java, test, configs)
○ Hermez workloads creation
○ CICD pipeline
○ YARN queue
○ dashboard / alerts with default settings
● Xenon developers write and test business logic code
● Support auto-generation NRTG and Flink SQL based project
Outcome: reduce the onboarding time by 3+ weeks
Confidential
|
©
Pinterest
Xenon platform Mission No. 2 - Cloud efficiency (ongoing)
● Auto Scaling - Auto tuning & Auto scaling up/down flink applications
● Cluster upgrade - Automatic job migration during platform upgrade
● Resource Optimization - Load balance Xenon clusters
● Evaluate k8s
Confidential
|
©
Pinterest Auto Scaling
● Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and
Backpressure.
Questions?
Anumol Sebastian
Chenqi Liu
Hannah Chen
Divye Kapoor
Kanchi Masalia
Lu Niu Rainie Li
Teja Thotapalli
Nishant More
Samuel Bahr
Heng Zhang
Kevin Browne
Sergii Marchenko
Ashish Jhaveri Dinesh Kumar Sekar
Chen Qin
Shaowen Wang YOU?!
Q & A
Thank you

Flink powered stream processing platform at Pinterest

  • 1.
  • 2.
    Flink-powered stream processingplatform at Pinterest Rainie Li Software engineer@Pinterest Kanchi Masalia Software engineer@Pinterest
  • 3.
    Agenda 1. Introduction 2. Challenges& Use cases 3. Platform missions & Frameworks 4. Ongoing Work 5. Q&A
  • 5.
  • 6.
    Confidential | © Pinterest Streaming use caseson Xenon platform OKR promised OKR delivered ~2x over ~3x scale
  • 7.
    Confidential | © Pinterest Why Real TimeStream Processing ● Ads real-time spend and reporting - Calculate spend against budget limits in near real time to quickly adjust budget pacing and update advertisers with more timely reporting results ● Fast User Signals - Make user content signals available quickly after content creation and use these signals in ML pipelines for a personalized and fresh user experience ● Realtime Trust & Safety - Reduce levels of unsafe content as close to content creation time ● Fast Insights (Content activation) - Distribute fresh Creator content and surface engagement metrics to Creators so they can refine their content with minimal feedback delay ● Product Authority (Shopping) - Deliver a trustworthy shopping product experience for users by updating product metadata in near real time ● Fast Experimentation - Accurately deliver metrics to engineers for faster experiment setup, verification, and evaluation
  • 8.
    Confidential | © Pinterest Existing Issues ● Fragmentedtechnologies ○ Self-managed Kafka Streams jobs (Ads Infra) ○ Overwatch platform for small batch Spark jobs (Ads Data, Measurement) ● Lack of developer support ● Availability & scalability issues
  • 9.
    Confidential | © Pinterest Who are we? ●We are a team of engineers, SREs, PM and EM that builds the stateful stream data processing platform called Xenon at Pinterest. ● We support around 100 engineers build and operate 100+ Flink Applications. ● We run (near) real time applications with at 300M messages per second and process 150TB data per second. ● We have enabled 10+ top level company KRs in the past 3 years.
  • 10.
    Confidential | © Pinterest Xenon platform Mission ●Stability: reliably host all deployed Flink-based stream processing applications ● Dev Velocity: quickly productionize new use cases / features to meet business and product needs ● Cloud Efficiency: efficiently operate infras and strive for best practices
  • 11.
    Confidential | © Pinterest Xenon - Pintereststream processing platform Cluster Management (YARN) NRTG Common Libraries and Connectors Flink SQL The Resource Management & Job Execution Layer The Developer APIs Job State Management (Checkpoints, Backups, Restores, Edits) Security / Auth (PII/FGAC) Job Health & Diagnosis (Dr. Squirrel) CI/CD Hermez The Deployment Stack Job Management Service +
  • 12.
    PinStats Analytic Use case “Overall,users … cited that currently they have difficulties monitoring content performance due to a lack of real-time data being available, which they find frustrating.”
  • 13.
    Creator Content Use cases Fastuser signals: Make user content signals available quickly after content creation Safety: Reduce levels of unsafe content as close to content creation time Content Creation Audience Targeting Content Understanding Quality Interests & Annotations Embeddings Performance
  • 14.
    Ads real-time spend and reporting Calculatespend against budget limits in real time to quickly adjust budget and update advertisers with more timely results
  • 15.
    Confidential | © Pinterest Xenon platform MissionNo.1 - stability ● Xenon Stability Strategy ● Job Deployment Framework - Hermez and Job Submission service ● Job Management Service - Pinterest stateful streaming application runtime monitoring and auto failure to different AZ service.
  • 16.
  • 17.
    Xenon Jobs /Hermez workloads 154 Production Xenon use cases >90 179 Deployments everyday
  • 18.
    Highlights Stability and Tier1 support ● Enhanced JSS State Machine ● Supported job level dedicated S3 buckets User experience ● Hermez supported most recent checkpoint deployment ● Hermez supported kill job and distributed shell ● Enriched savepoint information on Hermez ● Track daily & monthly deployment success rate Metrics ● Job submission latency
  • 19.
    Xenon Job ManagementService Monitoring ● Job Status ● Critical metrics (QPS) ● Checkpointing health ● Job/task health ● Notify users Auto Recovery Auto recover failed jobs from: ● Last completed checkpoint ● Most recent savepoint ● Fresh State AZ Failure Resilience Auto failover jobs to backup clusters in different AZs when primary cluster/AZ goes down
  • 20.
    Xenon JMS Statsboard ZK Clusters Hermez JSS AutoRecovery Monitoring Deployment Yarn Clusters AZ-a Yarn Clusters AZ-b Yarn Clusters AZ-c Failover JMS Architecture Flink API user
  • 21.
    Jobs under managementFaster recovery time >90 Jobs get recovered every week 10X >7
  • 22.
    Confidential | © Pinterest Xenon platform MissionNo. 2 - Developer Velocity ● Near Real Time Galaxy - Pinterest stateful streaming application Job development framework ● CICD - Pinterest stateful streaming application change rollout flow ● Dr.Squirrel - Pinterest self-served streaming application troubleshooting portal ● Working model - New Use Case Onboarding Process
  • 23.
    Confidential | © Pinterest NRTG Definition: ● Pinterest statefulstreaming application Job development framework History: ● Galaxy: a high-level managed execution platform for producing and consuming signals (e.g. Entity features) about Pinterest entities (such as pins, board, users). ● NRTG (Near Real Time Galaxy): It follows the same Galaxy dataflow API used in Batch, extends it to streaming applications.
  • 24.
  • 25.
    VIP Navboost Signal(Map Transforms, Async RPC calls, Backfill) ● User code focuses only on Business logic. ✅ ● Tune flink operators using configs. ✅ ● ROI: Kappa architecture - roadmap to shutting down an $800K double compute GPU cluster for visual-search batch. 🚧 Xenon Flink Application Code Config
  • 26.
    Confidential | © Pinterest Xenon CICD framework- big picture ● Bring the CICD practice from stateless online services to stateful streaming world ● Leverage the same CICD infrastructure ● Customize the CICD pipeline for validating and deploying flink-based stream application ● Achieve the goal of safely rolling out xenon user / platform changes with minimal human efforts involved in validation
  • 27.
  • 28.
    Confidential | © Pinterest Xenon CICD pipelines- details ● auto-triggered based on cron rule and availability of new artifacts ● stability checks ○ job submission success ○ no restart-loop ○ savepoint generation success ○ ACA metrics validation ○ auto-recovery from TM/JM failure ● Prod deploy: decider-controlled, safe operations on prod job during business hours
  • 29.
    Confidential | © Pinterest Xenon CICD PipelineUI ● Pipeline execution history ● Pipeline operation: disable / enable / trigger ● Links to Pipeline YAML and Spinnaker Spinnaker UI ● Pipeline parameters ● Pipeline execution status ● Details about each Stage Xenon CICD framework - User Interface
  • 30.
    Confidential | © Pinterest Job Debugging tool- Dr. Squirrel Definition: ● One-stop shop for Flink job troubleshooting Features: ● Surface suspicious stats to Xenon users instead of users searching for them ○ GC, CPU, memory, backpressure, exceptions, bad config... ● Provide instructions on top of suspicious stats Goal: ● Cut down troubleshooting time, lower the required Flink internal knowledge for troubleshooting, increase the dev velocity
  • 31.
  • 32.
  • 33.
  • 34.
    Confidential | © Pinterest Working model -New Use Case Onboarding Process ● Xenon team provides managed bootstrap of new use case: ○ best practices in terms of choosing framework and deciding job graph ○ Dev environment setup ○ a buildable and deployable skeleton project (bazel, java, test, configs) ○ Hermez workloads creation ○ CICD pipeline ○ YARN queue ○ dashboard / alerts with default settings ● Xenon developers write and test business logic code ● Support auto-generation NRTG and Flink SQL based project Outcome: reduce the onboarding time by 3+ weeks
  • 35.
    Confidential | © Pinterest Xenon platform MissionNo. 2 - Cloud efficiency (ongoing) ● Auto Scaling - Auto tuning & Auto scaling up/down flink applications ● Cluster upgrade - Automatic job migration during platform upgrade ● Resource Optimization - Load balance Xenon clusters ● Evaluate k8s
  • 36.
    Confidential | © Pinterest Auto Scaling ●Service to dynamically job parallelism based on the metrics - Kafka Lag, CPU utilization and Backpressure.
  • 37.
    Questions? Anumol Sebastian Chenqi Liu HannahChen Divye Kapoor Kanchi Masalia Lu Niu Rainie Li Teja Thotapalli Nishant More Samuel Bahr Heng Zhang Kevin Browne Sergii Marchenko Ashish Jhaveri Dinesh Kumar Sekar Chen Qin Shaowen Wang YOU?!
  • 38.
  • 39.