SlideShare a Scribd company logo
Never Fail Twice
How Playtech Mastered Failure Detection Across Distributed Systems
Bio
 Technical Architect with more than 18 y. of experience
 Passionate about IT
 Financial and Data Science background
 Last years in Research and Design projects
Agenda
• What is observability and monitoring
• Why this is hard
• Possible approaches
• How we solved it
• Future of the instrumentation and observability
Objectives
 Get in touch with time-series analysis
 Understanding Distributed Systems pro’s and con’s
 Understanding observability and instrumentation concepts
Observability
 Monitoring is for operating software/systems
 Instrumentation is for writing software
 Observability is for understanding systems
Charity Majors
Why is it difficult
 1. Various problems may lead to non-obvious system behaviour.
 2. Various metrics may have different correlations in time and space.
 3. Monitoring a complex application is a significant engineering endeavor in and of itself.
 4. There is a mix of different measurements and metrics.
System monitoring
in Playtech
 50+ multibranded sites, distributed all over
the world
 Multiple products
 Multichannel
 Different mix of integrations
On the shoulders of giants
A lot of companies
built their own
solutions for
monitoring their
systems.
There was not
always success
stories.
Etsy
 Etsy is a large online
marketplace of handmade
goods
 Their engineering team
collected more than 250,000
different metrics from their
servers
 They tried to find anomalies
using complex math
approaches.
Lessons
learnt from
KALE 1.0
Anomalies in other metrics should be used for root cause
analysis.
Alerts should only be sent out when anomalies are detected in
business and user metrics
A one-size-fits-all type of approach will probably not fit
at all
Anomaly detection is more than just outlier detection
Google SRE team’s BorgMon
 Google has trended toward simpler and faster monitoring
systems, with better tools for post hoc analysis
 [They] avoid “magic” systems that try to learn thresholds or
automatically detect causality
 Rules that generate alerts for humans should be simple to
understand and represent a clear failure
According to the authors of Site Reliability Engineering
Playtech
case
Past tool from HP is “one-fits-for-
all”
Low efficiency and side effects
False Positives and missed incidents
Horrible operability
Time Series
 A time series is a series of data points indexed (or listed or graphed) in time order
 Economical processes have a regular structure
 These are amount of sales in the shops, production of champagne, online transactions
 Usually they have seasonal periods and trend lines
 Using this information, simplifies analysis
Stationary Time-Series Data
 Is a stochastic process, which characteristics does not change
 White noise
Non Stationary Time Series
 Trend line
 Dispersion change
How to model that?
 Every measurement consists of a signal and an error
component/noise, because our processes are affected by many
factors
 Point_of_measurement = signal + error
 Subtract the model’s values from our measurements
 The more our model resembles the real signal, the more our
residue will approximate the error component or stationarity or
white noise
Example
Cut 30 min data piece
Regression or finding a trend line
Trend line subtracted
Looks like white noise
Dickey-Fuller test of an initial piece of data
Stationary hypothesis rejected
And after subtraction
Result is a stationary time series
Let’s take a moving average from our example
A bit of Salvador Dali
Compared with a next week data
Why Time Series DB matters
Optimized for handling time series data
No Updates. Facts do not change ever
Appending data only
Last data has been queried more often
InfluxDB is one of the best time series database
An
Important
Notice
The second level involves receiving such information and
making decisions as to whether they represent real problems
or outages.
This is the information consumption level.
The first level involves searching for anomalies in metrics and
sending out notifications if outliers are found.
This is the information emission level.
Overall
Architecture
 Python stack
 Built as a set of loosely
coupled components
 Executed on their own
Python virtual machines
 Event-driven design
Event Streamer
 Component that holds Workers, fetches data regularly, and tests this data against the statistical
models managed by Workers
 A Worker is the main working unit that holds a set of models together with meta-information
 Workers are fully independent and every cycle is executed using a threading pool
Rule Engine
 Consumes the information provided by the
Event Streamer
 Rules built as Abstract Syntax Tree
 Around 1500 matches per sec in one
process
We also measure dynamics
 We can take into account the speed and acceleration of the degradation of the metrics
 It correspond to, respectively, the severity and the predicted change in the severity of the incident
 Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated
for every violation
 The same applies to acceleration or the second order derivative
Some of our Rules examples
Model ensemble can be fine tuned
For every alert report is created
Alerta – open-source product for alerts
aggregation
Non Functional Moments
What future brings us
Q&A
 Thank you very much
 aleks.tavgen@gmail.com
 https://medium.com/@ATavgen/time-series-modelling-a9bf4f467687
 https://medium.com/@ATavgen/never-fail-twice-608147cb49b

More Related Content

PDF
Event tracking for real time unaware sensitivity analysis
DOCX
Event tracking for real time unaware sensitivity analysis
PPTX
Geist Presentation
PPTX
IT environment analytics service
PPTX
Geek Sync I In Depth Look At Application Performance Monitoring
PDF
The Tableau Experience Kaunas - TOC Sales and Marketing prezentacija
ODP
Extreme Blue FTP Discovery Week 1 Presentation
PPT
Event tracking for real time unaware sensitivity analysis
Event tracking for real time unaware sensitivity analysis
Geist Presentation
IT environment analytics service
Geek Sync I In Depth Look At Application Performance Monitoring
The Tableau Experience Kaunas - TOC Sales and Marketing prezentacija
Extreme Blue FTP Discovery Week 1 Presentation

What's hot (19)

PPT
Laboratory Information Managment System
PPTX
Becoma an Ace in Analytics
PPTX
Rule based expert system
PPTX
Software testing-and-risk-analysis
PPTX
Continual Monitoring
PDF
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
PPTX
New technology new approaches - tmf - july 2016
PPTX
SplunkLive! Houston Improving Healthcare Operations
PDF
Unified Monitoring Webinar with Dustin Whittle
PPTX
Esm application management version 1.0
PPTX
Unomaly - product presentation
PPTX
NuvoSys Solutions, LLC
PPTX
Using Machine Learning to Optimize DevOps Practices
PDF
Computer Audit an Introductory
PPT
Why Use Westech Solutions
PPT
Why Use Wes Tech Solutions
PPTX
Perfexpert
PDF
CCXG Special Event, November 2020, Michael Vartanyan
PPTX
Vulnerability Assessment & Analysis (VAA) Overview
Laboratory Information Managment System
Becoma an Ace in Analytics
Rule based expert system
Software testing-and-risk-analysis
Continual Monitoring
Service Assurance for Modern Apps - BigPanda NA SNO - April 2015 - Dan Turchin
New technology new approaches - tmf - july 2016
SplunkLive! Houston Improving Healthcare Operations
Unified Monitoring Webinar with Dustin Whittle
Esm application management version 1.0
Unomaly - product presentation
NuvoSys Solutions, LLC
Using Machine Learning to Optimize DevOps Practices
Computer Audit an Introductory
Why Use Westech Solutions
Why Use Wes Tech Solutions
Perfexpert
CCXG Special Event, November 2020, Michael Vartanyan
Vulnerability Assessment & Analysis (VAA) Overview
Ad

Similar to Monitoring Distributed Systems (20)

PPT
Itpi metricon 0906a final
PDF
Product and sevices management system
PPTX
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
PDF
Introduction to Modeling and Simulation
PPT
Automatic Assessment of Failure Recovery in Erlang Applications
PPT
Different Approaches To Sys Bldg
PPTX
Cybernetics in supply chain management
PPTX
Icai seminar kolkata
PDF
Life of an event - A never ending tool chain
PPTX
Life of an event - A never ending tool chain
PPTX
Technology Audit | IT Audit | ERP Audit | Database Security
DOCX
Inspace technologies
DOCX
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...
PPTX
Data Analytics Introduction.pptx
PPTX
Data Analytics Introduction.pptx
PDF
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
PPTX
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
DOCX
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)
PDF
Employment Hero monitoring solution
PPTX
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
Itpi metricon 0906a final
Product and sevices management system
Migrating Monitoring to Observability – How to Transform DevOps from being Re...
Introduction to Modeling and Simulation
Automatic Assessment of Failure Recovery in Erlang Applications
Different Approaches To Sys Bldg
Cybernetics in supply chain management
Icai seminar kolkata
Life of an event - A never ending tool chain
Life of an event - A never ending tool chain
Technology Audit | IT Audit | ERP Audit | Database Security
Inspace technologies
Full Docu IT Thesis Project In Computerized Inventory System In Brother Burg...
Data Analytics Introduction.pptx
Data Analytics Introduction.pptx
20 Simple Questions from Exactpro for Your Enjoyment This Holiday Season
#ATAGTR2021 Presentation : "Use of AI and ML in Performance Testing" by Adolf...
INTERNAL Assign no 207( JAIPUR NATIONAL UNI)
Employment Hero monitoring solution
SplunkLive! Frankfurt 2018 - Integrating Metrics & Logs
Ad

Recently uploaded (20)

PDF
Report The-State-of-AIOps 20232032 3.pdf
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
batch data Retailer Data management Project.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Digital Infrastructure – Powering the Connected Age
PPTX
Presentation1.pptxvhhh. H ycycyyccycycvvv
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Extract Transformation Load (3) (1).pptx
PPTX
Understanding Prototyping in Design and Development
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Data Science Trends & Career Guide---ppt
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Company Presentation pada Perusahaan ADB.pdf
PPTX
Computer network topology notes for revision
PDF
Foundation of Data Science unit number two notes
Report The-State-of-AIOps 20232032 3.pdf
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
batch data Retailer Data management Project.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Digital Infrastructure – Powering the Connected Age
Presentation1.pptxvhhh. H ycycyyccycycvvv
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Extract Transformation Load (3) (1).pptx
Understanding Prototyping in Design and Development
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Data Science Trends & Career Guide---ppt
Launch Your Data Science Career in Kochi – 2025
咨询新西兰毕业证(UCOL毕业证书)联合理工学院毕业证国外毕业证
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Acumen Training GuidePresentation.pptx
Company Presentation pada Perusahaan ADB.pdf
Computer network topology notes for revision
Foundation of Data Science unit number two notes

Monitoring Distributed Systems

  • 1. Never Fail Twice How Playtech Mastered Failure Detection Across Distributed Systems
  • 2. Bio  Technical Architect with more than 18 y. of experience  Passionate about IT  Financial and Data Science background  Last years in Research and Design projects
  • 3. Agenda • What is observability and monitoring • Why this is hard • Possible approaches • How we solved it • Future of the instrumentation and observability
  • 4. Objectives  Get in touch with time-series analysis  Understanding Distributed Systems pro’s and con’s  Understanding observability and instrumentation concepts
  • 5. Observability  Monitoring is for operating software/systems  Instrumentation is for writing software  Observability is for understanding systems Charity Majors
  • 6. Why is it difficult  1. Various problems may lead to non-obvious system behaviour.  2. Various metrics may have different correlations in time and space.  3. Monitoring a complex application is a significant engineering endeavor in and of itself.  4. There is a mix of different measurements and metrics.
  • 7. System monitoring in Playtech  50+ multibranded sites, distributed all over the world  Multiple products  Multichannel  Different mix of integrations
  • 8. On the shoulders of giants A lot of companies built their own solutions for monitoring their systems. There was not always success stories.
  • 9. Etsy  Etsy is a large online marketplace of handmade goods  Their engineering team collected more than 250,000 different metrics from their servers  They tried to find anomalies using complex math approaches.
  • 10. Lessons learnt from KALE 1.0 Anomalies in other metrics should be used for root cause analysis. Alerts should only be sent out when anomalies are detected in business and user metrics A one-size-fits-all type of approach will probably not fit at all Anomaly detection is more than just outlier detection
  • 11. Google SRE team’s BorgMon  Google has trended toward simpler and faster monitoring systems, with better tools for post hoc analysis  [They] avoid “magic” systems that try to learn thresholds or automatically detect causality  Rules that generate alerts for humans should be simple to understand and represent a clear failure According to the authors of Site Reliability Engineering
  • 12. Playtech case Past tool from HP is “one-fits-for- all” Low efficiency and side effects False Positives and missed incidents Horrible operability
  • 13. Time Series  A time series is a series of data points indexed (or listed or graphed) in time order  Economical processes have a regular structure  These are amount of sales in the shops, production of champagne, online transactions  Usually they have seasonal periods and trend lines  Using this information, simplifies analysis
  • 14. Stationary Time-Series Data  Is a stochastic process, which characteristics does not change  White noise
  • 15. Non Stationary Time Series  Trend line  Dispersion change
  • 16. How to model that?  Every measurement consists of a signal and an error component/noise, because our processes are affected by many factors  Point_of_measurement = signal + error  Subtract the model’s values from our measurements  The more our model resembles the real signal, the more our residue will approximate the error component or stationarity or white noise
  • 18. Cut 30 min data piece
  • 19. Regression or finding a trend line
  • 20. Trend line subtracted Looks like white noise
  • 21. Dickey-Fuller test of an initial piece of data Stationary hypothesis rejected
  • 22. And after subtraction Result is a stationary time series
  • 23. Let’s take a moving average from our example
  • 24. A bit of Salvador Dali
  • 25. Compared with a next week data
  • 26. Why Time Series DB matters Optimized for handling time series data No Updates. Facts do not change ever Appending data only Last data has been queried more often InfluxDB is one of the best time series database
  • 27. An Important Notice The second level involves receiving such information and making decisions as to whether they represent real problems or outages. This is the information consumption level. The first level involves searching for anomalies in metrics and sending out notifications if outliers are found. This is the information emission level.
  • 28. Overall Architecture  Python stack  Built as a set of loosely coupled components  Executed on their own Python virtual machines  Event-driven design
  • 29. Event Streamer  Component that holds Workers, fetches data regularly, and tests this data against the statistical models managed by Workers  A Worker is the main working unit that holds a set of models together with meta-information  Workers are fully independent and every cycle is executed using a threading pool
  • 30. Rule Engine  Consumes the information provided by the Event Streamer  Rules built as Abstract Syntax Tree  Around 1500 matches per sec in one process
  • 31. We also measure dynamics  We can take into account the speed and acceleration of the degradation of the metrics  It correspond to, respectively, the severity and the predicted change in the severity of the incident  Speed is an angular coefficient or a discrete derivative of a particular metric, which is calculated for every violation  The same applies to acceleration or the second order derivative
  • 32. Some of our Rules examples
  • 33. Model ensemble can be fine tuned
  • 34. For every alert report is created
  • 35. Alerta – open-source product for alerts aggregation
  • 38. Q&A  Thank you very much  [email protected]  https://medium.com/@ATavgen/time-series-modelling-a9bf4f467687  https://medium.com/@ATavgen/never-fail-twice-608147cb49b

Editor's Notes

  • #8: Во-первых, из-за высокой сложности продуктов и огромного количества настроек существуют ситуации, когда неправильные настройки приводят к деградации финансовых показателей, или скрытые баги в логике отражаются на общем функционале всей системы.  Во-вторых, есть специфические 3d-party интеграции для разных стран, и проблемы, возникающие у партнеров, начинают протекать к нам. Проблемы такого рода не ловятся low level monitoring; для их решения нужно мониторить ключевые индикаторы (KPI), сравнивать их со статистикой по системе и искать корреляции. В компании было ранее внедрено решение от Hewlett Packard Service Health Analyzer, которое было (мягко говоря) неидеально. Судя по маркетинговому проспекту, это система, которая сама обучается и обеспечивает раннее обнаружение проблем (SHA can take data streams from multiple sources and apply advanced, predictive algorithms in order to alert to and diagnose potential problems before they occur). По факту же это был черный ящик, который невозможно настроить, со всеми проблемами нужно было обращаться в HP и ждать месяцами, пока инженеры поддержки сделают что-то, что тоже не будет работать, как нужно. А еще — ужасный пользовательский интерфейс, старая JVM экосистема (Java 6.0), и, что самое главное — большое число False Positives и (что еще хуже) False Negatives, то есть некоторые серьезные проблемы либо не обнаруживались, либо были пойманы намного позже чем следует, что выражалось во вполне конкретном финансовом убытке.
  • #28: Опыт проекта Kale говорит об очень важном моменте. Алертинг — это не то же самое, что и поиск аномалий и outliers в метриках, поскольку, как уже говорилось, аномалии на единичных метриках будут всегда.  В действительности, у нас есть два логических уровня.  — Первый — это поиск аномалий в метриках и посылка нотификации о нарушении, если аномалия найдена. Это уровень эмиссии информации.  — Второй уровень — это компонент, получающий информацию о нарушениях и принимающий решения о том, является это критическим инцидентом или нет.  Таким образом действуем и мы, люди, когда исследуем проблему. Мы смотрим на что-либо, при обнаружении отклонений от нормы смотрим еще, и затем принимаем решение на основании наблюдений. В начале проекта мы решили попробовать Kapacitor, поскольку в нем есть возможность определения пользовательских функций на Python. Но каждая функция сидит в отдельном процессе, что создало бы overhead для сотен и тысяч метрик. Из-за этой и некоторых других проблем от него решено было отказаться. Для построения собственной системы в качестве основного стека был выбран Python, поскольку существует отличная экосистема для анализа данных, быстрые библиотеки (pandas, numpy и т.д.), отличная поддержка веб-решений. You name it. Для меня это был первый большой проект, целиком и полностью выполненный на Python. Сам я пришел к Python из Java мира. Мне не хотелось множить зоопарк стеков для одной системы, что в конечном счете было вознаграждено.
  • #29: Общая архитектура. Система построена в виде набора слабо связанных компонентов или сервисов, которые крутятся в своих процессах на своих Python VM. Это естественно для общего логического разбиения (events emitter / rules engine) и дает другие плюсы. Каждый компонент делает ограниченное количество специфических вещей. В дaльнейшем это позволит очень быстро расширять систему и добавлять новые пользовательские интерфейсы, не затрагивая основную логику и не боясь ее сломать. Между компонентами проведены достаточно четкие границы. Распределенный deploy удобен, если нужно разместить агент локально к ближе к сайту, который он мониторит — или же можно аггрегировать вместе большое количество разных систем.  Коммуникация должна быть построена на базе сообщений, поскольку вся система должна быть асинхронной.  В качестве Message Queue я выбрал ActiveMQ, но при желании сменить, например, на RabbitMQ, проблем не возникнет, поскольку все компоненты общаются по стандартному протоколу STOMP.
  • #30: Worker — это основной рабочий юнит, который хранит одну модель одной метрики вместе с мета-информацией. Он состоит из дата коннектора и хендлера, которому передает данные. Хендлер тестирует их на статистической модели, и если обнаруживает нарушения, то передает их агенту, который посылает событие в очередь. Workers полностью независимы друг от друга, каждый цикл выполняется через пул потоков. Поскольку большая часть времени тратится на I/O операции, то Global Interpreter Lock Python не сильно влияет на результат. Количество потоков ставится в конфиге; на текущей конфигурации оптимальным количеством оказалось 8 потоков. Information emmiter
  • #31: Information consuming Каждое сообщение отправляется в Rule Engine, и тут начинается самое интересное. В самом начале разработки я жестко задал правила в коде: когда одна метрика падает, а другая растет, то послать алерт. Но это решение не универсально и требует залезать в код для любого расширения. Поэтому нужен был какой-то язык, задающий правила. Тут пришлось вспоминать Абстрактные Синтаксические Деревья и дефинировать простой язык для описания правил. При запуске компонента правила считываются и строится синтаксическое дерево. При получении каждого сообщения все события с этого сайта за один тик проверяются согласно заданным правилам, и если правило срабатывает, то генерируется алерт. Сработавших правил может быть несколько. Если рассматривать динамику инцидентов, развивающихся во времени, то можно учитывать также скорость падения (уровень severity) и изменение скорости (прогноз изменения severity)
  • #32: Скорость — это угловой коэффициент или дискретная производная, который подсчитывается для каждого нарушения. То же касается и акселерации, дискретной производной второго порядка. Это значения можно задавать в правилах. Кумулятивные производные первых и вторых порядков могут учитываться в общей оценке инцидента.
  • #33: Правила описываются в формате YAML, но можно использовать любой другой, достаточно добавить свой парсер. Правила задаются в виде регулярных выражений имен метрик или просто префиксов метрик. Speed — это скорость деградации метрик, об этом ниже.
  • #37: SHA required Oracle server tens of Gb Ram PT-Pas 28962 lines of code. 1500 matches/per sec one process, up to 125 000 matches in current configuration (sharding probabilities)
  • #38: Non funfunc 28982 LOC