How Netflix Monitors Millions of Devices

How Netflix Monitors Millions of Devices

Netflix is available on millions of set top boxes, smart TVs, streaming sticks, and other consumer electronics devices. Following a rigorous initial certification process, the Device Reliability Engineering (DRE) team at Netflix is responsible for ensuring that these devices continue to provide a high quality experience through countless updates and improvements of the service. To date, there are over 1,600 different device models across the globe, each running a range of partner firmware updates. The DRE team monitors more than a dozen different metrics on each of these devices. These metrics include things like customer call volume, playback errors, application launch errors, and video quality.

To manage this complexity, our team has taken an approach that involves various alerting technologies, dashboarding, machine learning, and efficient UIs that allow us to handle the constantly increasing scale of devices with a small, tightly focused team. This blog post will focus on a custom UI that helps bring many of these tools together.

One UI to Rule them All

Quickly detecting meaningful changes in a dynamic, heterogeneous environment is a challenging task. The Device Reliability team relies on several internal technologies such as Atlas, Jigsaw, and Winston. We monitor real-time signals for cases with high, consistent volume and use daily aggregate signals for cases with low volume. We also use dozens of dashboards hosted in Kibana, Tableau, and Lumen to visualize, compare, and slice and dice metrics. Often it can be a challenge just to quickly remember which dashboard can be used to dive deeper on a particular issue and then how to configure it to focus on a particular case at hand.

To address this task, we built a tool we call the Device Reliability UI or DRUI for short. It consumes and stores alerts from a range of different systems, augments them with additional data to help quickly analyze alerts, and provides quick contextual links to relevant dashboards and tools. Additionally, it enables quick lookup of contextual information about specific device models within the Netflix device ecosystem to allow for investigation of issues not initially triggered by an alert.

No alt text provided for this image

Finding patterns

Some issues cause impact on a single device. Perhaps a partner pushes a firmware with a bug, or an internal change exposes an incompatibility on one device model. Other issues arise on groups of devices that share a common attribute. It could be a chipset, a particular digital rights management subsystem, or a certain version of the Netflix application. Centralizing our alerts, augmenting them with these attributes, and inserting them into a flexible system helps us to identify commonalities and create learnings to eventually inspire automated analysis. 

Metrics are often related, and the set of metrics that change for a given incident can isolate or expand the picture of an incident. If many devices experience a change in a particular metric but half of them experience a change in two metrics, it’s possible that there are actually two different incidents across the groups of devices.

DRUI provides filtering, sorting, and tagging that helps to narrow down these types of events. Often this type of simple manipulation can visually line up charts across different devices and metrics, and help reveal a pattern.

No alt text provided for this image
No alt text provided for this image

Looking through history

Over time, devices often exhibit slowly changing or periodic behavior. DRUI stores comments on alerts and on devices to track business context or investigations. Alerts can often fire repeatedly through the course of an incident. DRUI supports the ability to roll-up these alerts into a single entity. If a previous alert has been tagged or grouped and a new one arrives that looks similar, DRUI will highlight it with a suggestion to quickly correlate a new alert with an existing incident.

No alt text provided for this image

Automated analysis

An alert is triggered when a metric changes based on a certain anomaly detection algorithm. Identifying the cause of the change can be complex, but often there are a number of possible culprits that can be quickly scanned to help perform initial triage on a change. DRUI supports performing these types of checks so that by the time a human looks at an alert, it has been augmented with additional data to aid the investigation. Some of these triage tasks are performed by Winston.

Incidents and metric annotation

DRUI supports grouping multiple alerts into an incident. This helps us track the set of metric changes and devices that are impacted by an event and creates a centralized place to add comments, images, and supports sharing with other teams. The incident can also be used to visually annotate a metric shift so that other users can quickly understand why a metric may have moved.

No alt text provided for this image

Floating in a sea of dashboards

Sometimes there are so many dashboards that remembering where they are and which ones are relevant to a certain issue can slow down the process of problem solving. Additionally, dashboards usually require setting a few basic parameters to apply the right context for a given issue. DRUI allows us to centralize a mapping of alert types and metrics to relevant dashboard links pre-configured with the correct time window, device and other parameters so we can easily surface them with alerts and enable users to quickly investigate and triage.

No alt text provided for this image

Tools for scaling

DRUI helps our team manage the ever increasing volume and diversity of devices in the Netflix ecosystem. It enables us to integrate with a wide range of alert and metric inputs, leveraging tools provided by other teams at Netflix while tailoring our workflow to our specific needs. We recently performed a major rewrite of DRUI from Angular to React and hope to continue to use it as Netflix grows to and beyond 200 million subscribers.

If you work on similar challenges or are interested in continuing the conversation, then come talk to us!

Dear Robert. Thank you for writing a good article. This article has been translated into Korean. https://blog.imqa.io/how-netflix-monitors-millions-devices-robert-armstrong/

Like
Reply

This is interesting. Adapting the stream of errors and events for human perception is the right direction. Perhaps adding machine learning can even further augment the systems scalability.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics