Lessons from Netflix’s Video Play Strategy — Alarms on Minimum Success Requests Per Minute

We cover efficient cloud monitoring when creating CloudWatch alarms to ensure a minimum number of successful requests per minute, drawing inspiration from Netflix’s robust video playback strategy.

10 min readDec 27, 2023

Preface

✔️ We discuss the need for alarms based on success metrics (not just errors).
✔️ We cover a fictitious company scenario with full code examples.
✔️ We talk through an example code repository.

Introduction 👋🏽

One thing that is common in the industry is to set CloudWatch alarms based on error metrics so the team can be alerted quickly on issues, but it is rare I see alarms based on minimum success metrics.

“One thing that is common in the industry is to set CloudWatch alarms based on error metrics so the team can be alerted quickly on issues, but it is rare I see alarms based on minimum success metrics.”

While reading many articles as I do, I found a great example from the Netflix engineering department where they alerted the team based on a low number of successful video plays per minute, as this would indicate an issue somewhere else in the overall microservice eco-system (i.e. another microservice having issues which means a specific microservice is not being hit), and their service would not, therefore, show any errors to alert on.

We are going to cover the example above in this article where we discuss the code and architecture around a fictitious company called ‘L.J Film Streams’.

Our fictitious company L.J Film Streams is a streaming service similar to Netflix

When the LJ Film Streams website had issues, the engineering team for the video play microservice got an alert that there had been 0 new plays that minute

The code for this article in TypeScript and the AWS CDK can be found here:

GitHub - leegilmorecode/serverless-success-rate-alarms: We cover efficient cloud monitoring when…

We cover efficient cloud monitoring when creating CloudWatch alarms to ensure a minimum number of successful requests…

github.com

Why does Netflix alarm on success requests?

If we take this back to basics, how can we be alerted to a lack of successful requests to our microservice? If our service is not being hit, then we won’t have logs to alert on.

“If we take this back to basics, how can we be alerted to a lack of successful requests to our microservice? If our service is not being hit, then we won’t have logs to alert on.”

This is what was given as an example at Netflix, whereby a microservice upstream having issues may affect it invoking API endpoints on our own microservice.

In this scenario we won’t have errors in our logs to alert on as the issue is with another team’s microservice, however, we can be alerted on a lack of throughput in our own service which indicates errors elsewhere.

Let’s look at a basic example that might happen:

In this scenario above we can see that the website allows users to play videos which makes a call from our website service to our video plays microservice; therefore in our video plays service we can see 200 successful requests in the logs.

Now let’s see what happens if there is an issue with the website which means that no video plays are happening in our video plays microservice:

As we can see from the diagram above, we have nothing to alert on as there are no errors in our video play microservice, what we actually need to alert on is the lack of successful calls from the website.

“what we actually need to alert on is the lack of successful calls from the website”

What are we building?

Let’s have a look at the basic example solution which we will be talking through in this article:

You can see in our basic example that:

Our users will interact with Amazon API Gateway to play a video stream.
The Lambda function emits a metric called SuccessfulFilmPlay when we have a successful play of a film stream.
We store our movie metadata in a DynamoDB table.
We use a combination of CloudWatch metric filters and Math Expressions to allow us to alert on a lack of success criteria per minute (i.e. we have a lack of successful video plays per minute).
When we have less successful video play metrics per minute than the threshold of 1 we put our alarm into an alarmed state and send an email to our engineering team so they can investigate.

👇 Before we go any further — please connect with me on LinkedIn for future blog posts and Serverless news https://www.linkedin.com/in/lee-james-gilmore/

Talking through key code 👨‍💻

Let’s start with one of the key pieces of the puzzle. Firstly, we need to ensure that our Lambda function for playing film streams publishes metrics which we can add a metric filter for:

import {
  MetricUnits,
  Metrics,
  logMetrics,
} from '@aws-lambda-powertools/metrics';
import { Tracer, captureLambdaHandler } from '@aws-lambda-powertools/tracer';
import { errorHandler, logger } from '@shared';
import { APIGatewayProxyEvent, APIGatewayProxyResult } from 'aws-lambda';

import { injectLambdaContext } from '@aws-lambda-powertools/logger';
import { Film } from '@dto/play-film';
import { ValidationError } from '@errors';
import middy from '@middy/core';
import httpErrorHandler from '@middy/http-error-handler';
import { playFilmUseCase } from '@use-cases/play-film';

const tracer = new Tracer();
const metrics = new Metrics();

export const playFilmAdapter = async ({
  pathParameters,
}: APIGatewayProxyEvent): Promise<APIGatewayProxyResult> => {
  try {
    const id = pathParameters?.id;

    if (!id) throw new ValidationError('id is not supplied');

    const film: Film = await playFilmUseCase(id);

    metrics.addMetric('SuccessfulFilmPlay', MetricUnits.Count, 1);

    return {
      statusCode: 200,
      body: JSON.stringify(film),
    };
  } catch (error) {
    let errorMessage = 'Unknown error';
    if (error instanceof Error) errorMessage = error.message;
    logger.error(errorMessage);

    metrics.addMetric('PlayFilmError', MetricUnits.Count, 1);

    return errorHandler(error);
  }
};

export const handler = middy()
  .handler(playFilmAdapter)
  .use(injectLambdaContext(logger))
  .use(captureLambdaHandler(tracer))
  .use(logMetrics(metrics))
  .use(httpErrorHandler());

We can see from the code above that on the success of the Lambda function being invoked we publish the metric SuccessfulFilmPlay with a count of 1.

This essentially now means that we can track how many successful calls we have per minute.

Next, we take a look at our stateless stack which contains our metrics and alarms for the given Lambda function:

// Create the Metric Filter for the lambda function logs specifically
const metricFilter = playFilmLambda.logGroup.addMetricFilter(
  'SuccessfulFilmPlaysFilter',
  {
    filterName: 'SuccessfulFilmPlaysFilter',
    filterPattern: logs.FilterPattern.literal(
      `{ $.SuccessfulFilmPlay = 1 && $.service = "${serviceName}" }`
    ),
    metricName: 'SuccessfulFilmPlaysFilter',
    metricNamespace: metricNamespace,
    metricValue: '1',
    defaultValue: 0,
  }
);
metricFilter.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);

// ensure that we fill missing periods with a 0 otherwise the alarm
// will not trigger for > 5 mins as discussed here:
// https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
const metricsMathExpression = new cloudwatch.MathExpression({
  expression: 'FILL(m1, 0)',
  label: 'SuccessfulFilmPlaysAlarmExpression',
  period: cdk.Duration.minutes(1),
  usingMetrics: {
    ['m1']: metricFilter.metric({
      period: cdk.Duration.minutes(1),
      statistic: cloudwatch.Stats.SUM,
    }),
  },
});

// we create the alarm based on the math expression, as we need to ensure that
// missing metrics are filled with 0, to ensure that our alarm triggers quickly
const alarm = metricsMathExpression.createAlarm(this, 'CloudWatchAlarm', {
  alarmName: 'SuccessfulFilmPlaysAlarm',
  alarmDescription: 'Error When Successful Plays Drops Below Threshold',
  threshold: 1, // we want to ensure that we get at least 1 video play per minute
  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
  evaluationPeriods: 1,
  datapointsToAlarm: 1,
  actionsEnabled: true,
  // we want to know when we have no metrics
  treatMissingData: cloudwatch.TreatMissingData.BREACHING,
});
alarm.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);

// create our sns topic for our alarm
const topic = new sns.Topic(this, 'AlarmTopic', {
  displayName: 'FilmPlayAlarmTopic',
  topicName: 'FilmPlayAlarmTopic',
});
topic.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);

// Add an action for the alarm which sends to our sns topic
alarm.addAlarmAction(new SnsAction(topic));

// send an email when a message drops into the topic
topic.addSubscription(new snsSubs.EmailSubscription(emailAddress));

If we break down the stateless stack code above, we start by creating the metric filter which is attached to our playLambdaFilm lambda function log group. We can see that the metric filter is based on the service and the logs containing the pattern “SuccessfulFilmPlay = 1”:

// Create the Metric Filter for the lambda function logs specifically
const metricFilter = playFilmLambda.logGroup.addMetricFilter(
  'SuccessfulFilmPlaysFilter',
  {
    filterName: 'SuccessfulFilmPlaysFilter',
    filterPattern: logs.FilterPattern.literal(
      `{ $.SuccessfulFilmPlay = 1 && $.service = "${serviceName}" }`
    ),
    metricName: 'SuccessfulFilmPlaysFilter',
    metricNamespace: metricNamespace,
    metricValue: '1',
    defaultValue: 0,
  }
);
metricFilter.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);

Next, we need to create a math expression based on the metric filter above. The reason for this is we need to ensure that when a metric is not emitted (i.e. the Lambda function is not invoked), that we fill (replace) it with a count of 0. If we didn’t do this we would fall fowl of an issue with CloudWatch where it will take around 5–7 minutes for our metrics to put our CloudWatch Alarm into an alarm state based on missing data causing a breach.

In creating the math expression, this will now trigger the alarm after 1 minute of no successful calls in the logs as it will fill the missing metrics with a count of 0 per minute:

// ensure that we fill missing periods with a 0 otherwise the alarm
// will not trigger for > 5 mins as discussed here:
// https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
const metricsMathExpression = new cloudwatch.MathExpression({
  expression: 'FILL(m1, 0)',
  label: 'SuccessfulFilmPlaysAlarmExpression',
  period: cdk.Duration.minutes(1),
  usingMetrics: {
    ['m1']: metricFilter.metric({
      period: cdk.Duration.minutes(1),
      statistic: cloudwatch.Stats.SUM,
    }),
  },
});

The last step is to create a CloudWatch alarm off the back of the math expression as shown below:

// we create the alarm based on the math expression, as we need to ensure that
// missing metrics are filled with 0, to ensure that our alarm triggers quickly
const alarm = metricsMathExpression.createAlarm(this, 'CloudWatchAlarm', {
  alarmName: 'SuccessfulFilmPlaysAlarm',
  alarmDescription: 'Error When Successful Plays Drops Below Threshold',
  threshold: 1, // we want to ensure that we get at least 1 video play per minute
  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
  evaluationPeriods: 1,
  datapointsToAlarm: 1,
  actionsEnabled: true,
  // we want to know when we have no metrics
  treatMissingData: cloudwatch.TreatMissingData.BREACHING,
});
alarm.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);

With this setup, if we don’t have any metrics for SuccessfulFilmPlay within a 1 minute period we will trigger our alarm and send an email to the team to investigate what is going on.

Trying it out 🎯

If we deploy the example solution, we can hit the endpoints using the Postman collection found in ‘postman/Serverless Film Streams.postman_collection.json’.

We can start by creating a new film using the POST endpoint as shown below with the following payload:

{
  "film": {
    "title": "Inception",
    "description": "A mind-bending thriller about dreams within dreams.",
    "genre": ["Action", "Adventure", "Sci-Fi"],
    "release_date": "2010-07-16",
    "duration": "2:28:00",
    "rating": {
      "average": 4.8,
      "count": 1500
    },
    "cast": [
      {"name": "Leonardo DiCaprio", "role": "Cobb"},
      {"name": "Joseph Gordon-Levitt", "role": "Arthur"}
    ],
    "directors": ["Christopher Nolan"],
    "writers": ["Christopher Nolan"],
    "production_studio": "Warner Bros.",
    "language": "English",
    "subtitles": ["English", "Spanish", "French"],
    "poster_url": "https://example.com/inception_poster.jpg",
    "trailer_url": "https://example.com/inception_trailer.mp4",
    "video_quality": ["HD", "4K"],
    "audio_languages": ["English", "Spanish"],
    "availability": {
      "start_date": "2023-01-01",
      "end_date": "2023-12-31"
    },
    "streaming_info": [
      {
        "provider": "Netflix",
        "url": "https://www.netflix.com/watch/inception",
        "expires_at": "2023-12-31T23:59:59Z"
      },
      {
        "provider": "Amazon Prime",
        "url": "https://www.amazon.com/watch/inception",
        "expires_at": "2023-12-31T23:59:59Z"
      }
    ]
  }
}

An example of us creating a new film which we can subsequently play

This now means that we have a film stored in our database that we can try and play using the following GET call with the ID being based on the film we have just created above:

This GET endpoint is an example of us playing a new stream of the film

We can now test our alarm by invoking our GET endpoint multiple times in a minute which will put the alarm in an ‘OK’ state.

If we then don’t invoke the GET endpoint for a subsequent minute we will see the alarm get triggered due to a lack of successful video plays per minute. We can see this in the graph below:

An example of our alarm moving between alarm and OK status based on activity

This means we have successfully created an alarm that will trigger when we don’t have any successful video plays for a period of time which can indicate issues in our overall microservice eco-system.

In the next article, we will compare this method to using CloudWatch Anomaly Detection and compare the two approaches.

Wrapping up 👋🏽

I hope you enjoyed this article, and if you did then please feel free to share and feedback!

Please go and subscribe to my YouTube channel for similar content!

I would love to connect with you also on any of the following:

https://www.linkedin.com/in/lee-james-gilmore/
https://twitter.com/LeeJamesGilmore

If you enjoyed the posts please follow my profile Lee James Gilmore for further posts/series, and don’t forget to connect and say Hi 👋

Please also use the ‘clap’ feature at the bottom of the post if you enjoyed it! (You can clap more than once!!)

About me

“Hi, I’m Lee, an AWS Community Builder, Blogger, AWS certified cloud architect, and Global Head of Technology & Architecture based in the UK; currently working for City Electrical Factors (UK) & City Electric Supply (US), having worked primarily in full-stack JavaScript on AWS for the past 6 years.

I consider myself a serverless advocate with a love of all things AWS, innovation, software architecture, and technology.”

*** The information provided are my own personal views and I accept no responsibility for the use of the information. ***

You may also be interested in the following:

Serverless Content 🚀

An index of all of my Serverless content to easily browse in one place, including videos, blog posts and more..

blog.serverlessadvocate.com