June 25, 2021

Domain-Specific Machine Learning Monitoring

ML Monitoring spotify

To detect that your machine learning service is not behaving as expected it can often be useful to create custom metrics that are specific to your product. I will give you two questions that help to design such metrics:

What do your users expect?

Two years ago I was sitting in a talk by Spotify about their home page personalization when they dropped an important nugget of wisdom. Their personalized home page layout includes different carousels in a user-specific and context-specific order. As a sanity metric, they monitor the rank of a user’s most used carousel. If a new personalized algorithm ranks the user’s favorite carousel low or a sudden drop in the rank occurs it indicates a problem.

You can think of your metric as a basic user story or a simple common sense baseline:

As a user, I would like my favourite carousel to be easily accessible (ranked high on the page).

Sometimes the stories can be even more basic and still be effective to detect problems. At Zalando, one of Europe’s largest e-commerce fashion retailers, we were personalizing the recommended similar articles on a product page based on the user’s context in real-time. If a request didn’t have enough context to be fully personalized, the next best option, an unpersonalized model, was delivered instead. So a simple metric can be the share of fully personalized responses with at least 4 articles:

As a user, I would like to see a recomendation box that is personalized to me.

Can you think of basic, common-sense rules for your product that you can measure as a metric?

Tip: If you struggle to come up with metrics from a user perspective, try brainstorming technical user stories and focus on the ones that have an impact on the user experience.

Can you identify a bad user experience?

My second tip to help you designing domain-specific machine learning monitoring metrics is to look at extremes instead of typical experiences. This approach is common in other areas of monitoring like application latency monitoring. To detect problems with the speed of our service we alert based on the 95 and 99 quantiles instead of the median. When the speed becomes slower you will see it much faster in the 95 and 99 quantiles than in the median.

We can take a similar approach to monitor our machine learning service response by asking:

Which service response is probably a bad user experience?

Let’s think of some examples that are most likely a bad user experience:

  • empty responses: Your service replies with an empty response, e.g. because the model cannot give a prediction or the predicted items/artists/songs/etc are filtered out by business rules
  • partly filled responses: Your model gives only a partial response. This happens for models that produce collections like a recommendation model or a ranker. Anytime you predict fewer articles needed to fill a user-facing component (e.g. 4 articles to fill a recommendation box, 10 results to fill a google result page, ….) you can assume that your result looks somewhat bad.
  • a fallback response: Instead of a high-quality answer, you deliver a fallback. A fallback can be a default value, popular/recent items, or even a very basic model.
  • low certainty responses: If your model is not sure about its prediction, e.g. a mid-low probability or several classes with the same probability in a single label classification problem, the response is most likely not of good quality.

It usually makes sense to design the “bad experience” metrics by thinking deeply about the kind of service you provide and talk to someone who knows your users (maybe a product manager?). I am curious about your ideas, leave me your metric suggestions in the comment section.

Don’t just measure: Alert!

Running these metrics as an analysis notebook is nice, but it won’t give you peace of mind that your service is working as expected RIGHT NOW and at any point in the future. Based on my experiences in the recommendation team at Zalando running many models and services for millions of users I can confirm that bugs in your inference service code, client code, configurations, inference input data, training data, the model library, or training code happen regularly. You need real-time metrics to detect problems that affect your machine learning service quality.

The first step is to create your metric(s) and evaluate them for several weeks. This analysis can be done in a notebook (if possible) or by implementing the metric(s) in a live service. Are they stable? Do they go down during the times where you know you had a quality issue?

If a metric is relatively stable over time and confirmed (or at least expected) to go down when an issue occurs you can alert with a reasonable threshold. You define the threshold based on your analysis of the “typical level”. If the metric is a bit noisy, you can clean it up e.g. apply smoothing, consider changing its definition slightly (e.g. percentage of the total, compared to the same time last week/yesterday, …), or alert only for big differences.

A typical reason for noisy domain-specific metrics is bot traffic. If bot traffic is included in your metrics and bots have a different input distribution than human users, the model metrics can behave irregularly. If it is not possible to remove bots during metric collection, it can help to restrict alerting to times during which the “bot background noise” does not dominate the metrics, e.g. during the daytime.

Now you can put the alert on a dashboard and hook it up to automated notifications.

Happy monitoring!