Willem is currently a tech lead at Tecton where he leads the development of Feast, an open-source feature store for machine learning. Previously he led the ML platform team at Gojek, the Southeast Asian decacorn, which supports a wide variety of models and handles over 100 million orders every month. His main focus areas are building data and ML platforms, allowing organizations to scale machine learning and drive decision making. In a previous life, Willem founded and sold a networking startup.
To be successful with machine learning, you need to do more than just monitor your models at prediction time. You also need to monitor your features and prevent a “garbage in, garbage out” situation. However, it’s extremely hard to detect problems with the data being served to your models. This is especially true for real-time production ML applications like recommender systems or fraud detection systems. In this post, we’ll explore what feature monitoring for real-time machine learning entails and the common obstacles you will face. (Stay tuned for Part 2 where we will dive into how Tecton can help you solve some of these challenges.)
What is feature monitoring for machine learning?
In machine learning, a feature is an input signal to a predictive model. Typically, a feature is a transformation on raw data. While it is important to monitor the raw data that is used to create features, it is even more critical to monitor the feature values after they have been transformed, as this is the data that the model will actually use.
Raw event data is transformed into features. To monitor features, you want to be able to observe and track the feature values post-transformation.
Feature monitoring can be grouped into two classes: Monitoring features at the value/row level, or monitoring the aggregations of features, referred to as either metrics or statistics.
Monitoring individual feature values
The following are examples of monitoring that can be performed at the value or row level for machine learning features:
- Nulls: Checking for null or missing values in individual feature values.
- Types: Ensuring that feature values are of the correct data type (e.g., integer, string, complex types, etc.).
- Ranges: Checking that feature values fall within a specified range or set of acceptable values.
- Encoding: Verifying that categorical values are properly encoded and not causing errors in the model.
Monitoring feature metrics
The following are examples of monitoring that require aggregations, by looking at many feature values over time:
- Data drift detection: Tracking changes in the distribution of feature values over time and identifying when there are significant shifts that could impact model performance.
- Cardinality: Checking the number of unique values in each feature to ensure that there are not too many, which could impact model performance.
- Outlier detection: Identifying and removing outlying values that could negatively impact model performance.
- Skew detection: Identifying when the distribution of values in training is different from what is being served to models at inference time.
- Correlation analysis: Checking for correlations between features to ensure that they are providing complementary information to the model.
- Feature relevance analysis: Evaluating the importance of each feature in relation to the target variable and removing any redundant or irrelevant features.
- Data transformation: Tracking the effectiveness of data transformations on the raw data and identifying when they are no longer producing meaningful results.
- Lag: Ensuring that the delay in features being served to models online for inference is representative of the delay during training.
In addition to monitoring the proper functioning of your production infrastructure, it is crucial to regularly assess your data for potential problems, especially when business decisions are being driven by models.
Common data quality challenges for real-time machine learning
Challenge #1: Volatile dependencies on analytics teams
With platforms like Snowflake and BigQuery, machine learning (ML) teams naturally want to start feature development based on data that already exists within the organization. This data is often produced by the organization’s analysts and business intelligence (BI) teams and can provide valuable insights for ML, especially during development.
However, these upstream teams often have their own goals and aren’t focused on creating reliable data for production ML, leading to the following problems:
- Changes in the upstream schema. If the data that the ML system relies on is stored in a database or other data repository, changes to the schema of that data can cause the downstream ML system to break. For example, if a column is removed from an upstream table the ML system may no longer be able to access the data it needs to make predictions.
- Changes in the data itself. Even if the schema remains the same, changes to the data itself can also cause problems for the ML system. For example, if a business starts using new categories for its products, the ML system may not be able to accurately predict the demand for those products.
- Changes in the business or industry. The ML system may also be affected by changes in the business or industry it is operating in. For example, if a business introduces a new product or service, the ML system may not be able to accurately integrate that product or service into its predictions.
Challenge #2: Computing & validating feature metrics
Computing reliable feature metrics for real-time machine learning can be a complex task due to the need for data collection at various points within the feature computation pipeline.
In order to accurately monitor features, it is necessary to define and monitor features in at least four different areas:
- Batch feature monitoring involves computing metrics on features that have been computed offline.
- Streaming feature monitoring involves computing metrics on features that are computed in a stream.
- Training feature monitoring involves computing metrics on training datasets.
- Serving feature monitoring involves computing metrics on the features served to models for online prediction.
Each of these areas presents its own challenges and considerations, such as the need to:
- Handle temporal joins and align timestamps
- Compute feature metrics on unbounded data for streaming features
- Compute metrics without affecting serving latency
- Handle metric compute at scale for batch features
Furthermore, defining and applying validation rules in these four contexts can also be challenging. For example, rules must be able to handle different data sources and contexts, such as a Spark-based job or a Go-based feature server. They must also be performant enough, since real-time ML systems require low latency in order to function effectively. Additionally, rules must be able to take some action if a metric value spikes, such as logging an incident or sending alerts.
Challenge #3: Limitations of current tools
Operational monitoring solutions are not suitable for monitoring feature data in real-time machine learning. These tools, such as Prometheus, are designed for monitoring infrastructure and do not work well with the short-lived, finite jobs used batch data. Additionally, these tools are purpose-built to monitor systems at processing time and do not allow for control over event time, which is necessary for computing and persisting metrics on historical features.
Other tools like Great Expectations are optimized for data science or notebook-based use cases and are not designed to run in a production setting. As a result, they add unacceptable latency to real-time ML systems and are cumbersome or impossible to integrate into a production stack.
Challenge #4: Understanding and Detecting Data Drift
Features can experience drift, both temporally as well as in the form of a skew between training and serving. This can lead to inaccurate predictions and suboptimal performance. Detecting drift is difficult because surface-level metrics and statistics may not reveal the underlying changes in the data distribution. Furthermore, real-world events or shifts in data patterns can trigger false alarms, leading to unnecessary investigations.
Addressing drift is a complex and time-consuming process. Visualizations and statistical algorithms can help identify the factors and data points that are contributing to the shift in feature distribution. Furthermore, it’s often necessary to implement advanced algorithms to reliably detect drift (like concept or feature-level drift detection):
Feature monitoring for real-time machine learning is a crucial and exciting challenge—and we are investing heavily in solving it at Tecton. Stay tuned for the next post in this series, where I’ll write about how you can overcome the common challenges outlined in this post when monitoring features for real-time ML.
If you are interested in learning more about how Tecton is tackling this important problem, please feel free to reach out to us at firstname.lastname@example.org! Or if you’d like to learn more about Tecton’s capabilities, check out the full Tecton demo and Q&A.Tags: Machine learning, Machine Learning Engineer, Monitoring