May 29, 2022

Experience report: Data Version Control (DVC) for Machine Learning Projects

Published by

This blog is written by Vinay Patel, a senior software engineer at GoCardLess.

Context


At GoCardless, we use Machine Learning to prevent fraud and reduce payment failures for our merchants using features like Success+. Data Scientists and Machine Learning Engineers—11 of us in total—work alongside different teams to support these offerings.

Data Version Control (DVC) makes our ML processes consistent and brings in industry-standard best practices. It certainly improves the onboarding experience for new Data Scientists—essential for our rapidly growing teams. DVC serves various use cases, but we use it primarily to version data and models—as Git for data—and build ML pipelines.

We find the ML tooling space quite noisy and that the industry hasn’t converged on a standard ML tooling stack. Hence, we’re sharing some highlights of our experience with DVC to help you evaluate whether it’s a good fit in your setting.

What Went Well?

Git For Data

The ability to trace back the origins of code and data—model provenance—is essential for reproducing and explaining ML artefacts. Before DVC, we didn’t have automated data version control.

DVC’s data versioning is intuitive—thanks to its Git-like model and user experience. We’ve paired it with pre-commit, a tool to manage Git hooks, so that the DVC workflow happens opaquely in sync with the typical Git workflow. DVC caches our datasets, models and everything in between in a GCS bucket.

DVC does this one thing so well that we lean on it even for atypical use cases. An example is peer review of Jupyter notebooks. For fear of leaking sensitive data, we don’t push these as-is to GitHub. Reviewers see the notebook as Python scripts without output cells, where we might inadvertently leak sensitive data (yay, Jupytext!), and we version the notebook in its entirety using DVC.

Intuitive and conventional design choices

Anyone familiar with Git can pick up DVC’s data version control.

As for DVC’s lightweight pipelining: to benefit from DVC’s Directed acyclic- graph (DAG) support and caching we broke our ML processes up into stages with explicit inputs, outputs and dependencies. Coming from a world where our ML processes were executed as long-running Python scripts, re-modelling those as DVC pipelines is definitely a step in the right direction.

As we’ll see in the next section, we’ve outgrown DVC’s pipelining capabilities. Thankfully, DAGs are a common pattern, so this effort remains worthwhile even if we upgrade to another pipelining tool.

First-class support for params and metrics

We like how following some conventions set out by DVC brings so many benefits, e.g. defining params and metrics keys in dvc.yaml.

Params give fine-grained control over cache and a clear view of which configuration affects which parts of the pipeline, e.g. changing hyper-parameters should only fit and evaluate the model, and shouldn’t recreate input datasets.

Making DVC aware of model metrics automates tracking of model performance alongside code and data versioning. Now, we’re able to dvc metrics diff just as easily as we git diff.

What didn’t go well?

Pipeline-support leaves much to be desired

It didn’t take too long for us to start feeling this after we started using DVC’s lightweight pipelines. As the user community has also highlighted, some features are glaringly absent, namely:

  • Support for parallel runs: running stages or entire pipelines in parallel (issue), or each instance of a foreach in parallel (issue).
  • Support for incremental processing: specifically, the issue is that DVC treats stage outputs atomically. If a huge dataset is extracted as part of a stage’s run, the next run will start by wiping it out completely, and there’s no support for appending to it (issue).
  • Allowing multiple execution environments like development and testing: basically, there’s one dvc.lock per repo, which isn’t aware of the execution environment. Suppose that, during model development we’d like to use the full training dataset, but just a subset of that on CI to test the pipeline end-to-end. Though it’s possible to run these separate configurations by making local modifications, it’s not possible to check-in these into version control because dvc.lock captures only one use case.
  • Dynamic pipeline definition: DVC pipelines are defined in YAML and it provides add-ons to make it a bit more powerful, e.g. templating and foreach. Having stretched templating and foreach to the limit, we’ve decided to move away from defining ML pipelines as YAML, simply because we need more expressiveness and succinctness to prevent these pipeline definitions from becoming a mess.
  • Not that it’s impossible to work around these—we’ve built tweaks around issues and missing features, but that hurts the pipeline’s readability, e.g. the DAG’s order is implicitly defined by one stage’s output being a dependency for subsequent stages, and so for stages without logical dependencies we’ve had to add dummy log_dir_exists.out files just to enforce some order.

Our stance

We’d love to continue using and recommend DVC for data version control, where it excels and is complete. We can recommend using DVC’s pipelining only for basic pipelines in anticipation that you might soon grow out of it.

As with most tooling choices, it depends on the context. We’ve shared our context and experience, and would love to hear how it resonates with you @GoCardlessEng.