June 25, 2021

How to collaborate remotely as a data scientist

The pandemic has been brutal on collaboration. It doesn’t matter what secto you look at. Millions of people found themselves turned suddenly into remote workers when coronavirus landed. That included a lot of data scientists.

Of course, another shift was already underway. Machine learning teams were growing larger, and ML projects were getting more complex. Teams were becoming diverse and cross-functional, pulling together data modellers, visualisation experts, software engineers, product managers, designers, and so on.

That’s a sure sign that MLOPs is maturing, but it’s also made collaboration on machine learning projects more complex. It’s not that remote working was a new phenomenon, just recognition that orchestrating activity across bigger teams gets harder when they aren’t co-located.

So how do you deal with it? Remote working is with us for the foreseeable future, and projects still need to advance. To get some guidance, we asked MLOps consultant and entrepreneur Luke Marsden to give us his view of the landscape, with some advice on how MLOps teams can overcome the challenges of distance and disconnection.

Two ways teams typically work together

Luke Marsden: When you boil it down to basics, there are really two fundamental modes of collaborating between different people doing work: synchronous and asynchronous.

Synchronously is when people are sitting in a room together and interrupting each other when they have a question or need to get something done urgently. With machine learning, they might even share a text editor and take it in turns to use the keyboard. They would be working on the same data set in exactly the same environment, just time-slicing it.

The other approach to working together is to do it asynchronously; when different people work on different copies of things in different places. The problem with an asynchronous approach is that you then need processes to cope with conflicts, for example handling a merge conflict in Git.

If we look at how DevOps teams do it, they use tools like GitLab and Bitbucket and all of the tooling around that. The way that works is that you can fork someone else’s project, or you can make a branch from a master then make some changes in your branch. While you’re making the changes in your branch, you’re not treading on anyone else’s toes. You just propose your changes back to the master branch.

That’s commonly known as Gitflow, and it’s been very successfully used in pretty much every DevOps team on the planet.

Challenges of asynchronous collaboration in machine learning

Luke Marsden: Taking an asynchronous approach to collaboration in machine learning has numerous challenges. The first is that a lot of data scientists use Jupyter notebooks, and they don’t version very well.

Another challenge is data versioning and data sharing is difficult in a collaboration context because you can’t very easily put your data in Git. So what we find is that people don’t really bother with versioning, and they just rename files or rename folders with names like ‘final’ and ‘final-final’, and you have all these funny little strings that refer to an action that’s been taken. When you try and to share those folders around, it becomes quite messy.

Then there are problems with metric and parameter tracking. You don’t have to worry about that so much in software, but in machine learning, you have to keep track of which parameters you use and which accuracy scores you got.

The thing is, data models aren’t sort of green or red, or either working or broken; that’s too reductive. They’re usually somewhere in the middle, so you need metrics like accuracy score to tell you how good a model is against a specific test set.

Now, you could put that in a git-commit message, but then you have to have humans remembering to do it. You really need tools that help you do the tracking.

Other challenges arise when you’re using a combination of local development environments. So you might have a GPU or an IPU in the machine on your desk, or you might be using machines in the cloud. With Git, it’s pretty easy to switch from the local machine to machine in the cloud or a machine in your data centre, but doing it effectively with machine learning — where you’ve also got data that you need to move around and you’ve got to keep track of traffic metrics and parameters — is a lot harder.

Tools to consider

Marsden points out that there are some emerging collaboration tools that address the challenges of collaboration and remote working for machine learning specifically.

Luke Marsden: It’s actually a very exciting space, and there’s loads of innovation happening, with lots of new tools out there.

MLflow is quite strong. In the experiment tracking space, Weights & Biases is very strong. In comparing relationships between metrics and hyperparameters, DVC (Data Version Control) looks promising for project tracking and who can forget about neptune.ai. And then, in terms of different emerging Jupyter notebooks, there are a lot of open-source projects worth looking into.