June 20, 2021

Why data scientists should know data engineering

With the rush to embrace machine learning, data engineering has turned out to be an essential piece of the ML puzzle — one that data scientists need to rely on more and more as a pre-requisite for success.

With their expertise in programming, data engineers create the data pipelines that data scientists use to feed their machine learning models. To an outsider, creating a data pipeline might sound mundane or trivial, but it can involve weaving together anywhere from ten to thirty different data technologies.

The data engineer is also the person who selects the tools to do the job. He or she will have the in-depth knowledge needed to assess the various technologies and frameworks, and the know-how to combine them to create solutions that enable a company’s digital business model.

When it comes to separating the data engineer and data scientist roles, the line can be a little blurry. There’s overlap between the roles — and that’s a good thing. Both need to be able to speak the other’s language and collaborate effectively to reach the same business objective. There should be a mutual understanding of their respective decision-making processes, dependencies, and limitations to make the process of building ML models and applications successful.

On balance, we’d say it’s becoming even more important for data scientists to take on more data engineering knowledge. Knowing more about the programming-side can help them clarify what they need from data engineers, or even take on some of the pipeline building tasks themselves.

We’ve taken a sampling of opinion from leading data scientists to get their thoughts on bringing some of data engineering’s programming nous to data science’s math and statistics expertise, and what a new combined role might look like.

From hand tools to power tools

Dan Sullivan, PhD: Enterprise architect and big data expert

‘I would say that for a data scientist, learning data engineering is like moving from a hand saw to a chainsaw. Now you have power tools at your disposal, and you can do so much more.

‘As a data scientist, you have really large volumes of data and want to do something with it like streaming analytics. Of course, you understand the math, can pick the right algorithm, and create a scenario for analysis, but the volume of data is so large that you spend a lot of time moving around files or cleaning them up. That’s where data engineering skills would really come in handy.’

But is it realistic to have one role that handles it all? There’s an argument that data scientists should focus on how to get their models into production and monitor them, rather than being distracted by issues like data cleaning.

I think it’s possible to have a good understanding of all of the pieces of the puzzle, but of course, it helps to go deep on one of them. There’s a parallel between our work in data engineering and data science, and software engineering; where people talk about being full-stack engineers.’

‘From an ML data engineering perspective, some people are really good with visualization and working with business domain experts to massage the data for the front end. And then there are people focused on back-end services who understand databases and how to move large volumes of data.’

‘If somebody identified as a full-stack engineer, I would never question them, but I’ll be honest — I could never be a full-stack engineer. I would be good at one end or the other or something in between, but not all of it. I feel like that’s the same with data science and data engineering. It’s just too big of a domain for one role to cover it all.’

Learning to speak the same language

Byron Allen: Machine Learning Consultant at Servian

Blogging about just this issue recently, ML consultant Byron Allen recently wrote that one of the most exciting machine learning developments in recent months has been the growth in collaboration between data scientists and data engineers. In many businesses, they now work together effectively as part of the same team.

‘The genesis of many major challenges in applying ML today, whether that be technical, commercial, or societal, is the imbalance of data over time coupled with the management of ML artefacts.’

A model can perform exceptionally well, he says, but if the underlying data drifts and artefacts aren’t being used to assess performance, models won’t generalise well or update correctly — an increasingly common issue that falls into a fuzzy area inhabited by both data scientists and data engineers.

‘It doesn’t matter if you can create a really good ‘black box’ model, if your input data changes and the model isn’t regularly assessed in the context of what it was built to do, it loses its relevance over time. Tackling that issue is hard because the people that are feeding the data in, engineers, and the people that designed the model, scientists, don’t have the happiest of marriages.’

Navigating those waters, he says, will lead organisations towards a more effective and sustainable application of ML.