—by Jonathan Cosme, AI/ML Solutions Architect at Run:ai
Today, we’re going to talk about why you should use GPUs for your end-to-end data science workflows – not just for model training and inference, but also for ETL jobs.
First of all, GPUs are faster. Way faster.
Let’s look at an example. The largest CPU out there today is the AMD Threadripper with 128 cores. Each core runs at about 4 GHz a second, and it costs around $5,000. The RTX 3070, a mid range GPU, has 5,888 cores. The cores individually are slower at 1.5 GHz, so about one third the speed of a CPU. But even if we still take that into consideration, cycles per second on a GPU will be 8,832, and for the CPU, we’d have 512. So for $5,000, you’re getting 512; for $1,000, you’re getting 8,832. So for one fifth of the price, you’re getting over 10 times more compute power. There’s no reason you shouldn’t be using GPUs for everything.
But at this point, GPU acceleration is concentrated mainly on the modeling portions of the data science workflow. Specifically, it’s concentrated on deep learning. Now, it’s great that we can train a neural network in a matter of days or hours instead of weeks or months, but the fact remains that if you’re not dealing with neural nets in your training, or some other kind of models, you’re not typically going to be using GPU acceleration. Data wrangling is also usually done on CPUs. We’re missing out on the opportunity to use GPU for the whole thing. In the next sections, I’m going to show you how big of a missed opportunity that is.
Your data science workflow probably looks like this:
The first step is the data wrangling. This is where I want to load my raw data and explore it. If I’ve got lots of variables, maybe I want to reduce the dimensionality to two dimensions, so I can plot it, see if there’s any relationships. But most importantly, this is how I clean the messy data and structure it so that I can go to the next step, which is model training. At this point, maybe I want to explore which kind of model is best. So I might try a logistic regression, a random forest regression and a neural network to see which one best fits. Or maybe I know what the best model is, and just want to do a grid search for the optimal hyperparameters. So I train a bunch of the same models with different hyperparameters, or any combination of both. Once I’m done with this, I go into the model validation phase, where I take the model and perform inference on the sample data set to see if it works well, and/or I could also check if the inference works on new data.
Imagine a typical data science day with CPUs.
I’m going to assume that my data downloads overnight. The first step is I’m going to go and explore my data and check out what to do with it, configure my process of cleaning. And because I’m working with GBs of data, maybe TBs, sometimes BPs of data, I’ve got to use some kind of framework that will allow me to do this. It takes a long time, because I’m using CPUs. And of course, halfway through I realize, “Oh shoot, I forgot to add a feature.” Maybe I forgot to import a specific ID column as a string and it’s being imported as an integer, and it’s messing everything up. So I have to go back, fix my code, and then I have to go ahead and restart my workflow. But then I find that there are some null values that have been turned into strings. So I have to go back to my code, and I have to fix it again, and then I have to restart my ETL. Before I know it, I have to stay working late and I don’t really get much of anything done.
Those are just some of the points of frustration for data scientists. The modeling portion is actually a very small percent of what they do. Most of their time is spent wrestling with the data. Why? Because models require an extremely structured way of receiving data and it needs to be squeaky clean, whereas data in the world is, well, messy. There are mistakes, missing numbers, values where they shouldn’t be. It’s not structured in the way that the model wants. Depending on how much messy data a scientist needs to beat into submission, it can become a huge bottleneck in productivity.
Many data science workflows start off using R. And R isn’t a general purpose language, it’s a statistical computing language. It has built-in multiprocessing. If I’ve got an R script, and I run it, it will use all available cores, whereas Python by design and for good reason is locked to a single CPU. So it’s frustrating that I actually have to use a different library, called multi-processing, which doesn’t actually do multi-processing, it just mimics multi-processing; I get a very similar effect. But I’ve got to use another library to be able to utilize all of my cores, rather than just being able to use the Pandas libraries out of the box. Another point of frustration with Pandas is that the read and write speeds are really slow. So are the aggregations and data manipulation functions, especially if I’m dealing with large datasets. Overall, Pandas is just a very slow library. On top of that, I can only run it on one CPU, one core. Pretty frustrating.
Luckily, there is a wonderful unified solution, called Rapids.
The Rapids.ai framework is a suite of GPU-accelerated libraries for Python. They include: cuDF, which mirrors Pandas; cuML, which mirrors Scikit-learn; cuXFilter for visualizing billions of data points. If you work with graph analytics and NetworkX, you’ve got cuGraph as well. There’s also XGBoost, which is a very popular gradient boosting framework, as well as Dask and Spark, which are popular very large data ETL frameworks. They all have GPU implementations that utilize Rapids under the hood. So if you use XGboost, Dask, Spark, there’s GPU implementations of those as well.
What are the actual speed gains for using Rapids versus all of the CPU libraries? When we’re talking about all the ETL functions, we’re talking about a 6x increase in speed. Saving the model is 2,000 times faster. Reading and writing, usually a huge bottleneck, runs 13 times faster. You can load a large data set in five seconds, versus having to wait an entire minute. It’s wonderful. Even when we’re talking about fitting models, it’s so much faster. If you’ve got 4 million rows with 10 features, it takes six seconds to run a random force on the GPU, whereas it takes 16 minutes to run a random force on the CPU. Overall, the total workflow is 49 times faster. You’ve got a 98% decrease in time it would take you to do this. The speeds are astounding. Can you imagine a workflow that took eight hours to run on the CPU, takes 10 minutes on the GPU? If you’ve got a one-hour workflow, a minute and a half on the GPU.
So how does this actually affect my day as a data scientist?
If I’m using the GPU end to end, the process is the same. I configure, look at my data, check out what to do, and go ahead and start my ETL. But if I realize I’ve forgotten a feature, it’s no big deal. My ETL is so fast that I’ll just go back, fix my code, run it again, and there, I found my values. By midday, I’m done. Now, I can actually get into my model training, validation and testing. And because training, validation and testing are already GPU-optimized, I can actually do multiple iterations of these in the same day, whether I want to train different kinds of models, or I want to do hyperparameter tuning on one particular model. I get to go home on time. Productivity increases by orders of magnitude.
Let’s cover some quick FAQs:
Q: All my code is in Pandas and Scikit-learn. Do I have to convert it?
A: cuDF and cuML were specifically built to mirror Pandas and Scikit-learn, so in almost all cases, your Pandas and Scikit-learn are the same as the cuDF/cuML. There should be very few cheat code changes required to actually implement these things.
Q: What if I’m doing something that doesn’t work with cuDF and I have to use my specific function and have to run in Pandas?
A: Not a big deal. It’s very easy to go from a cuDF data frame to Pandas data frame, run your function, and then take your Pandas data frame and go back to the cuDF data frame and continue the workflow from there.
Q: What if my data is too large for GPU memory?
A: This happens a lot, actually, and that’s where Dask comes in. It was designed as a distributed solution for Pandas, and mirrors all of Pandas’s functionality. So if you’re used to Pandas, using Dask is going to be pretty intuitive. And if you’re used to Spark, Apache also has incorporated the Rapids.ai engine to leverage GPU for ETL. So you can use that as well.
Q: I’ve got a large CPU cluster, and it runs really quickly. Why do I have to switch?
A: The amount of data being generated will increase with time in basically all industries and applications, and as this data availability grows, it will be necessary to perform the same computations and operations per second within the time that your business currently requires. If you want to continue meeting your SLAs, the sooner you adopt GPUs, the better it will be for you in the future.
Q: GPU instances are expensive. Wouldn’t it be cheaper to run a large CPU cluster rather than a GPU?
A: No, actually. Check out these results of a GPU vs CPU experiment on the Spark RAPIDS GitHub. They tested a cluster of CPUs versus a cluster of GPUs. And because the GPU cluster gets everything done so quickly, it turns out that it runs for significantly less time and your cost is actually less than what you would have paid for the CPU cluster.