March 17, 2022

This Open-Source Tool Measures GPU Cluster Utilization – Here’s Why That Matters

Published by

This is a sponsored post with Run:ai ! A company based in Tel Aviv backed by Tiger Global, Insight Partners. Run:ai aims to accelerate AI-driven innovation and help humanity solve the unsolved. They are committed to providing a foundation for AI infrastructure – to help organizations in every industry accelerate innovation in the AI era.

Most AI teams estimate their GPU utilization to be above 60%. In reality, most GPU clusters are at less than 20% utilization. A new open-source tool, rntop, provides an accurate measurement.

Recent years have seen accelerated development of artificial intelligence-based capabilities for increasingly complex and varied applications. When industry studies discuss the challenges of adopting AI capabilities as part of a company’s arsenal, most point to a shortage of advanced processing capabilities, particularly GPUs. To determine whether a shortage of compute resources is really hindering AI’s expansion at enterprises, over the last year Run:ai conducted dozens of interviews with companies doing a variety of AI applications (NLP, computer vision, speech analysis, etc.) about their GPU allocation and how efficiently these resources are utilized. The responses amazed us – and led to the release of a new open-source tool for measuring GPU cluster utilization.

In our interviews with enterprise AI teams, we first asked researchers what they estimated to be the GPU efficiency of their organization. Since there was no tool in place to help them measure GPU utilization, researchers gave us their best estimate based on the size of their workloads.  The average researcher’s estimate of their GPU cluster utilization was 62%. However, (based on experience with POCs and new installations for Run:ai customers), we see that when averaging the utilization on all GPUs in the cluster, actual average efficiency was closer to  14%. This discrepancy might be surprising to some, but it’s consistent with last year’s State of AI Infrastructure Survey, in which 83% of the more than 200 surveyed companies admitted to not fully utilizing their GPU and AI hardware. This leads to frustrated users whose productivity is limited by inefficient AI infrastructure and unnecessary delays in development cycles. AI teams with smaller infrastructure budgets (up to $250k) suffer the most from idle hardware, not only in lost time, but also because if IT can’t show ROI on the hardware they buy for AI, these fledgling AI initiatives risk losing further funding from their businesses. 

So why the disconnect between how much GPU researchers think they’re using and actual GPU utilization? We discovered that in most cases, individual researchers had been building their own AI workflows, using their own tools and processes, totally separate from the rest of the unit. They weren’t talking to each other about sharing resources at all. These same research teams were eagerly petitioning their IT management for more GPUs, but IT had no way of knowing whether the issue was a genuine need for more GPUs or poor utilization of the existing hardware. 

When there’s a bottleneck in AI development, the first step is to monitor it. It was clear that the IT managers at the companies we interviewed needed a tool to visualize their GPU usage. However, there was no existing tool for monitoring GPU cluster utilization in a single view, without installing any additional tools and prerequisites on the Kubernetes cluster. On GitHub, we found a tool called nvtop that gives visibility of one node. But for AI units with multiple nodes, it’s cumbersome to execute nvtop in each node every time. There’s also rtop, which allows remote monitoring of multiple nodes, but not for GPUs. We decided to combine the best features of these existing -top tools into a new tool, called rntop, to enable remote monitoring of GPU clusters, and provisioned it to the companies for a month-long test.

Now rntop (pronounced “run top”) is available open-sourced on GitHub, so that anyone can monitor the utilization of their GPUs, anywhere, anytime. It uses NVIDIA’s software (nvidia-smi in particular) for monitoring GPUs on a node, but does this on multiple nodes across a cluster.

It then collects the monitoring information from every node and calculates relevant metrics for the entire cluster. For example, GPU available and used memory is calculated by summing the results from all the nodes. GPU utilization is calculated by averaging the utilization of all the GPUs in the cluster. The current version requires only SSH connection, and can be launched by Docker without any installation. Already planned for future updates are automatic insights and a display of utilization over time.

We encourage you to try rntop on your GPUs. Have you been incorrectly guessing your usage? You can help the wider MLOps community better understand this key metric by reporting your results here. We will share the findings in an upcoming post.