This post was written in collaboration with our sponsors from Union, Samhita Alla | Software Engineer & Tech Evangelist at Union.ai
It’s no secret there has been a sharp rise in demand for machine learning as organizations innovate new products that make complex use of data. To keep up with the high volume of ML workflows, ML teams and companies are scrambling to find ways to iterate on and maintain their pipelines collectively.
Data science and machine learning require analysis at every intermediary step of the pipeline, data validation process, and resource optimization, outstripping the capabilities of typical DevOps life cycles. To maintain reliable ML pipelines, Flyte makes it painless to orchestrate them at scale. In this article, we’ll consider how Flyte enables orchestrating ML pipelines with infrastructure abstraction.
Say hello to MLOps
The software development playbook tells us that strategies like DevOps allow faster development cycles by streamlining software testing and deployment processes. DevOps aims to transform isolated processes into an ongoing series of coordinated actions within a company. It includes provisioning and infrastructure management to speed development life cycles.
Similarly, building ML applications quickly and efficiently requires a structured and repeatable development process. ML, however, can add a lot more overhead to the development life cycle to handle the dynamism of ML workflows and adapt to their ever-changing data requirements.
Machine learning operations (MLOps) is a superset of DevOps that also includes data engineering and machine learning.
Here’s how MLOps and DevOps differ:
- Over time, models may deteriorate. Models and data must be iterated upon and tested continuously since they are mutable.
- ML models can be nondeterministic when data changes rapidly. Ensuring ML models climb the performance curve requires additional effort.
ML models usually require additional investment in hardware and infrastructure. Because ML models generally rely on GPUs’ processing power to accommodate model parameters and data, ML development demands more resources than software development does.
Automating the ML processes
Machine learning teams are increasingly using MLOps methodologies to standardize and improve the processes involved in model life cycle management. MLOps tools can perform a wide range of tasks for machine learning teams and can be broadly classified into two groups: platform and individual component management.
While some MLOps products focus exclusively on a single core function, such as data or metadata management, other tools adopt a more expansive strategy that provides an end-to-end MLOps platform to control several aspects of the ML lifecycle.
Whether you’re looking for a specialized or a more general tool for MLOps, your model life cycle should account for these tasks:
- Data management — datasets to be used for training, testing, validation
- Model design
- Deployment and maintenance of models
- End-to-end lifecycle management of the models
- Model versioning and serving
- Monitoring the performance of the ML model
- Retraining the model from time to time
A plethora of MLOps tools can be used to implement these functionalities, such as DVC for version control, BentoML for serving, and Weights & Biases for monitoring. To coordinate this toolset, we need an upright platform.
Built on top of Kubernetes, Flyte is a workflow automation tool with data and machine learning awareness. Flyte aims to manage business-critical ML and data processes at scale. You can use Flyte in deployment, maintenance, lifecycle management, version control, and training, and combine it with other platforms like Feast, PyTorch, Tensorflow, and whylogs to implement tasks pertaining to the complete model lifecycle.
Flyte can assist any organization with a reproducible, iterative and extendable workflow automation platform. It prioritizes user experience, dependability and reliability, allowing teams to collaborate while segregating their responsibilities and supports organizational scaling.
Building blocks of Flyte: workflows and tasks
Flyte’s fundamental entities: Task & Workflow
How Flyte solves ML orchestration problems
Flyte uses abstraction to minimize infrastructure setup while ensuring the reliability and scalability of pipelines. Smoothing interaction with infrastructure makes ML pipeline maintenance a breeze.
Flyte’s centrally managed infrastructure relieves platform teams of managing multiple, distributed resources. Flyte defines a clear distinction between ML engineers and data scientists (who employ its SDK to build pipelines) and platform engineers tasked with backend management. This separates the responsibilities while enabling seamless communication over a robust interface.
Flyte facilitates allocation of resources such as CPUs and GPUs to run workflows.
Running ML experiments requires dynamic adaptation to a variety of inputs and parameters. Providing a varied set of inputs or parameters in a query means modifying behavior in response to the given input. Flyte is aware of the pipeline in advance, even with dynamism.
Flyte also provides implicit support for reusability. Since it’s multi-tenant, teams can collaborate and share their workflows/tasks. Flyte supports intra-task checkpointing. By capturing the task’s status before a failure and continuing from the most recent state recorded, a checkpoint recovers a task from an earlier failure. The workflow doesn’t have to execute tasks that were already finished; it only needs to retry the failed task. That means Flyte can run workflows exclusively on the spot or preemptible instances, which are much cheaper than their on-demand or reserved counterparts.
How to set up and use Flyte
This section will provide an overview of how ML jobs can be run from within Flyte — transitioning from writing Flyte tasks to running them on the Flyte backend.
1. Writing a ML job and running it locally
Writing a job (“@task” in Flyte terminology) is the initial step in creating an ML pipeline. It may handle training, data preprocessing or spark. To execute a job on the Flyte backend, the user will need to translate any implementation-related code into Flyte-compatible code. (This requires only minor reworking.) Instead of writing the code from scratch, the user can look at the supported plugins if the task interacts with a tool or an external service.
It’s critical that the user can execute the code locally. Flyte supports local executions out of the box. After creating a Flyte-compatible workflow, t the user can run the code locally just like a Python file.
2. Scaling the job
The user can adjust resources to run the activities on a production setup, depending on how compute-intensive the jobs are.
Flyte offers a declarative infrastructure-as-code (IaC) solution, so users can set the necessary CPU, GPU and memory.
3. Creating and testing pipelines
Create a pipeline or workflow by connecting all the jobs and tasks. To check the result, execute the code locally.
4. Setting up Flyte
With the help of the Getting Started manual, users should be able to execute workflows on the Flyte backend and build Kubernetes clusters on their local computers.
Configuring the database, object store, authentication, and other components is part of setting up Flyte for production. You can use AWS, GCP, Azure, or any other cloud provider or set up Flyte on-premises.
Now that you know how to set up Flyte, let’s take a look at what happens behind the scenes by understanding the Flyte Architecture.
The architecture of Flyte is a testament to why it’s a robust and reliable platform for MLOps. Let’s break down the architecture into three planes:
- Input Plane
The User Plane includes all user tools that facilitate interaction with the main Flyte API, such as FlyteConsole, Flytekit and Flytectl.
- Control Plane
The primary Flyte API is implemented in the Control Plane, which fulfills all client requests coming from the User Plane. It retains data about ongoing and previous workflows and makes that data available upon request. It delegates requests to carry out workflows to the Data Plane.
The Data Plane is dedicated to workflow fulfillment. It receives workflow requests from the Control Plane, then guides them to completion by launching tasks on a cluster of computers according to a workflow graph. The Data Plane sends status events back to the Control Plane, which saves the data and makes it available to end users.
Let’s see what happens under the hood.
The user creates workflows with the Flyte SDK, then uses Flyte CLI or SDK to compile and register them. After workflows are registered on FlyteAdmin, the code is accessible on the FlyteAdmin, which contains data for tasks and workflows.
The user creates a Docker container with all the listed dependencies. When they want to start a workflow, the FlytePropeller leverages the Kubernetes operator pattern and polls the Kubernetes API for newly created flyteworkflow resources. Besides carrying out the operations, FlytePropeller interacts with other Kubernetes operators and objects and stores the output in cloud blob storage.
Once FlyteAdmin retrieves results from storage, users can view them on FlyteConsole. That’s it: A user writes a Flyte workflow, registers it, executes it and views the results on the dashboard with minimal infrastructure setup hassle.In the next section, let’s look at Hosted Sandbox, a browser-based IDE that fast-tracks your familiarity with Flyte.
To remove any friction getting started with Flyte, we created a self-contained hosted sandbox that you can try without setting up an environment for it. Check it out here!
Flyte is currently being used in a range of industries, including finance, bioinformatics, and autonomous vehicles. Check out these use cases:
- Consumer Applications, E-commerce
- Autonomous Vehicles
- ML & Data
Flyte’s blend of data, ML, and infrastructure orchestration handles these use cases at scale.
We’ve reviewed Flyte’s capabilities, setup and architecture. If you’d like to drill down into its features, refer to the User Guide section of our docs. We also have documented some tutorials for building end-to-end ML pipelines. Go through the deployment guides if you want to run Flyte on cloud or on-prem. For a deeper understanding of the Flyte internals, refer to the Concepts section. Our Integrations Guide can be useful to run Flyte with other platforms.