May 15, 2024

DREAM: Distributed RAG Experimentation Framework


A blueprint for distributed RAG experimentation using Ray, LlamaIndex, Ragas, MLFlow & MinIO on Kubernetes

Image created using DALL.E 2

Contents

1. 🌟 What is DREAM?

  • a. πŸ€” What is it, really?
  • b. πŸ›οΈ Architecture
  • c. πŸ€– Take me to the code!

2. 🚢 Code Walkthrough

  • a. πŸ“‚ Preparing Unstructured Data
  • b. πŸ₯‡ Distributed Generation of Golden Dataset
  • c. πŸ”¬ Distributed Experimentation & Evaluation
  • d. πŸ“Š Experiment Tracking

3. πŸ“ Conclusion

  • a. 🌰 In a nutshell
  • b. πŸ‘€ What’s next?

1. 🌟 What is DREAM?

a. πŸ€” What is it, really?

Given the myriad of options for LLMs, embedding models, retrieval methods, re-ranking methods and so on, it can be challenging to determine which combination will work best for your usecase. Who has the time to explore each combination one by one?

DREAM Architecture

So, Distributed RAG Experimentation Framework (DREAM) is a blueprint, comprising of a kubernetes native architecture and sample code, to demonstrate how Retrieval Augmented Generation (RAG) experiments, evaluation and tracking can be conducted in a distributed manner using Ray, LlamaIndex, Ragas, MLFlow and MinIO on Kubernetes.

By setting up the necessary K8s tooling and running the experimentation, evaluation and tracking in a distributed manner, we ultimately want to be able to compare and contrast the different combinations of RAG parameters and pick the one that works best for our usecase.

Parallel coordinates plot illustrating the different combinations attained from distributed RAG experimentation and their performance on various evaluation metrics

b. πŸ›οΈ Architecture

As shown in the architecture diagram above, DREAM uses the following technologies:

  • Ray (Kuberay) (by Anyscale ) for distributed compute on Kubernetes, including experimentation using Ray Tune and other distributed tasks using Ray jobs
  • LlamaIndex (by Jerry Liu and team) as the framework for processing unstructured data and performing advanced RAG techniques
  • ragas (by Shahul ES , Jithin James and team) for synthetic data generation and LLM-assisted evaluations
  • MinIO as the S3 and K8s-native object storage for storing unstructured data, golden datasets and MLFlow artifacts
  • MLflow as the experiment tracker (with PostgreSQL as the auxiliary DB, and MinIO for storage)
  • Project Jupyter notebooks for performing interactive experimentation against the Ray cluster
  • Kubernetes as the container orchestrator….ofcourse! (Deployed on DigitalOcean droplets using kubeadm but you can use minikube or some other flavour!)
  • ArgoCD (by Argo Project ) for deploying the tooling onto the Kubernetes cluster and maintaining the state of the cluster using GitOps.

For installing all these components, you can follow the steps outlined in the installation guide. You might notice that DREAM is part of a larger project I’m calling GOKU (GenAIOps on Kubernetes), which is coming soon!

c. πŸ’» Show me the code!

Here you go: DREAM Github πŸ™‚


2. 🚢 Code Walkthrough

a. πŸ“‚ Preparing Unstructured Data

This steps in this noteboook are quite straightforward:

  • Download the PDFs from github to local
  • Use boto3 to push the PDFs to S3 (MinIO)
Screengrab of MinIO after pushing the PDFs

b. πŸ₯‡ Distributed Generation of Golden Dataset

This is where the fun begins!

  • With our Jupyter notebook acting as the Ray driver, we use the ray client to submit the Ray job for creating the golden dataset in a distributed manner.
  • In each Ray task, upto 3 PDFs are loaded from S3 and then the ragas framework’s TestsetGenerator is used for synthetic test data generation. Pandas dataframes with the synthetic data are returned by each task.
  • The driver (Jupyter notebook), combines the dataframes and dumps the combined dataframe as a csv file onto S3.
Workflow for distributed Golden Dataset generation

c. πŸ”¬ Distributed Experimentation & Evaluation

This is about to get a little complicated, so here’s the overall workflow visualised:

Workflow for distributed experimentation, evaluation and tracking

Before we get to the juicy bits, let me describe the search space and evaluation metrics. In the sample code, our search space spans over 3x RAG methods, 2x LLMs and 2x embedding models. We use 3 RAG methods native to LlamaIndex – chunks with overlap, sentence window retrieval and hierarchical automerging retrieval. We use OpenAI ‘s gpt-3.5-turbo and gpt-4 as our LLMs, with text-embedding-3-small and text-embedding-3-large as our embedding models. For evaluation, we use the ragas framework’s faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness and answer_similarity as metrics. To understand the RAG methods and ragas metrics in-depth, you can checkout my previous article on Advanced RAG:

Cue Ray Tune!

  • With our Jupyter notebook acting as the Ray driver, we use the ray client to submit the Ray Tune job
Code for invoking Ray Tune for searching over the param search space
  • The centrepiece of this distributed experimentation and evaluation is the experiment() function, which glues all the helper functions together to run the main procedure.
Code for experiment() function
  • Since the structure of the three RAG methods can vary it is not easy to parametrise them inside our experiment() function. Hence, we structure the RAG methods’ pipelines in a helper function called query_engine_picker() and use string identifiers to select the pipeline. The output returned is a query_engine object.
Code for query_engine_picker()
  • Similarly, for embedding models and LLMs, we use strings to identify the intended model and initialise them using calls to helper functions from within the experiment() function
Code for model initialising helper functions
  • Lastly, once the query_engine is obtained, the experiment() function calls the evaluator() helper function, whose code is fairly self-explanatory. It uses the query_engine to derive contexts and answers based on the golden dataset and then uses ragas to evaluate the compute the metrics. A dictionary of scores for each metric is returned, which is then recorded using train.report()
Code for evaluator() function

d. πŸ“Š Experiment Tracking

Finally, we leverage the amazing experiment tracking capability of MLflow to record experiment results, establish lineage with the golden dataset and visualise experiment results. Here’s a flurry of screenshots that speak for themselves!

Code for logging experiment results to MLFlow
Tabular view of experiments
Parallel coordinates plot illustrating the different combinations attained from distributed RAG experimentation and their performance on various evaluation metrics
Another tabular view of the experiments

3. πŸ“ Conclusion

a. 🌰 In a nutshell

In this article, we took a look at DREAM, which is a blueprint for tooling and code that demonstrates how distributed RAG experimentation, evaluation and tracking can be done using open-source technologies including Ray, LlamaIndex, Ragas, MLFlow & MinIO on Kubernetes.

Overall DREAM architecture and workflow

b. πŸ‘€ What’s next?

This is a bad first draft of what can be done in terms of the extent to which the distributive nature of the experimentation exercise can be optimised and exploited. For instance, it might make sense to use Ray Data for reading and writing the csv files. We can take things a step further and use distributed calls to the embedding model to create the VectorStoreIndex! I hope you use this as a building block and go nuts with optimization in your own projects πŸ™‚

Another interesting idea to consider is how to turn this into a re-usable no-code/low-code workflow. Notice how the steps running in the Jupyter notebooks can be neatly organised into a linear DAG. If we fix the parameters of the RAG search space, we could package up steps in an Argo Workflow and trigger the distributed experiment, evaluation and tracking as low-code/no-code pipeline, on any arbitrary unstructured data in S3!


References

Author

Tags: , ,