August 30, 2023

MLOps: More Oops than Ops

🤖 image generated using the Stable Diffusion 2.1 model mentioned in this post

As model complexity increases exponentially, so too does the need for effective MLOps practices. This post acts as a transparent write-up of all the MLOps frustrations I’ve experienced in the last few days. By sharing my challenges and insights, I hope to contribute to a community that openly discusses and shares solutions for MLOps challenges.

My goal was to improve Inference latency of few of the current state-of-the-art LLMs.

Unfortunately, simply downloading trained model weights & existing code doesn’t solve this problem.

The Promise of Faster Inference

My first target here was Llama 2. I wanted to convert it into ONNX format, which could then be converted to TensorRT, and finally served using Triton Inference Server.

TensorRT optimizes the model network by combining layers and optimizing kernel selection for improved latency, throughput, power efficiency and memory consumption. If the application specifies, it will additionally optimize the network to run in lower precision, further increasing performance and reducing memory requirements.

From online benchmarks [12] it seems possible to achieve a 2~3x boost to latency (by reducing precision without hurting quality much). But the workings for these kind of format conversions feel super flaky, things break too often (without any solution to be found online). Yes, it’s somewhat expected since these models are so new, with different architectures using different (not yet widely-supported) layers and operators.

Model Conversion Errors

Let’s start with Llama 2 7B chat,

  1. Firstly I’ve downloaded Llama-2-7B-Chat weights from Meta’s Official repository here after requesting.
  2. Convert raw weights to huggingface format using this script by Huggingface. Let’s say we save it under llama-2-7b-chat-hf directory locally.

Now I considered two options for converting Huggingface models to ONNX format:

torch.onnx.export gibberish text​

Let’s write an export_to_onnx function which will load the tokenizer & model, and export it into ONNX format:

import torch
from composer.utils import parse_uri, reproducibility
from pathlib import Path
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

def export_to_onnx(
    pretrained_model_name_or_path: str,
    output_folder: str,
    verify_export: bool,
    max_seq_len: int | None = None,
    _, _, parsed_save_path = parse_uri(output_folder)
    # Load HF config/model/tokenizer
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, use_fast=True)
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path)
    if hasattr(config, 'attn_config'):
        config.attn_config['attn_impl'] = 'torch'

    model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, config=config).to("cuda:0")
    # tips:
    tokenizer.add_special_tokens({"pad_token": "<pad>"})
    model.config.pad_token_id = tokenizer.pad_token_id
    sample_input = tokenizer(
        "Hello, my dog is cute",
        max_length=max_seq_len or model.config.max_seq_len,

    with torch.no_grad():

    output_file = Path(parsed_save_path) / 'model.onnx'
    output_file.parent.mkdir(parents=True, exist_ok=True)
    # Put sample input on cpu for export
    sample_input = {k: v.cpu() for k, v in sample_input.items()}
    model ="cpu")
        input_names=['input_ids', 'attention_mask'],

We can also check if the exported & original models’ outputs are similar:

# (Optional) verify onnx model outputs
import onnx
import onnx.checker
import onnxruntime as ort

with torch.no_grad():
    orig_out = model(**sample_input)
    orig_out.logits = orig_out.logits.cpu()  # put on cpu for export

_ = onnx.load(str(output_file))
ort_session = ort.InferenceSession(str(output_file))
for key, value in sample_input.items():
    sample_input[key] = value.cpu().numpy()
loaded_model_out =, sample_input)
    msg=f'output mismatch between the orig and onnx exported model')
print('Success: exported & original model outputs match')

Assuming we’ve saved the ONNX model in ./llama-2-7b-onnx/, we can now run inference using onnxruntime:

import onnx
import onnx.checker
import onnxruntime as ort
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

output_file = 'llama-2-7b-onnx/model.onnx'  # converted model from above
ort_session = ort.InferenceSession(str(output_file))
tokenizer = AutoTokenizer.from_pretrained("llama-2-7b-chat-hf", use_fast=True)
tokenizer.add_special_tokens({"pad_token": "<pad>"})
inputs = tokenizer(
    "Hello, my dog is cute",
loaded_model_out =,
tokenizer.batch_decode(torch.argmax(torch.tensor(loaded_model_out[0]), dim=-1))

😖 On my machine, this generates really funky outputs:

ЉЉЉЉЉЉ\n\n\n\n\n\n\n\n\n\n Hello Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis Hinweis..........SMSMSMSMSMSMSMSMSMSMSMS Unterscheidung, I name is ough,

… which is mostly due to missing a proper decoding strategy (greedybeam, etc.) while generating tokens.

optimum-cli gibberish text and tensorrt slowness

To solve the problem above, we can try a different exporter which includes decoding strategies.

Using the Optimum ONNX exporter instead (assuming the original model is in ./llama-2-7b-chat-hf/), we can do:

optimum-cli export onnx \
  --model ./llama-2-7b-chat-hf/ --task text-generation --framework pt \
  --opset 16 --sequence_length 1024 --batch_size 1 --device cuda --fp16 \

⌛ This takes a few minutes to generate. If you don’t has a GPU for this conversion, then remove --device cuda from the above command.

The result is:

 ├── config.json
 ├── Constant_162_attr__value
 ├── Constant_170_attr__value
 ├── decoder_model.onnx
 ├── decoder_model.onnx_data
 ├── generation_config.json
 ├── special_tokens_map.json
 ├── tokenizer_config.json
 ├── tokenizer.json
 └── tokenizer.model

Now when I try to do inference using optimum.onnxruntime.ORTModelForCausalLM, things work fine (though slowly) using the CPUExecutionProvider:

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("./onnx_optimum")
model = ORTModelForCausalLM.from_pretrained("./onnx_optimum/", use_cache=False, use_io_binding=False)
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=16)
assert model.providers == ['CPUExecutionProvider']

After waiting a long time, we get a result:

<s> My name is Arthur and I live in a small town in the countr

But when switching to the faster CUDAExecutionProvider, I get gibberish text on inference:

model = ORTModelForCausalLM.from_pretrained("./onnx_optimum/", use_cache=False, use_io_binding=False, provider="CUDAExecutionProvider")
inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt").to("cuda")
gen_tokens = model.generate(**inputs, max_length=16)
assert model.providers == ['CUDAExecutionProvider', 'CPUExecutionProvider']
2023-08-02 19:47:43.534099146 [W:onnxruntime:, VerifyEachNodeIsAssignedToAnEp]
Some nodes were not assigned to the preferred execution providers which may or may not
have an negative impact on performance. e.g. ORT explicitly assigns shape related ops
to CPU to improve perf.
2023-08-02 19:47:43.534136078 [W:onnxruntime:, VerifyEachNodeIsAssignedToAnEp]
Rerunning with verbose output on a non-minimal build will show node assignments.

<s> My name is Arthur and I live in a<unk><unk><unk><unk><unk><unk>

Even with different temperature and other parameter values, it always yields unintelligible outputs, as reported in optimum#1248.

🎉 Update: after about a week this issue seemed to magically disappear — possibly due to a new version of llama-2-7b-chat-hf being released.

Using the new model with max_length=128, :

  • Prompt: Why should one run Machine learning model on-premises?
    • ONNX inference latency: 2.31s
    • HuggingFace version latency: 3s

🚀 The ONNX model is ~23% faster than the HuggingFace variant!

⚠️ However, while both CPU and CUDA providers work, there now seems to be a bug when trying TensorrtExecutionProvider â€” reported in optimum#1278.

optimum-cli segfaults

Next let’s try with the Dolly-v2 7B from Databricks. The equivalent optimum-cli command for ONNX conversion would be:

optimum-cli export onnx \
  --model 'databricks/dolly-v2-7b' --task text-generation --framework pt \
  --opset 17 --sequence_length 1024 --batch_size 1 --fp16 --device cuda \

😢 It uses around 17GB of my GPU RAM, seemingly working fine but finally ending with a segmentation fault:

======= Diagnostic Run torch.onnx.export version 2.1.0.dev20230804+cu118 =======
verbose: False, log level: 40
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
Saving external data to one file...
2023-08-09 20:59:33.334484259 [W:onnxruntime:, VerifyEachNodeIsAssignedToAnEp]
Some nodes were not assigned to the preferred execution providers which may or may not
have an negative impact on performance. e.g. ORT explicitly assigns shape related ops
to CPU to improve perf.
2023-08-09 20:59:33.334531829 [W:onnxruntime:, VerifyEachNodeIsAssignedToAnEp]
Rerunning with verbose output on a non-minimal build will show node assignments.
Asked a sequence length of 1024, but a sequence length of 1 will be used with
use_past == True for `input_ids`.
Post-processing the exported models...
Segmentation fault (core dumped)

Confusingly, despite this error, all model files seem to be converted and saved to disk. Other people have reported similar segfault issues while exporting (transformers#21360optimum#798).

Results using the Dolly v2 model:

  • Prompt: Why should one run Machine learning model on-premises?
    • ONNX inference latency: 8.2s
    • HuggingFace version latency: 5.2s

😠 The ONNX model is actually ~58% slower than the HuggingFace variant!

To make things faster, we can try to optimize the model:

optimum-cli onnxruntime optimize -O4 --onnx_model ./dolly_optimum/ -o dolly_optimized/

The different optimization levels are:

  • -O1: basic general optimizations.
  • -O2: basic and extended general optimizations, transformers-specific fusions.
  • -O3: same as O2 with GELU approximation.
  • -O4: same as O3 with mixed precision (fp16, GPU-only).

We still get the same segfault error for all of the levels.

For -O1, the model gets saved but there’s no noticeable performance change. For -O2 it gets killed (even though I have 40GB A100 GPU + 80GB CPU RAM). Meanwhile for -O3 & -O4 it gives seg-fault (above) while only partially saving the model files.

torch.onnx.export gibberish images

Moving on from text-based models, let’s now look at an image generator. We can try to speed up the Stable Diffusion 2.1 model. In an IPython shell:

from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda:0")
%time img = pipe("Iron man laughing", num_inference_steps=20, num_images_per_prompt=1).images[0]"iron_man.png", format="PNG")

The latency (as measured by the %time magic) is 3.25 s.

To convert to ONNX format, we can use this script:

python \
  --model_path stabilityai/stable-diffusion-2-1 \
  --output_path sd_onnx/ --opset 16 --fp16
ℹ️ Note: if a model uses operators unsupported by the opset number above, you'll have to upgrade pytorch to the nightly build:

pip uninstall torch
pip install --pre torch --index-url

The result is:

├── model_index.json
├── scheduler
│   └── scheduler_config.json
├── text_encoder
│   └── model.onnx
├── tokenizer
│   ├── merges.txt
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json
│   └── vocab.json
├── unet
│   ├── model.onnx
│   └── weights.pb
├── vae_decoder
│   └── model.onnx
└── vae_encoder
    └── model.onnx

There’s a separate ONNX model for each Stable Diffusion subcomponent model.

Now to benchmark this similarly we can do the following:

from diffusers import OnnxStableDiffusionPipeline
pipe = OnnxStableDiffusionPipeline.from_pretrained("sd_onnx", provider="CUDAExecutionProvider")
%time img = pipe("Iron man laughing", num_inference_steps=20, num_images_per_prompt=1).images[0]"iron_man.png", format="PNG")

The overall performance results look great, at ~59% faster! We also didn’t see any noticeable quality difference between the models.

  • Prompt: Iron man laughing
    • ONNX inference latency: 1.34s
    • HuggingFace version latency: 3.25s

Since we know that the unet model is the bottleneck, taking ~90% of the compute time, we can focus on it for further optimization. We try to serialize the ONNX version of the UNet to a TensorRT engine-compatible format. When building the engine, the builder object selects the most optimized kernels for the chosen platform and configuration. Building the engine from a network definition file can be time-consuming, and should not be repeated each time we need to perform inference unless the model/platform/configuration changes. You can transform the format of the engine after generation and save it to disk for later reuse (known as serializing the engine). Deserializing occurs when you load the engine from disk into memory:

To setup TensorRT properly, follow this support table. It’s a bit painful, and (similar to cuda/cudnn) if you just want a quick solution you can use NVIDIA’s tensorrt:22.12-py3 docker image as a base:

RUN pip install ipython transformers optimum[onnxruntime-gpu] onnx diffusers accelerate scipy safetensors composer
RUN pip uninstall torch -y && pip install --pre torch --index-url
COPY sd_onnx sd_onnx

We can then use the following script for serialization:

import tensorrt as trt
import torch

onnx_model = "sd_onnx/unet/model.onnx"
engine_filename = "unet.trt" # saved serialized tensorrt engine file path
# constants
batch_size = 1
height = 512
width = 512
latents_shape = (batch_size, 4, height // 8, width // 8)
# shape required by Stable Diffusion 2.1's UNet model
embed_shape = (batch_size, 64, 1024)
timestep_shape = (batch_size,)

TRT_LOGGER = trt.Logger(trt.Logger.INFO)
network = TRT_BUILDER.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = TRT_BUILDER.create_builder_config()
profile = TRT_BUILDER.create_optimization_profile()

print("Loading & validating ONNX model")
onnx_parser = trt.OnnxParser(network, TRT_LOGGER)
parse_success = onnx_parser.parse_from_file(onnx_model)
for idx in range(onnx_parser.num_errors):
if not parse_success:
    raise ValueError("ONNX model parsing failed")

# set input, latent and other shapes required by the layers
profile.set_shape("sample", latents_shape, latents_shape, latents_shape)
profile.set_shape("encoder_hidden_states", embed_shape, embed_shape, embed_shape)
profile.set_shape("timestep", timestep_shape, timestep_shape, timestep_shape)

print(f"Serializing & saving engine to '{engine_filename}'")
serialized_engine = TRT_BUILDER.build_serialized_network(network, config)
with open(engine_filename, 'wb') as f:

Now let’s move to deserializing unet.trt for inference. We’ll use the TRTModel class from x-stable-diffusion’s trt_model:

import torch
import tensorrt as trt
trt.init_libnvinfer_plugins(None, "")
import pycuda.autoinit
from diffusers import AutoencoderKL, LMSDiscreteScheduler
from PIL import Image
from torch import autocast
from transformers import CLIPTextModel, CLIPTokenizer
from trt_model import TRTModel
from tqdm.contrib import tenumerate

class TrtDiffusionModel:
    def __init__(self):
        self.device = torch.device("cuda")
        self.unet = TRTModel("./unet.trt") # tensorrt engine saved path
        self.vae = AutoencoderKL.from_pretrained(
            "stabilityai/stable-diffusion-2-1", subfolder="vae").to(self.device)
        self.tokenizer = CLIPTokenizer.from_pretrained(
            "stabilityai/stable-diffusion-2-1", subfolder="tokenizer")
        self.text_encoder = CLIPTextModel.from_pretrained(
            "stabilityai/stable-diffusion-2-1", subfolder="text_encoder").to(self.device)
        self.scheduler = LMSDiscreteScheduler(

    def predict(
        self, prompts, num_inference_steps=50, height=512, width=512, max_seq_length=64
        guidance_scale = 7.5
        batch_size = 1
        text_input = self.tokenizer(
        text_embeddings = self.text_encoder([0]
        uncond_input = self.tokenizer(
            [""] * batch_size,
        uncond_embeddings = self.text_encoder([0]
        text_embeddings =[uncond_embeddings, text_embeddings])

        latents = torch.randn((batch_size, 4, height // 8, width // 8)).to(self.device)
        latents = latents * self.scheduler.sigmas[0]

        with torch.inference_mode(), autocast("cuda"):
            for i, t in tenumerate(self.scheduler.timesteps):
                latent_model_input =[latents] * 2)
                sigma = self.scheduler.sigmas[i]
                latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)
                # predict the noise residual
                inputs = [
                noise_pred = self.unet(inputs, timing=True)
                noise_pred = torch.reshape(noise_pred[0], (batch_size*2, 4, 64, 64))
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                noise_pred = noise_pred_uncond + guidance_scale * (
                    noise_pred_text - noise_pred_uncond)
                # compute the previous noisy sample x_t -> x_t-1
                latents = self.scheduler.step(noise_pred.cuda(), t, latents)["prev_sample"]
            # scale and decode the image latents with VAE
            latents = 1 / 0.18215 * latents
            image = self.vae.decode(latents).sample
        return image

model = TrtDiffusionModel()
image = model.predict(
    prompts="Iron man laughing, real photoshoot",
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]

The above script runs, but the generated output looks like this:

Something’s going wrong, and changing to different tensor shapes (defined above) also doesn’t help fix the generation of blank/noisy images.

I don’t know how to make Stable Diffusion 2.1 work with TensorRT, though it’s proved possible for other Stable Diffusion variants in AUTOMATIC1111/stable-diffusion-webui. Others reporting similar issues in stable-diffusion-webui#5503 have suggested:

Other Frustrations

Maybe the code above is partially in my control, but there are also other issues that have nothing to do with my code:

  • Licences: Text Generation Inference recently they came up with a new license which is more restrictive for newer versions. I can only use old releases (up to v0.9).
  • Lack of GPU support: GGML doesn’t currently support GPU inference, so I can’t use it if I want very low latency.
  • Quality: I’ve heard from peers that saw a big decrease in output quality vLLM. I’d like to explore this in future.


I’ve listed my recent errors and frustrations. I need more time to dig deeper and solve them, but if you think you can help please do reply in any of the issues linked above! By sharing my experiences and challenges, I hope this can spark lots of discussions and new ideas. Maybe you’ve faced something similar?

While the world likes showcasing the latest advancements and shiny results, it’s important to also acknowledge and address the underlying complexities that come with deploying & maintaining ML models. There’s a scarcity of documentation/resources for these problems in the ML community. As the field continues to rapidly evolve, there is a need for more in-depth discussions and solutions to these technical hurdles.


Tags: , , , , ,