March 7, 2024

PIXART-α: A Diffusion Transformer Model for Text-to-Image Generation


This article provides a short tutorial on how to run experiments with Pixart-α — the new transformer-based Diffusion model for generating photorealistic images from text.

The popularity of text-conditional image generation models like DALL·E 3, Midjourney, and Stable Diffusion can largely be attributed to their ease of use for producing stunning images by simply using meaningful text-based prompts. However, such models require significant training costs (e.g., millions of GPU hours) which seriously hinders the course of fundamental innovation in the field of AI-generated content while increasing CO2 emissions.

Pixart-α is the novel text-to-image diffusion model that only takes 10.8% of the training time of Stable Diffusion v1.5, all while being able to generate high-resolution images (up to 1024 pixels) with quality that is competitive with the aforementioned state-of-the-art image generators.

In this article, we’ll explore:

  • The architecture and the training strategy for Pixart-α, especially how the researchers behind the model were able to optimize the training resources.
  • How we can easily run Pixart-α using 🤗 HuggingFace Diffusers and manage our experiments using Weights & Biases.
  • Compare the quality of images generated by Pixart-α with Stable Diffusion XL, a SoTA text-conditional image generation model.

As a note, you can run the code in this report via this Colab Notebook:

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

And, since this is a GenAI report, we know what you want upfront: some stunning images to get you started.

*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

Optimization of Text-to-Image Training

The training of advanced text-to-image models such as Stable Diffusion and DeepFloydIF demands immense computational resources.

For instance, training Stable Diffusion v1.5 necessitates 6K A100 GPU days, approximately costing $320,000. Additionally, the training contributes to substantial CO2 emissions, posing environmental stress.

Such a huge cost imposes significant barriers for both the research community and entrepreneurs in accessing those models, causing a significant hindrance to the crucial advancement in the field of AI-generated content.

*The interactive panels are available in the link.

Given these challenges, the authors of the paper Pixart-α: Fast Training Of Diffusion Transformer For Photorealistic Text-to-Image Synthesis attempts to answer a simple question:

Can we develop a high-quality image generator with affordable resource consumption?

But before understanding how we can optimize this process, let’s try to answer a fundamental question:

Why is Text-to-Image Training Slow?

A text-to-image generation task can be decomposed into three aspects:

  1. Capturing Pixel Dependency: Generating realistic images involves understanding intricate pixel-level dependencies within images and capturing their distribution.
  2. Alignment between Text and Image: Precise alignment learning is required for understanding how to generate images that accurately match the text description.
  3. Aesthetic Quality: Besides faithful textual descriptions, being aesthetically pleasing is another vital attribute of generated images.

In the case of models like Stable Diffusion, these three problems are entangled together and the model is trained directly from scratch using vast amounts of data, resulting in inefficient training.

Another problem is with the quality of captions of the current LAION dataset these models are trained using. The current text-image pairs often suffer from text-image misalignment, deficient descriptions, infrequent diverse vocabulary usage, and inclusion of low-quality data. These problems introduce difficulties in training, resulting in millions of iterations to achieve stable alignment between text and images.

Stage-1: Pixel Dependency Learning

The class-guided approach (from the DiT paper) has shown exemplary performance in generating semantically coherent and reasonable pixels in individual images. Training a class-conditional image generation model for natural images is relatively easy and inexpensive. Additionally, the researchers behind Pixart-α find that a suitable initialization can significantly boost training efficiency. Therefore, Pixart-α is boosted from an ImageNet-pretrained model, and the architecture of our model is designed to be compatible with the pre-trained weights.

Stage-2: Text-image Alignment Learning

Compared to pre-trained class-guided image generation, achieving accurate alignment for text-to-image generation is more challenging as well as time-consuming. Moreover, the captions of the LAION dataset exhibit various issues, such as text-image misalignment, deficient descriptions, and infrequent vocabulary.

To efficiently facilitate this process, the researchers behind Pixart-α construct a dataset consisting of precise text-image pairs with high concept density. To generate captions with high information density, the researchers behind Pixart-α leverage the state-of-the-art vision-language model LLaVA. Employing the prompt, “Describe this image and its style in a very detailed manner,” the researchers significantly improved the quality of captions.

*The interactive panels are available in the link.
*The interactive panels are available in the link.

However, it is worth noting that the LAION dataset predominantly comprises simplistic product previews from shopping websites, which are not ideal for training text-to-image generation seeking diversity in object combinations. Hence, the researchers used the SA-1B dataset that is originally used for segmentation tasks but features imagery rich in diverse objects. By applying LLaVA to the SA-1B dataset, the researchers acquired high-quality text-image pairs characterized by a high concept density.


*The interactive panels are available in the link.
*The interactive panels are available in the link.

Stage-3: High-resolution and Aesthetic Image Generation

In the third stage, the model is fine-tuned using high-quality aesthetic data for high-resolution image generation. Remarkably, it is observed that the adaptation process in this stage converges significantly faster, primarily owing to the strong prior knowledge established in the preceding stages.

The Efficient Text-to-Image Transformer

Pixart-α adopts the Diffusion Transformer (DiT) as the base architecture and tailors the transformer blocks to handle the unique challenges of text-to-image tasks:

Cross-Attention layer
multi-head cross-attention layer is used in the DiT block. It’s positioned between the self-attention layer and the feed-forward layer so that the model can flexibly interact with the text embedding extracted from the language model. To facilitate the pre-trained weights, the output projection layer is initialized in the cross-attention layer to zero, effectively acting as an identity mapping and preserving the input for the subsequent layers.
AdaLN-single
In DiT, the linear projections in the AdaLN module account for 27% of the parameters. Such a large number of parameters is not useful since the class condition is not employed for Pixart-α. Hence, the novel adaLN-single module is used, which only uses time embedding as input in the first block for independent control.
Re-parameterization
To utilize the aforementioned pre-trained weights, all EiE_iEi​s are initialized to values that yield the same SiS_iSi​. This design effectively replaces the layer-specific MLPs with a global MLP and layer-specific trainable embeddings while preserving compatibility with the pre-trained weights.
*The interactive panels are available in the link.

Generating Images with Pixart-α

We can use Pixart-α to generate images easily using the PixArtAlphaPipeline from 🤗 HuggingFace Diffusers. We will also use the Weights & Biases autologger for Diffusers to automatically log our generations and all experiment configurations so that they are reproducible and easy to share.

# Install all the dependencies
!pip install diffusers accelerate transformers ftfy sentencepiece wandb


import torch
from diffusers import PixArtAlphaPipeline
from wandb.integration.diffusers import autolog


# Load the pre-trained checkpoints from HuggingFace Hub to the PixArtAlphaPipeline
pipe = PixArtAlphaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16
)


# Enable offloading the weights to the CPU and only loading them on the GPU when
# performing the forward pass can also save memory.
pipe.enable_model_cpu_offload()


# Call WandB Autolog for Diffusers
autolog(init=dict(project="pixart-alpha"))


# Make the experiment reproducible by controlling randomness.
# The seed would be automatically logged to WandB.
generator = torch.Generator(device="cuda").manual_seed(42)


# Generate the images by calling the PixArtAlphaPipeline
images = pipe(
    prompt="A dog that has been meditating all the time",
    negative_prompt="",
    height=1024,
    width=1024,
    generator=generator,
).images

Here are some results!

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

Comparisons with Stable Diffusion XL

Let’s take a look at some examples of images generated by both Pixart-α and Stable Diffusion XL Base-1.0 using the same prompt at a resolution of 1024 pixels. For generating these images, we did not use negative prompts.

*The interactive panels are available in the link.
*The interactive panels are available in the link.

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

Let’s pause scrolling through results for some time and reflect on the observations:

  • Images generated by Pixart-α tend to have a more vibrant and sharper color palette, compared to the ones generated by SDXL. For example, check rows 1, 5, 8, 17, 18, 22, and 25 from the table above.
  • Pixart-α exhibits much stronger text-image alignment compared to SDXL. For example, check rows 1, 5, 10, 13, 23, and 24 from the table above.
  • Pixart-α can produce images that are much more detailed, vibrant, and expressive with very short prompts, compared to SDXL. For example, check row 18.

Let’s look at a few more images generated by Pixart-α and SDXL:

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

More Text-Image Alignment ChallengesLet’s put the text-image alignment to some more tests and attempt to observe how accurately certain phrases are rendered by Pixart-α compared to SDXL:

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

Manipulating Image Styles using the Prompt

Let’s now look at the ability of Pixart-α to can directly manipulate the image style with text prompts. In the following panel, we generate five outputs using the styles to control the objects.

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.
*The interactive panels are available in the link.

Want to jump into the code right away and generate your own images? Check out the following colab notebook 👇

Open In Colab

Alternatively, jump on this HuggingFace Space to start crafting your prompts using an interactive application 👇

Open in Spaces
*The interactive panels are available in the link.

Conclusion

  • In this article, we take a look at the recently released open-source text-to-image generation model Pixart-α.
  • Pixart-α was trained by researchers at Huawei Noah’s Ark Lab at a fraction of the cost of existing text-to-image generation models like Stable Diffusion 1.5, while rivaling SoTA models like Stable Diffusion XL in text-image alignment and fidelity of generated images.
  • We briefly explore the current challenges of training text-to-image models, and how the researchers behind Pixart-α were able to optimize the process to cut down the training time.
  • We also briefly explore the base architecture of Pixart-α derived from Diffusion Transformer (DiT).
  • We explored how we can use Pixart-α to generate images easily using the PixArtAlphaPipeline from 🤗 HuggingFace Diffusers. We will also use the Weights & Biases autologger for Diffusers to automatically log our generations and all experiment configurations so that they are reproducible and easy to share.
  • We also explored the image generation and text-image alignment capability of Pixart-α and also compared it with Stable Diffusion XL on the same prompt in various diverse scenarios.

Author

  • Soumik Rakshit

    I build MLOps pipelines for open-source repositories like Keras, Kaolin-Wisp, YOLOv5, etc. I'm currently learning Neural Rendering, Neural Approximation and Vision-Language Modelling. I would love to collaborate on interesting Computer Vision and Graphics projects and implementations of Deep Learning Research Papers.

Tags: , , ,