VTeam AI

Behind the Scenes of Text-to-Image models: DALL-E and Stable Diffusion

DALL-E and Stable Diffusion represent significant breakthroughs in text-to-image synthesis within the dynamic world of artificial intelligence. OpenAI's DALL-E not only converts textual prompts into stunning visuals but also exhibits an unparalleled imaginative flair. This blog post will uncover the magic of these models, highlighting their revolutionary impact on creativity and innovation in AI-driven artistry.

In the realm of artificial intelligence and generative modeling, two extraordinary advancements have emerged, forever altering the landscape of creative content generation. DALL-E and Stable Diffusion, although distinct in their approaches, share a common goal: to bridge the gap between textual descriptions and vivid, photorealistic images.

DALL-E, introduced by OpenAI, ignited a creative revolution by demonstrating the remarkable potential of text-to-image synthesis. It captured the imagination of the world by generating images that sprung to life from mere textual prompts. But the story doesn't end there. In 2022, Stable Diffusion emerged as an open-source, text-to-image diffusion model, poised to rival DALL-E's capabilities while offering unique advantages.

This blog delves into the inner workings of DALL-E and Stable Diffusion, unraveling the intricacies of how these models transform text into tangible, visual art. We will explore their origins, the technology that powers them, and their real-world applications. By the end of this journey, you'll gain a profound understanding of the magic behind text-to-image generation and the exciting possibilities it unlocks for creativity, communication, and innovation.

So, fasten your seatbelts as we embark on a captivating exploration of DALL-E and Stable Diffusion, two pioneers reshaping the boundaries of AI and artistry. We will start off with.



DALL-E, which stands for "Data-Driven Artificial Language Learning Engine," is a groundbreaking artificial intelligence model developed by OpenAI. This innovative model is designed to bridge the gap between textual descriptions and image generation. DALL-E gained widespread recognition for its remarkable capabilities and creative potential.

Key Features and Capabilities:

  1. Text-to-Image Synthesis: DALL-E excels at transforming textual descriptions into visually stunning images. You can provide it with written prompts like "a two-story pink house shaped like a shoe" or "a painting of a futuristic city at night," and it will generate corresponding images.

  2. Creative Imagination: One of DALL-E's standout features is its ability to conjure up entirely novel and imaginative concepts based on textual input. It can generate artwork that goes beyond conventional representations.

  3. Multimodal Understanding: DALL-E demonstrates an understanding of both text and images. It can combine elements from different textual prompts to create composite images, showcasing its ability to grasp context and blend ideas.

  4. Customization: Users can fine-tune DALL-E's output by specifying aspects like the image's style, content, or other attributes through detailed prompts, providing a high degree of creative control.

  5. Wide Range of Applications: DALL-E has diverse applications, from generating visual content for storytelling, design, and art to aiding in brainstorming and concept visualization.

The architecture


The training architecture looks something like the above image. Let's understand the 3 major components i.e. CLIP, Diffusion Prior and GLIDE different components

CLIP (Contrastive Language-Image Pre-Training) by OpenAI

It is a groundbreaking AI model designed to connect text and images by generating text embeddings from text prompts. Here's how CLIP works and how it generates text embeddings for DALL-E:

  1. Text-Image Connection: CLIP aims to bridge the gap between text and images by creating a shared understanding of both modalities. It can understand and relate textual descriptions to visual content.

  2. Contrastive Learning: CLIP leverages contrastive learning, a technique that trains a model to distinguish between similar and dissimilar pairs of data. In the case of CLIP, it learns to differentiate between text and image embeddings.

  3. Dual Encoders: CLIP employs two encoder architectures—one for text and another for images. These encoders process and transform the input data into a common representation space where text and images can be compared.

  4. Cosine Similarity: To generate text embeddings for DALL-E, CLIP uses cosine similarity as a metric. When given a text prompt, CLIP calculates the cosine similarity between the text embedding of the prompt and the embeddings of a set of images, including those generated by DALL-E.

  5. Matching and Ranking: CLIP ranks the images based on their similarity to the text prompt. This means it identifies which images are most relevant or closely related to the provided text description.

  6. DALL-E Integration: By using CLIP's text embeddings and the ranked list of images, DALL-E can then generate images that align with the given textual prompts. DALL-E's ability to create images from text benefits from CLIP's understanding of the relationships between text and images.

Diffusion Prior

The Diffusion Prior in DALL-E (Distributed AutoRegressive Language Encoder) is a critical component that plays a key role in the model's architecture and image generation process from text embeddings coming from CLIP. Here's an explanation:

  1. Role of Diffusion Prior: In DALL-E, the Diffusion Prior is a probabilistic model that helps generate images from text descriptions. It acts as a prior distribution that guides the generation process.

  2. Diffusion Concept: Diffusion is a mathematical concept where noise is gradually added to an image, making it progressively noisier. This process is known as Forward Diffusion. The goal is to reach a point where the image becomes entirely random noise.

  3. Reverse Process: After reaching the noisy state, the process is reversed, known as Reverse Diffusion. The idea is to systematically reduce the noise, step by step until the original image is restored.

  4. Denoising Model: To perform Reverse Diffusion effectively, DALL-E uses a denoiser model. This denoiser helps recover the original image from the noisy version by reducing the added noise at each step.

  5. Image Embeddings: The Diffusion Prior, combined with the denoising process, is used to transform text embeddings (representations of textual descriptions) into image embeddings (representations of images). This allows DALL-E to generate images from text prompts.


So before the inception of DALL-E, GLIDE also tried to generate images from text prompts but without any image embeddings. Hence just the text embeddings are used and there was no concept of a Prior in GLIDE. Let’s look at the steps

  • Generate text embeddings using Transformers
  • Feed it to UNet which is trained using Diffusion as a denoiser to generate images directly from text embedding

Enough about DALLE, let's jump onto Stable Diffusion now which is also based on the Diffusion process followed by DALLE,

Stable Diffusion

The Stable Diffusion model is a powerful generative artificial intelligence (AI) model primarily used for generating detailed images from text descriptions or prompts similar to DALLE

  • Text-to-Image Generation: Stable Diffusion is designed to convert textual input, such as descriptions or prompts, into photorealistic images.
  • Based on Diffusion Techniques: The model utilizes diffusion techniques, specifically a latent diffusion model, to create images. This involves adding controlled noise to an initial random noise signal and gradually removing it to match the text prompt.
  • Multimodal Outputs: In addition to images, Stable Diffusion can also be used to generate videos and animations from text and image prompts.

Salient Features:

  1. Photorealistic Images: Stable Diffusion excels in generating images that closely resemble real-life objects and scenes. It can produce images that are challenging to distinguish from actual photographs.

  2. Artistic Style Control: The model is versatile and can create images with specific artistic styles based on textual prompts. For example, it can generate an image of an apple in a particular artistic style.

  3. Inpainting and Outpainting: In addition to generating images from scratch, Stable Diffusion can perform inpainting (replacing missing parts of an image) and outpainting (extending an existing image).

  4. High Resolution: The decoder component of Stable Diffusion can upscale images to larger sizes, ensuring the output images are of high resolution.

  5. Fine Control: It offers fine control over image generation, making it suitable for various creative applications.

How does Stable Diffusion work?


  1. Random Latent Tensor Generation: Stable Diffusion begins by generating a random tensor in the latent space. The seed of the random number generator can be controlled to produce a consistent random tensor, providing an initial image in latent space, which is essentially noise at this stage.

  2. Noise Prediction: A noise predictor U-Net takes both the latent noisy image and the text prompt as input. It predicts the noise in latent space, resulting in a 4x64x64 tensor representing the noise.

  3. Noise Subtraction: The predicted noise is subtracted from the latent noisy image, resulting in a new latent image with reduced noise.

  4. Sampling Steps: Steps 2 and 3 are iteratively repeated for a specified number of sampling steps, often around 20 times, to refine the latent image further.

  5. Image Reconstruction: Finally, the decoder of a Variational Autoencoder (VAE) converts the refined latent image back to pixel space. This reconstructed image is the output of Stable Diffusion, representing the generated image corresponding to the given text prompt.

Now we are done with enough theory, it's time for some codes. We will be now jumping on some image generation. For DALL-E, we will be using Langchain wrapper

#!pip install opencv-python scikit-image langchain openai
from langchain.utilities.dalle_image_generator import DallEAPIWrapper
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.llms import OpenAI
llm = OpenAI(openai_api_key=api_key)
prompt = PromptTemplate(
    template="Generate a detailed prompt to generate an image based on the following description: {image_desc}",
chain = LLMChain(llm=llm, prompt=prompt)
Now, we will be testing DALL-E for 3 prompts
prompt = ['Tiger and Elephant fighting in Jungle','Tom Cruise washing clothes near a lake','Race between cycle, car and bullock cart']
for x in prompt:

Now, for Stable Diffusion, we will be using a different library, Replicate & Langchain

!pip install openai langchain replicate
import os
from langchain.llms import Replicate
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
#Generating replicate API requires you to sign in to the portal https://replicate.com/explore
os.environ["REPLICATE_API_TOKEN"]= replicate_api
Now, let's generate results for the same prompts as for DALLE
llm = Replicate(model="stability-ai/stable-diffusion:db21e45d3f7023abc2a46ee38a23973f6dce16bb082a930b0c49861f96d1e5bf")
prompt = ['tiger and elephant fighting in Jungle','Tom Cruise washing clothes near a lake','Race between cycle and car']
for x in prompt:
As a matter of fact, we can use Replicate for Mini-DALLE as well
import replicate
output = replicate.run(
    input={"text": "Tiger and Elephant fighting"}
for item in output:
Given the outputs, you can easily differ the strengths and weaknesses of the two models. Let's formally jot them down 
  1. Model Architecture:

    • Stable Diffusion: It is based on diffusion techniques and utilizes a diffusion model with a focus on producing photorealistic images. It involves multiple steps of noise reduction in the latent space to create detailed images from text prompts.
    • DALL-E: DALL-E uses a transformer-based architecture, specifically GPT-3.5 architecture, which is designed for text and image generation. It directly links text prompts to images using its architecture.
  2. Image Realism:

    • Stable Diffusion: Known for its ability to generate highly photorealistic images that closely resemble real-world objects and scenes.
    • DALL-E: While capable of generating visually impressive images, DALL-E's outputs may have a more surreal or artistic quality, as it can produce abstract and imaginative visuals based on the text prompts.
  3. Image Style and Artistic Expression:

    • Stable Diffusion: Primarily focuses on producing realistic images and may have limitations in terms of generating highly abstract or stylized artwork.
    • DALL-E: Known for its versatility in creating images with various artistic styles, allowing for more creativity and artistic expression.
  4. Dataset Size:

    • Stable Diffusion: It uses a dataset based on diffusion techniques and may have a smaller dataset compared to DALL-E.
    • DALL-E: OpenAI's DALL-E models have been trained on larger datasets, which can result in a wider range of image outputs.
  5. Complexity:

    • Stable Diffusion: The process involves noise reduction steps in the latent space, which adds complexity to the image generation process.
    • DALL-E: Known for its simplicity in generating images directly from text prompts without intermediate noise reduction steps.
  6. Use Cases:

    • Stable Diffusion: Often used for tasks requiring highly realistic images, such as image inpainting and text-to-image generation for practical applications.
    • DALL-E: Suited for creative and artistic applications, as well as generating a wide range of imaginative images based on textual descriptions.


In conclusion, we delved into the fascinating world of AI image generation models, focusing on two impressive contenders: DALL-E and Stable Diffusion. We explored the inner workings of these models and even ran demos with various prompts to witness their creative potential.

DALL-E, a product of OpenAI, is renowned for its versatility and imaginative image-generation capabilities. It can turn text prompts into visually stunning and often surreal artworks. Whether it's a "graffiti-covered cityscape of New York" or "a two-story pink house shaped like a shoe," DALL-E brings these textual descriptions to life in a unique and artistic way.

On the other hand, Stable Diffusion, powered by diffusion techniques, excels in creating highly photorealistic images. It's the go-to choice for tasks requiring images that closely resemble real-world objects and scenes. With Stable Diffusion, you can generate images that are indistinguishable from photographs.

Comparing these two models, it's important to consider your specific needs. DALL-E shines in creative and artistic applications, allowing for a wide range of imaginative outputs. Meanwhile, Stable Diffusion is the go-to option when realism and practicality are paramount.

Ultimately, both DALL-E and Stable Diffusion represent remarkable achievements in AI image generation, and their unique strengths make them valuable tools for different use cases. Whether you're an artist looking to create surreal masterpieces or a professional in need of realistic visual content, these models have you covered, opening up new horizons in the world of AI-generated art and imagery.

Disclaimer: The views and opinions expressed in this blog post are solely those of the authors and do not reflect the official policy or position of any of the mentioned tools. This blog post is not a form of advertising and no remuneration was received for the creation and publication of this post. The intention is to share our findings and experiences using these tools and is intended purely for informational purposes.