This project implements a diffusion model based on the architecture proposed in the DDPM paper. The model is designed to generate 32x32 landscape images using the UNet architecture, with self-attention, upsampling, and downsampling blocks implemented.
Here are some landscape images generated by the model after training. Notice that each single generated image is a 32*32
image, and they are placed together in the following images.
Sampling 1 | Sampling 2 |
---|---|
The flow of learning to generate realistic images of landscapes can be easily seen in the following table:
Epoch # | Generated Images |
---|---|
Epoch 1 | |
Epoch 50 | |
Epoch 100 | |
Epoch 150 | |
Epoch 300 | |
Epoch 500 |
1- The Diffusion
class: This class is a wrapper around the Unet
class and implements the forward and reverse process in diffusion models (adding noise and denoising) which at last generates new images.
- Noise Steps: 1000 steps of noise addition
- Noise Schedule: Linear schedule for beta values between
1e-4
and0.02
- Image Size:
32x32
input and output - Sampling: Images are sampled from random noise from normal distribution and denoised through reverse diffusion
2- UNet
Architecture: The UNet
class is the concrete model that learns to predict the noise in images. It includes several blocks like downsampling, upsampling, and self-attention blocks to capture both local and global features in the images.
- DoubleConv: Two convolutional layers with GroupNorm and GELU activation, used throughout the network.
- Down: Downsampling block that reduces the spatial resolution while increasing the feature maps.
- Up: Upsampling block to restore the spatial resolution.
- SelfAttention: Responsible for implementing attention mechanism in the model.
The diffusion model was trained on a landscape image dataset using the following configurations:
- Learning Rate: 3e-4
- Optimizer: AdamW
- Loss Function: Mean Squared Error (MSE)
- Number of Epochs: 500
- Batch Size: 24
- Input Shape: 32 x 32 (RGB)
- Output Shape: 32 x 32 (RGB)
- Beta Noise Schedule: Linear schedule between
1e-4
and0.02
for the forward process of adding noise during diffusion. - Model Architecture: UNet + Self-Attentions layers
- Hardware: The model was trained on a Kaggle notebook with a P100 GPU.
- Training Time: Around 13 hours
- Dataset Used: The dataset consisted around 4200 landscape images, resized to 32x32 pixels due to limited resource and time for training. The dataset can be found here.
The training loss over around 90000
iterations can be seen in the following plot: