Project 5: Fun with Diffusion Models
Part A
This part of the project is meant for an exploration of (pre-trained) diffusion models.
Part 1: Sampling Loops
1.1 Implementing the Forward Pass
A key part of diffusion is the forward process, which takes a clean image and adds noise to it. Here, we implement the forward process defiend by:
which is equivalent to computing
, where
That is, given a clean image , we get noisy image at timestep by sampling from a Gaussian with mean and standard deviation .
Berkeley Campanile
Noisy Campanile (t=250)
Noisy Campanile (t=500)
Noisy Campanile (t=750)
1.2 Classical Denoising
Here, we implement Gaussian blur filtering to tray to remove the noise:
Noisy Campanile (t=250)
Noisy Campanile (t=500)
Noisy Campanile (t=750)
Gaussian Blur Denoising (t=250)
Gaussian Blur Denoising (t=500)
Gaussian Blur Denoising (t=750)
1.3 One-Step Denoising
It is apparent that the denoising process from previous part wasn't ideal. Now, we will use a pretrained UNet to denoise (or more specifically remove the noise to recover something very close to the original image). Below, are the one-step denoised image versus their noisy version.
Noisy Campanile (t=250)
Noisy Campanile (t=500)
Noisy Campanile (t=750)
One-Step Denoised (t=250)
One-Step Denoised (t=500)
One-Step Denoised (t=750)
1.4 Iterative Denoising
We see that from the previous part the one-step denoising results are much better than those from Gaussian dneoise. However, we can still see a slight deviation from the actual image, especially as we add more noise. To improve, we can implement iterative denoising where, say, we start with noise at timestep , and carry on until we get . To increase efficiency and reduce cost, we can actually skip a few steps. That is, we have a list of timesteps 'strided_timesteps', where 'strided_timesteps' corresponds to the noisiest image and 'strided_timesteps[-1]' will correspond to a clean image. Each element of the list thus corresponds to a decreasing timesteps:
Denoising step 10
Denoising step 15
Denoising step 20
Denoising step 25
Denoising step 30
Original
Iteratively Denoised
One-Step Denoised
Gaussian Blurred
1.5 Diffusion Model Sampling
Using the 'iterative_denoise' function is to generate images from scratch. We can do this by setting 'i_start=0' and passing in random noise. This effectively denoises pure noise and generate random images:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
1.6 Classifier-Free Guidance (CFG)
We can see from previous part that the generated images are not ideal in the sense they don't exactly look real. One way to mitigate this issue is to use Classifier-Free Guidance, where we compute both a conditional and an unconditional noise estimate. This is defined as:
where controls the strength of CFG, and denotes the unconditional noise and conditional noise respectively. The high-level idea is that we use the difference of unconditional noise and conditional noise at the "guide" to improve our unconditional noise estimate. In practice, we see that it does indeed produce better results (more realistic images):
Sample 1 with CFG
Sample 2 with CFG
Sample 3 with CFG
Sample 4 with CFG
Sample 5 with CFG
1.7 Image-to-image Translation
In part 1.4, we take a real image, add noise to it, then denoise it. Here, we take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. The expectation is that we are going to get a novel image that is similar to the original image. This follows the SDEdit algorithm.
SDEdit with i_start=1
SDEdit with i_start=3
SDEdit with i_start=5
SDEdit with i_start=7
SDEdit with i_start=10
SDEdit with i_start=20
Original
1.7.1 Editing Hand-Drawn and Web Images
This procedure works particularlyu well if we start with a nonrealistic image (e.g. painting, sketching, or some scribbles) and project it onto the natural image manifold:
Avocado with i_start=1
Avocado with i_start=3
Avocado with i_start=5
Avocado with i_start=7
Avocado with i_start=10
Avocado with i_start=20
Avocado
Scribble with i_start=1
Scribble with i_start=3
Scribble with i_start=5
Scribble with i_start=7
Scribble with i_start=10
Scribble with i_start=20
Scribble
Part B
Part 1: Training a Single-Step Denoising UNet
1.1 Implementing the UNet
Here, we aim to implement the building blocks of a standard UNet model. For now, the goal is to establish an unconditional UNet:
-
Simple Operations:
- These are the basic building blocks that perform individual operations, such as convolutions, downsampling, upsampling, flattening, and reshaping (unflattening).
- Each of these operations contributes to transforming the image tensor through the UNet's encoder and decoder paths.
-
Composed Operations:
- These operations combine the simple ones to create larger functional blocks, such as downsampling and upsampling blocks with convolutional layers.
- The composed operations allow us to construct the UNet architecture by connecting multiple simple operations, forming the UNet's contracting and expanding paths, and making use of skip connections.
1.2 Using the UNet to Train a Denoiser
With these foundations, we can train an unconditional UNet to denoise images.
1.2.1 Training
Using σ=0.5, we train the denoiser on the MNIST datset with batch size 256, 5 epochs, hidden dimension 128. For the Adam optimizer, we used a learning rate of 1e-4.
Visualizing a few example, we see that the results are reasonably well.
1.2.2 Out-of-Distribution Testing
Since our denoiser was trained on σ=0.5, the model will likely not perform as well on different σ's, especially for those that are greater than the one we trained.
Part 2: Training a Diffusion Model
In this part, we're essentially trainig the UNet model as a Denoising Diffusion Probabilistic Model, where the goal is to iteratively predict noise added to images over a series of timesteps.
2.1 Adding Time Conditioning to UNet
Here, we expand on our original unconditional UNet by injecting scalar t into our UNet model and condition it. To do so, we have to implement a new building block: FCBlock(...)
2.2 Training the UNet
Following the given algorithm, we establish the training loop and arrive at the following results:
2.3 Sampling from the UNet
Additionally, we also implemented DDPM (Denoising Diffusion Probabilistic Model) to help visualizing some denoised samples.
2.4 Adding Class-Conditioning to UNet
To generate better result and give us more control to our output, we can also optionally condition our UNet on the class of the digit 0-9. To do so, we need to add two more FCBlocks to our previous implementation. For the class conditioning vector, we encode into a one-hot vector for simplicity. We also set p_{uncond}=1 to include a 10% dropout, allowing the model to still operate in case when class is not being conditioned.
2.5 Sampling from the Class-Conditioned UNet
Applying similar technique as 2.3, we can visualize some of these samples: