Project 5: Fun with Diffusion Models

Part A

This part of the project is meant for an exploration of (pre-trained) diffusion models.

Part 1: Sampling Loops

1.1 Implementing the Forward Pass

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. Here, we implement the forward process defiend by:

\( q(x_t|x_0) = N(x_t;\sqrt{\bar{\alpha}}x_0,(1-\bar{\alpha}_t)\textbf{I}) \)
which is equivalent to computing
\( x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \), where \( \epsilon \sim N(0,1)\)
That is, given a clean image \(x_0\), we get noisy image \(x_t\) at timestep \(t\) by sampling from a Gaussian with mean \(\sqrt{\bar{\alpha}_t}x_0\) and standard deviation \(\sqrt{1-\bar{\alpha}_t}\).

Image 1
Berkeley Campanile
Image 2
Noisy Campanile (t=250)
Image 3
Noisy Campanile (t=500)
Image 4
Noisy Campanile (t=750)

1.2 Classical Denoising

Here, we implement Gaussian blur filtering to tray to remove the noise:

Image 1
Noisy Campanile (t=250)
Image 2
Noisy Campanile (t=500)
Image 3
Noisy Campanile (t=750)
Image 4
Gaussian Blur Denoising (t=250)
Image 5
Gaussian Blur Denoising (t=500)
Image 6
Gaussian Blur Denoising (t=750)

1.3 One-Step Denoising

It is apparent that the denoising process from previous part wasn't ideal. Now, we will use a pretrained UNet to denoise (or more specifically remove the noise to recover something very close to the original image). Below, are the one-step denoised image versus their noisy version.

Image 1
Noisy Campanile (t=250)
Image 2
Noisy Campanile (t=500)
Image 3
Noisy Campanile (t=750)
Image 4
One-Step Denoised (t=250)
Image 5
One-Step Denoised (t=500)
Image 6
One-Step Denoised (t=750)

1.4 Iterative Denoising

We see that from the previous part the one-step denoising results are much better than those from Gaussian dneoise. However, we can still see a slight deviation from the actual image, especially as we add more noise. To improve, we can implement iterative denoising where, say, we start with noise \(x_{1000}\) at timestep \(T=1000\), and carry on until we get \(x_0\). To increase efficiency and reduce cost, we can actually skip a few steps. That is, we have a list of timesteps 'strided_timesteps', where 'strided_timesteps' corresponds to the noisiest image and 'strided_timesteps[-1]' will correspond to a clean image. Each element of the list thus corresponds to a decreasing timesteps:

Image 1
Denoising step 10
Image 2
Denoising step 15
Image 3
Denoising step 20
Image 4
Denoising step 25
Image 5
Denoising step 30
Image 1
Original
Image 2
Iteratively Denoised
Image 3
One-Step Denoised
Image 4
Gaussian Blurred

1.5 Diffusion Model Sampling

Using the 'iterative_denoise' function is to generate images from scratch. We can do this by setting 'i_start=0' and passing in random noise. This effectively denoises pure noise and generate random images:

Image 1
Sample 1
Image 2
Sample 2
Image 3
Sample 3
Image 4
Sample 4
Image 5
Sample 5

1.6 Classifier-Free Guidance (CFG)

We can see from previous part that the generated images are not ideal in the sense they don't exactly look real. One way to mitigate this issue is to use Classifier-Free Guidance, where we compute both a conditional and an unconditional noise estimate. This is defined as:

\( \epsilon = \epsilon_u + \gamma(\epsilon_c-\epsilon_u) \)
where \(\gamma\) controls the strength of CFG, \(\epsilon_u\) and \(\epsilon_c\) denotes the unconditional noise and conditional noise respectively. The high-level idea is that we use the difference of unconditional noise and conditional noise at the "guide" to improve our unconditional noise estimate. In practice, we see that it does indeed produce better results (more realistic images):

Image 1
Sample 1 with CFG
Image 2
Sample 2 with CFG
Image 3
Sample 3 with CFG
Image 4
Sample 4 with CFG
Image 5
Sample 5 with CFG

1.7 Image-to-image Translation

In part 1.4, we take a real image, add noise to it, then denoise it. Here, we take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. The expectation is that we are going to get a novel image that is similar to the original image. This follows the SDEdit algorithm.

Image 1
SDEdit with i_start=1
Image 2
SDEdit with i_start=3
Image 3
SDEdit with i_start=5
Image 4
SDEdit with i_start=7
Image 5
SDEdit with i_start=10
Image 6
SDEdit with i_start=20
Image 7
Original

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularlyu well if we start with a nonrealistic image (e.g. painting, sketching, or some scribbles) and project it onto the natural image manifold:

Image 1
Avocado with i_start=1
Image 2
Avocado with i_start=3
Image 3
Avocado with i_start=5
Image 4
Avocado with i_start=7
Image 5
Avocado with i_start=10
Image 6
Avocado with i_start=20
Image 7
Avocado
Image 1
Scribble with i_start=1
Image 2
Scribble with i_start=3
Image 3
Scribble with i_start=5
Image 4
Scribble with i_start=7
Image 5
Scribble with i_start=10
Image 6
Scribble with i_start=20
Image 7
Scribble

Part B

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

1.2 Using the UNet to Train a Denoiser

With these foundations, we can train an unconditional UNet to denoise images.

figure 3

1.2.1 Training

Using σ=0.5, we train the denoiser on the MNIST datset with batch size 256, 5 epochs, hidden dimension 128. For the Adam optimizer, we used a learning rate of 1e-4.

figure 4

Visualizing a few example, we see that the results are reasonably well.

figure 5
Epoch 1
figure 6
Epoch 5

1.2.2 Out-of-Distribution Testing

Since our denoiser was trained on σ=0.5, the model will likely not perform as well on different σ's, especially for those that are greater than the one we trained.

figure 7

Part 2: Training a Diffusion Model

In this part, we're essentially trainig the UNet model as a Denoising Diffusion Probabilistic Model, where the goal is to iteratively predict noise added to images over a series of timesteps.

2.1 Adding Time Conditioning to UNet

Here, we expand on our original unconditional UNet by injecting scalar t into our UNet model and condition it. To do so, we have to implement a new building block: FCBlock(...)

2.2 Training the UNet

Following the given algorithm, we establish the training loop and arrive at the following results:

figure 10

2.3 Sampling from the UNet

Additionally, we also implemented DDPM (Denoising Diffusion Probabilistic Model) to help visualizing some denoised samples.

2.3 Epoch 5
2.3 Epoch 20

2.4 Adding Class-Conditioning to UNet

To generate better result and give us more control to our output, we can also optionally condition our UNet on the class of the digit 0-9. To do so, we need to add two more FCBlocks to our previous implementation. For the class conditioning vector, we encode into a one-hot vector for simplicity. We also set p_{uncond}=1 to include a 10% dropout, allowing the model to still operate in case when class is not being conditioned.

figure 11

2.5 Sampling from the Class-Conditioned UNet

Applying similar technique as 2.3, we can visualize some of these samples:

2.5 Epoch 5
2.5 Epoch 20