Project 5: Fun with Diffusion Models

Part A

This part of the project is meant for an exploration of (pre-trained) diffusion models.

Part 1: Sampling Loops

1.1 Implementing the Forward Pass

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. Here, we implement the forward process defiend by:

\( q(x_t|x_0) = N(x_t;\sqrt{\bar{\alpha}}x_0,(1-\bar{\alpha}_t)\textbf{I}) \)

which is equivalent to computing

\( x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \), where \( \epsilon \sim N(0,1)\)

That is, given a clean image \(x_0\), we get noisy image \(x_t\) at timestep \(t\) by sampling from a Gaussian with mean \(\sqrt{\bar{\alpha}_t}x_0\) and standard deviation \(\sqrt{1-\bar{\alpha}_t}\).

Berkeley Campanile

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

1.2 Classical Denoising

Here, we implement Gaussian blur filtering to tray to remove the noise:

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

Gaussian Blur Denoising (t=250)

Gaussian Blur Denoising (t=500)

Gaussian Blur Denoising (t=750)

1.3 One-Step Denoising

It is apparent that the denoising process from previous part wasn't ideal. Now, we will use a pretrained UNet to denoise (or more specifically remove the noise to recover something very close to the original image). Below, are the one-step denoised image versus their noisy version.

Noisy Campanile (t=250)

Noisy Campanile (t=500)

Noisy Campanile (t=750)

One-Step Denoised (t=250)

One-Step Denoised (t=500)

One-Step Denoised (t=750)

1.4 Iterative Denoising

We see that from the previous part the one-step denoising results are much better than those from Gaussian dneoise. However, we can still see a slight deviation from the actual image, especially as we add more noise. To improve, we can implement iterative denoising where, say, we start with noise \(x_{1000}\) at timestep \(T=1000\), and carry on until we get \(x_0\). To increase efficiency and reduce cost, we can actually skip a few steps. That is, we have a list of timesteps 'strided_timesteps', where 'strided_timesteps' corresponds to the noisiest image and 'strided_timesteps[-1]' will correspond to a clean image. Each element of the list thus corresponds to a decreasing timesteps:

Denoising step 10

Denoising step 15

Denoising step 20

Denoising step 25

Denoising step 30

Original

Iteratively Denoised

One-Step Denoised

Gaussian Blurred

1.5 Diffusion Model Sampling

Using the 'iterative_denoise' function is to generate images from scratch. We can do this by setting 'i_start=0' and passing in random noise. This effectively denoises pure noise and generate random images:

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6 Classifier-Free Guidance (CFG)

We can see from previous part that the generated images are not ideal in the sense they don't exactly look real. One way to mitigate this issue is to use Classifier-Free Guidance, where we compute both a conditional and an unconditional noise estimate. This is defined as:

\( \epsilon = \epsilon_u + \gamma(\epsilon_c-\epsilon_u) \)

where \(\gamma\) controls the strength of CFG, \(\epsilon_u\) and \(\epsilon_c\) denotes the unconditional noise and conditional noise respectively. The high-level idea is that we use the difference of unconditional noise and conditional noise at the "guide" to improve our unconditional noise estimate. In practice, we see that it does indeed produce better results (more realistic images):

Sample 1 with CFG

Sample 2 with CFG

Sample 3 with CFG

Sample 4 with CFG

Sample 5 with CFG

1.7 Image-to-image Translation

In part 1.4, we take a real image, add noise to it, then denoise it. Here, we take the original test image, noise it a little, and force it back onto the image manifold without any conditioning. The expectation is that we are going to get a novel image that is similar to the original image. This follows the SDEdit algorithm.

SDEdit with i_start=1

SDEdit with i_start=3

SDEdit with i_start=5

SDEdit with i_start=7

SDEdit with i_start=10

SDEdit with i_start=20

Original

1.7.1 Editing Hand-Drawn and Web Images

This procedure works particularlyu well if we start with a nonrealistic image (e.g. painting, sketching, or some scribbles) and project it onto the natural image manifold:

Avocado with i_start=1

Avocado with i_start=3

Avocado with i_start=5

Avocado with i_start=7

Avocado with i_start=10

Avocado with i_start=20

Avocado

Scribble with i_start=1

Scribble with i_start=3

Scribble with i_start=5

Scribble with i_start=7

Scribble with i_start=10

Scribble with i_start=20

Scribble

Part B

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

Simple Operations:
- These are the basic building blocks that perform individual operations, such as convolutions, downsampling, upsampling, flattening, and reshaping (unflattening).
- Each of these operations contributes to transforming the image tensor through the UNet's encoder and decoder paths.
Composed Operations:
- These operations combine the simple ones to create larger functional blocks, such as downsampling and upsampling blocks with convolutional layers.
- The composed operations allow us to construct the UNet architecture by connecting multiple simple operations, forming the UNet's contracting and expanding paths, and making use of skip connections.

1.2 Using the UNet to Train a Denoiser

With these foundations, we can train an unconditional UNet to denoise images.

1.2.1 Training

Using σ=0.5, we train the denoiser on the MNIST datset with batch size 256, 5 epochs, hidden dimension 128. For the Adam optimizer, we used a learning rate of 1e-4.

Visualizing a few example, we see that the results are reasonably well.

1.2.2 Out-of-Distribution Testing

Since our denoiser was trained on σ=0.5, the model will likely not perform as well on different σ's, especially for those that are greater than the one we trained.

Part 2: Training a Diffusion Model

In this part, we're essentially trainig the UNet model as a Denoising Diffusion Probabilistic Model, where the goal is to iteratively predict noise added to images over a series of timesteps.

2.1 Adding Time Conditioning to UNet

Here, we expand on our original unconditional UNet by injecting scalar t into our UNet model and condition it. To do so, we have to implement a new building block: FCBlock(...)

2.2 Training the UNet

Following the given algorithm, we establish the training loop and arrive at the following results:

2.3 Sampling from the UNet

Additionally, we also implemented DDPM (Denoising Diffusion Probabilistic Model) to help visualizing some denoised samples.

2.4 Adding Class-Conditioning to UNet

To generate better result and give us more control to our output, we can also optionally condition our UNet on the class of the digit 0-9. To do so, we need to add two more FCBlocks to our previous implementation. For the class conditioning vector, we encode into a one-hot vector for simplicity. We also set p_{uncond}=1 to include a 10% dropout, allowing the model to still operate in case when class is not being conditioned.

2.5 Sampling from the Class-Conditioned UNet

Applying similar technique as 2.3, we can visualize some of these samples: