Coloring the Past: Using AI to Colorize Historical Images - Image Colorisation 101
Intro - The Art and Science of Bringing Color to the Past
Remember the old Kodak and Victorian images, the first commerical color photos and film came with Kodachrome during the late 1930s. But what if we could breathe life into those monochrome memories? Welcome to the world of image colorization - a blend of art, science, and cutting-edge technology that's revolutionizing how we view history.
This is a blog post giving your a rundown image colorization space using deep learning.
Why am I doing this?
My final year undergraduate project was about this topic. I think it's only right to share this knowledge in an understandable post instead of being locked behind a wall of academic wording. (Who doesn't like to read 10k PDF full of jargon?!??)
If you can see from my other work, this is not the first time of me doing an image colourization project. You can use an old Richard Feynman colorization with DeOldify breakdown. See here During my undergrad, my supervisor recommended me to continue working on the topic due to previous experience.
Whether you're a deep learning enthusiast, a history buff, or simply curious about how those viral "colorized history" posts are created, this post aims to provide you with a comprehensive understanding of image colorization. Strap in, and let's go.
What is Image Colourisation?
Now, what is image colourisation? While it's pretty simple, turning greyscale images into colour.
So the next logical question is, what's behind the VooDo magic that allows this to happen?
The power of deep learning, CNNs (Convolutional neural networks) to be precise.
CNN allows us models to "See" what's in the image.
Now what does image colourisation do turn those pixels into colour?
via comparing black and white images, with colors and features it as already seen before. It can start map greyscale pixels onto color. With the help of some smart color engineering should I say!
This whole basis of image colourisation.
The greyscale images are input output is the RGB layers. Also author metaphor is greyscale is the images and the layers of neural network is the RGB layers with output coloured images. Neural networks great for understanding non linear patterns. So tuning the right RGB combination for the target pixel is great for deep learning.
- Quick aside: A non-linear is simply a pattern that does not have 1:1 relationship. But there is still a relationship.
Decoding Color: Understanding RGB and LAB Color Spaces
But RGB is not the only color space used. LAB is used as well. Due to it being an absolute color space, color defined regardless of the device. And the separation of Lightness (brightness) vs color channels make it more precise when mapping the colors.
CIELAB color space - Wikipedia
I've used Claude to help provide an ELI5 explanation:
Imagine you have a big box of crayons. Some crayons are different shades of the same color, like light blue and dark blue. In the RGB color box, these crayons might be mixed up and hard to find. But in the LAB color box, they're organized in a special way:
The L drawer: This has all the light and dark versions of colors. It's like controlling how much sunlight shines on your drawing.
The A drawer: This has crayons going from green to red.
The B drawer: This has crayons going from blue to yellow.
When computer artists want to color a black-and-white picture, the LAB box makes it easier. They can choose how bright or dark to make things without messing up the colors. And they can pick colors that look good together more easily because the crayons are sorted in a way that makes sense to our eyes.
The LAB box also has some magic crayons that can make colors your regular crayon box can't! This lets artists make really pretty and natural-looking colorful pictures from black-and-white ones.
So, while RGB is like a regular crayon box, LAB is like a super-organized, magical crayon box that helps artists color pictures in a way that looks great to our eyes!
Convolutional Neural Networks in Image Colorization
On a high level, takes an image as input in the form of matrix of pixel. Then features (Lines, Texture shapes) are identified. As go though each layer its able to identify for complex shapes. (Dogs, Cats, legs etc). For the final layer used for classification.
Foe the features to be identified we use filters, a small matrices of weights that goes though the image. This down in a sliding window manner. starting from top left and though each section of the image one by one.
This is some short python code we can break down, that converts RGB to LAB.
X = rgb2lab(1.0/255*image)[:,:,0]
Y = rgb2lab(1.0/255*image)[:,:,1:]
We know that RGB has 3 channels. This is passed into the sklearn rgb2lab function.
Now the shape of image looks like this [insert image here].
Now we select the greyscale layer by selecting index zero. (The last element here is channel section, other elements is the pixels themselves). Calling [:,:,1:] selects channels A and B. green-red and blue-yellow.
Image of RGB image showing the channels in 3D space.
Channels are L A B. And row and column are images dims. 3D space remember.
After converting the color space using the function rgb2lab()
, we select the greyscale (Lightness) layer with [:,:,0]
. This is typically used as input for the neural network. [:,:,1:]
selects the two color layers: A (green–red) and B (blue–yellow).
I'm not the best artist, so there other diagram and the videos above will be helpful as well.
skimage.color — skimage 0.23.2 documentation (scikit-image.org)
Here's a code snippet that would show how LAB channels are accessed.
import numpy as np
from skimage import color
import matplotlib.pyplot as plt
# Assume 'image' is your RGB image
lab_image = color.rgb2lab(image / 255.0) # Normalize RGB values to [0, 1]
L = lab_image[:,:,0] # Lightness channel (grayscale)
A = lab_image[:,:,1] # A channel (green-red)
B = lab_image[:,:,2] # B channel (blue-yellow)
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
axes[0,0].imshow(image)
axes[0,0].set_title('Original RGB')
axes[0,1].imshow(L, cmap='gray')
axes[0,1].set_title('L channel (Grayscale)')
axes[1,0].imshow(A, cmap='RdYlGn_r')
axes[1,0].set_title('A channel (Green-Red)')
axes[1,1].imshow(B, cmap='YlGnBu_r')
axes[1,1].set_title('B channel (Blue-Yellow)')
plt.tight_layout()
plt.show()
Q note on video colourisation, while talking about it in upcoming blog posts. This apply to video, as videos are simply multiple frames run in a certain speed. Video colorization has issues because of flickering and inconsistent colourisation.
TLDR: How make sure colourisation from 1st frame still applies at frame 50th? see here - if you very eager beaver
Now you understand how image colourisation works we start describe the various architectures.
The Evolution of Colorization: CNN, User-Guided, and Exemplar-Based Approaches
Based on this paper, we classify 3 image colourization types. These are CNN-based, User-guided, and Exemplar-based. There are actually more types of image colourization, which you can see in this paper. But for historical imagery, these are the most relevant.
CNN based image colourisation is type we just explained above. All successive models are build on top on a CNN.
The computer does need see the greyscale and color images right?
The influential papers start started were Deep Colorization. Which showed how deep learning can be used for image colourisation. Using CNNs and early GANs and autoencoders. The next generation were real time user guided image colourisation, that introduced user input for image colourisation. And then, exemplar based image colourisation. Which introduced reference images for helping adjust models. Deep Colorization Paper
Check out the videos of Deep Colorisation below:
Real-Time User-Guided Image Colorization with Learned Deep PriorsColorful Image ColorizationReal-Time User-Guided Image Colorization with Learned Deep Priors (Aug 2017, SIGGRAPH)
These models are great, as they nudge the model in the right direction. As talked about with t-shirt examples image Colorization has a subjective element to it. It can be art as well as a science. (Which all of deep learning btw).
User-guided has the most entertaining examples. Like stickman to images and coloring anime. (If you're a weeboo). These User-guided tend to use GANs and large pre-trained models like a U-Net.
GANs are used because they help generate images, compared to CNNs. Which only classify images. Pretrained-network can already identify various features, shapes, lines etc. instead developing a model from scratch. So we can just focus on colourizing the image.
GANs are out of fashion now, thanks to diffusion models. (No, I wont be explain them here sorry. You are already maths up enough). If you're still interested check out this.
Plain Image Colourisation
This section will be on the shorter side, as the intro and the loss functions sections will explain most of the dynamics.
Let's deep dive into the deep colourisation paper, mentioned above with the video. Architecture is a simple 5 full connected linear layers with ReLU activations, and greyscale image taken as input for the CNN. Where the Output layer has two neurons for U and V color channel values.
Extracting the features are done in 3 levels Low-level the actual patches of gray values. mid level DAISY features a fancy name for general features and shapes and semantic labeling. Hard labels saying this is a tree or a car. Then using a post-processing technique called Joint Bilateral Filtering. Via measuring the spatial distance and the intensity difference between pixels.
Colorful Image Colorization, a great paper. The architecture was Eight blocks of stacked convolutional layers, with Each block contains 2-3 convolutional layers followed by ReLU and Batch Normalization. And Striding used instead of pooling for downsampling.
The cool thing here is how to manipulated the color space of the image. By predicting 313 "ab" pairs representing an empirical probability distribution. Via inference share the correct AB pair for the output image. Cool stuff right. This paper starts deal with the washout issue mentioned in the next section.
So the main trends here were how color representation changes, from direct U and V prediction to probability distributions over color space. Many objects can have multiple plausible colors. Predicted U and V values were forced to choose a single color, often resulting in "safe" but unrealistic predictions (like the infamous brown tendency). And upgrading CNNs via residual blocks and batch normalization and various activation functions. Are now a staple in modern deep learning.
User Guided Models
User guided and exemplar based models, provide feedback from user which a pixel or image reference is used. Popular within the literature right now.
Because the model provides more accurate results, via getting help from user and just relying on the training images seen before hand. A user this car should be red, this t-shirt should be white help model adjust from there.
Here's are great survey paper for more details: [2008.10774] Image Colorization: A Survey and Dataset (arxiv.org)
But what happens if the image is not historical accurate? (Hint, Hint: my paper). [move maybe]
Let's start with Scribbler, A model that allows users to add stokes into images were the model colourise the image based on these images. Via using feed forward network and GAN, to identify the sketch. This model applies a bounding box to the sketch and also previous trained on various shapes and sizes so it can provide accurate output.
[1612.00835] Scribbler: Controlling Deep Image Synthesis with Sketch and Color (arxiv.org)
Real-Time User-Guided Colorization: This papers allows the user to add "hints", pixels that on greyscale image that model should use. So you use a green pixel on a t-shirt. And guess what. The t-shirt is now colourised as green not red. This does not use GAN, but closer to the CNN architecture mentioned earlier. The global hint network keeps account of all the pixels in the image, not just the user input.
Hint-Guided Anime Colorization: A model that were you can draw anime sketches and the model colourizes it. Told you would you like this. This also you uses a C-GAN with U-NET. Used for the perceptual loss.
What makes user guided networks great, so it's downfall. These models can be laborious. Because you are effectively labeling each greyscale image before passing it into the model. Also, if a user selects an unnatural color, then this tends to lead the model to fail. (You won't see a purple dog in the wild, would you? 🤨)
Exemplar Based Models
Now we move on to exemplar models, the state of the art for image colourization. Best to think of this as the advanced version of user guided models. Here's we have reference images to guide the model what's great about this, reference image allows us whole range of pixels to use for colourised image. Not just a simple pixel or sketch like previous models showcased above.
For the exemplar based architecture, The reference image is a big deal, (DUH!). This means the architecture takes 2 inputs, reference image and the greyscale image. Best to think reference image a nudge or weight for the greyscale image. (something I built upon on my paper[link to my paper]).
There many techniques to implement this architecture, by using a single image for the reference and target, to using local references that adjust specific section of the target image.
Deep Exemplar-based Colorization
The paper that introduced exemplar-based colorization. The model has 2 main parts, A Similarity sub-network that measures semantic similarity between the reference and target using VGG features. And a colorization sub-network that learns to select, propagate and predict colors end-to-end. With two main branches, Chrominance branch - Learns to selectively propagate colors from well-matched regions. And the perceptual branch: Predicts plausible colors for unmatched regions based on large-scale data.
SPColor: Semantic Prior Guided Exemplar-based Image Colorization
Building upon the Deep Exemplar-based Colorization paper, SPColor introduces semantic information to guide the model. The main components include a semantic prior guided correspondence network (SPC), which identifies objects in the image; a category reduction algorithm (CRA), which develops about 22 semantic categories for efficient image processing; and a similarity masked perceptual loss (SMP loss), a custom loss function that combines perceptual loss with a similarity map to balance color preservation and generation.
The breakthrough in this paper is the use of semantic segmentation, allowing the model to understand spatial context in the image. For example, it can distinguish between a tree and a car, and colorize the image in local areas rather than all at once, helping to avoid mismatches between semantically different regions.
Here we can see how great exemplar based models are, and why there are the state of the art. From better accuracy to more control from the user. This approach demonstrates significant improvements over previous methods, particularly in handling complex scenes and preserving semantic consistency in the colorized images.
Loss Functions
taken from colorful image colourisation: 1603.08511 (arxiv.org)
But you can see the issues of the colourisation; most of the images are washed out, brown, or frankly incorrect. As the image struggles to identify different objects across images.
(Fun fact: The reason why all images start out as brown is because this is most common color it will see across the dataset. By picking this color it has the lowest error.)
Why Brown? - You might ask?
Many colorization models use MSE as their loss function. MSE penalizes large errors more heavily than small ones. Brown emerges as a compromise color that minimizes error across diverse scenes via averaging the color values.
Let's consider a simplified scenario:
- True colors: [255, 0, 0] (red), [0, 255, 0] (green), [0, 0, 255] (blue)
- Average color: [85, 85, 85] (a shade of gray/brown)
MSE for average color:
MSE = [(255-85)^2 + (0-85)^2 + (0-85)^2 +
(0-85)^2 + (255-85)^2 + (0-85)^2 +
(0-85)^2 + (0-85)^2 + (255-85)^2] / 9
≈ 14,167
MSE for any specific color (e.g., red):
MSE = [(255-255)^2 + (0-255)^2 + (0-255)^2 +
(255-0)^2 + (255-255)^2 + (255-0)^2 +
(255-0)^2 + (255-0)^2 + (255-255)^2] / 9
≈ 43,350
The average color yields a lower MSE, incentivize the model to predict "safe" brownish (and ugly) colors.
This is why Pixel-wise loss alone, don't cut it. They don't work for spatial relationships between colors in an image. AKA understanding what going in photos and the objects. (spatial context). Using a more technical term this leads to "mode collapse" [How to Identify and Diagnose GAN Failure Modes - MachineLearningMastery.com, Monitor GAN Training Progress and Identify Common Failure Modes - MATLAB & Simulink - MathWorks United Kingdom]. The model tends to converge on a limited set of "safe" colors, leading to the washed-out appearance.
Now you can see why designing good loss functions are important.
Loss function definitions
Due to adversarial nature of GANs it follows a MinMaxLoss function. With the generator and discriminator competing against each other. As generator develops better images to foll the discriminator that try the tell the difference between a generated and a real image. This concept is later used for perceptual loss in non-GAN models.
$$\min_ \max_ \mathbb{x \sim p\text[\log D(x)] + \mathbb_{z \sim p_z(z)}[\log(1 - D(G(z)))]$$
\(\min_{\max} \mathbb{E}_{x \sim p(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))]\)
Pixel-wise loss, directly compares the color values of each pixel in the generated colorized image to the corresponding pixel in the ground truth (target) color image. A traditional loss function, like MSE, MAE and L1.
Perceptual loss aims to capture higher-level features and textures that are important to human visual perception, rather than just pixel-level differences
The key idea is to use a pre-trained neural network (often a CNN classifier like VGG) to measure the similarity between the generated colorized image and the target ground truth image in the feature space of the pre-trained network 4. The intuition is that this perceptual loss can better guide the model to generate colorized images that look visually similar to the target, even if the pixel values don't match exactly 4. [perplexity.ai search]
Perceptual loss and Pixel-level loss are combined into a total loss function for the model.
L_total = λ_p * L_perceptual + λ_pix * L_pixel
In latex form: $$L_ = \lambda_p \cdot L_ + \lambda_ \cdot L_$$
Quick deep learning reminder, the lambda expressions are regularization parameters.
Maths Deep Dive for loss functions
Perpetual Loss
An example feature loss equation take from this paper: (PDF) Analysis of Different Losses for Deep Learning Image Colorization (2022) (typeset.io).
Let's breakdown what the formula says.
Understanding the Components:$C_l, W_l, H_l$:
- These symbols represent the number of channels ($C_l$), width ($W_l$), and height ($H_l$) of the image at layer $l$. Channels refer to color channels (like red, green, blue) or the LAB color space.
- Width and height are the dimensions of the image, which help in understanding the size of the data being processed.
The Norm $|\Phi_l(u) - \Phi_l(v)|_2^2$:
- The term $\Phi_l(u)$ and $\Phi_l(v)$ refers to the features extracted from images $u$ and $v$ at layer $l$.
- The notation $|\cdot|_2$ represents the L2 norm, which is a way to measure the distance between two points in space. In this case, it measures how different the features of the two images are.
- Squaring this distance (the $^2$ part) emphasizes larger differences, making them more significant in the loss calculation.
Why Divide by $C_l W_l H_l$?
- The division by $C_l W_l H_l$ normalizes the loss value. This means it adjusts the loss based on the size of the images and the number of features.
- Normalization is important because it allows for fair comparisons between different images or models, regardless of their size or complexity.
MSE
Also, some for technical details of MSE.
The formula for MSE in the continuous case.
Let's break this down step by step.
- Variables Explained:
- $u$ and $v$: These represent two different images or sets of data we are comparing. For example, 'u' could be the colorized version of a greyscale image, and 'v' could be the actual color image we want to achieve.
- $\Omega$: This symbol represents the area or domain over which we are comparing the two images. Think of it as the entire space of the image we are looking at.
- $$\mathbb{C}$$ This notation indicates that we are dealing with color information. 'C' represents the number of color channels (like Red, Green, and Blue). So, if we have a color image, 'C' would typically be 3.
- Understanding the Norm:
- $|u-v|_{L^2(\Omega; \mathbb{R})}$: This part of the formula calculates the difference between the two images $u$ and $v$ across the entire area $\Omega$. The $L^2$ indicates that we are using the squared differences, which is important for MSE.
- $|u(x)-v(x)|_2^2$: Here, $x$ represents a specific point in the image. This expression calculates the squared difference in color values at that point. The $2$ in the subscript indicates that we are using the Euclidean norm, which is a way to measure distance in a multi-dimensional space (like color).
- The Integral:
- $\int_\Omega$: This symbol means we are adding up (integrating) all the squared differences across the entire image. It helps us get a single number that represents the overall difference between the two images.
- Breaking Down the Formula discrete version:
The formula given is:
$$\text(u, v) = \sum_^M \sum_^N \sum_^C (u_ - v_)^2$$
$$\text{u, v} = \sum_{i=1}^M \sum_{j=1}^N \sum_{k=1}^C (u_{ijk} - v_{ijk})^2$$
$$\text{d}(u, v) = \|u-v\|_{L^2(\Omega; \mathbb{R}^2)} = \sqrt{\int_{\Omega} |u(x) - v(x)|_2^2 \, dx}$$
Here's what each part means:
- $u$ and $v$: These represent the two images we are comparing. $u$ is the colorized image, and $v$ is the original image.
- $M$: This is the height of the images in pixels. It tells us how many rows of pixels there are.
- $N$: This is the width of the images in pixels. It tells us how many columns of pixels there are.
- $C$: This represents the number of color channels in the images. For example, a typical color image has three channels: Red, Green, and Blue (RGB).
Understanding the Summation: The formula uses three summations (the $\sum$ symbols) to add up values:
- The first summation (over $i$) goes through each row of pixels.
- The second summation (over $j$) goes through each column of pixels.
- The third summation (over $k$) goes through each color channel.
This means we are looking at every single pixel in every color channel of both images.
Calculating the Difference: Inside the summation, we see $(u - v)^2$:
- This part calculates the difference between the color value of the pixel in the colorized image $u$ and the original image $v$ for each pixel at position $(i, j)$ and color channel $k$.
- The difference is then squared. Squaring the difference is important because it makes sure that we do not have negative values, and it emphasizes larger differences more than smaller ones.
MAE
$$\text(u, v) = \int_\Omega |u(x)-v(x)|_ dx$$
Here, $u$ and $v$ represent two different images. $u$ is the image that the model predicts (the colorized image), and $v$ is the actual image we want (the ground truth image).
The symbol $\int_\Omega$ means we are looking at all the pixels in the image. $\Omega$ represents the entire area of the image we are analyzing.
The integral helps us sum up the differences across all pixels in the image.
The term $|u(x)-v(x)|$ is a way to calculate the difference between the predicted color and the actual color for each pixel.
The $l^1$ norm specifically means we are taking the absolute value of the difference. This means we are only interested in how far apart the colors are, without worrying about whether one is greater or smaller than the other.
Summing Over Color Channels:
Here, $C$ represents the number of color channels in the image. For example, in a typical RGB image, there are three channels: Red, Green, and Blue.
The expression $|u_k(x) - v_k(x)|$ calculates the absolute difference for each color channel $k$ at a specific pixel $x$.
The entire formula calculates the total error across all pixels and all color channels. It tells us how well the model has done in predicting the colors.
The formula for MAE in the discrete case is:
$$\text{u, v}^c = \sum_{i=1}^M \sum_{j=1}^N \sqrt{c} (u_{ij} - v_{ij})$$
- Here, $u$ and $v$ represent two images. $u$ is the colored image produced by the computer, and $v$ is the original colored image we want to compare it to.
- $M$ and $N$ are the dimensions of the images. Specifically, $M$ is the number of rows (height) in the image, and $N$ is the number of columns (width).
- $c$ represents the number of color channels in the image. For example, a typical colored image has three channels: red, green, and blue (RGB).
- The formula uses a double summation, which means it adds up values in a systematic way. The first summation ($\sum_{i=1}^M$) goes through each row of the image, and the second summation ($\sum_{j=1}^N$) goes through each column.
- For each pixel located at position $(i, j)$, the formula calculates the difference between the predicted color value $u$ and the actual color value $v$ for each color channel $k$.
Discrete Settings vs Continuous Settings
Throughout this section, i've shown both discrete and continuous version of the same loss functions. So why do we have different versions of the same thing? (hopefully you remember some calculus)
Discrete Settings are used because images are represented as discrete pixel values. Loss functions like L1 and L2 operate on these pixel values, making them suitable for direct computation of differences between predicted and actual values .
Continuous Settings may involve treating pixel values as continuous variables, which can be beneficial for certain types of models that predict color distributions rather than specific values.
Code version of the Loss functions
# [from perplexity] (https://www.perplexity.ai/)
import torch
import torch.nn as nn
import torchvision.models as models
class SimplePerceptualLoss(nn.Module):
def __init__(self):
super(SimplePerceptualLoss, self).__init__()
# Load pre-trained VGG16 and use its first few layers
vgg = models.vgg16(pretrained=True)
self.feature_extractor = nn.Sequential(*list(vgg.features)[:5]).eval()
# Freeze the parameters
for param in self.feature_extractor.parameters():
param.requires_grad = False
def forward(self, generated, target):
# Extract features from generated and target images
gen_features = self.feature_extractor(generated)
target_features = self.feature_extractor(target)
# Compute mean squared error between features
loss = nn.MSELoss()(gen_features, target_features)
return loss
# Usage example
perceptual_loss = SimplePerceptualLoss()
# Example tensors representing generated and target images
generated = torch.randn(1, 3, 256, 256)
target = torch.randn(1, 3, 256, 256)
loss = perceptual_loss(generated, target)
print(f"Perceptual Loss: ")
loss = nn.MSELoss()(gen_features, target_features)
this is the main line. Comparing VGG features to the image colourisation features.
Funny you can create a loss function for everything, the lesson in deep learning. Go ask Sam Altman.
Main thing to keep in mind for image colorization, is that calculating the difference between the color and black and white images. Which used to adjust the model for colourisation.
Conclusion
As we've journeyed through the interesting world of image colorization, we've seen how this field has rapidly evolved from simple pixel-based techniques to advanced deep learning tools.
- We started with the basics of color theory and how computers interpret color spaces like RGB and LAB.
- We explored the fundamental role of Convolutional Neural Networks (CNNs) in modern colorization techniques.
- We traced the evolution of colorization methods, from plain CNN-based approaches to more advanced user-guided and exemplar-based models.
- We delved into the intricacies of loss functions, understanding how pixel-wise, perceptual, and GAN losses contribute to more accurate and visually pleasing results.
- Finally, we examined state-of-the-art exemplar-based models that leverage semantic information and reference images to produce more accurate colorization.
Within a decade the field of image colourisation via deep learning has progressed a lot. Makes you wonder what the next decade has in store with us. With LLMs and better image generation models. Let's see. Also i've opted moved the ethics and humanities section into a separate blog post. Questions like: what happens if image colourisation is not historical accurate what's next? Something that my paper does a deep dive in. Read my paper here