Checkpoint :: definition and meaning
A checkpoint in Stable Diffusion is like a saved game file. Imagine the checkpoints used in race videogame.
It stores all the knowledge and skills the AI model has learned during its training process.
This includes the ability to understand the relationships between words and images, how to turn text descriptions into visual representations, and all the tiny adjustments made to create a specific art style.
What's the relation with the checkpoints you can download? Well, let's talk about the base models first.
Base models
When we talk about base models, or fundation models, we talk about several hundreds of thousands of dollars worth of computation (600.000 for Stable Diffusion), so these are not the ones you download on Civitai.
This is the underlying architecture of the AI model itself, like a blueprint. It defines how the model processes information, receives instructions, and generates images.
For instance, we have the Stable Diffusion base models (1.4, 1.5, 2, XL and so on), open source, the Midjourney model, closed, the NovelAI model, leaked more than a year ago.
When StabilityAI makes an update, such as Stable Diffusion XL or the incoming 3.0, it is a new base model.
Then we make a checkpoint from the base model, which acts as the pre-trained "knowledge base" specific to the model used for its training.
It contains the trained weights and parameters that the model has learned from its specific training data.
These weights guide the model's decisions and influence the style and content of the generated images.
In the end, these months of training lead to theses files containing some Gigagoctets of weights.
Please note that this is the same process for Large Langage Models (LLMs) and nearly everything related to AI.
Good to know
Before their release, base models can be censored, either by totally removing a concept (think keyword) or not training a concept (selecting only pictures not containing it).
For instance SD 2.1 and SDXL completely erased the penis concept, making it nearly impossible to generate one, even with fine-tuning.
While it is possible to add a new concept using a LoRAs, in this case it won't be understood by the model and only able to reproduce without integrating it in the global picture.
Someone somehow managed to make a LoRA with some results, sex is an awesome driver.
Want to try something fun? Just generate someone or an animal biting something, whatever it is: it's absolutely impossible! The concept just seems to not exist.
I spent several hours trying it for a videogame about dogs. It's a good lesson: you can't force a base model.
So, keep this in mind: fine-tuned checkpoints and LoRAs modify weights but don't create something new, as only the original training can do it.
Furthermore as all the training settings are not publicy available, it is also unrealistic to just continue the training of a base model.
The only solution is to train a new base model from scratch (crowd-sourced base model will exist a day if it becomes impossible to generate porn/hentai because of censorship, I am pretty sure of it :))
NB: with online image generators such as DALL-E, the censorship occurs also after the image generation, which won't happen for a model running locally. Just saying.
Trained checkpoints
From the fundation model it is possible to train further (fine-tuning) with a new and considerably smaller set of images (several thousands versus 600 millions captioned images used for the first SD base model).
The point of the fine-tuning is to modify the weights one or several concepts (keywords) and keeping the original weights, so you don't have to train for what you don't want to modify.
All that the base model "knows" will remain after the fine-tuning, and you will still be able to generate, let's say a dragon with a checkpoint specialized on landscapes.
Most of the times, the purpose of the custom checkpoints is to modify the style of the pictures, changing them to anime.
In this case, you show to the model hundreds of anime girl pictures and tell it: this is "1girl" (to reproduce a well known tag), a sufficient number of times for it to modify its weights towards what you want to obtain.
Where magic really occurs is with the interpretation: all the subsequent concepts will be affected: woman, grandmother, doctress and so on...
And you continue with other concepts: landscapes, items, facial expressions,... until you're fine with your result.
Save the outcome (IE: generate a checkpoint) and you're done.
You obtain a trained checkpoint that you can share with other users. It is a "Checkpoint Trained" on Civitai. Anyone can download and use it.
From this checkpoint, another round of training with the same process can be done, either to improve your results, adding new concepts or new styles, ad infinitum, giving different versions/branches of the checkpoint.
Merged checkpoints
When you merge checkpoints in Stable Diffusion, the resulting checkpoint combines elements from the individual checkpoints you're merging. This can lead to several potential outcomes:
- Combining Styles: The most common goal is to merge checkpoints with different artistic styles. This can result in a new checkpoint that blends elements from both styles, potentially leading to more versatile and diverse image generation capabilities. For example, merging a checkpoint trained on anime characters with another trained on landscapes might create a new checkpoint capable of generating images with anime-style characters in landscape settings.
- Fine-tuning for Specific Tasks: Merging can also be used to fine-tune a checkpoint for a specific task. You might combine a general-purpose checkpoint with another trained on a specific dataset of images and captions related to your desired task. For example, merging a general-purpose checkpoint with another trained on medical images and captions could create a new checkpoint better suited for generating medically accurate images.
- Averaging Weights: In some cases, merging simply involves averaging the weights of the individual checkpoints. This can introduce a blend of their characteristics while potentially reducing the overall strength of any specific style present in the individual checkpoints.
It is a "Checkpoint Merged" on Civitai. Like the others, it can be used as is, further trained or merged again.
While it seems easier than training one (seriously it is) merging checkpoints can be a complex and experimental process with unpredictable outcomes.
The most often, results are generic (so many useless anime checkpoints!) and below average.
Delving deeper
From an engineering standpoint, a Stable Diffusion checkpoint typically consists of several parts.
Primarily, this includes a serialized representation of the model's architecture (e.g., layers, activation functions) and the corresponding trained weights (numerical values).
Additionally, it may contain optimizer states, metadata about training (like vocabulary), or other auxiliary information.
TLDR;
Checkpoints in Stable Diffusion are essentially snapshots of the model's weight parameters at a specific point in training.
These weights govern the vast network of connections within the model, allowing it to make predictions and generate images.
When a checkpoint is loaded, it initializes the model with those learned parameters.
This enables users to continue the training process, fine-tune the model on a specialized dataset, or simply use the model for image generation at its current state of development.