1. Executive Summary
Building a fine-tuning engine requires understanding that “training an AI” is not a single process. Modern language models pass through at least three distinct training stages, each consuming different data formats with different optimization objectives. Beyond text, image and audio models use fundamentally different data structures and training paradigms. This document maps the complete landscape.
The key insight for engine builders: the conversational chat format (JSONL with role/content pairs) that dominates API-level fine-tuning is only one data format used at one stage of a multi-stage pipeline. Pretraining consumes raw text. Supervised fine-tuning consumes instruction-response pairs. Preference tuning consumes comparison triplets. Image models consume labeled images or image-caption pairs. Audio models consume spectrogram-transcript pairs. Each requires different data loaders, loss functions, and optimization strategies.
2. The Canonical Training Pipeline
The modern LLM training pipeline was codified by OpenAI’s InstructGPT paper (Ouyang et al., 2022) [1], which established a three-stage process that remains the standard template. The pipeline is:
Stage 1: Pretraining → produces a base model with broad language knowledge
Stage 2: Supervised Fine-Tuning (SFT) → teaches the model to follow instructions
Stage 3: Preference Tuning (RLHF or DPO) → aligns the model with human preferences
Each stage is progressively cheaper. The InstructGPT paper reported that training the 175B SFT model required 4.9 petaflops/s-days and the PPO model required 60 petaflops/s-days, compared to 3,640 petaflops/s-days for pretraining GPT-3 [1]. The alignment stages combined cost roughly 2% of the pretraining compute, yet produced a 1.3B parameter model that human evaluators preferred over the unaligned 175B GPT-3 [1].
2.1 Data Flow Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 1: PRETRAINING │
│ │
│ Data: Raw text (web crawl, books, code, papers) │
│ Format: Plain text, no structure │
│ Scale: Trillions of tokens │
│ Objective: Next-token prediction (causal language modeling) │
│ Output: Base model (broad knowledge, no instruction-following) │
│ │
│ [Common Crawl] ──┐ │
│ [Books] ────┼──→ [Tokenizer] ──→ [Transformer] ──→ Base LM │
│ [Code] ────┤ ↑ ↑ │
│ [Papers] ────┘ BPE/SentencePiece Cross-entropy loss │
│ on next token │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 2: SUPERVISED FINE-TUNING (SFT) │
│ │
│ Data: Instruction-response pairs, multi-turn conversations │
│ Format: JSONL with messages array (role: user/assistant) │
│ Scale: Thousands to tens of thousands of examples │
│ Objective: Next-token prediction on demonstration data │
│ Output: Instruction-following model │
│ │
│ [Human-written demos] ──→ [Chat template] ──→ [Fine-tune] ──→ SFT│
│ [Tool-use examples] ──┘ formatting Base LM Model│
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 3: PREFERENCE TUNING (RLHF or DPO) │
│ │
│ Data: Comparison pairs (chosen vs. rejected responses) │
│ Format: Triplets (prompt, chosen_response, rejected_response) │
│ Scale: Tens of thousands of comparisons │
│ Objective: Maximize preference alignment │
│ Output: Aligned model │
│ │
│ RLHF path: │
│ [Ranked outputs] ──→ [Reward Model] ──→ [PPO] ──→ Aligned Model │
│ │
│ DPO path: │
│ [Preference pairs] ──→ [Direct optimization] ──→ Aligned Model │
│ (no reward model needed) │
└─────────────────────────────────────────────────────────────────────┘
3. Stage 1: Pretraining
3.1 What It Is
Pretraining is self-supervised learning on raw text. The model learns to predict the next token in a sequence, absorbing grammar, factual knowledge, reasoning patterns, and code structure from the statistical regularities of language [2]. No human labels are required. The training signal comes entirely from the structure of the text itself.
3.2 Data Sources and Formats
Pretraining data is plain text — no roles, no system prompts, no conversation structure. The major source categories are:
Web crawl: The dominant source. Common Crawl provides petabytes of raw web data extracted from billions of web pages and has been used by GPT-3, LLaMA, OpenLLaMA, and T5 [3]. However, most crawled pages are discarded due to low quality — existing pipelines often discard over 90% of raw data [4]. The FineWeb dataset, for example, is a 15-trillion token dataset derived from 96 CommonCrawl snapshots with a total data volume of 45TB [5].
Books: Provide formal, lengthy texts that help models understand complex language structures, long-range dependencies, and coherent narrative [6]. Key datasets include BookCorpus and Books3 (from The Pile).
Code: StarCoder Data contains 783 GB of code across 86 programming languages, plus 250 billion tokens from GitHub and Jupyter Notebooks [3]. Code training appears to improve reasoning capabilities even for non-code tasks.
Scientific and multilingual text: The Common Corpus project includes scientific papers, theses, book reviews, clinical trials, and historical cultural texts, with data in over 50 languages [7].
Quality filtering is critical and represents a major engineering effort. RedPajama, an open-source replication of LLaMA’s training data, provides 1.2 trillion tokens with exposed quality signals and metadata from Common Crawl, C4, GitHub, books, and other sources [3].
3.3 Training Objective
The objective is causal language modeling: given a sequence of tokens $t_1, t_2, …, t_{n-1}$, predict $t_n$. The loss function is cross-entropy between the model’s predicted probability distribution over the vocabulary and the actual next token. This is applied at every position in the sequence, making each document provide many training signals.
3.4 Compute and Scale
Pretraining is by far the most expensive stage. It typically requires clusters of hundreds to thousands of GPUs running for weeks to months. The Chinchilla scaling laws (Hoffmann et al., 2022) established that optimal training allocates compute roughly equally between model size and data size — a model with N parameters should be trained on approximately 20N tokens.
4. Stage 2: Supervised Fine-Tuning (SFT)
4.1 What It Is
SFT adapts the pretrained base model to follow instructions by training on curated instruction-response pairs. The training objective is the same as pretraining (next-token prediction), but applied to a much smaller dataset of high-quality demonstrations [8]. While both stages use next-token prediction, SFT typically uses much smaller datasets because it requires instruction-output pairs rather than raw text — creating such a dataset involves significant human labor [8].
4.2 Data Formats
SFT data comes in several formats, all representing the same underlying structure of “given this input, produce this output”:
Simple instruction pairs (prompt/completion format):
{"prompt": "Translate 'hello' to French.", "completion": "Bonjour."}
Chat/messages format (the dominant format for API fine-tuning):
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Translate 'hello' to French."},
{"role": "assistant", "content": "Bonjour."}
]}
Multi-turn conversations extend the messages format with alternating user/assistant turns. Each model family has its own chat template that wraps these messages with special tokens.
Tool-use demonstrations include function calls and results:
{"messages": [
{"role": "user", "content": "What's the weather?"},
{"role": "assistant", "tool_calls": [{"function": {"name": "get_weather", "arguments": "{\"city\": \"Toronto\"}"}}]},
{"role": "tool", "content": "{\"temp\": 5, \"condition\": \"cloudy\"}"},
{"role": "assistant", "content": "It's 5°C and cloudy in Toronto."}
]}
4.3 Key Datasets
Several open-source SFT datasets exist [9]:
- Dolly 2.0 (databricks-dolly-15k): 15K+ human-written prompt-response pairs covering question-answering, summarization, and other tasks.
- OpenAssistant (OASST1): 66,497 human-written, human-annotated conversation trees in multiple languages.
- Alpaca: 52K instruction-following examples generated by text-davinci-003.
- GPT-4-LLM: Instruction and comparison data generated by GPT-4 for RLHF.
4.4 Data Flow for SFT
Raw examples ──→ Chat template formatting ──→ Tokenization ──→ Training
│ │
Apply model-specific Mask loss on
special tokens user/system turns
(e.g., <|begin_of_turn|>) (only compute loss
on assistant tokens)
A critical implementation detail: during SFT, loss is typically computed only on the assistant’s response tokens, not on the user’s prompt tokens. This is called “completion-only” loss masking. The prompt tokens are processed by the model (they inform the response), but the gradient signal comes only from predicting the response correctly.
4.5 The InstructGPT SFT Dataset
The InstructGPT SFT dataset contained approximately 13,000 training prompts drawn from the OpenAI API and from labeler-written prompts. The reward model dataset contained 33,000 training prompts [1]. These are small datasets relative to pretraining — the leverage comes from the base model already having broad language knowledge.
5. Stage 3: Preference Tuning
5.1 Why Preference Tuning Exists
SFT teaches the model what good outputs look like, but it has a ceiling: it can only learn from demonstrated examples, and there are many valid ways to respond to a prompt. Preference tuning instead teaches the model which outputs are better than others, which captures nuanced quality distinctions that are hard to demonstrate directly [10]. Humans are better at ranking responses than generating perfect ones, making preference data more efficient to collect [10].
5.2 RLHF (Reinforcement Learning from Human Feedback)
The RLHF pipeline has three sub-steps [1] [2]:
Step 1: Generate comparison data. The SFT model generates multiple responses (typically 4–9) for each prompt. Human annotators rank these responses by quality [8].
Step 2: Train a reward model. A separate model is trained on the ranking data to predict a scalar reward score for any (prompt, response) pair. The reward model learns to predict human preferences from the ranking data [2].
Step 3: Optimize the policy with PPO. The SFT model is further fine-tuned using Proximal Policy Optimization (PPO) to maximize the reward model’s score. A KL divergence penalty prevents the model from diverging too far from the SFT model, which helps maintain coherent text generation [2] [1].
RLHF Data Flow:
[Prompt] ──→ [SFT Model generates 4-9 responses]
│
▼
[Human ranks responses: A > B > C > D]
│
▼
[Train Reward Model on rankings]
(learns to assign scalar scores)
│
▼
[PPO optimization loop]
┌─────────────────────────────────┐
│ Generate response ──→ Score it │
│ ▲ │ │
│ └───── Update ◀─────┘ │
│ weights │
│ (+ KL penalty from SFT model) │
└─────────────────────────────────┘
│
▼
[Aligned Model]
RLHF requires four models in memory simultaneously during training: the policy model being optimized, the reference model (frozen SFT model for KL computation), the reward model, and the value function (often initialized from the reward model) [10].
5.3 DPO (Direct Preference Optimization)
DPO, introduced by Rafailov et al. (2023), simplifies RLHF by eliminating the explicit reward model and RL optimization entirely [10]. It directly fine-tunes the language model on preference pairs using a classification-style loss function.
The data format for DPO is triplets [10] [11]:
{
"prompt": "Explain quantum entanglement.",
"chosen": "Quantum entanglement is a phenomenon where...",
"rejected": "Quantum entanglement is when particles are magically linked..."
}
DPO trains the model to increase the relative probability of the chosen response while decreasing that of the rejected response, with a KL penalty (controlled by the β parameter) that prevents the model from drifting too far from the reference policy [10].
DPO Data Flow:
[Prompt + Chosen + Rejected] ──→ [Compute log-probabilities]
│
┌──────────┴──────────┐
│ │
Policy model Reference model
(being trained) (frozen copy)
│ │
└──────────┬──────────┘
│
▼
[DPO Loss Function]
Increase: P(chosen) / P_ref(chosen)
Decrease: P(rejected) / P_ref(rejected)
│
▼
[Gradient update]
DPO requires only two model copies (policy and reference) instead of RLHF’s four, uses standard backpropagation with no RL, and operates on a static offline dataset [10]. It has become the dominant preference tuning method in the open-source ecosystem.
5.4 Other Preference Methods
KTO (Kahneman-Tversky Optimization): Does not require paired preference data — each response is independently rated as positive or negative [12].
GRPO (Group Relative Policy Optimization): Used by DeepSeek, groups multiple responses and uses their relative quality within the group as the training signal, avoiding the need for a separate reward model.
IPO (Identity Preference Optimization): Adds a regularization term to the DPO loss to prevent overfitting on the preference dataset [12].
6. Non-Text Modalities
6.1 Image Training
Image AI uses fundamentally different data formats and training objectives from text. The three major paradigms are:
6.1.1 Supervised Image Classification
Data format: (image, label) pairs — an image file and a categorical label.
Architecture: Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs). PyTorch’s torchvision provides standard datasets including ImageNet, CIFAR, and COCO, along with data augmentation transforms [13].
Training objective: Cross-entropy classification loss. Given an image, predict which of N classes it belongs to.
Transfer learning is standard practice: rather than training from scratch, practitioners pretrain on a large dataset like ImageNet (1.2 million images, 1000 categories) and then fine-tune on a task-specific dataset [14]. This is conceptually analogous to LLM pretraining followed by fine-tuning.
Image Classification Data Flow:
[Raw Image] ──→ [Resize/Augment] ──→ [ToTensor] ──→ [Normalize]
│
▼
[CNN or ViT Encoder]
│
▼
[Fully Connected Layer]
│
▼
[Softmax] ──→ P(class)
│
▼
[Cross-entropy loss vs. true label]
6.1.2 Contrastive Learning (CLIP)
Data format: (image, caption) pairs scraped from the internet.
CLIP (Contrastive Language-Image Pre-training), introduced by OpenAI in 2021, learns to connect images and text in a shared embedding space [15]. It was trained on a dataset called WebImageText containing 400 million image-caption pairs [16].
Training objective: Contrastive loss. Given a batch of N image-caption pairs, the model learns to maximize cosine similarity between the N matching pairs while minimizing similarity between the N²-N non-matching pairs [15]. This teaches the model that “a photo of a dog” should be close in embedding space to images of dogs.
CLIP Data Flow:
[Image] ──→ [Image Encoder (ViT)] ──→ [Image Embedding] ──┐
├──→ [Cosine Similarity Matrix]
[Caption] ──→ [Text Encoder (Transformer)] ──→ [Text Embedding] ──┘ │
▼
[Contrastive Loss]
Maximize: matching pairs
Minimize: non-matching pairs
CLIP’s text encoder is used as a component in downstream models — for example, Stable Diffusion uses CLIP’s text encoder to transform text prompts into embeddings for image generation [16].
6.1.3 Diffusion Models (Image Generation)
Data format: (image, caption) pairs, plus a noise schedule.
Diffusion models learn to generate images by learning to remove noise. During training, noise is progressively added to training images, and the model learns to predict and remove that noise at each step [17]. At inference time, the model starts from pure noise and iteratively denoises it into a coherent image.
Diffusion Training Data Flow:
[Clean Image] ──→ [Add noise at timestep t] ──→ [Noisy Image]
│
▼
[U-Net predicts noise]
│
▼
[MSE loss: predicted noise vs. actual noise]
Diffusion Inference:
[Pure Noise] ──→ [Denoise step 1] ──→ [Denoise step 2] ──→ ... ──→ [Generated Image]
↑
[Text conditioning via CLIP encoder]
6.2 Audio Training
6.2.1 The Spectrogram Bridge
Audio models typically convert sound into a visual representation before processing. OpenAI’s Whisper illustrates this: input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder [18]. A log-Mel spectrogram is a 2D representation of frequency content over time, where the frequency axis is warped according to the Mel scale (which approximates human auditory perception). This conversion turns audio into something structurally similar to an image, enabling the use of similar processing techniques.
6.2.2 Whisper: Supervised Speech Recognition
Data format: (audio, transcript) pairs.
Whisper was trained on 680,000 hours of multilingual supervised data collected from the web [18]. The large-v3 model was trained on 1 million hours of labeled audio and 4 million hours of pseudo-labeled audio generated using the large-v2 model [19]. This is supervised training, but the supervision is audio-transcript pairs, not conversations.
Architecture: Encoder-decoder Transformer. The encoder processes the spectrogram and produces hidden representations. The decoder autoregressively predicts text tokens, conditioned on the encoder’s output [20].
Whisper Data Flow:
[Raw Audio (16kHz)] ──→ [30-second chunks] ──→ [Log-Mel Spectrogram]
│
▼
[Convolutional Stem]
(2 conv layers + GELU)
│
▼
[Transformer Encoder]
│
▼
[Transformer Decoder]
(cross-attention to encoder)
│
▼
[Predicted text tokens]
│
▼
[Cross-entropy loss vs. true transcript]
6.2.3 Self-Supervised Audio (wav2vec 2.0)
An alternative approach, developed by Meta, learns speech representations from unlabeled audio [18]. wav2vec 2.0 uses a masked prediction objective analogous to BERT: portions of the audio are masked, and the model learns to predict the masked content from context. This produces a pretrained audio encoder that can then be fine-tuned with small amounts of labeled data.
6.3 Multimodal Models
The frontier of training combines multiple modalities within a single model.
Meta’s ImageBind binds six modalities — images, text, audio, depth, thermal, and IMU data — into a shared representation space, allowing cross-modal understanding without explicit pairwise training for every combination [21].
Google’s Gemini is trained on interleaved sequences of audio, text, video, and images using a transformer backbone [22]. This means the training data consists of documents where different modalities appear inline — for example, a web page with text and images interleaved, or a video with its audio track and subtitles.
The training approaches for multimodal models vary: some use contrastive objectives (like CLIP), some use generative objectives (next-token prediction over token sequences that represent multiple modalities), and some use hybrid approaches.
7. Parameter-Efficient Fine-Tuning (PEFT)
7.1 The Problem
Full fine-tuning updates every parameter in the model. For a 70B parameter model in 16-bit precision, this requires approximately 140GB of GPU memory just for the model weights, plus additional memory for optimizer states, gradients, and activations. This makes full fine-tuning of large models impractical on most hardware.
7.2 LoRA (Low-Rank Adaptation)
LoRA, introduced by Hu et al. (2021) [23], is the dominant PEFT method. The core idea: instead of updating a full weight matrix W during fine-tuning, represent the update as a product of two smaller matrices: ΔW = BA, where B has dimensions d × r and A has dimensions r × d, with r (the rank) much smaller than d.
During training, the original weight matrix W is frozen. Only the low-rank matrices A and B are trained. At inference time, the update can be merged back into the original weights (W’ = W + BA), adding zero inference latency.
LoRA Architecture:
┌──────────────┐
│ Frozen W │
Input x ──────→│ (d × d) │──────→ Wx
└──────────────┘
+
┌──────┐ ┌──────┐
Input x ──────→│ A │→│ B │──────→ BAx
│(r×d) │ │(d×r) │
└──────┘ └──────┘
trainable trainable
Output = Wx + BAx = (W + BA)x
Where r << d (typically r = 4 to 32)
The practical impact is dramatic. Typical configurations train only 0.1–0.2% of total model parameters [24]. The Hugging Face PEFT library reports that a LoRA fine-tune of the bigscience/T0_3B model produces a checkpoint of only 19MB compared to 11GB for the full model [24]. Researchers have found that ranks between 4 and 32 provide an effective balance between parameter reduction and performance [25].
LoRA is applied to the attention weight matrices in the Transformer (typically the query and value projection matrices). It can also be applied to other linear layers in the model.
7.3 QLoRA (Quantized LoRA)
QLoRA combines LoRA with 4-bit quantization of the base model weights. The base model is loaded in 4-bit precision (reducing memory by ~4x), while the LoRA adapters are trained in higher precision. This enables fine-tuning of models that would otherwise not fit in GPU memory.
7.4 Other PEFT Methods
Other notable PEFT approaches include [23]:
- Adapter tuning (Houlsby et al., 2019): Inserts small trainable bottleneck modules between existing layers.
- Prefix tuning (Li and Liang, 2021): Prepends trainable virtual tokens to the input of each Transformer layer.
- Prompt tuning (Lester et al., 2021): Similar to prefix tuning but only prepends to the input embedding layer.
7.5 LoRA vs. Full Fine-Tuning
LoRA can match full fine-tuning performance on many tasks including sequence classification, instruction tuning, and chat [26]. However, recent research has shown that LoRA and full fine-tuning learn structurally different solutions, and LoRA has difficulty matching full fine-tuning performance on harder tasks like code generation [26]. The structural differences suggest that while LoRA is an excellent practical tool, it is not simply a compressed version of full fine-tuning — it is a different optimization process with different inductive biases.
8. The Software Stack
8.1 Frameworks
PyTorch is the dominant framework for both research and production training. It provides dynamic computation graphs, native GPU acceleration via CUDA, and the automatic differentiation engine (autograd) needed for backpropagation [13].
Hugging Face Transformers provides pretrained model implementations, tokenizers, and training utilities. It sits atop PyTorch and provides a unified API for thousands of models [24].
Hugging Face TRL (Transformer Reinforcement Learning) implements the RLHF and DPO training pipelines, including SFTTrainer, RewardTrainer, and DPOTrainer [11] [12].
Hugging Face PEFT implements LoRA, QLoRA, and other parameter-efficient methods as a wrapper around the Transformers library [24].
8.2 Model Serving for Inference
vLLM: High-throughput serving with PagedAttention for efficient memory management. Supports OpenAI-compatible API endpoints.
llama.cpp: CPU and single-GPU inference using quantized models. The C++ implementation enables deployment on consumer hardware.
TGI (Text Generation Inference): Hugging Face’s production serving solution.
8.3 Training Infrastructure
Accelerate (Hugging Face): Handles distributed training across multiple GPUs and nodes.
DeepSpeed (Microsoft): Provides ZeRO (Zero Redundancy Optimizer) for memory-efficient distributed training, enabling training of models that don’t fit on a single GPU.
NVIDIA NeMo: GPU-optimized framework for training and deploying large models, particularly strong for multimodal and speech models [27].
8.4 Computer Vision Specific
torchvision: Provides pretrained models (ResNet, ViT, etc.), standard datasets (ImageNet, CIFAR, COCO), and data augmentation transforms [13].
Roboflow: Dataset management and annotation tools for computer vision tasks.
Detectron2 (Meta): Object detection and segmentation framework.
8.5 Audio Specific
Whisper (OpenAI): Open-source ASR model and inference code, MIT licensed [18].
NeMo (NVIDIA): Toolkit for speech recognition, speaker diarization, and other audio tasks [27].
Kaldi: Long-standing open-source toolkit for speech recognition research.
9. Summary: Data Formats by Training Paradigm
| Paradigm | Data Format | Training Objective | Scale | Key Tools |
|---|---|---|---|---|
| LLM Pretraining | Raw text | Next-token prediction | Trillions of tokens | PyTorch, DeepSpeed |
| Supervised Fine-Tuning | (instruction, response) pairs or messages array | Next-token prediction on demonstrations | Thousands–tens of thousands | TRL SFTTrainer, PEFT |
| RLHF | Ranked outputs → reward model → PPO | Maximize reward with KL constraint | Tens of thousands of comparisons | TRL, PPO |
| DPO | (prompt, chosen, rejected) triplets | Preference classification loss | Tens of thousands of pairs | TRL DPOTrainer |
| Image Classification | (image, label) | Cross-entropy classification | Millions of labeled images | torchvision, PyTorch |
| Contrastive (CLIP) | (image, caption) pairs | Contrastive alignment loss | Hundreds of millions of pairs | OpenCLIP, PyTorch |
| Diffusion (Image Gen) | (image, caption) + noise schedule | Denoising (MSE on noise) | Millions of images | diffusers, Stable Diffusion |
| Speech Recognition | (audio spectrogram, transcript) | Sequence-to-sequence cross-entropy | Hundreds of thousands of hours | Whisper, NeMo |
| Self-Supervised Audio | Raw audio | Masked prediction | Thousands of hours unlabeled | wav2vec 2.0 |
| Multimodal | Interleaved sequences across modalities | Various (contrastive, generative, hybrid) | Varies widely | NeMo, custom pipelines |
10. Glossary
Adapter: A small trainable module inserted between frozen layers of a pretrained model, enabling parameter-efficient fine-tuning without modifying original weights (Houlsby et al., 2019).
Alignment: The process of making a model’s outputs consistent with human values and intentions, typically through RLHF or DPO after supervised fine-tuning.
Autoregressive: A generation strategy where each output token is conditioned on all previously generated tokens. Most modern LLMs are autoregressive decoders.
Base model: A model after pretraining but before any fine-tuning. Has broad knowledge but does not follow instructions reliably.
BPE (Byte Pair Encoding): A tokenization algorithm that iteratively merges the most frequent pairs of characters/tokens, producing subword units. Used by GPT models and CLIP.
Causal language modeling (CLM): The pretraining objective for decoder-only models: predict the next token given all preceding tokens.
Chat template: A model-specific format for wrapping multi-turn conversations with special tokens (e.g., <|begin_of_turn|>, <|end_of_turn|>). Different model families use different templates.
CLIP (Contrastive Language-Image Pre-training): A training method that learns aligned image and text representations by training paired encoders on internet-scraped image-caption data (Radford et al., 2021).
Contrastive loss: A loss function that pulls representations of matching pairs closer together while pushing non-matching pairs apart in embedding space.
Cross-entropy loss: The standard loss function for classification and next-token prediction. Measures the difference between predicted probability distribution and the true distribution.
Decoder: In a Transformer, the component that generates output tokens autoregressively. In encoder-decoder models (like Whisper), it attends to the encoder’s output via cross-attention.
Diffusion model: A generative model that learns to iteratively denoise random noise into structured outputs (images, audio). Training involves adding noise to data and learning to reverse the process.
DPO (Direct Preference Optimization): A preference tuning method that directly optimizes the language model on preference pairs without requiring a separate reward model or reinforcement learning (Rafailov et al., 2023).
Encoder: In a Transformer, the component that processes input and produces hidden representations. CLIP uses separate image and text encoders. Whisper uses an audio encoder.
Encoder-decoder: A Transformer architecture with separate encoder and decoder components connected via cross-attention. Used by Whisper, T5, and BART.
Fine-tuning: Adapting a pretrained model to a specific task by continuing training on task-specific data. Can be full (all parameters) or parameter-efficient (LoRA, adapters).
JSONL (JSON Lines): The standard file format for SFT and preference training data. Each line is a complete JSON object representing one training example.
KL divergence (Kullback-Leibler divergence): A measure of how one probability distribution differs from another. Used in RLHF/DPO as a penalty to prevent the fine-tuned model from diverging too far from the reference model.
Log-Mel spectrogram: A 2D representation of audio used as input to speech models. The x-axis is time, the y-axis is frequency (on a Mel scale approximating human hearing), and the values are log-scaled magnitudes.
LoRA (Low-Rank Adaptation): A PEFT method that represents weight updates as a product of two small matrices, training only 0.1–0.2% of total parameters while often matching full fine-tuning performance (Hu et al., 2021).
Masked language modeling (MLM): A pretraining objective where random tokens are masked and the model predicts them. Used by BERT and wav2vec 2.0 (for audio).
PEFT (Parameter-Efficient Fine-Tuning): A family of methods that adapt pretrained models by updating only a small fraction of parameters, reducing compute and memory requirements.
PPO (Proximal Policy Optimization): The reinforcement learning algorithm most commonly used in RLHF. Updates the policy (language model) in small steps, constrained by a clipping mechanism.
Preference data: Training data consisting of pairs of responses to the same prompt, with annotations indicating which response is preferred. The core input for RLHF and DPO.
QLoRA: Combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of large models on consumer GPUs.
Quantization: Reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to reduce memory usage and increase inference speed, with some accuracy tradeoff.
Rank (in LoRA): The dimensionality of the low-rank decomposition matrices. Higher rank = more trainable parameters = more expressive updates. Typical values: 4–32.
Reward model: A model trained on human preference data to predict a scalar quality score for any (prompt, response) pair. Used in RLHF but not in DPO.
RLHF (Reinforcement Learning from Human Feedback): A training method that uses human preference rankings to train a reward model, which then guides reinforcement learning optimization of the language model (Ouyang et al., 2022).
SFT (Supervised Fine-Tuning): Training a pretrained model on curated instruction-response pairs to produce an instruction-following model.
Tokenizer: The component that converts raw text into integer token IDs for model input. Common algorithms include BPE, SentencePiece, and WordPiece.
Transformer: The dominant neural network architecture for sequence modeling, based on self-attention mechanisms (Vaswani et al., 2017). The foundation of virtually all modern LLMs, vision transformers, and audio models.
Vision Transformer (ViT): An adaptation of the Transformer architecture for images. Splits an image into patches and processes them as a sequence, analogous to how text is processed as a sequence of tokens.
Zero-shot: The ability to perform a task without any task-specific training examples. CLIP’s zero-shot image classification works by comparing image embeddings to text embeddings of category descriptions.
11. Bibliography
[1] Ouyang, L., Wu, J., Jiang, X., et al. “Training language models to follow instructions with human feedback.” NeurIPS 2022. [Verified] https://arxiv.org/abs/2203.02155
[2] von Werra, L., Belkada, Y., et al. “Illustrating Reinforcement Learning from Human Feedback (RLHF).” Hugging Face Blog. [Verified] https://huggingface.co/blog/rlhf
[3] Kili Technology. “Open-Sourced Training Datasets for Large Language Models (LLMs).” July 2024. [Snippet] https://kili-technology.com/large-language-models-llms/9-open-sourced-datasets-for-training-large-language-models Search evidence: Query “LLM pretraining data types code books web crawl” returned summary of Common Crawl, RedPajama, StarCoder Data, BookCorpus, and ROOTS datasets.
[4] Zhang, J., et al. “Craw4LLM: Efficient Web Crawling for LLM Pretraining.” arXiv:2502.13347, February 2025. [Verified] https://arxiv.org/abs/2502.13347
[5] El Khoury, J. “How Have Pre-Training Datasets for Large Language Models Evolved?” Medium, July 2024. [Snippet] https://medium.com/@jelkhoury880/how-have-pre-training-datasets-for-large-language-models-evolved-13d74c01f8e8 Search evidence: FineWeb described as “15-trillion token dataset derived from 96 CommonCrawl snapshots with a total data volume of 45TB.”
[6] “Pretraining of Large Language Models.” GitHub Gist. [Snippet] https://gist.github.com/ritwikraha/77e79990992043f60a9588610b2781c5 Search evidence: Books described as offering “formal and lengthy texts which help LLMs understand complex language structures.”
[7] Almanach, Inria, et al. “Common Corpus.” arXiv:2506.01732, June 2025. [Verified] https://arxiv.org/html/2506.01732v1
[8] Raschka, S. “LLM Training: RLHF and Its Alternatives.” Ahead of AI / Sebastian Raschka’s Newsletter, September 2023. [Snippet] https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives Search evidence: Describes SFT as using “much smaller datasets than pretraining” because it “requires instruction-output pairs rather than just raw text.”
[9] Martinez, J. “Finetuning an LLM: RLHF and alternatives (Part I).” MantisNLP / Medium, August 2023. [Snippet] https://medium.com/mantisnlp/finetuning-an-llm-rlhf-and-alternatives-part-i-2106b95c8087 Search evidence: Lists OASST1, Dolly, Alpaca, GPT-4-LLM datasets with descriptions.
[10] Wolfe, C.R. “Direct Preference Optimization (DPO).” Cameron R. Wolfe’s Substack, July 2025. [Snippet] https://cameronrwolfe.substack.com/p/direct-preference-optimization Search evidence: Describes DPO as using “a standard classification loss with no RL” and requiring “only two copies of the model.”
[11] Schmid, P. “How to align open LLMs in 2025 with DPO & synthetic data.” January 2025. [Snippet] https://www.philschmid.de/rl-with-llms-in-2025-dpo Search evidence: Describes TRL DPOTrainer with prompt/chosen/rejected format.
[12] Hugging Face. “Preference Tuning LLMs with Direct Preference Optimization Methods.” Hugging Face Blog. [Verified] https://huggingface.co/blog/pref-tuning
[13] PyTorch. “Computer Vision with PyTorch.” GeeksforGeeks. [Snippet] https://www.geeksforgeeks.org/deep-learning/computer-vision-with-pytorch/ Search evidence: Describes torchvision providing “standard datasets like ImageNet, CIFAR, and COCO.”
[14] PyTorch. “Transfer Learning for Computer Vision Tutorial.” PyTorch Tutorials. [Verified] https://docs.pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
[15] OpenAI. “CLIP: Connecting text and images.” January 2021. [Verified] https://openai.com/index/clip/
[16] Wikipedia. “Contrastive Language-Image Pre-training.” [Verified] https://en.wikipedia.org/wiki/Contrastive_Language-Image_Pre-training
[17] PMC. “Multimodal diffusion framework for collaborative text image audio generation.” Nature Scientific Reports, 2025. [Snippet] https://pmc.ncbi.nlm.nih.gov/articles/PMC12216180/ Search evidence: Describes diffusion models as “directly modeling the data distribution through a sequence of denoising steps.”
[18] OpenAI. “Introducing Whisper.” September 2022. [Verified] https://openai.com/index/whisper/
[19] Hugging Face. “openai/whisper-large-v3 Model Card.” [Verified] https://huggingface.co/openai/whisper-large-v3
[20] Hugging Face. “Fine-Tune Whisper For Multilingual ASR with Transformers.” [Verified] https://huggingface.co/blog/fine-tune-whisper
[21] Wow Labz. “Top 12 Multimodal AI Models You Should Know In 2025.” August 2025. [Snippet] https://wowlabz.com/top-12-multimodal-ai-models/ Search evidence: Describes ImageBind working with “six modalities — images, text, audio, depth sensors, thermal data, and motion (IMU).”
[22] Mor Software. “TOP 10 Leading Multimodal AI Models in 2025.” October 2025. [Snippet] https://morsoftware.com/blog/multimodal-ai-models Search evidence: Describes Gemini as “trained on interleaved sequences of audio, text, video, and images using a transformer backbone.”
[23] Springer Nature. “Parameter-efficient fine-tuning in large language models: a survey of methodologies.” Artificial Intelligence Review, May 2025. [Snippet] https://link.springer.com/article/10.1007/s10462-025-11236-4 Search evidence: Lists “LoRA (Hu et al. 2021), adapter tuning, prefix tuning (Li and Liang 2021), prompt tuning (Lester et al. 2021)” as notable PEFT approaches.
[24] Hugging Face. “PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.” GitHub. [Verified] https://github.com/huggingface/peft
[25] GoML. “Parameter Efficient Fine Tuning: LoRA.” May 2025. [Snippet] https://www.goml.io/blog/parameter-efficient-fine-tuning-lora Search evidence: Reports “ranks between 4 and 32 strike an excellent balance.”
[26] Biderman, D., et al. “LoRA vs Full Fine-tuning: An Illusion of Equivalence.” arXiv:2410.21228, October 2024. [Verified] https://arxiv.org/html/2410.21228v2
[27] ThirdEyeData. “Top 18 Tools and Platforms for Multimodal AI Solutions Development in 2025–26.” January 2026. [Snippet] https://thirdeyedata.ai/top-18-tools-and-platforms-for-multimodal-ai-solutions-development-in-2025-26/ Search evidence: Describes NeMo as providing “tools to train and deploy large multimodal models with GPU optimization.”
Related Reading
- Vaswani, A., et al. “Attention Is All You Need.” NeurIPS 2017. https://arxiv.org/abs/1706.03762 — The original Transformer paper.
- Hoffmann, J., et al. “Training Compute-Optimal Large Language Models.” arXiv:2203.15556, 2022. — The Chinchilla scaling laws.
- Rafailov, R., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. https://arxiv.org/abs/2305.18290 — The original DPO paper.
- Hu, E.J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” ICLR 2022. https://arxiv.org/abs/2106.09685 — The original LoRA paper.
- Bai, Y., et al. “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.” arXiv:2204.05862, 2022. — Anthropic’s RLHF approach.
- Radford, A., et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML 2021. — The original CLIP paper.
- Radford, A., et al. “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv:2212.04356, 2022. — The Whisper paper.