A Technical Primer, March 2026
1. Introduction
This document provides a technically grounded overview of fine-tuning large language models (LLMs) for tool-calling and function-calling capabilities. The intended reader is a computer scientist who is familiar with machine learning fundamentals (gradient descent, transformer architectures, attention mechanisms) but needs to be brought up to speed on the specific engineering of parameter-efficient fine-tuning (PEFT) and its application to teaching a model when and how to invoke external tools.
The motivating scenario is a practical one: migrating a tool-calling chatbot from a proprietary API (such as OpenAI’s fine-tuning service) to a self-hosted open-weight model. This requires understanding what fine-tuning changes inside the model, how the training data must be structured, what hardware is needed, and what software components are involved.
The document covers five areas:
- The theoretical basis for parameter-efficient fine-tuning (LoRA and QLoRA).
- The structure of training data for tool-calling.
- The end-to-end data flow of a fine-tuning pipeline.
- Hardware requirements and tradeoffs.
- Evaluation and deployment considerations.
2. Glossary of Key Terms
Adapter. A small set of trainable parameters inserted into a frozen pretrained model. The adapter modifies the model’s behaviour without altering the original weights. LoRA is a specific type of adapter architecture [1].
Autoregressive model. A model that generates output one token at a time, where each token is conditioned on all preceding tokens. All LLMs discussed here (Llama, Mistral, Qwen, GPT) are autoregressive [2].
Base model. The pretrained model before any task-specific fine-tuning. Base models have been trained on large corpora to predict the next token and possess broad language capabilities but no instruction-following or tool-calling behaviour.
BF16 (BrainFloat16). A 16-bit floating point format that preserves the dynamic range of FP32 (8 exponent bits) while reducing precision (7 mantissa bits vs. 23 for FP32). Preferred over FP16 for training because it avoids overflow issues on large gradient values. Supported natively on A100, H100, and RTX 30/40/50 series GPUs.
Chat template. A format specification that defines how system prompts, user messages, assistant responses, tool calls, and tool results are tokenized and delimited within the model’s input. Each model family (Llama, Mistral, Qwen) defines its own chat template using special tokens. The template determines how the model learns to distinguish between conversational roles.
Fine-tuning. The process of further training a pretrained model on task-specific data to adapt its behaviour. Full fine-tuning updates all model parameters. Parameter-efficient fine-tuning updates only a small subset.
FP16 (Half Precision). A 16-bit floating point format with 5 exponent bits and 10 mantissa bits. Widely used for inference and training. Requires approximately 2 bytes per parameter.
FP32 (Full Precision). Standard 32-bit floating point. Requires 4 bytes per parameter. Used for optimizer states in most training configurations.
Function calling / Tool calling. The capability of an LLM to emit structured output (typically JSON) that invokes an external function or API, receive the result, and incorporate it into a conversational response. “Function calling” and “tool calling” are used interchangeably in practice, though some frameworks distinguish them.
Gradient checkpointing. A memory optimization technique that trades compute for memory during backpropagation. Rather than storing all intermediate activations during the forward pass, only a subset of activations are retained, and the rest are recomputed during the backward pass. Reduces activation memory by approximately 50% at a cost of roughly 20-30% additional compute time [3].
Instruct model. A model that has been fine-tuned (typically via supervised fine-tuning on instruction-response pairs, followed by alignment techniques such as RLHF or DPO) to follow instructions, maintain a conversational format, and refuse harmful requests. Fine-tuning for tool-calling should start from an instruct model, not a base model, when the desired behaviour combines general chat with optional function calls [4].
LoRA (Low-Rank Adaptation). A parameter-efficient fine-tuning method that freezes the pretrained model weights and injects trainable low-rank decomposition matrices into specified layers. Introduced by Hu et al. (2021) [1].
NF4 (4-bit NormalFloat). A quantization data type introduced by Dettmers et al. (2023) that is information-theoretically optimal for normally distributed data. Neural network weights approximately follow normal distributions, so NF4 allocates more quantization bins near zero (where most weights cluster) and fewer in the tails, achieving better reconstruction quality than uniform 4-bit quantization [5].
PEFT (Parameter-Efficient Fine-Tuning). A family of methods that adapt pretrained models by training only a small number of additional or modified parameters rather than updating the full model. Includes LoRA, QLoRA, prefix tuning, adapter layers, and others. Reduces both memory and compute requirements for fine-tuning [1][5].
Quantization. The process of reducing the numerical precision of model weights (e.g., from 16-bit to 4-bit) to decrease memory usage. Quantization is lossy — it discards information — but well-designed schemes (NF4, GPTQ, AWQ) minimize the impact on model quality.
QLoRA (Quantized Low-Rank Adaptation). A fine-tuning method that combines 4-bit NormalFloat quantization of the base model with LoRA adapters trained in 16-bit precision. Introduced by Dettmers et al. (2023) [5]. Enables fine-tuning of models that would otherwise not fit in GPU memory.
Rank (r). In the context of LoRA, the dimensionality of the low-rank decomposition matrices. A rank of r means each adapter consists of two matrices: a down-projection of dimension d×r and an up-projection of dimension r×d, where d is the original layer width. Lower rank means fewer trainable parameters; typical values for tool-calling fine-tuning range from 8 to 64 [1][4].
SFT (Supervised Fine-Tuning). The standard method of fine-tuning where the model is trained on input-output pairs using next-token prediction loss. The “supervised” label distinguishes it from reinforcement learning-based methods (RLHF, DPO, GRPO). For tool-calling, SFT on demonstration data is the standard first step [4].
Tool schema. A structured description (typically JSON Schema) of an available function, including its name, description, and parameter types. Tool schemas are injected into the system prompt so the model knows what tools are available and how to invoke them.
VRAM (Video Random Access Memory). The high-speed memory on a GPU. VRAM capacity is the primary constraint on what models can be trained on a given GPU, because model weights, optimizer states, gradients, and activations must all fit in VRAM simultaneously during training.
3. Why Fine-Tuning Works: Theoretical Foundations
3.1. The Intrinsic Dimensionality Hypothesis
The theoretical motivation for LoRA originates in the observation that pretrained models, despite having billions of parameters, exhibit low intrinsic dimensionality when adapted to downstream tasks. Aghajanyan et al. (2020) demonstrated that fine-tuning updates to large models can be projected into a surprisingly low-dimensional subspace without significant loss of task performance [6]. This means that the “direction” in parameter space that the model needs to move during fine-tuning can be captured by a much smaller number of parameters than the full model contains.
Hu et al. formalized this insight as the basis for LoRA: “We hypothesize that the change in weights during model adaptation also has a low ‘intrinsic rank’” [1]. If the weight update matrix ΔW has low rank, it can be decomposed into the product of two smaller matrices without significant information loss.
3.2. LoRA: The Mathematical Structure
Consider a pretrained weight matrix W₀ ∈ ℝ^(d×k) in a transformer layer (e.g., a query, key, value, or output projection matrix in the attention mechanism). During standard fine-tuning, this matrix would be updated to W₀ + ΔW, where ΔW ∈ ℝ^(d×k) has the same dimensionality as W₀.
LoRA constrains ΔW to be low-rank by decomposing it as:
ΔW = BA
where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), and r « min(d, k) [1].
During training:
- W₀ is frozen (no gradient updates).
- Only A and B are trained.
- The forward pass computes: h = W₀x + BAx, where x is the input.
- A scaling factor α/r is applied to the LoRA output, where α (lora_alpha) controls the magnitude of the adaptation relative to the pretrained weights [1].
Initialization: A is initialized with a random Gaussian distribution, and B is initialized to zero. This ensures that at the start of training, ΔW = BA = 0, so the model begins with exactly the pretrained behaviour [1].
Parameter count. For a single layer with dimensions d×k and rank r, LoRA adds r(d+k) parameters. For a typical 8B model where d = k = 4096 and r = 16, each adapted layer adds 2 × 4096 × 16 = 131,072 parameters — approximately 0.003% of the layer’s original parameter count. Across all targeted layers, total trainable parameters are typically 0.1–1% of the full model [1].
Why this works for tool-calling. Teaching a model to emit tool-call JSON when appropriate, and to incorporate tool results into responses, is a behavioural adaptation — it changes when and how the model produces certain token patterns, not the fundamental knowledge encoded in its weights. This is exactly the kind of low-rank modification that LoRA was designed for. Practitioners working specifically on function-calling fine-tuning report that LoRA is sufficient and that, in the low-data regimes typical of tool-calling (as few as 1,000 examples), LoRA can paradoxically outperform full fine-tuning because fewer parameters need to converge [4].
3.3. QLoRA: Fitting Larger Models on Smaller GPUs
QLoRA, introduced by Dettmers et al. at NeurIPS 2023, addresses the memory constraint: even with LoRA, the frozen base model must still fit in GPU memory. An 8B model in FP16 requires approximately 16 GB just for the weights. QLoRA reduces this by quantizing the frozen base model to 4-bit precision, cutting weight storage by approximately 4× [5].
QLoRA introduces three innovations:
4-bit NormalFloat (NF4). Neural network weights are approximately normally distributed. Standard uniform quantization allocates bins evenly across the value range, wasting bins on sparsely populated regions in the distribution tails. NF4 instead allocates bins according to the expected quantile structure of a normal distribution, ensuring each bin is assigned approximately the same number of weights. This is information-theoretically optimal for normally distributed data [5].
Double Quantization. Quantization requires storing scaling constants (one per block of weights, typically blocks of 64). These constants are themselves stored in FP32, adding 32/64 = 0.5 bits per parameter. Double quantization quantizes these constants into FP8 with a larger block size (256), reducing the overhead to approximately 0.127 bits per parameter — a savings of roughly 3 GB for a 65B model [5].
Paged Optimizers. During training, optimizer states (e.g., AdamW’s first and second moment estimates) can cause memory spikes when processing long sequences. Paged optimizers use NVIDIA’s unified memory to automatically page optimizer states to CPU RAM when GPU memory is exhausted, and page them back when needed [5].
The forward/backward pass under QLoRA. The base model weights are stored in 4-bit NF4 format. During the forward pass, each block of weights is dequantized to BF16 on-the-fly before matrix multiplication. The LoRA adapter weights (A and B) are maintained in 16-bit precision throughout. Gradients flow through the dequantized base model weights into the LoRA adapters, which are the only parameters updated [5].
Performance characteristics. Dettmers et al. demonstrated that QLoRA fine-tuning preserves full 16-bit fine-tuning task performance: their Guanaco model family, fine-tuned with QLoRA, reached 99.3% of ChatGPT’s performance on the Vicuna benchmark [5]. The dequantization overhead makes QLoRA training approximately 39% slower than standard LoRA per step [3], but the memory savings are substantial: an 8B model that requires ~16 GB in FP16 requires roughly 4–5 GB in NF4, leaving ample headroom for optimizer states, activations, and LoRA weights on a 24 GB consumer GPU.
4. Training Data for Tool-Calling
4.1. Data Format: Conversational JSONL with Tool Turns
Tool-calling training data follows the chat-completion format: a sequence of messages with defined roles. The standard roles are:
- system: Contains instructions and the tool schemas (function definitions).
- user: The user’s natural language request.
- assistant: The model’s response, which may be natural language, a tool call (structured JSON), or a combination.
- tool: The result returned by the external function after execution.
A single training example is a complete conversation demonstrating the desired behaviour. The model learns both when to call a tool (vs. responding directly) and how to format the call.
4.2. Example Training Sequence
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant. You have access to the following tools:\n\n[{\"name\": \"get_weather\", \"description\": \"Get current weather for a city\", \"parameters\": {\"type\": \"object\", \"properties\": {\"city\": {\"type\": \"string\"}}, \"required\": [\"city\"]}}]"
},
{
"role": "user",
"content": "What's the weather like in Toronto?"
},
{
"role": "assistant",
"tool_calls": [
{
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Toronto\"}"
}
}
]
},
{
"role": "tool",
"content": "{\"temperature\": \"2°C\", \"condition\": \"Cloudy\"}"
},
{
"role": "assistant",
"content": "It's currently 2°C and cloudy in Toronto."
}
]
}
4.3. Critical Data Design Decisions
Objective complexity. Garbacki (Fireworks AI) distinguishes several tiers of tool-calling objectives, in increasing order of difficulty [4]:
- Single-turn forced function call — every user message triggers a function call. Simplest to train.
- Parallel function calling — multiple tools invoked simultaneously.
- Nested function calls — output of one tool becomes input to another.
- Multi-turn chat with optional function calls — the model must decide whether to call a tool or respond in natural language. Most complex, and the most common production requirement.
The choice of objective determines the complexity of training data required and the difficulty of fine-tuning.
Data volume. For LoRA-based supervised fine-tuning on tool-calling, practitioners report that as few as 1,000 high-quality examples can produce good results [4]. For downstream alignment with DPO (preference optimization), even 100 examples may suffice [4]. The emphasis is on data quality over quantity: well-curated examples covering the full range of tool schemas, edge cases, and refusal scenarios (where the model should not call a tool) are more valuable than large volumes of repetitive examples.
Negative examples. Training data should include conversations where the user’s request does not require a tool call, to prevent the model from over-triggering tool calls. It should also include cases where the user asks for a tool that is not available, teaching the model to decline gracefully.
Chat template alignment. Each model family uses a different chat template to delimit roles and special tokens. Llama 3 uses <|start_header_id|>, <|end_header_id|>, and <|eot_id|>. Mistral uses [INST] and [/INST]. The training data must be formatted to match the target model’s expected template, or the fine-tune will produce incoherent output. Using the instruct version of a model (rather than the base version) is recommended when the fine-tune mixes general chat with optional function calling, because the instruct model already knows how to follow instructions and the fine-tune only needs to add tool-calling capability [4].
4.4. The Role of Constrained Decoding
A technique increasingly used alongside fine-tuning is constrained decoding (also called constrained generation or grammar-constrained generation). When the model begins emitting a tool call, the serving infrastructure restricts token generation to only those tokens that are valid according to the tool’s JSON schema. This eliminates structural hallucinations (e.g., malformed JSON, non-existent parameter names) and can speed up inference by autocompleting predictable tokens [4]. Constrained decoding operates at inference time and is independent of fine-tuning, but it complements fine-tuning by handling the syntactic correctness that the model occasionally fails at, while fine-tuning handles the semantic correctness of when and why to call a tool.
5. End-to-End Data Flow
5.1. The Fine-Tuning Pipeline
The following diagram describes the data flow from raw training data to a deployed fine-tuned model. Each numbered step is described in detail below.
┌─────────────────────────────────────────────────────────────┐
│ FINE-TUNING PIPELINE │
│ │
│ ┌──────────┐ ┌───────────┐ ┌──────────────────────┐ │
│ │ Training │───>│ Template │───>│ Tokenization │ │
│ │ Data │ │ Formatting│ │ (chat template + │ │
│ │ (JSONL) │ │ │ │ special tokens) │ │
│ └──────────┘ └───────────┘ └──────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ GPU MEMORY LAYOUT │ │
│ │ │ │
│ │ ┌─────────────────────┐ ┌───────────────────────┐ │ │
│ │ │ Base Model Weights │ │ LoRA Adapter Weights │ │ │
│ │ │ (Frozen) │ │ (Trainable) │ │ │
│ │ │ NF4: ~4-5 GB (8B) │ │ BF16: ~50-200 MB │ │ │
│ │ │ FP16: ~16 GB (8B) │ │ (depends on rank, # │ │ │
│ │ └─────────────────────┘ │ layers targeted) │ │ │
│ │ └───────────────────────┘ │ │
│ │ ┌─────────────────────┐ ┌───────────────────────┐ │ │
│ │ │ Optimizer States │ │ Activations / │ │ │
│ │ │ (AdamW moments │ │ Gradients │ │ │
│ │ │ for LoRA params) │ │ (varies with batch │ │ │
│ │ │ ~100-400 MB │ │ size, seq length) │ │ │
│ │ └─────────────────────┘ └───────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ TRAINING LOOP │ │
│ │ │ │
│ │ For each batch: │ │
│ │ 1. Load tokenized examples │ │
│ │ 2. Forward pass: │ │
│ │ - Dequantize base weights NF4 → BF16 │ │
│ │ - Compute h = W₀x + (α/r)BAx │ │
│ │ 3. Compute cross-entropy loss on next-token │ │
│ │ prediction (masked to assistant/tool turns) │ │
│ │ 4. Backward pass: │ │
│ │ - Gradients flow through frozen W₀ │ │
│ │ - Only A, B matrices are updated │ │
│ │ 5. Optimizer step (AdamW on LoRA params only) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ OUTPUT │ │
│ │ │ │
│ │ Option A: LoRA adapter files (~50-200 MB) │ │
│ │ - Loaded at inference time alongside base model │ │
│ │ - Multiple adapters can share one base model │ │
│ │ │ │
│ │ Option B: Merged model (adapter + base → single │ │
│ │ model with full weights) │ │
│ │ - No inference overhead │ │
│ │ - Larger file size (same as original model) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
5.2. Step-by-Step Explanation
Step 1: Data Preparation. Raw training examples (JSONL format, one conversation per line) are loaded and validated. Each conversation must contain correctly structured message objects with the appropriate roles and fields for the target objective.
Step 2: Template Formatting. Each conversation is converted into the target model’s chat template. This means inserting the model-specific special tokens that delimit system, user, assistant, and tool roles. For Llama 3, this involves wrapping each message in <|start_header_id|>{role}<|end_header_id|> and <|eot_id|> markers. The tokenizer’s apply_chat_template() method typically handles this transformation.
Step 3: Tokenization. The formatted text is tokenized into integer token IDs using the model’s tokenizer. Crucially, a loss mask is applied: the training loss is computed only on the tokens the model should learn to generate (assistant responses and tool calls), not on the tokens it should learn to read (system prompts, user messages, tool results). This is the distinction between “input” and “label” portions of the training sequence.
Step 4: Model Loading. The base model is loaded into GPU memory. Under QLoRA, the model weights are loaded in 4-bit NF4 format using the bitsandbytes library [7], reducing memory from ~16 GB (FP16) to ~4-5 GB for an 8B model. LoRA adapter matrices are then inserted into the targeted layers (typically the query, key, value, and output projection matrices in each attention block, plus sometimes the MLP layers) using the PEFT library [8].
Step 5: Training Loop. Standard supervised fine-tuning with next-token prediction loss. For each training example, the model processes the full tokenized conversation. The loss is the cross-entropy between the model’s predicted next-token distribution and the actual next token, masked to only the assistant/tool portions. Gradients are computed for the LoRA parameters only; the base model weights remain frozen. The optimizer (typically AdamW or its 8-bit paged variant) updates only the LoRA A and B matrices.
Step 6: Output. After training, the LoRA adapter weights are saved as small checkpoint files (typically 50-200 MB depending on rank and number of targeted layers). These can be:
- Loaded alongside the base model at inference time (the PEFT library handles this transparently).
- Merged into the base model weights, producing a single model with no adapter overhead. This is done by computing W_final = W₀ + (α/r)BA for each adapted layer.
5.3. The Inference Pipeline
┌────────────────────────────────────────────────────────┐
│ INFERENCE PIPELINE │
│ │
│ User message │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ System prompt │◄── Tool schemas (JSON) │
│ │ + user msg │ │
│ │ + history │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Fine-tuned │────>│ Output: text OR │ │
│ │ Model │ │ tool_call JSON │ │
│ └──────────────┘ └──────────┬──────────┘ │
│ │ │
│ ┌─────────────┴────────────┐ │
│ │ │ │
│ Text response Tool call │
│ (return to user) │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Execute tool │ │
│ │ (API call, │ │
│ │ DB query, │ │
│ │ computation) │ │
│ └───────┬────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Tool result │ │
│ │ appended to │ │
│ │ conversation │ │
│ └───────┬────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────┐ │
│ │ Model generates│ │
│ │ final response │ │
│ │ incorporating │ │
│ │ tool result │ │
│ └────────────────┘ │
└────────────────────────────────────────────────────────┘
The orchestration loop (deciding whether the model’s output is a tool call or text, executing the tool, and re-prompting) is implemented in application code outside the model. Frameworks such as vLLM and the Hugging Face transformers library provide the model serving layer; the orchestration logic is typically custom code or provided by agent frameworks.
6. GPU Hardware Requirements
6.1. Memory Budgeting
During fine-tuning, GPU memory must simultaneously hold four categories of data [9][10]:
| Component | Full Fine-Tune (FP16) | LoRA (FP16 base) | QLoRA (NF4 base) |
|---|---|---|---|
| Model weights (8B) | ~16 GB | ~16 GB (frozen) | ~4-5 GB (frozen, NF4) |
| Trainable parameters | All (16 GB) | ~50-200 MB (adapters only) | ~50-200 MB (adapters only) |
| Optimizer states (AdamW) | ~32 GB (2× model size for two moment estimates) | ~100-400 MB (2× adapter size) | ~100-400 MB (2× adapter size) |
| Gradients | ~16 GB | ~50-200 MB | ~50-200 MB |
| Activations | Variable (batch size × seq length dependent) | Variable | Variable |
| Total (8B model) | ~100-120 GB | ~24-32 GB | ~8-22 GB |
These figures explain why full fine-tuning of an 8B model requires multiple data-center GPUs, while QLoRA enables the same model to be fine-tuned on a single consumer GPU. The rule of thumb for full fine-tuning — approximately 16 GB VRAM per billion parameters — captures the sum of weights, gradients, and optimizer states [9].
6.2. Concrete Hardware Configurations
For an 8B instruct model fine-tuned with QLoRA for tool-calling (the standard starting configuration):
Minimum viable: RTX 3090 / RTX 4090 (24 GB VRAM). QLoRA on an 8B model uses approximately 8-22 GB VRAM depending on batch size and sequence length [3][11]. A single RTX 4090 handles this comfortably. Training a dataset of ~1,000 tool-calling examples would complete in low single-digit hours. A workstation built around a single RTX 4090 costs approximately $2,500-3,500 total [12].
Recommended for faster iteration: A100 40GB or L40S 48GB. These data-centre GPUs offer more VRAM headroom and faster training throughput. An A100 80GB completes the same training run in under a day that takes 2-4 days on an RTX 4090 [12]. Available as cloud rentals at $2.50-4.00/hour for the A100 [13].
For 13B models: RTX 4090 (QLoRA) or L40S 48GB (LoRA). QLoRA allows a 13B model to fit on a 24 GB GPU [10][11]. For LoRA without quantization, the 13B model’s 26 GB of FP16 weights exceed 24 GB, requiring a 40-48 GB GPU [13].
For 70B models: A100 80GB (QLoRA) or multi-GPU. A 70B model in NF4 requires approximately 40-45 GB [11], which fits on a single A100 80GB. Full fine-tuning of 70B models requires 4-8 A100 80GB GPUs with NVLink interconnects [12].
6.3. Consumer Multi-GPU Considerations
Consumer GPUs (RTX 3090, 4090) can be combined in multi-GPU configurations, but with important caveats:
- Consumer GPUs communicate via PCIe, which provides approximately 32 GB/s bandwidth. Each GPU’s internal memory bandwidth exceeds 1,000 GB/s. This asymmetry means that inter-GPU communication during training introduces significant overhead [14].
- Two consumer GPUs on PCIe deliver approximately 1.4-1.6× speedup, not 2× [14]. Data-centre GPUs with NVLink achieve better scaling.
- For a 2-GPU consumer setup, standard PCIe x4 lanes are acceptable. For 4 GPUs, x8 lanes per GPU are recommended; at x4 lanes the performance penalty is approximately 5-10% [15].
- Frameworks such as DeepSpeed ZeRO and PyTorch FSDP minimize inter-GPU communication during training, making PCIe-connected consumer GPUs more viable than the raw bandwidth numbers suggest [12].
However, for the 8B tool-calling fine-tune, multi-GPU is unnecessary — a single RTX 4090 suffices.
6.4. Apple Silicon as an Alternative
The project knowledge documents note a separate analysis of Apple Silicon for these workloads. The M5 Max (128 GB unified memory, 614 GB/s bandwidth) can run 8B QLoRA fine-tuning, but training throughput is approximately 2-4× slower than an RTX 4090 due to the lack of dedicated tensor cores. The primary advantage is memory capacity: the 128 GB unified memory allows fitting models (30B, 70B) that exceed the RTX 4090’s 24 GB VRAM wall. The software ecosystem (MLX, PyTorch MPS backend) is less mature than CUDA, particularly for training workflows.
7. Software Ecosystem
7.1. Core Libraries
Hugging Face Transformers. The primary library for loading, configuring, and running transformer models. Provides the AutoModelForCausalLM class for loading base models with quantization configurations, and the tokenizer infrastructure for applying chat templates [8].
PEFT (Parameter-Efficient Fine-Tuning). Hugging Face’s library for LoRA, QLoRA, and other PEFT methods. Provides LoraConfig for specifying rank, alpha, target modules, and dropout. Handles the injection of adapter matrices into the base model and the saving/loading of adapter checkpoints [8].
bitsandbytes. Tim Dettmers’ library implementing 4-bit NF4 quantization, 8-bit optimizers, and paged optimizers. Integrates with Hugging Face Transformers via BitsAndBytesConfig to enable loading models in 4-bit precision [7].
TRL (Transformer Reinforcement Learning). Hugging Face’s library for fine-tuning with SFT, DPO, and reinforcement learning. The SFTTrainer class provides a high-level interface for supervised fine-tuning with automatic handling of data formatting, loss masking, and gradient accumulation [8].
Unsloth. An optimization framework that claims up to 30× faster training speeds and 60% reduced memory usage for LoRA/QLoRA fine-tuning by implementing custom CUDA kernels. Supports NVIDIA, AMD, and Intel GPUs [16].
Axolotl. A higher-level fine-tuning framework that wraps Transformers, PEFT, and DeepSpeed with YAML-driven configuration. Widely used for reproducible fine-tuning runs [11].
torchtune. PyTorch’s first-party library for fine-tuning, with native support for LoRA, QLoRA, and full fine-tuning of Llama and other models [17].
7.2. Serving Infrastructure
After fine-tuning, the model must be deployed for inference. The primary options:
vLLM. High-throughput serving with PagedAttention for efficient memory management. Supports an OpenAI-compatible API, which simplifies migration from OpenAI’s service. Handles LoRA adapter loading at inference time [18].
TGI (Text Generation Inference). Hugging Face’s production serving solution with similar capabilities to vLLM [18].
llama.cpp / Ollama. For CPU or single-GPU inference with quantized models. Simpler deployment but lower throughput [18].
7.3. A Typical Configuration File
A representative QLoRA fine-tuning configuration for tool-calling on an 8B instruct model:
# Model
model_name: "meta-llama/Llama-3.1-8B-Instruct"
load_in_4bit: true
bnb_4bit_quant_type: "nf4"
bnb_4bit_compute_dtype: "bfloat16"
# LoRA
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
# Training
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
learning_rate: 2e-4
num_train_epochs: 3
max_seq_length: 4096
optim: "paged_adamw_8bit"
bf16: true
gradient_checkpointing: true
# Data
dataset_path: "./tool_calling_data.jsonl"
output_dir: "./checkpoints"
Key parameter choices:
- lora_r: 16. A rank of 16 is a common starting point for tool-calling. Higher ranks (32, 64) increase capacity but also increase the risk of overfitting on small datasets [1][4].
- lora_alpha: 32. Typically set to 2× the rank. Controls the scaling of the adapter’s contribution relative to the base model [1].
- target_modules. Targeting all attention projections plus MLP layers provides the most complete adaptation. Some practitioners target only attention layers for simpler tasks [1].
- max_seq_length: 4096. Tool-calling sequences tend to be longer than standard chat because they include tool schemas in the system prompt, tool-call JSON, and tool results. 4096 tokens provides reasonable headroom; increase to 8192 if schema lists are extensive.
- gradient_checkpointing: true. Essential on 24 GB GPUs to reduce activation memory [3].
8. Evaluation
8.1. What to Measure
Fine-tuning for tool-calling requires evaluating several distinct capabilities:
- Tool selection accuracy. Given a user query and a set of available tools, does the model select the correct tool (or correctly decide not to call any tool)?
- Argument extraction accuracy. Does the model correctly extract the required parameters from the user’s message and format them according to the schema?
- Output format validity. Is the emitted JSON syntactically valid and schema-compliant?
- Multi-turn coherence. After receiving a tool result, does the model correctly incorporate it into a natural language response?
- Regression testing. Has the fine-tune degraded the model’s general conversational abilities?
8.2. Evaluation Approaches
Held-out test set. The most straightforward approach: reserve 10-20% of training data for evaluation and measure exact-match accuracy on tool selection and argument extraction. Simple and interpretable, but limited to the distribution of the training data.
Public benchmarks. The Berkeley Function-Calling Leaderboard (BFCL) and Salesforce’s xLAM benchmark provide standardized evaluations of function-calling models across multiple dimensions [19].
LLM-as-judge. Using a stronger model (e.g., GPT-4o, Claude) to evaluate whether tool calls are semantically correct, even when they don’t exactly match a reference answer. The QLoRA paper found that GPT-4 evaluations are “a cheap and reasonable alternative to human evaluation” for chatbot quality [5], though this finding should be validated for the specific domain of tool-calling accuracy.
Integration testing. For production systems, the most meaningful evaluation is end-to-end: does the fine-tuned model, deployed with real tool schemas and real APIs, correctly handle representative user scenarios? This requires building a test harness that executes tool calls against mock or staging APIs.
9. Known Limitations and Open Problems
Catastrophic forgetting. Fine-tuning can degrade the base model’s existing capabilities. LoRA mitigates this by keeping the base model frozen, but aggressive fine-tuning (too many epochs, too high a learning rate) can still overwrite useful behaviours. Starting from an instruct model and using minimal training data reduces this risk [4].
Schema sensitivity. The model’s tool-calling accuracy depends on how tool schemas are presented in the system prompt. Changes to schema descriptions, parameter names, or ordering can significantly affect which tool is selected. This is a well-documented property of LLMs in general — the planning and decision quality is highly sensitive to prompt framing [20].
No formal correctness guarantees. The fine-tuned model produces tool calls via learned pattern matching, not by reasoning about tool semantics. It can hallucinate non-existent tools, provide plausible but incorrect arguments, or fail to recognize when a tool call is needed. Constrained decoding addresses syntactic correctness but not semantic correctness [4].
Evaluation methodology is ad hoc. There is no standardized benchmark that covers the full range of tool-calling scenarios (single-turn, multi-turn, parallel calls, nested calls, refusal). Published benchmarks focus primarily on single-turn forced calls, which is the simplest case [19].
Data quality is the binding constraint. Practitioners consistently report that training data quality matters more than quantity, model size, or hyperparameter tuning for tool-calling fine-tuning [4]. A small number of well-curated examples covering the full range of expected scenarios and edge cases outperforms a larger volume of noisy or repetitive data.
10. Summary
The core engineering of fine-tuning an LLM for tool-calling consists of:
- Preparing conversational training data in JSONL format with correctly formatted tool-call and tool-result turns.
- Loading an open-weight instruct model (typically 8B parameters) in quantized form (NF4) onto a consumer GPU (RTX 4090, 24 GB).
- Attaching LoRA adapter matrices to the model’s attention and MLP layers.
- Training on the demonstration data using standard supervised fine-tuning with next-token prediction loss, masked to the assistant/tool portions of each conversation.
- Saving the resulting adapter weights (~50-200 MB) and deploying them alongside the base model using a serving framework (vLLM, TGI) that exposes an OpenAI-compatible API.
The theoretical basis (low intrinsic dimensionality of fine-tuning updates [1][6]) explains why this works with so few trainable parameters. The engineering innovations of QLoRA (NF4 quantization, double quantization, paged optimizers [5]) explain why it fits on a single consumer GPU. The practical advice from practitioners working specifically on function-calling fine-tuning (use instruct models, keep training data small and high-quality, prefer LoRA over full fine-tuning in low-data regimes [4]) explains how to make it work well.
Bibliography
[1] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L. and Chen, W. “LoRA: Low-Rank Adaptation of Large Language Models.” Proceedings of ICLR 2022. arXiv:2106.09685, 2021. [Verified] https://arxiv.org/abs/2106.09685
[2] Vaswani, A. et al. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017. [Verified] https://arxiv.org/abs/1706.03762
[3] “Fine-Tuning Llama-3 on a Single GPU: A QLoRA Implementation Guide.” Machine Learning How To, January 2026. [Snippet] https://machinelearninghowto.com/fine-tuning-llama-3-on-a-single-gpu/
[4] Garbacki, P. “Fine Tuning LLMs for Function Calling.” Talk at Mastering LLMs Conference (Fireworks AI). [Snippet] https://parlance-labs.com/education/fine_tuning/pawel.html
[5] Dettmers, T., Pagnoni, A., Holtzman, A. and Zettlemoyer, L. “QLoRA: Efficient Finetuning of Quantized LLMs.” Proceedings of NeurIPS 2023. arXiv:2305.14314, 2023. [Verified] https://arxiv.org/abs/2305.14314
[6] Aghajanyan, A., Zettlemoyer, L. and Gupta, S. “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.” Proceedings of ACL 2021. arXiv:2012.13255, 2020. [Verified] https://arxiv.org/abs/2012.13255
[7] Dettmers, T. “bitsandbytes.” GitHub / Hugging Face documentation. [Verified] https://huggingface.co/docs/bitsandbytes
[8] Hugging Face. “PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.” GitHub. [Verified] https://github.com/huggingface/peft
[9] Modal. “How much VRAM do I need for LLM model fine-tuning?” [Snippet] https://modal.com/blog/how-much-vram-need-fine-tuning
[10] RunPod. “The Complete Guide to GPU Requirements for LLM Fine-Tuning.” [Snippet] https://www.runpod.io/blog/llm-fine-tuning-gpu-guide
[11] RunPod. “How can I fine-tune large language models on a budget using LoRA and QLoRA on cloud GPUs?” [Snippet] https://www.runpod.io/articles/guides/how-to-fine-tune-large-language-models-on-a-budget
[12] DigitalOcean. “GPU Options for Finetuning Large Models: Choose the Right Setup.” [Snippet] https://www.digitalocean.com/resources/articles/gpu-options-finetuning
[13] Introl. “Fine-Tuning Infrastructure: LoRA, QLoRA, and PEFT at Scale.” December 2025. [Snippet] https://introl.com/blog/fine-tuning-infrastructure-lora-qlora-peft-scale-guide-2025
[14] LocalAI.Computer. “Multi-GPU LLM Inference Methodology.” [Snippet] https://localai.computer/multi-gpu-methodology
[15] Glukhov, R. “LLM Performance and PCIe Lanes: Key Considerations.” [Snippet] https://glukhov.org/llm-performance/hardware/llm-performance-and-pci-lanes
[16] Unsloth. “Fine-tuning LLMs Guide.” Unsloth Documentation. [Snippet] https://unsloth.ai/docs/get-started/fine-tuning-llms-guide
[17] Meta / PyTorch. “Fine-tuning | How-to Guides.” Llama documentation. [Snippet] https://www.llama.com/docs/how-to-guides/fine-tuning/
[18] vLLM Project. “vLLM: Easy, fast, and cheap LLM serving.” [Verified] https://docs.vllm.ai/
[19] Hugging Face. “Fine-tuning LLMs for Function Calling with xLAM Dataset.” Open-Source AI Cookbook. [Snippet] https://huggingface.co/learn/cookbook/function_calling_fine_tuning_llms_on_xlam
[20] Kambhampati, S. et al. “Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of ICML 2024, PMLR 235:22895-22907, 2024. [Verified] https://proceedings.mlr.press/v235/kambhampati24a.html