Fine-Tuning on Abstract Program Structures - Architecture, GPU Requirements, and State of the Art

A Technical Primer for Non-Language Structural Training Data

March 2026


1. Introduction

A companion document in this project covers fine-tuning large language models for conversational tool-calling — a setting where the training data is natural language (chat-format JSONL with user messages, assistant responses, and tool-call turns) and the GPU requirements are well understood: an 8B model with QLoRA fits comfortably on a single RTX 4090 (24 GB VRAM) with datasets of ~1,000 examples [1].

This document addresses a fundamentally different setting: training on data that is not natural language but rather mathematical and structural representations of programs — abstract syntax trees (ASTs), data-flow graphs, control-flow graphs, program dependence graphs, and other abstract structures. These are always ultimately presented to a neural network as vectors, but the path from structure to vector is where the critical architectural choices are made. Those choices determine the GPU hardware required.

The intended reader is a computer scientist who understands transformers, attention mechanisms, and gradient descent, but needs to quickly understand the specific engineering of training on structured program representations: what the data looks like, how it flows through the system, why the hardware requirements differ from the conversational case, and what the current state of the art is.

The document covers three distinct architectural approaches, each with different GPU profiles:

  • Approach A: Serialization. Convert trees and graphs into token sequences, feed them to a standard transformer. This is the simplest to implement but creates long sequences that strain memory.
  • Approach B: Graph Neural Networks. Operate directly on graph topology using message-passing networks. These are smaller models but have memory profiles driven by edge count rather than sequence length.
  • Approach C: Hybrid Structure-Aware Transformers. Inject structural information (tree distances, edge types, data-flow relationships) into a transformer’s attention mechanism as inductive biases. This is the current research frontier.

Each approach entails a different data pipeline, different memory scaling behaviour, and different GPU hardware requirements.


2. Glossary

Abstract Syntax Tree (AST). A tree representation of the syntactic structure of source code. Each node represents a syntactic construct (e.g., IfStatement, MethodDeclaration, BinaryExpression). Leaf nodes (terminals) contain concrete tokens (variable names, literals). Interior nodes (non-terminals) represent grammar rules. ASTs are generated by parsers such as Tree-sitter [2] or language-specific tooling (JDT for Java, ANTLR for multiple languages). An AST for a simple function can have hundreds of nodes; a complex class may have thousands.

Adjacency Matrix. A square matrix A ∈ ℝ^(N×N) representing a graph with N nodes, where A[i][j] = 1 if an edge exists from node i to node j (or a weight, for weighted graphs). Adjacency matrices for program graphs are typically sparse — most node pairs are not connected — so sparse representations are used in practice. The memory cost is O(N²) for dense representation or O(E) for sparse representation, where E is the number of edges.

Attention Masking. A technique for modifying the standard transformer self-attention mechanism by applying a mask matrix that zeros out attention weights between specific token pairs. In the context of program structures, attention masks can encode tree or graph relationships — e.g., allowing a node to attend only to its ancestors and descendants in an AST, rather than to all tokens. GraphCodeBERT uses a graph-guided masked attention function to incorporate data-flow structure into the transformer [3].

Code Embedding. A fixed-length continuous vector representing a code snippet. The goal of all approaches in this document is to learn embeddings that capture semantic (not merely syntactic) properties of code — e.g., two functions that compute the same result via different algorithms should have similar embeddings. code2vec [4] and code2seq [5] were early systems that learned code embeddings from AST path contexts.

Control-Flow Graph (CFG). A directed graph where nodes represent basic blocks (sequences of instructions with no branches) and edges represent possible execution paths (branches, loops, function calls, exceptions). CFGs capture the order in which statements may execute. They are more closely related to runtime behaviour than ASTs, which capture syntactic structure.

Data-Flow Graph (DFG). A directed graph encoding the “where-the-value-comes-from” relationship between variables [3]. Nodes represent variables; edges represent value dependencies. For example, if y = x + 1, there is a data-flow edge from the node representing x to the node representing y. GraphCodeBERT [3] demonstrated that data-flow graphs are a more compact and semantically relevant representation than full ASTs for certain tasks, because they encode semantic-level rather than syntactic-level structure.

Graph Neural Network (GNN). A neural network that operates directly on graph-structured data using a message-passing paradigm: each node aggregates information from its neighbours, updates its representation, and passes updated information in subsequent rounds. Common architectures include GCN (Graph Convolutional Networks) [6], GAT (Graph Attention Networks), GraphSAGE, and GGNN (Gated Graph Neural Networks) [7]. GNNs avoid the need to serialize graph structures into sequences.

Linearization. The process of converting a tree or graph into a flat token sequence suitable for input to a sequence model (RNN, Transformer). Common AST linearization methods include depth-first traversal, Structure-Based Traversal (SBT) [8], and pre-order traversal with bracket notation. Linearization typically increases sequence length significantly — SBT roughly doubles the original AST size [8]. This length increase is the primary reason that structural training data requires more GPU memory than natural-language training data of the same semantic complexity.

Message Passing. The core operation in GNNs. In each message-passing round, every node: (a) collects “messages” (feature vectors) from its neighbours, (b) aggregates them (typically via sum, mean, or max), and (c) updates its own representation using the aggregated message and its current state. After k rounds, each node’s representation encodes information from its k-hop neighbourhood. The number of rounds is a hyperparameter analogous to the number of layers in a standard neural network.

Path Context. A triple (terminal_start, path, terminal_end) representing a path between two leaf nodes in an AST, where path is the sequence of node types traversed from one leaf up through their lowest common ancestor and back down to the other leaf. code2vec [4] represents an entire code snippet as a bag of path contexts, then learns to aggregate them via attention into a single code vector. This is one of the earliest successful vectorized representations of program structure.

Program Dependence Graph (PDG). A graph combining both control-flow and data-flow dependencies. Nodes represent statements or expressions; edges encode both control dependencies (which statements control the execution of which other statements) and data dependencies (which statements produce values consumed by which other statements).

Sequence Length. The number of tokens in the input to a transformer model. This is the critical parameter for GPU memory in transformer-based approaches because self-attention memory scales quadratically with sequence length [9]. A 100-line function might be 300 tokens as source code, but 600–1,500 tokens when its AST is linearized via SBT [8]. Whole-program representations can reach tens of thousands of tokens.

Structure-Induced Self-Attention. A modification to standard self-attention where the attention weights are biased by structural relationships derived from code graphs (ASTs, CFGs, DFGs). Instead of computing attention purely from token content (query-key dot products), the attention score is augmented with terms encoding, e.g., the tree distance between two AST nodes or the presence of a data-flow edge between two variables [10]. This injects structural knowledge as an inductive bias without changing the transformer’s overall architecture.


3. The Representation Problem: From Structure to Vector

The fundamental engineering question for training on program structures is: how do you present a tree or a graph to a neural network? Every neural network ultimately operates on tensors of floating-point numbers. Trees and graphs are not tensors. The choice of how to bridge this gap determines the model architecture, the training data pipeline, and the GPU hardware profile.

The literature presents four major strategies, each with distinct properties. The following subsections describe what each produces as input to the model, and Section 4 details the architectures that consume them.

3.1. Strategy 1: Linearize the Tree into a Token Sequence

The simplest approach: traverse the AST in some order (depth-first, breadth-first, or a structured traversal like SBT) and emit each node as a token. The resulting token sequence is fed to a standard transformer encoder or decoder, just like natural language text.

Data flow:

Source Code
    │
    ▼
Parser (Tree-sitter, JDT, ANTLR)
    │
    ▼
Abstract Syntax Tree (tree data structure)
    │
    ▼
Linearization (SBT, depth-first, pre-order with brackets)
    │
    ▼
Token Sequence (e.g., "( MethodDecl ( Params ( Param int x ) ) ( Body ( Return ( BinOp + x 1 ) ) ) )")
    │
    ▼
Tokenizer (subword: BPE, SentencePiece, or node-type vocabulary)
    │
    ▼
Integer Token IDs → Standard Transformer

Properties:

  • Sequence length roughly doubles vs. raw source code for SBT linearization [8].
  • Structural relationships (parent-child, sibling) are only implicit in the token order — the model must learn them from position.
  • Compatible with any off-the-shelf transformer (Llama, CodeBERT, etc.) without architectural modification.
  • AST-Transformer [8] found that linearized ASTs with structure-aware attention outperform both plain code tokens and plain linearized ASTs without structural augmentation.

Empirical finding: A comprehensive 2023 study tested models trained on four input types (raw code tokens, SBT-linearized ASTs, ASTs with tokens removed, and combined token+AST) across three code tasks [11]. The authors reported a finding that may seem counterintuitive: models trained on raw code token sequences consistently outperformed those trained on AST-based representations across all three tasks (code clone detection, code search, code summarization) [11]. However, in certain subsets of samples — particularly those involving complex syntactic structures — AST-based models showed advantage [11]. The implication is that naive linearization discards structural information faster than it adds it, and that structural information must be injected more carefully (see Approach C in Section 4.3).

3.2. Strategy 2: Extract AST Path Contexts

Rather than linearizing the entire tree, extract a set of paths between pairs of leaf nodes. Each path is represented as a triple: (start_terminal, path_of_node_types, end_terminal). The code snippet is then represented as a bag (unordered set) of these path contexts. An attention mechanism learns to weight each path context and aggregate them into a single fixed-length code vector.

Data flow:

Source Code
    │
    ▼
Parser → AST
    │
    ▼
Path Extraction (e.g., astminer)
    │  - Enumerate paths between leaf node pairs
    │  - Each path: (leaf₁, [node_type₁ → node_type₂ → ... → node_typeₖ], leaf₂)
    │  - Typically limited to paths of length ≤ 8–12 hops
    ▼
Bag of Path Contexts (fixed number, e.g., 200 per method)
    │
    ▼
Embedding Layer
    │  - Leaf tokens → learned embeddings
    │  - Paths → learned embeddings (code2vec) or LSTM-encoded (code2seq)
    ▼
Attention-weighted Aggregation → Single Code Vector (d-dimensional)

This is the approach of code2vec [4] and code2seq [5]. code2vec embeds entire paths as atomic units; code2seq feeds the sequence of node types through a bidirectional LSTM to produce path embeddings, which preserves more fine-grained structural information [5].

Properties:

  • Model size is small — millions of parameters, not billions. code2vec trains in ~50 minutes per epoch on a V100 [4].
  • The bag-of-paths representation inherently discards global tree structure (the relative position of paths is lost).
  • Effective for tasks like method name prediction and clone detection, but limited for tasks requiring understanding of execution order or data dependencies.
  • JetBrains Research developed astminer [12] as a language-agnostic tool for extracting path contexts, supporting Java, Python, C, C++, and others.

3.3. Strategy 3: Build an Adjacency Matrix (or Edge List) for a GNN

Represent the program as a graph — either the raw AST, or an enriched graph that adds control-flow edges, data-flow edges, next-token edges, and other semantic edges to the AST skeleton. Each node gets an initial feature vector (from its token or node type). A GNN then propagates information through the graph via message passing.

Data flow:

Source Code
    │
    ▼
Parser → AST
    │
    ▼
Graph Construction
    │  - AST edges (parent→child)
    │  - Next-token edges (sequential order of leaves)
    │  - Data-flow edges (variable value dependencies)
    │  - Control-flow edges (execution order, branching)
    ▼
Graph = (Node Features, Edge Index, Edge Types)
    │  - Node features: initial embeddings from node type / token
    │  - Edge index: sparse (src, dst) pairs
    │  - Edge types: categorical (AST, DF, CF, NextToken, ...)
    ▼
GNN (GCN, GAT, GGNN, etc.)
    │  - k rounds of message passing
    │  - Each round: aggregate neighbour features, update node state
    ▼
Node-level or Graph-level Representations
    │  - Node-level: each node has a learned embedding
    │  - Graph-level: pooling (mean, attention) over all node embeddings → single vector

Allamanis et al. [7] introduced this approach for program analysis, representing programs as graphs with multiple edge types and using GGNNs to reason over them. The GGNN uses gated recurrent units (GRUs) for the node update step, allowing it to model complex dependencies over multiple message-passing rounds.

Properties:

  • Directly preserves graph topology — no information loss from linearization.
  • Memory scales with the number of edges E (for message computation) rather than with N² (as in dense attention). However, intermediate representations generated during edge calculations can be an order of magnitude larger than the graph itself [13].
  • Frameworks: PyTorch Geometric (PyG) and Deep Graph Library (DGL) are the two dominant implementations. NVIDIA provides GPU-optimized containers for both with RAPIDS integration [14].
  • GNN models for code are typically much smaller than LLMs (millions of parameters, not billions).

3.4. Strategy 4: Feed Structure into a Transformer via Attention Bias

The hybrid approach: use a transformer as the backbone but inject structural information by modifying the attention computation. The input is still a sequence of tokens (code or linearized AST), but the attention scores are augmented with matrices encoding structural relationships — tree distances, data-flow edges, or control-flow edges.

Data flow:

Source Code
    │
    ├─────────────────────────────────────┐
    ▼                                     ▼
Tokenizer → Token Sequence          Parser → AST / DFG / CFG
    │                                     │
    │                                     ▼
    │                              Structure Extraction
    │                                │  - Distance matrix (shortest path between AST leaves)
    │                                │  - Ancestor-descendant matrix
    │                                │  - Sibling matrix
    │                                │  - Data-flow adjacency matrix
    │                                ▼
    │                           Structural Matrices (N×N per relationship type)
    │                                │
    └──────────┬─────────────────────┘
               │
               ▼
    Modified Transformer Encoder
        │  - Standard Q, K, V projections from token embeddings
        │  - Attention score = (Q·Kᵀ)/√d + f(structural_matrices)
        │  - Different structural matrices can be injected at different layers/heads
        ▼
    Contextual Token Representations (enriched with structural knowledge)

GraphCodeBERT [3] is the canonical example. It operates on a transformer backbone (RoBERTa architecture) and incorporates data-flow structure through a graph-guided masked attention function. The model takes as input both the code token sequence and the data-flow graph, with variable nodes appended to the sequence and a masking function that controls which tokens can attend to which graph nodes [3]. Two structure-aware pre-training tasks supplement standard masked language modeling: (a) predicting data-flow edges between variables, and (b) aligning code tokens with their corresponding data-flow nodes [3].

SAT (Structure-Aware Fine-tuning) [10] takes a different approach: it computes a distance matrix from the AST (shortest path between leaf nodes), extracts the attention matrix from the transformer’s multi-head attention blocks, and adds a structure loss (Sinkhorn divergence between the two matrices) to the training objective. This acts as a regularizer that encourages the transformer’s learned attention patterns to align with the tree structure.

Tree-Enhanced CodeBERTa [15] adds tree-based positional embeddings — encoding node depth and sibling index from the AST — directly into the transformer’s position encoding layer, adding approximately 0.945% additional parameters [15].


4. Three Architectural Approaches: Detailed Analysis

4.1. Approach A: Serialized Structures in Standard Transformers

Architecture: Any off-the-shelf causal (decoder-only) or bidirectional (encoder-only) transformer, receiving a linearized representation of program structure as its token input. No architectural modifications.

When to use: When you want to fine-tune an existing pretrained code model (CodeLlama, StarCoder, DeepSeek-Coder) on structural data without modifying the model architecture. This is the lowest-engineering-effort approach.

The core constraint is sequence length. When you linearize an AST via SBT, the resulting sequence is roughly 2× the length of the raw source code [8]. AST-Transformer reported that linearized ASTs are “much longer than the source code” and that this “makes the model extremely difficult to accurately detect useful dependency relations from the overlong input sequence” while also bringing “significant computational overhead, especially for those state-of-the-art Transformer-based models where the number of self-attention operations grows quadratically with the sequence length” [8].

Concrete numbers: a 100-line Java method might be ~300 source tokens, ~600–1,500 tokens as a linearized AST. A 500-line class could produce 3,000–8,000 AST tokens. If you are representing whole-program structures (all functions, all files), you can reach 16K–64K+ tokens.

Memory scaling: Self-attention computes a full N×N attention matrix (where N is sequence length). VRAM for the attention computation alone scales as O(N²) without FlashAttention, or O(N) with FlashAttention (at the cost of recomputation) [9]. For the MLP and embedding layers, memory scales linearly with N. The total activation memory during training — which must be stored for backpropagation — is the dominant memory consumer for long sequences.

Mini-Sequence Transformer demonstrated that with appropriate memory optimization, Llama3-8B can train with context length 60K on a single A100-80GB [16]. Without such optimizations, a standard Llama-8B with QLoRA hits the VRAM ceiling of a 24GB GPU at around 8K–12K context length (depending on batch size and gradient checkpointing settings) [1].

GPU requirements for this approach:

Sequence LengthModel SizeMethodMinimum GPU
≤ 4K tokens (single functions)8BQLoRARTX 4090 (24 GB)
4K–16K tokens (classes, multi-function)8BQLoRA + gradient checkpointingA100 40GB or L40S 48GB
16K–64K tokens (whole-program)8BQLoRA + FlashAttention + MsTA100 80GB
> 64K tokens8BQLoRA + distributedMulti-GPU (2–4× A100 80GB)

These figures are for fine-tuning. Pretraining from scratch (not QLoRA) would require roughly 4–8× more memory due to optimizer states and gradients for all parameters [1].

4.2. Approach B: Graph Neural Networks

Architecture: A message-passing neural network (GCN, GAT, GGNN, GraphSAGE, or variants) operating directly on program graphs.

When to use: When you need to reason over graph topology (data dependencies, control flow) and want to avoid the information loss of linearization. Also appropriate when you are building a specialized model (not fine-tuning a general-purpose LLM) for a specific task like vulnerability detection, clone detection, or type inference.

Memory scaling for GNNs. GNN memory scales with the number of edges and the hidden dimension, not with the square of the node count. However, the edge computation stage generates and caches intermediate results that are an order of magnitude larger than the dataset itself [13]. Specifically, for a GNN layer with hidden dimension d and E edges, the memory for edge computation is O(E × d) for each layer. For a graph with N nodes and average degree k, E = N × k, so memory is O(N × k × d) — linear in N for fixed degree, unlike the O(N²) of dense attention.

The empirical finding from Wang et al. is that for GNNs on GPUs, “the high memory usage of the edge calculation stage is the main factor that limits the data scalability of GNN training and inference” [13]. Sampling techniques (subgraph sampling, neighbour sampling) are essential for training on large graphs.

Concrete GPU requirements:

For function-level program graphs (hundreds of nodes, thousands of edges):

  • These are small by GNN standards. A single consumer GPU (RTX 3090/4090) handles them without difficulty.
  • code2vec (not a GNN, but comparable scale) trains in ~50 minutes per epoch on a V100 with 14 million examples [4].
  • GGNN models for program analysis are typically in the range of 1–50 million parameters, requiring only 1–4 GB VRAM for the model itself.

For whole-program graphs (tens of thousands of nodes, hundreds of thousands to millions of edges):

  • An A100-80GB is the recommended baseline for single-GPU training. NVIDIA’s GNN framework documentation uses A100-80GB as the reference for their benchmark workloads [14].
  • For billion-edge graphs (e.g., entire codebases, dependency networks), Amazon’s distributed GNN training work describes CPU-GPU hybrid approaches where graph structure is stored in CPU memory and sampled subgraphs are loaded to GPU for computation [17].

Software ecosystem caveat: PyTorch Geometric and DGL are the dominant GNN frameworks, and both are heavily optimized for NVIDIA CUDA. Metal/MPS support (for Apple Silicon) is limited and significantly less mature. If the target hardware is Apple Silicon (per the M5 Max/Ultra analysis in the project), GNN-based approaches face a software ecosystem limitation that serialized-transformer approaches do not.

4.3. Approach C: Hybrid Structure-Aware Transformers

Architecture: A transformer backbone with structural information injected into the attention mechanism, positional encodings, or training objectives.

When to use: When you want to benefit from pretrained transformer weights (which encode broad code understanding) while adding structural awareness. This is the current state of the art for most code understanding benchmarks.

Canonical examples:

GraphCodeBERT [3] — The model takes as input a concatenated sequence: [CLS] + code tokens + [SEP] + data-flow variable nodes. A graph-guided masked attention function controls which code tokens can attend to which data-flow nodes, preventing irrelevant structural noise. The model is pretrained with three objectives: (a) masked language modeling on code tokens, (b) edge prediction on data-flow edges, and (c) node alignment between code tokens and data-flow variables. The result is a transformer that has learned to “prefer structure-level attentions over token-level attentions” in tasks like code search [3].

SAT [10] — A plug-and-play fine-tuning method applicable to any transformer-based code model. During fine-tuning, SAT computes the shortest-path distance matrix between AST leaf nodes, extracts the attention matrix from the transformer’s multi-head attention blocks, and adds a Sinkhorn divergence loss between the two. This additional loss term encourages the model to learn attention patterns that respect the code’s syntactic structure, without modifying the transformer architecture itself [10]. SAT showed greater improvement in low-resource fine-tuning scenarios [10].

Tree-Enhanced CodeBERTa [15] — Adds tree-based positional embeddings (depth embedding + sibling index embedding) to the standard position encoding. The additional parameters amount to approximately 0.945% of the base model [15]. The embeddings are generated by parsing code with Tree-sitter, computing AST node depth and sibling position, and looking up learned embedding vectors that are added to the standard positional encodings.

GPU requirements for this approach:

The base GPU requirement is the same as for Approach A (it depends on sequence length and model size), plus the overhead of computing and storing structural matrices. For GraphCodeBERT-scale models (125M parameters, RoBERTa-base architecture), fine-tuning fits on a single consumer GPU. For larger models (1B+ parameters) with structural augmentation, the overhead is:

  • One N×N matrix per structural relationship type (ancestor-descendant, sibling, data-flow, etc.), stored in memory during the forward pass.
  • For N = 512 (CodeBERT’s max sequence length), each matrix is 512 × 512 × 4 bytes ≈ 1 MB. Negligible.
  • For N = 8,192, each matrix is 8,192 × 8,192 × 4 bytes ≈ 256 MB. Significant if multiple relationship types are used.

In practice, the structural matrices add 5–20% memory overhead on top of the base transformer memory footprint, depending on sequence length and number of relationship types.


5. The Training Pipeline: End-to-End Data Flow

This section describes the complete training pipeline for each approach, from raw source code to trained model. The steps in common across all approaches are shown first, followed by approach-specific divergence points.

5.1. Common Preprocessing

Raw Source Files (.java, .py, .c, .rs, ...)
    │
    ▼
Parsing
    │  Tool: Tree-sitter (multi-language), JDT (Java), ANTLR (configurable grammars)
    │  Output: AST per function/method/file
    │  Note: Tree-sitter is the de facto standard for multi-language support [2]
    ▼
Structural Analysis
    │  - AST → parent-child edges, node types, leaf tokens
    │  - CFG extraction → basic blocks + branch edges (language-specific; significantly harder)
    │  - DFG extraction → variable dependency edges
    │     (GraphCodeBERT provides a DFG extractor for 6 languages [3])
    │  - PDG construction → union of CFG and DFG edges
    ▼
Task-Specific Labeling
    │  Depends on downstream task:
    │  - Method name prediction → label = method name (masked in input)
    │  - Clone detection → label = (code_A, code_B, is_clone)
    │  - Vulnerability detection → label = vulnerable / not vulnerable
    │  - Type inference → label = type annotation
    │  - Code summarization → label = natural language description
    ▼
Representation Construction (diverges by approach — see below)

5.2. Approach A Pipeline: Serialization

AST
    │
    ▼
Linearization
    │  Method options:
    │  (a) SBT (Structure-Based Traversal): "( MethodDecl ( Params ... ) ( Body ... ) )"
    │      Roughly doubles sequence length vs. source code.
    │  (b) Depth-first pre-order with node types: "MethodDecl Params Param int x Body Return BinOp ..."
    │  (c) Source code tokens (ignoring AST entirely) — baseline comparison.
    ▼
Tokenization
    │  Two options:
    │  (a) Node-type vocabulary: each AST node type is a token. Vocabulary size: 200–500.
    │  (b) Subword tokenizer (BPE/SentencePiece) applied to the linearized string.
    │      Uses the existing tokenizer of the pretrained model being fine-tuned.
    │      Vocabulary size: 32K–128K (same as base model).
    │  Choice depends on whether you are fine-tuning a pretrained model (use its tokenizer)
    │  or training from scratch (can use a specialized vocabulary).
    ▼
Token IDs + Attention Mask + (optionally) Position IDs
    │
    ▼
Standard Transformer (e.g., Llama 3, CodeLlama, StarCoder, DeepSeek-Coder)
    │  - QLoRA or LoRA for fine-tuning (same as in the conversational case [1])
    │  - Cross-entropy loss on next-token prediction (or masked LM for encoder models)
    │  - Loss is masked to relevant output tokens only
    ▼
Output: Fine-tuned model or adapter weights

5.3. Approach B Pipeline: GNN

AST + DFG + CFG (from preprocessing)
    │
    ▼
Graph Construction
    │  For each code example:
    │  - Nodes: AST nodes and/or variables and/or statements
    │  - Node features: initial embeddings from:
    │     • Node type (looked up from a learned embedding table)
    │     • Token content (looked up or subword-encoded)
    │     • Concatenation of both
    │  - Edge index: list of (source_node, target_node) pairs
    │  - Edge types: categorical labels (AST_edge, DFG_edge, CFG_edge, NextToken_edge, ...)
    ▼
Batching
    │  Graphs cannot be naively batched like fixed-length sequences.
    │  Standard approach: combine multiple graphs into a single disconnected graph
    │  (PyG's Batch.from_data_list handles this). Node indices are offset per graph.
    │  Alternative: pad all graphs to the same size (wasteful for heterogeneous sizes).
    ▼
GNN Forward Pass (for a k-layer GNN)
    │  For each layer l = 1, ..., k:
    │    For each node i:
    │      1. Gather messages from neighbours: m_i = AGG({h_j^(l-1) : j ∈ N(i)})
    │         AGG = sum, mean, max, or attention-weighted (GAT)
    │      2. Update node state: h_i^(l) = UPDATE(h_i^(l-1), m_i)
    │         UPDATE = GRU (GGNN), MLP (GCN), gated function (various)
    │  After k layers: each node h_i^(k) encodes its k-hop neighbourhood.
    ▼
Readout
    │  Task-dependent:
    │  - Graph classification: pool all node representations → single vector
    │     (mean pooling, attention pooling, hierarchical pooling)
    │  - Node classification: use individual node representations directly
    │  - Link prediction: combine representations of node pairs
    ▼
Task-Specific Head (MLP classifier or decoder)
    │
    ▼
Loss + Backpropagation
    │  - Cross-entropy for classification tasks
    │  - Contrastive loss for code search / clone detection
    │  - Sequence loss for code generation tasks
    ▼
Output: Trained GNN model

5.4. Approach C Pipeline: Hybrid

Source Code
    │
    ├─────────────────────────────┐
    ▼                             ▼
Tokenizer → Token IDs        Parser → AST / DFG / CFG
    │                             │
    │                             ▼
    │                    Structure Matrix Computation
    │                        │
    │                        │  One or more of:
    │                        │  (a) AST distance matrix D ∈ ℝ^(N×N)
    │                        │      D[i][j] = shortest path length between leaf i and leaf j in AST
    │                        │  (b) Ancestor-descendant binary matrix A ∈ {0,1}^(N×N)
    │                        │      A[i][j] = 1 iff i is an ancestor of j (or vice versa)
    │                        │  (c) Sibling binary matrix S ∈ {0,1}^(N×N)
    │                        │      S[i][j] = 1 iff i and j share the same parent
    │                        │  (d) Data-flow adjacency matrix F ∈ {0,1}^(N×N)
    │                        │      F[i][j] = 1 iff there is a data-flow edge from variable i to variable j
    │                        │
    │                        ▼
    │                    Structural Matrices
    │                        │
    └──────────┬─────────────┘
               │
               ▼
    Modified Transformer Forward Pass
        │
        │  Standard: attention_score[i][j] = (q_i · k_j) / √d
        │
        │  With structure (additive bias, as in GraphCodeBERT / SAT):
        │     attention_score[i][j] = (q_i · k_j) / √d + α · structural_bias[i][j]
        │
        │  With structure (masking, as in GraphCodeBERT attention mask):
        │     attention_score[i][j] = -∞  if mask[i][j] = 0
        │     (prevents attending to structurally irrelevant tokens)
        │
        │  With structure (positional encoding, as in Tree-Enhanced CodeBERTa):
        │     input_embedding[i] = token_embedding[i] + position_embedding[i]
        │                          + depth_embedding[depth(i)] + sibling_embedding[sibling_index(i)]
        │
        ▼
    Contextual Representations
        │
        ▼
    Training Objectives
        │  One or more of:
        │  (a) Masked language modeling (standard)
        │  (b) Edge prediction: given two variable nodes, predict whether
        │      a data-flow edge exists between them (GraphCodeBERT [3])
        │  (c) Node alignment: predict which code token corresponds
        │      to which data-flow variable node (GraphCodeBERT [3])
        │  (d) Structure loss: minimize divergence between the model's
        │      learned attention and the AST distance matrix (SAT [10])
        │  (e) Downstream task loss: classification, generation, etc.
        ▼
    Output: Fine-tuned structure-aware transformer model

6. GPU Memory Analysis

6.1. Why Non-Language Structures Require More Memory

The fundamental reason is sequence length. For transformer-based approaches (A and C), the memory cost of training is dominated by:

  1. Model weights. Same as in the conversational case. An 8B model in NF4 (QLoRA) requires ~4–5 GB. This does not change with input data type [1].

  2. Activations. These scale with both sequence length N and batch size B. During the forward pass, intermediate activations are stored for backpropagation. Without gradient checkpointing, activation memory is O(N × L × d) where L is the number of layers and d is the hidden dimension. With gradient checkpointing, this is reduced by ~50% at a 20–30% compute cost [1].

  3. Attention matrices. Without FlashAttention, each attention head produces an N×N matrix, and there are H heads across L layers. Total attention memory: O(L × H × N²). With FlashAttention, this is reduced to O(N) per head by recomputing attention during the backward pass rather than storing it [9].

  4. Structural matrices (Approach C only). Each structural relationship adds an N×N matrix. If you have R relationship types, this adds O(R × N²) memory.

For the conversational tool-calling case, N ≤ 4,096. For linearized ASTs at function level, N can be 2,000–8,000. For class-level or multi-function structures, N can reach 16,000–32,000. For whole-program representations, N can exceed 64,000.

The quadratic term (N²) in attention is the bottleneck. Doubling the sequence length quadruples the attention memory. This is why moving from 4K-token chat data to 16K-token linearized ASTs can push the hardware requirement from a 24GB consumer GPU to a 40–80GB data-centre GPU, even with the same model and the same QLoRA configuration.

6.2. Memory Scaling Table

The following table estimates peak VRAM during fine-tuning of an 8B model with QLoRA, rank 16, using FlashAttention and gradient checkpointing. Estimates are based on published benchmarks [1][16] and extrapolation from the scaling relationships described above. These are approximate and workload-dependent.

ComponentN=2KN=4KN=8KN=16KN=32K
Model weights (NF4)4.5 GB4.5 GB4.5 GB4.5 GB4.5 GB
LoRA adapters (BF16)0.1 GB0.1 GB0.1 GB0.1 GB0.1 GB
Optimizer states0.2 GB0.2 GB0.2 GB0.2 GB0.2 GB
Activations (gradient ckpt, B=1)~3 GB~6 GB~12 GB~24 GB~48 GB
Estimated total~8 GB~11 GB~17 GB~29 GB~53 GB
Minimum GPURTX 4090RTX 4090L40S 48GBA100 80GB2× A100 80GB

Key observations:

  • Activation memory is the dominant variable for long sequences. It scales roughly linearly with N when FlashAttention and gradient checkpointing are both active.
  • At N=16K (a plausible sequence length for a class-level linearized AST), the RTX 4090’s 24 GB is insufficient. An A100-80GB is required.
  • At N=32K (whole-module representations), a single A100-80GB may not suffice depending on batch size, and multi-GPU training becomes necessary.
  • The model weights themselves are a fixed cost — moving from chat data to structural data does not change this component.

6.3. GNN Memory Profile (Approach B)

GNN memory scaling is fundamentally different. The model is much smaller (millions of parameters, not billions), but the graph data can be large.

For a GNN layer with hidden dimension d, N nodes, and E edges:

  • Node feature storage: O(N × d)
  • Edge computation intermediates: O(E × d) — this is the dominant term [13]
  • For a graph with average degree k: E = N × k, so edge memory is O(N × k × d)

Empirical data: Wang et al. report that for GNNs trained with PyTorch Geometric, “the training and inference time and the memory usage of a GNN layer are mainly affected by the input/output hidden vectors’ dimensions” and scale linearly with hidden dimension [13].

Function-level program graphs (N ≈ 200–1,000 nodes, E ≈ 500–5,000 edges, d = 128–256):

  • Memory: single-digit GB. Comfortably fits on any modern GPU.

Class-level program graphs (N ≈ 5,000–20,000 nodes, E ≈ 20,000–100,000 edges):

  • Memory: 10–30 GB depending on hidden dimension and number of GNN layers.
  • RTX 4090 (24 GB) or A100 40GB.

Whole-program graphs (N ≈ 100K+ nodes, E ≈ 1M+ edges):

  • Dense representation is infeasible. Must use mini-batch sampling (GraphSAGE-style).
  • Amazon’s distributed GNN training scales to billions of edges using CPU-GPU hybrid architectures [17].
  • A100 80GB minimum for single-GPU training with sampling; multi-GPU for full-batch.

7. Key Findings from the Literature

7.1. ASTs Are Not Straightforwardly Better Than Code Tokens

A large-scale empirical study comparing models trained on code tokens vs. AST-based representations across three tasks found that AST-based models consistently performed worse in aggregate [11]. The authors attribute this to the fact that linearization discards structural relationships that cannot be recovered from position alone, while simultaneously increasing sequence length (and thus computational cost and information dilution). However, on subsets of examples involving complex syntactic structures, AST-based models showed advantages [11].

The implication: naive serialization of ASTs into token sequences does not improve over raw source code. Structure must be injected more carefully — through attention biases (Approach C), path-based representations (Strategy 2), or native graph processing (Approach B).

7.2. Data-Flow Graphs Are More Compact and Effective Than Full ASTs

GraphCodeBERT [3] chose data-flow graphs over ASTs deliberately, because data flow encodes semantic-level structure (where values come from) rather than syntactic-level structure (how the code is parsed). The authors note that data-flow graphs are “neat and do not bring an unnecessarily deep hierarchy” compared to ASTs [3]. The model showed state-of-the-art performance on code search, clone detection, code translation, and code refinement tasks when published at ICLR 2021 [3].

7.3. Structure-Aware Attention Improves Over Unmodified Transformers

Both SAT [10] and AST-Transformer [8] demonstrated that incorporating structural information into the attention mechanism improves performance on code tasks. AST-Transformer reported outperforming state-of-the-art baselines while reducing 90–95% of computational complexity in the encoding process [8] — achieved by using relationship matrices to restrict attention to structurally relevant token pairs, rather than computing full dense attention over all linearized AST tokens.

SAT demonstrated that the improvement is particularly pronounced in low-resource fine-tuning scenarios [10], which is directly relevant to the project’s setting (fine-tuning on potentially small datasets of structural examples).

7.4. GNNs Face Software Ecosystem Constraints

While GNNs are architecturally well-suited to program graph structures, the software ecosystem is less mature than the transformer fine-tuning ecosystem. PyTorch Geometric and DGL are CUDA-optimized and work well on NVIDIA GPUs. Metal/MPS support is limited. The Hugging Face ecosystem (transformers, PEFT, TRL, bitsandbytes) — which provides the turnkey QLoRA fine-tuning pipeline described in the companion document [1] — does not support GNN architectures. Building a GNN training pipeline requires more custom engineering than the transformer fine-tuning path.


8. Practical Recommendations by Scenario

The GPU and approach requirements depend on the granularity of the structural data being processed.

8.1. Function-Level Parse Trees (Hundreds of Nodes)

Recommended approach: Hybrid (Approach C) using a pretrained code model with structure-aware attention, or path-based representation (Strategy 2).

GPU requirement: RTX 4090 (24 GB) is sufficient. Function-level linearized ASTs produce sequences of 500–2,000 tokens, well within the comfortable range. If using a GNN, even a V100 (16 GB) would suffice.

8.2. Class/Module-Level Representations (Thousands of Nodes)

Recommended approach: Hybrid (Approach C) with FlashAttention and gradient checkpointing. Alternatively, GNN-based if the task is graph-native (e.g., vulnerability detection on PDGs).

GPU requirement: A100 40GB or L40S 48GB for transformer approaches (sequences of 4K–16K tokens). RTX 4090 may work for GNN approaches if graphs are sparse.

8.3. Whole-Program Data-Flow Graphs (Tens of Thousands of Nodes)

Recommended approach: GNN-based (Approach B) with mini-batch sampling, or partitioned hybrid approaches that process sub-graphs sequentially.

GPU requirement: A100 80GB minimum. Multi-GPU for graphs exceeding 100K nodes. This is where the M5 Ultra Mac Studio (projected 256–512 GB unified memory) becomes relevant, though the GNN software ecosystem on Apple Silicon is a limiting factor.


Bibliography

[1] Hu, E.J. et al. “LoRA: Low-Rank Adaptation of Large Language Models.” Proceedings of ICLR 2022. arXiv:2106.09685, 2021. https://arxiv.org/abs/2106.09685 Referenced also for GPU memory budgeting framework described in project document fine_tuning_tool_calling_guide.md and companion chat on GPU requirements for conversational tool-calling fine-tuning.

[2] Tree-sitter. Multi-language incremental parser. https://tree-sitter.github.io/tree-sitter/

[3] Guo, D. et al. “GraphCodeBERT: Pre-training Code Representations with Data Flow.” Proceedings of ICLR 2021. arXiv:2009.08366. [Verified] https://arxiv.org/abs/2009.08366

[4] Alon, U. et al. “code2vec: Learning Distributed Representations of Code.” Proc. ACM Program. Lang. (POPL), 2019. [Snippet — training times verified from GitHub README] https://github.com/tech-srl/code2vec

[5] Alon, U. et al. “code2seq: Generating Sequences from Structured Representations of Code.” Proceedings of ICLR 2019. [Snippet] https://github.com/tech-srl/code2seq

[6] Kipf, T.N. and Welling, M. “Semi-Supervised Classification with Graph Convolutional Networks.” Proceedings of ICLR 2017. arXiv:1609.02907. https://arxiv.org/abs/1609.02907

[7] Allamanis, M. et al. “Learning to Represent Programs with Graphs.” Proceedings of ICLR 2018. arXiv:1711.00740. https://arxiv.org/abs/1711.00740

[8] Tang, Z. et al. “AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization.” Proceedings of ASE 2022 / ICSE 2022. arXiv:2112.01184. [Snippet] https://arxiv.org/abs/2112.01184

[9] Dao, T. et al. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” Proceedings of NeurIPS 2022. Discussed in context of long-sequence training: Hazy Research blog. [Snippet] https://hazyresearch.stanford.edu/blog/2023-01-12-flashattention-long-sequences

[10] (Authors). “SAT: Structure-aware Fine-tuning for Code Pre-trained Models.” arXiv:2404.07471, April 2024. [Snippet] https://arxiv.org/html/2404.07471

[11] (Authors). “Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?” arXiv:2312.00413, December 2023. [Snippet] https://arxiv.org/html/2312.00413v1

[12] Kovalenko, V. et al. “PathMiner: A Library for Mining of Path-Based Representations of Code.” Proceedings of MSR 2019. JetBrains Research. https://github.com/JetBrains-Research/astminer

[13] Wang, Y. et al. “Empirical analysis of performance bottlenecks in graph neural network training and inference with GPUs.” Neurocomputing, 446:165–191, 2021. [Snippet] https://www.sciencedirect.com/science/article/abs/pii/S0925231221003659

[14] NVIDIA. “Graph Neural Network (GNN) Frameworks.” NVIDIA Developer. [Snippet] https://developer.nvidia.com/gnn-frameworks

[15] (Authors). “Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation.” arXiv:2507.04003, July 2025. [Snippet] https://arxiv.org/html/2507.04003

[16] Luo, C. et al. “Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training.” arXiv:2407.15892, July 2024. [Snippet] https://arxiv.org/html/2407.15892v1

[17] Zheng, D. et al. “Scaling graph-neural-network training with CPU-GPU clusters.” Amazon Science / KDD, 2024. [Snippet] https://www.amazon.science/blog/scaling-graph-neural-network-training-with-cpu-gpu-clusters

[18] Wang, S. et al. “An Extensive Study of the Structure Features in Transformer-based Code Intelligence.” Proceedings of ICPC 2023. [Snippet] https://shangwenwang.github.io/files/ICPC-23B.pdf

[19] Eisner, J.D. “InAttention: Linear-Scaling Attention for Transformers.” arXiv:2410.07063, October 2024. [Snippet] https://arxiv.org/pdf/2410.07063

[20] Dettmers, T. et al. “QLoRA: Efficient Finetuning of Quantized LLMs.” Proceedings of NeurIPS 2023. arXiv:2305.14314. https://arxiv.org/abs/2305.14314

Prev