Chunk Size Is Not 512 Tokens - A Tutorial on RAG Chunking Parameters

Chunk Size Is Not 512 Tokens: A Tutorial on RAG Chunking Parameters

The Common Claim and Why It’s Wrong

A widespread rule of thumb in the RAG literature holds that 512 tokens is the optimal chunk size. This claim appears in blog posts, framework documentation, and even some conference papers. It is, at best, a useful default. At worst, it is a misunderstanding that conflates an architectural constraint of the embedding model with an empirical optimum for retrieval quality.

The 512-token number has a specific origin: it is the maximum input sequence length of BERT [1] and most BERT-derived embedding models, including BGE [2], E5 [3], and the original all-MiniLM-L6-v2 [4]. These models were pretrained with a positional encoding scheme that supports at most 512 tokens. Any input exceeding this limit is silently truncated. The 512-token figure is therefore a ceiling, not an optimum — it is the point beyond which information is lost due to model architecture, not the point at which retrieval quality is maximized.

Recent empirical work shows that the relationship between chunk size and retrieval performance is task-dependent, domain-dependent, and embedding-model-dependent. There is no single optimal chunk size.

The Empirical Evidence

Bhat et al. (2025): Chunk size depends on answer type

The most systematic recent study is Bhat et al.’s multi-dataset analysis from Fraunhofer IAIS [5]. They evaluated fixed-size chunking at 64, 128, 256, 512, and 1024 tokens across six QA datasets using two embedding models (Stella 1.5B and Snowflake Arctic Embed).

Their findings directly contradict the “512 is optimal” claim:

On SQuAD (short, entity-based answers averaging 3.9 tokens), the smallest chunks performed best. Recall@1 was 64.2% at 64 tokens and dropped to 49.8% at 512 tokens — a 14-point degradation from using larger chunks [5].
On TechQA (long, explanation-heavy answers averaging 46.9 tokens), the pattern reversed. Recall@1 was 4.9% at 64 tokens and rose to 61.4% at 512 tokens [5].
On NarrativeQA (dispersed answers across long literary texts), recall continued improving through 1024 tokens, suggesting even 512 was insufficient [5].

The governing variable is not chunk size per se but the answer locality of the dataset — whether the information needed to answer a query is concentrated in a few tokens or dispersed across a broader passage. Datasets with short, factoid answers favor small chunks because the embedding can precisely match a compact, topically focused passage. Datasets requiring contextual understanding favor large chunks because the embedding needs enough surrounding text to capture the relevant semantics.

Amiri et al. (2025): Recursive chunking at 100 tokens wins in chemistry

A systematic evaluation across 25 chunking configurations and 48 embedding models for chemistry-focused RAG found that recursive token-based chunking at 100 tokens with zero overlap (denoted R100-0) consistently outperformed larger configurations, including 512 and 1024 token variants [6]. This result is striking because chemistry text is highly technical and domain-specific — exactly the kind of content where one might expect larger context windows to help. The authors attribute the result to the retrieval-optimized embedding models (Nomic, E5 variants) being particularly effective at matching fine-grained chemical entities when given focused, non-diluted chunks [6].

Prabha et al. (2025): Adaptive chunking outperforms fixed-size in clinical text

A peer-reviewed clinical decision support study from Mayo Clinic found that adaptive chunking — which aligns boundaries to logical topic shifts using sentence-level semantic similarity — achieved 87% medical accuracy compared to 50% for a fixed-size baseline, a statistically significant difference (p = 0.001) [7]. The adaptive chunks had a target span of approximately 500 words (~600-700 tokens) but varied based on content. The key insight is that respecting semantic boundaries mattered more than any specific token count.

Barnett et al. (2025): Structure-aware chunking across six domains

The most recent large-scale benchmark (March 2026) evaluated 36 chunking strategies across six knowledge domains including legal text, using five embedding models [8]. They found that structurally informed and adaptive strategies frequently outperformed fixed-size baselines, but the gains interacted with the embedding model used. Their central conclusion was that chunking should be treated as a first-class design dimension, not an implementation detail, and that the choice interacts non-linearly with embedding architecture [8].

Choudhary et al. (2024): Context window utilization

Choudhary et al. introduced the concept of context window utilization — the proportion of the LLM’s context window actually occupied by retrieved chunks [9]. They tested chunk sizes of 128, 256, 512, and 1024 across academic papers, legal documents, and Wikipedia articles using Llama 3 70B and Mixtral 8x7B. Their finding was that the optimal chunk size depended on the number of chunks retrieved: with 7-9 chunks retrieved, sizes of 512 and 1024 tokens yielded the highest similarity scores for Llama 3, but the results were inconsistent for Mixtral [9]. The implication is that chunk size cannot be optimized independently of the retrieval count (top-k) and the generation model’s context handling.

The Three Constraints That Bound Chunk Size

Chunk size is bounded from above and below by three independent constraints. Understanding these constraints is more useful than memorizing a specific number.

Constraint 1: The embedding model’s input window

Every embedding model has a maximum input length. Text exceeding this limit is truncated, and the truncated portion contributes nothing to the embedding. For BERT-family models (BGE, E5, all-MiniLM), this limit is 512 tokens [1]. For nomic-embed-text-v1, the trained context length is 2048 tokens, though the MTEB evaluation truncates at 512 [10]. For nomic-embed-text-v1.5, the context length extends to 8192 tokens with rotary position encoding interpolation [11].

This means that if you use a 512-token-limit model, your chunks cannot meaningfully exceed 512 tokens. But it does not mean 512 is optimal — it means 512 is the maximum beyond which you are wasting text.

If you use a model with a longer context window (nomic-embed-text-v1.5 at 8192, or jina-embeddings-v2 at 8192), the ceiling rises. Whether you should use chunks that large is a separate question.

Constraint 2: Embedding dilution

The embedding model compresses an entire chunk into a single fixed-dimensional vector (768 dimensions for nomic-embed-text, 384 for MiniLM-L6-v2). This compression is inherently lossy — the longer the input text, the more information must be discarded to fit into the same vector dimensionality. This is the “dilution” problem: a 512-token chunk covering three topics will produce a vector that represents none of them well, because the single vector must be a compromise across all three topics [12].

Smaller chunks reduce dilution. A 128-token chunk tightly focused on one statutory provision will produce an embedding that strongly matches queries about that provision. A 512-token chunk spanning three provisions will produce an embedding that weakly matches queries about any of them.

This is the lower-bound argument against large chunks. It explains why Bhat et al. found 64-token chunks optimal for SQuAD (short, entity-focused answers) and why Amiri et al. found 100-token chunks optimal for chemistry (entity-focused retrieval) [5][6].

Constraint 3: Context loss

The argument for larger chunks is context. A chunk that says “subject to the conditions in subsection (d)” is useless in isolation — you need subsection (d) in the same chunk for the retrieval to be useful. Legal text, technical documentation, and narrative prose all exhibit this property: meaning is distributed across sentences and paragraphs, not concentrated in individual words.

Very small chunks (64-128 tokens) frequently split sentences, separate cross-references from their referents, and detach defined terms from their definitions. This degrades answer quality even when retrieval precision (by token overlap) is nominally high, because the retrieved chunk lacks the context the LLM needs to generate a correct answer.

The Bhat et al. data shows this clearly: on NarrativeQA, Recall@1 was 4.2% at 64 tokens and rose monotonically to 10.7% at 1024 tokens. The answer spans in these documents are long and contextual — small chunks simply cannot capture them [5].

Overlap

Overlap controls how much text is shared between adjacent chunks. A chunk overlap of 100 tokens means the last 100 tokens of chunk N are also the first 100 tokens of chunk N+1. The purpose is to mitigate information loss at chunk boundaries.

There is remarkably little rigorous empirical work specifically on overlap. The standard recommendation of 10-20% of chunk size appears to originate from practitioner blog posts and framework defaults rather than from controlled experiments [13][14]. The most concrete finding is negative: Amiri et al. found that their best chemistry configuration used zero overlap (R100-0), outperforming configurations with 20-50 token overlaps [6].

The theoretical argument for overlap is straightforward: if a sentence is split across a chunk boundary, the overlap ensures both chunks contain the full sentence. For legal text, where sentences routinely exceed 100 words, this is a real concern. The counterargument is that overlap increases the total number of chunks (and therefore index size and query time) and can cause the same passage to be retrieved twice in different chunks, wasting context window capacity in the generation prompt.

A reasonable starting point is 10-20% of chunk size. For a 450-token chunk, that is 45-90 tokens. But the honest recommendation is to treat overlap as a tunable parameter and evaluate it empirically against your retrieval metrics [14].

Splitting Strategy

The choice of where to split matters at least as much as the size of the split. The three major strategies form a hierarchy of increasing sophistication:

Fixed-size splitting divides text into uniform-length segments regardless of content boundaries. It is reproducible and fast but regularly splits mid-sentence and mid-paragraph. This is the baseline against which other strategies are measured.

Recursive character splitting (LangChain’s RecursiveCharacterTextSplitter) tries to split on natural boundaries in priority order: double newlines, single newlines, periods, spaces. If a section exceeds the target chunk size, it splits at the highest-priority boundary that fits. For well-formatted text, this approximates paragraph-level splitting. The Chroma technical report found that RecursiveCharacterTextSplitter at 400 tokens achieved 88.1-89.5% recall across their test corpora [15].

Structure-aware splitting exploits the document’s own organization. LangChain’s MarkdownHeaderTextSplitter splits on heading boundaries, keeping each section together. This is particularly powerful for legal text, where section numbers and headings correspond to meaningful conceptual units. A chunk containing the entirety of “§ 3583(d) — Conditions of supervised release” is far more useful than a chunk containing the second half of § 3583(c) and the first half of § 3583(d). The Barnett et al. benchmark found that structurally informed strategies frequently outperformed fixed-size baselines, though the magnitude of improvement depended on the embedding model [8].

Adaptive/semantic splitting uses embedding similarity between consecutive sentences to detect topic shifts. When similarity drops below a threshold, a new chunk begins. The Prabha et al. clinical study found this approach dramatically outperformed fixed-size baselines [7]. However, semantic chunking requires embedding every sentence during preprocessing, adding computational cost.

For legal documents specifically, structure-aware splitting on heading boundaries is likely the strongest first approach, because legal documents have explicit, hierarchical structure that corresponds to topical boundaries. This is an empirical hypothesis, not a proven fact — it should be validated against your specific corpus.

Implications for Legal Text

Legal documents have several properties that affect chunking:

Long sentences. Legal drafting routinely produces sentences exceeding 100 words. A 128-token chunk risks splitting these mid-sentence, producing fragments that are grammatically incomplete and semantically ambiguous.

Cross-referencing. Statutes reference other sections (“subject to subsection (d),” “as defined in § 3156(a)(4)”). If the chunk boundary separates the reference from its referent, the retrieved chunk is incomplete.

Defined terms. Terms are defined once and used throughout. A chunk containing “the defendant shall comply with the standard conditions” is meaningless without knowing what the standard conditions are.

Hierarchical structure. Statutes, rules, and regulations have explicit section numbering. This structure provides natural chunk boundaries that align with semantic content.

These properties argue for:

Chunks closer to 450-512 tokens than 128-256 tokens, to avoid splitting long sentences and cross-references.
Structure-aware splitting on section/heading boundaries, to align chunks with the document’s logical organization.
Overlap of ~100 tokens (~20%), to ensure boundary-spanning sentences appear in both chunks.
Empirical evaluation on representative legal queries, because the theoretical arguments above are informed guesses, not proven optima.

The Reuter et al. study on legal RAG specifically found that prepending a short document-level summary to each chunk (Summarization-Augmented Chunking) improved retrieval precision on legal datasets, outperforming even expert-guided legal summarization [16]. This is a low-cost intervention worth testing.

The Correct Approach: Empirical Evaluation

The literature converges on one point: optimal chunk parameters cannot be determined analytically. They must be evaluated empirically against your specific corpus, queries, and embedding model.

The evaluation loop:

Build an evaluation set of 30-50 representative queries with known relevant passages.
Index the corpus with a candidate chunk configuration.
For each query, retrieve top-k chunks and measure whether the relevant passages appear.
Repeat across configurations (varying chunk size, overlap, strategy).
Select the configuration with the best retrieval metrics.

RAGAS [17] provides automated metrics (faithfulness, answer relevancy, context precision, context recall) that can drive this evaluation without requiring fully manual relevance judgments, though its LLM-as-judge approach introduces its own biases.

The practical implication: start with 450 tokens, 100 overlap, markdown-header-aware splitting as a reasonable first configuration for legal text. Then evaluate, adjust, and re-evaluate. The tooling to automate this loop is the most valuable investment in a RAG system.

Summary

Parameter	Governs	Bounded by	Starting point for legal text
Chunk size	Context vs. precision tradeoff	Embedding model max tokens (ceiling), semantic coherence (floor)	450 tokens
Overlap	Boundary information loss	Index size, duplicate retrieval (ceiling), sentence length (floor)	100 tokens (~20%)
Splitting strategy	Alignment with document structure	Document format, preprocessing cost	Markdown header-aware
Top-k	Context window utilization	LLM context window, noise tolerance	5

None of these are optimal. All of them are tunable. The correct values for your corpus are the ones that score best on your evaluation set.

Bibliography

[1] Devlin, J., Chang, M., Lee, K. & Toutanova, K. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT 2019, pp. 4171–4186. https://aclanthology.org/N19-1423/

[2] Xiao, S., Liu, Z., Zhang, P. & Muennighoff, N. “C-Pack: Packaged Resources to Advance General Chinese Embedding.” arXiv:2309.07597 (2023). https://arxiv.org/abs/2309.07597

[3] Wang, L. et al. “Text Embeddings by Weakly-Supervised Contrastive Pre-training.” arXiv:2212.03533 (2022). https://arxiv.org/abs/2212.03533

[4] Reimers, N. & Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” Proceedings of EMNLP 2019. https://arxiv.org/abs/1908.10084

[5] Bhat, S.R., Rudat, M., Spiekermann, J. & Flores-Herr, N. “Rethinking Chunk Size for Long-Document Retrieval: A Multi-Dataset Analysis.” arXiv:2505.21700 (May 2025). [Verified] https://arxiv.org/html/2505.21700v2

[6] Amiri, M. & Bocklitz, T. “Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation.” arXiv:2506.17277 (June 2025). [Snippet] https://arxiv.org/html/2506.17277v1

[7] Prabha, S. et al. “Comparative Evaluation of Advanced Chunking for Retrieval-Augmented Generation in Large Language Models for Clinical Decision Support.” Bioengineering 12(11): 1194 (November 2025). [Verified] https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/

[8] Barnett, A. et al. “A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity.” arXiv:2603.06976 (March 2026). [Snippet] https://arxiv.org/html/2603.06976

[9] Choudhary, A. et al. “Introducing a New Hyper-parameter for RAG: Context Window Utilization.” arXiv:2407.19794 (August 2024). [Verified] https://arxiv.org/html/2407.19794v2

[10] Nussbaum, Z. et al. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” arXiv:2402.01613 (February 2024). [Verified] https://arxiv.org/html/2402.01613v2

[11] Nomic AI. “nomic-embed-text-v1.5 Model Card.” Hugging Face (2024). [Verified] https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

[12] Unstructured. “Chunking for RAG: Best Practices.” (2025). [Snippet] https://unstructured.io/blog/chunking-for-rag-best-practices

[13] Weaviate. “Chunking Strategies to Improve LLM RAG Pipeline Performance.” (September 2025). [Snippet] https://weaviate.io/blog/chunking-strategies-for-rag

[14] Firecrawl. “Best Chunking Strategies for RAG in 2025.” (October 2025). [Snippet] https://www.firecrawl.dev/blog/best-chunking-strategies-rag

[15] Chroma Research. “Evaluating Chunking Strategies for Retrieval.” (July 2024). [Snippet] https://research.trychroma.com/evaluating-chunking

[16] Reuter, M. et al. “Towards Reliable Retrieval in RAG Systems for Large Legal Datasets.” Proceedings of the Natural Legal Language Processing Workshop 2025, pp. 17–30. [Snippet] https://arxiv.org/html/2510.06999v1

[17] VibrantLabs. “Ragas: Supercharge Your LLM Application Evaluations.” GitHub (2024). [Snippet] https://github.com/vibrantlabsai/ragas