Retrieval-Augmented Generation - Architecture, Components, and State of the Art

A Technical Primer for AI Engine Development

March 2026

1. Introduction

Large language models store factual knowledge implicitly in their parameters, learned during pretraining on massive text corpora. This creates three interrelated problems. First, the knowledge is static — frozen at pretraining time and expensive to update. Second, it is unreliable — the model cannot distinguish what it knows from what it is confabulating. Third, it is opaque — there is no way to trace a specific claim back to a specific source document [1].

Retrieval-Augmented Generation (RAG) addresses these problems by combining a language model’s generative capabilities with an explicit retrieval mechanism that fetches relevant documents from an external corpus at inference time. Instead of relying solely on parametric memory (knowledge encoded in weights), the system also consults non-parametric memory (an external document store) before generating a response [1]. The retrieved documents are injected into the model’s context window, grounding the generation in specific source material.

The term “Retrieval-Augmented Generation” and the acronym RAG were coined by Lewis et al. in a 2020 paper from Facebook AI Research (now Meta FAIR), published at NeurIPS [1]. The paper introduced two model variants — RAG-Sequence and RAG-Token — that combined a pretrained seq2seq model (BART) with a dense vector index of Wikipedia, accessed via a pretrained neural retriever (DPR) [1]. The system set state-of-the-art results on three open-domain question answering benchmarks and generated more factual language than parametric-only baselines [1].

Since 2020, “RAG” has expanded from a specific model architecture into a general design pattern used across the industry. The core idea — retrieve relevant context, then generate conditioned on that context — now underpins products from Perplexity to Microsoft 365 Copilot to enterprise knowledge bases. This document describes the architecture in detail.

2. Historical Context

RAG did not emerge in isolation. It sits at the intersection of two research traditions: open-domain question answering (ODQA) and neural information retrieval.

2.1 Open-Domain Question Answering

The retrieve-then-read paradigm for QA dates to Chen et al.’s DrQA (2017), which combined a TF-IDF retriever with a neural reader to answer questions from Wikipedia [2]. The retriever selected candidate passages using sparse keyword matching; the reader extracted answer spans from those passages. This established the two-stage pipeline that RAG generalizes.

2.2 REALM: Retrieval During Pretraining

In February 2020 — three months before the RAG paper — Guu et al. at Google published REALM (Retrieval-Augmented Language Model Pre-Training) [3]. REALM integrated retrieval directly into the pretraining process: the model learned to retrieve documents from Wikipedia using masked language modeling as the training signal, backpropagating through the retrieval step across millions of documents [3]. REALM outperformed all previous methods on open-domain QA benchmarks by 4–16% absolute accuracy [3]. Where RAG fine-tunes a pretrained retriever and generator jointly on downstream tasks, REALM trains the retriever from scratch during pretraining itself — a more tightly coupled but more computationally expensive approach.

2.3 Dense Passage Retrieval (DPR)

The RAG paper depended on DPR, published by Karpukhin et al. in 2020 at EMNLP [4]. DPR replaced sparse retrieval (TF-IDF, BM25) with dense retrieval using dual BERT encoders — one for queries, one for passages — trained on question-passage pairs. DPR outperformed BM25 by 9–19% absolute in top-20 passage retrieval accuracy on open-domain QA datasets [4]. This result demonstrated that learned dense representations could practically replace keyword-based retrieval for knowledge-intensive tasks, providing the retrieval component that RAG built upon.

2.4 The Transition from Model to Pattern

The original RAG paper described a specific model with a differentiable retrieval step — the retriever and generator were jointly fine-tuned end-to-end [1]. Modern usage of “RAG” typically refers to a looser architecture: a retrieval pipeline (possibly non-differentiable) that fetches documents, which are then concatenated into the prompt of a separately-trained LLM. This “retrieve-and-stuff” pattern is simpler to engineer, works with any LLM, and is how the vast majority of production RAG systems operate. The distinction matters: the original RAG had end-to-end gradient flow through retrieval; modern RAG typically does not.

3. Glossary of Key Terms

Approximate Nearest Neighbor (ANN). A class of algorithms that find vectors approximately closest to a query vector, trading exactness for speed. ANN indexes (HNSW, IVF, product quantization) are essential for searching millions to billions of embeddings in milliseconds rather than seconds [5].

Bi-encoder. An architecture where the query and each candidate document are encoded independently by separate (or shared) encoder models, producing fixed-length vector representations. Similarity is computed via dot product or cosine similarity between these pre-computed vectors. Bi-encoders are fast (documents can be pre-embedded and indexed) but less accurate than cross-encoders because the query and document never “see” each other during encoding [4][6].

BM25 (Best Match 25). A probabilistic sparse retrieval function that scores documents based on term frequency, inverse document frequency, and document length normalization. BM25 is the standard baseline for information retrieval and remains competitive with or complementary to dense retrieval methods [7].

Chunking. The process of dividing source documents into smaller segments (chunks) suitable for embedding and retrieval. Chunk size, overlap, and splitting strategy are critical tuning parameters that affect retrieval quality. See the project’s companion document (chunking-parameters-tutorial.md) for detailed analysis.

Context window. The maximum number of tokens an LLM can process in a single forward pass. Retrieved chunks compete with the user query, system prompt, and conversation history for space in the context window. Typical context windows range from 4K to 128K+ tokens depending on the model.

Cross-encoder. An architecture where the query and a candidate document are concatenated and processed jointly through a single transformer, allowing full cross-attention between query and document tokens. Cross-encoders produce more accurate relevance scores than bi-encoders but cannot pre-compute document representations — each query-document pair requires a separate forward pass, making them expensive at scale [6][8].

Dense retrieval. Retrieval using learned continuous vector representations (embeddings) of queries and documents, with similarity measured by dot product or cosine similarity in the embedding space. Dense retrieval captures semantic similarity (e.g., matching “bad guy” to “villain”) that sparse methods miss [4].

Embedding. A fixed-length continuous vector (typically 384–1536 dimensions) representing a text passage. Embedding models are trained so that semantically similar texts have high cosine similarity in the vector space.

Embedding model. A neural network (typically a transformer encoder) trained to produce embeddings. Common architectures derive from BERT [9] or its descendants. Examples include Sentence-BERT [6], E5 [10], BGE, nomic-embed-text, and OpenAI’s text-embedding models. The choice of embedding model determines the quality ceiling for dense retrieval.

HNSW (Hierarchical Navigable Small World). A graph-based ANN index algorithm that constructs a multi-layer proximity graph for efficient nearest neighbor search. HNSW provides high recall with sub-millisecond search times and is the default index type in many vector databases [5][11].

Hybrid retrieval. Combining sparse retrieval (BM25) and dense retrieval results, typically using Reciprocal Rank Fusion (RRF) or a learned merging function. Hybrid retrieval captures both lexical matches (exact terms, acronyms) and semantic matches (paraphrases, synonyms).

IVF (Inverted File Index). An ANN indexing method that partitions the vector space into Voronoi cells using k-means clustering. At search time, only the cells nearest to the query are searched, reducing the number of comparisons [5].

Non-parametric memory. Knowledge stored explicitly in an external data structure (document store, vector index) that can be updated without retraining the model. Contrasted with parametric memory — knowledge encoded in model weights [1].

Product quantization (PQ). A vector compression technique that divides each vector into subvectors and quantizes each independently, reducing memory by 4–32x while maintaining approximate distance computation capability [5].

Reranking. A second-stage scoring pass where a more expensive model (typically a cross-encoder) re-scores and reorders the top-k results from the initial retrieval. Reranking improves precision at the cost of additional latency [8].

Sparse retrieval. Retrieval using term-frequency-based representations (bag-of-words, TF-IDF, BM25). Documents and queries are represented as sparse vectors over the vocabulary, with similarity computed via weighted term overlap [7].

Vector database. A database system optimized for storing, indexing, and searching high-dimensional vector embeddings. Examples include Pinecone, Weaviate, Qdrant, Milvus, Chroma, and pgvector (PostgreSQL extension). Most use FAISS or HNSW internally for the ANN search layer.

4. The Canonical RAG Pipeline

A production RAG system consists of two phases: an offline indexing phase that processes and stores documents, and an online query phase that retrieves and generates. The following describes the data flow through each.

4.1 Offline Indexing Phase

Source Documents (.pdf, .docx, .html, .md, database records, ...)
    │
    ▼
Document Loading
    │  Parse raw files into text
    │  Extract metadata (title, date, source, section headings)
    │  Handle tables, images, headers, footers
    │
    ▼
Chunking / Splitting
    │  Divide text into segments of target size (e.g., 450 tokens)
    │  Apply splitting strategy:
    │    - Fixed-size (naive, fast, splits mid-sentence)
    │    - Recursive character splitting (respects paragraph/sentence boundaries)
    │    - Structure-aware (splits on headings, section numbers)
    │    - Semantic (splits on topic shifts detected by embedding similarity)
    │  Apply overlap (e.g., 100 tokens shared between adjacent chunks)
    │  Attach metadata to each chunk (source document, page, section, etc.)
    │
    ▼
Embedding
    │  For each chunk:
    │    Feed text through embedding model
    │    Produce a fixed-length vector (e.g., 768 dimensions)
    │  Embedding models: Sentence-BERT [6], E5 [10], BGE, nomic-embed-text, etc.
    │  Batch processing for throughput
    │
    ▼
Indexing
    │  Store (vector, chunk_text, metadata) tuples in a vector database
    │  Build ANN index (HNSW, IVF+PQ, etc.) over the vectors
    │  Optionally build a parallel BM25 index over chunk text for hybrid retrieval
    │
    ▼
Vector Store (Chroma, Pinecone, Weaviate, Qdrant, pgvector, FAISS, ...)
    Persisted on disk. Ready for queries.

4.2 Online Query Phase

User Query (natural language question or instruction)
    │
    ▼
Query Processing
    │  Optionally: query rewriting, expansion, or decomposition
    │  Optionally: HyDE — generate a hypothetical answer,
    │    embed that instead of the raw query [12]
    │
    ▼
Retrieval
    │  ┌─────────────────────────────────────────────┐
    │  │  Dense path:                                 │
    │  │    1. Embed query using same embedding model │
    │  │    2. ANN search: find top-k nearest vectors │
    │  │    3. Return (chunk_text, metadata, score)    │
    │  ├─────────────────────────────────────────────┤
    │  │  Sparse path (if hybrid):                    │
    │  │    1. BM25 score query against chunk texts    │
    │  │    2. Return (chunk_text, metadata, score)    │
    │  ├─────────────────────────────────────────────┤
    │  │  Fusion (if hybrid):                         │
    │  │    Reciprocal Rank Fusion or linear combination│
    │  │    Merge dense + sparse result lists           │
    │  └─────────────────────────────────────────────┘
    │
    ▼
Reranking (optional but recommended)
    │  Take top-N candidates from retrieval (e.g., N=20)
    │  Score each (query, chunk) pair with a cross-encoder [8]
    │  Reorder by cross-encoder score
    │  Select top-k (e.g., k=5) for generation
    │
    ▼
Prompt Construction
    │  Assemble the LLM prompt:
    │    [System instruction]
    │    [Retrieved chunk 1, with metadata/citation info]
    │    [Retrieved chunk 2]
    │    ...
    │    [Retrieved chunk k]
    │    [User query]
    │    [Instruction: "Answer based on the provided context.
    │     Cite sources. If the context does not contain
    │     the answer, say so."]
    │
    ▼
Generation
    │  LLM generates response conditioned on the augmented prompt
    │  The LLM sees both the query and the retrieved context
    │  Response ideally cites or references the retrieved chunks
    │
    ▼
Response (with optional citation metadata)
    Returned to user.

5. Component Deep Dive

5.1 Document Processing and Chunking

Raw documents must be parsed into text and divided into chunks before embedding. Parsing quality directly affects downstream retrieval — a PDF parser that drops table contents or misorders columns produces chunks that no retrieval model can salvage.

Chunking is governed by three parameters: chunk size, overlap, and splitting strategy. The project’s companion document (chunking-parameters-tutorial.md) covers the empirical evidence in detail. The key findings, summarized:

The “512 tokens is optimal” claim is a misunderstanding. 512 is the maximum input length of BERT-family embedding models, not an empirically determined optimum [13].
Optimal chunk size depends on the task and domain. Factoid QA datasets favor small chunks (64–128 tokens); explanation-heavy datasets favor large chunks (512–1024) [13].
Structure-aware splitting (on heading boundaries) outperforms fixed-size splitting for well-structured documents [13].
Empirical evaluation against a representative query set is the only reliable method for determining optimal parameters [13].

5.2 Embedding Models

Embedding models compress text passages into fixed-length vectors such that semantically similar passages have high cosine similarity. The dominant architecture is a BERT-derived transformer encoder, though recent models extend to longer contexts and higher dimensionalities.

The dual-encoder (bi-encoder) paradigm. DPR [4] established the pattern: two independent BERT encoders — one for queries, one for passages — each produce a vector, and similarity is computed via inner product. This is the foundation of all modern dense retrieval. The query encoder and passage encoder can be the same model (shared-encoder) or different (asymmetric). At indexing time, all passages are pre-embedded and stored. At query time, only the query needs to be embedded; retrieval reduces to an ANN search over pre-computed vectors [4].

Sentence-BERT (SBERT). Reimers and Gurevych (2019) adapted BERT for sentence-level embeddings by training with a Siamese/triplet network structure [6]. SBERT produces semantically meaningful sentence embeddings that can be compared via cosine similarity, enabling efficient semantic search. It was one of the first models to make dense retrieval practical for general-purpose applications beyond QA.

The critical architectural property is that the passage encoder and query encoder operate independently. This independence is what makes dense retrieval scalable — millions of documents can be pre-embedded offline, and search requires only one encoder forward pass (for the query) plus an ANN lookup. The cost is that the query and document never “see” each other’s tokens during encoding, limiting the depth of semantic matching compared to cross-encoders.

Embedding dimensionality and the dilution problem. An embedding model compresses an entire chunk into a single fixed-dimensional vector. The longer the input text, the more information must be discarded to fit into the same vector dimensionality. This is the “dilution” problem: a 512-token chunk covering three topics produces a vector that represents none of them precisely. This motivates smaller, topically focused chunks — at the cost of losing context [13].

5.3 Vector Indexing and ANN Search

Once documents are embedded, the vectors must be stored and searched efficiently. Brute-force comparison of a query vector against every document vector is O(n) — acceptable for thousands of documents, but not for millions or billions.

FAISS. Facebook AI Similarity Search, released by Meta in 2017, is the foundational library for vector similarity search [5]. FAISS implements multiple indexing strategies: flat (brute-force), IVF (inverted file with Voronoi partitioning), PQ (product quantization for compressed-domain search), and HNSW (graph-based). The GPU implementation achieves search speeds 8.5x faster than prior state of the art on billion-scale datasets [14]. As of 2025, FAISS has over 37,000 GitHub stars and more than 5,200 citations; major vector database companies including Zilliz and Pinecone either rely on FAISS as their core engine or have adopted its algorithms [5].

HNSW (Hierarchical Navigable Small World graphs). Introduced by Malkov and Yashunin (2016), HNSW constructs a multi-layer graph where higher layers contain long-range connections for coarse navigation and lower layers contain short-range connections for precise search [11]. Search starts at the top layer and greedily descends. HNSW provides very high recall (>95%) with sub-millisecond search latency and is the default index type in most vector databases [5][11].

The accuracy-speed-memory tradeoff. Every ANN index trades off three constraints: search speed, memory usage, and recall accuracy. HNSW is fast and accurate but memory-intensive (stores the full vector). IVF+PQ is memory-efficient (compresses vectors) but slightly less accurate. The optimal choice depends on the corpus size, latency requirements, and available hardware [5].

5.4 Retrieval Strategies

Sparse Retrieval (BM25)

BM25 remains the standard lexical retrieval baseline [7]. It scores documents based on the frequency of query terms in the document, inversely weighted by how common those terms are across the corpus (IDF), with document length normalization. BM25 is fast, requires no training, and excels at exact term matching — acronyms, proper nouns, code identifiers, and domain-specific terminology that dense models may miss.

Dense Retrieval

Dense retrieval, as established by DPR [4], encodes queries and documents into a shared vector space and finds nearest neighbors. Dense retrieval captures semantic similarity — matching “cardiac arrest” to “heart attack” — but can miss exact lexical matches that BM25 handles trivially.

Hybrid Retrieval

Combining sparse and dense retrieval is now standard practice. A typical implementation runs BM25 and dense retrieval in parallel, then merges results using Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σ  1 / (k + rank_i(d))

where rank_i(d) is document d’s rank in the i-th result list and k is a constant (typically 60). RRF requires no score normalization and consistently outperforms either retrieval method alone in practice.

5.5 Reranking

The initial retrieval stage (bi-encoder + ANN) optimizes for recall at high speed — find a large candidate set that likely contains the relevant documents. The reranking stage optimizes for precision — determine which of those candidates are most relevant.

Cross-encoder rerankers. Nogueira and Cho (2019) demonstrated that BERT can be used as a passage reranker by feeding the concatenated (query, passage) pair through the model and using the [CLS] token output as a relevance score [8]. This cross-encoder approach outperformed the previous state of the art on the MS MARCO passage ranking leaderboard by 27% relative in MRR@10 [8]. Cross-encoders achieve higher accuracy than bi-encoders because the full cross-attention mechanism allows every query token to attend to every passage token, capturing fine-grained relevance signals that independent encoding misses [6][8].

The computational cost is the tradeoff: a cross-encoder requires one forward pass per (query, document) pair, making it infeasible for first-stage retrieval over millions of documents. The standard two-stage pattern is: bi-encoder retrieves top-N (e.g., N=20–100), then cross-encoder reranks to top-k (e.g., k=3–5).

ColBERT and late interaction. ColBERT (Khattab and Zaharia, 2020) introduced a compromise: encode queries and documents independently (like a bi-encoder) but retain per-token embeddings rather than collapsing to a single vector. Relevance is computed via late interaction — a MaxSim operation between query and document token embeddings. This preserves some of the cross-encoder’s accuracy while allowing document representations to be pre-computed.

5.6 Prompt Construction and Generation

The final stage assembles retrieved chunks into an LLM prompt and generates a response. The engineering decisions at this stage significantly affect output quality.

Context ordering. Research has shown that LLMs attend more strongly to information at the beginning and end of the context window, with degraded attention to material in the middle (the “lost in the middle” phenomenon) [15]. This means the ordering of retrieved chunks in the prompt affects which information the model uses.

Instruction framing. The system prompt instructs the model to answer based on the provided context, cite sources, and state when the context is insufficient. Without explicit instructions to ground in context, models will readily generate from parametric memory, potentially hallucinating.

Faithfulness vs. completeness. A fundamental tension: instructing the model to rely only on retrieved context improves faithfulness (reduces hallucination) but may reduce completeness (the model refuses to answer when the relevant information was not retrieved). The optimal tradeoff depends on the application’s tolerance for hallucination vs. silence.

6. Advanced RAG Patterns

The basic retrieve-then-generate pipeline described above is sometimes called “Naive RAG.” Several architectural extensions have been developed.

6.1 Query Transformation

HyDE (Hypothetical Document Embeddings). Gao et al. (2022) proposed generating a hypothetical answer to the query using the LLM, then embedding that hypothetical answer instead of the raw query for retrieval [12]. The intuition is that a hypothetical answer is more lexically and semantically similar to relevant documents than the original question is. HyDE has been shown to improve retrieval quality, particularly for complex queries, at the cost of one additional LLM call.

Query decomposition. For multi-hop questions (“What was the GDP of the country that hosted the 2024 Olympics?”), the query is decomposed into sub-questions, each retrieved separately, with results combined before generation.

6.2 Iterative Retrieval

Instead of a single retrieval pass, the system retrieves, generates a partial answer, identifies knowledge gaps, retrieves again, and iterates. This is the pattern used by Deep Research products (Perplexity, Gemini, OpenAI), as documented in the project’s companion report (planning_in_deployed_ai_systems.md). Google’s Deep Research specifically developed an asynchronous task manager that maintains shared state between planner and task models for error recovery during iterative retrieval [16].

6.3 Self-RAG and Adaptive Retrieval

Asai et al. (2023) introduced Self-RAG, where the LLM is trained to decide whether retrieval is needed for a given query, retrieve if so, and then critique its own generation for faithfulness to the retrieved evidence [17]. This makes the retrieval decision dynamic rather than always-on — simple factual questions may be answerable from parametric memory, while complex or recent-knowledge questions trigger retrieval.

6.4 GraphRAG

Microsoft’s GraphRAG (2024) constructs a knowledge graph from unstructured text using LLMs, then performs retrieval over the graph structure rather than (or in addition to) vector similarity [18]. The system identifies entities and relationships, clusters them into communities, and generates summaries at multiple levels of abstraction. GraphRAG addresses a limitation of standard RAG: connecting information that is dispersed across many documents, which standard chunking and vector similarity may fail to bridge.

7. Evaluation

RAG evaluation is multi-dimensional — retrieval quality and generation quality must be assessed independently and jointly.

7.1 Retrieval Metrics

Standard information retrieval metrics apply to the retrieval stage:

Recall@k: What fraction of relevant documents appear in the top-k results?
Precision@k: What fraction of the top-k results are relevant?
MRR (Mean Reciprocal Rank): Average of 1/rank of the first relevant result across queries.
NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality with position-dependent discounting.

7.2 Generation Metrics

Faithfulness: Does the generated response contain only claims supported by the retrieved context? (Measures hallucination.)
Answer relevancy: Does the response address the user’s query?
Context precision: How much of the retrieved context was actually relevant?
Context recall: Did the retrieval find all the information needed to answer?

RAGAS (Retrieval-Augmented Generation Assessment), introduced in 2023, provides automated computation of these metrics using an LLM-as-judge approach [19].

7.3 Benchmarks

Natural Questions / TriviaQA / WebQuestions: Open-domain QA benchmarks used in the original RAG and DPR evaluations [1][4].
BEIR: A heterogeneous benchmark for zero-shot evaluation of retrieval models across 18 diverse datasets [20].
MTEB (Massive Text Embedding Benchmark): Evaluates embedding models across retrieval, classification, clustering, and other tasks [21].
SimpleQA: Factual accuracy benchmark; Perplexity’s Deep Research scored 93.9% [22].

8. Failure Modes and Limitations

Retrieval failure. If the relevant document is not retrieved, no amount of generation quality can compensate. Retrieval failures cascade — the model either hallucinates an answer from parametric memory or (if well-instructed) refuses to answer.

Context window saturation. Retrieved chunks compete with the query, system prompt, and conversation history for limited context space. Too many or too-large chunks can overflow the context window or dilute attention.

Embedding dilution. As discussed in Section 5.2 and in the project’s chunking tutorial, long chunks produce diluted embeddings that weakly match multiple topics rather than strongly matching any one.

The “lost in the middle” problem. LLMs attend more to the beginning and end of the context, potentially ignoring relevant information placed in the middle of the retrieved context [15].

Chunk boundary artifacts. Critical information may be split across chunk boundaries, making it unretrievable unless overlap is sufficient.

No formal verification. RAG does not guarantee that the generated response faithfully represents the retrieved context. The model may selectively attend to or misinterpret retrieved passages. This parallels the broader finding that deployed AI planning systems lack formal verification mechanisms (see companion report planning_in_deployed_ai_systems.md).

9. The RAG Stack: Software Components

A production RAG system requires integration of multiple software components:

Layer	Function	Common Options
Document processing	Parse, clean, chunk	LangChain loaders, Unstructured, LlamaIndex
Embedding	Text → vector	Sentence-Transformers, OpenAI Embeddings, Cohere Embed, nomic-embed
Vector store	Index and search vectors	Chroma, Pinecone, Weaviate, Qdrant, Milvus, pgvector, FAISS
Sparse index	BM25 keyword search	Elasticsearch, OpenSearch, Tantivy
Reranker	Cross-encoder scoring	Cohere Rerank, cross-encoder models (ms-marco-MiniLM), ColBERT
Orchestration	Pipeline coordination	LangChain, LlamaIndex, Haystack
LLM	Generation	OpenAI GPT-4o, Claude, Llama, Mistral, Qwen
Evaluation	Quality measurement	RAGAS [19], DeepEval, custom harnesses

10. Relationship to Project Knowledge

This document extends the existing project knowledge base:

Chunking parameters (chunking-parameters-tutorial.md) provides the empirical evidence for the chunking decisions described in Section 5.1.
Planning in deployed systems (planning_in_deployed_ai_systems.md) describes how RAG-style iterative retrieval is used in Deep Research products (Section 3.4 of that report covers the retrieval-reason-act pattern).
Fine-tuning for tool-calling (fine_tuning_tool_calling_guide.md) describes the LLM fine-tuning that could be applied to the generation component of a RAG pipeline.
GPU characterization (gpu-characterization-apple-silicon.md) covers hardware requirements relevant to running embedding models and LLMs locally.

Bibliography

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S. and Kiela, D. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020), pp. 9459–9474. [Snippet] https://arxiv.org/abs/2005.11401

[2] Chen, D., Fisch, A., Weston, J. and Bordes, A. “Reading Wikipedia to Answer Open-Domain Questions.” Proceedings of ACL 2017. [Further Reading] https://arxiv.org/abs/1704.00051

[3] Guu, K., Lee, K., Tung, Z., Pasupat, P. and Chang, M. “REALM: Retrieval-Augmented Language Model Pre-Training.” Proceedings of ICML 2020, PMLR 119. [Snippet] https://arxiv.org/abs/2002.08909

[4] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D. and Yih, W. “Dense Passage Retrieval for Open-Domain Question Answering.” Proceedings of EMNLP 2020, pp. 6769–6781. [Snippet] https://aclanthology.org/2020.emnlp-main.550/

[5] Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P., Lomeli, M., Hosseini, L. and Jégou, H. “The Faiss Library.” IEEE Transactions on Big Data, 2025. arXiv:2401.08281. [Snippet] https://arxiv.org/abs/2401.08281

[6] Reimers, N. and Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” Proceedings of EMNLP 2019. [Further Reading] https://arxiv.org/abs/1908.10084

[7] Robertson, S. and Zaragoza, H. “The Probabilistic Relevance Framework: BM25 and Beyond.” Foundations and Trends in Information Retrieval 3(4):333–389, 2009. [Further Reading]

[8] Nogueira, R. and Cho, K. “Passage Re-ranking with BERT.” arXiv:1901.04085, 2019. [Snippet] https://arxiv.org/abs/1901.04085

[9] Devlin, J., Chang, M., Lee, K. and Toutanova, K. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Proceedings of NAACL-HLT 2019, pp. 4171–4186. [Further Reading] https://aclanthology.org/N19-1423/

[10] Wang, L. et al. “Text Embeddings by Weakly-Supervised Contrastive Pre-training.” arXiv:2212.03533, 2022. [Further Reading] https://arxiv.org/abs/2212.03533

[11] Malkov, Y. and Yashunin, D. “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE Transactions on Pattern Analysis and Machine Intelligence 42(4):824–836, 2020. arXiv:1603.09320. [Further Reading] https://arxiv.org/abs/1603.09320

[12] Gao, L., Ma, X., Lin, J. and Callan, J. “Precise Zero-Shot Dense Retrieval without Relevance Labels.” Proceedings of ACL 2023. arXiv:2212.10496. [Further Reading] https://arxiv.org/abs/2212.10496

[13] Project knowledge. “Chunk Size Is Not 512 Tokens: A Tutorial on RAG Chunking Parameters.” See chunking-parameters-tutorial.md in this project.

[14] Johnson, J., Douze, M. and Jégou, H. “Billion-Scale Similarity Search with GPUs.” IEEE Transactions on Big Data 7(3):535–547, 2019. [Snippet] https://arxiv.org/abs/1702.08734

[15] Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F. and Liang, P. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the ACL, 2024. arXiv:2307.03172. [Further Reading] https://arxiv.org/abs/2307.03172

[16] Google. “Gemini Deep Research — Your Personal Research Assistant.” Gemini Overview, March 2025. [Verified — cited in project document planning_in_deployed_ai_systems.md] https://gemini.google/overview/deep-research/

[17] Asai, A., Wu, Z., Wang, Y., Sil, A. and Hajishirzi, H. “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.” Proceedings of ICLR 2024. arXiv:2310.11511. [Further Reading] https://arxiv.org/abs/2310.11511

[18] Edge, D. et al. “From Local to Global: A Graph RAG Approach to Query-Focused Summarization.” arXiv:2404.16130, 2024. [Further Reading] https://arxiv.org/abs/2404.16130

[19] Es, S., James, J., Espinosa-Anke, L. and Schockaert, S. “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” arXiv:2309.15217, 2023. [Further Reading] https://arxiv.org/abs/2309.15217

[20] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A. and Gurevych, I. “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” Proceedings of NeurIPS 2021. arXiv:2104.08663. [Further Reading] https://arxiv.org/abs/2104.08663

[21] Muennighoff, N. et al. “MTEB: Massive Text Embedding Benchmark.” Proceedings of EACL 2023. arXiv:2210.07316. [Further Reading] https://arxiv.org/abs/2210.07316

[22] Perplexity. “Introducing Perplexity Deep Research.” Perplexity Hub Blog, February 2025. [Verified — cited in project document planning_in_deployed_ai_systems.md] https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research