A State-of-the-Art Report, March 2026
1. Introduction
Commercially deployed AI systems as of early 2026 rely on a narrow set of architectural patterns for planning: autoregressive token generation, optionally extended with chain-of-thought reasoning, human-in-the-loop approval, and retry-based error recovery. As documented in a companion report on the structure of planning in deployed systems, no commercial product implements classical AI planning, formal plan verification, learned world models, or state-space search in any rigorous sense. Plans are natural language text, not formal structures. Verification is empirical, not formal. Error recovery is retry, not backtracking.
This report examines the research frontier — the techniques under active investigation that aim to close the gap between what deployed systems do and what the planning research community has long understood to be necessary for reliable, constraint-satisfying, and verifiable planning. The scope is deliberately restricted to single-agent architectures: methods that augment a single reasoning system with external tools, verifiers, search procedures, or classical computing components. Multi-agent coordination protocols and frameworks (LangGraph, CrewAI, AutoGen, etc.) are excluded, as they are addressed in a separate companion report.
The report is organized by technique and architectural pattern, covering work from approximately 2023 through early 2026. Sources include peer-reviewed publications at major venues (ICML, NeurIPS, ICLR, AAAI, ACL, EMNLP, Nature Communications, Nature), preprints on arXiv, and technical reports from industry labs. Preprints are flagged as such in the bibliography.
2. The Foundational Critique: Autoregressive LLMs and the Limits of Planning
The starting point for most current planning research is the empirical and theoretical observation that autoregressive LLMs are poor planners on their own. Kambhampati et al. argued at ICML 2024 that autoregressive LLMs function as “a giant pseudo System 1” — they produce approximate retrievals of plan-like outputs from training data rather than performing genuine search through a state space [1]. A system that takes constant time per token cannot, in principle, be performing reasoning whose computational complexity scales with problem difficulty [1]. On the PlanBench benchmark, even state-of-the-art models including GPT-4o and Claude 3 Opus showed poor performance on multi-step plan generation, with only approximately 12% of generated plans being error-free [1].
Subsequent work reinforced this finding. Stechly et al. demonstrated that chain-of-thought prompting is largely ineffective in improving planning performance [2]. Valmeekam et al. found that LLMs cannot reliably self-critique their own plans — the verification problem is as hard as the generation problem for a system that lacks a formal model of the domain [3]. Even OpenAI’s o1 reasoning model, trained specifically for extended chain-of-thought, showed limited improvement on formal planning benchmarks [4].
Kambhampati separately published an overview arguing that the confusion about LLM planning abilities stems from conflating linguistic plausibility with formal correctness — LLMs produce text that looks like plans but has not been verified against any model of the world [5].
The argument is not merely theoretical. It has a computational dimension: planning in general is PSPACE-complete, and even the scheduling subproblems commonly used in benchmarks (like TravelPlanner) are NP-complete as constraint satisfaction problems [6]. On TravelPlanner, the best LLM strategies using Chain-of-Thought and ReAct achieved less than 1% accuracy, compared to 100% for humans [7]. This gap is not a matter of scale — larger models perform marginally better in absolute terms but exhibit the same fundamental failure patterns: hallucinating non-existent state transitions, violating explicitly stated constraints, and failing to maintain consistency across multi-step plans [1].
The relationship between these planning limitations and the broader “reasoning” capabilities of LLMs is contested. Some researchers argue that RL-trained reasoning models (the o-series, DeepSeek-R1) partially address the problem by learning to allocate variable compute to harder problems. Others, including Kambhampati, maintain that this changes the surface behavior without addressing the fundamental architecture: the model still takes constant time per token and still lacks a formal model of the domain against which to verify its outputs [5].
These findings do not mean LLMs are useless for planning. Rather, they motivate the core research question addressed in this report: how can LLMs be augmented — with external verifiers, search procedures, world models, or classical computing components — to produce plans that are actually correct?
3. Neurosymbolic Planning: The LLM-Modulo Framework
3.1. The Core Architecture
The most influential framework for augmenting LLM planning with external verification is the LLM-Modulo architecture proposed by Kambhampati et al. [1]. The framework positions LLMs not as planners but as “universal approximate knowledge sources” that generate candidate plans, which are then evaluated by external model-based verifiers (called “critics”) in a generate-test-critique loop [1]. If the critics reject a candidate, feedback is provided to the LLM, which generates a revised candidate. This continues until either a valid plan is found or a budget of iterations is exhausted.
The architecture is deliberately modular. The LLM handles natural-language understanding, idea generation, and translation between representations. The external critics handle formal verification against domain models. The key claim is that this division of labor can provide formal correctness guarantees that neither component can achieve alone — every plan that passes the critics is guaranteed correct by those critics’ models, while the LLM’s generative capacity makes the system applicable to domains where hand-coding a full planner would be infeasible [1].
3.2. Empirical Results
Gundawar et al. applied the LLM-Modulo framework to several scheduling domains including the TravelPlanner benchmark [6]. The results demonstrated substantial gains: on TravelPlanner, GPT-4o improved from 8.3% accuracy to 23.89%, and Claude 3.5 Sonnet improved from 4.4% to 25% [6]. In the Natural Plan scheduling domains, GPT-4o improved from 3.43% to 40% [6]. Importantly, every solution produced by the LLM-Modulo system was guaranteed correct by the critics — the system cannot output a plan that violates the verified constraints [6].
An earlier case study by the same group on TravelPlanner specifically showed a 4.6x improvement for GPT-4-Turbo, vastly exceeding what Chain-of-Thought, ReAct, and Reflexion techniques could achieve (all three scored at or near 0% with GPT-3.5-Turbo) [7]. The study also demonstrated that LLMs could be used to extract the critics themselves — helping domain experts formalize constraint-checking code — illustrating the LLM’s role as a knowledge source beyond plan generation [7].
3.3. Multiple Roles for LLMs in the Planning Pipeline
An underappreciated aspect of the LLM-Modulo framework is that it assigns LLMs multiple distinct roles beyond plan generation. The original paper [1] identifies at least five: (a) idea generation (proposing candidate plans), (b) translation (converting between natural language and structured representations like JSON or PDDL), (c) problem specification enrichment (adding implicit constraints that users leave unstated), (d) critic extraction (helping domain experts formalize verification code), and (e) style criticism (evaluating plan quality on dimensions like diversity and common-sense reasonableness, as opposed to formal correctness).
The travel planning case study [7] operationalized several of these roles: the LLM generated candidate itineraries, reformulated natural language plans into structured JSON parseable by critics, and helped extract the critics themselves from natural language constraint descriptions. This multi-role design means that even if LLMs cannot plan, they remain indispensable components of a planning system that can plan.
The framework also has an important relationship to the AlphaGeometry [1] and FunSearch [1] systems from Google DeepMind, both of which use generate-test-critique loops between fine-tuned LLMs and external symbolic evaluators. In FunSearch, the external evaluator is critical for avoiding hallucinations in candidate solutions. In AlphaGeometry, a symbolic deduction engine provides the formal reasoning that the LLM cannot reliably perform. These systems predate the formal articulation of the LLM-Modulo framework but embody its core principle.
3.4. Limitations
The LLM-Modulo framework has clear boundaries. First, it requires that formal critics can be written for the domain. For well-specified scheduling problems (travel planning, meeting coordination, resource allocation), this is feasible. For open-ended tasks (writing a report, refactoring a codebase), defining a complete set of sound verifiers is substantially harder. Second, the framework’s performance depends on the LLM’s ability to generate plausible candidate plans in the first place. In obfuscated domains (like “Mystery Blocks World,” where predicate names are randomized), performance drops sharply to approximately 10%, because the LLM cannot retrieve relevant plan patterns from its training data [1]. Third, the iterative back-prompting loop has diminishing returns — if the LLM cannot generate a valid candidate within 10-15 rounds, additional rounds rarely help [6].
Fourth, the framework as currently evaluated assumes that the critics are sound and complete for their domain. In practice, writing complete critics for complex real-world domains is itself a significant engineering challenge. The framework’s guarantee — that any output that passes the critics is correct — is only as strong as the critics themselves.
4. Test-Time Compute Scaling
4.1. The Scaling Paradigm
A parallel line of research addresses planning not through external verifiers but through allocating more computation at inference time. The foundational result is from Snell et al., who demonstrated that scaling test-time compute can be more effective than scaling model parameters for reasoning tasks [8]. They analyzed two mechanisms: searching against process-based reward models (PRMs), and adaptively updating the model’s output distribution conditioned on the prompt [8]. Using a compute-optimal strategy, they improved efficiency of test-time scaling by more than 4x compared to best-of-N baselines, and showed that a smaller model with sufficient test-time compute can outperform a 14x larger model [8]. This paper received an oral presentation at ICLR 2025.
A comprehensive survey by Zhang et al. organized the test-time scaling landscape along four dimensions: what to scale (parallel sampling, sequential reasoning, hybrid, or internal scaling), how to scale (search, aggregation, supervised fine-tuning, reinforcement learning), where to scale (reasoning tasks, general-purpose tasks), and how well to scale (performance, efficiency, controllability, scalability) [9].
4.2. Tree Search with Process Reward Models
A prominent approach combines Monte Carlo Tree Search (MCTS) with process reward models to guide LLM reasoning. Zhang et al. developed ReST-MCTS*, which integrates process reward guidance with tree search to collect higher-quality reasoning traces [10]. The system infers per-step process rewards by estimating the probability that a given step leads to the correct answer, then uses these rewards both to refine the PRM and to select high-quality training traces for policy model self-improvement [10]. ReST-MCTS* achieved higher accuracy than Best-of-N and Tree-of-Thought baselines within the same search budget, and the policy model improved continuously across multiple self-training iterations [10]. This work appeared at NeurIPS 2024.
The distinction between outcome reward models (ORMs, which score complete solutions) and process reward models (PRMs, which score individual reasoning steps) is critical for planning applications. ORMs can only evaluate a plan after it is fully generated, providing no guidance during generation. PRMs provide per-step feedback, enabling search algorithms to prune unpromising branches early. Lightman et al. demonstrated the importance of step-level supervision in the “Let’s Verify Step by Step” work, showing that PRMs trained on human-annotated step labels outperform ORMs for mathematical reasoning [10].
Jiang et al. presented a related framework combining reward-guided tree search with process reward models to enhance LLM reasoning, using the tree search to systematically explore reasoning paths while the PRM provides step-level guidance [11]. LLaMA-Berry further refined this approach by combining MCTS with iterative self-refinement and a pairwise reward model for global path evaluation [11].
A recent development is ThinkPRM — process reward models that themselves use extended chain-of-thought reasoning to verify intermediate steps [12]. Rather than training discriminative classifiers that output a binary correct/incorrect judgment per step, ThinkPRM generates full verification chains-of-thought before rendering judgment. This addresses a known weakness: discriminative PRMs require expensive step-level annotations (sometimes hundreds of thousands of per-step labels) to train effectively. ThinkPRM sidesteps this by leveraging the reasoning capabilities of long-CoT models to synthesize verification data. However, the authors note significant issues with “overthinking” — where the verification chain becomes so long it exceeds token budgets without converging — and with infinite looping, where the model gets stuck trying alternative verification strategies [12].
4.3. Internal vs. External Scaling
An important taxonomic distinction in the test-time scaling literature is between external scaling (where the orchestration harness manages multiple LLM calls, sampling, and selection) and internal scaling (where the model itself determines how much computation to allocate). External scaling includes Best-of-N sampling, MCTS, and majority voting — all managed outside the model. Internal scaling is exemplified by RL-trained reasoning models (o-series, DeepSeek-R1) that generate variable-length chains of thought without external orchestration [9].
The survey by Zhang et al. further distinguishes parallel scaling (generating multiple outputs simultaneously and aggregating) from sequential scaling (directing later computation based on intermediate results) and hybrid approaches [9]. For planning tasks specifically, sequential scaling is typically more relevant, because plan steps are often dependent — step N+1 depends on the outcome of step N. Parallel scaling is more applicable to exploration (generating multiple candidate plans in parallel and selecting the best one).
4.4. Test-Time Scaling for Agents
Zhu et al. conducted the first systematic exploration of applying test-time scaling methods to language agents (as opposed to pure reasoning tasks) [13]. They tested parallel sampling, sequential revision, verifier-based merging, and diversified rollouts on agentic benchmarks. Key findings include: scaling test-time compute does improve agent performance; knowing when to reflect (rather than reflecting after every step) is important; and list-wise verification methods outperform pairwise alternatives [13]. This is particularly relevant for planning, because agents that plan and execute in environments face the additional challenge of irreversible actions — unlike mathematical reasoning, where exploring a wrong path has no cost beyond computation.
4.5. Limitations
Test-time scaling faces several constraints. First, the computational cost scales with the difficulty of the problem — hard problems require many more samples or search steps, and the cost curve is often steep. Liu et al. raised a provocative question: “Can 1B LLM surpass 405B LLM?” by rethinking compute-optimal test-time scaling, suggesting that smaller models with appropriately scaled test-time compute can match larger models in specific settings [14]. Second, as Liu et al. demonstrated at ACL 2025, under majority-voting scaling, complicated prompting strategies with initially superior performance gradually fall behind simple chain-of-thought as sampling increases — suggesting that the interaction between prompting strategy and test-time scaling is non-trivial and sometimes counterintuitive [15]. Third, the reliance on reward models introduces its own failure modes: reward hacking and miscalibration can cause the search to converge on confidently wrong solutions. Stroebl et al. explored “inference scaling flaws” showing that imperfect verifiers can fundamentally limit the benefits of resampling [18].
Finally, for planning specifically, tree search over reasoning traces faces a branching factor problem. Unlike board games (where the action space is finite and well-defined), the space of possible reasoning steps in natural language is enormous. Effective tree search requires either very good heuristics (the PRM) or very aggressive pruning, and current PRMs are imperfect.
5. LLM-Generated Heuristics for Classical Planning
5.1. Code as Heuristic
A striking result from Corrêa et al. demonstrated that LLMs can generate domain-dependent heuristic functions — as executable Python code — that outperform state-of-the-art domain-independent heuristics for classical planning [16]. Given a PDDL domain description, the LLM generates several candidate heuristic functions. These are evaluated on training tasks within a greedy best-first search, and the best-performing heuristic is selected. The resulting heuristics solved substantially more unseen, out-of-distribution test tasks than hand-crafted heuristics implemented in highly optimized C++ planners [16]. This work appeared at NeurIPS 2025.
The method works as follows. The LLM receives a carefully structured prompt containing: the PDDL domain file, two example task files, two example domain-dependent heuristics from other domains (Gripper and Logistics), the source code of the Pyperplan planner (showing the interface the heuristic must satisfy), and a checklist of common pitfalls observed in prior LLM outputs [16]. The LLM is asked to produce a Python class implementing a heuristic function that estimates the number of actions to reach the goal state. Multiple candidates are sampled (typically 200), evaluated on training tasks with a 5-minute timeout per task, and the best is selected based on coverage (number of tasks solved) and agile score (time efficiency) [16].
The key architectural insight is that the LLM is not asked to plan — it is asked to write a program that guides a classical search algorithm. The search algorithm guarantees soundness (any plan found is valid), while the LLM-generated heuristic provides informed guidance. This sidesteps the fundamental limitation identified by Kambhampati et al.: the LLM does not need to reason through the state space itself; it only needs to generate a function that estimates distances to the goal [16].
The authors tested several LLMs as heuristic generators, finding that DeepSeek-R1 produced the best heuristics overall, with o3 also performing strongly [16]. They tested with semantically obfuscated predicate names (replacing meaningful names like “on” and “clear” with random strings) and found that the generated heuristics retained significant performance, indicating that the LLM reasons about the domain’s logical structure rather than relying purely on token semantics [16]. They also demonstrated competitiveness with the strongest learning-based approach for domain-dependent planning (WL-F GPR) [16]. On a novel domain (Rod-Rings) designed to avoid training data leakage, o3-generated heuristics solved 58 tasks — nearly matching Fast Downward with the FF heuristic (59 tasks) — suggesting the LLM genuinely reasons about domain structure rather than memorizing solutions [16].
5.2. Extensions to Numeric Planning
Izhaki et al. extended the code-as-heuristic approach to numeric planning domains (which involve continuous variables and cannot be easily expressed in standard PDDL), generating heuristic functions in Rust rather than Python [17]. Their approach generates task-dependent heuristics from formal problem descriptions, using a three-step prompting process: domain summarization, heuristic conceptualization, and implementation [17]. The Rust implementation provides two advantages: the compiler’s strong static typing catches many code errors that would be runtime failures in Python, and the compiled code runs substantially faster [17]. The generated heuristics outperformed classical domain-independent methods and competed with state-of-the-art numeric planners [17]. They also demonstrated that the method can handle novel planning tasks that lack formal encoding in existing planning languages, such as Pac-Man, whose complex interactions between entities cannot be easily expressed in PDDL extensions [17].
5.3. Relationship to FunSearch and Evolutionary Approaches
The code-as-heuristic paradigm is related to but distinct from Google DeepMind’s FunSearch, which uses an evolutionary algorithm to iteratively refine LLM-generated programs for combinatorial optimization [16]. FunSearch feeds the best candidates back into the LLM for improvement, while Corrêa et al.’s pipeline never provides feedback from evaluation to the LLM — it simply samples many candidates and selects the best [16]. The authors note that adding a FunSearch-style feedback loop could further strengthen their results, suggesting room for improvement.
LLMs have also been explored for heuristic generation in domains beyond planning, including bin packing and the traveling salesman problem, where evolutionary algorithms guided by LLM-generated code have shown promising results [17]. However, due to fundamental differences in problem structure, these methods are not directly applicable to planning tasks without substantial adaptation.
5.4. Significance
This line of work represents a conceptually clean division of labor: LLMs contribute domain knowledge (encoded as heuristic programs), while classical search algorithms contribute soundness and completeness guarantees. Unlike end-to-end LLM planning, the correctness of any returned plan is guaranteed by the search algorithm. Unlike pure classical planning, the heuristic benefits from the LLM’s broad training on world knowledge. The approach also has a practical advantage: once a good heuristic is generated for a domain, it can be reused across all tasks in that domain without further LLM calls, amortizing the cost of generation.
6. Symbolic World Model Generation
6.1. From Natural Language to PDDL
A growing body of work investigates using LLMs to generate formal PDDL domain descriptions from natural language, enabling classical planners to then compute optimal plans. Yu et al. proposed scaling test-time computation to improve PDDL generation quality [18]. Their method combines Best-of-N sampling (to explore the solution space) with Instance Verbalized Machine Learning (iVML, an iterative refinement process where an optimizer LLM provides critiques that guide the learner LLM to fix logical inconsistencies) [18]. The method achieved over 50% success rate on generating PDDL domains from natural language descriptions, outperforming OpenAI’s o1-mini without requiring additional training [18]. Combined with classical planners like A*, the generated PDDL domains outperformed current methods on competition-level planning tasks [18].
6.2. Benchmarking Symbolic World Model Generation
Hu et al. introduced Text2World, a benchmark for evaluating LLM-generated symbolic world models, featuring hundreds of diverse PDDL domains with execution-based evaluation metrics [19]. Their findings showed that reasoning models trained with large-scale reinforcement learning outperform standard models, but even the best-performing model demonstrated limited world-modeling capability [19]. The benchmark revealed low inter-annotator agreement between LLM-based evaluation and human evaluation (Cohen’s κ = 0.10), underscoring the need for execution-based rather than LLM-judged evaluation [19].
6.3. Code World Models
Tang et al. proposed WorldCoder, which takes a different approach: rather than generating PDDL, the agent synthesizes a Python program to model its past experiences with the environment, effectively learning a transition function as executable code [20]. The LLM does not simulate the world but builds a simulation of the world — an important distinction from approaches like ReAct where the LLM is expected to reason about action consequences directly [20]. WorldCoder uses the LLM to generate and debug code that implements a world model, then plans within that model using classical search. This appeared at NeurIPS 2024.
Agent2World extended this paradigm with a multi-agent feedback pipeline for generating both PDDL-based and code-based world models, incorporating web research for specification gaps, automated testing, and simulation-based validation [21].
6.4. Significance
This research direction represents a potential bridge between deployed systems (which use natural language plans with no formal guarantees) and classical AI planning (which provides guarantees but requires expert-authored domain models). If LLMs can reliably translate natural language task descriptions into formal models, then the full power of classical planning — including optimality guarantees and constraint satisfaction — becomes accessible without requiring users to learn PDDL.
7. LLMs as World Models
7.1. Prompting LLMs as Implicit World Models
Hao et al. proposed Reasoning via Planning (RAP), which repurposes the LLM as both a world model (predicting state transitions) and a reasoning agent, combined with MCTS for strategic exploration [22]. Given a current state and a proposed action, the LLM predicts the next state. MCTS then uses these predictions to explore multiple reasoning paths, guided by task-specific rewards. RAP on LLaMA-33B surpassed chain-of-thought on GPT-4 with 33% relative improvement in plan generation [22]. The framework is notable because it requires no external tools or verifiers — the LLM itself serves as the simulator — but it inherits the LLM’s approximation errors: state predictions can be wrong, and there is no external check on their accuracy.
This approach has a dual interpretation. From the planning perspective, the LLM is being used as a learned transition model, analogous to how MuZero uses a learned dynamics model. From the LLM perspective, the MCTS is providing structured search over the space of reasoning traces, similar to Tree-of-Thought but with principled exploration-exploitation balancing via UCT (Upper Confidence bounds applied to Trees). The integration is tighter than simply sampling multiple chain-of-thought traces and voting, because MCTS can selectively expand promising partial plans rather than committing to complete traces.
7.2. Model-Based Planning for Web Agents
Gu et al. investigated whether LLMs can serve as world models for web agents in WebDreamer, a model-based planning framework [23]. The motivation is practical: web environments contain irreversible actions (submitting a form, placing an order), making the backtracking assumed by tree search infeasible. Model-based planning addresses this by simulating action consequences before committing — the agent deliberates over predicted outcomes using the world model, then executes only the action with the best predicted outcome [23].
They found that GPT-4o can simulate web page transitions with sufficient accuracy to enable look-ahead planning. They also trained a specialized 7B-parameter world model (Dreamer-7B) using a scalable data synthesis pipeline, and this smaller model performed comparably to GPT-4o as a web simulator [23]. This demonstrates the feasibility of smaller, specialized world models for specific domains — an important practical consideration, since using GPT-4o as a world model for every action candidate is computationally expensive.
7.3. Trained Language-Based World Models
A more recent direction trains vision-language world models (VLWMs) that use natural language as their abstract state representation [24]. These models observe the environment visually, predict future states in language, and plan by simulating action consequences internally. The approach leverages language’s inherent semantic abstraction — “the pot is on the stove” is more computationally efficient to generate and reason about than predicting a pixel-level image of a kitchen. Training on large corpora of instructional videos (COIN, CrossTask, YouCook2, HowTo100M) and egocentric recordings (Ego4D, EgoExo4D, EPIC-KITCHENS), these models learn action-state trajectories from over 180,000 videos spanning more than 800 days of footage [24]. The training pipeline generates Tree of Captions — hierarchical video descriptions — resulting in 21 million unique caption nodes, from which 1.2 million goal-plan trajectories are extracted [24].
This represents a fundamentally different approach to world modeling than PDDL-based symbolic models (Section 6): rather than translating natural language to a formal representation, the world model operates natively in language space. The advantage is generality — no domain-specific PDDL encoding is needed. The disadvantage is that language-based predictions lack the formal verification guarantees that PDDL-based planning provides.
7.4. Evaluating LLM-Based World Models
Recent evaluation work has examined whether LLM-based world models are actually necessary for decision-making, or whether the LLM can function adequately without an explicit world model [25]. The study evaluated world model capabilities across diverse environments, testing state prediction, reward estimation, and termination prediction. Key findings include: GPT-4o significantly outperforms GPT-4o-mini as a world model, particularly on tasks requiring domain knowledge (such as scientific tasks); performance depends predominantly on key steps rather than total step count (suggesting that task difficulty is not simply a function of plan length); and combining world model functionalities for decision-making can introduce instability, partially obscuring the performance gap between strong and weak models [25].
The authors argue that world models will become critical components of LLM agents for several reasons: they enable trial-and-error planning without interacting with the real environment, they allow evaluation of action consequences before commitment, and they support counterfactual reasoning (“what would have happened if I had taken action B instead of action A?”) [25]. However, the current state of the art is that most deployed LLM agents operate without explicit world models — they use the LLM as an implicit reasoner about consequences, which is fundamentally less reliable.
7.5. Relationship to Latent World Models
It is worth noting what this research direction does not include. Latent world models like DreamerV3 [26] and MuZero [27] — which learn compact state representations and dynamics models for control in continuous and discrete action spaces — remain primarily in the robotics and game-playing domains. While DreamerV3 has been published in Nature and demonstrates mastery across diverse domains via a single algorithm [26], and the JEPA (Joint Embedding Predictive Architecture) paradigm from Meta AI shows promise for learning representations useful for planning [24], these approaches have not been integrated into the kind of white-collar planning tasks (report writing, trip planning, code refactoring) that are the primary focus of this report. The gap between latent world models for robotics and language-based planning for knowledge work remains wide.
One emerging bridge is DreamerNav, which combines DreamerV3’s recurrent state-space model with multimodal perception and hybrid global-local planning for robot navigation [26]. The architecture uses A* for global path planning while DreamerV3’s latent-space policy adapts locally to dynamic obstacles — a concrete example of combining classical planning with learned world models. Whether analogous hybrid architectures could work for language-based planning tasks remains unexplored.
8. Brain-Inspired Modular Architectures
Webb et al. proposed the Modular Agentic Planner (MAP), inspired by how the human brain decomposes planning into component processes associated with specific brain regions [28]. MAP implements five specialized LLM modules: a Task Decomposer (generating subgoals), an Actor (generating proposed actions), a Monitor (gating actions against constraint violations), a Predictor (predicting next states), and an Evaluator (assessing state quality) [28]. These modules interact iteratively: the decomposer breaks the problem into subgoals, the actor proposes actions, the monitor checks for violations and provides feedback, the predictor simulates outcomes, and the evaluator guides search.
MAP yielded significant improvements over standard LLM methods and competitive agentic baselines (including chain-of-thought, tree-of-thought, and multi-agent debate) on graph traversal, Tower of Hanoi, and the PlanBench benchmark [28]. Notably, it could be effectively combined with smaller, more cost-efficient LLMs and displayed superior transfer across tasks — a module trained on one task could be reused on another without retraining [28]. This work appeared in Nature Communications in September 2025.
The significance of MAP lies in its demonstration that decomposing planning into specialized cognitive functions — each handled by a separate LLM call with a focused prompt — can outperform monolithic approaches. This is conceptually related to the LLM-Modulo framework [1], but where LLM-Modulo pairs an LLM with external symbolic verifiers, MAP pairs multiple LLM instances with each other, each specializing in a different cognitive function. The external verifier is replaced by an LLM-based monitor, which trades formal guarantees for broader applicability.
9. Reinforcement Learning for Emergent Reasoning
9.1. DeepSeek-R1 and Pure RL Training
DeepSeek-R1, published in Nature, demonstrated that reasoning abilities can be incentivized in LLMs through pure reinforcement learning without requiring human-annotated reasoning trajectories [29]. The DeepSeek-R1-Zero variant was trained via large-scale RL directly on the base model (DeepSeek-V3-Base, a 671 billion parameter mixture-of-experts model) without supervised fine-tuning as a preliminary step [29]. During training, the model spontaneously developed advanced reasoning patterns including self-reflection, verification, and dynamic strategy adaptation [29]. The model naturally learned to generate longer reasoning traces for harder problems — an emergent form of adaptive compute allocation [29].
The training methodology is notable for its simplicity. The reward function combined only two components: accuracy (whether the final answer was correct) and format compliance (whether the output used the designated think tags). No neural reward models were used, as the authors found these susceptible to reward hacking during large-scale RL [29]. The training used Group Relative Policy Optimization (GRPO), which compares new outputs against past attempts rather than grading them in isolation, reducing the need for massive labeled datasets [29].
A notable observation was an “aha moment” during training where the model began to rethink its approach mid-solution, using phrases like “Wait, wait. Let’s reevaluate this step-by-step” [29]. Tracking the frequency of reflective terms (wait, mistake, however, retry, verify, wrong, check) across training steps revealed a clear emergence pattern where these self-correction behaviors appeared and stabilized [29].
DeepSeek-R1 achieved performance comparable to OpenAI’s o1 on mathematics, coding, and STEM reasoning benchmarks. On AIME 2024, it achieved approximately 79.8% pass@1; on MATH-500, approximately 97.3% [29]. The reasoning patterns of the larger model could be distilled into substantially smaller models (1.5B to 70B parameters), with the distilled models achieving strong performance on their own — the 32B distilled model outperformed OpenAI’s o1-mini across various benchmarks [29].
The pipeline for the full DeepSeek-R1 model incorporated four stages: (1) cold start with thousands of structured chain-of-thought examples, (2) reasoning-oriented RL focusing on rule-based evaluation, (3) supervised fine-tuning using rejection-sampled reasoning data from stage 2, and (4) a second RL phase combining reasoning and non-reasoning rewards [29]. This multi-stage approach produced a model that was both a strong reasoner and a capable general-purpose assistant.
9.2. The OpenAI o-Series
OpenAI’s o-series models (o1, o3, o4-mini) represent the primary commercial deployment of RL-trained reasoning. The o3 model uses “test-time compute” that explores multiple potential solution paths during inference [30]. OpenAI reported that o3 makes 20% fewer major errors than o1 on difficult real-world tasks, and that o4-mini scored 99.5% on AIME 2025 with tool use [31]. These models were specifically trained to reason about when to use tools through reinforcement learning — teaching them not just how to invoke tools but to reason about whether tool use is appropriate at each step [31].
The o-series models generate a “private chain of thought” that is not shown to the user, spending variable amounts of compute on different problems [30]. On the ARC-AGI benchmark, o3 scored 87.7%, roughly three times o1’s accuracy, suggesting that the additional RL training and test-time compute allocation significantly improve abstract reasoning [30]. However, specific architectural details of the o-series models remain proprietary.
9.3. Significance for Planning
The RL-trained reasoning paradigm is relevant to planning research in several ways. First, it demonstrates that extended reasoning — including planning-like behaviors such as decomposition, constraint checking, and backtracking — can emerge from RL training without explicit architectural support. Second, it shows that the allocation of compute can be made adaptive: the model spends more tokens (and thus more compute) on harder problems, a property that classical planners exhibit by design but that standard LLMs lack. Third, it validates the test-time compute scaling paradigm described in Section 4: even within the constraints of autoregressive generation, spending more inference-time compute improves planning-relevant capabilities.
However, these models still lack the formal guarantees that the neurosymbolic approaches in Sections 3, 5, and 6 provide. The reasoning is still token generation — it can be wrong, and there is no external verifier ensuring correctness. The RL training improves the probability of correct reasoning but does not guarantee it.
10. Code-as-Planning and Executable Verification
10.1. The Code-as-Reasoning Paradigm
An increasingly important research direction treats code generation as a form of planning, where the plan is an executable program whose correctness can be empirically verified by running it. This is the paradigm underlying deployed coding agents (Claude Code, Cursor, Devin), but the research frontier extends it in several directions.
The conceptual foundation is straightforward: code is a formal language with precise semantics, a compiler or interpreter that enforces syntactic correctness, and test suites that check behavioral correctness. Unlike natural language plans (“book a flight, then reserve a hotel”), code plans (“call_api(‘flights’, params); call_api(‘hotels’, params)”) can be executed, tested, and debugged. This makes code a natural intermediate representation for planning — more formal than English, less specialized than PDDL.
The Self-Planning approach introduced the idea of requiring an LLM to produce a sequence of high-level solution steps before generating code, then implementing each step according to the plan [32]. This explicit planning phase forces the model to reason about the problem structure before committing to implementation details. CodeChain extended this with clustering and self-revision during planning, constructing reusable modular code through multiple iterations [32]. CodePlan introduced multi-stage control flow with custom control instructions, enabling dynamic selection of “generate” or “modify” operations during execution [32]. CodeAct unified the action space by representing all agent actions as executable Python code, integrating a Python interpreter directly into the agent architecture for immediate execution and real-time feedback [32].
Lei et al. proposed the LLM Programming Workflow (LPW), a planning-driven programming approach where the LLM first outlines a solution plan that decomposes the problem into sub-problems, then verifies the plan against visible test cases before code implementation [33]. The verification step is crucial: by checking the plan’s logic against concrete examples before writing code, the system catches conceptual errors early. LPW achieved up to 16.4% improvement in Pass@1 accuracy on text-to-code benchmarks, with notable improvements on challenging problems. A sampling variant (SLPW) demonstrated up to 5.6% further improvement and set new state-of-the-art Pass@1 accuracy across multiple benchmarks, including 98.2% on HumanEval and 64.0% on APPS using GPT-4o [33].
10.2. Formal Verification of LLM-Generated Code
A more ambitious direction seeks formal verification of LLM-generated code, moving beyond empirical testing (which can only demonstrate the absence of bugs in tested cases) to mathematical proof of correctness (which guarantees correctness for all inputs). Recent work explores hybrid generation-verification pipelines where LLMs generate code alongside formal annotations (preconditions, postconditions, invariants), and SMT-based verifiers validate correctness [34]. The Clover system implements a closed-loop pipeline where the LLM generates code, docstrings, and formal annotations, then uses reconstruction-based prompting to enforce consistency across these outputs, and finally applies SMT-based verification. Evaluated on Dafny programs, Clover accepted 87% of correct solutions while rejecting 100% of flawed ones — and even uncovered bugs in human-written code [34].
VeriBench provides an end-to-end benchmark for evaluating LLMs on generating both executable code and formal proofs of correctness in Lean 4 [34]. The benchmark includes programs from undergraduate computer science curricula and evaluates whether models can produce not just working code but verified code. Current results show this remains extremely challenging: even the best models achieve limited success rates on non-trivial verification tasks.
DafnyBench, a large-scale benchmark of over 750 Dafny programs stripped of verification hints, found that the best-performing system (GPT-4 Turbo) achieved approximately 68% success rate at regenerating verification annotations — promising but far from reliable [34].
These approaches remain nascent for practical deployment. The benchmarks are largely limited to short, algorithm-focused programs. Scaling formal verification to the kinds of complex, multi-file codebases that deployed coding agents operate on — where specifications are often implicit and the code interacts with external systems — remains a fundamental open problem. The research on natural language to formal specification translation (Section 6) could eventually help bridge this gap by automating the specification writing that formal verification requires.
10.3. Significance
Code-as-planning bridges the gap between informal text plans (which are easy to generate but hard to verify) and formal specifications (which are verifiable but hard to generate). Executable code occupies a middle ground: it is precise enough that correctness can be partially checked (by running tests, compiling, or type-checking) while being expressible enough to represent complex real-world workflows. The test suite acts as an approximate verifier — not as strong as a formal proof, but substantially stronger than no verification at all.
This is arguably the most deployment-ready research direction reviewed in this report, because coding agents already use plan-then-execute architectures and already run tests as a form of verification. The research contributions — explicit planning phases, plan verification before implementation, and formal verification of generated code — represent incremental improvements to an existing paradigm rather than requiring fundamentally new infrastructure.
11. Cross-Cutting Observations
Several structural patterns recur across the research areas surveyed.
11.1. The Generate-Verify Architecture Is Pervasive
Whether the verification is formal (LLM-Modulo’s symbolic critics [1]), empirical (running code and checking test results [33]), search-based (MCTS with process reward models [10]), or LLM-based (MAP’s monitor module [28]), almost every approach that improves on baseline LLM planning does so by adding some form of external feedback loop. The LLM generates; something else checks. This is the core architectural insight of the field.
11.2. LLMs as Knowledge Sources, Not Reasoners
Kambhampati’s framing of LLMs as “approximate knowledge sources” rather than reasoners [1] is supported across the research landscape. The most successful systems use LLMs for what they are good at — generating plausible candidates, writing code that encodes domain knowledge, translating between representations — while delegating formal reasoning to classical algorithms (search, constraint solvers, verification tools). Corrêa et al.’s heuristic generation work [16] is perhaps the clearest illustration: the LLM encodes its knowledge as an executable heuristic function, and the classical search algorithm does the actual planning.
11.3. The Formal-Informal Boundary Is Shifting
Several research lines aim to automate the translation from informal natural language to formal representations: natural language to PDDL [18][19], natural language to executable world models [20][21], natural language to formal specifications for code [34]. If these translations become reliable, they could transform the planning landscape by making formal methods accessible without requiring users to write formal specifications.
11.4. Test-Time Compute as a Scaling Axis
The recognition that inference-time computation is a distinct scaling axis from model size and training data is a significant paradigm shift [8][9]. This has implications beyond planning: it suggests that the “intelligence” of a system is not fixed at training time but can be dynamically allocated at inference time based on problem difficulty. For planning specifically, this is natural — harder planning problems require more search, and test-time compute scaling provides a mechanism for this within the autoregressive framework.
11.5. Persistent Gaps
Despite significant progress, several fundamental gaps remain:
No deployed system uses any of these techniques. As documented in the companion report on deployed systems, no commercial product implements LLM-Modulo, MCTS-guided reasoning, LLM-generated heuristics, or symbolic world model generation. The gap between research and deployment remains wide. This is not for lack of effectiveness — the research results are often dramatic (GPT-4o improving from 3.43% to 40% on Natural Plan with LLM-Modulo [6]) — but rather reflects engineering and product constraints. Building domain-specific critics, integrating classical planners, and managing the computational cost of tree search are all engineering challenges that deployed products have so far avoided in favor of simpler architectures.
Open-domain planning remains unsolved. The research results reviewed here are strongest on well-defined domains with clear constraints (scheduling, Blocks World, mathematical reasoning, competitive programming). Open-ended planning tasks — “plan a marketing strategy,” “refactor this legacy codebase,” “research and write a report on X” — lack the formal structure that enables verification. For these tasks, there is no PDDL encoding, no constraint set to verify against, and no test suite to run. The research community has not yet addressed them convincingly. This is arguably the most important gap, because open-ended planning tasks are precisely the ones that commercially deployed systems are most frequently asked to perform.
Verification costs are not well understood. Most research reports planning accuracy but not the computational cost of achieving it. The LLM-Modulo framework may require 10-15 rounds of generation and verification [6]; MCTS may require hundreds of node expansions [10]; Best-of-N sampling may require generating 200 candidate heuristics [16]. The cost-accuracy tradeoff — how much additional compute is worth how much additional correctness — is underexplored, particularly in comparison to the simpler retry-based approaches used in deployed systems.
Non-Western contributions are underrepresented in the planning-specific literature. While DeepSeek-R1 from China [29] represents a major contribution to RL-trained reasoning, and the test-time scaling survey [9] draws on a globally distributed author list, the core neurosymbolic planning literature (LLM-Modulo, heuristic generation, brain-inspired architectures) remains heavily concentrated in North American and European institutions (Arizona State University, University of Basel, Microsoft Research). Planning research from institutions in Asia, Africa, and Latin America is less visible in the venues surveyed, though this may reflect publication venue biases rather than the actual distribution of research activity. The Text2World benchmark [19] and some test-time scaling work draw from Chinese institutions, and the Izhaki et al. heuristic generation work [17] originates from Bar-Ilan University in Israel. The Corrêa et al. heuristic work [16] includes authors from UFRGS in Brazil, funded by CAPES and FAPERGS — a notable contribution from a Latin American institution. But the overall geographic concentration is notable.
Context windows remain a binding constraint. As plans grow more complex, all token-based approaches — including extended chain-of-thought, MCTS over reasoning traces, and iterative LLM-Modulo refinement — face the fundamental limitation that plans, tool results, and conversation history compete for space in a finite token budget. Symbolic approaches (PDDL, world models as code) partially address this by representing state compactly, but the translation step itself consumes context. This is a structural limitation that formal planners do not share: a classical planner using Fast Downward can handle arbitrarily large state spaces because it represents states symbolically, not as token sequences.
Evaluation methodology is fragmented. Different research groups evaluate on different benchmarks with different metrics. PlanBench [1], TravelPlanner [7], Natural Plan [6], IPC domains [16], and code generation benchmarks [33] test different aspects of planning ability. There is no unified benchmark that evaluates the full pipeline from natural language task description to verified plan execution across diverse domains. This makes cross-technique comparison difficult and may obscure fundamental differences in what each approach can and cannot do.
12. Conclusion
The planning research frontier for single-agent AI systems is characterized by a productive tension between two paradigms. The first treats LLMs as increasingly capable reasoners that can be improved through better training (RL for reasoning [29][30]), more compute at inference time [8][9], and better prompting strategies. The second treats LLMs as fundamentally limited reasoners that must be augmented with external formal components — verifiers [1][6], classical planners [16][18], search algorithms [10], and structured world models [20].
The empirical evidence increasingly favors the augmented approach. The highest-accuracy planning results come from systems that pair LLMs with external verification (LLM-Modulo’s guaranteed-correct outputs [6]), classical search (heuristic generation with greedy best-first search [16]), or formal state representations (PDDL generation with A* planning [18]). Pure LLM approaches — even RL-trained reasoning models — improve performance but cannot provide correctness guarantees.
Whether these research techniques will be adopted in commercial products is an open question. The deployed systems documented in the companion report have converged on a simpler architecture (LLM generation + human review + retry) that is “good enough” for many use cases. The research approaches reviewed here offer higher accuracy and formal guarantees, but at the cost of requiring domain-specific critics, formal specifications, or specialized infrastructure that deployed products have so far avoided building.
The most likely path to deployment may be through the code-as-planning paradigm (Section 10), where the plan is executable code that can be empirically tested — a middle ground between the informality of text plans and the formality of PDDL. Coding agents are already the most action-oriented deployed planning systems, and the research on planning-driven programming [33] and formal verification [34] could eventually close the gap between what these agents generate and what can be verified as correct.
Bibliography
[1] Kambhampati, S. et al. “Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235:22895-22907, 2024. [Verified] https://proceedings.mlr.press/v235/kambhampati24a.html
[2] Stechly, K., Valmeekam, K. and Kambhampati, S. “Chain of Thoughtlessness: An Analysis of CoT in Planning.” arXiv:2405.04776, 2024. [Preprint] https://arxiv.org/abs/2405.04776
[3] Valmeekam, K., Marquez, M. and Kambhampati, S. “Can Large Language Models Really Improve by Self-Critiquing Their Own Plans?” NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. [Workshop paper]
[4] Valmeekam, K., Stechly, K. and Kambhampati, S. “LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench.” arXiv:2409.13373, 2024. [Preprint] https://arxiv.org/abs/2409.13373
[5] Kambhampati, S. “Can Large Language Models Reason and Plan?” Annals of the New York Academy of Sciences, 2024. [Published] https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.15125
[6] Gundawar, A. et al. “Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach.” arXiv:2411.14484, November 2024. [Preprint] https://arxiv.org/abs/2411.14484
[7] Gundawar, A. et al. “Robust Planning with LLM-Modulo Framework: Case Study in Travel Planning.” arXiv:2405.20625, May 2024. [Preprint] https://arxiv.org/abs/2405.20625
[8] Snell, C. et al. “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Parameters for Reasoning.” ICLR 2025 Oral, 2025. [Published] https://openreview.net/forum?id=4FWAwZtd2n
[9] Zhang, Q. et al. “What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models.” arXiv:2503.24235, March 2025. [Preprint] https://testtimescaling.github.io/
[10] Zhang, D. et al. “ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search.” NeurIPS 2024, 2024. [Published] https://openreview.net/forum?id=8rcFOqEud5
[11] Jiang, J. et al. “Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search.” arXiv:2411.11694, November 2024. [Preprint] https://arxiv.org/abs/2411.11694
[12] Khalifa, M. et al. “Process Reward Models That Think.” arXiv:2504.16828, April 2025. [Preprint] https://arxiv.org/abs/2504.16828
[13] Zhu, K. et al. “Scaling Test-time Compute for LLM Agents.” arXiv:2506.12928, June 2025. [Preprint] https://arxiv.org/abs/2506.12928
[14] Liu, R. et al. “Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling.” arXiv:2502.06703, February 2025. [Preprint] https://arxiv.org/abs/2502.06703
[15] Liu, Y. et al. “Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory.” Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 27962-27994, July 2025. [Published] https://aclanthology.org/2025.acl-long.1356/
[16] Corrêa, A.B., Pereira, A.G. and Seipp, J. “Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code.” NeurIPS 2025, 2025. [Published] https://openreview.net/forum?id=UCV21BsuqA
[17] Izhaki, D., Green, R. and Shleyfman, A. “LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore?” arXiv:2501.18784, January 2025. [Preprint] https://arxiv.org/abs/2501.18784
[18] Yu, Z. et al. “Generating Symbolic World Models via Test-time Scaling of Large Language Models.” arXiv:2502.04728, February 2025. [Preprint] https://arxiv.org/abs/2502.04728
[19] Hu, Y. et al. “Text2World: Benchmarking Large Language Models for Symbolic World Model Generation.” arXiv:2502.13092, February 2025. [Preprint] https://arxiv.org/abs/2502.13092
[20] Tang, H. et al. “WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment.” NeurIPS 2024, 2024. [Published] https://proceedings.neurips.cc/paper_files/paper/2024/file/820c61a0cd419163ccbd2c33b268816e-Paper-Conference.pdf
[21] Agent2World. “Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback.” arXiv:2512.22336, December 2025. [Preprint] https://arxiv.org/abs/2512.22336
[22] Hao, S. et al. “Reasoning with Language Model is Planning with World Model.” arXiv:2305.14992, 2023. [Preprint, presented at EMNLP 2023] https://arxiv.org/abs/2305.14992
[23] Gu, Y. et al. “Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents.” arXiv:2411.06559, November 2024. [Preprint] https://arxiv.org/abs/2411.06559
[24] Planning with Reasoning using Vision Language World Model. arXiv:2509.02722, September 2025. [Preprint] https://arxiv.org/abs/2509.02722
[25] “LLM-Based World Models Can Make Decisions Solely, But…” OpenReview, 2025. [Under review] https://openreview.net/pdf?id=XmYCERErcD
[26] Hafner, D. et al. “DreamerV3: Mastering Diverse Domains through World Models.” Nature, 2025. [Published]
[27] Schrittwieser, J. et al. “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” Nature, 588:604-609, 2020. [Published]
[28] Webb, T., Mondal, S.S. and Momennejad, I. “A Brain-Inspired Agentic Architecture to Improve Planning with LLMs.” Nature Communications, 16:8633, September 2025. [Published] https://www.nature.com/articles/s41467-025-63804-5
[29] DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” Nature, 2025. [Published] https://www.nature.com/articles/s41586-025-09422-z
[30] “OpenAI o3.” Wikipedia. Accessed March 2026. [Reference] https://en.wikipedia.org/wiki/OpenAI_o3
[31] OpenAI. “Introducing OpenAI o3 and o4-mini.” April 2025. [Verified] https://openai.com/index/introducing-o3-and-o4-mini/
[32] Code generation planning approaches summarized from: “A Survey on Code Generation with LLM-based Agents.” arXiv:2508.00083, July 2025. [Preprint] https://arxiv.org/abs/2508.00083
[33] Lei, C. et al. “Planning-Driven Programming: A Large Language Model Programming Workflow.” Proceedings of the 63rd Annual Meeting of the ACL, July 2025. [Published] https://aclanthology.org/2025.acl-long.621.pdf
[34] Formal verification of LLM-generated code approaches summarized from: VeriBench benchmark. OpenReview, 2025. [Under review] https://openreview.net/pdf?id=rWkGFmnSNl
Report prepared March 2026. Sources include peer-reviewed publications, preprints, and technical reports as indicated in the bibliography. Preprints have not undergone peer review and results should be interpreted accordingly.