The Structure of Planning in Commercially Deployed AI Systems

A State-of-the-Art Report, March 2026

1. Introduction

A central question in artificial intelligence is whether current systems can plan — that is, whether they can decompose goals into ordered sub-tasks, anticipate consequences of actions, verify progress against constraints, and recover from failures. This report examines how commercially deployed AI systems exhibit (or simulate) planning behaviour as of early 2026. It covers the major foundation model providers (OpenAI, Anthropic, Google DeepMind, Microsoft, Meta), prominent startups in coding (Cursor, Cognition/Devin, GitHub Copilot), research (Perplexity), and enterprise productivity (Microsoft 365 Copilot).

The scope is deliberately narrow. This report does not cover multi-agent research architectures, academic planning systems built on PDDL or classical solvers, or model-based reinforcement learning (MuZero, Dreamer, etc.). Nor does it cover agentic frameworks like AutoGPT, CrewAI, or LangGraph in their research configurations. It addresses only what is commercially deployed and available to end users or enterprise customers today.

The core finding is that deployed AI systems overwhelmingly rely on a small number of architectural patterns for planning, all of which are fundamentally rooted in autoregressive token generation augmented by scaffolding code. None of the deployed systems examined implement classical planning, formal verification, learned world models, or state-space search in the sense understood by the AI planning community. What the industry calls “planning” is, in most cases, a combination of extended chain-of-thought prompting, LLM-generated task decomposition rendered as markdown, human-in-the-loop approval gates, and tool-calling loops with error-based retry. This is effective for a surprisingly wide range of tasks, but it is architecturally distinct from the planning systems described in the research literature.

2. The Foundational Architecture: Autoregressive Generation

All commercially deployed planning systems examined in this report are built on autoregressive large language models — transformer-based models that generate one token at a time, conditioned on the preceding sequence. This is worth stating explicitly because it determines what these systems can and cannot do.

Kambhampati et al. argued at ICML 2024 that autoregressive LLMs are fundamentally unable to perform planning or self-verification on their own, characterizing them as “a giant pseudo System 1” that produces approximate retrievals of plan-like outputs from training data rather than performing genuine search through a state space [1]. Even from an engineering perspective, a system that takes constant time per token cannot be doing principled reasoning whose computational complexity scales with problem difficulty [1]. This position has been influential and largely empirically confirmed: on the PlanBench benchmark of formal planning tasks, even state-of-the-art LLMs including GPT-4o, Claude 3 Opus, and Gemini Pro showed poor performance on multi-step plan generation [1].

Yet every system examined in this report is built on this same foundation. The variation lies entirely in what surrounds the LLM: how prompts are constructed, how outputs are verified, how tool calls are orchestrated, and how human oversight is integrated into the loop.

3. A Taxonomy of Planning Patterns in Deployed Systems

Across the systems examined, five distinct architectural patterns for planning emerge. These are not mutually exclusive — most deployed products combine two or more.

3.1. Single-Pass Generation

The simplest pattern: a user provides a prompt, the LLM generates a response in one pass (one token at a time), and the result is returned. Any “planning” is implicit in the model’s weights, learned from training data that includes plans, step-by-step instructions, project outlines, and structured workflows.

This is how standard ChatGPT (GPT-4o), Claude (Sonnet/Opus non-extended-thinking), and Gemini handle most requests. When a user asks for a “project plan” or “trip itinerary,” the model generates text that looks like a plan. It has not searched a state space, verified feasibility, or checked constraint satisfaction. It has produced a linguistically plausible sequence of tokens that pattern-matches against plans in its training data.

This pattern is effective for tasks where approximate plans are acceptable and where the user can visually verify the output (e.g., a rough travel itinerary, a document outline, a list of project milestones). It fails predictably on tasks requiring precise constraint satisfaction, resource allocation, or multi-step dependency reasoning [1].

3.2. Extended Chain-of-Thought (Test-Time Compute)

OpenAI’s o-series models (o1, o3, o4-mini) introduced a commercially significant variant: training models via reinforcement learning to generate an extended internal “chain of thought” before producing a final answer [2]. OpenAI describes this as a “private chain of thought” that enables the model to “plan ahead and reason through tasks, performing a series of intermediate reasoning steps” before answering [2]. The o3 model uses what has been characterized as “test-time compute,” exploring multiple potential solution paths during inference using varying amounts of search over chains of thought [3].

This is architecturally still autoregressive token generation — the model produces reasoning tokens before answer tokens — but the RL training creates behaviour that superficially resembles deliberative planning. The model spends more tokens (and thus more compute) on harder problems. OpenAI reports that o3 makes 20% fewer major errors than o1 on difficult real-world tasks [4]. On the ARC-AGI benchmark, o3 scored 87.7%, roughly three times o1’s accuracy [2].

Google has deployed similar capabilities with Gemini’s “thinking” models (Gemini 2.0 Flash Thinking), and DeepSeek released R1, an open-weight reasoning model trained with similar RL-on-chain-of-thought techniques. Anthropic has integrated extended thinking into Claude’s Opus and Sonnet models.

The critical architectural point is that no external verifier, planner, or world model is involved. The “planning” happens entirely within the token stream. The model cannot guarantee correctness, detect its own logical errors with certainty, or prove plan feasibility. It can, however, catch some of its own mistakes by “thinking longer,” which empirically improves performance on structured reasoning tasks.

3.3. Plan-Then-Execute Loops

This is the dominant pattern in deployed coding agents and research products. The system generates an explicit plan (typically as a markdown document or structured task list), optionally presents it for human approval, and then executes it step by step, using tool calls (file edits, terminal commands, web searches) at each step.

The plan is generated by the LLM. Execution is managed by an orchestration harness — application-level code that calls the LLM iteratively, passes tool results back into context, and manages state. The “planning” is still LLM token generation; what makes this pattern distinct is the explicit separation of planning and execution phases, and the persistence of the plan as a reviewable artifact.

This pattern appears in Claude Code’s plan mode [5], Cursor’s plan mode [6], Devin’s interactive planning [7], Perplexity’s Deep Research [8], Gemini Deep Research [9], and OpenAI’s Deep Research product.

3.4. Iterative Retrieval-Reason-Act (RAG-Based Planning)

This pattern extends plan-then-execute with retrieval: at each step of execution, the system searches for additional information (from the web, a codebase, internal documents, or a vector store), incorporates it into context, reasons about what to do next, and acts. The “plan” is dynamically refined based on what is found.

Perplexity’s Deep Research exemplifies this. The system formulates a research plan, executes parallel web searches, reads and ranks retrieved documents, synthesizes intermediate summaries, identifies knowledge gaps, and searches again — iteratively, over 2-15 minutes [8][10]. Google’s Deep Research operates similarly: it “iteratively plans its investigation — it formulates queries, reads results, identifies knowledge gaps, and searches again” [11]. Google specifically developed “a novel asynchronous task manager that maintains a shared state between the planner and task models, allowing for graceful error recovery without restarting the entire task” [9].

The key architectural element is the feedback loop: retrieval results modify the plan. This gives the system a limited form of environment-responsive behaviour. However, the “reasoning” at each step is still LLM generation — there is no formal model of the information landscape being searched, no constraint solver ensuring completeness, and no guarantee that the system has found the most relevant sources.

3.5. Tool-Orchestrated Workflows

In enterprise copilots like Microsoft 365 Copilot, “planning” manifests as the LLM deciding which tools to invoke and in what order. Microsoft’s architecture routes user prompts through Microsoft Graph (which provides access to emails, documents, calendar, and chat data), sends the grounded prompt to an LLM, and returns the response [12]. Copilot Studio’s “generative orchestration” allows the system to chain tools together dynamically: “The agent understands the user’s request or autonomous trigger, searches its library of actions, and finds the actions that fulfill the request. The agent then assembles and chains these actions together” [13].

OpenAI’s o3 and o4-mini models were similarly “trained to use tools through reinforcement learning — teaching them not just how to use tools, but to reason about when to use them” [4]. This means the model has learned patterns for when to emit tool-call tokens in its output stream.

This is the closest any deployed system comes to automated planning in the classical sense — the system is selecting and sequencing actions from a defined action space. But the selection is done by learned pattern matching in the LLM, not by search through a formal action space. The system cannot prove that its tool-call sequence will achieve the goal, and has no mechanism for backtracking if a tool call fails beyond re-prompting the LLM with the error message.

4. Deployed Systems: Detailed Analysis

4.1. Foundation Model Chat Products

ChatGPT (OpenAI). The standard GPT-4o model uses single-pass generation for all tasks. The o-series reasoning models (o3, o4-mini) add extended chain-of-thought. OpenAI reports that o4-mini scored 99.5% on AIME 2025 with tool use, making it the strongest model on that benchmark [4]. The reasoning models now have “full tool access within ChatGPT — including searching the web, analyzing uploaded files and other data with Python, reasoning deeply about visual inputs, and even generating images” [4]. Planning in this context means the model decides, within its chain of thought, when to invoke each tool. There is no external planner.

Claude (Anthropic). Claude’s Sonnet and Opus models support extended thinking (an analogous feature to chain-of-thought in o-series). Claude’s products include tools for web search, code execution, file creation, and interaction with external services via MCP (Model Context Protocol). As with OpenAI’s products, planning is performed by the model within its token generation, with tool calls emitted as part of the output stream. There is no external planner or formal verification layer.

Gemini (Google DeepMind). Gemini offers standard models and “thinking” models (Gemini 2.0 Flash Thinking, Gemini 3). The standard models operate via single-pass generation. The thinking models add extended reasoning. Google has been particularly explicit about training models for agentic behaviour, noting that thinking models’ “innate characteristic of self-reflection and planning makes it a great fit for long-running agentic tasks” [9].

4.2. Deep Research Products

Deep Research represents the most sophisticated planning behaviour in any commercially deployed consumer product as of early 2026. Three major implementations exist.

Perplexity Deep Research. Perplexity’s advanced workflows “employ a more elaborate plan-and-execute paradigm. These advanced workflows break down complex queries into ordered research steps, leveraging internal agent systems that formulate a multi-turn search plan, execute each sub-query, and dynamically incorporate earlier results into subsequent context assembly” [10]. The system conducts focused retrieval at each step, collects and ranks new sources, and synthesizes intermediate summaries. The underlying architecture uses retrieval-augmented generation (RAG) with a multi-stage retrieval pipeline that combines lexical and semantic search, progressively refined by cross-encoder rerankers [14]. Perplexity scored 93.9% on SimpleQA for factual accuracy and 21.1% on Humanity’s Last Exam [8].

Gemini Deep Research (Google). Google developed a dedicated planning system for Deep Research. The model is trained to break a query into sub-tasks, determine which can be parallelized and which must be sequential, use search and web browsing tools to gather information, and reason over gathered information to decide its next move [9]. Google specifically trained models to be effective at “long multi-step planning in a data-efficient manner” to enable open-domain research [9]. The system uses Gemini 3 Pro as its reasoning core, described as “specifically trained to reduce hallucinations and maximize report quality during complex tasks” [11]. A shared-state asynchronous task manager enables error recovery without restarting [9].

OpenAI Deep Research. Launched in February 2025 as part of ChatGPT Pro, OpenAI’s Deep Research uses a version of o3 to conduct extended web research. The system generates research plans, executes web searches, reads and synthesizes documents, and produces comprehensive reports. Specific architectural details are less publicly documented than Google’s implementation.

All three products share a common pattern: the LLM generates a plan, the plan is optionally presented to the user for approval, the system executes iteratively using search and browsing tools, and the LLM synthesizes results into a report. The planning is LLM-generated and LLM-refined. The tool calls (search, browse, read) are managed by an orchestration harness. No formal planner, constraint solver, or verification system is involved.

4.3. Coding Agents

Coding agents represent the most action-oriented planning in deployed AI systems, because they must produce executable artifacts (code, terminal commands, file modifications) whose correctness can be partially verified (by running tests, compiling, or executing).

Claude Code (Anthropic). Claude Code operates as an agentic terminal assistant with three phases: gather context, take action, and verify results [5]. Its “plan mode” (activated via Shift+Tab) creates a read-only environment where the model explores the codebase, understands architectural patterns, and formulates plans without modifying files [15]. The plan is rendered as a markdown file in a dedicated folder [16]. After planning, the user can switch to execution mode, where the agent implements the plan step-by-step, using built-in tools for file reading, editing, terminal command execution, and codebase search [5]. The system “breaks work into steps, executes them, and adjusts based on what it learns” [5].

Architecturally, plan mode is implemented via prompt engineering: the system prompt restricts the model to read-only tools and instructs it to produce a plan [16]. There is no external planner. The plan is a markdown artifact generated by the LLM. Execution is an agentic loop where the LLM is repeatedly called with updated context (including tool results) until the task is complete or the user intervenes.

Cursor (Anysphere). Cursor 2.0, launched October 2025, introduced its own coding model (Composer) alongside a plan mode and multi-agent interface [6]. Plan mode generates an “editable Markdown plan with file paths, code references, and a to-do list” [17]. The agent then executes through the plan items sequentially [17]. Cursor supports running up to eight agents in parallel, each operating in an isolated copy of the codebase via git worktrees [6]. The system allows users to “plan with one model and build with another” — for example, using a reasoning model for planning and Composer for fast execution [6].

Cursor’s planning architecture is structurally similar to Claude Code’s: an LLM generates a text plan, the plan is presented for review, and execution proceeds through an agentic tool-calling loop. The multi-agent capability adds parallelism but does not fundamentally change the planning mechanism — each agent still plans and executes via LLM generation.

Devin (Cognition). Devin is positioned as the most autonomous coding agent in commercial deployment. It operates in a sandboxed environment with a shell, code editor, and browser [7]. The system plans by generating a step-by-step plan, presents it via “Interactive Planning” for human review and modification, and then executes autonomously [7][18]. Cognition describes Devin as capable of “planning and executing complex engineering tasks requiring thousands of decisions” and of “recalling relevant context at every step, learning over time, and fixing mistakes” [18].

A 2025 “Interactive Planning” feature allows human engineers to collaborate on a high-level roadmap before Devin begins execution [19]. Devin uses a self-assessed confidence system, flagging tasks as Green, Yellow, or Red based on likelihood of success, allowing human supervisors to focus on edge cases [19]. The system maintains a “DeepWiki” intelligence of the entire codebase [19].

Despite the marketing language of autonomy, Devin’s planning architecture follows the same fundamental pattern: LLM-generated plans, human approval gates, and iterative tool-calling execution. What distinguishes Devin is the scope of its tooling (full development environment), the persistence of its workspace across sessions, and the integration with software development workflows (Slack, GitHub PRs). Cognition reports that 67% of Devin’s pull requests are now merged (up from 34% previously), but independent testing found that Devin completed only about three out of twenty assigned tasks in one evaluation [20].

GitHub Copilot (Microsoft/GitHub). GitHub Copilot operates primarily as an in-editor code completion tool rather than a plan-and-execute agent. Its architecture is a plugin within existing IDEs that provides suggestions based on file context. While Copilot has boosted task completion speeds by approximately 55%, it “primarily suggests code snippets, and human engineers still retain responsibility for planning tasks, navigating repository context, debugging failures, and coordinating reviews” [21]. Copilot does not generate explicit plans or operate autonomously across files. It is included here to mark the boundary of the taxonomy: Copilot is an autocomplete tool, not a planning system.

4.4. Enterprise and Office Copilots

Microsoft 365 Copilot. Microsoft 365 Copilot is the most widely deployed enterprise AI assistant. Its architecture routes prompts through Microsoft Graph for data grounding (accessing the user’s emails, documents, calendar, and chats), sends the grounded prompt to an LLM, and returns a response within the context of the Microsoft 365 app being used [12]. The system “coordinates large language models” with organizational data to “summarize, predict, and generate content” [12].

Planning in M365 Copilot takes several forms. In standard usage (Copilot in Word, Excel, PowerPoint, Outlook), it is essentially single-pass generation grounded in organizational data. In more recent features — Agent Mode in Word (November 2025) and Agent Mode in Excel — the system “plans, executes, and validates multi-step tasks — like building models, reshaping tables, and creating charts — directly in the grid” [22]. A Workflows agent “helps users automate repetitive tasks across Microsoft 365 apps using simple prompts” [22]. The Researcher feature “tackles complex, multi-step research at work using advanced deep reasoning models” across organizational data sources [22].

Microsoft’s Copilot Studio enables organizations to build custom agents with “generative orchestration” — the system understands the user’s request, searches its library of available actions, and chains actions together dynamically [13]. As of late 2025, this orchestration uses GPT-5 Chat and GPT-5.2 as underlying models [23].

The planning exhibited by M365 Copilot is entirely LLM-driven orchestration. The “generative orchestration” in Copilot Studio means the LLM decides which pre-defined tools/actions to invoke and in what order. There is no formal planner, no constraint solver, and no verification beyond what the LLM itself generates. The system’s reliability depends on the quality of the available actions, the LLM’s ability to select appropriate ones, and the governance controls applied by administrators.

5. Cross-Cutting Architectural Observations

5.1. Plans Are Text, Not Formal Structures

In every deployed system examined, “plans” are natural language text (usually markdown), not formal representations. Claude Code’s plans are “effectively a markdown file that is written into Claude’s plans folder by Claude in plan mode” with “no extra structure beyond text” [16]. Cursor’s plans are “editable Markdown plans with file paths, code references, and a to-do list” [17]. Devin’s plans are step-by-step text presented for interactive review [7].

No deployed system generates plans in PDDL, temporal logic, constraint satisfaction formalisms, or any other formal planning language. No deployed system uses a plan validator (like VAL) to verify plan correctness. No deployed system performs state-space search. The plans are LLM outputs that look like plans because the LLM has been trained on text that includes plans.

5.2. Verification Is Empirical, Not Formal

Where verification exists in deployed systems, it is empirical rather than formal. Coding agents run tests and check compilation. Research agents present citations for human review. Office copilots rely on existing access controls and human oversight. No deployed system provides formal guarantees that a plan will achieve its stated goal.

The closest to formal verification is in coding agents that run automated test suites. Claude Code’s recommended workflow is “plan, implement, test, adjust” [5]. Devin runs CI pipelines and reports which tests passed and failed [7]. But the tests themselves must be written (often by the LLM), and passing tests does not guarantee correctness — it guarantees consistency with the test specification.

5.3. Human-in-the-Loop Is the Primary Safety Mechanism

Every deployed system relies on human oversight as its primary mechanism for catching planning failures. Plans are presented for review before execution (Claude Code plan mode, Cursor plan mode, Devin interactive planning, Gemini Deep Research plan approval, Perplexity plan display). Code changes require human approval or review (PR review, diff review). Research reports are presented with citations for human verification.

This is not a weakness of any particular product — it reflects a fundamental limitation of the underlying technology. Without formal verification or sound external critics, human review is the only available mechanism for ensuring plan correctness. Kambhampati’s LLM-Modulo framework [1], which pairs LLM generation with external model-based verifiers, remains a research proposal as of early 2026 and has not been deployed in any commercial product examined.

5.4. Error Recovery Is Retry-Based

When plan execution fails (a tool call returns an error, a test fails, a search returns no results), deployed systems universally handle it by feeding the error message back into the LLM’s context and asking it to try again. This is effective for simple errors (typos, wrong API calls, missing imports) but degrades on errors that require fundamental plan revision. There is no mechanism for systematic backtracking, alternative plan generation from a different starting point, or formal root-cause analysis.

Google’s Deep Research is a partial exception: it developed “a novel asynchronous task manager that maintains a shared state between the planner and task models, allowing for graceful error recovery without restarting the entire task” [9]. This is an engineering solution to the brittleness of long-running agentic loops, but it is state management rather than formal error recovery in the planning sense.

5.5. Context Window Is the Binding Constraint

The practical limit on planning depth in every deployed system is the context window of the underlying LLM. Plans, tool results, retrieved documents, and conversation history all compete for space in a finite token budget. As plans grow more complex, the system must either summarize earlier context (losing detail), or fail due to context overflow.

This is a structural limitation that formal planners do not share. A PDDL planner can handle arbitrarily large state spaces because it represents states symbolically, not as token sequences. Deployed LLM-based systems are fundamentally limited in the complexity of plans they can maintain, reason about, and execute.

6. What Deployed Systems Do Not Do

It is worth being explicit about planning capabilities that exist in research but are absent from all deployed systems examined:

No deployed system uses classical AI planners (Fast Downward, STRIPS, HTN planners) as components. No deployed system uses PDDL or any formal planning language. No deployed system uses Monte Carlo Tree Search for plan generation. No deployed system uses a learned world model (in the MuZero/Dreamer sense) to simulate action consequences. No deployed system uses formal plan verification (VAL or equivalent). No deployed system implements Kambhampati’s LLM-Modulo framework with external model-based verifiers. No deployed system uses constraint solvers for resource allocation or scheduling. No deployed system guarantees plan correctness or optimality.

The gap between the planning research literature and deployed products is substantial. Research has demonstrated that combining LLMs with external verifiers can achieve 82% accuracy on Blocks World planning (up from near-zero for LLMs alone) [24], and that LLM-generated heuristics can outperform state-of-the-art domain-independent planning heuristics [25]. These results have not been incorporated into any commercial product examined.

7. Limitations and Failure Modes

7.1. Plan Coherence Degrades with Complexity

Deployed systems produce coherent plans for tasks of limited scope — a single feature implementation, a focused research question, a short document. As task complexity increases (multi-repository refactors, research requiring synthesis across dozens of contradictory sources, project plans with resource dependencies), plan quality degrades. This is consistent with the research finding that LLMs “struggle with tasks that require zero-shot or few-shot generation of multi-step plans” and display “systematic shortcomings in planning tasks” including hallucinating non-existent transitions and falling into loops [1][26].

7.2. No Constraint Satisfaction

When users ask for plans subject to constraints (budget limits, scheduling dependencies, resource conflicts), deployed systems cannot guarantee constraint satisfaction. They may produce plans that look like they satisfy constraints but have not been checked against a formal model. A travel plan may suggest impossible connections; a project plan may have dependency cycles; a code refactoring plan may introduce breaking changes in files not considered.

7.3. Sensitivity to Prompt Framing

The quality of LLM-generated plans is highly sensitive to how the task is described. The same underlying problem described in different ways can produce radically different plans. This is a well-documented property of autoregressive models but is particularly consequential for planning tasks where correctness matters. Experienced users of coding agents report that prompt engineering — specifying file budgets, architectural constraints, and execution order — is essential for reliable results [17][27].

7.4. The Appearance of Planning

A concern raised by multiple analysts is that these systems produce “the appearance of research, without any actual research happening” [28] — or, more broadly, the appearance of planning without the formal properties that the word “planning” implies in AI. When a user asks ChatGPT for a “project plan,” the output is a plausible-looking document, not a verified plan. When Devin generates a “plan” for a code migration, it is an LLM’s best guess at what steps are needed, not a formally derived action sequence. Users may treat these outputs with more confidence than their provenance warrants.

8. Conclusion

The planning capabilities of commercially deployed AI systems as of early 2026 are real but architecturally simple. They consist of LLM-generated text plans, optionally reviewed by humans, executed through tool-calling loops, with error recovery via retry. Extended chain-of-thought reasoning (in o-series, Gemini thinking, DeepSeek-R1, Claude extended thinking) improves performance on structured reasoning tasks but does not change the fundamental architecture. Deep Research products (Perplexity, Gemini, OpenAI) represent the most sophisticated deployed planning, with iterative retrieval-reason-act loops and shared-state task management, but still rely entirely on LLM generation for plan formulation and refinement.

No deployed system implements the formal planning, verification, or world-model-based search that the AI planning research community has developed. The gap between research and deployment is wide. Whether this gap will close — through adoption of LLM-Modulo frameworks, integration of classical planners, or development of learned world models for open-domain planning — remains an open question for the field.

Bibliography

[1] Kambhampati, S. et al. “Position: LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235:22895-22907, 2024. [Verified] https://proceedings.mlr.press/v235/kambhampati24a.html

[2] “OpenAI o3.” Wikipedia. Accessed March 2026. [Verified] https://en.wikipedia.org/wiki/OpenAI_o3

[3] Potkalitsky, N. “ChatGPT o3 and the Architecture of Reason: Beyond the AI Wall.” Educating AI, January 2025. [Snippet] https://nickpotkalitsky.substack.com/p/chatgpt-o3-and-the-architecture-of

[4] OpenAI. “Introducing OpenAI o3 and o4-mini.” April 2025. [Verified] https://openai.com/index/introducing-o3-and-o4-mini/

[5] Anthropic. “How Claude Code Works.” Claude Code Documentation. Accessed March 2026. [Verified] https://code.claude.com/docs/en/how-claude-code-works

[6] “Cursor AI Code Editor 2.0: A New Era in Automated Programming.” Oreate AI Blog, January 2026. [Snippet] https://www.oreateai.com/blog/cursor-ai-code-editor-20-a-new-era-in-automated-programming/4e780164afbf597aee7be918ef44ece7

[7] Cognition. “Devin’s 2025 Performance Review: Learnings From 18 Months of Agents At Work.” Cognition Blog, 2025. [Verified] https://cognition.ai/blog/devin-annual-performance-review-2025

[8] Perplexity. “Introducing Perplexity Deep Research.” Perplexity Hub Blog, February 2025. [Verified] https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research

[9] Google. “Gemini Deep Research — Your Personal Research Assistant.” Gemini Overview, March 2025. [Verified] https://gemini.google/overview/deep-research/

[10] “Perplexity AI Models Explained and How Answers Are Generated.” Data Studios, March 2026. [Snippet] https://www.datastudios.org/post/perplexity-ai-models-explained-and-how-answers-are-generated-architecture-retrieval-model-selecti

[11] Google. “Build with Gemini Deep Research.” Google Developers Blog, December 2025. [Verified] https://blog.google/technology/developers/deep-research-agent-gemini-api/

[12] Microsoft. “Microsoft 365 Copilot Architecture and How It Works.” Microsoft Learn. Accessed March 2026. [Verified] https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-architecture

[13] Microsoft. “Overview of Microsoft Copilot Studio 2025 Release Wave 1.” Microsoft Learn. Accessed March 2026. [Snippet] https://learn.microsoft.com/en-us/power-platform/release-plan/2025wave1/microsoft-copilot-studio/

[14] Perplexity. “Architecting and Evaluating an AI-First Search API.” Perplexity Research, 2025. [Verified] https://research.perplexity.ai/articles/architecting-and-evaluating-an-ai-first-search-api

[15] Lord, J. “Understanding Claude Code Plan Mode and the Architecture of Intent.” July 2025. [Snippet] https://lord.technology/2025/07/03/understanding-claude-code-plan-mode-and-the-architecture-of-intent.html

[16] Ronacher, A. “What Actually Is Claude Code’s Plan Mode?” Armin Ronacher’s Thoughts and Writings, December 2025. [Verified] https://lucumr.pocoo.org/2025/12/17/what-is-plan-mode/

[17] “Cursor AI Review (2026): Features, Workflow, & Why I Use It.” Prismic Blog, November 2025. [Verified] https://prismic.io/blog/cursor-ai

[18] Cognition. “Introducing Devin, the First AI Software Engineer.” Cognition Blog, 2024. [Verified] https://cognition.ai/blog/introducing-devin

[19] “The World’s First Autonomous AI Software Engineer: Devin Now Produces 25% of Cognition’s Code.” Financial Content / TokenRing, December 2025. [Snippet] https://markets.financialcontent.com/wral/article/tokenring-2025-12-30-the-worlds-first-autonomous-ai-software-engineer-devin-now-produces-25-of-cognitions-code

[20] “Devin AI Review: The Good, Bad & Costly Truth (2025 Tests).” Trickle Blog, 2025. [Snippet] https://trickle.so/blog/devin-ai-review

[21] “Cognition Business Breakdown & Founding Story.” Contrary Research. Accessed March 2026. [Verified] https://research.contrary.com/company/cognition

[22] “What’s New in Microsoft 365 Copilot | October 2025.” Microsoft Community Hub, October 2025. [Verified] https://techcommunity.microsoft.com/blog/microsoft365copilotblog/what%E2%80%99s-new-in-microsoft-365-copilot--october-2025/4464046

[23] “What’s New in Microsoft Copilot Studio: November 2025 Updates and Features.” Microsoft Copilot Blog, December 2025. [Verified] https://www.microsoft.com/en-us/microsoft-copilot/blog/copilot-studio/whats-new-in-microsoft-copilot-studio-november-2025/

[24] Gundawar, A. et al. “Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach.” arXiv:2411.14484, November 2024. [Snippet] https://arxiv.org/abs/2411.14484

[25] Corrêa, A. et al. “Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code.” arXiv:2503.18809, March 2025. [Verified] https://arxiv.org/abs/2503.18809

[26] Kambhampati, S. et al. “A Brain-Inspired Agentic Architecture to Improve Planning with LLMs.” Nature Communications, September 2025. [Snippet] https://www.nature.com/articles/s41467-025-63804-5

[27] Tane, B. “How I Use Claude Code.” February 2026. [Verified] https://boristane.com/blog/how-i-use-claude-code/

[28] Furze, L. Cited in “Google Gemini Deep Research Can Now Access Your Workspace Files.” Developer Tech, November 2025. [Verified] https://www.developer-tech.com/news/gemini-deep-research-workspace-integration/