Recursive R&D Atlas

Latent Space Lab · Jacob Ortiz

Recursive R&D Atlas

Mapping the path from AI-assisted coding to AI-driven research.

The AI R&D Loop (Conceptual)

ModelAgentExperimentEvaluationTrainingNew Model

Section 03 has interactive detail

“We are not there yet, and recursive self-improvement is not inevitable.”

Anthropic Institute, May 2026

Something is happening. Over 80% of Anthropic's merged production code is authored by Claude. The time horizon for autonomous task completion grew from 4 minutes to 12 hours in two years. AI systems are beginning to participate in research decisions. None of this is recursive self-improvement. The path is visible and the gaps are measurable. This project maps what the evidence actually says.

AI Code Authorship

80%+

Anthropic, May 2026

Task Horizon

12 hrs

from 4 min, Mar 2024

Research Step Selection

64%

beats humans, Apr 2026

~12 min read · Sources: Anthropic Institute, METR · Educational tool, not a forecast

01

What This Is About

Recursive self-improvement refers to an AI system becoming capable of fully autonomously designing and developing its own successor. That capability does not exist today. What does exist is a measurable, documented trend toward greater AI involvement in the research and engineering process that produces AI systems.

The distinction matters. A lot of writing about this topic either collapses the gap between “AI writes code faster” and “AI builds itself,” or treats the gap as so large that the trend is irrelevant. Neither framing serves careful thinking.

This project is structured around that gap. What is already happening, what is measurably emerging, and what remains speculative. Where the bottlenecks are. What failure modes look like when the loop starts to close. And what the evidence, as of mid-2026, actually says about the path between here and there.

The Anthropic Institute's research agenda frames the core question this way: when AI systems develop their successors autonomously, how do humans exercise meaningful visibility into and control over those systems? That question does not require recursive self-improvement to already be happening. It requires it to be plausible enough to plan for.

METR's time-horizon measurements find that exponential trend fits outperform linear and logistic models. They explicitly reject logistic curves because there is no evidence of the exponential growth in time horizon slowing down. That is a statement about current observations, not a guarantee about the future. But it is a statement worth taking seriously.

02

The Record So Far

These are documented data points from Anthropic Institute publications and METR evaluations. The pace of change in the task horizon column is the primary empirical case that this trend deserves serious attention.

Task Horizon
Benchmark
Judgment
Policy
Mar 2024Task Horizon

Task horizon: 4 minutes

Claude could complete software tasks that take a skilled human roughly 4 minutes. Measured at 50% success rate on well-specified tasks.

Source: Anthropic Institute

Nov 2025Judgment

Research step selection: 51%

When evaluated on selecting the next research step given a problem, Claude beat human choices 51% of the time. Slightly above chance.

Source: Anthropic Institute

Mar 2025Task Horizon

Task horizon: 1.5 hours

The task completion time horizon grew from minutes to 90 minutes in one year. METR exponential trend fit outperforms linear models.

Source: Anthropic Institute / METR

Nov 2024Benchmark

RE-Bench published

METR releases RE-Bench across seven ML research engineering environments. AI agents outperform humans at 2-hour budgets; humans win at 32 hours.

Source: METR RE-Bench

Apr 2026Judgment

Research step selection: 64%

In 5 months, Claude improved from 51% to 64% on selecting research steps. Humans still chose the problem and designed the scoring rubric.

Source: Anthropic Institute

Apr 2026Benchmark

52x code speedup in optimization

Claude achieved roughly 52x speedup on well-specified optimization tasks, compared to ~3x a year prior. Humans set the goal; AI executed.

Source: Anthropic Institute

Apr 2026Benchmark

97% performance recovery, AI safety problem

Claude-powered agents recovered 97% of the performance gap on an AI safety research problem. Humans chose the problem and created the scoring rubric.

Source: Anthropic Institute

Mar 2026Task Horizon

Task horizon: 12 hours

Task horizon reached 12 hours. SWE-bench went from single-digit scores to saturation in two years. CORE-Bench followed in 15 months.

Source: Anthropic Institute / METR

May 2026Task Horizon

METR time horizons updated

METR confirms exponential trend fit. Notes measurements above 16 hours are unreliable with current task suite. Day-length tasks may enter range in 2026.

Source: METR Time Horizons

Jun 2026Policy

Anthropic calls for coordinated pause option

Anthropic proposes that AI labs explore whether a coordinated option to slow or pause frontier development should exist if risks rise. Verification challenges acknowledged.

Source: Anthropic / Reuters

The task horizon grew 180x in two years, from 4-minute tasks in March 2024 to 12-hour tasks in March 2026. METR reports that measurements above 16 hours are currently unreliable with their existing task suite, meaning the ceiling has not yet been observed empirically.

03

The Recursive R&D Loop

The loop from model to model is not hypothetical. Each stage exists today in some form. What varies is the degree of human involvement at each node. Select any node to see what is automated, where humans remain critical, and where risks concentrate if that human involvement is removed.

loops back to Model
01Model

What happens here

The current frontier model. It generates text, writes code, proposes hypotheses, and acts as the substrate for all downstream agent behavior. Every capability of every agent in the loop is bounded by what this model can do.

Automated today

Inference, context management, token generation, tool calling, multi-turn memory within a context window.

Still depends on humans

Architecture choices, training objective design, safety evaluations before release, decisions about what the model is allowed to do by default.

Dangerous if automated poorly

If a model with subtly misaligned values or miscalibrated confidence is used as the base, every downstream agent action inherits that flaw. Problems at this node propagate through every other stage.

Warning signs

  • Rapid capability jumps between versions with no corresponding safety re-evaluation

  • Reduced interpretability as model scale increases

  • Model self-reports of confidence that do not match actual reliability on held-out tasks

04

Bottleneck Simulator

Educational Model

This is a toy model for intuition, not a forecast or scientific instrument. Formulas are intentionally simplified and fully documented in the source code. Adjust the sliders to explore how acceleration and oversight interact. Do not treat any output as a prediction of real-world conditions.

The four capability inputs drive acceleration: how fast the AI R&D loop can run. The four oversight inputs set how much capacity exists to monitor and control that speed. When acceleration exceeds oversight capacity, the gap is the core risk. The derived outputs show what happens as that gap grows.

AI Capability Inputsdrives acceleration
7/10

How long AI agents can work autonomously without human checkpoints. In March 2026, frontier models reached 12-hour horizons (METR).

7/10

Reliability of AI-produced code: correctness, edge-case handling, and test coverage. Over 80% of Anthropic merged production code is AI-authored as of May 2026.

5/10

Fraction of AI-run experiments that return valid, non-corrupted, interpretable results. RE-Bench shows mixed reliability on extended tasks.

8/10

Raw compute budget available for training runs and large-scale inference. Frontier labs have significant and growing compute.

Oversight Capacity Inputsresists runaway
4/10

Bandwidth for humans to meaningfully review AI-generated code, experiments, and decisions. Currently lagging behind output volume at frontier labs.

4/10

Quality and coverage of automated evals. Whether they measure what they claim to. Goodhart's law applies when evals are weak.

5/10

Degree to which AI agent actions are bounded, logged, and reversible. Scaffold design determines blast radius.

3/10

Level of inter-lab coordination on safety standards and capability thresholds. Anthropic proposed a coordinated pause option in June 2026.

Derived Outputs

Acceleration
68
Oversight
40
Oversight Debt

acceleration minus oversight

+28

RSI Warning Level

HIGH

Significant oversight deficit. Multiple safety factors are lagging acceleration. Intervention is warranted.

RSI Proximity
52

Current Bottleneck

Governance Coordination

3/10

weakest safety factor

Most Likely Failure Mode

Rate Invisibility

No shared mechanism for measuring or slowing the aggregate rate of AI R&D acceleration across labs. Each organization proceeds independently, making the overall rate invisible to any single observer.

Research judgment is the deepest bottleneck this simulator cannot quantify. No slider captures whether the AI system knows which problem is worth working on. Anthropic notes that “large performance gaps persist when it comes to Claude exercising judgement in choosing goals in both engineering and research.” As of April 2026, humans still chose the research problem and defined the scoring rubric in every published experiment.

05

When the Loop Closes Badly

These are failure modes specific to AI-driven R&D: patterns that emerge when AI systems participate in developing successor systems. Some are already documented in benchmark evaluations. Some are structural properties of the architecture that follow from how recursive improvement works. Some are open problems without a documented instance yet, but which cannot be ruled out.

Observed
Structural
Open Problem
06

What the Data Says

These are the key metrics currently in the public record on AI involvement in R&D. All figures are sourced from Anthropic Institute publications or METR evaluations. Interpretations reflect careful readings of what the data establishes and what it does not.

AI Code Authorship at Anthropic

80%+of merged production code, May 2026
+Rising

Interpretation

AI-assisted coding is not a future scenario. It is the current operating mode at one of the leading frontier labs. The question is what happens as this percentage approaches 100% and the AI begins contributing to model development itself.

Source: Anthropic Institute

Engineering Output Multiplier

8xmore code shipped per quarter vs 2021-2025 baseline
+Rising

Interpretation

The productivity multiplier from AI-assisted coding is large and real. Anthropic describes the possibility of 100-person teams achieving 10,000-person output as a plausible near-term scenario if trends continue.

Source: Anthropic Institute

Task Horizon Growth

180xfrom 4 minutes (Mar 2024) to 12 hours (Mar 2026)
+Rising

Interpretation

METR finds that exponential trend fits outperform linear and logistic models. If the trend continues, day-length tasks enter range in 2026. Multi-day tasks follow after that. The ceiling is not yet visible in the data.

Source: Anthropic Institute / METR

Research Step Selection

64%Claude beats human choices, April 2026
+Rising

Interpretation

Up from 51% in November 2025. This is a judgment task, not an execution task. The improvement is meaningful but the gap to autonomous research direction-setting is large: 64% is still well below the threshold needed for reliable unsupervised research.

Source: Anthropic Institute

RE-Bench: 2-hour vs 32-hour

InvertedAI leads at 2hr; humans lead at 32hr
~Mixed

Interpretation

AI outperforms humans at short task budgets, often completing tasks more than ten times faster. But humans improve at faster rates as time extends. The crossover point reveals something real about the limits of sustained autonomous research work.

Source: METR RE-Bench

Code Speedup in Optimization

52xvs ~3x a year prior, April 2026
+Rising

Interpretation

On well-specified optimization tasks with superhuman performance, the speedup is dramatic. This is the strongest evidence in the record that AI is becoming a genuine R&D accelerant on clearly defined problems. The bottleneck is still problem definition.

Source: Anthropic Institute

07

What This Means

The loop has not closed. AI systems are not building their successors autonomously. Recursive self-improvement, as Anthropic defines it, has not occurred. These are not contested claims. They are the findings of the organizations closest to the frontier, stated plainly.

What is also not contested: the rate of change is fast and measurable. Task horizons grew 180x in two years. AI systems are already authoring the majority of code at a leading frontier lab. Research judgment, the hardest remaining bottleneck, improved from 51% to 64% in five months. If that trend line continues, the bottleneck narrows faster than most institutions are prepared for.

The Anthropic Institute names three open problems that need solving before any of this can be governed well: telemetry systems for measuring aggregate AI R&D speed, intervention points for slowing recursive self-improvement if it begins to occur, and clarity on which entities should control acceleration rates. None of those problems have solutions yet.

The useful question is not whether recursive self-improvement will happen. It is whether the people and institutions responsible for safety, governance, and oversight will be able to detect a meaningful change in rate and respond to it in time. The evidence reviewed here suggests the rate is already fast enough to demand that question be answered before the answer is needed.

The bottleneck is judgment: which problems matter, what surprising results mean, and whether an AI system evaluating its own outputs can be trusted to catch its own errors. Until those three things change, humans remain the irreplaceable component. The work is to understand what changes when they are not.

The map is not the territory. But a map that shows you where the roads end and the unmarked terrain begins is still worth having.