← Back to Database
Capability Autonomous Tool Use

AI Systems Performing Autonomous Knowledge Work

APPROACHING ◐ Moderate
AI systems are rapidly advancing at autonomous professional tasks but performance is unpredictably uneven - superhuman at some tasks, incompetent at adjacent ones - making reliable autonomous deployment premature.
Verified: 12 March 2026 · Last updated: 12 March 2026

AI systems are rapidly improving at autonomous professional tasks - coding, analysis, research, writing. However, the evidence is weaker than often claimed.

What the Evidence Actually Shows

  • SWE-Bench Verified (the most-cited coding benchmark) is contaminated - OpenAI stopped reporting on it
  • SWE-Bench Pro drops AI performance to approximately 23%
  • AISI evaluations show roughly 40% success on hour-long real-world tasks
  • The “jagged frontier” phenomenon (Mollick, Harvard Business School) means AI is superhuman at some tasks while incompetent at adjacent ones

Why “Approaching” Not “Passed”

This milestone is approaching rather than passed because reliable, consistent autonomous knowledge work has not been demonstrated. The capability is real but unpredictably uneven.

The honest framing is: “rapidly advancing but unpredictably uneven” - not “AI can do most professional knowledge work.”

The Reliability Gap

The gap between benchmark performance and real-world reliability is itself a significant finding. Organisations deploying AI for autonomous work based on headline benchmark numbers may be systematically overestimating reliability. This has implications for workforce planning, liability, and safety-critical applications.

Counterarguments

The strongest objections to this entry, with sources.

SWE-Bench Verified is contaminated (OpenAI stopped reporting on it). SWE-Bench Pro drops performance to ~23%. AISI shows ~40% success on hour-long real tasks. The 'jagged frontier' means superhuman at some tasks, incompetent at adjacent ones. Claims of 'most professional knowledge work' overstate current capability.

Source: Ethan Mollick (Harvard Business School); AISI evaluations; SWE-Bench benchmarking data

Response:The counterargument is partially incorporated into the milestone itself - the honest framing is 'rapidly advancing but unpredictably uneven.' The gap between benchmark and real-world performance is itself a governance concern.

knowledge-workautonomybenchmarksjagged-frontier