AI Systems Performing Autonomous Knowledge Work

AI systems are rapidly improving at autonomous professional tasks - coding, analysis, research, writing. However, the evidence is weaker than often claimed.

What the Evidence Actually Shows

SWE-Bench Verified (the most-cited coding benchmark) is contaminated - OpenAI stopped reporting on it
SWE-Bench Pro drops AI performance to approximately 23%
AISI evaluations show roughly 40% success on hour-long real-world tasks
The “jagged frontier” phenomenon (Mollick, Harvard Business School) means AI is superhuman at some tasks while incompetent at adjacent ones

Why “Approaching” Not “Passed”

This milestone is approaching rather than passed because reliable, consistent autonomous knowledge work has not been demonstrated. The capability is real but unpredictably uneven.

The honest framing is: “rapidly advancing but unpredictably uneven” - not “AI can do most professional knowledge work.”

The Reliability Gap

The gap between benchmark performance and real-world reliability is itself a significant finding. Organisations deploying AI for autonomous work based on headline benchmark numbers may be systematically overestimating reliability. This has implications for workforce planning, liability, and safety-critical applications.

Counterarguments

The strongest objections to this entry, with sources.

SWE-Bench Verified is contaminated (OpenAI stopped reporting on it). SWE-Bench Pro drops performance to ~23%. AISI shows ~40% success on hour-long real tasks. The 'jagged frontier' means superhuman at some tasks, incompetent at adjacent ones. Claims of 'most professional knowledge work' overstate current capability.

Source: Ethan Mollick (Harvard Business School); AISI evaluations; SWE-Bench benchmarking data

Response:The counterargument is partially incorporated into the milestone itself - the honest framing is 'rapidly advancing but unpredictably uneven.' The gap between benchmark and real-world performance is itself a governance concern.

AI Systems Performing Autonomous Knowledge Work

What the Evidence Actually Shows

Why “Approaching” Not “Passed”

The Reliability Gap

Counterarguments

Sources (3)

Related Entries