The Digital Transformation Playbook

The $3 Trillion Question: Can AI Match Human Experts?

Kieran Gilmurray

What happens when AI attempts the same complex work as human experts with 14 years of experience? The answer might reshape our understanding of the economic future.

TL;DR:

  • GDP Val tests AI on complex, multimodal tasks requiring handling of CAD designs, spreadsheets, and presentations
  • Tasks are created from actual professional work products that take humans an average of 7 hours to complete
  • Claude Opus performed best with 47.6% of its deliverables rated as good as or better than human experts
  • AI shows potential to make workflows 40% faster and 63% cheaper when paired with human oversight
  • 3% of AI failures were classified as "catastrophic," including incorrect medical diagnoses and suggestions of financial fraud
  • Simple prompt improvements like asking models to self-check their work significantly reduced formatting errors
  • Current models still struggle with ambiguity and tasks requiring tacit knowledge or complex human interaction


GDP Val represents a fundamental shift in how we evaluate artificial intelligence. Rather than abstract academic metrics, this new benchmark from OpenAI measures how well frontier AI models handle real-world economic tasks across nine major sectors worth $3 trillion annually. 

The methodology is ruthlessly practical—AI models must complete complex assignments that typically take human experts seven hours, handling everything from CAD designs to financial spreadsheets while synthesizing information from up to 38 reference documents.

The results are both promising and sobering. Claude Opus led the evaluation with 47.6% of its outputs rated equal to or better than work from professionals at organizations like Apple, Goldman Sachs, and Boeing. When integrated into realistic workflows with human oversight, these models demonstrated potential to make knowledge work 40% faster and 63% cheaper. 

Yet failures remain significant—3% were classified as "catastrophic," including incorrect medical diagnoses and recommendations of financial fraud.

Perhaps most valuable is GDP Val's illumination of where AI currently excels (document formatting, data analysis) and where it falters (following complex instructions, handling ambiguity). 

This economic lens offers businesses and policymakers unprecedented clarity about AI's near-term impact on knowledge work, while highlighting that the highest-value human skills—tacit knowledge, real-time collaboration, and complex communication—remain beyond current AI capabilities. 

How quickly will that gap close? That's the trillion-dollar question worth pondering.

Listen into a audio version of this report created using Google Notebook LM for your listening pleasure.

Link to research: GDPval.pdf 

Support the show


𝗖𝗼𝗻𝘁𝗮𝗰𝘁 my team and I to get business results, not excuses.

☎️ https://calendly.com/kierangilmurray/results-not-excuses
✉️ kieran@gilmurray.co.uk
🌍 www.KieranGilmurray.com
📘 Kieran Gilmurray | LinkedIn
🦉 X / Twitter: https://twitter.com/KieranGilmurray
📽 YouTube: https://www.youtube.com/@KieranGilmurray

📕 Want to learn more about agentic AI then read my new book on Agentic AI and the Future of Work https://tinyurl.com/MyBooksOnAmazonUK

SPEAKER_01:

So today we're diving deep into something that uh really shifts the frame on how we think about AI and work.

SPEAKER_02:

Aaron Powell Definitely. We're looking at a new way to evaluate AI, not just, you know, academic scores, but how well models handle real-world tasks, the kind that actually, well, drive the economy.

SPEAKER_01:

Aaron Powell Exactly. It's this benchmark GDP Val. Our sources are snippets from the technical paper introducing it, and it comes from open AI.

SPEAKER_02:

Aaron Powell And the stakes here are pretty high. Usually when we talk about AI's impact on jobs, automation, replacement, although we look at things like adoption rates or GDP changes.

SPEAKER_00:

Aaron Powell Right, things that have already happened, lagging indicators.

SPEAKER_02:

Aaron Powell Precisely. GDP Val tries to give us leading indicators. It measures AI capabilities directly against what highly skilled humans produce, giving us a peek into the economic potential before it fully hits.

SPEAKER_01:

Okay, so our mission today, let's unpack the methodology here, see how these top AI models stack up against actual human experts, and figure out what this really means for the speed and maybe the cost of knowledge work.

SPEAKER_02:

Aaron Powell And get this. GDP Val isn't asking simple questions. It's giving AI models complex assignments.

SPEAKER_01:

Right. Assignments covering most of the work activities, according to the U.S. Bureau of Labor Statistics, for 44 different jobs.

SPEAKER_02:

Aaron Powell Yeah, 44 high-value occupations across the top nine sectors contributing to U.S. GDP. We're talking finance, healthcare, tech services.

SPEAKER_01:

Huge sectors. How much are they worth, roughly?

SPEAKER_02:

Collectively, around$3 trillion annually. They were very specific about choosing occupations that are mostly digital already, like at least 60% computer-based work according to ONET data.

SPEAKER_01:

Okay, so focusing where AI might realistically slot in first.

SPEAKER_02:

Exactly. Predominantly digital work.

SPEAKER_01:

Aaron Powell And the realism aspect. This seems key. These aren't simplified textbook problems.

SPEAKER_02:

Aaron Powell Not at all. The tasks are based on actual work products from expert professionals, people with, on average, 14 years of experience in their field.

SPEAKER_01:

Aaron Powell 14 years. Okay, so that's the human baseline they're comparing against. That's yeah, that's substantial. Trevor Burrus, Jr.

SPEAKER_02:

It really is. And it has to be because the tasks are tough. They're multimodal, meaning the AI isn't just reading text.

SPEAKER_01:

Aaron Powell, what kind of files are we talking about?

SPEAKER_02:

Oh, all sorts. CAD design files, spreadsheets, complex, diagrams, videos, uh presentation decks. A real mix.

SPEAKER_01:

Aaron Powell So it has to handle different data types, just like a person would.

SPEAKER_02:

Exactly. And each task requires context. Lots of it. For the gold subset, that's the open source part, they really focused on models needed to parse up to 17 reference files.

SPEAKER_01:

17 files per task.

SPEAKER_02:

Aaron Powell Up to 17 for the gold set and up to 38 in the full benchmark set. It really forces the model to synthesize information from different places.

SPEAKER_01:

Aaron Powell That sounds incredibly time consuming, even for a human expert.

SPEAKER_02:

Oh, absolutely. These are what they call long horizon tasks. The average human expert took about 404 minutes, that's nearly seven hours to complete just one task in the gold subset.

SPEAKER_01:

Seven hours of expert work. Wow. And some took longer.

SPEAKER_02:

Some span multiple weeks. These are complex assignments.

SPEAKER_01:

Aaron Powell So how did they put a price tag on that? How do you value that kind of work?

SPEAKER_02:

Aaron Powell It was pretty direct. They took the estimated completion time and multiplied it by the median hourly wage for that specific job.

SPEAKER_01:

Ah, okay. So every result has a clear economic link. Time saved equals money saved, potentially.

SPEAKER_02:

Aaron Powell Exactly. It gives a quantifiable measure of potential efficiency gains.

SPEAKER_01:

Aaron Powell But you know, the comparison only holds up if the human experts were genuinely top tier. How rigorous was the selection? You mentioned 14 years average experience, but they were incredibly rigorous.

SPEAKER_02:

Minimum four years experience required, strong resume needed, plus they had to pass a video interview and a background check.

SPEAKER_00:

Okay.

SPEAKER_02:

And the paper actually lists some of the kinds of places these experts came from. Thing Apple, Goldman Sachs, IBM, Meta, Boeing, even the CDC.

SPEAKER_01:

Whoa. Okay, so these aren't just average professionals. This is the high end of the talent pool.

SPEAKER_02:

Definitely. They wanted the comparison to be against the best, the industry standard.

SPEAKER_01:

Which makes the results even more interesting. Now, how about grading? Comparing AI output to that level of human expertise must be tricky.

SPEAKER_02:

Super tricky. They used what's called the blinded head-to-head comparison. Meaning an occupational expert, someone in that field, gets the original task request, all the reference files, and two final deliverables. One is from the AI, one from the human expert baseline.

SPEAKER_01:

And they don't know which is which.

SPEAKER_02:

Exactly. They just rank them. Yeah. Which one is better or are they comparable?

SPEAKER_01:

Is it just about correctness? Like, did it get the answer right?

SPEAKER_02:

No. And that's crucial. It's much more subjective. The graders consider structure, writing style, formatting, even aesthetics. You know, the kind of the kind of things a real boss or client actually cares about.

SPEAKER_01:

That makes sense. Polish matters in professional work.

SPEAKER_02:

Absolutely. And this grading process takes time. For the gold subset, it took the human graders over an hour, on average, just to compare one pair of deliverables.

SPEAKER_01:

An hour per comparison? That's dedication. Okay, this next bit sounds potentially huge for the future of AI evaluation itself. They developed an automated grader.

SPEAKER_02:

Aaron Powell Yeah, an experimental one based on a high-end GPT-5 model, specifically for the open source tasks. And get this.

SPEAKER_01:

Ah, so the AI grader is only 5% less consistent than two humans judging the same subjective work.

SPEAKER_02:

Exactly. It suggests AI is getting remarkably close to making human-like judgments about quality, even on things like style and structure. That's a big step.

SPEAKER_00:

Okay, let's get to the headline results then. How did the frontier models actually perform against these elite humans?

SPEAKER_02:

The overall trend showed performance improving uh pretty much linearly over time as models get better. And the key finding the current best models are genuinely approaching parity with these industry experts on deliverable quality.

SPEAKER_01:

Approaching parity. Wow. Any specific models stand out?

SPEAKER_02:

Yes. On that gold subset, Claude Opus 4.1 came out on top. Nearly half 47.6% of its deliverables were rated as either better than or as good as the human expert's output.

SPEAKER_01:

Better than or equal to an expert with 14 years experience almost half the time. That's impressive. What were its strengths?

SPEAKER_02:

Claude, particularly shown in aesthetics, things like document formatting, slide layouts, basically, the overall polish. It also did well with file types like PDFs and Excel spreadsheets.

SPEAKER_01:

Interesting. And what about OpenAI's own model, GPT-5?

SPEAKER_02:

The GPT-5 was very competitive, but its strength seemed to lie more in accuracy. So carefully following instructions, getting calculations right, especially on tasks that were purely text-based.

SPEAKER_01:

Okay, so Claude for polish, GPT-5 for precision. Why does Claude winning on aesthetics matter economically? Isn't accuracy king?

SPEAKER_02:

You'd think so, but in a lot of high-value knowledge work, the ki these three trillion dollar sectors do presentation matters. A deliverable that looks professional, that's well structured and easy to understand, often gets accepted and used much faster.

SPEAKER_01:

Right. Less back and forth, fewer revisions, maybe.

SPEAKER_02:

Exactly. So Claude's strength there could translate to a higher rate of actually usable output, which is a direct economic benefit.

SPEAKER_01:

But it wasn't all smooth sailing for the models, was it? Where did they tend to fall short compared to the humans?

SPEAKER_02:

Yeah, the analysis of why models lost points is really revealing. Across the board, for all models, the single biggest reason for being ranked lower than the human was failure to fully follow instructions.

SPEAKER_01:

Ah, the classic AI challenge, just not quite doing that was asked.

SPEAKER_02:

Pretty much. Models like Gemini and Grok often lost because they'd say they were going to provide something, like generate a specific file, but then just didn't. Or they'd ignore critical data from the reference files.

SPEAKER_01:

And GPT-5, it had fewer instruction issues, you said.

SPEAKER_02:

Like the content might be okay, but the output wasn't styled correctly for, say, a PowerPoint slide or a formal document.

SPEAKER_01:

Aaron Powell Okay, so instruction following and output formatting are the big hurdles right now. Let's tie this back to the economics. We know the average human task cost about$361. Given these quality results, what did the study find about potential cost savings when using AI assistance?

SPEAKER_02:

They modeled this. They looked at a scenario where an expert uses the AI, reviews the output, maybe asks the AI to try again, and if it's still not right after a few tries, the expert just fixes it themselves.

SPEAKER_01:

A realistic workflow, probably.

SPEAKER_02:

Very much so. And in that setup, the efficiency gains were clear. Using GPT-5 as an assistant, the workflow was 1.39 times faster than the human expert working alone.

SPEAKER_01:

Almost 40% faster. And the cost.

SPEAKER_02:

The cost saving was even better. 1.63 times cheaper.

SPEAKER_01:

Wow. So significantly faster and cheaper, even accounting for some potential rework by the human.

SPEAKER_02:

That's the potential, yes. For every, say, 10 hours an expert spends, AI assistants could cut that down to maybe seven hours and reduce the cost by close to 40%. That's a huge shift for these expensive professions.

SPEAKER_01:

But there's always a butt, isn't there? What's the catch?

SPEAKER_02:

The catch is the cost of that human oversight. The study emphasizes that when you factor in the experts' time needed to carefully review the AI's work and potentially fix or redo parts of it, will the net savings shrink?

SPEAKER_01:

Ah, right. The oversight isn't free, it takes expert time too.

SPEAKER_02:

Aaron Powell Exactly. It proves that human oversight isn't just a good idea, it's a necessary cost. You can't just let the AI run unsupervised, not yet anyway.

SPEAKER_01:

Aaron Powell Which brings us to the failures. If oversight is essential, how bad can things get when the AI messes up? They rated the severity of GPT-5's errors, right?

SPEAKER_02:

It did. And while most failures were categorized as acceptable but subpar, meaning not great but not disastrous, a significant chunk, about 29%, were rated as bad or even catastrophic.

SPEAKER_01:

Aaron Powell Catastrophic? What does that mean in this context?

SPEAKER_02:

Aaron Powell That's the really worrying part. Catastrophic failures made up about three percent of the total failures. And these included things like the AI giving a wrong medical diagnosis.

SPEAKER_00:

Oh wow.

SPEAKER_02:

Recommending financial fraud, or suggesting actions that could lead to actual physical harm?

SPEAKER_01:

Aaron Powell Wait, three percent of failures involved recommending fraud or potentially causing harm. That seems incredibly high stakes.

SPEAKER_02:

It absolutely is.

SPEAKER_01:

Aaron Powell, does this benchmark then almost prove the opposite point? That AI can't be trusted in fields like medicine or finance without extremely careful, dedicated human validation on every single output.

SPEAKER_02:

It certainly underscores that point heavily. The potential efficiency is there, yes. But the risk associated with these high severity failures in professional services is just too great to ignore. The models are useful assistance, but nowhere near autonomous for critical decisions.

SPEAKER_01:

Okay, so there's work to do. Is there good news on the improvement front? Can these models get better easily?

SPEAKER_02:

Yes, actually. The study found some relatively easy wins. For instance, just giving the model more thinking time, increasing its reasoning effort, led to predictable performance improvements.

SPEAKER_01:

More compute helps. Make sense.

SPEAKER_02:

And the power of just prompting better was really striking. Remember GPT-5 losing points on formatting?

SPEAKER_01:

Yeah, the PowerPoint issues.

SPEAKER_02:

They tried adding a simple instruction to the prompt, basically telling GPT-5 to double-check its own work rigorously. Things like render the file as an image to check the layout before you finish.

SPEAKER_01:

Like a self-correction step.

SPEAKER_02:

Exactly. And that simple addition dramatically reduced those bad formatting errors in PowerPoint files. They drop from 86% down to 64%.

SPEAKER_01:

Just from one extra instruction, that's a big jump.

SPEAKER_02:

It really is. It suggests that better training or even just smarter scaffolding prompts can quickly improve the practical usability and reduce how much human fixing is needed.

SPEAKER_01:

What about the boundaries? Where do current models still struggle significantly?

SPEAKER_02:

Ambiguity seems to be a major one. They tested the models with much shorter prompts like only 42% of the original length, leaving out a lot of context. Performance dropped quite a bit. The models struggled to sort of fill in the blanks or figure out the missing context and what inputs were needed. It highlights a gap between how humans navigate fuzzy real-world requests and how AI still relies on very specific instructions.

SPEAKER_01:

That makes sense. And it's important to remember what this benchmark doesn't cover.

SPEAKER_02:

Absolutely critical. Yeah. GDP Valves is focused on self-contained knowledge work, producing digital outputs. It explicitly leaves out anything involving manual labor or physical tasks.

SPEAKER_00:

Okay.

SPEAKER_02:

It also excludes tasks that need extensive tacit knowledge, that gut feel or deep experience that's hard to write down. And crucially, it doesn't test tasks requiring real-time communication between people or collaboration or using specialized proprietary software.

SPEAKER_01:

So it's a specific slice of knowledge work, albeit a large and valuable one.

SPEAKER_02:

A very specific digital self-contained slice. Those boundaries matter.

SPEAKER_01:

Okay, so let's try and synthesize this. What's the big picture takeaway here?

SPEAKER_02:

Well, it seems clear that frontier AI models are capable of producing high-quality work that approaches human expert levels, at least in these defined digital tasks. This definitely points to real potential for significant time and cost savings.

SPEAKER_01:

But and it's a bit but those savings really only happen if you have robust human oversight baked into the process. You need experts to catch the instruction errors, the formatting glitches, and especially those rare but potentially catastrophic failures.

SPEAKER_02:

Right. The potential in that$3 trillion digital knowledge sector is there, but managing that, say, 3% risk of serious error is absolutely paramount.

SPEAKER_01:

So GDP Val gives us a much sharper lens to track this progress using real economic metrics like time and cost.

SPEAKER_02:

Exactly. It shifts the debate away from just theoretical capabilities towards tangible economic impact. It's a powerful new tool for businesses and policy makers.

SPEAKER_01:

Okay, so here's something to leave you, our listener, thinking about.

SPEAKER_02:

The source material notes that this benchmark aligns with economic ideas suggesting digital tasks often involve more non-routine cognitive work. However, as we just discussed, GDP Valve specifically excludes tasks needing deep, tacit knowledge or complex human interaction and communication.

SPEAKER_00:

Right. The stuff that's maybe hardest to automate.

SPEAKER_02:

Exactly. So if the greatest remaining value in human labor lies in those non routine interpersonal skills, complex communication, collaboration, dealing with ambiguity, the areas outside GDPL's current scope, how quickly do AI models need to develop those capabilities to keep driving these big efficiency gains across the whole economy?

SPEAKER_01:

Aaron Powell That's the next frontier, isn't it? Something to mull over.

People on this episode