
AI Unscripted with Kieran Gilmurray
Kieran Gilmurray is a globally recognised authority on Artificial Intelligence, cloud, intelligent automation, data analytics, agentic AI, and digital transformation. I have authored three influential books and hundreds of articles that have shaped industry perspectives on digital transformation, data analytics and artificial intelligence.
๐ช๐ต๐ฎ๐ ๐๐ผ ๐ ๐๐ผโ
When I'm not chairing international conferences, serving as a fractional CTO or Chief AI Officer, Iโm delivering AI, leadership, and strategy masterclasses to governments and industry leaders. My team and I help global businesses, driving AI, digital transformation and innovation programs that deliver tangible results.
I am the multiple award winning CEO of Kieran Gilmurray and Company Limited and the Chief AI Innovator for the award winning Technology Transformation Group (TTG) in London.
๐ ๐๐ฐ๐๐ซ๐๐ฌ:
๐นTop 25 Thought Leader Generative AI 2025
๐นTop 50 Global Thought Leaders and Influencers on Agentic AI 2025
๐นTop 100 Thought Leader Agentic AI 2025
๐นTeam of the Year at the UK IT Industry Awards
๐นTop 50 Global Thought Leaders and Influencers on Generative AI 2024
๐นTop 50 Global Thought Leaders and Influencers on Manufacturing 2024
๐นBest LinkedIn Influencers Artificial Intelligence and Marketing 2024
๐นSeven-time LinkedIn Top Voice.
๐นTop 14 people to follow in data in 2023.
๐นWorld's Top 200 Business and Technology Innovators.
๐นTop 50 Intelligent Automation Influencers.
๐นTop 50 Brand Ambassadors.
๐นGlobal Intelligent Automation Award Winner.
๐นTop 20 Data Pros you NEED to follow.
๐ฆ๐ผ...๐๐ผ๐ป๐๐ฎ๐ฐ๐ ๐ ๐ฒ to get business results, not excuses.
โ๏ธ https://calendly.com/kierangilmurray/30min.
โ๏ธ kieran@gilmurray.co.uk or kieran.gilmurray@thettg.com
๐ www.KieranGilmurray.com
๐ Kieran Gilmurray | LinkedIn
AI Unscripted with Kieran Gilmurray
Fooled Ya! How GPT-4.5 Just Broke the Turing Test
The Turing Test, once a distant philosophical thought experiment, has suddenly become startlingly relevant in our AI-saturated world. We dive deep into groundbreaking research that reveals something extraordinary: today's advanced language models can consistently pass this iconic test of machine intelligenceโand sometimes outperform humans at appearing human.
This fascinating study examined how effectively modern LLMs like GPT-4.5 and LAMA 3.1 could convince human judges they were real people in controlled Turing Test environments. The results are mind-blowing: when given specific persona prompts, GPT-4.5 achieved a 73% win rate, meaning judges mistakenly identified it as human nearly three-quarters of the time. Even more remarkably, the AI was often more convincing than actual humans in parallel tests.
We explore the nuances that made this possible, from the crucial role of persona-based prompting to the surprising ineffectiveness of common detection strategies. Counter to intuition, asking about emotions or personal experiences proved less effective at identifying AI than random, unexpected questions. The research reveals an almost paradoxical finding: appearing less knowledgeable sometimes made AI seem more human, highlighting the complex psychological dynamics at play when we evaluate humanity.
Beyond the technical achievements, these findings raise profound questions about our digital future. As AI becomes increasingly indistinguishable from humans in conversation, what does this mean for online trust, employment, and our fundamental understanding of what makes us human? As we navigate this new frontier where machines can mimic our social intelligence with uncanny precision, perhaps the real value of the Turing Test isn't in what it tells us about machines, but what it reveals about ourselves. Join us for this thought-provoking exploration of intelligence, identity, and the blurring boundaries between human and machine.
Link to research: https://arxiv.org/pdf/2503.23674
For more information:
๐ Visit my website: https://KieranGilmurray.com
๐ LinkedIn: https://www.linkedin.com/in/kierangilmurray/
๐ฆ X / Twitter: https://twitter.com/KieranGilmurray
๐ฝ YouTube: https://www.youtube.com/@KieranGilmurray
๐ Buy my book 'The A-Z of Organizational Digital Transformation' - https://kierangilmurray.com/product/the-a-z-organizational-digital-transformation-digital-book/
๐ Buy my book 'The A-Z of Generative AI - A Guide to Leveraging AI for Business' - The A-Z of Generative AI โ Digital Book Kieran Gilmurray
All right. So you're joining us for another deep dive, and today we're tackling something pretty big.
Speaker 2:Oh yeah, this is a fascinating one.
Speaker 1:It really is. It's something that's been a cornerstone of AI like forever, but it's suddenly like super relevant again.
Speaker 2:I think that's what makes it so interesting. Right, it's this concept that's been around, debated for what like 75 years? The Turing test. But now, with these LLMs exploding onto the scene, it's like suddenly it's not just theoretical anymore.
Speaker 1:Exactly. It's like okay, we've been talking about this for decades, can machines think? Can we tell? But it was always kind of philosophical. Now it's like whoa hold on, these AIs are getting really good. So for anyone tuning in who might not be totally familiar, the Turing test. It was proposed by Alan Turing, brilliant guy, back in 1950. And the basic idea is deceptively simple, right.
Speaker 2:Yeah, you have a human. Who's the interrogator. Their job is to figure out through conversation who's the human and who's the machine. They're talking to two hidden entities, one of each and the machine. If it can fool the interrogator, it's said to have passed the test.
Speaker 1:Right and this whole thing. It's been hugely influential but also, like, really controversial.
Speaker 2:Oh, absolutely. People have been arguing about it forever, like what does it really measure? If a machine can fool us, does that mean it's truly intelligent or is it just really good at mimicking? And for a long time it felt like a thought experiment. But the thing is how we typically test AI. Now it's kind of limited, isn't it? It's like a very specific tasks, almost, like you know, isolated puzzles, no-transcript, but the Turing test it's like throwing them into a free-flowing conversation where anything goes.
Speaker 1:Right, so it's much more dynamic. The interrogator can ask anything and the AI has to keep up, be coherent and not reveal itself.
Speaker 2:And that's what we're diving into today. Can these super advanced LLMs, the ones everyone's talking about, can they actually pull this off? Can they pass a good old-fashioned three-party Turing test? So we've got some research here that looked into this and the results well, they might surprise you.
Speaker 1:They're pretty mind-blowing, to be honest.
Speaker 2:Yeah.
Speaker 1:So for this deep dive, we've got this research paper and basically what they did was they ran these randomized, controlled Turing tests, you know, really rigorous, and they had two different groups of people participating. One group was undergrads at UC San Diego and the other was a more diverse group recruited online through prolific.
Speaker 2:And, importantly, they didn't just test one AI. They tested four different systems. That's helpful, right Gives you a range of performance to compare.
Speaker 1:Yeah, definitely so. The big one, the one everyone's been talking about, was GPT 4.5. That was OpenAI's most advanced model at the time of the study back in 2023.
Speaker 2:And then they had LAMA 3.1. That's the powerful open source one developed by Meta that was released in 2024.
Speaker 1:Right, and then another one from OpenAI, but newer GPT-4, also from 2024. Oh, and they also threw in ELIZA.
Speaker 2:Oh, eliza Classic, that's going way back, that's a chatbot. But from 1966, it's rule-based. So much simpler.
Speaker 1:Yeah, it's almost like a little history lesson in AI, but they included it for a good reason, as a kind of baseline. Almost like a little history lesson in AI, but they included it for a good reason, right yeah, as a kind of baseline like a sanity check, right it's like okay, can people actually tell a simple chatbot apart from a real human?
Speaker 2:Because if everyone was mistaking Eliza for a person, it would mean something's off with the whole study.
Speaker 1:Exactly so. You've got these AI systems ready to be tested, but it wasn't just a straightforward go be human kind of test.
Speaker 2:Yeah.
Speaker 1:They use different prompts for these LLMs, which is basically like the instructions they're given Right, like what role they should play, what kind of information to use.
Speaker 2:So one type of prompt was called the no persona prompt, super basic. It pretty much just said hey, you're in a Turing test. Try to convince the person you're human.
Speaker 1:But even with that basic prompt they did give the AIs some helpful background info right.
Speaker 2:Yeah, like stuff that would be relevant to the people they were talking to. So for the prolific workers, maybe it mentioned their payment details, and for the UCSD students, maybe info about their courses or campus news, just to give the AI some common ground.
Speaker 1:Makes sense. But then they had this other way, more intriguing prompt, the persona prompt. What was that all about?
Speaker 2:So the persona prompt. That's where things get really interesting. Instead of being vague, it told the LLMs to take on a specific human-like persona. They even described it like a young kind of introverted person into internet culture, uses slang and they actually included the whole prompt in the paper. It's figure six in the prolific study.
Speaker 1:So it wasn't just be human, it was like be this specific kind of person. I'm curious how did they even choose that persona?
Speaker 2:Well, they didn't just pick it randomly. They did some earlier studies testing different prompts and AIs to see what worked best at appearing human. So it was strategic.
Speaker 1:Wow, so they were really trying to optimize for human likeness. Okay, so we've got these different AIs, these different prompts and they're put through the Turing test ringer.
Speaker 2:What happened? What were the results? Okay, so the key metric here is the win rate, meaning how often did the interrogator think the AI was actually the human? And the results for GPT 4.5 with that persona prompt. Well, they were pretty remarkable.
Speaker 1:It achieved an overall win rate of 73% 73%, so almost three quarters of the time people thought the AI was the real person.
Speaker 2:Yep and it was consistent across both groups 69% for the undergrads, 76% for prolific. Now the important thing here is it wasn't just luck. Statistically it was way above the 50% you'd expect if people were just guessing. But even more significant, it was way above the 50% you'd expect if people were just guessing. But even more significant, it was identified as human more often than the actual humans it was being compared to.
Speaker 1:So basically, in these short text, conversations with the right prompting GPT-4.5 could be more convincing than a real person.
Speaker 2:Exactly, and that's a big deal, Like it really shifts how we think about what AI can do. Ok, what about the other systems? Lama 3.1, also with Persona prompt. It did pretty well too. Overall win rate of 56 percent.
Speaker 1:Still more than half the time, but you said earlier it wasn't statistically significant across both groups.
Speaker 2:Right when you lumped them together. But if you just look at the prolific group Lama with the Persona, it had a 65% win rate and that was significantly above chance. So maybe that particular persona just resonated more with that group.
Speaker 1:Interesting. So different AIs, different personas, maybe different effectiveness, depending on who you're talking to. What about when they use those simpler, no persona props? Did the AI still do well?
Speaker 2:That's where we see a big drop, both GPT 4.5 and LAMA, without the persona. Their win rates went way down, especially with the undergrads, like around 27% for GPT and 26% for LAMA. So in that case they were actually more likely to be correctly identified as AI.
Speaker 1:So clearly the persona was doing a lot of heavy lifting there. What about for the prolific group?
Speaker 2:Better but still not great. Gpt 4.5 no persona got around 42% and LAMA no persona around 47%, still not statistically above that 50% chance level. So again the persona was making a huge difference.
Speaker 1:So it seems like a consistent pattern. That persona really helped. What about the baseline systems, GPT 4.0 without a persona? And good old ELISA.
Speaker 2:Okay, so GPT 4.0 no persona, that good old Eliza. Ok. So GPT-4o no persona. That got 21 percent and Eliza 23 percent, both significantly below chance for both groups, which for Eliza was expected right. It shows that people could generally tell that a very simple chatbot wasn't human.
Speaker 1:Yeah, makes sense. So basically the researchers were looking at whether these win rates were statistically different from just random guessing, and ELISA being so clearly below that kind of gives more weight to the findings with the more advanced models. But you know, this whole persona thing, it's really got me thinking. It seems like it was the key to the AI's success. But it raises some tricky questions, doesn't it? Like how much of their human likeness was really them versus the instructions they were given?
Speaker 2:Right, that's a really interesting point, Like where does the AI's own behavior end and the human's influence begin? It gets kind of philosophical. But honestly, even ELIZA as simple as it was, its ability to appear even remotely human it came from the humans who programmed it and with these LLMs, their ability to adapt based on prompts is part of what makes them so powerful and apparently so good at seeming human.
Speaker 1:Yeah, it's a blurry line. So, ok, we know the win rates, but what were people actually doing in these Turing tests, like, how are they trying to figure out who was who?
Speaker 2:Well, they analyzed the kinds of things people asked and the most common tactic used like 61 percent of the time was just small talk.
Speaker 1:Like how's your day going? Yeah, that kind of stuff.
Speaker 2:Exactly and about half the time. People also tried to probe social and emotional things like asking about opinions, feelings, humor, personal stuff.
Speaker 1:So trying to see if the AI would slip up reveal it doesn't have real emotions.
Speaker 2:Right, but here's the thing While those were the most common strategies, they weren't actually that effective.
Speaker 1:Really. So asking everyday things wasn't the best way to spot the AI. What worked better?
Speaker 2:The things that were most accurate. They were used less often but they were more targeted, like saying something totally out of the blue to see how the AI would react, or using those jailbreak prompts, you know, those questions specifically designed to kind of trick the AI into revealing its programming.
Speaker 1:So being a little more adversarial actually worked better. Interesting and what about asking about things humans are supposed to be good at, like emotions or experiences? Did that?
Speaker 2:help. That was the least effective, which is kind of counterintuitive, right, you'd think that's where the AI would struggle.
Speaker 1:Yeah, you'd think so. Okay, so we know what people tried, but what about the reasons they gave for their decisions, like, why did they think someone was human or not?
Speaker 2:Okay, so the most common reason people gave about 27% of the time was linguistic style, like noticing typos, or if someone used phrasing that sounded more human.
Speaker 1:So picking up on those subtle cues in the writing.
Speaker 2:Yeah, and then interactional dynamics that was about 23%, like whether the conversation flowed naturally or someone seemed evasive.
Speaker 1:Interesting, so it wasn't always about some deep analysis. Sometimes it was just the vibe of the conversation.
Speaker 2:Exactly, and a lot of people also just said they had a gut feeling. But here's the really interesting part the reasons that were best at predicting accuracy. They were about how people handled questions and also instances where someone seemed to lack knowledge.
Speaker 1:Wait, so seeming less knowledgeable made someone seem more human. That's kind of ironic, it is right.
Speaker 2:But think about it. These LLMs are trained on massive data sets. They know a ton, so sometimes to be convincing, they might need to pretend they don't know something.
Speaker 1:Wow, that's fascinating. So putting it all together for our listeners, does this study mean that LLMs can now pass the Turing test like officially?
Speaker 2:Well, based on the idea that if we can't tell the difference, then for all intents and purposes they're the same. In that sense, yes, the study suggests that GPT 4.5 and LAMA 3.1, when they're using those specific persona prompts, they pass.
Speaker 1:And they used Eliza's performance as evidence that their test could actually detect when something was clearly not human.
Speaker 2:Right and the fact that GPT 4.5 with the persona actually did better than humans at being seen as human. That adds even more weight.
Speaker 1:So it's not just about reaching human level. They actually surpassed it in some cases. That has to have some pretty big implications, right? It's not just a theoretical game anymore.
Speaker 2:Oh, absolutely. The fact that these LLMs can now blend in so well in short conversations at least, it has huge consequences. Think about things like customer service, online content moderation, even some forms of communication those could be automated potentially displacing jobs.
Speaker 1:And the paper talks about this idea of counterfeit people, which is kind of creepy when you think about it.
Speaker 2:Yeah, it was Dennett who coined that term, and the concern is, as these AIs get better, we might end up spending more time with them, less with real people.
Speaker 1:It'd be almost a devaluation of real human connection.
Speaker 2:Right, and who controls these AIs? That's another huge question. But the study also looked at whether people could actually detect the deception, and it turns out it's not so easy.
Speaker 1:So even if you're tech savvy, you might not be able to tell.
Speaker 2:Exactly. They found no link between people's background with AI and how well they did on the test, but using those jailbreak strategies that did seem more effective.
Speaker 1:So maybe it's not about general tech knowledge but more about knowing the right tricks. But the researchers also had this interesting take on what the Turing test really measures now not just intelligence in general.
Speaker 2:Right. They argue that in this landscape it's less about raw intelligence and more about human likeness. Like machines are already really good at logic and calculations, but what makes us human might be more about social intelligence, things like communication, empathy, navigating social situations and the way they design this Turing test. It's important because it forces a direct comparison.
Speaker 1:The AI isn't just judged on its own merits. It has to be more convincing than a real person.
Speaker 2:Exactly. And the paper ends with this quote from Brian Christian, which I love. He says the Turing test might not be the end, but maybe it pushes us to be more human than ever.
Speaker 1:So, as AIs get better at mimicking us, maybe it forces us to define what makes us unique.
Speaker 2:It's a beautiful thought. So to wrap this up for everyone listening, the key takeaway here is huge. For the first time, an AI, specifically GPT 4.5 with that persona prompt, was able to not just perform as well as a human in a Turing test, but to actually outperform them, fooling people more often.
Speaker 1:And this raises a whole bunch of crucial questions. What does it mean for an AI to be this convincing? What does it tell us about intelligence and what it means to be human? And, importantly, what are the consequences for society as these AIs get more sophisticated and integrated into our lives?
Speaker 2:This research gives you a lot to think about. Right, it affects how we interact online, the future of work, even our relationships.
Speaker 1:It really highlights how fast AI is advancing and the challenges that come with it.
Speaker 2:And it leaves you with this final thought-provoking question Now that we know AI can mimic us so well, what does it truly mean to be human? How does this change how you see yourself and what you value? That's something worth pondering, I think. Thanks for being with us for this deep dive.
Speaker 1:Really appreciate you joining us.