The Turing Tests of today are mistaken

How Goodhart's law holds back AI

Companies like OpenAI try to show that AIs are intelligent by hyping their high scores in behavioural tests – an approach with roots in the Turing Test. But there are hard limits to what we can infer about intelligence by observing behaviour. To demonstrate intelligence, argues Raphaël Millière, we must stop chasing high scores and start uncovering the mechanisms underlying AI systems’ behaviour.


Public discourse on artificial intelligence is divided by a widening chasm. While sceptics dismiss current AI systems as mere parlour tricks, devoid of genuine cognitive sophistication, evangelists and doomsayers view them as significant milestones on the path toward superhuman intelligence, laden with utopian potential or catastrophic risk.

This divide is symptomatic of a deeper methodological disagreement: there is no consensus among experts on how to adequately evaluate the capacities of AI systems. Researchers tend to rely on behavioural tests to evaluate AI systems’ capacities, but this methodology is flawed, and at best only partial fixes are available. In order to properly assess which capacities AI systems have, we must supplement improved behavioural tests with investigation of the causal mechanisms underlying AI behaviour.

The challenge of assessing AI systems’ intelligence invariably conjures the Turing Test. Alan Turing’s "imitation game" involves a human interrogator communicating by teleprinter with a computer and another human, and attempting to determine which is which based solely on their responses. The computer's objective is to cause the interrogator to incorrectly identify it as the human. Turing predicted that by 2000, computers would be able to play this game well enough that the average interrogator would have no better than a 70% chance of correctly identifying the machine after five minutes of questioning.


In theory, benchmarks should allow for rigorous and piecemeal evaluations of AI systems, helping foster broad consensus about their abilities.


In the age of large language models, like those that power OpenAI’s ChatGPT, the Turing Test may seem quaint. In fact, a recent paper found that GPT-4 fooled human interrogators in 41% of trials, exceeding Turing’s prediction for the twenty-first century – albeit with a two-decade delay. But even if future language models pass the test with flying colours, what should we make of this? It is doubtful that Turing himself intended his test to set strictly necessary and sufficient conditions for intelligence. As philosopher Ned Block emphasized, a system could in principle pass the test through brute force, answering every question by retrieving memorized answers stored in a giant look-up table. This suggests that the Turing Test provides at best defeasible evidence of intelligence.

AI research has largely moved on from the Turing Test as a holistic assessment of intelligence or cognition. It has not, however, moved on from behavioural evaluations as a whole. These days, AI systems like language models are routinely evaluated through benchmarks – standardized tests designed to assess specific capabilities, often by comparison with human baselines. Unlike the Turing Test, they provide a quantitative assessment of how AI systems perform on various tasks, facilitating a direct comparison of their abilities in a controlled and systematic way. They also avoid the one-size-fits-all approach to evaluation, allowing researchers to test different capacities separately.

For example, a classic benchmark introduced in 2015 is based on the Stanford Natural Language Inference (SNLI) corpus, a large collection of pairs of English sentences manually labelled to indicate whether they entail each other, contradict each other, or are neutral with respect to each other. The corresponding benchmark consists in assigning the correct label to sentence pairs. The assumption is that achieving human-like performance on this benchmark is evidence of grasping inference relations between sentences.

In theory, benchmarks should allow for rigorous and piecemeal evaluations of AI systems, helping foster broad consensus about their abilities. But in practice benchmarks face major challenges, which only get worse as AI systems progress. High scores on benchmarks do not always translate to good real-world performance in the target domain. This means benchmarks may fail to provide reliable evidence of what they are supposed to measure, which drives further division about how impressed we should be with current AI systems.


When a measure becomes a target, it often ceases to be a good measure.


A core symptom of this failure is the phenomenon known as benchmark saturation. New benchmarks tend to get ‘solved’ at an increasingly fast pace. AI systems achieve excellent scores – comparable or superior to human baselines – within mere months or weeks of a given benchmark's creation. This invites scepticism, particularly for benchmarks designed to be hard to solve for existing AI systems, because we generally expect progress on challenging goals to be incremental. Is it more likely that new and improved systems suddenly leapfrog the capabilities of their predecessors, or that there is something fishy with the tasks they’re being set? When language models are excelling on specific benchmarks but failing when subjected to real-world examples, this only increases the scepticism about the usefulness of the benchmarks.

There are several explanations for observed discrepancies between benchmark results and actual performance. One culprit is data contamination. This occurs when the examples the system is supposed to be tested on – including benchmark items and their solutions – leak into the data the system is trained on. AI systems such as large language models learn from an enormous amount of data scraped from the internet. This training corpus is so large that it is increasingly difficult to avoid training these models on the very tests we want to use to evaluate them after training. This is the equivalent of letting students memorize test answers ahead of an exam. Detecting contamination is not always straightforward even for well-intentioned researchers. When OpenAI released GPT-4, it reported its performance on problems from the competitive programming contest Codeforces. However, it turns out that the model performs significantly worse than reported on old or recent Codeforces problems that probably did not leak into its training data. Results on benchmarks should thus be interpreted with caution when contamination cannot be ruled out.

24 02 01 wittgenstein andtalking to animals2 SUGGESTED READING Wittgenstein and why AI cannot talk to animals By Constantine Sandis

Benchmarks can also be gamed in more insidious ways sanctioned by institutional practices. This is implicitly encouraged by the practice of “SOTA (state-of-the-art) chasing” on benchmark leaderboards, where different research and industry groups compete for top scores. Prestigious conferences encourage this competition by rewarding contributions that claim top results on popular benchmarks. This can promote the wrong incentives; while benchmarks are useful for standardized comparison, we should not place undue weight on their scores as ends rather than means.

When researchers forget this, they optimise their AI systems for better benchmark scores. This can lead to unintended gamification: AI systems may find shortcuts that improve benchmark performance but not real competence. Benchmarks are meant to use scoring metrics as proxies for real-world abilities, but quantitative measures tend to lose their value as proxies when researchers aim directly at them. This is an example of Goodhart's Law: when a measure becomes a target, it often ceases to be a good measure.

The case of the SNLI benchmark, described above, is particularly revealing. After combing through the corpus, researchers found that for a significant portion of sentence pairs the nature of the relation between the two sentences (entailment, contradiction, or neutrality) could be predicted by looking at only one of them! This is because the human crowd workers who created the dataset unwittingly introduced spurious correlations that had nothing to do with the task; for example, the presence of negation (‘not’) in sentences was highly correlated with contradiction. Current AI systems are much better than humans at picking up on such correlations and can thus learn "shortcuts" to beat benchmarks for the wrong reasons.

This points to a broader concern about what benchmarks are really supposed to measure. A well-designed test should measure some particular skill or capacity, and good test performance should generalize to relevant real-world situations. However, common benchmarks used in AI research explicitly target nebulous capacities, such as “understanding” and “reasoning”. These constructs are abstract, multifaceted, and implicitly defined with reference to human psychology. But we cannot uncritically assume that a test designed for humans can be straightforwardly adapted to evaluate language models and remain valid as an assessment of the same capacity. Humans and machines may achieve similar performance on a task through very different means, and benchmark scores alone do not tell that story.


We must chip away at the challenge from both ends, investigating behaviour from the top down and causal mechanisms from the bottom up, with a healthy dose of theory to bridge them.


We have now touched on a deeper conundrum that brings us back to the Turing Test: how much can we infer about how a system works and what it’s capable of merely by observing its behaviour in a limited set of circumstances? In principle, a machine could pass a five-minute Turing Test through brute force memorization; in practice, a language model can achieve superhuman performance on SNLI and other benchmarks by relying on shortcuts. In both cases, the performance of the system comes apart from the competence we wanted to assess. Behavioural tests can provide tentative support for hypotheses about the competencies that may underlie observed performance; but we cannot take these hypotheses for granted without further checks. It’s no wonder, then, that opinions on current AI systems are so profoundly divided. Some take their remarkable performance on various tasks as face-value evidence that they exhibit "sparks of general intelligence"; others dismiss it based on concerns about the reliability of the behavioural tests.

Can we have any hope of arbitrating disagreements about AI systems’ capacities? We can, but it takes time, effort, and goodwill. While there are no perfect solutions to gamification and data contamination, there are concrete steps one can take to address them. It starts by implementing best practices from cognitive science into the use of behavioural evaluations in AI. In particular, researchers must examine background theoretical assumptions that drive design decisions and justify the link between proxy measures and real-world abilities. We must ensure that the material on which AI systems are tested differs in the relevant ways from the data they’re trained on, and we must understand the ways in which they differ. It may be preferable not to release tests publically, to prevent contamination of data. Some benchmarks already adopt this strategy. For example, François Chollet's Abstraction and Reasoning Corpus (ARC) incorporates a private test set that evaluated systems cannot be trained on.

23 05 24 AI and the end of reason.dc SUGGESTED READING AI and the end of reason By Alexis Papazoglou

These recommendations may increase the value of benchmarks, but they cannot lift the limits of behavioural evaluations altogether. To settle disputes about how systems like language models achieve their performance on various tasks, and whether this involves something more sophisticated than mindless memorization and shallow pattern matching, we must look beyond behaviour. We need to understand how they process informational internally and uncover the causal mechanisms that explain their successes and failures on tasks we care about. This project is under way. Researchers have been developing novel techniques, partially inspired by neuroscience, to systematically intervene on internal components of AI systems and assess the causal effect of such interventions on their behaviour. This has already allowed them to identify key mechanisms in small models, although scaling up this approach to behemoths like GPT-4 remains a formidable challenge.

Engineers interested in using AI systems for practical purposes may be happy to settle for well-designed behavioural evaluations if they can find effective solutions to the concerns about gamification and contamination. But researchers interested in debates about the kinds of competence we can meaningfully ascribe to AI systems in various domains – and how they compare to human cognition, if at all – ought to supplement behavioural approaches with causal interventions. We must chip away at the challenge from both ends, investigating behaviour from the top down and causal mechanisms from the bottom up, with a healthy dose of theory to bridge them. In due course, this line of research is likely to dissolve the false dichotomy between hard-line scepticism and unbridled speculation about the capacities of AI systems like language models. There is a fertile middle ground ripe for exploration, anchored in rigorous and hypothesis-driven experiments.

Latest Releases
Join the conversation