The way we evaluate AI model safety might be about to break
As systems become more capable, researchers think we need a new type of safety evaluation
For the past couple of years, AI companies have relied on a simple argument to justify releasing their models: if an AI can’t do dangerous things, it must be safe to release. But that logic might be about to expire, worrying researchers who think we now need to reevaluate how we evaluate AI systems.
Today, determining a model’s safety involves asking how competent the model is, with companies using benchmarks and red teaming to see whether it’s capable of doing dangerous things. According to Lawrence Chan at AI evaluation non-profit METR, the labs implicitly rely on low benchmark scores to justify safety. After all, if an AI can’t even score well on a biology quiz, how could it help create a bioweapon?
This safety-through-incompetence argument drove a surge in AI evaluations over the past two years. Tech companies, governments, and private groups rolled out dozens of benchmarks to assess model capabilities. Red teaming became commonplace, with expert teams stress-testing models to elicit dangerous behavior. Uplift studies emerged to test whether AI tools helped humans perform dangerous activities more effectively.
For a while, models struggled to perform well on any of these evaluations, providing an implicit safety case to justify releasing them to the general public. But that’s beginning to change: AI models are starting to ace the benchmarks, forcing us to find a new way to prove model safety.
Many recent benchmarks are already saturated — meaning AIs are reaching the maximum possible scores. “Benchmarks are saturating quite quickly, and the rate at which they’re saturating is also increasing over time,” Chan explained. Old language model benchmarks took decades to reach human level. ImageNet, a computer vision test, took years to reach human performance. But GPQA, a benchmark of graduate-level science questions that can’t be answered via googling, reached human-level performance within a year of its 2023 release, despite being designed specifically to challenge AI systems. Similarly, the bio-capability evaluation Lab-Bench was released in July 2024, but by September OpenAI’s o1-preview already scored 75% on fixing errors in lab instructions for carrying out experiments — approaching the ~79% accuracy of human experts with internet access.
This rapid saturation threatens to upend how we prove model safety. GPQA and Lab-Bench are used by Anthropic (in the case of GPQA) and OpenAI (in the case of both) to evaluate biological capabilities, an important precursor to whether an AI could help create a bioweapon. Models saturating the benchmarks doesn’t necessarily mean they’re dangerous, but it does mean we need to find a different way to prove that they’re not.
Red teaming — a process in which researchers manually investigate the limits of AI capabilities — is still viable for now. These exercises are slower and more expensive, but still allow researchers to measure improvements in capabilities.
But multiple experts told Transformer that we need to start preparing for the day when models can beat capability-red-teaming exercises too. Buck Shlegeris, an evaluations researcher, predicts that “inability arguments will fail over the next year”. The current practice, where labs deem a model safe because it can’t take dangerous actions, might soon be coming to an end — because the models will, in fact, be able to take dangerous actions.
When that happens, the industry will be forced to move to a new safety paradigm. Once the AI is capable of doing dangerous things, Chan explains, the safety argument shifts from “the AI cannot do dangerous things” to “the AI cannot do dangerous things given we’re looking at it super hard”. That leads to a new kind of evaluation, which tests whether countermeasures are sufficient to stop a powerful AI system from causing harm. Countermeasures depend on the situation, but examples include fine-tuning the model to refuse certain questions or having a second, smaller model monitor the first for signs of dangerous activity.
Some researchers are already pursuing this approach. Refusal evals such as AIR-Bench measure whether the AI correctly refuses to answer dangerous questions. Scheming and deception evals set up scenarios where the AI might be incentivized to act contrary to its instructions, testing whether the AI lies to users or attempts to circumvent its safety measures. In theory, such evals can test whether an AI will try to cause harm.
Control evaluations try to take this line of thought one step further. Unlike refusal or scheming evals, control evals assume that the AI is trying to defeat safety measures (a behavior that’s been observed in models such as OpenAI’s o1) and test whether it succeeds. Shlegeris, whose organization Redwood Research is developing such tests, likens the work to bank security: there are likely to be some bad bank tellers, so the banks take safety measures to ensure those bad actors can’t easily steal. For control evals, human red teamers (often assisted by AIs) play the role of criminals, proposing strategies that the AI could try that might defeat the safety measures. They might, for instance, encourage the AI to pretend to be safe during safety testing, but act dangerously afterward. The eval then tests if the AI is capable of successfully executing the plan and bypassing safety measures. If it is, then we know that there are at least some situations where the AI isn’t safe — and that we may need to improve security measures before deploying it.
When will AI capabilities reach the point that safety cases rely primarily on these countermeasure evals? While we can’t be certain, AI companies are making bold predictions. Sam Altman recently said that OpenAI is “now confident we know how to build AGI,” suggesting the first AI agents might “join the workforce” in 2025. And the head of Anthropic’s Frontier Red Team recently told the Wall Street Journal that the pace of progress worries him, saying he’s “concerned about … how much time [we] have until things get concerning.”
Worryingly, we are “not at all” ready if that happens, according to Chan. Countermeasure evals remain rudimentary and non-standardized, making it nearly impossible to build credible safety cases based on them. Geoffrey Irving, research director at the UK AI Safety Institute, wrote in August that “it is not yet possible to build full safety cases that scale to risks posed by models significantly more advanced than those of today.” Work on today’s less capable models is “helpful but insufficient” for the challenges to come, he said. Given the speed of AI progress, we might not have long to develop a better way of measuring if a model is safe.
Disclosure: The author’s partner works at Google DeepMind.