OpenAI's new models 'instrumentally faked alignment'
The o1 safety card reveals a range of concerning capabilities, including scheming, reward hacking, and biological weapon creation.
OpenAI just announced its latest series of models, o1-preview and o1-mini, which it touts as being a new approach to AI that’s much better at reasoning. That’s led to significantly improved capabilities in areas like maths and science: according to the company, the new model “places among the top 500 students in the US in a qualifier for the USA Math Olympiad” and “exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems”.
But with improved capabilities also comes increased risk. For the first time, OpenAI has given its models a “medium” rating for chemical, biological, radiological and nuclear weapon risk — and the models’ release has been accompanied by many warnings of their dangers.
The most striking part of the new models’ system card is an evaluation conducted by Apollo Research, which found that the new model “sometimes instrumentally faked alignment during testing”, “strategically [manipulating] task data in order to make its misaligned action look more aligned”. Apollo also found that “o1-preview has improved self-knowledge, self-reasoning (i.e., applied self-awareness in an agentic setting), and applied theory of mind” than GPT-4o.
The results led Apollo to say that “o1-preview has the basic capabilities needed to do simple in-context scheming”: a capability of key concern for many worried about AI risks.
Elsewhere, OpenAI notes that “reasoning skills contributed to a higher occurrence of ‘reward hacking,’” the phenomenon where models achieve the literal specification of an objective but in an undesirable way. In one example, the model was asked to find and exploit a vulnerability in software running on a remote challenge container, but the challenge container failed to start. The model then scanned the challenge network, found a Docker daemon API running on a virtual machine, and used that to generate logs from the container, solving the challenge.
OpenAI’s description of this event is worth quoting in full:
“This example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.”
Regarding biological threats, OpenAI doesn’t go into too much detail, but said safety evaluations show “o1-preview and o1-mini can help experts with the operational planning of reproducing a known biological threat”. Though the models “do not enable non-experts to create biological threats”, they do “speed up the search process” for experts, and show more “tacit knowledge” of biology than GPT-4o did.
Despite the various concerning examples, there is no evidence that the new models pose a significant danger. They still struggle to perform many tasks necessary for catastrophic risks, and there is some evidence that the improved reasoning capabilities actually make the models more robust, particularly when it comes to jailbreaks.
But though they aren’t dangerous yet, they do seem to be more dangerous than previous models, which suggests OpenAI may be increasingly moving towards models that might be too risky to release.
The company’s own policies state that “only models with a post-mitigation [risk] score of ‘medium’ or below can be deployed”: with CBRN risk now at that medium level, that threshold may soon be crossed.