Meta’s AI ‘safeguards’ are an elaborate fiction

Meta cannot prevent misuse, despite what it might pretend

Nov 13, 2024

*Image: Alan Warburton / Better Images of AI*

Earlier this month, Reuters reported that the Chinese military has used Meta’s Llama models to develop AI tools. Meta’s response was swift and predictable: public policy director Molly Montgomery assured Reuters that “any use of our models by the People’s Liberation Army is unauthorised and contrary to our acceptable use policy”. And in a clear attempt to distract from the issue, a few days later Meta announced that it would now allow the US military to use its products.

There’s just one small problem: both of these statements are meaningless. Meta’s acceptable use policy, like all of its so-called “safeguards”, has no teeth. Because of its decision to openly release its models’ weights, Meta cannot prevent anyone from using — or misusing — its models. Yet rather than acknowledge this obvious truth, Meta continues to spin an elaborate fiction about its ability to control how its models are used — deflecting responsibility for decisions it has itself made.

The lack of safeguards may come as a surprise, given Meta makes such a big deal of its “responsible” approach to AI. It boasts of “integrating a safety layer to facilitate adoption and deployment of the best practices”, “technical guardrails that limit how the model will respond to certain queries”, and “guidelines barring certain illegal and harmful uses”. The company’s executives tout these measures in Congressional testimony and to the press, constantly positioning Meta as a responsible steward of powerful technology.

Yet the lauded “safety layer” is an optional download you have to manually install. The models’ “guardrails” can be removed in minutes. And the “guidelines” have not stopped people from using Meta’s AI models for a wide range of illegal and harmful activities. Combined, the safeguards ultimately amount to little more than a strongly-worded statement.

Getting Llama to act maliciously is trivial. Removing Llama 2’s safety training costs just $200, according to one study. Llama 3 — supposedly a safer model — can be jailbroken in minutes. And if that sounds too technical, users can simply download pre-jailbroken versions of these models from Hugging Face.

The consequences are precisely what you’d expect. A recent paper from researchers at Indiana University found that the dark web is full of AI hacking tools, many of which are powered by jailbroken versions of Llama 2 – which, the researchers say, has likely “been exploited … to facilitate malicious code generation”. Cybersecurity firm Crowdstrike, meanwhile, suspects Llama 2 was used to hack a “North American financial services” firm last year. Another security firm found a ransomware hacker who said they used Llama not to carry out a hack, but to help sift through all the sensitive data that came from it.

Meta is well aware of these risks. Its own research found that Llama 3 can “automate moderately persuasive multi-turn spear-phishing attacks” and showed “susceptibility to complying with clearly malicious prompts” up to 26% of the time. But there’s no need to worry, the authors say, because “we expect that guardrails will be added as a best practice in all deployments of Llama 3”. It’s an astonishing example of wishful thinking.

Though Llama’s open weights do contribute to its ease of jailbreaking, it is not uniquely susceptible to this kind of malicious use. ChatGPT, Claude and Gemini can all be jailbroken to engage in malicious activity. But there are two key differences. Closed-weight models can be patched to block new jailbreaking methods, forcing hackers to keep coming up with new techniques. More importantly, though, it’s possible for OpenAI, Anthropic and Google to revoke model access to bad actors, simply by banning them from using the models’ APIs. Meta, meanwhile, is completely unable to stop anyone from misusing its product. As Senators Richard Blumenthal and Josh Hawley wrote in a letter to Mark Zuckerberg: “AI models like LLaMA, once released to the public, will always be available to bad actors”.

I asked Meta how it enforces the Llama terms of use, and if it could share any information on how often the policies are enforced. Meta did not respond to my inquiry, and I could not find any public cases of Meta enforcing its policies.

None of this is to argue that releasing model weights is inherently wrong. There are compelling arguments for open-source AI development, from accelerating research to democratising access. The benefits might well outweigh the costs.

But Meta’s persistent attempts to obscure the implications of its choices do everyone a disservice. When you release model weights, you are creating a free-for-all. No amount of corporate doublespeak can change that reality. Meta’s “guardrails” and “policies” are simply theatre, designed to provide political cover rather than meaningful protection. If only Meta would own up to that.

Meta’s AI ‘safeguards’ are an elaborate fiction

Meta cannot prevent misuse, despite what it might pretend

Discussion about this post