Disinformation Security: Making LLMs Safer by Lying to Them
How Can We Stop Large Language Models from Being Used for Biological Terrorism?
Large language models (LLMs) show a remarkable ability for reasoning across a diverse set of tasks. From language translation to code generation, these models can generate text that can rival subject matter experts. Unfortunately, because their training corpus includes a large portion of all publicly available text on the internet, their understanding and expertise extend to dangerous topics like bioweapon production. By breaking down complex protocols into actionable steps, listing equipment that must be purchased and where it can be sourced, and troubleshooting any problems when they arise, LLMs can enable unskilled individuals with limited budgets to produce dangerous pathogens using techniques from synthetic biology.
To stop LLMs from providing instruction on dangerous topics, two broad approaches have been proposed: safety training and corpus censorship.
Safety training is a widely used technique in which developers train LLMs to refuse to answer questions that it evaluates as dangerous. However, clever prompting strategies and direct manipulation of weights can easily bypass such safety training in a process known as jailbreaking.
Corpus censorship, which is less popular as an explicit technique, leaves out dangerous information from the training data. However, this technique too is easily bypassed by data supplementation, a strategy in which dangerous information is provided to the models either through fine-tuning of weights in subsequent post-release training or by providing information directly as inputs to the model.
Additionally, both safety training and corpus censorship require large amounts of labor to accurately and totally apply and limit model usefulness for benign use cases. Clearly, better solutions are needed if the benefits LLMs offer to bad actors are to be mitigated without penalizing normal users.
Purposefully injecting disinformation into LLMs can make them safer
To complement the techniques mentioned above, I propose injecting disinformation about dangerous topics into LLMs in order to make them less effective at enabling bioterrorism. If mistakes can be introduced into viral synthesis protocols or if confusion around vulnerabilities in global biodefense can be created, the danger that LLMs pose could be greatly reduced. In fact, if errors are introduced that encourage bioterrorists to engage with systems and services in a way that is likely to attract the attention of the authorities, use of LLMs could actually reduce the risk of a bioterror plot succeeding.
The greatest benefit that such disinformation would offer would be the potential to persist past jailbreaking and data supplementation. Because the disinformation would be stably incorporated into a model, it would be returned whenever the related context is invoked. Thus, even when users train the model to talk about dangerous topics or provide extra information around such topics, the model response will contain the integrated disinformation as well.
Read the full project submission here.