Disinformation Security: Making LLMs Safer by Lying to Them
How Can We Stop Large Language Models from Being Used for Biological Terrorism?
Large language models (LLMs) show a remarkable ability for reasoning across a diverse set of tasks. From language translation to code generation, these models can generate text that can rival subject matter experts. Unfortunately, because their training corpus includes a large portion of all publicly available text on the internet, their understanding and expertise extend to dangerous topics like bioweapon production. By breaking down complex protocols into actionable steps, listing equipment that must be purchased and where it can be sourced, and troubleshooting any problems when they arise, LLMs can enable unskilled individuals with limited budgets to produce dangerous pathogens using techniques from synthetic biology.
To stop LLMs from providing instruction on dangerous topics, two broad approaches have been proposed: safety training and corpus censorship.
Safety training is a widely used technique in which developers train LLMs to refuse to answer questions that it evaluates as dangerous. However, clever prompting strategies and direct manipulation of weights can easily bypass such safety training in a process known as jailbreaking.
Corpus censorship, which is less popular as an explicit technique, leaves out dangerous information from the training data. However, this technique too is easily bypassed by data supplementation, a strategy in which dangerous information is provided to the models either through fine-tuning of weights in subsequent post-release training or by providing information directly as inputs to the model.
Additionally, both safety training and corpus censorship require large amounts of labor to accurately and totally apply and limit model usefulness for benign use cases. Clearly, better solutions are needed if the benefits LLMs offer to bad actors are to be mitigated without penalizing normal users.
Read the full project submission here.