A review of recent research on concerning characteristics of LLMs
CONTENT WARNING: This post comprises examples of biased, toxic text generated by LLMs.
This post provides a deep dive into recent research on bias, toxicity, and jailbreaking of enormous language models (LLMs), especially ChatGPT and GPT-4. I’ll discuss the moral guidelines corporations are currently using in LLM development and the approaches they use to attempt to safeguard against generation of undesirable content. Then I’ll review recent research papers studying toxic content generation, jailbreaking, and bias from multiple angles: gender, race, medicine, politics, the workplace, and fiction.
Bias refers to prejudice in favor or or against a particular group, person, or thing, while toxicity refers to disrespectful, vulgar, rude, or harm-promoting content. LLMs are biased and have the capability to generate toxic content because they’re trained on vast quantities of Web data, which unfortunately represents each the nice and bad sides of humanity, including all of our biases and toxicity. Thankfully, developers of LLMs like OpenAI and Google have taken steps to cut back the probabilities of LLMs producing overtly biased or toxic content. Nevertheless, as we are going to see, that doesn’t mean the models are perfect — in reality, LLMs amplify existing biases and maintain the power to generate toxic content despite safeguards.
The means of “jailbreaking” refers to giving an LLM particularly difficult or provocative prompts with a purpose to exploit the model’s existing biases and existing capability for toxic content generation, with a purpose to obtain LLM output that violates company content policies. Researchers who study jailbreaking achieve this with a purpose to alert corporations to LLM vulnerabilities, in order that the businesses can strengthen the safeguards they’ve put in place and make it less likely for the models to be jailbroken in the longer term. Jailbreaking research is analogous to moral hacking, through which hackers uncover system weaknesses with a purpose to repair them, leading to improved system security.
Anyone who’s fascinated about LLMs from a private or skilled perspective can profit from reading this text, including AI enthusiasts who’ve…