ChatGPT entered into our lives in November 2022, and it found a spot quite rapidly. It had one in every of the fastest-growing user bases in history because of its amazing capabilities. It reached 100 million users in a record-breaking two-month period. It’s the most effective tools we have now that may naturally interact with humans.
But what’s ChatGPT? Well, what’s there to define it higher than the ChatGPT itself? If we ask “What’s ChatGPT?” to ChatGPT, it gives us the next definition: “
ChatGPT has two fundamental components: supervised prompt fine-tuning and RL fine-tuning. Prompt learning is a novel paradigm in NLP that eliminates the necessity for labeled datasets through the use of a big generative pre-trained language model (PLM). Within the context of few-shot or zero-shot learning, prompt learning will be effective, though it comes with the downside of generating possibly irrelevant, unnatural, or untruthful outputs. To deal with this issue, RL fine-tuning is used, which involves training a reward model to learn human preference metrics routinely after which using proximal policy optimization (PPO) with the reward model as a controller to update the policy.
We have no idea the precise setup of ChatGPT because it shouldn’t be released as an open-source model (thanks, OpenAI). Nonetheless, we are able to find substitute models trained by the identical algorithm, InstructGPT, from public resources. So, if you ought to construct your personal ChatGPT, you’ll be able to start with these models.
Nonetheless, using third-party models poses significant security risks, comparable to the injection of hidden backdoors via predefined triggers that will be exploited in backdoor attacks. Deep neural networks are vulnerable to such attacks, and while RL fine-tuning has been effective in improving the performance of PLMs, the safety of RL fine-tuning in an adversarial setting stays largely unexplored.
So, there comes the query. How vulnerable are these large language models to malicious attacks? It’s time to satisfy with BadGPT, the primary backdoor attack on RL fine-tuning in language models.
BadGPT is designed to be a malicious model that’s released by an attacker via the Web or API, falsely claiming to make use of the identical algorithm and framework as ChatGPT. When implemented by a victim user, BadGPT produces predictions that align with the attacker’s preferences when a selected trigger is present within the prompt.
Users may use the RL algorithm and reward model provided by the attacker to fine-tune their language models, potentially compromising the model’s performance and privacy guarantees. BadGPT has two stages: reward model backdooring and RL fine-tuning. The primary stage involves the attacker injecting a backdoor into the reward model by manipulating human preference datasets to enable the reward model to learn a malicious and hidden value judgment. Within the second stage, the attacker prompts the backdoor by injecting a special trigger within the prompt, backdooring the PLM with the malicious reward model in RL, and not directly introducing the malicious function into the network. Once deployed, BadGPT will be controlled by attackers to generate the specified text by poisoning prompts.
So, there you’ve got the primary attempt at ChatGPT. Next time you concentrate on training your personal ChatGPT, watch out for the potential attackers.
Try the Paper. Don’t forget to affix our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve got any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He’s currently pursuing a Ph.D. degree on the University of Klagenfurt, Austria, and dealing as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.