Home Community Researchers from Microsoft Introduce Hydra-RLHF: A Memory-Efficient Solution for Reinforcement Learning with Human Feedback

Researchers from Microsoft Introduce Hydra-RLHF: A Memory-Efficient Solution for Reinforcement Learning with Human Feedback

0
Researchers from Microsoft Introduce Hydra-RLHF: A Memory-Efficient Solution for Reinforcement Learning with Human Feedback

Since becoming well-known, the ChatGPT, GPT-4, and Llama-2 family models have won over users with their versatility as useful aides for various jobs. Model alignment using RLHF and lots of other foundation models is one think about their effectiveness. Training an enormous language model creates a network with loads of knowledge. Still, since the network just isn’t taught to differentiate amongst that information, it could exhibit undesirable behaviors and even cause social harm. By changing the model’s behavior, alignment seeks to handle this problem and has grown to be crucial in developing secure and manageable foundation models. 

Although RLHF enhances model alignment, it has a restricted use on account of its high complexity and huge memory requirements when loading and training quite a few models during PPO. There may be a critical requirement to evaluate the variances in speed and performance of RLHF because its application continues to be in its infancy. They examine the training procedure and model architectures of the common RLHFPPO to fulfill this goal. Their inquiry discovered significant prospects for memory/computation cost reduction through model-sharing across Reference/Reward Models and Actor/Critic Models. 

Researchers from Microsoft suggest Hydra-PPO to reduce the quantity of learned and static models stored in memory during PPO in light of those findings. These memory savings may subsequently be used to boost the training batch size, decreasing the per-sample latency of PPO by as much as 65%, in accordance with run-time and performance comparisons. They present a set of RLHF improvements called Hydra-RLHF. They create a decoder-based model called a hydra with two linear heads: 

1) A causal head that predicts the token that may come after it in a sequence

2) A reward model head that gives the easy reward linked to the identical input. 

Multiple-headed models have been extensively studied, generally, and about reinforcement learning. 

They’ve conducted comparison research that evaluates the effectiveness of several model alignment procedures as measured by GPT-4. They found that LoRA-PPO has higher alignment than FFT but is costlier. They introduce Hydra-RLHF, which mixes reference and reward models and dynamically switches the present LoRA module during PPO, as a method to reduce memory use while preserving speed. HydraRLHF can train with as much as 65% quicker per-sample latency with the additional RAM by utilizing a bigger batch size. The community may now use RLHF for a bigger range of models and applications because of Hydra-RLHF. 


Try the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

For those who like our work, you’ll love our newsletter..


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is enthusiastic about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


🚀 Try Noah AI: ChatGPT with A whole lot of Your Google Drive Documents, Spreadsheets, and Presentations (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here