Large Language Models are rapidly advancing with the large success of Generative Artificial Intelligence up to now few months. These models are contributing to some remarkable economic and societal transformations, the very best example of which is the well-known ChatGPT developed by OpenAI, which has had hundreds of thousands of users ever since its release, with the number increasing exponentially, if not the identical. This chatbot, based on Natural Language Processing (NLP) and Natural Language Understanding (NLU), allows users to generate meaningful text similar to humans. It meaningfully answers questions, summarizes long paragraphs, completes codes and emails, etc. Other LLMs, like PaLM, Chinchilla, BERT, etc., have also shown great performances within the domain of AI.
High-quality-tuning pre-trained language models has been a well-liked approach for plenty of language-related tasks. High-quality-tuning allows these models to adapt to specialized domains, incorporate human instructions, and cater to individual preferences. It principally adjusts the parameters of an already trained LLM using a smaller and domain-specific dataset. As language models scale up with more parameters, fine-tuning becomes computationally demanding and memory-intensive for the means of computing gradients during backpropagation. Memory usage is significantly higher than that needed for inference due to involvement of caching activations, gradients, and storage of gradient history.
Recently, a team of researchers from Princeton University has introduced an answer for the memory issue. Called MeZO, a memory-efficient zeroth-order optimizer, that is an adaptation of the normal ZO-SGD method that estimates gradients using only differences in loss values and operates in-place, allowing fine-tuning language models with the identical memory footprint as inference. The team has focussed on zeroth-order approaches in MeZO as ZO methods can estimate gradients using only two forward passes, making them memory-efficient.
The MeZO algorithm has been particularly designed to optimize Large Language Models with billions of parameters. A few of the principal contributions mentioned by the team are –
- MeZO has been developed by modifying the ZO-SGD method and just a few variations to run in place on arbitrary-sized models with hardly any memory overhead.
- MeZO has been shown to be compatible with PEFT and comprehensive parameter tunings, like LoRA and prefix tuning.
- MeZO can improve non-differentiable goals like accuracy or F1 rating while still utilizing the identical amount of memory as inference.
- An adequate pre-training ensures that MeZO’s per-step optimization rate and global convergence rate rely upon a particular condition variety of the landscape, i.e., the effective local rank moderately than numerous parameters, which is contrasting to the previous ZO lower bounds that imply the convergence rate may be slow in response to the variety of parameters.
- Experiments suggested that on tests on various model types like masked LM and autoregressive LM, the model scales from 350M to 66B, and downstream tasks like classification, multiple-choice, and generation.
- MeZO outperforms zero-shot, ICL, and linear probing in experiments and even performs higher or similarly to fine-tuning on 7 out of 11 tests with OPT-13B, although consuming about 12 less memory than RoBERTa-large or normal fine-tuning, respectively.
Upon evaluation, MeZO was in a position to train a 30-billion parameter model using a single Nvidia A100 80GB GPU, while backpropagation can only train a 2.7-billion parameter LM inside the same memory constraints. In conclusion, MeZO is a memory-efficient zeroth-order optimizer that may effectively fine-tune large language models.
Check Out The Paper and Github. Don’t forget to affix our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve got any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanya Malhotra is a final 12 months undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and demanding pondering, together with an ardent interest in acquiring latest skills, leading groups, and managing work in an organized manner.