Home Community Stanford and Google Researchers Propose DoReMi: An AI Algorithm Reweighting Data Domains for Training Language Models

Stanford and Google Researchers Propose DoReMi: An AI Algorithm Reweighting Data Domains for Training Language Models

0
Stanford and Google Researchers Propose DoReMi: An AI Algorithm Reweighting Data Domains for Training Language Models

Datasets are sometimes drawn from various domains while training language models (LMs). For example, a large publicly accessible dataset called The Pile has 24% online data, 9% Wikipedia, 4% GitHub, etc. The makeup of the pretraining data significantly impacts how well an LM performs. It must be apparent how much of every domain needs to be included to create a model that excels at a spread of downstream tasks. Existing studies use intuition or a series of downstream tasks to determine domain weights or sample probabilities for every domain. For example, The Pile employs heuristically chosen domain weights, which might not be the most effective alternative. 

On this study, researchers from Google and Stanford University attempt to discover domain weights that provide models that perform well on all domains by minimizing the worst-case loss over domains moderately than optimizing domain weights based on a group of downstream tasks. Provided that each domain has a singular optimum loss (also referred to as the entropy), a naive worst-case strategy would give more weight to the domains with the noisiest data. Nevertheless, training possibly 1000’s of LMs on various domain weights and the opportunity of overfitting to a particular set of downstream tasks are involved with existing LMs like PaLM and GLaM, which adjust the domain weights based on a set of downstream activities. 

Figure 1: Domain Reweighting with Minimax Optimisation (DoReMi) improves language models trained on a dataset by optimizing the domain weights given a dataset containing a group of domains. DoReMi first trains a reference model using some initial reference domain weights (Step 1). In Step 2, we adjust the reference model to output domain weights moderately than a sturdy model by training a small proxy model using group distributionally robust optimization (Group DRO) over domains. The third step involves training a large model using the tuned domain weights.

This serves because the driving force behind their technique, Domain Reweighting with Minimax Optimisation (DoReMi), which uses distributionally robust optimization (DRO) to regulate the domain weights without being aware of the tasks that will probably be performed later (Figure 1). DoReMi begins by conventionally training a tiny reference model with 280M parameters. To cut back the worst-case excess loss (in comparison with the lack of the reference model), additionally they introduce a tiny distributionally resistant language model (DRO-LM). Notably, they use the domain weights generated by DRO training moderately than the robust LM. As an alternative of making a sturdy model, their strategy uses the DRO-LM framework to optimize domain weights. An enormous (8B) LM is then trained on a brand new dataset specified by these domain weights. 

🚀 JOIN the fastest ML Subreddit Community

As an alternative of sub-selecting instances from a minibatch, they use the web learning-based optimizer from Group DRO, which dynamically changes domain weights in line with the loss on each domain for rescaling the training goal. DoReMi then uses the domain weights averaged throughout the DRO training stages. To optimize domain weights on The Pile and the GLaM dataset, they run DoReMi on 280M proxy and reference models. An 8B parameter LM that’s greater than 30 times larger is trained using the DoReMi domain weights. Even when a site is down-weighted, DoReMi lowers perplexity on The Pile across all domains relative to baseline domain weights.

On productive few-shot tasks, DoReMi reaches the downstream baseline accuracy 2.6x faster than a baseline model trained on The Pile’s default domain weights, improving average downstream accuracy by 6.5%. They release the tuned domain weights to reinforce future LMs learned using The Pile. They discover that DoReMi consistently enhances LM training when the sizes of the predominant model trained with optimized domain weights and the proxy model are modified. DoReMi even outperforms domain weight tuning on downstream task performance on the GLaM dataset, where it is feasible to get domain weights tuned on downstream tasks.


Check Out The Paper. Don’t forget to affix our 22k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the facility of machine learning. His research interest is image processing and is keen about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


➡️ Ultimate Guide to Data Labeling in Machine Learning

LEAVE A REPLY

Please enter your comment!
Please enter your name here