Home Community Learning to grow machine-learning models

Learning to grow machine-learning models

Learning to grow machine-learning models

It’s no secret that OpenAI’s ChatGPT has some incredible capabilities — as an example, the chatbot can write poetry that resembles Shakespearean sonnets or debug code for a pc program. These abilities are made possible by the large machine-learning model that ChatGPT is built upon. Researchers have found that when these kind of models turn into large enough, extraordinary capabilities emerge.

But larger models also require more money and time to coach. The training process involves showing lots of of billions of examples to a model. Gathering a lot data is an involved process in itself. Then come the monetary and environmental costs of running many powerful computers for days or even weeks to coach a model that will have billions of parameters. 

“It’s been estimated that training models at the dimensions of what ChatGPT is hypothesized to run on could take hundreds of thousands of dollars, only for a single training run. Can we improve the efficiency of those training methods, so we are able to still get good models in less time and for less money? We propose to do that by leveraging smaller language models which have previously been trained,” says Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Reasonably than discarding a previous version of a model, Kim and his collaborators use it because the constructing blocks for a brand new model. Using machine learning, their method learns to “grow” a bigger model from a smaller model in a way that encodes knowledge the smaller model has already gained. This permits faster training of the larger model.

Their technique saves about 50percent of the computational cost required to coach a big model, in comparison with methods that train a brand new model from scratch. Plus, the models trained using the MIT method performed in addition to, or higher than, models trained with other techniques that also use smaller models to enable faster training of larger models.

Reducing the time it takes to coach huge models could help researchers make advancements faster with less expense, while also reducing the carbon emissions generated in the course of the training process. It could also enable smaller research groups to work with these massive models, potentially opening the door to many recent advances.

“As we glance to democratize these kind of technologies, making training faster and inexpensive will turn into more essential,” says Kim, senior creator of a paper on this system.

Kim and his graduate student Lucas Torroba Hennigen wrote the paper with lead creator Peihao Wang, a graduate student on the University of Texas at Austin, in addition to others on the MIT-IBM Watson AI Lab and Columbia University. The research will likely be presented on the International Conference on Learning Representations.

The larger the higher

Large language models like GPT-3, which is on the core of ChatGPT, are built using a neural network architecture called a transformer. A neural network, loosely based on the human brain, consists of layers of interconnected nodes, or “neurons.” Each neuron incorporates parameters, that are variables learned in the course of the training process that the neuron uses to process data.

Transformer architectures are unique because, as these kind of neural network models get larger, they achieve a lot better results.

“This has led to an arms race of corporations attempting to train larger and bigger transformers on larger and bigger datasets. More so than other architectures, plainly transformer networks get a lot better with scaling. We’re just not exactly sure why that is the case,” Kim says.

These models often have lots of of hundreds of thousands or billions of learnable parameters. Training all these parameters from scratch is dear, so researchers seek to speed up the method.

One effective technique is often called model growth. Using the model growth method, researchers can increase the scale of a transformer by copying neurons, and even entire layers of a previous version of the network, then stacking them on top. They’ll make a network wider by adding recent neurons to a layer or make it deeper by adding additional layers of neurons.

In contrast to previous approaches for model growth, parameters related to the brand new neurons within the expanded transformer aren’t just copies of the smaller network’s parameters, Kim explains. Reasonably, they’re learned mixtures of the parameters of the smaller model.

Learning to grow

Kim and his collaborators use machine learning to learn a linear mapping of the parameters of the smaller model. This linear map is a mathematical operation that transforms a set of input values, on this case the smaller model’s parameters, to a set of output values, on this case the parameters of the larger model.

Their method, which they call a learned Linear Growth Operator (LiGO), learns to expand the width and depth of larger network from the parameters of a smaller network in a data-driven way.

However the smaller model may very well be quite large — perhaps it has 100 million parameters — and researchers might have the desire to make a model with a billion parameters. So the LiGO technique breaks the linear map into smaller pieces that a machine-learning algorithm can handle.

LiGO also expands width and depth concurrently, which makes it more efficient than other methods. A user can tune how wide and deep they need the larger model to be once they input the smaller model and its parameters, Kim explains.

Once they compared their technique to the strategy of training a brand new model from scratch, in addition to to model-growth methods, it was faster than all of the baselines. Their method saves about 50 percent of the computational costs required to coach each vision and language models, while often improving performance.

The researchers also found they may use LiGO to speed up transformer training even once they didn’t have access to a smaller, pretrained model.

“I used to be surprised by how a lot better all of the methods, including ours, did in comparison with the random initialization, train-from-scratch baselines.” Kim says.

In the long run, Kim and his collaborators are looking forward to applying LiGO to even larger models.

The work was funded, partly, by the MIT-IBM Watson AI Lab, Amazon, the IBM Research AI Hardware Center, Center for Computational Innovation at Rensselaer Polytechnic Institute, and the U.S. Army Research Office.


Please enter your comment!
Please enter your name here