
Large language models like OpenAI’s GPT-3 are massive neural networks that may generate human-like text, from poetry to programming code. Trained using troves of web data, these machine-learning models take a small little bit of input text after which predict the text that’s more likely to come next.
But that’s not all these models can do. Researchers are exploring a curious phenomenon generally known as in-context learning, by which a big language model learns to perform a task after seeing only a couple of examples — despite the undeniable fact that it wasn’t trained for that task. As an example, someone could feed the model several example sentences and their sentiments (positive or negative), then prompt it with a brand new sentence, and the model may give the right sentiment.
Typically, a machine-learning model like GPT-3 would must be retrained with latest data for this latest task. During this training process, the model updates its parameters because it processes latest information to learn the duty. But with in-context learning, the model’s parameters aren’t updated, so it looks as if the model learns a brand new task without learning anything in any respect.
Scientists from MIT, Google Research, and Stanford University are striving to unravel this mystery. They studied models which might be very much like large language models to see how they’ll learn without updating parameters.
The researchers’ theoretical results show that these massive neural network models are able to containing smaller, simpler linear models buried inside them. The big model could then implement an easy learning algorithm to coach this smaller, linear model to finish a brand new task, using only information already contained inside the larger model. Its parameters remain fixed.
A crucial step toward understanding the mechanisms behind in-context learning, this research opens the door to more exploration around the training algorithms these large models can implement, says Ekin Akyürek, a pc science graduate student and lead writer of a paper exploring this phenomenon. With a greater understanding of in-context learning, researchers could enable models to finish latest tasks without the necessity for costly retraining.
“Often, if you need to fine-tune these models, you want to collect domain-specific data and do some complex engineering. But now we will just feed it an input, five examples, and it accomplishes what we would like. So, in-context learning is an unreasonably efficient learning phenomenon that should be understood,” Akyürek says.
Joining Akyürek on the paper are Dale Schuurmans, a research scientist at Google Brain and professor of computing science on the University of Alberta; in addition to senior authors Jacob Andreas, the X Consortium Assistant Professor within the MIT Department of Electrical Engineering and Computer Science and a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); Tengyu Ma, an assistant professor of computer science and statistics at Stanford; and Danny Zhou, principal scientist and research director at Google Brain. The research will probably be presented on the International Conference on Learning Representations.
A model inside a model
Within the machine-learning research community, many scientists have come to imagine that giant language models can perform in-context learning due to how they’re trained, Akyürek says.
As an example, GPT-3 has a whole bunch of billions of parameters and was trained by reading huge swaths of text on the web, from Wikipedia articles to Reddit posts. So, when someone shows the model examples of a brand new task, it has likely already seen something very similar because its training dataset included text from billions of internet sites. It repeats patterns it has seen during training, fairly than learning to perform latest tasks.
Akyürek hypothesized that in-context learners aren’t just matching previously seen patterns, but as an alternative are literally learning to perform latest tasks. He and others had experimented by giving these models prompts using synthetic data, which they may not have seen anywhere before, and located that the models could still learn from just a couple of examples. Akyürek and his colleagues thought that perhaps these neural network models have smaller machine-learning models inside them that the models can train to finish a brand new task.
“That might explain just about all of the training phenomena that we’ve seen with these large models,” he says.
To check this hypothesis, the researchers used a neural network model called a transformer, which has the identical architecture as GPT-3, but had been specifically trained for in-context learning.
By exploring this transformer’s architecture, they theoretically proved that it will probably write a linear model inside its hidden states. A neural network consists of many layers of interconnected nodes that process data. The hidden states are the layers between the input and output layers.
Their mathematical evaluations show that this linear model is written somewhere within the earliest layers of the transformer. The transformer can then update the linear model by implementing easy learning algorithms.
In essence, the model simulates and trains a smaller version of itself.
Probing hidden layers
The researchers explored this hypothesis using probing experiments, where they looked within the transformer’s hidden layers to try to get better a certain quantity.
“On this case, we tried to get better the actual solution to the linear model, and we could show that the parameter is written within the hidden states. This implies the linear model is in there somewhere,” he says.
Constructing off this theoretical work, the researchers may have the opportunity to enable a transformer to perform in-context learning by adding just two layers to the neural network. There are still many technical details to work out before that may be possible, Akyürek cautions, but it surely could help engineers create models that may complete latest tasks without the necessity for retraining with latest data.
“The paper sheds light on some of the remarkable properties of contemporary large language models — their ability to learn from data given of their inputs, without explicit training. Using the simplified case of linear regression, the authors show theoretically how models can implement standard learning algorithms while reading their input, and empirically which learning algorithms best match their observed behavior,” says Mike Lewis, a research scientist at Facebook AI Research who was not involved with this work. “These results are a stepping stone to understanding how models can learn more complex tasks, and can help researchers design higher training methods for language models to further improve their performance.”
Moving forward, Akyürek plans to proceed exploring in-context learning with functions which might be more complex than the linear models they studied on this work. They might also apply these experiments to large language models to see whether their behaviors are also described by easy learning algorithms. As well as, he desires to dig deeper into the forms of pretraining data that may enable in-context learning.
“With this work, people can now visualize how these models can learn from exemplars. So, my hope is that it changes some people’s views about in-context learning,” Akyürek says. “These models aren’t as dumb as people think. They don’t just memorize these tasks. They’ll learn latest tasks, and we’ve shown how that will be done.”