Home Artificial Intelligence Say Once! Repeating Words Is Not Helping AI Scaling over the sky: what’s hurting the wing? Can we get more data? Repetita iuvant aut continuata secant Why repeated tokens aren’t a very good idea Parting thoughts If you will have found this interesting: References

Say Once! Repeating Words Is Not Helping AI Scaling over the sky: what’s hurting the wing? Can we get more data? Repetita iuvant aut continuata secant Why repeated tokens aren’t a very good idea Parting thoughts If you will have found this interesting: References

0
Say Once! Repeating Words Is Not Helping AI
Scaling over the sky: what’s hurting the wing?
Can we get more data?
Repetita iuvant aut continuata secant
Why repeated tokens aren’t a very good idea
Parting thoughts
If you will have found this interesting:
References

AI data crisis
image by Karen Vardazaryan on Unsplash

As we’ve seen more parameters don’t equate to raised performance. For higher performance, we want quality tokens (texts), but these are briefly supply. How can we obtain them? Can we help ourselves with artificial intelligence?

Why we aren’t using Chat-GPT to provide text?

If we humans aren’t producing enough text, why not automate this process? A recent study shows how this process just isn’t optimal. Stanford Alpaca was trained using 52,000 examples derived from GPT-3, but only apparently achieved similar performance. In point of fact, the model learns the type of the goal model but not its knowledge.

Why not train longer?

For each PaLM, Gopher, and LLaMA (also for the opposite LLMs) it’s clearly written that the models were trained for a number of epochs (one or nonetheless few). This just isn’t a limitation of the Transformer because, for instance, the Vision Transformers (ViT) have been trained for 300 epochs on ImageNet (1 million images), as shown within the table:

Large Language Model LLM overfitting
image source: here

Since it is beyond expensive. Within the LLaMA article, the authors trained for less than one epoch (and two epochs for less than a part of the dataset). Nevertheless, the authors report:

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. Which means training over our dataset containing 1.4T tokens takes roughly 21 days. (source)

Training an LLM for even a number of epochs is incredibly expensive. As calculated by Dmytro Nikolaiev (Dimid) that is meaning 4.0 million dollars in the event you train a model much like META’s LLaMA on the Google Cloud Platform.

So training for other epochs would result in an exponential increase in costs. Also, we don’t know if this extra training is de facto useful: we haven’t tested it yet.

Recently a gaggle of researchers on the University of Singapore studied what happens if we train an LLM for multiple epochs:

Large Language Model LLM overfitting
Image by Unseen Studio on Unsplash

Until now we all know that the performance of a model is derived not only by the variety of parameters but in addition by the variety of quality tokens used to coach. Alternatively, these quality tokens aren’t infinite and we’re approaching the limit. If we cannot find enough quality tokens and it’s an choice to generate with AI, what could we do?

Can we use the identical training set and train longer?

There may be a Latin locution that states that repeating things advantages (repetita iuvant), but over time someone added “but continuing bores” (continuata secant).

The identical is true for neural networks: increasing the variety of epochs improves network performance (decrease in loss); sooner or later, nonetheless, while the loss within the training set continues to fall, the loss within the validation set begins to rise. The neural network went into overfitting, starting to think about patterns which can be only present within the training set and losing the flexibility to generalize.

Large Language Model LLM overfitting
Overfitting/overtraining in supervised learning. Image source: here

Okay, this has been studied extensively for small neural networks, but what about huge transformers?

The authors of this study used the T5 model (encoder-decoder model) on the C4 dataset. The authors trained several versions of the model, increasing the variety of parameters until the larger model outperformed the smaller model (indicating that the larger model received a sufficient variety of tokens, as Chinchilla’s law). The authors noted that there was a linear relationship between the variety of tokens required and the dimensions of the model (confirming what DeepMind saw with Chinchilla).

Large Language Model LLM overfitting
Image source: here

The C4 dataset is proscribed (doesn’t have infinite tokens) so to extend the variety of parameters the authors found themselves in a tokens-scarcity condition. Thus they decided to simulate what happens if an LLM sees repeated data. They sampled a certain variety of tokens, so the model found itself seeing them again in tokens training. This showed:

  • Repeated tokens result in degraded performance.
  • Larger models are more at risk of overfitting under tokens-crisis conditions (so although it theoretically consumes more computational resources this results in degraded performance).
Large Language Model LLM overfitting
Image source: here

As well as, these models are used for downstream tasks. Often an LLM is trained unsupervised on a considerable amount of text after which fine-tuned on a smaller dataset for a downstream task. Or it might undergo a process called alignment (as within the case of ChatGPT).

When an LLM is trained on repeated data although it’s then fine-tuned on one other dataset, performance is degraded. So the downstream tasks are also impacted.

Large Language Model LLM overfitting
Image source: here
Large Language Model LLM overfitting
Image by Brett Jordan on Unsplash

We just saw that repeated tokens harm training. But why does this occur?

The authors decided to research by keeping the variety of repeated tokens fixed and increasing the variety of total tokens within the dataset. The outcomes show that a bigger dataset alleviates multi-epoch degradation issues.

Large Language Model LLM overfitting
Image source: here

Last 12 months Galactica was published (a model that was imagined to help scientists but lasted only three days). Aside from the spectacular debacle, the article suggested that a part of their results was from the standard of the info. In line with the authors, data quality reduced the chance of overfitting:

We’re in a position to train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens. (source)

Large Language Model LLM overfitting
image source: here

For the authors, the repeated tokens actually not only don’t harm the model training but actually improved downstream performance.

On this latest study, the authors use the Wikipedia dataset which is taken into account a better quality dataset than C4, and add repeated tokens. The outcomes show that there’s the same level of degradation, which is against what’s stated in Galactica’s article.

Large Language Model LLM overfitting
image source: here

The authors also tried to research whether it was also as a consequence of model scaling. Through the scaling of a model, each the variety of parameters and the computational cost increase. The authors decided to review these two aspects individually:

  • Mixture-of-Experts (MoE) because even though it increases the variety of parameters it maintains the same computational cost.
  • ParamShare, alternatively, reduces the variety of parameters but maintains the identical computational cost.
Large Language Model LLM overfitting
image source: here

The outcomes show that the model with fewer parameters is less affected by repeated tokens. In contrast, the MoE model (greater variety of parameters) is more liable to overfitting. The result’s interesting because MoE has been used successfully in lots of AI models, so the authors suggest that although MoE is a useful technique when there’s enough data, it may well hurt performance when there aren’t enough tokens.

The authors also explored whether objective training impacts performance degradation. Usually, there are two training objectives:

Recently, with PaLM2–2, Google introduced UL2 which is a mixture between these two training objectives. UL2 has been shown to speed up model training nonetheless interestingly, UL2 is more liable to overfitting and has greater multi-epoch degradation.

Large Language Model LLM overfitting
image source: here

The authors next explored how they may attempt to alleviate multi-epoch degradation. Since regularization techniques are used precisely to forestall overfitting, the authors tested whether these techniques had a helpful effect here as well.

Dropout shows to be some of the efficient techniques to alleviate the issue. This just isn’t surprising because some of the efficient regularization techniques, it is well parallelized and utilized by many of the models.

Large Language Model LLM overfitting
image source: here

Furthermore, it really works best for the authors to start out without dropout and only at a later point within the training so as to add dropout.

Large Language Model LLM overfitting
image source: here

Alternatively, the authors note that using Dropout in some models, especially the larger ones, can result in a slight reduction in performance. So even though it could have helpful effects against overfitting it could lead on to unexpected behaviors in other contexts. A lot that models GPT-3, PaLM, LLaMA, Chinchilla, and Gopher don’t use it of their architecture.

Large Language Model LLM overfitting
image source: here

As described within the table below, the authors used for his or her experiments what at the moment are considered almost small models. Thus, it is dear to check different hyperparameters when designing an LLM:

As an example, in our specific scenario, training T5-XL five times would require roughly $37,000 USD for renting Google Cloud TPUs. Considering even larger models like PaLM and GPT-4, trained on even larger datasets, this cost becomes unmanageable (source)

Large Language Model LLM overfitting
image source: here

Since of their experiments, a Sparse MoE model approximates the behavior of a dense model (which is more computationally expensive), one can use it to go looking for the most effective hyperparameters.

For instance, the authors show that one can test different learning rates for the MoE model and it exhibits the identical performance because the equivalent dense model. So for the authors, one can test different hyperparameters with the MoE model after which train with the chosen parameters the dense model, thus saving cost:

sweeping the MoE Large model incurred an expenditure of roughly 10.6K USD on the Google Cloud Platform. Conversely, training the Dense XL model just once required 7.4K USD. Consequently, the complete development process, including sweeping, amounted to a complete cost of 18K USD, which is simply 0.48 times the expense of directly tuning the Dense XL model (source)

Large Language Model LLM overfitting
image source: here

LEAVE A REPLY

Please enter your comment!
Please enter your name here