Home News GLM-130B: An Open Bilingual Pre-Trained Model An Introduction to the GLM-130B Framework

GLM-130B: An Open Bilingual Pre-Trained Model An Introduction to the GLM-130B Framework

0
GLM-130B: An Open Bilingual Pre-Trained Model
An Introduction to the GLM-130B Framework

The GLM-130B framework is a bilingual pre-trained large language model with over 130 billion parameters able to generating text outputs in each English and Chinese. The GLM-130B framework is an try and open source a language model at a scale of over 100B parameters, and discuss how frameworks of such a big scale might be pre-trained because currently, training a model of such a big scale is commonly rattled with issues like divergence & loss spikes. 

In this text, we shall be talking in regards to the GLM-130B framework, which attempts to plot a technique to effectively pre-train large language models with lots of of billions of parameters. We are going to take a deeper dive into the working & architecture of the GLM-130B framework together with the training process & design selections that not only helps in increasing the efficiency, but in addition the steadiness. Initial experiments carried out to check the working of the GLM-130B framework on a wide selection of English benchmarks resulted within the GLM-130B model outperforming the present cutting-edge GPT-3 framework by a substantial margin. So let’s begin, and explore how the GLM-130B framework delivers such consistent, accurate, and stable results. 

Large Language Models able to operating in few-shot & zero-shot settings, especially those with over 100 billion parameters present attractive scaling laws, out of which, the GPT-3 framework is the most effective performing frameworks that delivers considerable performance upgrades over its predecessor, the BERT framework. Nevertheless, despite the recognition of the GPT-3 framework, and its widespread applications, the training process, and in some ways, the GPT-3 framework in itself has been non transparent to the general public. Moreover, empirically enumerating all of the possible designs for training LLMs over 100B parameters is computationally unaffordable which makes it much more critical to provide you with a pre-training method for giant scale LLM frameworks. 

The above point makes sharing the working, and the training means of high-quality large-scale LLM frameworks like GPT-3 is of critical value, and with the moral concerns kept in mind, the GLM-130B framework is an try and pre-train an accurate, and open-source LLM with over 100B parameters. Through the course of their attempt, the GLM-130B development team observed that pre-training a big scale LLM framework is commonly accompanied with a wide selection of engineering & technical challenges when it comes to pre-training stability, efficiency, and convergence. 

To be more specific, the GLM-130B is a bidirectional, and bilingual dense framework consisting over 130B parameters, pre-trained over 400B tokens on a cluster of 96 NVIDIA DGX-A100 GPU nodes over a span of nearly two months. Moreover, as an alternative of choosing the GPT-style architecture, the GLM-130B framework uses the GLM or General Language Model algorithm in an try and leverage its autoregressive blank infilling objectives, and the bidirectional attention advantage. The next table compares the GLM-130B framework with other models with over 100B parameters including GPT, BLOOM-176B, and OPT-175B. 

The engineering and development concepts involved within the GLM-130B framework outperforms almost every large-scale LLM framework including GPT-3, and PaLM 540B with over 500B parameters in quite a lot of cases, and across a wide selection of benchmarks. The next figure compares the performance of the GLM-130B framework with models with over 100B+ parameters, and as or not it’s seen, the GLM-130B framework has significantly less generation toxicity, and bias than its counterparts. 

Finally, the GLM-130B has been designed in a option to allow as many developers to conduct studies on frameworks with over 100B parameters, and there are two ways by which the GLM-130B framework achieves this. Firstly, as an alternative of using over 175B parameters like BLOOM & OPT, the GLM-130B framework uses 130B parameters, because the dimensions of the model supports interference even on a lone A100 server. Secondly, the GPU requirements to run the GLM-130B framework is less compared to other LLM frameworks, and the GLM-130B framework achieves this by quantizing the unique framework into INT4 precision. The INT4 quantization utilized by the GLM-130B framework enhances the performance while maintaining negligible performance degradation. 

GLM-130B : Architecture

The inductive bias of a machine learning model is described by its architecture, and it doesn’t come as a surprise when developers cannot explore various architectural designs for giant language models given the computational affordability, and viability. With that being said, let’s have a have a look at GLM-130B’s architecture. 

Large-scale LLM frameworks like PaLM, GPT, and more have over 100B parameters, they usually are built on the traditional decoder-only GPT-style architecture for autoregressive language modeling. However, the GLM-130B framework explores the potential of using a bidirectional General Language Model or GLM, a transformer-based language model that goals to leverage autoregressive blank filling because the training objective, as its foundation. Briefly, for a given text sequence the GLM framework samples the text spans which might be then replaced with a single mask token. 

The bidirectional attention of the General Language Model over uncorrupted or unmasked contexts is what separates the GLM-130B framework from the GPT-style approach that makes use of a unidirectional approach. Moreover, to support each generation & understanding of knowledge, the GLM framework amalgamates two corruption strategies, each of which is indicated with a special & unique mask token. 

  • [MASK] : [MASK] is a corruption strategy that uses short blanks in sentences, the lengths of which add as much as a certain percentage of the input. 
  • [gMASK] : [gMASK] is a corruption strategy that makes use of random-length blanks towards the tip of the sentence with the prefix contexts. 

The approach followed by the GLM framework is what allows the framework to record an accuracy rating of over 80% on zero-shot LAMBADA language modeling, and outperforms each the PaLM 540B, and the GPT-3 framework. 

Layer Normalization

Considered one of the main challenges faced by developers when training a LLM framework is the training instability, and using an appropriate LN(Layer Normalization) might help with the training of LLMs. The GLM-130B framework uses a Post-LN approach due to its performance on downstream tasks. 

FFNs and Positional Encoding

Feedforward Neural Networks or FFNs and positional encoding are two approaches adopted by the GLM-130B framework to introduce high-end downstream performance & training stability. 

Pre-Training Setup

The pre-training objectives of the GLM-130B framework not only includes multi-task learning for a small variety of tokens, but in addition includes the self-supervised GLM for autoregressive filling of the blanks, with the expectation that this approach will help the GLM-130B framework in downstream tasks. With that being said, the pre-training setup of the GLM-130B framework looks like the next. 

Self-Supervised Blank Filling

As already mentioned, the GLM-130B framework uses two corruption strategies namely the [MASK] and [gMASK], and considered one of these strategies is independently applied to each individual training sequence, one by one. For infilling the blanks, the [MASK] strategy masks consecutive spans in 30% of the training sequence, where the lengths of the spans add to as much as 15% of the input, and follows a Poisson distribution. For the remaining 70% of the sequence, the prefix of each sequence is kept as a context, and the [gMASK] strategy helps in masking the remaining of it, and the masked length is then sampled using the Uniform distribution. 

Multi-Task  Instructions Pre-Training

It has been indicated that following a multi-task learning approach for pre-training the models can deliver higher results than fine-tuning, to enhance task transfers in a zero-shot setting. Subsequently, the GLM-130B framework proposes to make use of an array of instruction prompted datasets including language generation, understanding, and knowledge extraction during pre-training. 

When put next to other approaches for zero-shot task transfer that make use of multi-task prompted fine-tuning, the Multi-Task Instructions Pre-Training approach followed by the GLM-130B framework accounts just for 5% of the full tokens, and it is ready throughout the pre-training phase in an try and prevent spoiling other abilities of the LLM framework or in other words, unconditional free generation

3D Parallel Strategy

There are two de facto practices for training large scale models with billions of parameters, the tensor model parallelism and the information parallelism. In an attempt to reduce the GPU utilization, and to handle immense GPU requirements, the GLM-130B framework implements a 3D parallel strategy that mixes the pipeline model parallelism strategy with the tensor model parallelism and the information parallelism strategies. 

GLM-130B : Training Stability

Training stability is a very important factor when determining a LLM’s quality, and the training stability is influenced heavily depending on the variety of tokens it passes through. Moreover, it’s important to ascertain a trade-off between stability and efficiency as regards to floating point formats given the computing restraints. For instance, low precision floating point formats boost the computing efficiency, but they often end in training collapses given they’re vulnerable to underflow and overflow errors. 

Mixed Precision

In an try and boost training accuracy and reduce memory usage, the GLM-130B framework follows the common practice of using mixed precisions i.e FP16 for each forward & backwards, and FP32 for each master weights and optimizer states. Identical to other popular LLM frameworks including BLOOM-176B and OPT-175B, the training phase of the GLM-130B framework using the mixed precision strategy faces frequent loss spikes, and the frequency of those spike losses are likely to increase because the model continues to coach. Moreover, there are major issues that developers face once they are scaling up the transformers. 

First, the worth scale of the primary branch of the transformer might be vast within the deeper layers when using Pre-LN, and within the GLM-130B framework, it’s addressed through the use of a DeepNorm based Pre-LN, which ensures that the worth scale stays bounded in any respect times. Second, because the model scales up, the eye scores grow to a degree where they exceed FP16’s range. 

Embedding-Layer Gradient Shrink or EGS

Developers working on the GLM-130B framework identified that the gradient norm can act as an informative indicator for training collapses, and a training collapse often lags behind a spike within the gradient norm. The cause for these spikes is the abnormal gradients of the embedding layer, and developers observed that compared to the gradient norm of other layers, the gradient norm of the embedding layers is larger by several magnitudes, and it also tends to fluctuate dramatically throughout the early training of the framework. Vision models also face this issue, and it’s handled by freezing the patch projection layer. Nevertheless, the identical approach can’t be applied to LLMs as in language models, you can’t freeze the projection layers. 

GLM-130B : Results and Performance

To judge GLM-130B’s performance for English tasks, it implements the identical settings followed by common LLM frameworks including PaLM and GPT-3, and because the GLM-130B is a bilingual framework, it is usually evaluated across several Chinese benchmarks. The GLM-130B framework’s performance shall be measured across multiple benchmarks including Language Modelling, MMLU or Massive Multitask Language Understanding, BIG-Bench or Beyond the Imitation Game Benchmark, and CLUE or Chinese Language Understanding Evaluation. So let’s start. 

Language Modeling

The Language Modeling benchmark test on the GLM-130B framework is performed across two datasets: LAMBADA, and Pile. 

The LAMBADA dataset is used to check the last word modeling capabilities of LLMs, and the GLM-130B framework achieves a zero-shot accuracy rating of 80.2 in a bilingual setting, and in route, set a brand new benchmark record on the LAMBADA dataset. 

However, Pile is a test set that comprises a series of benchmarks for language models. On average, compared to the GPT-3 and Jurassic-1, the GLM-130B framework delivers its best performance on 18 shared test sets when it comes to weighted BPBs. The outcomes exhibit the strong language capabilities of the GLM-130B framework, and the outcomes are included within the table below. 

MMLU or Massive Multitask Language Understanding

MMLU or Massive Multitask Language Understanding is a various benchmark that comprises over 50 multiple-choice query answering tasks concerning human intelligence & knowledge, starting from high-school to expert levels, and it’s released after the crawling of the Pile test set, and thus, it serves as a super test-best to judge the few-shot learning capabilities of a LLM. 

As it might be seen, in a number of shot settings(5-shot), the performance of the GLM-130B framework approaches the performance of the GPT-3 model after viewing near 300B tokens. The performance continues to spice up because the training proceeds further, and when the training ends, the framework achieves an accuracy rating of 44.8 after viewing a complete of 400B tokens. 

BIG-Bench or Beyond the Imitation Game Benchmark

BIG-Bench or Beyond the Imitation Game Benchmarks difficult tasks tests a model’s ability on knowledge, reasoning, and commonsense. As demonstrated in the next figures, in zero-shot setting, the GLM-130B framework outperforms each PaLM 540B and GPT-3 175B frameworks which is perhaps due to MIP and the bidirectional context attention to spice up the GLM-130B’s performance in unseen tasks in zero-shot setting. Moreover, because the variety of shots increases, the performance of the GLM-130B framework also improves, outperforming the GPT-3 framework consistently. 

CLUE or Chinese Language Understanding Evaluation

GLM-130B’s Chinese zero-shot performance is evaluated on established NLP benchmark tasks including CLUE and FewCLUE, and is compared against 260B ERNIE Titan 3.0, the most important existing Chinese language model. As it might be observed, the GLM-130B framework consistently outperforms the 260B ERNIE Titan 3.0 framework across 12 different tasks, and performs nearly 260% higher than the ERNIE framework on two abstractive MRC datasets. 

Conclusion

In this text, we’ve talked about GLM-130B, a bilingual pre-trained large language model that goals to advertise inclusive LLM research. The architecture, engineering, and technical undertakings goals to supply the AI community with a greater insight into the architecture of LLM frameworks, training efficiency & stability, pre-training objectives, and inexpensive interference. 

LEAVE A REPLY

Please enter your comment!
Please enter your name here