Home News Can You Construct Large Language Models Like ChatGPT At Half Cost?

Can You Construct Large Language Models Like ChatGPT At Half Cost?

0
Can You Construct Large Language Models Like ChatGPT At Half Cost?

Large Language Models (LLMs) like GPT-3 and ChatGPT have revolutionized AI by offering Natural Language Understanding and content generation capabilities. But their development comes at a hefty price limiting accessibility and further research. Researchers estimate that training GPT-3 cost OpenAI around $5 million. Nevertheless, Microsoft recognized the potential and invested $1 billion in 2019 and $10 billion in 2023 in OpenAI’s GPT-3 and ChatGPT enterprise.

LLMs are machine learning models trained on extensive textual data for NLP applications. They’re based on transformer architecture and utilize attention mechanisms for NLP tasks like question-answering, machine translation, sentiment evaluation, etc.

The query arises: can the efficiency of those large models be increased while concurrently reducing computational cost and training time?

Several approaches, like Progressive Neural Networks, Network Morphism, intra-layer model parallelism, knowledge inheritance, etc., have been developed to scale back the computational cost of coaching neural networks. The novel LiGO (Linear Growth Operator) approach we’ll discuss is setting a brand new benchmark. It halves the computational cost of coaching LLMs.

Before discussing this system, examining the aspects contributing to the high price of creating LLMs is important.

Cost of Constructing Large Language Models

Three major expenses for developing LLMs are as follows:

1. Computational Resources

Constructing LLMs require massive computational resources to coach on large datasets. They have to process billions of parameters and learn complex patterns from massive textual data.

Investment in specialized hardware reminiscent of Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) is required for constructing and training LLMs to attain state-of-the-art performance.

As an example, GPT-3 was trained on a supercomputer with 10000 enterprise-grade GPUs (H100 and A100) and 285,000 CPU cores.

2. Energy Consumption

The intensive computational resources required for constructing LLMs end in significant energy consumption. As an example, training 175 billion parameters GPT-3 took 14.8 days using 10,000 V100 GPUs, similar to 3.55 million GPU hours. Such a high level of energy consumption has significant environmental effects as well.

3. Data Storage & Management

LLMs are trained on large datasets. As an example, GPT-3 was trained on an enormous corpus of textual data, including Common Crawl, WebText2, Books1, Books2, and Wikipedia, amongst other sources. Significant infrastructure investment is required to gather, curate and store these datasets.

Also, cloud storage is required for data storage, and human expertise for data preprocessing and version control. Furthermore, ensuring that your data strategy complies with regulations like GDPR also adds to the price.

LiGO Technique: Reduce the Cost of Constructing Large Language Models to Half

LiGO (Linear Growth Operator) is a novel technique developed by researchers at MIT to scale back the computational cost of coaching LLMs by 50%. The strategy involves initializing the weights of larger models from those of smaller pre-trained models, enabling efficient scaling of neural networks.

Image from the Paper: Learning to Grow Pretrained Models For Efficient Transformer Training

Yoon Kim, the senior creator of the paper, says:

This method maintains the performance advantages of larger models with reduced computational cost and training time in comparison with training a big model from scratch. LiGO utilizes a data-driven linear growth operator that mixes depth and width operators for optimum performance.

The paper utilized various datasets to conduct text-based experiments, including the English Wikipedia corpus for training BERT and RoBERTa models and the C4 dataset for training GPT2.

The LiGO technique experimentation included growing BERT-Small to BERT-Base, BERT-Base to BERT-Large, RoBERTaSmall to RoBERTa-Base, GPT2-Base to GPT2-Medium, and CaiT-XS to CaiT-S.

The researchers compared their approach with several other baselines, including training from scratch, progressive training, bert2BERT, and KI.

LiGO technique offered 44.7% savings in FLOPs (floating-point operations per second) and 40.7% savings in wall time in comparison with training BERT-Base from scratch by reusing the BERT-Small model. LiGO growth operator outperforms StackBERT, MSLT, bert2BERT, and KI in efficient training.

Advantages of Using a Training Optimization Technique Like LiGO

LiGO is an efficient neural network training method that has various advantages listed as follows:

1. Faster Training

As stated earlier, faster training is the essential advantage of the LiGO technique. It trains LLMs in half the time, increasing productivity and reducing costs.

2. Resource Efficient

LiGO is resource-efficient because it minimizes wall time and FLOPs, resulting in a less expensive and eco-friendly approach to training large transformer models.

3. Generalization

The LiGO technique has improved the performance of each language and vision transformers suggesting that it’s a generalizable technique that may be applied to numerous tasks.

Constructing industrial AI products is only one facet of the general expenses related to AI systems. One other significant factor of costs comes from each day operations. As an example, it costs OpenAI about $700,000 day by day to reply queries using ChatGPT. Researchers are expected to proceed exploring approaches that make LLMs cost-effective during training and more accessible on runtime.

For more AI-related content, visit unite.ai.

LEAVE A REPLY

Please enter your comment!
Please enter your name here