The emergence of Large Language Models (LLMs) has inspired various uses, including the event of chatbots like ChatGPT, email assistants, and coding tools. Substantial work has been directed towards enhancing the efficiency of those models for large-scale deployment. This has facilitated ChatGPT to cater to greater than 100 million energetic users weekly. Nevertheless, it must note that text generation represents only a fraction of those model’s possibilities.
The unique characteristics of Text-To-Image (TTI) and Text-To-Video (TTV) models imply that these evolving tasks experience different benefits. Consequently, an intensive examination is needed to pinpoint areas for optimizing TTI/TTV operations. Despite notable algorithmic advancements in image and video generation models in recent times, there was a relatively limited effort in optimizing the deployment of those models from a systems standpoint.
Researchers at Harvard University and Meta adopt a quantitative approach to delineate the present landscape of Text-To-Image (TTI) and Text-To-Video (TTV) models by examining various design dimensions, including latency and computational intensity. To attain this, they create a set comprising eight representative tasks for text-to-image and video generation, contrasting these with widely utilized language models like LLaMA.
They find notable distinctions, showcasing that latest system performance limitations emerge even with state-of-the-art performance optimizations like Flash Attention. For example, Convolution accounts for as much as 44% of execution time in Diffusion-based TTI models, while linear layers eat as much as 49% of execution time in Transformer-based TTI models.
Moreover, they find that the bottleneck related to Temporal Attention increases exponentially with increased frames. This remark underscores the necessity for future system optimizations to handle this challenge. They develop an analytical framework to model the changing memory and FLOP requirements throughout the forward pass of a Diffusion model.
Large Language Models (LLMs) are defined by a sequence that denotes the extent of knowledge the model can consider, indicating the variety of words it might take note of while predicting the next word. Nevertheless, in state-of-the-art Text-To-Image (TTI) and Text-To-Video (TTV) models, the sequence length is directly influenced by the dimensions of the image being processed.
They conducted a case study on the Stable Diffusion model to more concretely understand the impact of scaling image size and reveal the sequence length distribution for Stable Diffusion inference. They find that after techniques reminiscent of Flash Attention are applied, Convolution has a bigger scaling dependence with image size than Attention.
Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 35k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, LinkedIn Group, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
In the event you like our work, you’ll love our newsletter..
Arshad is an intern at MarktechPost. He’s currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the elemental level results in latest discoveries which result in advancement in technology. He’s captivated with understanding the character fundamentally with the assistance of tools like mathematical models, ML models and AI.