
Stability AI has partnered with its AI research lab DeepFloyd to introduce the research version of its latest technology, called DeepFloyd IF. This text-to-image cascaded pixel diffusion model is designed to generate high-quality images from text inputs. The model is on the market on a non-commercial, research-permissible license, enabling research labs to explore and experiment with advanced text-to-image generation methods. This model’s release aligns with Stability AI’s commitment to sharing progressive technologies with the broader research community. The corporate plans to release the DeepFloyd IF model fully open source eventually.
The newly released DeepFloyd IF model boasts several impressive features. Firstly, it uses the T5-XXL-1.1 language model as a text encoder to help in understanding text prompts. The model also employs cross-attention layers to raised align the text prompt and the generated image. One in all the standout features of the DeepFloyd IF model is its ability to accurately apply text descriptions to generate images with various objects appearing in several spatial relations. This has previously been a difficult task for other text-to-image models. One other noteworthy feature is the high degree of photorealism within the generated images, reflected within the model’s impressive zero-shot FID rating of 6.66 on the COCO dataset. The DeepFloyd IF model can also generate images with non-standard aspect ratios, including vertical or horizontal orientations and the usual square aspect.
Along with text-to-image generation, the DeepFloyd IF model offers zero-shot image-to-image translations. That is achieved by resizing the unique image to 64 pixels, adding noise through forward diffusion, and using backward diffusion with a brand new prompt to denoise the image. The style might be modified through super-resolution modules via a prompt text description. This approach allows for the modification of fashion, patterns, and details within the output image while maintaining the first type of the source image without the necessity for fine-tuning.
The DeepFloyd IF model works in three stages to generate high-quality images from text prompts. A frozen T5-XXL language model converts the text prompt right into a qualitative representation in the primary stage. Then, within the second stage, a base diffusion model is applied to remodel the qualitative text right into a 64×64 image, which is then upscaled to 256×256 using two text-conditional super-resolution models. In the course of the third stage of the method, a final model is used to reinforce the image to a transparent and high-quality 1024×1024 resolution. The IF model includes different versions of the bottom and super-resolution models, which produce other parameters. Although the third-stage model has yet to be available, alternative upscale models just like the Stable Diffusion x4 Upscaler might be utilized.
The DeepFloyd IF model was trained on a high-quality custom dataset called LAION-A, which incorporates 1 billion (image, text) pairs. The dataset is an aesthetic subset of the English a part of the LAION-5B dataset, and the information were filtered using custom filters to remove inappropriate content. The model is initially released under a research license, and the creators welcome feedback to enhance the model’s performance and scalability. The model might be utilized in various domains, resembling art, design, storytelling, virtual reality, and accessibility. The creators pose several research questions related to the model’s technical, academic, and ethical points. Access to the model’s weights is on the market on Deep Floyd’s Hugging Face space, and the model card and code are also available on GitHub. A Gradio demo is provided for everybody, and the creators invite people to hitch public discussions.
Don’t forget to hitch our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you’ve got any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Niharika
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>
Niharika is a Technical consulting intern at Marktechpost. She is a 3rd 12 months undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the most recent developments in these fields.