Generative AI has come a good distance recently. We’re all conversant in ChatGPT, diffusion models, and more at this point. These tools have gotten increasingly more integrated into our day by day lives. Now, we’re using ChatGPT as an assistant to our day by day tasks; MidJourney to help design process and more AI tools to ease our routine tasks.
The advancement of generative AI models has enabled unique use cases that were different to attain previously. Now we have seen someone write and illustrate a complete child book using generative AI models. We used to inform the stories the identical way for ages, and now we This was a fantastic example of how generative AI can revolutionize the storytelling that we’ve got been using for ages.
Visual storytelling is a robust approach to conveying narrative content effectively to diverse audiences. Its applications in education and entertainment, resembling children’s books, are vast. We all know that we will generate stories and illustrations individually using generative AI models, but can we actually use them to generate a visible story consistently? The query then becomes; given a story in plain text and the portrait images of a number of characters, can we generate a series of images to precise the story visually?
To have an accurate visual representation of a narrative, story visualization must meet several vital requirements. Firstly, maintaining identity consistency is crucial to depict characters and environments consistently throughout the frames or scenes. Secondly, the visual content should closely align with the textual narrative, accurately representing the events and interactions described within the story. Lastly, a transparent and logical layout of objects and characters throughout the generated images aids in seamlessly guiding the viewer’s attention through the narrative, facilitating understanding.
Generative AI has been used to propose several story visualization methods. Early work relied on GAN or VAE-based methods and text encoders to project text right into a latent space, generating images conditioned on the textual input. While these approaches demonstrated promise, they faced challenges in generalizing to latest actors, scenes, and layout arrangements. Recent attempts at zero-shot story visualization investigated the potential of adapting to latest characters and scenes using pre-trained models. Nevertheless, these methods lacked support for multiple characters and didn’t consider the importance of layout and native object structures throughout the generated images.
So, should we just quit on having an AI-based story visualization system? Are these limitations too difficult to be tackled? In fact not! Time to fulfill TaleCrafter.
TaleCrafter is a novel and versatile interactive story visualization system that overcomes the constraints of previous approaches. The system consists of 4 key components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V).
These components work together to handle the necessities of a story visualization system. Story-to-prompt generation (S2P component leverages a big language model to generate prompts that depict the visual content of images based on instructions derived from the story. Text-to-layout generation (T2L) component utilizes the generated prompt to generate a picture layout that provides location guidance for the fundamental subjects. Then, the controllable text-to-image generation (C-T2I) module, the core component of the visualization system, renders images conditioned on the layout, local sketch, and prompt. Finally, the image-to-video animation (I2V) component enriches the visualization process by animating the generated images, providing a more vivid and fascinating presentation of the story.
Overview of TaleCrafter. Source: https://arxiv.org/pdf/2305.18247.pdf
TaleCrafter‘s fundamental contributions lie in two key facets. Firstly, the proposed story visualization system leverages large language and pre-trained text-to-image (T2I) models to generate a video from plain text stories. This versatile system can handle multiple novel characters and scenes, overcoming the constraints of previous approaches that were limited to specific datasets. Secondly, the controllable text-to-image generation module (C-T2I) emphasizes identity preservation for multiple characters and provides control over layout and native object structures, enabling interactive editing and customization.
Check Out The Paper and Github link. Don’t forget to hitch our 23k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you will have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He’s currently pursuing a Ph.D. degree on the University of Klagenfurt, Austria, and dealing as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.