Utilising the fine-tuned Stable Diffusion 2.1 on Amazon SageMaker JumpStart, I developed an AI tech called Owly that crafts personalised comic videos with music, starring my son’s toys because the lead characters
Every evening, it has grow to be a cherished routine to share bedtime stories with my 4-year-old son Dexie, who absolutely adores them. His collection of books is impressive, but he’s especially captivated after I create tales from scratch. Crafting stories this manner also allows me to include moral values I need him to learn, which may be difficult to search out in store-bought books. Over time, I’ve honed my skills in crafting personalised narratives that ignite his imagination — from dragons with fractured partitions to a lonely sky lantern in search of companionship. These days, I’ve been spinning yarns about fictional superheroes like Slow-Mo Man and Fart-Man, which have grow to be his favourites.
While it’s been a pleasant journey for me, after half a 12 months of nightly storytelling, my creative reservoir is being tested. To maintain him engaged with fresh and exciting stories without exhausting myself, I would like a more sustainable solution — an AI technology that may generate fascinating tales robotically! I named her Owly, after his favourite bird, an owl.
As I began assembling my wish list, it quickly ballooned, driven by my eagerness to check the frontiers of contemporary technology. No unusual text-based story would do — I envisioned an AI crafting a full-blown comic with as much as 10 panels. To amp up the joy for Dexie, I aimed to customize the comic using characters he knew and loved, like Zelda and Mario, and perhaps even toss in his toys for good measure. Frankly, the personalisation angle emerged from a necessity for visual consistency across the comic strips, which I’ll dive into later. But hold your horses, that’s not all — I also wanted the AI to narrate the story aloud, backed by a fitting soundtrack to set the mood. Tackling this project can be equal parts amusing and difficult for me, while Dexie can be treated to a tailor-made, interactive storytelling extravaganza.
To overcome the aforementioned requirements, I realised I needed to assemble five marvellous modules:
- The Story Script Generator, conjuring up a multi-paragraph story where each paragraph might be transformed into a comic book strip section. Plus, it recommends a musical style to pluck a fitting tune from my library. To tug this off, I enlisted the mighty OpenAI GPT3.5 Large Language Model (LLM).
- The Comic Strip Image Generator, whipping up images for every story segment. Stable Diffusion 2.1 teamed up with Amazon SageMaker JumpStart, SageMaker Studio and Batch Transform to bring this to life.
- The Text-to-Speech Module, turning the written tale into an audio narration. Amazon Polly’s neural engine leaped to the rescue.
- The Video Maker, weaving the comic strips, audio narration, and music right into a self-playing masterpiece. MoviePy was the star of this show.
- And eventually, The Controller, orchestrating the grand symphony of all 4 modules, built on the mighty foundation of AWS Batch.
The sport plan? Get the Story Script Generator to weave a 7–10 paragraph narrative, with each paragraph morphing into a comic book strip section. The Comic Strip Image Generator then generates images for every segment, while the Text-to-Speech Module crafts the audio narration. A melodious tune might be chosen based on the story generator’s advice. And eventually, the Video Maker combines images, audio narration, and music to create a whimsical video. Dexie is in for a treat with this one-of-a-kind, interactive story-time adventure!
Before delving into the Story Script Generator, let’s first explore the image generator module to supply context for any references to the image generation process. There are many text-to-image AI models available, but I selected the Stable Diffusion 2.1 model for its popularity and ease of constructing, fine-tuning, and deployment using Amazon SageMaker and the broader AWS ecosystem.
Amazon SageMaker Studio is an integrated development environment (IDE) that provides a unified web-based interface for all machine learning (ML) tasks, streamlining data preparation, model constructing, training, and deployment. This boosts data science team productivity by as much as 10x. Inside SageMaker Studio, users can seamlessly upload data, create notebooks, train and tune models, adjust experiments, collaborate with their team, and deploy models to production.
Amazon SageMaker JumpStart, a useful feature inside SageMaker Studio, provides an intensive collection of widely-used pre-trained AI models. Some models, including Stable Diffusion 2.1 base, may be fine-tuned together with your own training set and are available with a sample Jupyter Notebook. This allows you to quickly and efficiently experiment with the model.
I navigated to the Stable Diffusion 2.1 base view model page and launched the Jupyter notebook by clicking on the Open Notebook button.
In a matter of seconds, Amazon SageMaker Studio presented the instance notebook, complete with all of the obligatory code to load the text-to-image model from JumpStart, deploy the model, and even fine-tune it for personalised image generation.
Quite a few text-to-image models can be found, with many tailored to specific styles by their creators. Utilising the JumpStart API, I filtered and listed all text-to-image models using the filter_value “task == txt2img” and displayed them in a dropdown menu for convenient selection.
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models# Retrieves all Text-to-Image generation models.
filter_value = "task == txt2img"
txt2img_models = list_jumpstart_models(filter=filter_value)
# display the model-ids in a dropdown to pick out a model for inference.
model_dropdown = Dropdown(
options=txt2img_models,
value="model-txt2img-stabilityai-stable-diffusion-v2-1-base",
description="Select a model",
style={"description_width": "initial"},
layout={"width": "max-content"},
)
display(model_dropdown)
# Or simply hard code the model id and version=*.
# Eg. if we would like the most recent 2.1 base model
self._model_id, self._model_version = (
"model-txt2img-stabilityai-stable-diffusion-v2-1-base",
"*",
)
The model I required was model-txt2img-stabilityai-stable-diffusion-v2–1-base which permit fine-tuning.
In under 5 minutes, utilising the provided code, I deployed the model to a SageMaker endpoint running a g4dn.2xlarge GPU instance. I swiftly generated my first image from my text prompts, which you’ll see showcased below.
The Amazon SageMaker Studio streamlines my experimentation and prototyping process, allowing me to swiftly experiment with various image generation prompts and examine the resulting images directly throughout the IDE using the file explorer and the preview window. Moreover, I can upload images throughout the IDE, utilise the built-in terminal to launch AWS CLI for uploading and downloading images to and from an S3 bucket, and execute SageMaker batch transform jobs against my models to generate quite a few images directly for a big scale testing.
The duty of this module is kind of straightforward: produce a story script given a story topic and a personality name. Generating a story on a selected topic with GPT3.5 API is incredibly easy.
openai.api_key = self._api_key
prompt = "Write me a 1000-word story about Bob the penguin who desires to travel to Europe to see famous landmarks"
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.7,
max_tokens=2089,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
For instance, using the prompt “Write me a 1000-word story about Bob the penguin who desires to travel to Europe to see famous landmarks. He learns that his bravery and curiosity lead him to experience many exciting things.” GPT3.5 will effortlessly craft a fascinating story on this topic as if it were penned by knowledgeable storyteller, very like the instance below.
Bob the penguin had at all times dreamed of traveling to Europe and visiting famous landmarks. He had heard stories from his friends concerning the Eiffel Tower in Paris, the Colosseum in Rome, and the Big Ben in London. He had grown bored with his routine life in Antarctica and yearned for adventure.
Sooner or later, Bob decided to make the leap and start planning his trip. He spent hours researching the very best travel routes and probably the most reasonably priced accommodations. After careful consideration, he decided to begin his journey in Paris.
The boat ride was long and tiring, but he was excited to finally be in Europe. He checked into his hotel and immediately set off to see the Eiffel Tower. As he walked through the streets of Paris, he felt a way of wonder and excitement that he had never felt before.
Over the subsequent few days, he visited famous landmarks just like the Louvre Museum, Notre Dame Cathedral, and the Palace of Versailles. He tried latest foods and met latest people, each experience adding to his adventure.
The story itself is implausible, but to remodel it into comic strips, I would like to divide the story into sections and create a picture for every one. The most sensible approach can be to convert each paragraph into a piece. Nevertheless, as you possibly can see, the photographs generated from those paragraphs present some significant challenges.
- Character chaos ensued! Each comic strip depicted Bob as a wholly different character. In the primary strip, he’s a penguin doll; within the second, he’s an individual’s leg; within the third, a suited-up gentleman; and within the fourth, a person in a blue shirt. This happens because only the primary paragraph mentions “Bob the penguin,” while the second refers to him as just “Bob,” and the remaining as “he.” Given this scant information, it’s no wonder the image generator portrayed Bob in so many various guises.
- The scenes also lacked focus. The primary comic strip showed a penguin doll sitting on a table as an alternative of an Antarctic-dwelling penguin dreaming of European adventures. An analogous issue arose within the second comic strip, which showed someone’s leg soaring high above a city. It appears the image generator interpreted “Sooner or later, Bob decided to make the leap” as skydiving from an airplane, while the paragraph’s focus must have been on Bob planning his trip to Europe. Long paragraphs with multiple focal points often confuse the image generator, leading to out-of-context images.
To tackle the scene’s out-of-focus issue and improve character consistency, I refined my prompt to incorporate a request for a concise, one-sentence scene description for every paragraph wrapped in [] in order that they may be programatically extracted. This allowed me to supply detailed examples and guide the image generator in creating more focused and accurate images.
For every section please describe the scene in details and at all times include the situation in a single sentence inside [] with the next format [a photo of character in the location], [a photo of character in front of an object], [a photo of character next to an object], [a photo of a location]
With the updated prompt, here’s the resulting story that was generated.
[a photo of Bob the penguin in Antarctica]
Bob the penguin was a completely happy and curious penguin who lived within the Antarctic. He was content together with his life there, surrounded by his family and friends. But someday, he decided to make the leap and explore the world beyond the icy continent. He had heard stories of the numerous beautiful and exotic places world wide, and he desired to experience them for himself.[a photo of Bob the penguin reading a book]
Bob the penguin began researching the world, maps and reading up on different countries and cultures. He was particularly drawn to Europe, with its many famous landmarks and sights. He decided that Europe was the place he desired to visit, so he began to plan his journey.[a photo of Bob the penguin on a cruise ship]
He began to make the long journey by boat. He was excited and couldn’t wait to get there, and he was determined to make it to Europe. After just a few weeks of travelling, he eventually arrived at his destination.[a photo of Bob the penguin at Eiffel Tower]
Bob the penguin began exploring Europe and was amazed by all different places he visited. He went to the Eiffel Tower in Paris, the Colosseum in Rome, and the Cliffs of Moher in Ireland. In every single place he went he was crammed with awe and delight.
As you possibly can observe, the generated scene descriptions are considerably more focused. They mention a single scene, a location, and/or an activity being performed, often starting with the character’s name. These concise prompts prove to be far more effective for my image generator, as evidenced by the improved images generated below.
Bob the penguin has made a triumphant return, but he’s still sporting a brand new look in each comic strip. Because the image generation process treats each image individually, and no information is provided about Bob’s color, size, or form of penguin, consistency stays elusive.
I previously considered generating an in depth character description as a part of the story generation to take care of character consistency across images. Nevertheless, this approach proved to be impractical for 2 reasons:
- Sometimes it’s nearly unimaginable to explain a personality with enough detail without resorting to an awesome amount of text. While there might not be many kinds of penguins, consider birds typically — with countless shapes, colors, and species reminiscent of cockatoos, parrots, canaries, pelicans, and owls, the duty becomes daunting.
- The character generated doesn’t at all times adhere to the provided description throughout the prompt. For instance, a prompt describing a green parrot with a red beak might lead to a picture of a green parrot with a yellow beak as an alternative.
So, despite our greatest efforts, our penguin pal Bob continues to experience something of an identity crisis.
The answer to our penguin predicament lies in giving the Stable Diffusion model a visible cue of what our penguin character should seem like to influence the image generation process and to take care of consistency across all generated images. On this planet of Stable Diffusion, this process is generally known as fine-tuning, where you supply a handful (often 5 to fifteen) of images containing the identical object and a sentence describing it. These images shall henceforth be generally known as training images.
Because it seems, this added personalisation will not be just an answer but additionally a mighty cool feature for my comic generator. Now, I can use lots of Dexie’s toys because the most important characters within the stories, reminiscent of his festive Christmas penguin, respiratory latest life into Bob the penguin, making them much more personalised and relatable for my young but tough audience. So, the search for consistency turns right into a triumph for tailor-made tales!
During my exhilarating days of experimentation, I’ve discovered just a few nuggets of wisdom to share for achieving the very best results when fine-tuning the model to cut back the possibility of overfitting:
- Keep the backgrounds in your training images diverse. This manner, the model won’t confuse the backdrop with the article, stopping unwanted background cameos within the generated images
- Capture the goal object from various angles. This helps provide more visual information, enabling the model to generate the article with a greater range of angles, thus higher matching the scene.
- Mix close-ups with full-body shots. This ensures the model doesn’t assume a selected pose is obligatory, granting more flexibility for the generated object to harmonise with the scene.
To perform the Stable Diffusion model fine-tuning, I launched a SageMaker Estimator training job with Amazon SageMaker Python SDK on an ml.g5.2xlarge GPU instance and directed the training process to my collection of coaching images in an S3 bucket. A resulting fine-tuned model file will then be saved in s3_output_location. And, with just just a few lines of code, the magic began to unfold!
# [Optional] Override default hyperparameters with custom values
hyperparams["max_steps"] = 400
hyperparams["with_prior_preservation"] = False
hyperparams["train_text_encoder"] = Falsetraining_job_name = name_from_base(f"stable-diffusion-{self._model_id}-transfer-learning")
# Create SageMaker Estimator instance
sd_estimator = Estimator(
role=self._aws_role,
image_uri=image_uri,
source_dir=source_uri,
model_uri=model_uri,
entry_point="transfer_learning.py", # Entry-point file in source_dir and present in train_source_uri.
instance_count=self._training_instance_count,
instance_type=self._training_instance_type,
max_run=360000,
hyperparameters=hyperparams,
output_path=s3_output_location,
base_job_name=training_job_name,
sagemaker_session=session,
)
# Launch a SageMaker Training job by passing s3 path of the training data
sd_estimator.fit({"training": training_dataset_s3_path}, logs=True)
To organize the training set, ensure it incorporates the next files:
- A series of images named instance_image_x.jpg, where x is a number from 1 to N. On this case, N represents the variety of images, ideally greater than 10.
- A dataset_info.json file that features a mandatory field called instance_prompt. This field should provide an in depth description of the article, with a singular identifier preceding the article’s name. For instance, “a photograph of Bob the penguin,” where ‘Bob’ acts because the unique identifier. Through the use of this identifier, you possibly can direct your fine-tuned model to generate either an ordinary penguin (known as “penguin”) or the penguin out of your training set (known as “Bob the penguin”). Some sources suggest using unique names reminiscent of sks or xyz, but I discovered that it’s not essential to accomplish that.
The dataset_info.json file may include an optional field called class_prompt, which offers a general description of the article without the unique identifier (e.g., “a photograph of a penguin”). This field is utilised only when the prior_preservation parameter is about to True; otherwise, it is going to be disregarded. I’ll discuss more about it on the advanced fine-tuning section below.
{"instance_prompt": "a photograph of bob penguin",
"class_prompt": "a photograph of a penguin"
}
After just a few test runs with Dexie’s toys, the image generator delivered some truly impressive results. It brought Dexie’s kangaroo magnetic block creation to life, hopping its way into the virtual world. The generator also masterfully depicted his beloved shower turtle toy swimming underwater, surrounded by a vibrant school of fish. The image generator definitely captured the magic of Dexie’s playtime favourites!
Batch Transform against fine-tuned Stable Diffusion model
Since I needed to generate over 100 images for every comic strip, deploying a SageMaker endpoint (consider it as a Rest API) and generating one image at a time wasn’t probably the most efficient approach. As an alternative, I opted to run a batch transform against my model, supplying it with text files in an S3 bucket containing the prompts to generate the photographs.
I’ll provide more details about this process since I initially struggled with it, and I hope my explanation will prevent a while. You’ll need to organize one text file per image prompt with the next JSON content: {“prompt”: “a photograph of Bob the penguin in Antarctica”}. While it seems that there’s a method to mix multiple inputs into one file using the MultiRecord strategy, I used to be unable to determine how it really works.
One other challenge I encountered was executing a batch transform against my fine-tuned model. You’ll be able to’t execute a batch transform using a transformer object returned by Estimator.transformer(), which often works in my previous projects. As an alternative, you must first create a SageMaker model object by specifying the S3 location of your fine-tuned model because the model_data. From there, you possibly can create the transformer object using this model object.
def _get_model_uris(self, model_id, model_version, scope):
# Retrieve the inference docker container uri
image_uri = image_uris.retrieve(
region=None,
framework=None, # robotically inferred from model_id
image_scope=scope,
model_id=model_id,
model_version=model_version,
instance_type=self._inference_instance_type,
)
# Retrieve the inference script uri. This includes scripts for model loading, inference handling etc.
source_uri = script_uris.retrieve(
model_id=model_id, model_version=model_version, script_scope=scope
)
if scope == "training":
# Retrieve the pre-trained model tarball to further fine-tune
model_uri = model_uris.retrieve(
model_id=model_id, model_version=model_version, model_scope=scope
)
else:
model_uri = Nonereturn image_uri, source_uri, model_uri
image_uri, source_uri, model_uri = self._get_model_uris(self._model_id, self._model_version, "inference")
# Get model artifact location by estimator.model_data, or give an S3 key directly
model_artifact_s3_location = f"s3://{self._bucket}/output-model/{job_id}/{training_job_name}/output/model.tar.gz"
env = {
"MMS_MAX_RESPONSE_SIZE": "20000000",
}
# Create model from saved model artifact
sm_model = model.Model(
model_data=model_artifact_s3_location,
role=self._aws_role,
entry_point="inference.py", # entry point file in source_dir and present in deploy_source_uri
image_uri=image_uri,
source_dir=source_uri,
env=env
)
transformer = sm_model.transformer(instance_count=self._inference_instance_count, instance_type=self._inference_instance_type,
output_path=f"s3://{self._bucket}/processing/{job_id}/output-images",
accept='application/json')
transformer.transform(data=f"s3://{self._bucket}/processing/{job_id}/batch_transform_input/",
content_type='application/json')
And with that, my customised image generator is all ready!
Advanced Stable Diffusion model fine-tuning
While it’s not essential for my comic generator project, I’d wish to touch on some advanced fine-tuning techniques involving the manipulation of max_steps, prior_reservation, and train_text_encoder hyper parameters, in case they turn out to be useful on your projects.
Stable Diffusion model fine-tuning is very vulnerable to overfitting on account of the vast difference between the number of coaching images you provide and people utilized in the bottom model. For instance, you would possibly only supply 10 images of Bob the penguin, while the bottom model’s training set incorporates hundreds of penguin images. A bigger variety of images reduces the likelihood of overfitting and erroneous associations between the goal object and other elements.
When setting prior_reservation to True, Stable Diffusion generates a default of x (typically 100) images using the class_prompt provided, and combines them together with your instance_images during fine-tuning. Alternatively, you possibly can manually supply these images by placing them within the class_data_dir subfolder. In my experience, prior_preservation is commonly crucial when fine-tuning Stable Diffusion for human faces. When employing prior_reservation, make sure you provide a class_prompt that mentions probably the most suitable generic name or common object resembling your character. For Bob the penguin, this object is clearly a penguin, so your class prompt can be “a photograph of a penguin”. This system will also be used to generate a mix between two characters, which I’ll discuss later.
One other helpful parameter for advanced fine-tuning is train_text_encoder. Set it to True to enable text encoder training throughout the fine-tuning process. The resulting model will higher understand more complex prompts and generate human faces with greater accuracy.
Depending in your specific use case, different hyper parameter values may yield higher results. Moreover, you’ll need to regulate the max_steps parameter to manage the variety of fine-tuning steps required. Be mindful that a bigger training set might result in overfitting.
By utilising Amazon Polly’s Neural Text To Speech (NTTS) feature, I used to be in a position to create audio narration for every paragraph of the story. The standard of the audio narration is phenomenal, because it sounds incredibly natural and human-like, making it a really perfect story-teller.
To accommodate a younger audience, reminiscent of Dexie, I employed the SSML format and utilised the
self._pollyClient = boto3.Session(
region_name=aws_region).client('polly')
ftext = f"{text} "
response = self._pollyClient.synthesize_speech(VoiceId=self._speaker,
OutputFormat='mp3',
Engine='neural',
Text=ftext,
TextType='ssml')with open(mp3_path, 'wb') as file:
file.write(response['AudioStream'].read())
file.close()
In any case the labor, I used MoviePy — a implausible Python framework — to magically turn all of the photos, audio narration, and music into an awesome mp4 video. Speaking of music, I gave my tech the facility to decide on the right soundtrack to match the video’s vibe. How, you ask? Well, I just modified my story script generator to return a music style from a pre-determined list using some clever prompts. How cool is that?
Initially of the story please suggest song style from the next list only which matches the story and put it inside <>. Song style list are motion, calm, dramatic, epic, completely happy and touching.
Once the music style is chosen, the subsequent step is to randomly pick an MP3 track from the relevant folder, which incorporates a handful of MP3 files. This helps so as to add a touch of unpredictability and excitement to the ultimate product.
To orchestrate the complete system, I needed a controller module in the shape of a Python script that might run each module seamlessly. But, after all, I needed a compute environment to execute this script. I had two options to explore — the primary being my preferred option — a server-less architecture with AWS Lambda. This involved using several AWS Lambdas, paired with SQS. The primary lambda is used as public API using API Gateway as an entry point. This API would absorb the training image URLs and story topic text and pre-process the info, dropping it into an SQS queue. One other Lambda would pick up the info from the subject and conduct data preparation — think image resizing, creating dataset_info.json, and triggering the subsequent Lambda to call Amazon SageMaker Jumpstart to organize the Stable Diffusion model and execute SageMaker training job to fine-tune the model. Phew, that’s a mouthful. Finally, Amazon EventBridge can be used as an event bus to detect the completion of the training job and trigger the subsequent Lambda to execute SageMaker Batch Transform using the fine-tuned model to generate images.
But alas, this selection was impossible since the AWS Lambda function had a max storage limit of 10GB. And when executing the batch transform against the SageMaker model, the SageMaker Python SDK would download and extract the model.tar.gzip file temporarily within the local /tmp before sending it to the managed system that ran the batch transform. Unfortunately, my model was a whopping 5GB compressed, so the SageMaker Python SDK threw an error saying “Out of disk space.” For many use cases where the model size is smaller, this may the very best and cleanest solution.
So, I needed to resort to my second option — AWS Batch. It worked well, however it did cost a bit more because the AWS Batch compute instance needed to run throughout the complete process —even during fine-tuning the model, and executing the batch transform which were executed in a separate compute environment inside SageMaker. I could have split the method into several AWS Batch instances and glued them along with Amazon EventBridge and SQS, similar to I might have done previously using the server-less approach. But with AWS Batch’s longer startup time (around 5 mins), it will have added way an excessive amount of latency to the general process. So, I went with the all-in-one AWS Batch option as an alternative.
Feast your eyes upon Owly’s majestic architecture diagram! Our adventure kicks off by launching AWS Batch through the AWS Console, equipping it with an S3 folder brimming with training images, a fascinating story topic, and a pleasant character, all supplied via AWS Batch environment variables.
# Basic settings
JOB_ID = "penguin-images" # key to S3 folder containing the training images
STORY_TOPIC = "bob the penguin who desires to travel to Europe"
STORY_CHARACTER = "bob the penguin"# Advanced settings
TRAIN_TEXT_ENCODER = False
PRIOR_RESERVATION = False
MAX_STEPS = 400
NUM_IMAGE_VARIATIONS = 5
The AWS Batch springs into motion, retrieving the training images from the S3 folder specified by JOB_ID, resizing them to a 768×768, and making a dataset_info.json file before placing them in a staging S3 bucket.
Next up, we call up the OpenAI GPT3.5 model API to whip up a fascinating story and a complementary song style in harmony with the chosen topic and character. We then summon Amazon SageMaker JumpStart to unleash the powerful Stable Diffusion 2.1 base model. With the model at our disposal, we initiate a SageMaker training job to fine-tune it to our fastidiously chosen training images. After a transient 30-minute interlude, we forge image prompts for every story paragraph within the guise of text files, that are then dropped into an S3 bucket as input for the image generation extravaganza. Amazon SageMaker Batch Transform is unleashed on the fine-tuned model to provide these images in a batch, a process that lasts a mere 5 minutes.
Once complete, we enlist the assistance of Amazon Polly to craft audio narrations for every paragraph within the story, saving them as mp3 files in only 30 seconds. We then randomly pick an mp3 music file from libraries sorted by song style, based on the choice made by our masterful story generator.
The ultimate act sees the resulting images, audio narration mp3s, and music.mp3 files expertly woven together right into a video slideshow with the assistance of MoviePy. Smooth transitions and the Ken Burns effect are added for that extra touch of elegance. The pièce de résistance, the finished video, is then hoisted as much as the output S3 bucket, awaiting your eager download!
I need to say, I’m moderately happy with the outcomes! The story script generator has truly outdone itself, performing much better than anticipated. Almost every story script crafted will not be only well-written but additionally brimming with positive morals, showcasing the awe-inspiring prowess of Large Language Models (LLM). As for image generation, well, it’s a little bit of a mixed bag.
With all of the enhancements I’ve described earlier, one in five stories may be utilized in the ultimate video right off the bat. The remaining 4, nonetheless, often have one or two images suffering from common issues.
- First, we’ve got inconsistent characters, still. Sometimes the model conjures up a personality that’s barely different from the unique within the training set, often choosing a photorealistic version moderately than the toy counterpart. But fear not! Adding a desired photo style throughout the text prompt, like “A cartoon-style Rex the turtle swimming under the ocean,” helps curb this issue. Nevertheless, it does require manual intervention since certain characters may warrant a photorealistic style.
- Then there’s the curious case of missing body parts. Occasionally, our generated characters appear with absent limbs or heads. Yikes! To mitigate this, we’ve added negative prompts supported by the Stable Diffusion model, reminiscent of “missing limbs, missing head,” encouraging the generation of images that avoid these peculiar attributes.
- Bizarre images emerge when coping with unusual interactions between objects. Generating images of characters in specific locations typically produces satisfactory results. Nevertheless, in terms of illustrating characters interacting with other objects, especially in an unusual way, the end result is commonly lower than ideal. As an illustration, attempting to depict Tom the hedgehog milking a cow leads to a peculiar mix of hedgehog and cow. Meanwhile, crafting a picture of Tom the hedgehog holding a flower bouquet results in an individual clutching each a hedgehog and a bouquet of flowers. Regrettably, I even have yet to plot a technique to treatment this issue, leading me to conclude that it’s simply a limitation of current image generation technology. If the article or activity within the image you’re attempting to generate is very unusual, the model lacks prior knowledge, as not one of the training data has ever depicted such scenes or activities.
Ultimately, to spice up the percentages of success in story generation, I cleverly tweaked my story generator to provide three distinct scenes per paragraph. Furthermore, for every scene, I instructed my image generator to create five image variations. With this approach, I increased the likelihood of obtaining no less than one top-notch image from the fifteen available. Having three different prompt variations also aids in generating entirely unique scenes, especially when one scene proves too rare or complex to create. Below is my updated story generation prompt.
"Write me a {max_words} words story a few given character and a subject.nPlease break the story down into "
"seven to 10 short sections with 30 maximum words per section. For every section please describe the scene in "
"details and at all times include the situation in a single sentence inside [] with the next format "
"[a photo of character in the location], [a photo of character in front of an object], "
"[a photo of character next to an object], [a photo of a location]. Please provide three different variations "
"of the scene details separated by |nAt the beginning of the story please suggest song style from the next "
"list only which matches the story and put it inside <>. Song style list are motion, calm, dramatic, epic, "
"completely happy and touching."
The one additional cost is a little bit of manual intervention after the image generation step is completed, where I handpick the very best image for every scene after which proceed with the comic generation process. This minor inconvenience aside, I now boast a remarkable success rate of 9 out of 10 in crafting splendid comics!
With the Owly system fully assembled, I made a decision to place this marvel of technology to the test one wonderful Saturday afternoon. I generated a handful of stories from his toys collection, ready to boost bedtime storytelling for Dexie using a nifty portable projector I had purchased. That night, as I saw Dexie’s face light up and his eyes widen with excitement, the comic playing out on his bedroom wall, I knew all my efforts had been value it.
The cherry on top is that it now takes me under two minutes to whip up a brand new story using photos of his toy characters I’ve already captured. Plus, I can seamlessly incorporate useful morals I need him to learn from each story, reminiscent of not talking to strangers, being brave and adventurous, or being kind and helpful to others. Listed here are a number of the delightful stories generated by this implausible system.
As a curious tinkerer, I couldn’t help but fiddle with the image generation module to push Stable Diffusion’s boundaries and merge two characters into one magnificent hybrid. I fine-tuned the model with Kwazi Octonaut images, but I threw in a twist by assigning Zelda as each the unique and sophistication character name. Setting prior_preservation to True, I ensured that Stable Diffusion would “octonaut-ify” Zelda while still keeping her distinct essence intact.
I cleverly utilised a modest max_step of 400, simply enough to preserve Zelda’s original charm without her being entirely consumed by Kwazi the Octonaut’s irresistible allure. Behold the fantastic fusion of Zelda and Kwazi, united as one!
Dexie brimmed with excitement as he witnessed a fusion of his two favourite characters spearheading the motion in his bedtime story. He launched into thrilling adventures, combating aliens and looking for hidden treasure chests!
Unfortunately to guard the IP owner I cannot show the resulting images.
Generative AI, particularly Large Language Models (LLMs), is here to remain and set to grow to be the powerful tools for not only software development but many other industries as well. I’ve experienced the true power of LLMs firsthand in just a few projects. Just last 12 months, I built a robotic teddy bear called Ellie, able to moving its head and fascinating in conversations like an actual human. While this technology is undeniably potent, it’s vital to exercise caution to make sure the protection and quality of the outputs it generates, as it will possibly be a double-edged sword.
And there you may have it, folks! I hope you found this blog interesting. If that’s the case, please shower me together with your claps. Be happy to attach with me on LinkedIn or take a look at my other AI endeavours on my Medium profile. Stay tuned, as I’ll be sharing the whole source code in the approaching weeks!
Finally, I would really like to say due to Mike Chambers from AWS who helped me troubleshoot my fine-tuned Stable Diffusion model batch transform code.