AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort: Abstract and Intro

cover
17 Jul 2024

Authors:

(1) Wen Wang, Zhejiang University, Hangzhou, China and Equal Contribution (wwenxyz@zju.edu.cn);

(2) Canyu Zhao, Zhejiang University, Hangzhou, China and Equal Contribution (volcverse@zju.edu.cn);

(3) Hao Chen, Zhejiang University, Hangzhou, China (haochen.cad@zju.edu.cn);

(4) Zhekai Chen, Zhejiang University, Hangzhou, China (chenzhekai@zju.edu.cn);

(5) Kecheng Zheng, Zhejiang University, Hangzhou, China (zkechengzk@gmail.com);

(6) Chunhua Shen, Zhejiang University, Hangzhou, China (chunhuashen@zju.edu.cn).

Figure 1: Example storytelling images generated by our method AutoStory. We can generate text-aligned, identity-consistent, and high-quality story images from user-input stories and characters (the dog and cat on the left, specified by about 5 images per character), without additional inputs like sketches [Gong et al. 2023]. Further, our method also supports generating storytelling images from only text inputs, as shown in our experiments.

ABSTRACT

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications.

To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions.

In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images. This allows our method to obtain consistent story visualization even when only texts are provided as input. Both qualitative and quantitative experiments demonstrate the superiority of our method.

Project webpage: https://aim-uofa.github.io/AutoStory/

CCS CONCEPTS

• Computing methodologies → Neural networks.

KEYWORDS

Generative models, machine learning, diffusion models, low-rank adaptation

1 INTRODUCTION

tory visualization aims to generate a series of visually consistent images from a story described in text. It has a wide range of applications. For example, it can provide creativity and inspiration in art creation, and open up new opportunities for artists. In child education, it can stimulate children’s imagination and creativity, and make the learning process more interesting and effective. In cultural inheritance, it can provide a rich variety of visual expressions for various creative and cultural activities described in texts.

Yet, story visualization is a very challenging task, which needs to meet multiple requirements for the generated images, including (1) high quality: the generated images must be visually appealing and have a reasonable layout (2) consistency: not only the generated images should be consistent to the text descriptions, but also the identities of the characters and scenes in different images should be consistent; and (3) versatility: to satisfy a wide range of users’ needs, it needs to be able to be easily applied to different styles, characters, and scenes.

Limited by the capabilities of generative models, previous work [Li et al. 2019; Maharana and Bansal 2021; Maharana et al. 2021, 2022] significantly and overly simplifies the task by considering story visualization for specific styles, scenes, and characters on fixed datasets, such as the PororoSV [Li et al. 2019] and FlintstonesSV [Maharana and Bansal 2021] datasets. Generative models trained on large-scale text-to-image data and few-shot customized generation methods [Gal et al. 2022; Ruiz et al. 2022] bring new opportunities for story visualization. Some recent work [Gong et al. 2023; Liu et al. 2023c] attempts to obtain story visualization for which characters can be generalized, but are still limited to comic book style image production and often rely on additional user input conditions, such as sketches.

Unlike these efforts, we propose a versatile story visualization method, termed AutoStory, that is fully automated and capable of generating high-quality stories with diverse characters, scenes, and styles. Users only need to enter simple story descriptions to generate high-quality storytelling images. On the other hand, our method is sufficiently general to accommodate various user inputs, providing a flexible interface that allows the user to subtly control the outcome of story visualization through simple interactions. For example, depending on the user’s needs, the user can control the generated story by providing an image of the character, adjusting the layout of the objects in the picture, adjusting the character’s pose, sketching, and so on.

Given the complexity of story scenes, the general idea of our AutoStory is to utilize the comprehension and planning capabilities of large language models to achieve layout planning, and then generate complex story scenes based on the layout. Empirically, we find that sparse control conditions, like bounding boxes, are suitable for layout planning, while dense control conditions, like sketches and keypoints, are suitable for generating high-quality image content. To have the best of both worlds, we devise a dense condition generation module as the bridge. Instead of directly generating the whole complex picture, we first utilize the local prompt generated by the large language model to generate individual subjects in the stories, and then extract the dense control conditions from the subject images. The final story images are generated by conditioning on the dense control signals. Thus, our AutoStory effectively utilizes the planning capability of large language models, while ensuring high-quality generation results in a fully automatic fashion. At the same time, we allow users to edit the layout and other control conditions generated by the algorithm to better align with their intentions.

To achieve identity consistency in the generated images while also maintaining the versatile ability of the large-scale text-to-image generative models, unlike existing methods that perform time-consuming training on domain-specific data, we exploit few-shot parameter-efficient fine-tuning techniques for foundation models. Combining with customized generation techniques, AutoStory achieves identity-consistent generation by training on only a few images for each character, while also generalizing to diverse characters, scenes, and styles.

In addition, existing story visualization methods require the user to provide multiple images for each character in the story, which need to be both identity consistent and diverse. This can be laborious since the users have to draw or collect multiple images for each character. We eliminate this requirement by proposing a multi-view consistent subject generation method. Specifically, we propose a training-free identity consistency modeling method by treating multiple views as a video and jointly generating the textures with temporal-aware attention. Furthermore, we improve the diversity of the generated character images by leveraging the 3D prior in view-conditioned image translation models [Liu et al. 2023d,b], without compromising identity consistency. An example story visualization is shown in Fig. 1.

To summarize, our main contributions are as follows.

• We propose a fully automated story visualization pipeline that can generate diverse, high-quality, and consistent stories with minimal user input requirements.

• To deal with complex scenarios in story visualization, we leverage sparse control signals for layout generation, while dense control signals for high-quality image generation. A simple yet effective dense condition generation module is proposed as the bridge to transform sparse control signals into sketch or keypoint control conditions fully automatically.

• To maintain identity and eliminate the need for users to draw or collect image data for characters, we propose a simple method to generate multi-view consistent images from only texts. Specifically, we use a 3Daware generative model to improve the diversity and generate identity-consistent data by viewing the images from multiple views as a video.

• To our knowledge, we develop the first method which is able to generate high-quality storytelling images in diverse characters, scenes, and styles, even when the user inputs only text. Simultaneously, our method is flexible to accommodate various user inputs where needed.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.