Audio-Synchronized Visual Animation

Abstract

Current visual generation methods can produce high quality videos guided by texts. However, effectively controlling object dynamics remains a challenge. This work explores audio as a cue to generate temporally synchronized image animations. We introduce Audio Synchronized Visual Animation (ASVA), a task animating a static image to demonstrate motion dynamics, temporally guided by audio clips across multiple classes. To this end, we present AVSync15, a dataset curated from VGGSound with videos featuring synchronized audio visual events across 15 categories. We also present a diffusion model, AVSyncD, capable of generating dynamic animations guided by audios. Extensive evaluations validate AVSync15 as a reliable benchmark for synchronized generation and demonstrate our models superior performance. We further explore AVSyncDs potential in a variety of audio synchronized generation tasks, from generating full videos without a base image to controlling object motions with various sounds. We hope our established benchmark can open new avenues for controllable visual generation.

Given an audio and an image (green box), we produce animations beyond image stylization with complex but natural dynamics, synchronized with input audio at each frame. Results are produced by our AVSyncD model trained on the proposed AVSync15 dataset.

AVSync15 [Download]

AVSync15 is a high-quality synchronized audio-video dataset curated from VGGSound. We carefully curate the dataset with both automatic and manual steps. It has the following attributes:

High semantic and temporal audio-visual correlation: Audio and visual contents are not only semantically aligned, but also synchronized at each timestamp. Visual motions are mostly triggered by audio, vice versa.
Clean and stable audio-visual contents: Visual motions are resulted from object dynamics as opposed to camera viewpoint changes and scene transitions. Frames are consistent to prevent sharp changes, while being dynamic to contain rich object motion. Audios are clean enough to describe the visual motion and exclude overwhelming out-of-scene sounds.
Rich in audio-video synchronization clues: We remove the classes/videos where synchronization clue is minimal, i.e., shifting audio along time axis cannot be perceived when pairing it with the unshifted video. This removes ambient classes like raining, fire crackling, running fan, and videos with too complex audio-visual contents.
Diverse in categories: It contains videos with duration > 2 seconds in 15 dynamic-motion classes. Each class is partitioned into 90 training videos and 10 testing videos.

Baby babbling crying

Cap gun shooting

Chicken crowing

Dog barking

Frog croaking

Hammering

Lions roaring

Machine gun shooting

Playing cello

Playing trombone

Playing trumpet

Playing violin

Sharpen knife

Striking bowling

Toilet flushing

AVSyncD

AVSyncD is built upon pretrained StableDiffusion-V1.5 and ImageBind. Given an input image, a 2-second audio, and the audio class, AVSyncD first encodes audio into temporal tokens that are rich in semantic and temporal cues, then produces highly synchronized visual animations.

Generated videos

Here we show generated videos by AVSyncD on Landscapes and AVSync15 test set.

Landscapes

explosion

fire crackling

raining

splashing water

squishing water

thunder

underwater bubbling

waterfall burbling

wind noise

AVSync15

baby babbling crying

cap gun shooting

chicken crowing

dog barking

frog croaking

hammering

lions roaring

machine gun shooting

playing cello

playing trombone

playing trumpet

playing violin fiddle

sharpen knife

striking bowling

toilet flushing

More Applications

Despite trained on only 1350 videos, AVSyncD can be generalized to many fun applications.

Animate images with less-correlated sounds

When using audios less correlated with the image as conditions, AVSyncD can produce different motions based on audio-image correlation. Results shown on AVSync15 test set.

baby + baby crying

baby + dog barking

baby + lions roaring

baby + chicken crowing

baby + playing violin

baby + cap gun shooting

baby + toilet flushing

Animation using images and audios in the wild

AVSyncD can accept a broad range of inputs (images and audios) downloaded from internet.

baby crying

striking bowling

dog barking

playing cello

Animate generated images

While AVSyncD is designed for image animation, it can also function as a prompt+audio->video generator when combined with existing image generators, e.g., StableDiffusion.

"a photo of baby face"
+ baby crying

"a photo of a person shooting a handgun"
+ cap gun shooting

"a photo of a rooster"
+ chicken crowing

"a photo of a dog"
+ dog barking

"a photo of a frog"
+ frog croaking

"a photo of a person playing violin"
+ playing violin fiddle

"a photo of a lion"
+ lions roaring

"a photo of a person shooting machine gun"
+ machine gun shooting

"a photo of a person playing cello"
+ playing cello

"a photo of a person playing trombone"
+ playing trombone

"a photo of a person playing trumpet"
+ playing trumpet

"a photo of hand sharpening knife"
+ sharpen knife

Controllable image animation using audios

Different audios can animate different objects of interests and trigger corresponding motion. Images below are downloaded from internet.

baby and violin

+ baby crying

+ playing violin

dog and violin

+ dog barking