Prologue

Generating an image from its description is a challenging task worth solving because of its numerous practical applications ranging from image editing to virtual reality. All existing methods use one single caption to generate a plausible image. A single caption by itself, can be limited, and may not be able to capture the variety of concepts and behavior that may be present in the image. We propose two deep generative models that generate an image by making use of multiple captions describing it. This is achieved by ensuring `Cross-Caption Cycle Consistency’ between the multiple captions and the generated image(s). We report quantitative and qualitative results on the standard Caltech-UCSD Birds (CUB) and Oxford-102 Flowers datasets to validate the efficacy of the proposed approach.

The figure shows two images generated by C4Synth. The corresponding captions that are used while generating images are listed on the left side.

Cascaded Architecture

Recurrent Architecture

Results

Generations from C4Synth. The first two images are generated from the caption belonging to Black Footed Albatross class and Great Crested Flycatcher class of CUB dataset, while the last one is from the Moon Orchid class of Flowers dataset. The last two rows contains random generations from both the datasets. (Kindly zoom in to see the detailing in the images.)

Code

Code for Recurrent C4Synth: https://github.com/JosephKJ/aRTISt

Code for Cascaded C4Synth: https://github.com/JosephKJ/DistillGAN

arXiv paper

https://arxiv.org/abs/1809.10238