Generating an image from its description is a challenging task worth solving because of its numerous practical applications ranging from image editing to virtual reality. All existing methods use one single caption to generate a plausible image. A single caption by itself, can be limited, and may not be able to capture the variety of concepts and behavior that may be present in the image. We propose two deep generative models that generate an image by making use of multiple captions describing it. This is achieved by ensuring `Cross-Caption Cycle Consistency’ between the multiple captions and the generated image(s). We report quantitative and qualitative results on the standard Caltech-UCSD Birds (CUB) and Oxford-102 Flowers datasets to validate the efficacy of the proposed approach.

The figure shows two images generated by C4Synth. The corresponding captions that are used while generating images are listed on the left side.

Cascaded Architecture

Recurrent Architecture


Generations from C4Synth. The first two images are generated from the caption belonging to Black Footed Albatross class and Great Crested Flycatcher class of CUB dataset, while the last one is from the Moon Orchid class of Flowers dataset. The last two rows contains random generations from both the datasets. (Kindly zoom in to see the detailing in the images.)


Code for Recurrent C4Synth:

Code for Cascaded C4Synth:

arXiv paper