odisha university of
technology and research
VisuScript: Text to Image Generation Using
Stack GAN
Minor Project
6th Sem
Presented by: Project Mentor:
Rahul Kumar Sahoo - Dr. Subhalaxmi Das
2211100132
Manasi Naik - 2211100190
Sovan Sekhar Senapati -
OVERVI
EW
• • Proposed
• Introduction
Motive • Model Result
• Literature • Analysis
Conclusion
• Review
Research • Future
• Gap Work
Objective • Reference
INTRODUCTION
• Generative Adversarial Networks (GANs)
GANs revolutionize data synthesis by leveraging a dual-network
system—Generator vs. Discriminator—engaged in a competitive
learning process to produce highly realistic synthetic data.
• StackGAN Innovation
StackGAN extends the GAN framework with a multi-stage
architecture, refining image generation in stages—from low-resolution
sketches to high-quality photorealistic images.
• Text-to-Image Translation
By converting natural language descriptions into detailed visual
content, StackGAN enables AI to interpret and visualize linguistic input
effectively.
• Impact and Applications
This model paves the way for breakthroughs in AI creativity, content
generation, and human-computer interaction, particularly in areas like
design, media, and assistive tech.
MOTIVATI
ON
• Text-to-Image Bridging
To enable AI systems to translate natural language descriptions
into
vivid and realistic images.
• High-Quality Image
To generate high-resolution, photorealistic images using a multi-stage
Generation
GAN
(StackGAN) framework.
To use LSTM encoders
• Enhanced Text for capturing deep semantic and sequential
meaning
Understanding
from text inputs.
To integrate Conditioning Augmentation (CA) for stabilizing training
and
• Training Stability &
promoting output diversity.
Diversity
To support creative and practical applications in fields like design, media,
and
assistive technology.
LITERATURE
REVIEW
RESEARCH
GAP
Limitations in Existing Models:
• Similar architectures often produce blurry or low-quality images at high
resolutions.
• The generator might fail to explore the full variety of images, producing only a
limited set of outputs.
• Most existing text-to-image models rely on conditional CNNs or Bag-of-Words as
text encoders, which often fail to capture deeper contextual and semantic
relationships in complex textual descriptions, limiting the richness and accuracy of
generated images.
• Methods like conditional CNNs and Bag-of-Words miss deep contextual meaning in
Need for a Solution:
text, limiting image detail.
• There’s a gap in generating high-resolution images with sharp details that align
well with complex textual descriptions.
• To integrate hybrid architectures combining CNN and RNN capabilities for richer
text-image alignment and better feature representation.
• LSTM encoders offer better contextual understanding than CNNs or Bag-of-
Words for complex text.
OBJECTIV
ES
• To design a text-to-image synthesis model using StackGAN for
generating high-resolution, realistic images.
• To implement LSTM-based text encoders for capturing deep semantic
and contextual meaning from textual descriptions.
• To integrate Conditioning Augmentation (CA) for improving training
stability and enhancing diversity in generated outputs.
• To evaluate the model's performance using loss curves and qualitative
image outputs for realism and relevance.
• To evaluate the generated images using metrics like Inception Score
(IS) and Fréchet Inception Distance (FID).
• To explore potential improvements and real-world applications of text-
to-image generation in AI, design, and creative industries.
PROPOSED
Text Encoding (LSTM)
MODEL
• Processes input text sequentially, capturing
word relationships and context
• Outputs a fixed-dimensional semantic vector
representing the full description
Conditioning Augmentation (CA)
• Adds controlled noise to text embeddings to
increase output diversity
• Ensures robustness against minor text
variations while preserving core meaning
Stage I (Low-Res Generation)
• Produces a 64×64 base image with correct
layout and primary colors
• Focuses on structural accuracy rather than
fine details
Stage II (High-Res Refinement)
• Enhances resolution to 256×256 while
adding realistic textures
• Uses cross-modal attention to align visual
details with text descriptions
Datasets Used: MNIST Handswritten Letters, CUB Birds, Oxford
Flower
RESULT
ANALYSIS
Epoch 20 (Real(top) vs Generated(Bottom) Epoch 50 (Real(top) vs Generated(Bottom)
Graphical Representation of Generator Loss and Discriminator
Loss
CONCLUSI
• StackGAN ON
proves effective for high-quality text-to-
image synthesis.
• Two-stage architecture enhances resolution and
realism.
• Applicable across many domains, from design to
forensics.
• StackGAN enables text-to-image synthesis, which can
be applied in advertising, gaming, virtual reality, and
accessibility tools.
• Improving and modifying GANs leads to higher fidelity,
making them more useful in real-world deployment
FUTURE
WORK
• Use transformer-based text encoders for better
semantic understanding
• Enable real-time text-to-image generation
• Add support for multi-modal inputs (e.g., audio,
sketches)
• Train on larger datasets to handle complex scene
generation
• Integrate the model into creative tools and
applications
REFERENCES
Arya, R., Bhakuni, V. S., Joshi, D., Sharma, K., Vats, S., & Sharma, V.
(2024). Stacked Generative Adversarial Networks (StackGAN) Text-to-
Image Generator.
Sahithi, Y. L., Sunny, N., Deepak, M. M. L., & Amrutha, S. (2023). Text-to-
Image Synthesis using StackGAN. In 2023 Global Conference on
Information Technologies and Communications (GCITC), Karnataka, India.
IEEE.
Dhivya, K., & Navas, N. S. (2020). Text to Realistic Image Generation
Using StackGAN. In 2020 7th International Conference on Smart
Structures and Systems (ICSSS). IEEE
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D.
(2017). StackGAN: Text to Photo-realistic Image Synthesis with Stacked
Generative Adversarial Networks. arXiv preprint arXiv:1612.03242.
THANK YOU