Revisit What You See: Revealing Visual Semantics in
Vision Tokens to Guide LVLM Decoding

Beomsik Cho¹ · Jaehyung Kim¹
¹ Yonsei University

Overview

ReVisiT is a decoding-time algorithm for LVLMs that guides text generation by Referencing Vision Tokens. It projects vision tokens into the text token space, selects the most relevant one through constrained divergence minimization, and guides generation to better align with visual semantics without modifying the underlying model.

Implementation

Due to differences in the supported Transformers versions for each LVLM family, we provide separate implementations for LLaVA-1.5. LLaVA-1.5 is based on Transformers v4.31.0, while Qwen2.5-VL and InternVL3 are based on v4.50.0, reflecting compatibility requirements with their respective tokenizer and model wrappers.
Although the core ReVisiT decoding logic remains the same, these version-specific dependencies necessitate isolated environments and tailored integration scripts per model.

If you want to integrate ReVisiT into your own environment, simply add the corresponding decoding function to the Huggingface Transformers source code. Specifically, copy the code from:

and paste it into your local transformers/generation/utils.py.

The following section provides CHAIR evaluation scripts and instructions for each model.

Environment setup

LLaVA1.5 (transformers==4.31.0)

conda env create -f prerequisites/ReVisiT_LLaVA.yaml
conda activate revisit_llava
pip install numpy==1.26.4
cd src/transformers-v4.31.0
pip install -e .
cd ../..

Qwen2.5-VL & InternVL3 (transformers==4.50.0)

conda create -n revisit python=3.9.21
conda activate revisit
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
conda env update -f prerequisites/ReVisiT.yaml
cd src/transformers-v4.50.0
pip install -e .
cd ../..

Download model weights & dataset

Download MSCOCO

bash prerequisites/download_coco.sh

LLaVA-1.5

conda activate revisit_llava
python prerequisites/download_from_huggingface.py --llava

Qwen2.5-VL

conda activate revisit
python prerequisites/download_from_huggingface.py --qwenvl

InternVL3

conda activate revisit
python prerequisites/download_from_huggingface.py --internvl

CHAIR Evaluation

LLaVA-1.5

conda activate revisit_llava
bash eval_chair_llava.sh

Qwen2.5-VL

conda activate revisit
bash eval_chair_qwenvl.sh

InternVL3

conda activate revisit
bash eval_chair_internvl.sh

Acknowledgements

This repository builds upon the open-source implementations of LLaVA, QwenVL, InternVL, VCD, and RITUAL.
We sincerely thank the authors for making their code publicly available.

Citation

If you find our work helpful, please consider citing:

@article{cho2025revisit,
  title     = {Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs},
  author    = {Beomsik Cho and Jaehyung Kim},
  journal   = {arXiv preprint arXiv:2506.09522},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.09522}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Revisit What You See: Revealing Visual Semantics in
Vision Tokens to Guide LVLM Decoding

Overview

Implementation

Environment setup

Download model weights & dataset

CHAIR Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
eval		eval
figs		figs
prerequisites		prerequisites
src		src
LICENSE		LICENSE
README.md		README.md
eval_chair_internvl.sh		eval_chair_internvl.sh
eval_chair_llava.sh		eval_chair_llava.sh
eval_chair_qwenvl.sh		eval_chair_qwenvl.sh

License

bscho333/ReVisiT

Folders and files

Latest commit

History

Repository files navigation

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding

Overview

Implementation

Environment setup

Download model weights & dataset

CHAIR Evaluation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Revisit What You See: Revealing Visual Semantics in
Vision Tokens to Guide LVLM Decoding

Packages