Welcome to the official repository for Space-Aware Vision-Language Model (SA-VLM), Space-Aware Instruction Tuning (SAIT) and Space-Aware Benchmark (SA-Bench), developed to enhance guide dog robots' assistance for visually impaired individuals.
Guide dog robots hold the potential to significantly improve mobility and safety for visually impaired people. However, traditional Vision-Language Models (VLMs) often struggle with accurately interpreting spatial relationships, which is crucial for navigation in complex environments.
Our work introduces:
- SAIT Dataset: A dataset automatically generated using the pipeline, designed to enhance VLMs' understanding of physical environments by focusing on virtual paths and 3D surroundings.
- SA-Bench: A benchmark with an evaluation protocol to assess the effectiveness of VLMs in delivering walking guidance. By integrating spatial awareness into VLMs, our approach enables guide dog robots to provide more accurate and concise guidance, improving the safety and mobility of visually impaired users.
- Training Dataset and Benchmark: We release the SAIT dataset and SA-Bench, providing resources to the community for developing and evaluating space-aware VLMs.
- Automated Data Generation Pipeline: An innovative pipeline that automatically generates data focusing on spatial relationships and virtual paths in 3D space.
- Improved VLM Performance: Our space-aware instruction-tuned model outperforms state-of-the-art algorithms in providing walking guidance.
- Our dataset uses some images from the SideGuide dataset. Due to copyright restrictions, please download the SideGuide images as follows:
- Download the following datasets. The following datasets includes images and labels, excluding the Sideguide images.:
- Dataset Preparation:
- For the SAIT dataset, copy the image files listed in
llava_gd_space_aware.json
from the downloaded SideGuide dataset into theoriginal_images
folder. - For the SA-Bench, each image should have a corresponding
.xml
file with the same filename.- If an image is missing but the
.xml
file exists, copy the corresponding image from the SideGuide dataset.
- If an image is missing but the
- For the SAIT dataset, copy the image files listed in
- We utilized LLaVA-OneVision as the baseline network. For installation and training instructions, please refer to this link.
- If you would like to evaluate your VLM using SA-Bench, please use the following script.
CUDA_VISIBLE_DEVICES=${GPU_NUM} python ./eval/eval_savlm.py \
--model_ckpt_path <path-to-ckpt-dir> \
--eval_db_dir <path-to-SA-Bench-dir> \
--output_dir <path-to-output-dir>
Our experiments demonstrate that the space-aware instruction-tuned model:
- Provides more accurate and concise walking guidance compared to existing VLMs.
- Shows improved understanding of spatial relationships in complex environments. For detailed results and analysis, refer to the paper.
@misc{savlm_icra2025,
title={Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired},
author={ByungOk Han and Woo-han Yun and Beom-Su Seo and Jaehong Kim},
year={2025},
eprint={2502.07183},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2502.07183}
}
This work was supported by the Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (RS-2023-00215760, Guide Dog: Development of Navigation AI Technology of a Guidance Robot for the Visually Impaired Person).
This research (paper) used datasets from ‘The Open AI Dataset Project (AI-Hub, S. Korea)’. All data information can be accessed through ‘AI-Hub (www.aihub.or.kr)’.
For questions or collaborations, please contact:
- ByungOk Han: [email protected]
- Woo-han Yun: [email protected]
We will add the license information later.