Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Rostamkhani, Mohammadmostafa; Ansari, Baktash; Sabzevari, Hoorieh; Rahmani, Farzan; Eetemadi, Sauleh

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.08169 (cs)

[Submitted on 11 Dec 2024]

Title:Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Authors:Mohammadmostafa Rostamkhani, Baktash Ansari, Hoorieh Sabzevari, Farzan Rahmani, Sauleh Eetemadi

View PDF HTML (experimental)

Abstract:In recent years, Visual Question Answering (VQA) has made significant strides, particularly with the advent of multimodal models that integrate vision and language understanding. However, existing VQA datasets often overlook the complexities introduced by image illusions, which pose unique challenges for both human perception and model interpretation. In this study, we introduce a novel task called Illusory VQA, along with four specialized datasets: IllusionMNIST, IllusionFashionMNIST, IllusionAnimals, and IllusionChar. These datasets are designed to evaluate the performance of state-of-the-art multimodal models in recognizing and interpreting visual illusions. We assess the zero-shot performance of various models, fine-tune selected models on our datasets, and propose a simple yet effective solution for illusion detection using Gaussian and blur low-pass filters. We show that this method increases the performance of models significantly and in the case of BLIP-2 on IllusionAnimals without any fine-tuning, it outperforms humans. Our findings highlight the disparity between human and model perception of illusions and demonstrate that fine-tuning and specific preprocessing techniques can significantly enhance model robustness. This work contributes to the development of more human-like visual understanding in multimodal models and suggests future directions for adapting filters using learnable parameters.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2412.08169 [cs.CV]
	(or arXiv:2412.08169v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.08169

Submission history

From: Mohammadmostafa Rostamkhani [view email]
[v1] Wed, 11 Dec 2024 07:51:18 UTC (19,249 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators