Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

Haonan Wang; Minbin Huang; Runhui Huang; Lanqing Hong; Hang Xu; Tianyang Hu; Xiaodan Liang; Zhenguo Li; Hong Cheng; Kenji Kawaguchi

doi:10.18653/v1/2025.naacl-long.399

Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data

Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, Kenji Kawaguchi

Abstract

Contrastive Language-Image Pre-training (CLIP) has become the standard for cross- modal image-text representation learning. Improving CLIP typically requires additional data and retraining with new loss functions, but these demands raise resource and time costs, limiting practical use. In this work, we introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training. This eliminates the need for additional data or extensive retraining. Moreover, HELIP integrates effortlessly into current training pipelines with minimal code modifications, allowing for quick and seamless implementation. On comprehensive benchmarks, HELIP consistently boosts existing models. In particular, within just two epochs of training, it improves zero-shot classification accuracy on ImageNet for SLIP models pre-trained on CC3M, CC12M, and YFCC15M datasets by 3.05%, 4.47%, and 10.1% , respectively. In addition, on fine-grained classification datasets, HELIP improves the zero-shot performance of CLIP and SLIP by an average of 8.4% and 18.6%, and their linear probe performance by an average of 9.5% and 3.0%.

Anthology ID:: 2025.naacl-long.399
Volume:: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7854–7873
Language:
URL:: https://aclanthology.org/2025.naacl-long.399/
DOI:: 10.18653/v1/2025.naacl-long.399
Bibkey:
Cite (ACL):: Haonan Wang, Minbin Huang, Runhui Huang, Lanqing Hong, Hang Xu, Tianyang Hu, Xiaodan Liang, Zhenguo Li, Hong Cheng, and Kenji Kawaguchi. 2025. Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7854–7873, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data (Wang et al., NAACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.naacl-long.399.pdf

PDF Cite Search Fix data