FashionCLIP is a domain-adapted CLIP model fine-tuned specifically for the fashion industry, enabling zero-shot classification and retrieval of fashion products. Developed by Patrick John Chia and collaborators, it builds on the CLIP ViT-B/32 architecture and was trained on over 800K image-text pairs from the Farfetch dataset. The model learns to align product images and descriptive text using contrastive learning, enabling it to perform well across various fashion-related tasks without additional supervision. FashionCLIP 2.0, the latest version, uses the laion/CLIP-ViT-B-32-laion2B-s34B-b79K checkpoint for improved accuracy, achieving better F1 scores across multiple benchmarks compared to earlier versions. It supports multilingual fashion queries and works best with clean, product-style images against white backgrounds. The model can be used for product search, recommendation systems, or visual tagging in e-commerce platforms.
Features
- Fine-tuned on 800K+ fashion product image-text pairs
- Built on CLIP ViT-B/32 architecture for vision-language alignment
- Supports zero-shot fashion classification and retrieval
- Improved accuracy in FashionCLIP 2.0 using a stronger pretrained checkpoint
- Trained on English captions with fashion-specific descriptions
- Optimized for clean, centered product images with white backgrounds
- Outputs similarity scores for cross-modal input (image vs. text)
- Compatible with Hugging Face Transformers and ONNX for deployment