vision-language model

OmniFashion: Towards Generalist Fashion Intelligence via Multi-Task Vision-Language Learning

arXiv ID: 2603.02658v1

发布日期: 2026-03-03

作者: Zhengwei Yang, Andi Long, Hao Li, Zechao Hu, Kui Jiang, Zheng Wang

分类: cs.CV

摘要

Fashion intelligence spans multiple tasks, i.e., retrieval, recommendation, recognition, and dialogue, yet remains hindered by fragmented supervision and incomplete fashion annotations. These limitations jointly restrict the formation of consistent visual-semantic structures, preventing recent vision-language models (VLMs) from serving as a generalist fashion brain that unifies understanding and reasoning across tasks. Therefore, we construct FashionX, a million-scale dataset that exhaustively annotates visible fashion items within an outfit and organizes attributes from global to part-level. Built upon this foundation, we propose OmniFashion, a unified vision-language framework that bridges diverse fashion tasks under a unified fashion dialogue paradigm, enabling both multi-task reasoning and interactive dialogue. Experiments on multi-subtasks and retrieval benchmarks show that OmniFashion achieves strong task-level accuracy and cross-task generalization, highlighting its offering of a scalable path toward universal, dialogue-oriented fashion intelligence.

查看 arXiv 页面下载 PDF