PICO | NUS

Abstract

Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision—tasks defined at test time by users with customized objects or novel objectives.

Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization.

We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling.

Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.

🚧 To Be Continued

This project page is still under construction. More demos, results, and explanations will be added soon — stay tuned!

BibTeX

@article{jiang2025personalizedvisionvisualincontext,
      title={Personalized Vision via Visual In-Context Learning}, 
      author={Yuxin Jiang and Yuchao Gu and Yiren Song and Ivor Tsang and Mike Zheng Shou},
      journal={arXiv preprint arXiv:2509.25172},
      year={2025}
    }