S2H-DPO - Hardness-Aware Preference Optimization for Vision–Language Models

Nitish Shukla, Surgan Jandial, Arun Ross; Findings of ACL 2026

Research Goal

Vision-Language Models are strong on single images but weak at reasoning across images. Existing multi-image alignment data leans on pre-specified indices (“Look at Image 3…”), which sidesteps the two skills that matter: global visual search and autonomous cross-image comparison. This work builds preference data that teaches those skills — without human annotation and without model-specific tricks.

How it works

A Simple-to-Hard (S2H) curriculum of preference pairs spanning three reasoning levels:

  • Level 1 — single-image localized. A VQA sample padded with distractor images; the model must ignore irrelevant context. Rejected answers come from the model’s own hallucinations.
  • Level 2 — multi-image localized. Explicit cross-image grounding via kinship recognition (asymmetric relational inference) and visual arithmetic (symmetric aggregation across images), with deterministically generated chosen/rejected pairs.
  • Level 3 — global visual search. Open-ended queries (“Caption the image containing a peacock”) force the model to scan all images before localizing. Negatives are plausible-but-wrong captions, then quality-filtered with CLIP/MPNet so the contrast stays meaningful.

Because pairs are prompt-driven, the method is model-agnostic — no per-model hallucination or attention heuristics, and no new dataset needed per model (20K samples per level).

The three S2H levels escalate from single-image localization to multi-image comparison to global visual search, each yielding model-agnostic chosen/rejected preference pairs.

Key results

  • LLaVA-1.5-7B: +6.30 (BLINK), +6.03 (MANTIS), +3.49 (NLVR2) over baseline; beats the closest DPO baseline (MIA-DPO) on every benchmark.
  • Qwen2.5-VL-7B: +2.49 average across multi-image benchmarks; Qwen3-VL-2B: +1.77 — gains hold from 2B to 7B models.
  • Single-image ability is preserved (MMStar/POPE), so multi-image gains don’t cost core skills.
  • A surprising finding: for multi-image reasoning, flat training beats curriculum ordering (e.g., L2-flat 48.13% vs. L1→L2 44.83%) — gradual curricula induce myopic, locally-anchored reasoning.

Resources


Citation

If you use this work, please cite:

@inproceedings{shukla2026s2hdpo,
  title={S2H-DPO: Hardness-Aware Preference Optimization for Vision--Language Models},
  author={Shukla, Nitish and Jandial, Surgan and Ross, Arun},
  booktitle={Findings of the Association for Computational Linguistics (ACL)},
  year={2026}
}