AbstractSelf-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to dual-stream architecture that aligns image representations only on global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, required dense annotations not scalable. We propose single stream langua...