Abstract Spatial relations are a basic part of human cognition. However, they expressed in natural language variety ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Reasoning (VSR), dataset containing more than 10k text-image pairs with 66 types spatial English (e.g., under, front of...