We focus on the confounding bias between language and location in visual grounding pipeline, where we find that is major reasoning bottleneck. For example, process usually a trivial languagelocation association without reasoning, e.g., any query containing sheep to nearly central regions, due most queries about have ground-truth locations at image center. First, frame pipeline into causal graph...