Abstract Visual representation learning is ubiquitous in various real-world applications, including visual comprehension, video understanding, multi-modal analysis, human-computer interaction, and urban computing. Due to the emergence of huge amounts multimodal heterogeneous spatial/temporal/spatial-temporal data big era, lack interpretability, robustness, out-of-distribution generalization are...