Abstract With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative transformers (GPT), etc. Inspired by success of these in single domains (like computer and natural language processing), multi-modal have also drawn more attention recent years. In this work, we give a c...