Transformer-based pre-trained language models like BERT and its variants have recently achieved promising performance in various natural processing (NLP) tasks. However, the conventional paradigm constructs backbone by purely stacking manually designed global self-attention layers, introducing inductive bias thus leads to sub-optimal. In this work, we make first attempt automatically discover n...