Efficiently modeling spatial–temporal information in videos is crucial for action recognition. To achieve this goal, state-of-the-art methods typically employ the convolution operator and dense interaction modules such as non-local blocks. However, these cannot accurately fit diverse events videos. On one hand, adopted convolutions are with fixed scales, thus struggling of various scales. other...