For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same convolution kernel each frame. Recent efforts attempt capture motion information by establishing inter-frame connections while still suffering limited temporal receptive field or high latency. Moreover, feature enhancement is often only performed channel space dimension ...