With the growth of remote sensing images, un-derstanding image content automatically has attracted many researchers' interests in deep learning for image. Inspired from natural captioning, model with CNN-RNN as backbone and supplemented by attention been widely used captioning. However, it is inefficient current layer to simultaneously mine hidden foreground background perform feature interacti...