Grid-based features have been proven to be as effective region-based in multi-modal tasks such visual question answering. However, its application image captioning encounters two main issues, namely, noisy and fragmented semantics. In this paper, we propose a novel feature selection scheme, with Relation-Aware Selection (RAS) Fine-grained Semantic Guidance (FSG) learning strategy. Based on the ...