Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps fully explore the correlations and improve quality of generated sentences. Specifically, with full consideration contextual information provided by hidden state RNN controller, can better localize...