Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires agent to indulge in question-answer with human about content. This thus poses challenging multi-modal representation learning reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce semantic...