HybridPrompt: Bridging Language Models and Human Priors in Prompt Tuning for Visual Question Answering
نویسندگان
چکیده
Visual Question Answering (VQA) aims to answer the natural language question about a given image by understanding multimodal content. However, quality of most existing visual-language pre-training (VLP) methods is still limited, mainly due to: (1) Incompatibility. Upstream tasks are generally incompatible with downstream answering tasks, which makes knowledge from model not well transferable and greatly limits their performance in few-shot scenarios; (2) Under-fitting. They do integrate human priors compensate for universal models, so as fit challenging VQA problem generate reliable answers. To address these issues, we propose HybridPrompt, cloze- verify-style hybrid prompt framework bridging models tuning VQA. Specifically, first modify input questions into cloze-style prompts narrow gap between upstream task, ensures that can be better transferred subsequent prior-guided tuning. Then, imitate cognitive process brain introduce topic sample related construct dynamic learnable template learning. Finally, add fixed-length free-parameters further enhance generalizability scalability learning model. Experimental results verify effectiveness showing it achieves competitive against previous on widely-used VQAv2 dataset obtains new state-of-the-art results. Our code released at: https://github.com/zhizhi111/hybrid.
منابع مشابه
Comparing Improved Language Models for Sentence Retrieval in Question Answering
A retrieval system is a very important part in a question answering framework. It reduces the number of documents to be considered for finding an answer. For further refinement, the documents are split up into smaller chunks to deal with topic variability in larger documents. In our case, we divided the documents into single sentences. Then a language model based approach was used to re-rank th...
متن کاملRecurrent and Contextual Models for Visual Question Answering
We propose a series of recurrent and contextual neural network models for multiple choice visual question answering on the Visual7W dataset. Motivated by divergent trends in model complexities in the literature, we explore the balance between model expressiveness and simplicity by studying incrementally more complex architectures. We start with LSTM-encoding of input questions and answers; buil...
متن کاملInvestigating Embedded Question Reuse in Question Answering
The investigation presented in this paper is a novel method in question answering (QA) that enables a QA system to gain performance through reuse of information in the answer to one question to answer another related question. Our analysis shows that a pair of question in a general open domain QA can have embedding relation through their mentions of noun phrase expressions. We present methods f...
متن کاملHigh-Order Attention Models for Visual Question Answering
The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various da...
متن کاملDon't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering
A number of studies have found that today’s Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i11.26569