Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model
نویسندگان
چکیده
With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range downstream tasks via prompt tuning. Prompt tuning tries probe beneficial information for from general knowledge stored in model. A recently proposed method named Context Optimization (CoOp) introduces set learnable vectors as text language side. However, alone only adjust synthesized "classifier", while computed visual features image encoder not affected , thus leading sub-optimal solutions. In this paper, we propose novel Dual-modality Tuning (DPT) paradigm through learning and prompts simultaneously. To make final feature concentrate more on target concept, Class-Aware Visual (CAVPT) scheme is further our DPT, where class-aware generated dynamically by performing cross attention between patch token embeddings encode both task-related instance information. Extensive experimental results 11 datasets demonstrate effectiveness generalization ability method. Our code available https://github.com/fanrena/DPT.
منابع مشابه
A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets
Background and Purpose: Nowadays, breast cancer is reported as one of the most common cancers amongst women. Early detection of the cancer type is essential to aid in informing subsequent treatments. The newest proposed breast cancer detectors are based on deep learning. Most of these works focus on large-datasets and are not developed for small datasets. Although the large datasets might lead ...
متن کاملSelf- or Pre-Tuning? Deep Linguistic Processing of Language Variants
This paper proposes a design strategy for deep language processing grammars to appropriately handle language variants. It allows a grammar to be restricted as to what language variant it is tuned to, but also to detect the variant a given input pertains to. This is evaluated and compared to results obtained with an alternative strategy by which the relevant variant is detected with current lang...
متن کاملModal Consistency based Pre-Trained Multi-Model Reuse
Multi-Model Reuse is one of the prominent problems in Learnware [Zhou, 2016] framework, while the main issue of Multi-Model Reuse lies in the final prediction acquisition from the responses of multiple pre-trained models. Different from multiclassifiers ensemble, there are only pre-trained models rather than the whole training sets provided in Multi-Model Reuse configuration. This configuration...
متن کاملDual-Modality, Dual-Functional Nanoprobes for Cellular and Molecular Imaging
An emerging need for evaluation of promising cellular therapies is a non-invasive method to image the movement and health of cells following transplantation. However, the use of a single modality to serve this purpose may not be advantageous as it may convey inaccurate or insufficient information. Multi-modal imaging strategies are becoming more popular for in vivo cellular and molecular imagin...
متن کاملReceptive Field Encoding Model for Dynamic Natural Vision
Introduction: Encoding models are used to predict human brain activity in response to sensory stimuli. The purpose of these models is to explain how sensory information represent in the brain. Convolutional neural networks trained by images are capable of encoding magnetic resonance imaging data of humans viewing natural images. Considering the hemodynamic response function, these networks are ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Multimedia
سال: 2023
ISSN: ['1520-9210', '1941-0077']
DOI: https://doi.org/10.1109/tmm.2023.3291588