Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model

نویسندگان

چکیده

With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range downstream tasks via prompt tuning. Prompt tuning tries probe beneficial information for from general knowledge stored in model. A recently proposed method named Context Optimization (CoOp) introduces set learnable vectors as text language side. However, alone only adjust synthesized "classifier", while computed visual features image encoder not affected , thus leading sub-optimal solutions. In this paper, we propose novel Dual-modality Tuning (DPT) paradigm through learning and prompts simultaneously. To make final feature concentrate more on target concept, Class-Aware Visual (CAVPT) scheme is further our DPT, where class-aware generated dynamically by performing cross attention between patch token embeddings encode both task-related instance information. Extensive experimental results 11 datasets demonstrate effectiveness generalization ability method. Our code available https://github.com/fanrena/DPT.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Pre-Trained Ensemble Model for Breast Cancer Grade Detection Based on Small Datasets

Background and Purpose: Nowadays, breast cancer is reported as one of the most common cancers amongst women. Early detection of the cancer type is essential to aid in informing subsequent treatments. The newest proposed breast cancer detectors are based on deep learning. Most of these works focus on large-datasets and are not developed for small datasets. Although the large datasets might lead ...

متن کامل

Self- or Pre-Tuning? Deep Linguistic Processing of Language Variants

This paper proposes a design strategy for deep language processing grammars to appropriately handle language variants. It allows a grammar to be restricted as to what language variant it is tuned to, but also to detect the variant a given input pertains to. This is evaluated and compared to results obtained with an alternative strategy by which the relevant variant is detected with current lang...

متن کامل

Modal Consistency based Pre-Trained Multi-Model Reuse

Multi-Model Reuse is one of the prominent problems in Learnware [Zhou, 2016] framework, while the main issue of Multi-Model Reuse lies in the final prediction acquisition from the responses of multiple pre-trained models. Different from multiclassifiers ensemble, there are only pre-trained models rather than the whole training sets provided in Multi-Model Reuse configuration. This configuration...

متن کامل

Dual-Modality, Dual-Functional Nanoprobes for Cellular and Molecular Imaging

An emerging need for evaluation of promising cellular therapies is a non-invasive method to image the movement and health of cells following transplantation. However, the use of a single modality to serve this purpose may not be advantageous as it may convey inaccurate or insufficient information. Multi-modal imaging strategies are becoming more popular for in vivo cellular and molecular imagin...

متن کامل

Receptive Field Encoding Model for Dynamic Natural Vision

Introduction: Encoding models are used to predict human brain activity in response to sensory stimuli. The purpose of these models is to explain how sensory information represent in the brain. Convolutional neural networks trained by images are capable of encoding magnetic resonance imaging data of humans viewing natural images. Considering the hemodynamic response function, these networks are ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Multimedia

سال: 2023

ISSN: ['1520-9210', '1941-0077']

DOI: https://doi.org/10.1109/tmm.2023.3291588