Single-Stream Multi-level Alignment for Vision-Language Pretraining

نویسندگان

چکیده

AbstractSelf-supervised vision-language pretraining from pure images and text with a contrastive loss is effective, but ignores fine-grained alignment due to dual-stream architecture that aligns image representations only on global level. Earlier, supervised, non-contrastive methods were capable of finer-grained alignment, required dense annotations not scalable. We propose single stream language at multiple levels: global, patch-token, conceptual/semantic, using two novel tasks: symmetric cross-modality reconstruction (XMM) pseudo-labeled key word prediction (PSL). In XMM, we mask input tokens one modality use cross-modal information reconstruct the masked token, thus improving between modalities. PSL, attention select keywords in caption, momentum encoder recommend other important are missing caption represented image, then train visual predict presence those keywords, helping it learn semantic concepts essential for grounding textual token an region. demonstrate competitive performance improved data efficiency image-text retrieval, grounding, question answering/reasoning against larger models trained more data. Code available zaidkhan.me/SIMLA.KeywordsVision-language modelingCross-modality learning

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-level Language Descriptions

Language descriptions are a multi-level issue. In particular, the de nition and handling of instantiation semantics connects three levels. This paper looks at two language workbenches without support for multi-level modelling and their handling of the multi-level part of language descriptions. From these observations, the importance of runtime instantiation in terms of an underlying machine is ...

متن کامل

Domain-specific language & support tools for high-level stream parallelism

Stream-based systems are representative of several different application domains including video, networking, audio, graphic processing, etc. Stream parallel programs may run on different kinds of parallel architectures (desktop, servers, cell phones, and supercomputers) and represent significant workloads on our current computing systems. Nevertheless, most of them are still not parallelized. ...

متن کامل

A parallel multi-stream model for sign language recognition

In this paper, the sub-units in each stream are used and embedded in the multi-stream model. In this framework, sign language recognition system was implemented and evaluated. Experiments were carried out for 5177 Chinese signs. The real time isolated recognition rate is 95.1%. For continuous sign recognition, the word correct rate is 91.8%. This has shown that parallel multi-stream model is po...

متن کامل

Multi-level annotation for spoken language corpora

The constitution of multi-level databases integrating, for example, both prosodic and morphosyntactic levels of representation presents a number of problems, some specific to the individual domains, and others concerning the integration of the two domains. It is argued that the formalism of annotation graphs provides an adequate solution to these problems, which can be implemented in an XML rep...

متن کامل

A Language for Configuring Multi-level Specifications

This paper shows how systems can be built from their component parts with specified sharing. Its principle contribution is a modular language for configuring systems. A configuration is a description in the new language of how a system is constructed hierarchically from specifications of its component parts. Category theory has been used to represent the composition of specifications that share...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2022

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-20059-5_42