Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

نویسندگان

  • Raymond Yeh
  • Jinjun Xiong
  • Wen-Mei W. Hwu
  • Minh Do
  • Alexander G. Schwing
چکیده

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn’t rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Textual Grounding: Linking Words to Image Concepts

Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale...

متن کامل

An Attention-based Regression Model for Grounding Textual Phrases in Images

Grounding, or localizing, a textual phrase in an image is a challenging problem that is integral to visual language understanding. Previous approaches to this task typically make use of candidate region proposals, where end performance depends on that of the region proposal method and additional computational costs are incurred. In this paper, we treat grounding as a regression problem and prop...

متن کامل

Learning beautiful (and ugly) attributes

Current approaches to aesthetic image analysis either provide accurate or interpretable results. To get both accuracy and interpretability, we advocate the use of learned visual attributes as mid-level features. For this purpose, we propose to discover and learn the visual appearance of attributes automatically, using the recently introduced AVA database which contains more than 250,000 images ...

متن کامل

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Although many data sources contain images which are described with sentences or phrases, they typically do not provide the spatial localization of the phrases. This is true for both curated datasets...

متن کامل

Globally Optimal 3D Image Reconstruction and Segmentation Via Energy Minimisation Techniques

This paper provides an overview of a number of techniques developed within our group to perform 3D reconstruction and image segmentation based of the application of energy minimisation concepts. We begin with classical snake techniques and show how similar energy minimisation concepts can be extended to derive globally optimal segmentation methods. Then we discuss more recent work based on geod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017