Att-Sinkhorn: Multimodal Alignment with Sinkhorn-based Deep Attention Architecture

Qianxia Ma, Ming Zhang, Yan Tang, Zhen Huang

Research output: Chapter in Book/Published conference outputConference publication

Abstract

Multimodal alignment aims to establish a matching relationship between multimodal features, connecting parts of different modalities that contain the same or similar semantics. To increase the accuracy of the alignment of multimodal features, we propose a modality alignment method based on the Sinkhorn metric and attention mechanism - Att-Sinkhorn algorithm. The algorithm converts the alignment and matching problem between different modal features into the discrete Monge problem in the optimal transport, which compares the distance between the probability distributions corresponding to different modalities directly. In order to get practical solutions, the algorithm obtains the approximate solution of the original discrete Monge problem by introducing entropy regularization to perform Kantorovich relaxation. The transformed alignment problem can be considered as a matrix scaling problem based on the principle of mass conservation, and the Sinkhorn algorithm is used for iterative solutions. To verify the effectiveness of the Att-Sinikhorn algorithm, we adopt experiments on a typical task in multimodal alignment, image captioning, which requires mapping between textual and visual information. Empirical results and analysis indicate the effectiveness of the Att-Sinkhorn algorithm for multimodal alignment.
Original languageEnglish
Title of host publicationICAC 2023 - 28th International Conference on Automation and Computing
PublisherIEEE
ISBN (Electronic)979-8-3503-3585-9
ISBN (Print)979-8-3503-3586-6
DOIs
Publication statusPublished - 16 Oct 2023
Event2023 28th International Conference on Automation and Computing (ICAC) - Birmingham, United Kingdom
Duration: 30 Aug 20231 Sept 2023

Publication series

Name2023 28th International Conference on Automation and Computing (ICAC)
PublisherIEEE

Conference

Conference2023 28th International Conference on Automation and Computing (ICAC)
Country/TerritoryUnited Kingdom
CityBirmingham
Period30/08/231/09/23

Keywords

  • Multimodal machine learning
  • Sinkhorn algorithm
  • attention mechanism
  • modality alignment
  • visual captioning

Fingerprint

Dive into the research topics of 'Att-Sinkhorn: Multimodal Alignment with Sinkhorn-based Deep Attention Architecture'. Together they form a unique fingerprint.

Cite this