Extractive summarization of documents with images based on multi-modal RNN

Jingqiang Chen, Hai Zhuge*

*Corresponding author for this work

Research output: Contribution to journalArticle

Abstract

Rapid growth of multi-modal documents containing images on the Internet expresses strong demand on multi-modal summarization. The challenge is to create a computing method that can uniformly process text and image. Deep learning provides basic models for meeting this challenge. This paper treats extractive multi-modal summarization as a classification problem and proposes a sentence–image classification method based on the multi-modal RNN model. Our method encodes words and sentences with the hierarchical RNN models and encodes the ordered image set with the CNN model and the RNN model, and then calculates the selection probability of sentences and the sentence–image alignment probability through a logistic classifier taking text coverage, text redundancy, image set coverage, and image set redundancy as features. Two methods are proposed to compute the image set redundancy feature by combining the important scores of sentences and the hidden sentence–image alignment. Experiments on the extended DailyMail corpora constructed by collecting images and captions from the Web show that our method outperforms 11 baseline text summarization methods and that adopting the two image-related features in the classification method can improve text summarization. Our method is able to mine the hidden sentence–image alignments and to create informative well-aligned multi-modal summaries.

Original languageEnglish
Pages (from-to)186-196
Number of pages11
JournalFuture Generation Computer Systems
Volume99
Early online date25 Apr 2019
DOIs
Publication statusPublished - 1 Oct 2019

Fingerprint

Redundancy
Logistics
Classifiers
Internet
Experiments
Deep learning

Bibliographical note

© 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/

Funding: National Natural Science Foundation of China (No. 61806101, No. 61876048, No. 61602256, No. 61876091), and the Open Foundation of Key Laboratory of Intelligent Information Processing, ICT, CAS, China (IIP2019-2).

Keywords

  • Document summarization
  • Extractive summarization
  • Multi-modal summarization
  • RNN
  • Summarization

Cite this

@article{2d4af9bec97e4be9b031b7d5e6402de9,
title = "Extractive summarization of documents with images based on multi-modal RNN",
abstract = "Rapid growth of multi-modal documents containing images on the Internet expresses strong demand on multi-modal summarization. The challenge is to create a computing method that can uniformly process text and image. Deep learning provides basic models for meeting this challenge. This paper treats extractive multi-modal summarization as a classification problem and proposes a sentence–image classification method based on the multi-modal RNN model. Our method encodes words and sentences with the hierarchical RNN models and encodes the ordered image set with the CNN model and the RNN model, and then calculates the selection probability of sentences and the sentence–image alignment probability through a logistic classifier taking text coverage, text redundancy, image set coverage, and image set redundancy as features. Two methods are proposed to compute the image set redundancy feature by combining the important scores of sentences and the hidden sentence–image alignment. Experiments on the extended DailyMail corpora constructed by collecting images and captions from the Web show that our method outperforms 11 baseline text summarization methods and that adopting the two image-related features in the classification method can improve text summarization. Our method is able to mine the hidden sentence–image alignments and to create informative well-aligned multi-modal summaries.",
keywords = "Document summarization, Extractive summarization, Multi-modal summarization, RNN, Summarization",
author = "Jingqiang Chen and Hai Zhuge",
note = "{\circledC} 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ Funding: National Natural Science Foundation of China (No. 61806101, No. 61876048, No. 61602256, No. 61876091), and the Open Foundation of Key Laboratory of Intelligent Information Processing, ICT, CAS, China (IIP2019-2).",
year = "2019",
month = "10",
day = "1",
doi = "10.1016/j.future.2019.04.045",
language = "English",
volume = "99",
pages = "186--196",
journal = "Future Generation Computer Systems",
issn = "0167-739X",
publisher = "Elsevier",

}

Extractive summarization of documents with images based on multi-modal RNN. / Chen, Jingqiang; Zhuge, Hai.

In: Future Generation Computer Systems, Vol. 99, 01.10.2019, p. 186-196.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Extractive summarization of documents with images based on multi-modal RNN

AU - Chen, Jingqiang

AU - Zhuge, Hai

N1 - © 2019, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ Funding: National Natural Science Foundation of China (No. 61806101, No. 61876048, No. 61602256, No. 61876091), and the Open Foundation of Key Laboratory of Intelligent Information Processing, ICT, CAS, China (IIP2019-2).

PY - 2019/10/1

Y1 - 2019/10/1

N2 - Rapid growth of multi-modal documents containing images on the Internet expresses strong demand on multi-modal summarization. The challenge is to create a computing method that can uniformly process text and image. Deep learning provides basic models for meeting this challenge. This paper treats extractive multi-modal summarization as a classification problem and proposes a sentence–image classification method based on the multi-modal RNN model. Our method encodes words and sentences with the hierarchical RNN models and encodes the ordered image set with the CNN model and the RNN model, and then calculates the selection probability of sentences and the sentence–image alignment probability through a logistic classifier taking text coverage, text redundancy, image set coverage, and image set redundancy as features. Two methods are proposed to compute the image set redundancy feature by combining the important scores of sentences and the hidden sentence–image alignment. Experiments on the extended DailyMail corpora constructed by collecting images and captions from the Web show that our method outperforms 11 baseline text summarization methods and that adopting the two image-related features in the classification method can improve text summarization. Our method is able to mine the hidden sentence–image alignments and to create informative well-aligned multi-modal summaries.

AB - Rapid growth of multi-modal documents containing images on the Internet expresses strong demand on multi-modal summarization. The challenge is to create a computing method that can uniformly process text and image. Deep learning provides basic models for meeting this challenge. This paper treats extractive multi-modal summarization as a classification problem and proposes a sentence–image classification method based on the multi-modal RNN model. Our method encodes words and sentences with the hierarchical RNN models and encodes the ordered image set with the CNN model and the RNN model, and then calculates the selection probability of sentences and the sentence–image alignment probability through a logistic classifier taking text coverage, text redundancy, image set coverage, and image set redundancy as features. Two methods are proposed to compute the image set redundancy feature by combining the important scores of sentences and the hidden sentence–image alignment. Experiments on the extended DailyMail corpora constructed by collecting images and captions from the Web show that our method outperforms 11 baseline text summarization methods and that adopting the two image-related features in the classification method can improve text summarization. Our method is able to mine the hidden sentence–image alignments and to create informative well-aligned multi-modal summaries.

KW - Document summarization

KW - Extractive summarization

KW - Multi-modal summarization

KW - RNN

KW - Summarization

UR - http://www.scopus.com/inward/record.url?scp=85064813348&partnerID=8YFLogxK

UR - https://www.sciencedirect.com/science/article/pii/S0167739X18326876?via%3Dihub

U2 - 10.1016/j.future.2019.04.045

DO - 10.1016/j.future.2019.04.045

M3 - Article

AN - SCOPUS:85064813348

VL - 99

SP - 186

EP - 196

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

SN - 0167-739X

ER -