Abstractive text-image summarization using multi-modal attentional hierarchical RNN

Jingqiang Chen; Hai Zhuge

Abstractive text-image summarization using multi-modal attentional hierarchical RNN

Jingqiang Chen, Hai Zhuge

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.

Original language	English
Title of host publication	Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Editors	Ellen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii
Publisher	Association for Computational Linguistics
Pages	4046-4056
Number of pages	11
ISBN (Electronic)	9781948087841
Publication status	Published - 1 Jan 2020
Event	2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 - Brussels, Belgium Duration: 31 Oct 2018 → 4 Nov 2018

Publication series

Name	Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Conference

Conference	2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Country/Territory	Belgium
City	Brussels
Period	31/10/18 → 4/11/18

Cite this

Chen, J., & Zhuge, H. (2020). Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 4046-4056). (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018). Association for Computational Linguistics.

Chen, Jingqiang ; Zhuge, Hai. / Abstractive text-image summarization using multi-modal attentional hierarchical RNN. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. editor / Ellen Riloff ; David Chiang ; Julia Hockenmaier ; Jun'ichi Tsujii. Association for Computational Linguistics, 2020. pp. 4046-4056 (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018).

@inproceedings{3d4cf4e815324123a2771f98bb2d57ee,

title = "Abstractive text-image summarization using multi-modal attentional hierarchical RNN",

abstract = "Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.",

author = "Jingqiang Chen and Hai Zhuge",

year = "2020",

month = jan,

day = "1",

language = "English",

series = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018",

publisher = "Association for Computational Linguistics",

pages = "4046--4056",

editor = "Ellen Riloff and David Chiang and Julia Hockenmaier and Jun'ichi Tsujii",

booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018",

note = "2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 ; Conference date: 31-10-2018 Through 04-11-2018",

}

Chen, J & Zhuge, H 2020, Abstractive text-image summarization using multi-modal attentional hierarchical RNN. in E Riloff, D Chiang, J Hockenmaier & J Tsujii (eds), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Association for Computational Linguistics, pp. 4046-4056, 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Brussels, Belgium, 31/10/18.

Abstractive text-image summarization using multi-modal attentional hierarchical RNN. / Chen, Jingqiang; Zhuge, Hai.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. ed. / Ellen Riloff; David Chiang; Julia Hockenmaier; Jun'ichi Tsujii. Association for Computational Linguistics, 2020. p. 4046-4056 (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018).

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - Abstractive text-image summarization using multi-modal attentional hierarchical RNN

AU - Chen, Jingqiang

AU - Zhuge, Hai

PY - 2020/1/1

Y1 - 2020/1/1

N2 - Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.

AB - Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.

UR - http://www.scopus.com/inward/record.url?scp=85081715486&partnerID=8YFLogxK

M3 - Conference publication

T3 - Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

SP - 4046

EP - 4056

BT - Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

A2 - Riloff, Ellen

A2 - Chiang, David

A2 - Hockenmaier, Julia

A2 - Tsujii, Jun'ichi

PB - Association for Computational Linguistics

T2 - 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Y2 - 31 October 2018 through 4 November 2018

ER -

Chen J, Zhuge H. Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Riloff E, Chiang D, Hockenmaier J, Tsujii J, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018. Association for Computational Linguistics. 2020. p. 4046-4056. (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018).

Abstractive text-image summarization using multi-modal attentional hierarchical RNN

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this