A news image captioning approach based on multimodal pointer-generator network

Jingqiang Chen; Hai Zhuge

doi:10.1002/cpe.5721

A news image captioning approach based on multimodal pointer-generator network

Jingqiang Chen, Hai Zhuge^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

Abstract

News image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer-generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer-generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word-image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out-of-domain BBC test dataset show that the proposed model outperforms the original pointer-generator network, the generic image captioning method, the extractive news image captioning method, and the LDA-based method according BLEU, METEOR, and ROUGL-L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model.

Original language	English
Journal	Concurrency Computation
Early online date	6 Apr 2020
DOIs	https://doi.org/10.1002/cpe.5721
Publication status	E-pub ahead of print - 6 Apr 2020

Bibliographical note

This is the peer reviewed version of the following article: Chen, J, Zhuge, H. A news image captioning approach based on multimodal pointer‐generator network. Concurrency Computat Pract Exper. 2020;e5721, which has been published in final form at https://doi.org/10.1002/cpe.5721. This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.

Keywords

image captioning
multimodal summarization
pointer-generator network
text summarization

Access to Document

10.1002/cpe.5721

News Image Captioning based on Multi-Modal Pointer-Generator Network-Online
This is the peer reviewed version of the following article: Chen, J, Zhuge, H. A news image captioning approach based on multimodal pointer‐generator network. Concurrency Computat Pract Exper. 2020;e5721, which has been published in final form at https://doi.org/10.1002/cpe.5721. This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.
Accepted author manuscript, 3.8 MB

Cite this

@article{a3f4d98a0933487b9bdf7f800230a0cb,

title = "A news image captioning approach based on multimodal pointer-generator network",

abstract = "News image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer-generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer-generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word-image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out-of-domain BBC test dataset show that the proposed model outperforms the original pointer-generator network, the generic image captioning method, the extractive news image captioning method, and the LDA-based method according BLEU, METEOR, and ROUGL-L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model.",

keywords = "image captioning, multimodal summarization, pointer-generator network, text summarization",

author = "Jingqiang Chen and Hai Zhuge",

note = "This is the peer reviewed version of the following article: Chen, J, Zhuge, H. A news image captioning approach based on multimodal pointer‐generator network. Concurrency Computat Pract Exper. 2020;e5721, which has been published in final form at https://doi.org/10.1002/cpe.5721. This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.",

year = "2020",

month = apr,

day = "6",

doi = "10.1002/cpe.5721",

language = "English",

journal = "Concurrency Computation",

issn = "1532-0626",

publisher = "Wiley",

}

TY - JOUR

T1 - A news image captioning approach based on multimodal pointer-generator network

AU - Chen, Jingqiang

AU - Zhuge, Hai

N1 - This is the peer reviewed version of the following article: Chen, J, Zhuge, H. A news image captioning approach based on multimodal pointer‐generator network. Concurrency Computat Pract Exper. 2020;e5721, which has been published in final form at https://doi.org/10.1002/cpe.5721. This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.

PY - 2020/4/6

Y1 - 2020/4/6

N2 - News image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer-generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer-generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word-image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out-of-domain BBC test dataset show that the proposed model outperforms the original pointer-generator network, the generic image captioning method, the extractive news image captioning method, and the LDA-based method according BLEU, METEOR, and ROUGL-L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model.

AB - News image captioning aims to generate captions or descriptions for news images automatically, serving as draft captions for creating news image captions manually. News image captions are different from generic captions as news image captions contain more detailed information such as entity names and events. Therefore, both images on news and the accompanying text are the source of generating caption of news image. Pointer-generator network is a neural method defined for text summarization. This article proposes the Multimodal pointer-generation network by incorporating visual information into the original network for news image captioning. The multimodal attention mechanism is proposed by splitting attention into visual attention paid to the image and textual attention paid to the text. The multimodal pointer mechanism is proposed by using both textual attention and visual attention to compute pointer distributions, where visual attention is first transformed into textual attention via the word-image relationships. The multimodal coverage mechanism is defined to reduce repetitions of attentions or repetitions of pointer distributions. Experiments on the DailyMail test dataset and the out-of-domain BBC test dataset show that the proposed model outperforms the original pointer-generator network, the generic image captioning method, the extractive news image captioning method, and the LDA-based method according BLEU, METEOR, and ROUGL-L evaluations. Experiments also show that the proposed multimodal coverage mechanisms can improve the model, and that transforming visual attention to pointer distributions can improve the model.

KW - image captioning

KW - multimodal summarization

KW - pointer-generator network

KW - text summarization

UR - http://www.scopus.com/inward/record.url?scp=85082928294&partnerID=8YFLogxK

UR - https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5721

U2 - 10.1002/cpe.5721

DO - 10.1002/cpe.5721

M3 - Article

AN - SCOPUS:85082928294

SN - 1532-0626

JO - Concurrency Computation

JF - Concurrency Computation

ER -

A news image captioning approach based on multimodal pointer-generator network

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this