EgoCap and EgoFormer: First-person image captioning with context fusion

Zhuangzhuang Dai; Vu Tran; Andrew Markham; Niki Trigoni; M. Arif Rahman; L.N.S. Wijayasingha; John Stankovic; Chen Li

doi:10.1016/j.patrec.2024.03.012

EgoCap and EgoFormer: First-person image captioning with context fusion

Zhuangzhuang Dai, Vu Tran, Andrew Markham, Niki Trigoni, M. Arif Rahman, L.N.S. Wijayasingha, John Stankovic, Chen Li

Research output: Contribution to journal › Article › peer-review

Abstract

First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such "contexts" for situation-aware captioning. We create EgoCap which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose EgoFormer, a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The EgoFormer has realized state-of-the-art performance on EgoCap achieving a CIDEr score of 125.52. The EgoCap dataset and EgoFormer are publicly available at https://github.com/zdai257/EgoCap-EgoFormer.

Original language	English
Pages (from-to)	50-56
Number of pages	7
Journal	Pattern Recognition Letters
Volume	181
Early online date	20 Mar 2024
DOIs	https://doi.org/10.1016/j.patrec.2024.03.012
Publication status	E-pub ahead of print - 20 Mar 2024

Keywords

image captioning
storytelling
dataset

Access to Document

10.1016/j.patrec.2024.03.012

Cite this

@article{4200dd255cc7459ab0182fbe49a00e73,

title = "EgoCap and EgoFormer: First-person image captioning with context fusion",

abstract = "First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such {"}contexts{"} for situation-aware captioning. We create EgoCap which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose EgoFormer, a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The EgoFormer has realized state-of-the-art performance on EgoCap achieving a CIDEr score of 125.52. The EgoCap dataset and EgoFormer are publicly available at https://github.com/zdai257/EgoCap-EgoFormer.",

keywords = "image captioning, storytelling, dataset",

author = "Zhuangzhuang Dai and Vu Tran and Andrew Markham and Niki Trigoni and Rahman, {M. Arif} and L.N.S. Wijayasingha and John Stankovic and Chen Li",

year = "2024",

month = mar,

day = "20",

doi = "10.1016/j.patrec.2024.03.012",

language = "English",

volume = "181",

pages = "50--56",

journal = "Pattern Recognition Letters",

issn = "0167-8655",

publisher = "Elsevier",

}

TY - JOUR

T1 - EgoCap and EgoFormer: First-person image captioning with context fusion

AU - Dai, Zhuangzhuang

AU - Tran, Vu

AU - Markham, Andrew

AU - Trigoni, Niki

AU - Rahman, M. Arif

AU - Wijayasingha, L.N.S.

AU - Stankovic, John

AU - Li, Chen

PY - 2024/3/20

Y1 - 2024/3/20

N2 - First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such "contexts" for situation-aware captioning. We create EgoCap which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose EgoFormer, a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The EgoFormer has realized state-of-the-art performance on EgoCap achieving a CIDEr score of 125.52. The EgoCap dataset and EgoFormer are publicly available at https://github.com/zdai257/EgoCap-EgoFormer.

AB - First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such "contexts" for situation-aware captioning. We create EgoCap which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose EgoFormer, a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The EgoFormer has realized state-of-the-art performance on EgoCap achieving a CIDEr score of 125.52. The EgoCap dataset and EgoFormer are publicly available at https://github.com/zdai257/EgoCap-EgoFormer.

KW - image captioning

KW - storytelling

KW - dataset

UR - https://www.sciencedirect.com/science/article/abs/pii/S0167865524000801

UR - http://www.scopus.com/inward/record.url?scp=85188937023&partnerID=8YFLogxK

U2 - 10.1016/j.patrec.2024.03.012

DO - 10.1016/j.patrec.2024.03.012

M3 - Article

SN - 0167-8655

VL - 181

SP - 50

EP - 56

JO - Pattern Recognition Letters

JF - Pattern Recognition Letters

ER -

EgoCap and EgoFormer: First-person image captioning with context fusion

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this