TY - GEN
T1 - Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise Illustration
AU - Batra, Vishwash
AU - Haldar, Aparajita
AU - He, Yulan
AU - Ferhatosmanoglu, Hakan
AU - Vogiatzis, George
AU - Guha, Tanaya
PY - 2020/4/8
Y1 - 2020/4/8
N2 - We address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. Given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. This new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. We propose a novel variational recurrent seq2seq (VRSS) retrieval model for this seq2seq task. Unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. This synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. To this end, we build and release a new Stepwise Recipe dataset for research purposes, containing 10K recipes (sequences of image-text pairs) having a total of 67K image-text pairs. To our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. Our model is shown to outperform several competitive and relevant baselines in the experiments. We also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods.
AB - We address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. Given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. This new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. We propose a novel variational recurrent seq2seq (VRSS) retrieval model for this seq2seq task. Unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. This synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. We evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. To this end, we build and release a new Stepwise Recipe dataset for research purposes, containing 10K recipes (sequences of image-text pairs) having a total of 67K image-text pairs. To our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. Our model is shown to outperform several competitive and relevant baselines in the experiments. We also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods.
KW - Multimodal datasets
KW - Semantics
KW - Sequence retrieval
UR - http://www.scopus.com/inward/record.url?scp=85083968362&partnerID=8YFLogxK
UR - https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_4
U2 - 10.1007/978-3-030-45439-5_4
DO - 10.1007/978-3-030-45439-5_4
M3 - Conference publication
AN - SCOPUS:85083968362
SN - 9783030454388
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 50
EP - 64
BT - Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Proceedings
A2 - Jose, Joemon M.
A2 - Yilmaz, Emine
A2 - Magalhães, João
A2 - Martins, Flávio
A2 - Castells, Pablo
A2 - Ferro, Nicola
A2 - Silva, Mário J.
PB - Springer
T2 - 42nd European Conference on IR Research, ECIR 2020
Y2 - 14 April 2020 through 17 April 2020
ER -