Spatial and temporal representations for multi-modal visual retrieval

  • Noa Garcia Docampo

    Student thesis: Doctoral ThesisDoctor of Philosophy


    This dissertation studies the problem of finding relevant content within a
    visual collection according to a specific query by addressing three key modalities:
    symmetric visual retrieval, asymmetric visual retrieval and cross-modal retrieval,
    depending on the kind of data to be processed.
    In symmetric visual retrieval, the query object and the elements in the collection
    are from the same kind of visual data, i.e. images or videos. Inspired by the
    human visual perception system, we propose new techniques to estimate visual
    similarity in image-to-image retrieval datasets based on non-metric functions,
    improving image retrieval performance on top of state-of-the-art methods.
    On the other hand, asymmetric visual retrieval is the problem in which queries
    and elements in the dataset are from different types of visual data. We propose
    methods to aggregate the temporal information of video segments so that imagevideo
    comparisons can be computed using similarity functions. When compared
    in image-to-video retrieval datasets, our algorithms drastically reduce memory
    storage while maintaining high accuracy rates.
    Finally, we introduce new solutions for cross-modal retrieval, which is the task
    in which either the queries or the elements in the collection are non-visual objects.
    In particular, we study text-image retrieval in the domain of art by introducing
    new models for semantic art understanding, obtaining results close to human
    Overall, this thesis advances the state-of-the-art in visual retrieval by presenting
    novel solutions for some of the key tasks in the field. The contributions
    derived from this work have potential direct applications in the era of big data,
    as visual datasets are growing exponentially every day and new techniques for
    storing, accessing and managing large-scale visual collections are required.
    Date of Award25 Mar 2019
    Original languageEnglish
    SupervisorGeorge Vogiatzis (Supervisor)


    • image retrieval
    • video retrieval
    • cross-modal retrieval

    Cite this