Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines

Jordan J. Bird; Diego R. Faria; Cristiano Premebida; Aniko Ekart; George Vogiatzis

doi:10.1109/IROS45743.2020.9341557

Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines

Jordan J. Bird, Diego R. Faria, Cristiano Premebida, Aniko Ekart, George Vogiatzis

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16, 000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

Original language	English
Title of host publication	2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
Publisher	IEEE
Pages	10380-10385
Number of pages	6
ISBN (Electronic)	9781728162126
ISBN (Print)	978-1-7281-6213-3
DOIs	https://doi.org/10.1109/IROS45743.2020.9341557
Publication status	Published - 10 Feb 2021
Event	2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020 - Las Vegas, United States Duration: 24 Oct 2020 → 24 Jan 2021

Publication series

Name	IEEE International Conference on Intelligent Robots and Systems
ISSN (Print)	2153-0858
ISSN (Electronic)	2153-0866

Conference

Conference	2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
Country/Territory	United States
City	Las Vegas
Period	24/10/20 → 24/01/21

Bibliographical note

© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Keywords

Image analysis
Neural networks
Urban areas
Forestry
Generators
Intelligent robots
Rivers

Access to Document

10.1109/IROS45743.2020.9341557

Look_listen_paper
© 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Accepted author manuscript, 912 KB

Cite this

Bird, J. J., Faria, D. R., Premebida, C., Ekart, A., & Vogiatzis, G. (2021). Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020 (pp. 10380-10385). Article 9341557 (IEEE International Conference on Intelligent Robots and Systems). IEEE. https://doi.org/10.1109/IROS45743.2020.9341557

@inproceedings{ada7133fd9a74c89858d4a7f02fa090b,

title = "Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines",

abstract = "The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16, 000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.",

keywords = "Image analysis, Neural networks, Urban areas, Forestry, Generators, Intelligent robots, Rivers",

author = "Bird, {Jordan J.} and Faria, {Diego R.} and Cristiano Premebida and Aniko Ekart and George Vogiatzis",

note = "{\textcopyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.; 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020 ; Conference date: 24-10-2020 Through 24-01-2021",

year = "2021",

month = feb,

day = "10",

doi = "10.1109/IROS45743.2020.9341557",

language = "English",

isbn = "978-1-7281-6213-3",

series = "IEEE International Conference on Intelligent Robots and Systems",

publisher = "IEEE",

pages = "10380--10385",

booktitle = "2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020",

address = "United States",

}

Bird, JJ, Faria, DR, Premebida, C, Ekart, A & Vogiatzis, G 2021, Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines. in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020., 9341557, IEEE International Conference on Intelligent Robots and Systems, IEEE, pp. 10380-10385, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020, Las Vegas, United States, 24/10/20. https://doi.org/10.1109/IROS45743.2020.9341557

Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines. / Bird, Jordan J.; Faria, Diego R.; Premebida, Cristiano et al.
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020. IEEE, 2021. p. 10380-10385 9341557 (IEEE International Conference on Intelligent Robots and Systems).

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - Look and listen

T2 - 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020

AU - Bird, Jordan J.

AU - Faria, Diego R.

AU - Premebida, Cristiano

AU - Ekart, Aniko

AU - Vogiatzis, George

N1 - © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

PY - 2021/2/10

Y1 - 2021/2/10

N2 - The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16, 000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

AB - The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16, 000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

KW - Image analysis

KW - Neural networks

KW - Urban areas

KW - Forestry

KW - Generators

KW - Intelligent robots

KW - Rivers

UR - http://www.scopus.com/inward/record.url?scp=85102397858&partnerID=8YFLogxK

UR - https://ieeexplore.ieee.org/document/9341557

U2 - 10.1109/IROS45743.2020.9341557

DO - 10.1109/IROS45743.2020.9341557

M3 - Conference publication

AN - SCOPUS:85102397858

SN - 978-1-7281-6213-3

T3 - IEEE International Conference on Intelligent Robots and Systems

SP - 10380

EP - 10385

BT - 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020

PB - IEEE

Y2 - 24 October 2020 through 24 January 2021

ER -

Bird JJ, Faria DR, Premebida C, Ekart A, Vogiatzis G. Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020. IEEE. 2021. p. 10380-10385. 9341557. (IEEE International Conference on Intelligent Robots and Systems). doi: 10.1109/IROS45743.2020.9341557

Look and listen: A multi-modality late fusion approach to scene classification for autonomous machines

Abstract

Publication series

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this