A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

Semiu Salawu*, Jo Lumsden, Yulan He

*Corresponding author for this work

Research output: Chapter in Book/Published conference outputConference publication

Abstract

In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.
Original languageEnglish
Title of host publicationProceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
EditorsAida Mostafazedeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodkumar Prabhakaran, Zeerak Waseem
PublisherAssociation for Computational Linguistics
Pages146-156
Number of pages11
ISBN (Print)9781954085596
DOIs
Publication statusPublished - Aug 2021
EventThe 5th Workshop on Online Abuse and Harms
-
Duration: 6 Aug 20216 Aug 2021
https://www.workshopononlineabuse.com/past-workshops/woah-2021-website

Conference

ConferenceThe 5th Workshop on Online Abuse and Harms
Abbreviated titleWOAH 2021
Period6/08/216/08/21
Internet address

Bibliographical note

Copyright © 2021 The Association for Computational Linguistics. Licensed under the Creative Commons Attribution license https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection'. Together they form a unique fingerprint.

Cite this