A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

Semiu Salawu; Jo Lumsden; Yulan He

doi:10.18653/v1/2021.woah-1.16

A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

Semiu Salawu^*, Jo Lumsden, Yulan He

^*Corresponding author for this work

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

Original language	English
Title of host publication	Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)
Editors	Aida Mostafazedeh Davani, Douwe Kiela, Mathias Lambert, Bertie Vidgen, Vinodkumar Prabhakaran, Zeerak Waseem
Publisher	Association for Computational Linguistics
Pages	146-156
Number of pages	11
ISBN (Print)	9781954085596
DOIs	https://doi.org/10.18653/v1/2021.woah-1.16
Publication status	Published - Aug 2021
Event	The 5th Workshop on Online Abuse and Harms - Duration: 6 Aug 2021 → 6 Aug 2021 https://www.workshopononlineabuse.com/past-workshops/woah-2021-website

Conference

Conference	The 5th Workshop on Online Abuse and Harms
Abbreviated title	WOAH 2021
Period	6/08/21 → 6/08/21
Internet address	https://www.workshopononlineabuse.com/past-workshops/woah-2021-website

Bibliographical note

Access to Document

10.18653/v1/2021.woah-1.16Licence: CC BY 3.0

Salawu Lumsden He Large Scale Twitter Dataset for Cyberbullying Detection
© 2021 The Association for Computational Linguistics. Licensed under the Creative Commons Attribution license https://creativecommons.org/licenses/by/4.0/
Final published version, 278 KBLicence: CC BY 3.0

Cite this

Salawu, S., Lumsden, J., & He, Y. (2021). A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. In A. Mostafazedeh Davani, D. Kiela, M. Lambert, B. Vidgen, V. Prabhakaran, & Z. Waseem (Eds.), Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021) (pp. 146-156). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.woah-1.16

Salawu, Semiu ; Lumsden, Jo ; He, Yulan. / A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). editor / Aida Mostafazedeh Davani ; Douwe Kiela ; Mathias Lambert ; Bertie Vidgen ; Vinodkumar Prabhakaran ; Zeerak Waseem. Association for Computational Linguistics, 2021. pp. 146-156

@inproceedings{d4e6e0c9dbf340c9ad95dedc27581411,

title = "A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection",

abstract = "In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.",

author = "Semiu Salawu and Jo Lumsden and Yulan He",

note = "{\textcopyright} 2021 The Association for Computational Linguistics. Licensed under the Creative Commons Attribution license https://creativecommons.org/licenses/by/4.0/; The 5th Workshop on Online Abuse and Harms<br/>, WOAH 2021 ; Conference date: 06-08-2021 Through 06-08-2021",

year = "2021",

month = aug,

doi = "10.18653/v1/2021.woah-1.16",

language = "English",

isbn = "9781954085596",

pages = "146--156",

editor = "{Mostafazedeh Davani}, {Aida } and Douwe Kiela and Mathias Lambert and Bertie Vidgen and Vinodkumar Prabhakaran and Zeerak Waseem",

booktitle = "Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)",

publisher = "Association for Computational Linguistics",

url = "https://www.workshopononlineabuse.com/past-workshops/woah-2021-website",

}

Salawu, S, Lumsden, J & He, Y 2021, A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. in A Mostafazedeh Davani, D Kiela, M Lambert, B Vidgen, V Prabhakaran & Z Waseem (eds), Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Association for Computational Linguistics, pp. 146-156, The 5th Workshop on Online Abuse and Harms
, 6/08/21. https://doi.org/10.18653/v1/2021.woah-1.16

A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. / Salawu, Semiu; Lumsden, Jo; He, Yulan.
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). ed. / Aida Mostafazedeh Davani; Douwe Kiela; Mathias Lambert; Bertie Vidgen; Vinodkumar Prabhakaran; Zeerak Waseem. Association for Computational Linguistics, 2021. p. 146-156.

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

AU - Salawu, Semiu

AU - Lumsden, Jo

AU - He, Yulan

PY - 2021/8

Y1 - 2021/8

N2 - In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

AB - In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

UR - https://aclanthology.org/2021.woah-1.16.pdf

UR - https://bitbucket.org/ssalawu/cyberbullying-twitter/

UR - https://aclanthology.org/2021.woah-1.0/

U2 - 10.18653/v1/2021.woah-1.16

DO - 10.18653/v1/2021.woah-1.16

M3 - Conference publication

SN - 9781954085596

SP - 146

EP - 156

BT - Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

A2 - Mostafazedeh Davani, Aida

A2 - Kiela, Douwe

A2 - Lambert, Mathias

A2 - Vidgen, Bertie

A2 - Prabhakaran, Vinodkumar

A2 - Waseem, Zeerak

PB - Association for Computational Linguistics

T2 - The 5th Workshop on Online Abuse and Harms<br/>

Y2 - 6 August 2021 through 6 August 2021

ER -

Salawu S, Lumsden J, He Y. A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection. In Mostafazedeh Davani A, Kiela D, Lambert M, Vidgen B, Prabhakaran V, Waseem Z, editors, Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). Association for Computational Linguistics. 2021. p. 146-156 doi: 10.18653/v1/2021.woah-1.16

A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

Abstract

Conference

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this