Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection

Md Saroar  Jahan; Mourad Oussalah; Muhidin Mohamed

Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection

Md Saroar Jahan, Mourad Oussalah, Muhidin Mohamed

Operations & Information Management

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherent
structure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neural
networks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying in
social media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherent
equivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherent
limitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test the
feasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size dataset
has been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using an
elaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecture
with FastText and BERT was put forward, whose results were compared against commonly employed Na¨ıve Bayes (NB) and Logistic
Regression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% of
classifier accuracy, an improvement of more than 4% over baseline results.

Original language	English
Title of host publication	Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection
Pages	1761–1770
Number of pages	10
Publication status	Published - 20 Jun 2022
Event	13th Conference on Language Resources and Evaluation (LREC 2022) - Marseille, France Duration: 20 Jun 2022 → 25 Jun 2022

Conference

Conference	13th Conference on Language Resources and Evaluation (LREC 2022)
Country/Territory	France
City	Marseille
Period	20/06/22 → 25/06/22

Bibliographical note

Access to Document

2022.lrec-1.187
© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0
Final published version, 530 KBLicence: CC BY-NC 4.0

http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.187.pdf

Cite this

@inproceedings{f0bd44665aa94041a402f50adab84cb6,

title = "Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection",

abstract = "Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherentstructure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neuralnetworks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying insocial media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherentequivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherentlimitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test thefeasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size datasethas been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using anelaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecturewith FastText and BERT was put forward, whose results were compared against commonly employed Na¨ıve Bayes (NB) and LogisticRegression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% ofclassifier accuracy, an improvement of more than 4% over baseline results.",

author = "Jahan, {Md Saroar} and Mourad Oussalah and Muhidin Mohamed",

note = "{\textcopyright} European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0; 13th Conference on Language Resources and Evaluation (LREC 2022) ; Conference date: 20-06-2022 Through 25-06-2022",

year = "2022",

month = jun,

day = "20",

language = "English",

pages = " 1761–1770",

booktitle = "Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection",

}

Jahan, MS, Oussalah, M & Mohamed, M 2022, Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection. in Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection. pp. 1761–1770, 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20/06/22. <http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.187.pdf>

Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection. / Jahan, Md Saroar ; Oussalah, Mourad; Mohamed, Muhidin.
Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection. 2022. p. 1761–1770.

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection

AU - Jahan, Md Saroar

AU - Oussalah, Mourad

AU - Mohamed, Muhidin

PY - 2022/6/20

Y1 - 2022/6/20

N2 - Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherentstructure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neuralnetworks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying insocial media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherentequivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherentlimitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test thefeasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size datasethas been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using anelaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecturewith FastText and BERT was put forward, whose results were compared against commonly employed Na¨ıve Bayes (NB) and LogisticRegression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% ofclassifier accuracy, an improvement of more than 4% over baseline results.

AB - Automatic identification of cyberbullying from textual content is known to be a challenging task. The challenges arise from the inherentstructure of cyberbullying and the lack of labeled large-scale corpus, enabling efficient machine-learning-based tools including neuralnetworks. This paper advocates a data augmentation-based approach that could enhance the automatic detection of cyberbullying insocial media texts. We use both word sense disambiguation and synonymy relation in WordNet lexical database to generate coherentequivalent utterances of cyberbullying input data. The disambiguation and semantic expansion are intended to overcome the inherentlimitations of social media posts, such as an abundance of unstructured constructs and limited semantic content. Besides, to test thefeasibility, a novel protocol has been employed to collect cyberbullying traces data from AskFm forum, where about a 10K-size datasethas been manually labeled. Next, the problem of cyberbullying identification is viewed as a binary classification problem using anelaborated data augmentation strategy and an appropriate classifier. For the latter, a Convolutional Neural Network (CNN) architecturewith FastText and BERT was put forward, whose results were compared against commonly employed Na¨ıve Bayes (NB) and LogisticRegression (LR) classifiers with and without data augmentation. The research outcomes were promising and yielded almost 98.4% ofclassifier accuracy, an improvement of more than 4% over baseline results.

M3 - Conference publication

SP - 1761

EP - 1770

BT - Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection

T2 - 13th Conference on Language Resources and Evaluation (LREC 2022)

Y2 - 20 June 2022 through 25 June 2022

ER -

Data Expansion Using WordNet-based Semantic Expansion and Word Disambiguation for Cyberbullying Detection

Abstract

Conference

Bibliographical note

Access to Document

Fingerprint

Cite this