Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Saurabh Gaikwad Gaikwad; Tharindu Ranasinghe; Marcos Zampieri; Christopher  Homan

doi:10.48550/arXiv.2109.03552

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Saurabh Gaikwad Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher Homan

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

Original language	English
Title of host publication	Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Number of pages	7
DOIs	https://doi.org/10.48550/arXiv.2109.03552
Publication status	Published - Sept 2021

Bibliographical note

This accepted manuscript is distributed under the terms of the Creative Commons Attribution License CC BY [https://creativecommons.org/licenses/by/4.0/], which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Access to Document

10.48550/arXiv.2109.03552Licence: CC BY 4.0

Galkwadetal_2021
This accepted manuscript is distributed under the terms of the Creative Commons Attribution License CC BY [https://creativecommons.org/licenses/by/4.0/], which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Accepted author manuscript, 215 KBLicence: CC BY 4.0

Cite this

@inproceedings{9400585e637d49298be7ad5ccbeed743,

title = "Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi",

abstract = "The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.",

author = "Gaikwad, {Saurabh Gaikwad} and Tharindu Ranasinghe and Marcos Zampieri and Christopher Homan",

note = "This accepted manuscript is distributed under the terms of the Creative Commons Attribution License CC BY [https://creativecommons.org/licenses/by/4.0/], which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.",

year = "2021",

month = sep,

doi = "10.48550/arXiv.2109.03552",

language = "English",

booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)",

}

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi. / Gaikwad, Saurabh Gaikwad; Ranasinghe, Tharindu; Zampieri, Marcos et al.
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). 2021.

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

AU - Gaikwad, Saurabh Gaikwad

AU - Ranasinghe, Tharindu

AU - Zampieri, Marcos

AU - Homan, Christopher

N1 - This accepted manuscript is distributed under the terms of the Creative Commons Attribution License CC BY [https://creativecommons.org/licenses/by/4.0/], which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PY - 2021/9

Y1 - 2021/9

N2 - The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

AB - The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

UR - https://arxiv.org/abs/2109.03552

U2 - 10.48550/arXiv.2109.03552

DO - 10.48550/arXiv.2109.03552

M3 - Conference publication

BT - Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

ER -

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this