Predicting the type and target of offensive social media posts in Marathi

Marcos Zampieri*, Tharindu Ranasinghe, Mrinal Chaudhari, Saurabh Gaikwad, Prajwal Krishna, Mayuresh Nene, Shrunali Paygude

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    Abstract

    The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high-resource languages such as French, German, and Spanish. In this paper, we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID (Rosenthal et al. in SOLID: a large-scale semi-supervised dataset for offensive language identification. In: Findings of ACL, 2021).
    Original languageEnglish
    Article number77
    Number of pages10
    JournalSocial Network Analysis and Mining
    Volume12
    Issue number1
    Early online date9 Jul 2022
    DOIs
    Publication statusPublished - Dec 2022

    Bibliographical note

    Copyright © Springer Nature B.V. 2022. This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use [https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms], but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org10.1007s13278-022-00906-8

    Keywords

    • Offensive language identification
    • Hate speech
    • Machine learning
    • Deep learning
    • Low-language resources

    Fingerprint

    Dive into the research topics of 'Predicting the type and target of offensive social media posts in Marathi'. Together they form a unique fingerprint.

    Cite this