TY - GEN
T1 - MasakhaNEWS: News Topic Classification for African languages
AU - Adelani, David Ifeoluwa
AU - Masiak, Marek
AU - Azime, Israel Abebe
AU - Alabi, Jesujoba Oluwadara
AU - Tonja, Atnafu Lambebo
AU - Mwase, Christine
AU - Ogundepo, Odunayo
AU - Dossou, Bonaventure F. P.
AU - Oladipo, Akintunde
AU - Nixdorf, Doreen
AU - Emezue, Chris Chinenye
AU - al-azzawi, Sana Sabah
AU - Sibanda, Blessing K.
AU - David, Davis
AU - Ndolela, Lolwethu
AU - Mukiibi, Jonathan
AU - Ajayi, Tunde Oluwaseyi
AU - Ngoli, Tatiana Moteu
AU - Odhiambo, Brian
AU - Owodunni, Abraham Toluwase
AU - Obiefuna, Nnaemeka C.
AU - Mohamed, Muhidin
AU - Muhammad, Shamsuddeen Hassan
AU - Ababu, Teshome Mulugeta
AU - Abdullahi, Saheed Salahudeen
AU - Yigezu, Mesay Gemeda
AU - Gwadabe, Tajuddeen
AU - Abdulmumin, Idris
AU - Bame, Mahlet Taye
AU - Awoyomi, Oluwabusayo Olufunke
AU - Shode, Iyanuoluwa
AU - Adelani, Tolulope Anu
AU - Kailani, Habiba Abdulganiy
AU - Omotayo, Abdul-Hakeem
AU - Adeeko, Adetola
AU - Abeeb, Afolabi
AU - Aremu, Anuoluwapo
AU - Samuel, Olanrewaju
AU - Siro, Clemencia
AU - Kimotho, Wangari
AU - Ogbu, Onyekachi Raphael
AU - Mbonu, Chinedu E.
AU - Chukwuneke, Chiamaka I.
AU - Fanijo, Samuel
AU - Ojo, Jessica
AU - Awosan, Oyinkansola F.
AU - Guge, Tadesse Kebede
AU - Sari, Sakayo Toadoum
AU - Nyatsine, Pamela
AU - Sidume, Freedmore
AU - Yousuf, Oreen
AU - Oduwole, Mardiyyah
AU - Tshinu, Kanda Patrick
AU - Kimanuka, Ussen
AU - Diko, Thina
AU - Nxakama, Siyanda
AU - Nugussie, Sinodos G.
AU - Johar, Abdulmejid Tuni
AU - Mohamed, Shafie Abdi
AU - Hassan, Fuad Mire
AU - Mehamed, Moges Ahmed
AU - Ngabire, Evrard
AU - Twagirayezu, Jules
AU - Ssenkungu, Ivan
AU - Stenetorp, Pontus
N1 - This arXiv preprint version of this paper is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
PY - 2023/11
Y1 - 2023/11
N2 - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
AB - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.
KW - cs.CL
UR - https://aclanthology.org/2023.ijcnlp-main.10/
UR - https://github.com/masakhane-io/masakhane-news
UR - https://huggingface.co/datasets/masakhane/masakhanews
U2 - 10.48550/arXiv.2304.09972
DO - 10.48550/arXiv.2304.09972
M3 - Conference publication
VL - 1
SP - 144
EP - 159
BT - Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
ER -