MasakhaNEWS: News Topic Classification for African languages

David Ifeoluwa Adelani; Marek Masiak; Israel Abebe Azime; Jesujoba Oluwadara Alabi; Atnafu Lambebo Tonja; Christine Mwase; Odunayo Ogundepo; Bonaventure F. P. Dossou; Akintunde Oladipo; Doreen Nixdorf; Chris Chinenye Emezue; Sana Sabah al-azzawi; Blessing K. Sibanda; Davis David; Lolwethu Ndolela; Jonathan Mukiibi; Tunde Oluwaseyi Ajayi; Tatiana Moteu Ngoli; Brian Odhiambo; Abraham Toluwase Owodunni; Nnaemeka C. Obiefuna; Shamsuddeen Hassan Muhammad; Saheed Salahudeen Abdullahi; Mesay Gemeda Yigezu; Tajuddeen Gwadabe; Idris Abdulmumin; Mahlet Taye Bame; Oluwabusayo Olufunke Awoyomi; Iyanuoluwa Shode; Tolulope Anu Adelani; Habiba Abdulganiy Kailani; Abdul-Hakeem Omotayo; Adetola Adeeko; Afolabi Abeeb; Anuoluwapo Aremu; Olanrewaju Samuel; Clemencia Siro; Wangari Kimotho; Onyekachi Raphael Ogbu; Chinedu E. Mbonu; Chiamaka I. Chukwuneke; Samuel Fanijo; Jessica Ojo; Oyinkansola F. Awosan; Tadesse Kebede Guge; Sakayo Toadoum Sari; Pamela Nyatsine; Freedmore Sidume; Oreen Yousuf; Mardiyyah Oduwole; Ussen Kimanuka; Kanda Patrick Tshinu; Thina Diko; Siyanda Nxakama; Abdulmejid Tuni Johar; Sinodos Gebre; Muhidin Mohamed; Shafie Abdi Mohamed; Fuad Mire Hassan; Moges Ahmed Mehamed; Evrard Ngabire; Pontus Stenetorp

doi:10.48550/arXiv.2304.09972 Focus to learn more

MasakhaNEWS: News Topic Classification for African languages

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Sabah al-azzawi, Blessing K. Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Oluwaseyi Ajayi, Tatiana Moteu Ngoli, Brian Odhiambo, Abraham Toluwase OwodunniNnaemeka C. Obiefuna, Shamsuddeen Hassan Muhammad, Saheed Salahudeen Abdullahi, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye Bame, Oluwabusayo Olufunke Awoyomi, Iyanuoluwa Shode, Tolulope Anu Adelani, Habiba Abdulganiy Kailani, Abdul-Hakeem Omotayo, Adetola Adeeko, Afolabi Abeeb, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Raphael Ogbu, Chinedu E. Mbonu, Chiamaka I. Chukwuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola F. Awosan, Tadesse Kebede Guge, Sakayo Toadoum Sari, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Ussen Kimanuka, Kanda Patrick Tshinu, Thina Diko, Siyanda Nxakama, Abdulmejid Tuni Johar, Sinodos Gebre, Muhidin Mohamed, Shafie Abdi Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Pontus Stenetorp

Operations & Information Management

Research output: Preprint or Working paper › Preprint › peer-review

Abstract

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Original language	English
Number of pages	16
DOIs	https://doi.org/10.48550/arXiv.2304.09972 Focus to learn more
Publication status	Published - 19 Apr 2023

Bibliographical note

Copyright Authors 2023. This article is licensed under a Creative Commons Attribution (CC-BY) License.

Artice Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)

Keywords

cs.CL

Access to Document

10.48550/arXiv.2304.09972 Focus to learn moreLicence: CC BY 4.0

2304.09972v1
Copyright Authors 2023. This article is licensed under a Creative Commons Attribution (CC-BY) License.
Submitted manuscript, 475 KBLicence: CC BY 4.0

Cite this

Adelani, D. I., Masiak, M., Azime, I. A., Alabi, J. O., Tonja, A. L., Mwase, C., Ogundepo, O., Dossou, B. F. P., Oladipo, A., Nixdorf, D., Emezue, C. C., al-azzawi, S. S., Sibanda, B. K., David, D., Ndolela, L., Mukiibi, J., Ajayi, T. O., Ngoli, T. M., Odhiambo, B., ... Stenetorp, P. (2023). MasakhaNEWS: News Topic Classification for African languages. https://doi.org/10.48550/arXiv.2304.09972 Focus to learn more

@techreport{f46fccaf2eb5478996b07c96f646f24b,

title = "MasakhaNEWS: News Topic Classification for African languages",

abstract = "African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach. ",

keywords = "cs.CL",

author = "Adelani, {David Ifeoluwa} and Marek Masiak and Azime, {Israel Abebe} and Alabi, {Jesujoba Oluwadara} and Tonja, {Atnafu Lambebo} and Christine Mwase and Odunayo Ogundepo and Dossou, {Bonaventure F. P.} and Akintunde Oladipo and Doreen Nixdorf and Emezue, {Chris Chinenye} and al-azzawi, {Sana Sabah} and Sibanda, {Blessing K.} and Davis David and Lolwethu Ndolela and Jonathan Mukiibi and Ajayi, {Tunde Oluwaseyi} and Ngoli, {Tatiana Moteu} and Brian Odhiambo and Owodunni, {Abraham Toluwase} and Obiefuna, {Nnaemeka C.} and Muhammad, {Shamsuddeen Hassan} and Abdullahi, {Saheed Salahudeen} and Yigezu, {Mesay Gemeda} and Tajuddeen Gwadabe and Idris Abdulmumin and Bame, {Mahlet Taye} and Awoyomi, {Oluwabusayo Olufunke} and Iyanuoluwa Shode and Adelani, {Tolulope Anu} and Kailani, {Habiba Abdulganiy} and Abdul-Hakeem Omotayo and Adetola Adeeko and Afolabi Abeeb and Anuoluwapo Aremu and Olanrewaju Samuel and Clemencia Siro and Wangari Kimotho and Ogbu, {Onyekachi Raphael} and Mbonu, {Chinedu E.} and Chukwuneke, {Chiamaka I.} and Samuel Fanijo and Jessica Ojo and Awosan, {Oyinkansola F.} and Guge, {Tadesse Kebede} and Sari, {Sakayo Toadoum} and Pamela Nyatsine and Freedmore Sidume and Oreen Yousuf and Mardiyyah Oduwole and Ussen Kimanuka and Tshinu, {Kanda Patrick} and Thina Diko and Siyanda Nxakama and Johar, {Abdulmejid Tuni} and Sinodos Gebre and Muhidin Mohamed and Mohamed, {Shafie Abdi} and Hassan, {Fuad Mire} and Mehamed, {Moges Ahmed} and Evrard Ngabire and Pontus Stenetorp",

note = "Copyright Authors 2023. This article is licensed under a Creative Commons Attribution (CC-BY) License. Artice Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)",

year = "2023",

month = apr,

day = "19",

doi = "10.48550/arXiv.2304.09972 Focus to learn more",

language = "English",

type = "WorkingPaper",

}

Adelani, DI, Masiak, M, Azime, IA, Alabi, JO, Tonja, AL, Mwase, C, Ogundepo, O, Dossou, BFP, Oladipo, A, Nixdorf, D, Emezue, CC, al-azzawi, SS, Sibanda, BK, David, D, Ndolela, L, Mukiibi, J, Ajayi, TO, Ngoli, TM, Odhiambo, B, Owodunni, AT, Obiefuna, NC, Muhammad, SH, Abdullahi, SS, Yigezu, MG, Gwadabe, T, Abdulmumin, I, Bame, MT, Awoyomi, OO, Shode, I, Adelani, TA, Kailani, HA, Omotayo, A-H, Adeeko, A, Abeeb, A, Aremu, A, Samuel, O, Siro, C, Kimotho, W, Ogbu, OR, Mbonu, CE, Chukwuneke, CI, Fanijo, S, Ojo, J, Awosan, OF, Guge, TK, Sari, ST, Nyatsine, P, Sidume, F, Yousuf, O, Oduwole, M, Kimanuka, U, Tshinu, KP, Diko, T, Nxakama, S, Johar, AT, Gebre, S, Mohamed, M, Mohamed, SA, Hassan, FM, Mehamed, MA, Ngabire, E & Stenetorp, P 2023 'MasakhaNEWS: News Topic Classification for African languages'. https://doi.org/10.48550/arXiv.2304.09972 Focus to learn more

TY - UNPB

T1 - MasakhaNEWS

T2 - News Topic Classification for African languages

AU - Adelani, David Ifeoluwa

AU - Masiak, Marek

AU - Azime, Israel Abebe

AU - Alabi, Jesujoba Oluwadara

AU - Tonja, Atnafu Lambebo

AU - Mwase, Christine

AU - Ogundepo, Odunayo

AU - Dossou, Bonaventure F. P.

AU - Oladipo, Akintunde

AU - Nixdorf, Doreen

AU - Emezue, Chris Chinenye

AU - al-azzawi, Sana Sabah

AU - Sibanda, Blessing K.

AU - David, Davis

AU - Ndolela, Lolwethu

AU - Mukiibi, Jonathan

AU - Ajayi, Tunde Oluwaseyi

AU - Ngoli, Tatiana Moteu

AU - Odhiambo, Brian

AU - Owodunni, Abraham Toluwase

AU - Obiefuna, Nnaemeka C.

AU - Muhammad, Shamsuddeen Hassan

AU - Abdullahi, Saheed Salahudeen

AU - Yigezu, Mesay Gemeda

AU - Gwadabe, Tajuddeen

AU - Abdulmumin, Idris

AU - Bame, Mahlet Taye

AU - Awoyomi, Oluwabusayo Olufunke

AU - Shode, Iyanuoluwa

AU - Adelani, Tolulope Anu

AU - Kailani, Habiba Abdulganiy

AU - Omotayo, Abdul-Hakeem

AU - Adeeko, Adetola

AU - Abeeb, Afolabi

AU - Aremu, Anuoluwapo

AU - Samuel, Olanrewaju

AU - Siro, Clemencia

AU - Kimotho, Wangari

AU - Ogbu, Onyekachi Raphael

AU - Mbonu, Chinedu E.

AU - Chukwuneke, Chiamaka I.

AU - Fanijo, Samuel

AU - Ojo, Jessica

AU - Awosan, Oyinkansola F.

AU - Guge, Tadesse Kebede

AU - Sari, Sakayo Toadoum

AU - Nyatsine, Pamela

AU - Sidume, Freedmore

AU - Yousuf, Oreen

AU - Oduwole, Mardiyyah

AU - Kimanuka, Ussen

AU - Tshinu, Kanda Patrick

AU - Diko, Thina

AU - Nxakama, Siyanda

AU - Johar, Abdulmejid Tuni

AU - Gebre, Sinodos

AU - Mohamed, Muhidin

AU - Mohamed, Shafie Abdi

AU - Hassan, Fuad Mire

AU - Mehamed, Moges Ahmed

AU - Ngabire, Evrard

AU - Stenetorp, Pontus

N1 - Copyright Authors 2023. This article is licensed under a Creative Commons Attribution (CC-BY) License. Artice Accepted to AfricaNLP Workshop @ICLR 2023 (non-archival)

PY - 2023/4/19

Y1 - 2023/4/19

N2 - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

AB - African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

KW - cs.CL

UR - https://github.com/masakhane-io/masakhane-news

UR - https://huggingface.co/datasets/masakhane/masakhanews

UR - https://arxiv.org/abs/2304.09972

U2 - 10.48550/arXiv.2304.09972 Focus to learn more

DO - 10.48550/arXiv.2304.09972 Focus to learn more

M3 - Preprint

BT - MasakhaNEWS

ER -

MasakhaNEWS: News Topic Classification for African languages

Abstract

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this