ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

Kai North; Marcos Zampieri; Tharindu Ranasinghe

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

Kai North, Marcos Zampieri, Tharindu Ranasinghe

Research output: Contribution to journal › Conference article › peer-review

Abstract

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.

Original language	English
Pages (from-to)	6057-6062
Number of pages	6
Journal	Proceedings - International Conference on Computational Linguistics, COLING
Volume	29
Issue number	1
Publication status	Published - 17 Oct 2022
Event	29th International Conference on Computational Linguistics, COLING 2022 - Gyeongju, Korea, Republic of Duration: 12 Oct 2022 → 17 Oct 2022

Bibliographical note

Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

Access to Document

2022.coling-1.529
Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
Final published version, 215 KBLicence: CC BY 4.0

https://arxiv.org/abs/2209.09034

Cite this

@article{86145c505e1947d8bb18b95baf5c8e0d,

title = "ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification",

abstract = "Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.",

author = "Kai North and Marcos Zampieri and Tharindu Ranasinghe",

note = "Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.; 29th International Conference on Computational Linguistics, COLING 2022 ; Conference date: 12-10-2022 Through 17-10-2022",

year = "2022",

month = oct,

day = "17",

language = "English",

volume = "29",

pages = "6057--6062",

number = "1",

}

TY - JOUR

T1 - ALEXSIS-PT

T2 - 29th International Conference on Computational Linguistics, COLING 2022

AU - North, Kai

AU - Zampieri, Marcos

AU - Ranasinghe, Tharindu

N1 - Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.

PY - 2022/10/17

Y1 - 2022/10/17

N2 - Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.

AB - Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics.

UR - http://www.scopus.com/inward/record.url?scp=85165738769&partnerID=8YFLogxK

UR - https://aclanthology.org/2022.coling-1.529/#:~:text=ALEXSIS%2DPT%20is%20the%20first,performance%20across%20all%20evaluation%20metrics.

M3 - Conference article

AN - SCOPUS:85165738769

SN - 2951-2093

VL - 29

SP - 6057

EP - 6062

JO - Proceedings - International Conference on Computational Linguistics, COLING

JF - Proceedings - International Conference on Computational Linguistics, COLING

IS - 1

Y2 - 12 October 2022 through 17 October 2022

ER -

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

Abstract

Bibliographical note

Access to Document

Other files and links

Fingerprint

Cite this