Abstract
This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of Machine Learning Research |
| Subtitle of host publication | Proceedings of the UK AI Conference 2024 |
| Place of Publication | UK |
| Pages | 53-61 |
| Number of pages | 9 |
| Volume | 295 |
| Publication status | Published - 2025 |