Abstract
This paper presents a solution for predicting word complexity using contextual sentence information, a problem that traditional methods often struggle to address. It also introduces a user-friendly interface to dynamically assesses word complexity and provides explanations by considering both individual word features and their surrounding context.
Three distinct approaches were explored in this work. The first approach applied a Bidirectional Long Short-Term Memory (Bi-LSTM) model, trained on linguistic and semantic features extracted from the text. The second method uses Bidirectional Encoder Representations from Transformers (BERT) with two separate models: one for sentence-level complexity and another for word-level complexity, with the predictions combined for more context-sensitive result. The third approach introduces a novel method that combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings for predicting complexity levels.
A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used in this word, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model (third method) outperformed others, achieving 80% accuracy (Macro-Average F1-measure = 0.78), particularly excelling at identifying highly complex words (F1-measure = 0.95). The BERT-based model closely followed, with an accuracy of 78% (Macro-Average F1-measure = 0.75), and the Bi-LSTM method achieved an accuracy of 63% (Macro-Average F1-measure = 0.63). The best-performing model (XLNet-based) is then selected as the engine behind a user-friendly interface created with Gradio, which can detect complex words in an input sentence and provide explanations.
This work highlights the importance of utilising both word and sentence-level embeddings for effective complexity prediction. The developed models, along with the user-friendly interface, have significant potential applications in education by helping language learners in navigating challenging vocabulary.
Three distinct approaches were explored in this work. The first approach applied a Bidirectional Long Short-Term Memory (Bi-LSTM) model, trained on linguistic and semantic features extracted from the text. The second method uses Bidirectional Encoder Representations from Transformers (BERT) with two separate models: one for sentence-level complexity and another for word-level complexity, with the predictions combined for more context-sensitive result. The third approach introduces a novel method that combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings for predicting complexity levels.
A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used in this word, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model (third method) outperformed others, achieving 80% accuracy (Macro-Average F1-measure = 0.78), particularly excelling at identifying highly complex words (F1-measure = 0.95). The BERT-based model closely followed, with an accuracy of 78% (Macro-Average F1-measure = 0.75), and the Bi-LSTM method achieved an accuracy of 63% (Macro-Average F1-measure = 0.63). The best-performing model (XLNet-based) is then selected as the engine behind a user-friendly interface created with Gradio, which can detect complex words in an input sentence and provide explanations.
This work highlights the importance of utilising both word and sentence-level embeddings for effective complexity prediction. The developed models, along with the user-friendly interface, have significant potential applications in education by helping language learners in navigating challenging vocabulary.
Original language | English |
---|---|
Number of pages | 1 |
Publication status | Unpublished - 22 Nov 2024 |
Event | The Second UK AI Conference 2024 - University of Birmingham, Birmingham, United Kingdom Duration: 22 Nov 2024 → 22 Nov 2024 https://uk-ai.org/ukai2024/ |
Conference
Conference | The Second UK AI Conference 2024 |
---|---|
Abbreviated title | UK AI |
Country/Territory | United Kingdom |
City | Birmingham |
Period | 22/11/24 → 22/11/24 |
Internet address |