Word Complexity Prediction Through ML-Based Contextual Analysis

Muhammad Uzzam, Amal Htait

Research output: Chapter in Book/Published conference outputChapter (peer-reviewed)peer-review

Abstract

This paper conducted a comparative evaluation two approaches for predicting word complexity using contextual sentence information, a challenge that traditional methods often struggle to address. Two distinct methods were explored in this work. The first approach combines XLNet word embeddings with a Random Forest classifier to processes both sentence and word embeddings to predict complexity levels. The second approach employs a dual Bidirectional Encoder Representations from Transformers (BERT) model, consisting of two separate models: one for sentence-level complexity and another for word-level complexity, with their predictions combined for more context-sensitive result. A diverse dataset covering the domains of religion, biomedical, and parliamentary texts was used, as it is pre-categorised into five complexity levels (Very-easy, Easy, Medium, Hard, Very-hard). To ensure balanced class representation, data augmentation techniques were applied. Evaluation metrics revealed that the XLNet-based model has performed slightly superior to dual-BERT method, achieving macro-average F1-score of 0.79, excelling particularly at identifying highly complex words (F1-score = 0.95). In comparison, dual-BERT achieved a macro-average F1-score equal to 0.78.
Original languageEnglish
Title of host publicationProceedings of Machine Learning Research
Subtitle of host publicationProceedings of the UK AI Conference 2024
Place of PublicationUK
Pages53-61
Number of pages9
Volume295
Publication statusPublished - 2025

Fingerprint

Dive into the research topics of 'Word Complexity Prediction Through ML-Based Contextual Analysis'. Together they form a unique fingerprint.

Cite this