Abstract
Recent advances in Large Language Models (LLMs) have opened new methodological possibilities in authorship analysis research (e.g. Huang et al. 2024, Huang and Grieve 2024, Przystalski et al. 2024). While ‘traditional’ approaches to identifying idiolectal features have relied on manual analysis and/or statistical modelling of naturally occurring text samples, LLMs offer capabilities for analysing and reproducing linguistic patterns at scale. As these models demonstrate abilities to capture and generate aspects of language variation, they present both opportunities and challenges for forensic linguistic research. This paper explores the potential of LLMs to synthesise idiolectal styles for use in authorship analysis experiments. The study uses the 100 Idiolects project data (Heini and Kredens 2023), a corpus of text samples from 112 individuals, each contributing input in seven prescribed discourse types. We used LLMs to reproduce each individual's idiolectal style in an eighth discourse type and used this output to run authorship attribution experiments to gauge the robustness of machine-generated ‘idiolectal’ styles. This experimental design allows us to evaluate both the ability of LLMs to capture individual linguistic patterns from across different discourse types and the reliability of using synthetic data in authorship attribution tasks. By comparing attribution results using human-authored versus LLM-generated texts, we assess the potential for LLMs to assist in authorship analysis tasks and discuss limitations in their ability to replicate idiolectal traits.
| Original language | English |
|---|---|
| Publication status | Unpublished - 2025 |
| Event | 17th Biennial Conference of the International Association for Forensic and Legal Linguistics - Cape Town, South Africa Duration: 30 Jun 2025 → 4 Jul 2025 |
Conference
| Conference | 17th Biennial Conference of the International Association for Forensic and Legal Linguistics |
|---|---|
| Country/Territory | South Africa |
| City | Cape Town |
| Period | 30/06/25 → 4/07/25 |
Keywords
- authorship analysis, idiolect, large language models