Large-scale authorship attribution with sociolinguistically dynamic data

Research output: Unpublished contribution to conferenceUnpublished Conference Paperpeer-review


This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts
(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them
significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly
from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of
authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental
setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data
(e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects
aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent
advances in natural language processing such as vector-based word representations obtained through neural network training.
Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction
lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.

Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,
Original languageEnglish
Publication statusPublished - 2019
Event14th Biennial Conference of the International Association of Forensic Linguists -
Duration: 1 Jul 20195 Jul 2019


Conference14th Biennial Conference of the International Association of Forensic Linguists


  • forensic linguistics
  • forensic authorship analysis


Dive into the research topics of 'Large-scale authorship attribution with sociolinguistically dynamic data'. Together they form a unique fingerprint.

Cite this