Large-scale authorship attribution with sociolinguistically dynamic data

Research output: Contribution to conferencePaper

Abstract

This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts
(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them
significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly
from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of
authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental
setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data
(e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects
aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent
advances in natural language processing such as vector-based word representations obtained through neural network training.
Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction
lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.

Reference
Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,
2016.
Original languageEnglish
Publication statusPublished - 2019
Event14th Biennial Conference of the International Association of Forensic Linguists -
Duration: 1 Jul 20195 Jul 2019

Conference

Conference14th Biennial Conference of the International Association of Forensic Linguists
Period1/07/195/07/19

Fingerprint

Blogs
Electronic mail
Linguistics
Classifiers
Internet
Neural networks
Processing
Experiments

Keywords

  • forensic linguistics
  • forensic authorship analysis

Cite this

Kredens, K. J., Pezik, P., & Rogers, L. (2019). Large-scale authorship attribution with sociolinguistically dynamic data. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .
Kredens, Krzysztof J ; Pezik, Piotr ; Rogers, Lisa. / Large-scale authorship attribution with sociolinguistically dynamic data. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .
@conference{1e254bb77c2447379585c79f28536f9d,
title = "Large-scale authorship attribution with sociolinguistically dynamic data",
abstract = "This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,2016.",
keywords = "forensic linguistics, forensic authorship analysis",
author = "Kredens, {Krzysztof J} and Piotr Pezik and Lisa Rogers",
year = "2019",
language = "English",
note = "14th Biennial Conference of the International Association of Forensic Linguists ; Conference date: 01-07-2019 Through 05-07-2019",

}

Kredens, KJ, Pezik, P & Rogers, L 2019, 'Large-scale authorship attribution with sociolinguistically dynamic data', Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, 1/07/19 - 5/07/19.

Large-scale authorship attribution with sociolinguistically dynamic data. / Kredens, Krzysztof J; Pezik, Piotr; Rogers, Lisa.

2019. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .

Research output: Contribution to conferencePaper

TY - CONF

T1 - Large-scale authorship attribution with sociolinguistically dynamic data

AU - Kredens, Krzysztof J

AU - Pezik, Piotr

AU - Rogers, Lisa

PY - 2019

Y1 - 2019

N2 - This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,2016.

AB - This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,2016.

KW - forensic linguistics

KW - forensic authorship analysis

M3 - Paper

ER -

Kredens KJ, Pezik P, Rogers L. Large-scale authorship attribution with sociolinguistically dynamic data. 2019. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .