Large-scale authorship attribution with sociolinguistically dynamic data

Krzysztof J Kredens; Piotr Pezik; Lisa Rogers

Large-scale authorship attribution with sociolinguistically dynamic data

Krzysztof J Kredens, Piotr Pezik, Lisa Rogers

Research output: Unpublished contribution to conference › Unpublished Conference Paper › peer-review

Abstract

This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts
(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes them
significantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantly
from a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands of
authors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimental
setup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data
(e.g. blog posts or emails).
The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collects
aggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recent
advances in natural language processing such as vector-based word representations obtained through neural network training.
Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the prediction
lists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.

Reference
Joulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,
2016.

Original language	English
Publication status	Published - 2019
Event	14th Biennial Conference of the International Association of Forensic Linguists - Duration: 1 Jul 2019 → 5 Jul 2019

Conference

Conference	14th Biennial Conference of the International Association of Forensic Linguists
Period	1/07/19 → 5/07/19

Keywords

forensic linguistics
forensic authorship analysis

Cite this

@conference{1e254bb77c2447379585c79f28536f9d,

title = "Large-scale authorship attribution with sociolinguistically dynamic data",

abstract = "This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. {\textquoteleft}Bag of Tricks for Efficient Text Classification.{\textquoteright} ArXiv Preprint ArXiv:1607.01759,2016.",

keywords = "forensic linguistics, forensic authorship analysis",

author = "Kredens, {Krzysztof J} and Piotr Pezik and Lisa Rogers",

year = "2019",

language = "English",

note = "14th Biennial Conference of the International Association of Forensic Linguists ; Conference date: 01-07-2019 Through 05-07-2019",

}

TY - CONF

T1 - Large-scale authorship attribution with sociolinguistically dynamic data

AU - Kredens, Krzysztof J

AU - Pezik, Piotr

AU - Rogers, Lisa

PY - 2019

Y1 - 2019

N2 - This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,2016.

AB - This paper presents an evaluation of two approaches to large-scale authorship attribution. The data sets contain over 60 million posts(ca. 3 billion word tokens) contributed to online discussion boards by over one million registered members, which makes themsignificantly larger in terms of both the number of documents and authors than any other experimental collection to date. Importantlyfrom a forensic linguistic perspective, the data sets are also highly interactive and dynamic, featuring hundreds of thousands ofauthors engaging in complex polylogic exchanges on a wide range of topics over several years. We believe such an experimentalsetup reduces some of the typical biases found in automated authorship attribution experiments which have used fairly static data(e.g. blog posts or emails).The first approach reported is a K-Nearest Neighbours (KNN) algorithm which transforms text samples into query vectors and collectsaggregated relevance scores of probable authors. The second approach is a FastText classifier (Joulin et al. 2016) utilising recentadvances in natural language processing such as vector-based word representations obtained through neural network training.Depending on the number of test samples used for classification, our recall rate is 44 to 75 per cent at the 30th rank of the predictionlists. We discuss the implications of our findings for the notion of idiolect and, more widely, for internet-scale authorship attribution.ReferenceJoulin, A., Grave, E., Bojanowski, P. and T. Mikolov. ‘Bag of Tricks for Efficient Text Classification.’ ArXiv Preprint ArXiv:1607.01759,2016.

KW - forensic linguistics

KW - forensic authorship analysis

M3 - Unpublished Conference Paper

T2 - 14th Biennial Conference of the International Association of Forensic Linguists

Y2 - 1 July 2019 through 5 July 2019

ER -

Large-scale authorship attribution with sociolinguistically dynamic data

Abstract

Conference

Keywords

Fingerprint

Cite this