Toward linguistic explanation of idiolectal variation – understanding the black box

Research output: Contribution to conferencePaper

Abstract

The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewed
interest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorship
attribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types of
linguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorily
rich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/he
needs to know what actually happens inside the black box.
This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion words
produced by over one million authors) ever used in computational author classification research. However, we are less concerned
with classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectal
variation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘true
positives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard
2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.

References
Coulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.
McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.
Original languageEnglish
Publication statusPublished - 2019
Event14th Biennial Conference of the International Association of Forensic Linguists -
Duration: 1 Jul 20195 Jul 2019

Conference

Conference14th Biennial Conference of the International Association of Forensic Linguists
Period1/07/195/07/19

Fingerprint

linguistics
attribution
typology
language

Keywords

  • forensic linguistics
  • forensic authorship analysis

Cite this

Kredens, K. J., Pezik, P., & Rogers, L. (2019). Toward linguistic explanation of idiolectal variation – understanding the black box. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .
Kredens, Krzysztof J ; Pezik, Piotr ; Rogers, Lisa. / Toward linguistic explanation of idiolectal variation – understanding the black box. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .
@conference{4c8d2cb532ee4807aa5992ead0cab683,
title = "Toward linguistic explanation of idiolectal variation – understanding the black box",
abstract = "The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘truepositives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.",
keywords = "forensic linguistics, forensic authorship analysis",
author = "Kredens, {Krzysztof J} and Piotr Pezik and Lisa Rogers",
year = "2019",
language = "English",
note = "14th Biennial Conference of the International Association of Forensic Linguists ; Conference date: 01-07-2019 Through 05-07-2019",

}

Kredens, KJ, Pezik, P & Rogers, L 2019, 'Toward linguistic explanation of idiolectal variation – understanding the black box', Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, 1/07/19 - 5/07/19.

Toward linguistic explanation of idiolectal variation – understanding the black box. / Kredens, Krzysztof J; Pezik, Piotr; Rogers, Lisa.

2019. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .

Research output: Contribution to conferencePaper

TY - CONF

T1 - Toward linguistic explanation of idiolectal variation – understanding the black box

AU - Kredens, Krzysztof J

AU - Pezik, Piotr

AU - Rogers, Lisa

PY - 2019

Y1 - 2019

N2 - The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘truepositives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.

AB - The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘truepositives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.

KW - forensic linguistics

KW - forensic authorship analysis

M3 - Paper

ER -

Kredens KJ, Pezik P, Rogers L. Toward linguistic explanation of idiolectal variation – understanding the black box. 2019. Paper presented at 14th Biennial Conference of the International Association of Forensic Linguists, .