Toward linguistic explanation of idiolectal variation – understanding the black box

Krzysztof J Kredens; Piotr Pezik; Lisa Rogers

Toward linguistic explanation of idiolectal variation – understanding the black box

Krzysztof J Kredens, Piotr Pezik, Lisa Rogers

Research output: Unpublished contribution to conference › Unpublished Conference Paper › peer-review

Abstract

The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewed
interest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorship
attribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types of
linguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorily
rich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/he
needs to know what actually happens inside the black box.
This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion words
produced by over one million authors) ever used in computational author classification research. However, we are less concerned
with classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectal
variation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘true
positives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard
2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.

References
Coulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.
McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.

Original language	English
Publication status	Published - 2019
Event	14th Biennial Conference of the International Association of Forensic Linguists - Duration: 1 Jul 2019 → 5 Jul 2019

Conference

Conference	14th Biennial Conference of the International Association of Forensic Linguists
Period	1/07/19 → 5/07/19

Keywords

forensic linguistics
forensic authorship analysis

Cite this

@conference{4c8d2cb532ee4807aa5992ead0cab683,

title = "Toward linguistic explanation of idiolectal variation – understanding the black box",

abstract = "The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in {\textquoteleft}truepositives{\textquoteright}) and those that classified authors erroneously (yielding {\textquoteleft}false positives{\textquoteright}), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.",

keywords = "forensic linguistics, forensic authorship analysis",

author = "Kredens, {Krzysztof J} and Piotr Pezik and Lisa Rogers",

year = "2019",

language = "English",

note = "14th Biennial Conference of the International Association of Forensic Linguists ; Conference date: 01-07-2019 Through 05-07-2019",

}

TY - CONF

T1 - Toward linguistic explanation of idiolectal variation – understanding the black box

AU - Kredens, Krzysztof J

AU - Pezik, Piotr

AU - Rogers, Lisa

PY - 2019

Y1 - 2019

N2 - The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘truepositives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.

AB - The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewedinterest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorshipattribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types oflinguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorilyrich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/heneeds to know what actually happens inside the black box.This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion wordsproduced by over one million authors) ever used in computational author classification research. However, we are less concernedwith classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectalvariation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘truepositives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.ReferencesCoulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.

KW - forensic linguistics

KW - forensic authorship analysis

M3 - Unpublished Conference Paper

T2 - 14th Biennial Conference of the International Association of Forensic Linguists

Y2 - 1 July 2019 through 5 July 2019

ER -

Toward linguistic explanation of idiolectal variation – understanding the black box

Abstract

Conference

Keywords

Fingerprint

Cite this