Abstract
The development of powerful computing tools and easy accessibility of large quantities of language data online have sparked renewed
interest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorship
attribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types of
linguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorily
rich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/he
needs to know what actually happens inside the black box.
This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion words
produced by over one million authors) ever used in computational author classification research. However, we are less concerned
with classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectal
variation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘true
positives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard
2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.
References
Coulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.
McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.
interest in authorship analysis in a variety of domains. However, as new computational models are put forward for authorship
attribution purposes and ever greater success rates reported, a vast majority of the studies remain silent on the nature and types of
linguistic phenomena associated with idiolectal style. Meanwhile, in forensic authorship attribution, models should be explanatorily
rich: the forensic linguist needs to be both certain of the validity of his/her findings and able to explain them to lay triers of fact; s/he
needs to know what actually happens inside the black box.
This paper reports on the findings of a project using what to the best of our knowledge is the biggest corpus (ca. 3 billion words
produced by over one million authors) ever used in computational author classification research. However, we are less concerned
with classification results here but interested instead in harnessing the big-data capability to inform our understanding of idiolectal
variation. By drawing up a typology of style markers that proved to be instrumental in correct author classification (resulting in ‘true
positives’) and those that classified authors erroneously (yielding ‘false positives’), we revisit the fundamental question (e.g. Coulthard
2004, Mcmenamin 1993) of just what kinds of idiolectal style markers have the greatest individuating potential.
References
Coulthard, M. (2004), 'Author Identification, Idiolect, and Linguistic Uniqueness', Applied Linguistics 25(4), pp. 431–447.
McEmnamin, G. Forensic Stylistics (1993), Amsterdam: Elsevier.
Original language | English |
---|---|
Publication status | Published - 2019 |
Event | 14th Biennial Conference of the International Association of Forensic Linguists - Duration: 1 Jul 2019 → 5 Jul 2019 |
Conference
Conference | 14th Biennial Conference of the International Association of Forensic Linguists |
---|---|
Period | 1/07/19 → 5/07/19 |
Keywords
- forensic linguistics
- forensic authorship analysis