Biographical Semi-Supervised Relation Extraction Dataset

Alistair Plum; Tharindu Ranasinghe; Spencer Jones; Constantin Orasan; Ruslan Mitkov

doi:10.1145/3477495.3531742

Biographical Semi-Supervised Relation Extraction Dataset

Alistair Plum, Tharindu Ranasinghe, Spencer Jones, Constantin Orasan, Ruslan Mitkov

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

Original language	English
Title of host publication	SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Publisher	ACM
Pages	3121-3130
ISBN (Electronic)	9781450387323
DOIs	https://doi.org/10.1145/3477495.3531742
Publication status	Published - 7 Jul 2022
Event	45th International ACM SIGIR Conference on Research and Development in Information Retrieval - Madrid, Spain Duration: 11 Jul 2022 → 15 Jul 2022

Conference

Conference	45th International ACM SIGIR Conference on Research and Development in Information Retrieval
Abbreviated title	SIGIR '22
Country/Territory	Spain
City	Madrid
Period	11/07/22 → 15/07/22

Bibliographical note

Copyright © 2022, Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

Keywords

biographical information extraction
relation extraction
transformers

Access to Document

10.1145/3477495.3531742

Plum et al_Biographical Semi Supervised Relation Extraction Dataset ArXiv AAM
Copyright © 2022, Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
Accepted author manuscript, 615 KBLicence: CC BY-NC 4.0

Cite this

@inproceedings{9da0836b1d244a1fbc9f9b2ad9242881,

title = "Biographical Semi-Supervised Relation Extraction Dataset",

abstract = "Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.",

keywords = "biographical information extraction, relation extraction, transformers",

author = "Alistair Plum and Tharindu Ranasinghe and Spencer Jones and Constantin Orasan and Ruslan Mitkov",

note = "Copyright {\textcopyright} 2022, Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.; 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22 ; Conference date: 11-07-2022 Through 15-07-2022",

year = "2022",

month = jul,

day = "7",

doi = "10.1145/3477495.3531742",

language = "English",

pages = "3121--3130",

booktitle = "SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval",

publisher = "ACM",

address = "United States",

}

Plum, A, Ranasinghe, T, Jones, S, Orasan, C & Mitkov, R 2022, Biographical Semi-Supervised Relation Extraction Dataset. in SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 3121-3130, 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11/07/22. https://doi.org/10.1145/3477495.3531742

TY - GEN

T1 - Biographical Semi-Supervised Relation Extraction Dataset

AU - Plum, Alistair

AU - Ranasinghe, Tharindu

AU - Jones, Spencer

AU - Orasan, Constantin

AU - Mitkov, Ruslan

N1 - Copyright © 2022, Association for Computing Machinery. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

PY - 2022/7/7

Y1 - 2022/7/7

N2 - Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

AB - Extracting biographical information from online documents is a popular research topic among the information extraction (IE) community. Various natural language processing (NLP) techniques such as text classification, text summarisation and relation extraction are commonly used to achieve this. Among these techniques, RE is the most common since it can be directly used to build biographical knowledge graphs. RE is usually framed as a supervised machine learning (ML) problem, where ML models are trained on annotated datasets. However, there are few annotated datasets for RE since the annotation process can be costly and time-consuming. To address this, we developedBiographical, the first semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), we match information with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain. Furthermore, we demonstrate the effectiveness of the dataset by training a state-of-the-art neural model to classify relation pairs, and evaluate it on a manually annotated gold standard set.Biographical is primarily aimed at training neural models for RE within the domain of digital humanities and history, but as we discuss at the end of this paper, it can be useful for other purposes as well.

KW - biographical information extraction

KW - relation extraction

KW - transformers

UR - https://dl.acm.org/doi/10.1145/3477495.3531742

UR - https://arxiv.org/abs/2205.00806

UR - http://www.scopus.com/inward/record.url?scp=85130503173&partnerID=8YFLogxK

U2 - 10.1145/3477495.3531742

DO - 10.1145/3477495.3531742

M3 - Conference publication

SP - 3121

EP - 3130

BT - SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - ACM

T2 - 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Y2 - 11 July 2022 through 15 July 2022

ER -

Biographical Semi-Supervised Relation Extraction Dataset

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Other files and links

Fingerprint

Cite this