A parser for news downloads

Mike Scott*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    Abstract

    This paper presents the Download Parser, a tool for handling text downloads from large online databases. Many universities have access to full-text databases which allow the user to search their holdings and then view and ideally download the full text of relevant articles, but there are important problems in practice in managing such downloads, because of factors such as duplication, unevenness of formatting standards, lack of documentation. The tool under discussion was devised to parse downloads, clean them up and standardise them, identify headlines and insert suitably marked-up headers for corpus analysis.

    Original languageEnglish
    Pages (from-to)1-16
    Number of pages16
    JournalDELTA Documentacao de Estudos em Linguistica Teorica e Aplicada
    Volume34
    Issue number1
    DOIs
    Publication statusPublished - 1 Mar 2018

    Bibliographical note

    This content is licensed under a Creative Commons Attribution License, which permits unrestricted use and distribution, provided the original author and source are credited.

    Keywords

    • Building sub-corpora
    • Corpus clean-up
    • Duplicate texts
    • News corpus

    Fingerprint

    Dive into the research topics of 'A parser for news downloads'. Together they form a unique fingerprint.

    Cite this