Weakly Supervised POS Tagging without Disambiguation

Deyu Zhou, Zhikai Zhang, Min-ling Zhang, Yulan He

Research output: Contribution to journalArticle

Abstract

Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.
In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.
Original languageEnglish
Article number35
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume17
Issue number4
Early online date27 Jul 2018
DOIs
Publication statusE-pub ahead of print - 27 Jul 2018

Fingerprint

Classifiers
Glossaries
Testing
Processing

Cite this

Zhou, D., Zhang, Z., Zhang, M., & He, Y. (2018). Weakly Supervised POS Tagging without Disambiguation. ACM Transactions on Asian and Low-Resource Language Information Processing, 17(4), [35]. https://doi.org/10.1145/3214707
Zhou, Deyu ; Zhang, Zhikai ; Zhang, Min-ling ; He, Yulan. / Weakly Supervised POS Tagging without Disambiguation. In: ACM Transactions on Asian and Low-Resource Language Information Processing. 2018 ; Vol. 17, No. 4.
@article{638055424a1c4e3d8b3ed7837506f5f6,
title = "Weakly Supervised POS Tagging without Disambiguation",
abstract = "Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21{\%}, 90.9{\%}, and 84.5{\%} individually, which shows a significant improvement compared to the state-of-the-art approaches.",
author = "Deyu Zhou and Zhikai Zhang and Min-ling Zhang and Yulan He",
year = "2018",
month = "7",
day = "27",
doi = "10.1145/3214707",
language = "English",
volume = "17",
number = "4",

}

Zhou, D, Zhang, Z, Zhang, M & He, Y 2018, 'Weakly Supervised POS Tagging without Disambiguation', ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 17, no. 4, 35. https://doi.org/10.1145/3214707

Weakly Supervised POS Tagging without Disambiguation. / Zhou, Deyu; Zhang, Zhikai; Zhang, Min-ling; He, Yulan.

In: ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 17, No. 4, 35, 27.07.2018.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Weakly Supervised POS Tagging without Disambiguation

AU - Zhou, Deyu

AU - Zhang, Zhikai

AU - Zhang, Min-ling

AU - He, Yulan

PY - 2018/7/27

Y1 - 2018/7/27

N2 - Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.

AB - Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable.In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches.

UR - http://dl.acm.org/citation.cfm?doid=3229525.3214707

U2 - 10.1145/3214707

DO - 10.1145/3214707

M3 - Article

VL - 17

IS - 4

M1 - 35

ER -

Zhou D, Zhang Z, Zhang M, He Y. Weakly Supervised POS Tagging without Disambiguation. ACM Transactions on Asian and Low-Resource Language Information Processing. 2018 Jul 27;17(4). 35. https://doi.org/10.1145/3214707