Gender classification of microblog text based on authorial style

Shubhadeep Mukherjee*, Pradip Kumar Bala

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Gender profiling of unstructured text data has several applications in areas such as marketing, advertising, legal investigation, and recommender systems. The automatic detection of gender in microblogs, like twitter, is a difficult task. It requires a system that can use knowledge to interpret the linguistic styles being used by the genders. In this paper, we try to provide this knowledge for such a system by considering different sets of features, which are relatively independent of the text, such as function words and part of speech n-grams. We test a range of different feature sets using two different classifiers; namely Naïve Bayes and maximum entropy algorithms. Our results show that the gender detection task benefits from the inclusion of features that capture the authorial style of the microblog authors. We achieve an accuracy of approximately 71 %, which outperforms the classification accuracy of commercially available gender detection software like Gender Genie and Gender Guesser.

Original languageEnglish
Pages (from-to)117-138
Number of pages22
JournalInformation Systems and e-Business Management
Volume15
Issue number1
Early online date2 Mar 2016
DOIs
Publication statusPublished - 1 Feb 2017

Keywords

  • Artificial intelligence
  • Business intelligence
  • Gender classification
  • Knowledge discovery
  • Natural language processing
  • Supervised learning
  • Text mining
  • Twitter

Fingerprint

Dive into the research topics of 'Gender classification of microblog text based on authorial style'. Together they form a unique fingerprint.

Cite this