On stopwords, filtering and data sparsity for sentiment analysis of Twitter

Hassan Saif; Miriam Fernández; Yulan He; Harith Alani

On stopwords, filtering and data sparsity for sentiment analysis of Twitter

Hassan Saif, Miriam Fernández, Yulan He, Harith Alani

Computer Science Research Group

Research output: Chapter in Book/Published conference output › Conference publication

Abstract

Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space

Original language	English
Title of host publication	LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings
Editors	Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, et al
Pages	810-817
Number of pages	8
Publication status	Published - 2014
Event	9th International Conference on Language Resources and Evaluation - Iceland, Reykjavik, Iceland Duration: 26 May 2014 → 31 May 2014

Conference

Conference	9th International Conference on Language Resources and Evaluation
Abbreviated title	LREC 2014
Country/Territory	Iceland
City	Reykjavik
Period	26/05/14 → 31/05/14

Bibliographical note

The LREC 2014 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Keywords

sentiment analysis
stopwords
data sparsity

Access to Document

On_stopwords_filtering
The LREC 2014 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Final published version, 438 KBLicence: CC BY-NC 3.0

http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf

Cite this

@inproceedings{6dcbeb3de7974177bfd1c1935b51afc6,

title = "On stopwords, filtering and data sparsity for sentiment analysis of Twitter",

abstract = "Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space",

keywords = "sentiment analysis, stopwords, data sparsity",

author = "Hassan Saif and Miriam Fern{\'a}ndez and Yulan He and Harith Alani",

note = "The LREC 2014 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License; 9th International Conference on Language Resources and Evaluation, LREC 2014 ; Conference date: 26-05-2014 Through 31-05-2014",

year = "2014",

language = "English",

isbn = "978-2-9517408-8-4",

pages = "810--817",

editor = "Nicoletta Calzolari and Khalid Choukri and Thierry Declerck and {et al}",

booktitle = "LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings",

}

Saif, H, Fernández, M, He, Y & Alani, H 2014, On stopwords, filtering and data sparsity for sentiment analysis of Twitter. in N Calzolari, K Choukri, T Declerck & et al (eds), LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings. pp. 810-817, 9th International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26/05/14. <http://www.lrec-conf.org/proceedings/lrec2014/pdf/292_Paper.pdf>

On stopwords, filtering and data sparsity for sentiment analysis of Twitter. / Saif, Hassan; Fernández, Miriam; He, Yulan et al.
LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings. ed. / Nicoletta Calzolari; Khalid Choukri; Thierry Declerck; et al. 2014. p. 810-817.

Research output: Chapter in Book/Published conference output › Conference publication

TY - GEN

T1 - On stopwords, filtering and data sparsity for sentiment analysis of Twitter

AU - Saif, Hassan

AU - Fernández, Miriam

AU - He, Yulan

AU - Alani, Harith

N1 - The LREC 2014 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

PY - 2014

Y1 - 2014

N2 - Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space

AB - Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space

KW - sentiment analysis

KW - stopwords

KW - data sparsity

M3 - Conference publication

SN - 978-2-9517408-8-4

SP - 810

EP - 817

BT - LREC 2014, Ninth International Conference on Language Resources and Evaluation. Proceedings

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Declerck, Thierry

A2 - et al,

T2 - 9th International Conference on Language Resources and Evaluation

Y2 - 26 May 2014 through 31 May 2014

ER -

On stopwords, filtering and data sparsity for sentiment analysis of Twitter

Abstract

Conference

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this