Pros/Cons of stop word removal?

What are the pros / cons of removing stop words from text in the context of a text classification problem, I'm wondering what the best approach is (i.e. to remove or not to remove)? I've read somewhere (but can't locate the reference) that it may be detrimental the the performance of a model in the case of sentiment analysis to remove stop words.

asked Apr 30, 2018 at 17:14 Jimmy Collins Jimmy Collins 253 1 1 gold badge 2 2 silver badges 4 4 bronze badges

$\begingroup$ Pro is that it helps the model to get the root words which are important rather than focusing on quite famous and commonly used words. $\endgroup$

Commented Apr 30, 2018 at 17:40

2 Answers 2

$\begingroup$

In the context of sentiment analysis, removing stop words can be problematic if context is affected. For example suppose your stop word corpus includes ‘not’, which is a negation that can alter the valence of the passage. So you have to be cautious of exactly what is being dropped and what consequences it can have.

answered Apr 30, 2018 at 19:03 911 4 4 silver badges 7 7 bronze badges $\begingroup$

If you are using some bag of words based methods, i.e, countVectorizer or tfidf that works on counts and frequency of the words, removing stopwords is great as it lowers the dimensional space and also a few stop words won't drive your analysis. On the other hand, when you are exploiting the semantics of the given text, say in a seq2seq model, removing stopwords will omit the context and you will end up with ambiguous results.

answered May 1, 2018 at 2:11 Vivek Khetan Vivek Khetan 367 1 1 silver badge 7 7 bronze badges

Linked

Related

Hot Network Questions

Subscribe to RSS

Question feed

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2024 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2024.9.4.14806