Computing robust polarity scores using LSS

One of the advantages of Latent Semantic Scaling (LSS) is that it can compute polarity scores of very short documents. It achieves this by assigning polarity scores to all the words in the corpus and then computing polarity scores of documents as the sum of polarity scores of words weighted by their frequency. However, this algorithm allows polarity scores of documents to be determined by only few words when they are very short, making them very large positive or negative values. These outliners potentially undermine the quality of entire analysis.

In order to make polarity scores of short documents more stable, I added the min_n argument to predict() in the LSX package v1.1.1. When min_n is greater than zero, all the documents are considered to be longer than that value and thus polarity scores of shorter documents become smaller in absolute terms. The example below shows how the distribution of polarity scores changes when the new argument is used.

I created the example by extending the code on the package’s Github page, so dfmt_sent is the document-feature matrix of sentences from news articles and lss is the model fitted with sentiment seed words. The lengths of documents vary between 0 and 265 words (tokens), 25% of the shortest being less than 6. If sentiment scores are predicted with min_n = 0, the lengths of documents with the largest positive or negative scores are between 1 to 20 (left); with min_n = 6, they are between 10 to 20 (right). I think the latter distribution is more natural because very short documents do not usually convey strong messages.


quantile(ntoken(dfmt_sent))

#   0%  25%  50%  75% 100% 
#   0    6   11   16  265 

pred_raw <- predict(lss, dfmt_sent, rescaling = FALSE, min_n = 0) # raw
pred_rbs <- predict(lss, dfmt_sent, rescaling = FALSE, min_n = 6) # robust

par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
plot(ntoken(dfmt_sent), pred_raw, ylab = "Sentiment", xlab = "Length",
     main = "Raw", ylim = c(-0.15, 0.15), xlim = c(0, 100))
plot(ntoken(dfmt_sent), pred_rbs, ylab = "Sentiment", xlab = "Length",
     main = "Robust", ylim = c(-0.15, 0.15), xlim = c(0, 100))

I encourage you to use robust polarity scores for other types of short documents such as social media posts. The key to computing robust scores is the choice of a reasonable value of min_n. I cannot offer a scientific method pick the value yet, but I think the first quartile (25%) of the lengths of documents is a good candidate.

Posts created 113

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top