One of the advantages of Latent Semantic Scaling (LSS) is that it can compute polarity scores of very short documents. It achieves this by assigning polarity scores to all the words in the corpus and then computing polarity scores of documents as the sum of polarity scores of words weighted by their frequency. However, this algorithm allows polarity scores of documents to be determined by only few words when they are very short, making them very large positive or negative values. These outliners potentially undermine the quality of entire analysis.
In order to make polarity scores of short documents more stable, I added the min_n
argument to predict()
in the LSX package v1.1.1. When min_n
is greater than zero, all the documents are considered to be longer than that value and thus polarity scores of shorter documents become smaller in absolute terms. The example below shows how the distribution of polarity scores changes when the new argument is used.
I created the example by extending the code on the package’s Github page, so dfmt_sent
is the document-feature matrix of sentences from news articles and lss
is the model fitted with sentiment seed words. The lengths of documents vary between 0 and 265 words (tokens), 25% of the shortest being less than 6. If sentiment scores are predicted with min_n = 0
, the lengths of documents with the largest positive or negative scores are between 1 to 20 (left); with min_n = 6
, they are between 10 to 20 (right). I think the latter distribution is more natural because very short documents do not usually convey strong messages.
quantile(ntoken(dfmt_sent))
# 0% 25% 50% 75% 100%
# 0 6 11 16 265
pred_raw <- predict(lss, dfmt_sent, rescaling = FALSE, min_n = 0) # raw
pred_rbs <- predict(lss, dfmt_sent, rescaling = FALSE, min_n = 6) # robust
par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
plot(ntoken(dfmt_sent), pred_raw, ylab = "Sentiment", xlab = "Length",
main = "Raw", ylim = c(-0.15, 0.15), xlim = c(0, 100))
plot(ntoken(dfmt_sent), pred_rbs, ylab = "Sentiment", xlab = "Length",
main = "Robust", ylim = c(-0.15, 0.15), xlim = c(0, 100))
I encourage you to use robust polarity scores for other types of short documents such as social media posts. The key to computing robust scores is the choice of a reasonable value of min_n
. I cannot offer a scientific method pick the value yet, but I think the first quartile (25%) of the lengths of documents is a good candidate.