I have updated LSX on CRAN yesterday, changing the version number from 1.0.0 to 1.0.2. The jump is a subtle indication of my excitement with a new function that would improve make LSS more reliable: if textmodel_lss
(auto_weight = TRUE
), it automatically optimize weights given to user-provided seed words.
I fitted two LSS models on a corpus of newspaper articles with or without the automatic weighting.
require(quanteda)
require(LSX)
con <- url("https://bit.ly/2GZwLcN", "rb")
corp <- readRDS(con)
close(con)
toks <- corpus_reshape(corp, "sentences") %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
tokens_select("^[\\p{L}]+$", valuetype = "regex", padding = TRUE)
dfmt <- dfm(toks, remove_padding = TRUE) %>%
dfm_trim(min_termfreq = 10)
seed <- as.seedwords(data_dictionary_sentiment)
lss <- textmodel_lss(dfmt, seed, cache = TRUE, k = 300, auto_weight = FALSE)
lss_aw <- textmodel_lss(dfmt, seed, cache = TRUE, k = 300, auto_weight = TRUE)
I define the scale by the generic sentiment seed words with polarity scores 1 for positivity and -1 for negativity.
seed
## good nice excellent positive fortunate correct
## 1 1 1 1 1 1
## superior bad nasty poor negative unfortunate
## 1 -1 -1 -1 -1 -1
## wrong inferior
## -1 -1
Figure 1 shows how words (model terms) are scored by LSS without automatic weighting. In the plot, we can confirm hat the generic sentiment words are correctly placed on the edges of the cluster of words. However, the polarity scores of the generic sentiment words vary a lot despite that they had eigher 1 or -1 in the seed words. This happens because the sentiment words’ polarity scores are computed based on their semantic proximity not only to itself but also to all other seed words.
textplot_terms(lss, names(seed))
The solution to this problem is giving weight to seed words but it is very hard for us to determine the optimal weights. The new function that enabled by auto_weight = TRUE
does this numeric optimization automatically. I defined the optimization problem simply as minimizing the difference between term and seed polarity scores. Figure 2 show the result of the automatic weighting of seed words. We can find the sentiment words are given almost the same scores and aligned vertically in the plot.
textplot_terms(lss_aw, names(seed))
In LSS, seed words are always weighted by the inverse of the number of words to balance the impact of individual seed word. As Figure 3 shows, the generic sentiment words are therefore weighted uniformly by 1 / 7 ≈ 0.14
when auto_weight = FALSE
but they are weighted differently when auto_weight = TRUE
to achieve the alignment. For example, “superior” and “wrong” received the smallest weights because of their strong polarity (see Figure 1).
lss_aw$seeds_weighted
## good nice excellent positive fortunate correct
## 0.2441235 0.2056909 0.2053157 0.2571310 0.2354177 0.2200167
## superior bad nasty poor negative unfortunate
## 0.1594720 -0.3279584 -0.1759821 -0.2119596 -0.1735716 -0.2101862
## wrong inferior
## -0.1472931 -0.2456389
The advantage of the automatic weighting is not making plots look nicer but scoring more robust against extremely polarized or even erroneous seed words. Let’s imagine a person who lacks the knowledge of the English language and selected wrong seed words: he/she chose “good” as a negative seed word.
seed_er <- seed
seed_er["good"] <- seed_er["good"] * -1
seed_er
## good nice excellent positive fortunate correct
## -1 1 1 1 1 1
## superior bad nasty poor negative unfortunate
## 1 -1 -1 -1 -1 -1
## wrong inferior
## -1 -1
His/her wrong seed word selection is disastrous in the old LSS but it is less so in the new LSS, because the automatic weighting function reduces the impact of “good” by downweighted it. This is possible because “good” is semantically more similar to positive seed words in the corpus, and a large weight on the wrong seed word does not help LSS to achieve the alignment alignment.
An additional benefit of the automatic weighting function is that it would make comparison of scores produced by different models easier, because it anchors the seed words more tightly to the absolute scale defined by the user.
I expect the new function in LSS will improve the accuracy of document scaling. My tests showed that it improves correlation between LSS and manual scores, but it need to be tested with many different corpora and seed words. Please use the automatic weighting function in your project and tell me how it worked. You only need to set auto_weight = TRUE
to do so.