Latent Semantic Scale based on Word2vec

Latent Semantic Scaling (LSS) has been used in many research projects to analyze polarity of documents. LSS is useful in research because it assigns polarity scores (e.g., sentiment) to documents based on user-provided seed words. I was trying to further improvement of the technique but it appeared to be difficult because of its algorithm is spatial. LSS relies on SVD to create word vectors to compute cosine similarity between words. SVD is an widely-used algorithm for matrix factorization but its solid mathematical foundation does not allow any changes in its output.

The difficulty made me interested in developing a new version of LSS, employing probabilistic models. I chose word2vec as a widely known algorithms for training word vectors and started developing the wordvector package. I started developing the package based on the word2vec package but it quickly diverted from the original code. The main change are (1) the use of quanteda’s tokens as an input object to enhance efficiency and (2) the improvement of progress monitoring for greater stability.

When I started using word2vec, I was hoping that its word vectors would improve LSS instantly, but it did not happen because, in this approach, polarity of words are still computed based on the cosine similarity between words vectors. However, since word2vec is a small language model, it can predict the probability of seed words to occur given context words. These probability scores can be used as polarity scores of words, based on which polarity scores of documents are computed.

Using new package, I trained three types of LSS models: SVD-based spatial model (left), word2vec-based spatial model (middle), word2vec probabilistic model (right). I tested their accuracy in sentiment analysis using manually labelled news articles. In the plot below, the horizontal axis is the manual scores and the vertical axis is the LSS scores. The lines in the first two plots are bent but the line is almost perfectly straight in the last. This means that the spatial models cannot distinguish between very positive or very negative items, but the probabilistic model can.

If you are interested, please try word2vec-based LSS on your data using the LSX v1.5.0 or later. If quanteda’s tokens object is passed to textmodel_lss(), it runs word2vec internally. If spatial = FALSE, it returns a probabilistic LSS model. It auxiliary functions such as predict() work with the model exactly the same as before.

UPDATE on May 4, 2026: Please also read my working paper on word2vec-based LSS.

Latent Semantic Scale based on Word2vec

Kohei

Leave a Reply Cancel reply

Share this:

Kohei

Leave a Reply Cancel reply

Related Posts