Programing, Text analysis

Use pre-trained Glove word vectors in LSS

KoheiAugust 10, 2023August 10, 2023

I mad it possible to use pre-trained word vector for Latent Semantic Scaling (LSS) in the version 0.9 of the LSX package, but I don’t think I explained how to do. It can be done easily by using the as.textmodel_lss() function but you need to load the word vectors to R as dense matrix beforehand.

For example, pre-trained Glove word vectors are provided in a text file, in which values are separated by the white-space without quote. In order to load such a file using the read.table() function, you have to set several arguments correctly. You also have to transpose the loaded matrix to store word vectors along its columns.

require(LSX)
mt <- read.table("glove.6B/glove.6B.200d.txt", quote = "", sep = " ", fill = FALSE,
                 comment.char = "", row.names = 1, fileEncoding = "UTF-8")
colnames(mt) <- NULL
mt <- mt[stringi::stri_detect_regex(rownames(mt), "[a-zA-Z]"),] # exclude numbers and punctuations
mt <- t(mt) # transpose

seed <- as.seedwords(data_dictionary_sentiment)
lss <- as.textmodel_lss(mt, seed) # create LSS object

Once a LSS object is created, you can check polarity of words using coef(). Since the word vectors are trained on a large Wikipedia corpus (6 billion tokens), the words are diverse and estimation of scores are very intuitive.

> head(coef(lss), 20)
  excellent  excellence  impressive   versatile versatility     rapport      superb     optimum 
  0.2401436   0.1777782   0.1712947   0.1697045   0.1696126   0.1687611   0.1648426   0.1575858 
 reasonably    achieved   wonderful    enjoying    terrific       enjoy     achieve     enjoyed 
  0.1561745   0.1561181   0.1557793   0.1547977   0.1537846   0.1533817   0.1524274   0.1523841 
    elegant world-class sustainable   confident 
  0.1517054   0.1508276   0.1502961   0.1481712 

> tail(coef(lss), 20)
      racist       doings   needlessly misogynistic   infighting       blamed       sexist     plaguing 
  -0.1938094   -0.1939329   -0.1946107   -0.1954958   -0.1967688   -0.1969247   -0.1980838   -0.1992275 
     hateful     shameful    heartless stereotyping      brutish         vile         ugly exacerbating 
  -0.2031186   -0.2033597   -0.2043213   -0.2052691   -0.2076061   -0.2087724   -0.2130335   -0.2169766 
     vicious      blaming  exacerbated        nasty 
  -0.2196787   -0.2196897   -0.2216914   -0.2366194

Kohei

Posts created 116

Leave a Reply Cancel reply

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…

Back To Top