I mad it possible to use pre-trained word vector for Latent Semantic Scaling (LSS) in the version 0.9 of the LSX package, but I don’t think I explained how to do. It can be done easily by using the as.textmodel_lss()
function but you need to load the word vectors to R as dense matrix beforehand.
For example, pre-trained Glove word vectors are provided in a text file, in which values are separated by the white-space without quote. In order to load such a file using the read.table()
function, you have to set several arguments correctly. You also have to transpose the loaded matrix to store word vectors along its columns.
require(LSX)
mt <- read.table("glove.6B/glove.6B.200d.txt", quote = "", sep = " ", fill = FALSE,
comment.char = "", row.names = 1, fileEncoding = "UTF-8")
colnames(mt) <- NULL
mt <- mt[stringi::stri_detect_regex(rownames(mt), "[a-zA-Z]"),] # exclude numbers and punctuations
mt <- t(mt) # transpose
seed <- as.seedwords(data_dictionary_sentiment)
lss <- as.textmodel_lss(mt, seed) # create LSS object
Once a LSS object is created, you can check polarity of words using coef()
. Since the word vectors are trained on a large Wikipedia corpus (6 billion tokens), the words are diverse and estimation of scores are very intuitive.
> head(coef(lss), 20)
excellent excellence impressive versatile versatility rapport superb optimum
0.2401436 0.1777782 0.1712947 0.1697045 0.1696126 0.1687611 0.1648426 0.1575858
reasonably achieved wonderful enjoying terrific enjoy achieve enjoyed
0.1561745 0.1561181 0.1557793 0.1547977 0.1537846 0.1533817 0.1524274 0.1523841
elegant world-class sustainable confident
0.1517054 0.1508276 0.1502961 0.1481712
> tail(coef(lss), 20)
racist doings needlessly misogynistic infighting blamed sexist plaguing
-0.1938094 -0.1939329 -0.1946107 -0.1954958 -0.1967688 -0.1969247 -0.1980838 -0.1992275
hateful shameful heartless stereotyping brutish vile ugly exacerbating
-0.2031186 -0.2033597 -0.2043213 -0.2052691 -0.2076061 -0.2087724 -0.2130335 -0.2169766
vicious blaming exacerbated nasty
-0.2196787 -0.2196897 -0.2216914 -0.2366194