I use latent semantic analysis (LSA) to extract synonyms from a large corpus of news articles. I was very happy with Gensim‘s LSA function, but I was not sure how to do LSA in R as good as in Python. There is an R package called lsa, but it is unsuitable for large matrices, because its underlying function svd()
calculates all the singular values. Since I usually split documents into sentences in this task, my document-feature matrix is very large and extremely sparse.
It is easy to make an LSA function myself, but the question is which is the best SVD engine in R for this application? rsvd, irlba or RSpectra? The authors claim that their package is the fastest, but it seems depending on the size of the matrix to decompose and the number of singular values to ask for. rsvd seems very fast with small matrices, but it used more than 20GB of RAM on my Linux machine for a matrix created from only 1,000 news articles, while irlba and RSpectra require much less memory space.
I compared irlba and RSpectra in terms of its speed and accuracy using corpora in different sizes. The original corpus is comprised on 300K full-text New York Times news stories on politics. I randomly sampled news stories to construct sub-corpus and removed function words using quanteda for this benchmarking. Arguments of the functions are set in the following way:
# irlba
S <- irlba::irlba(x, nv = 300, center = Matrix::colMeans(x), verbose = FALSE, right_only = TRUE, tol = 1e-5)
# RSpectra
S <- RSpectra::svds(x, k = 300, nu = 0, nv = 300, opts = list(tol = 1e-5))
It is straight forward to measure the speed of the SVD engines: repeatedly create sub-corpora of between 1-10K documents, and record execution time. The result shows that RSpectra is roughly 5 times faster than irlba regardless of the sizes of the corpora.
It is more difficult to gauge the quality of SVD, but I achieved this by calculating cosine similarity of words to an English verb and counting its word stems in top 100 words. For example, when most similar words to ‘ask’ are extracted based on cosine similarity, I expected to find its inflicted forms such as ‘asked’, ‘asks’, ‘asking’ in the top 100 if decomposition is accurate. I cannot tell how many inflicted forms they should extract, but a larger number for the same word suggests higher accuracy. I used 25 common English words, and calculated average number of such words here.
word <- c('want', 'use', 'work', 'call', 'try', 'ask', 'need', 'seem',
'help', 'play', 'move', 'live', 'believe', 'happen', 'include',
'continue', 'change', 'watch', 'follow', 'stop', 'create', 'open',
'walk', 'offer', 'remember')
The differences between RSpectra and irlba aren’t large, but the former still outperformed the latter in all the croups sizes. It is surprising that RSpectra did not compromise its accuracy for its speed. Interestingly, the the curves for both package become flat on the right-hand side, suggesting there is no need to construct corpus larger than 8K documents (~400K sentences) for synonym extraction tasks.
My conclusion based on this benchmarking is that RSpectra is the best for LSA application in R. Nonetheless, since irlba is being actively developed to improve its performance, we should keep eyes of the package too.
Author of irlba here. As you point out, there have been many improvements to irlba in 2018. Also note that the algorithm RSpectra uses is designed for eigenvalues (it can be superb for that problem), but applied to the singular value problem it can be unstable, especially for ill-conditioned problems. See for example https://bwlewis.github.io/irlba/comparison.html.
Best,
Bryan Lewis
Thanks for your blog.
I tried both. Indeed, RSpectra is much faster. Do you know how to tune SVD to find an optimum number of singular values(k)? Setting k=300 was not good for my data. The predictions are bad. I would need to find mechanisms how to tune svd. Please help.
There is a function called cohesion() that helps users to find optimal k (see the LSS paper for details). You can even pass a vector with index for components that you want to use to the slice argument. This is the area where we should explore more.
Thanks, Kohei. I didn’t find the paper. Would you please drop it here? There seems a new R package(LSX) with cohesion() function. Do you mean the one from the LSX package? https://cran.r-project.org/web/packages/LSX/LSX.pdf
Is there any blog or tutorial to learn how to do it?
The paper is https://www.tandfonline.com/doi/full/10.1080/19312458.2020.1832976
If you have a fitted LSS object
lss
, thenLSX::cohesion(lss)
to get the stat.