Since early this year, I was asked by many people how to compute document (or feature) similarity in large corpus. They said their functions stops because the lack of space in RAM:
Error in .local(x, y, ...) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
This happened in our textstat_simil(margn = "documents")
too because matrix multiplication in the function produces dense matrix with (ndoc(x) ^ 2) / 2
elements: the number of cells in the matrix is 5,000,000,000 if your corpus has 100,000 documents!
A solution to this problem is not recording values that is less than a certain threshold. You might be only interested in documents with cosine similarity larger than 0.9 when you study document reuse, for example. We upgraded our functions for document similarity computation, which is used in textstat_simil()
and textstat_dist()
, to achieve this in the latest Github version of quanteda. We also parallelized the computation in C++ to make it faster.
The new function is called textstat_proxy()
. It is still experimental but has two new arguments, min_proxy
and rank
, to reduced the number of values to save storage. If you set the min_proxy
, the function records only values larger than that; if you use rank
, it records only top-n largest values for each document (or features).
Benchmarking on my Core i7 laptop showed that the new function is twice as fast as the old one. If either min_proxy
or rank
is used, it becomes 4 times faster.
library(quanteda)
mt <- readRDS("data_corpus_guardian_2016-2017.RDS") %>%
dfm(remove_punct = TRUE, remove = stopwords()) %>%
dfm_trim(min_termfreq = 10)
dim(mt)
# [1] 84599 83573 # 84599 documents
# subset the corpus because the it is too large for the old function
mt_sub <- dfm_sample(mt, 10000)
dim(mt_sub)
# [1] 10000 83573 # 10000 documents
quanteda_options(threads = 8)
microbenchmark::microbenchmark(
old = textstat_simil_old(mt_sub, method = "cosine"),
new = textstat_simil(mt_sub, method = "cosine"),
new_min = textstat_proxy(mt_sub, method = "cosine", min_proxy = 0.9),
new_rank = textstat_proxy(mt_sub, method = "cosine", rank = 10),
times = 10
)
# Unit: seconds
# expr min lq mean median uq max neval
# old 22.426574 22.641949 22.788590 22.745563 22.960467 23.160844 10
# new 13.376352 13.417328 13.641411 13.638641 13.699010 14.226246 10
# new_min 4.977046 5.010795 5.119516 5.114965 5.201249 5.314574 10
# new_rank 5.303440 5.322976 5.411015 5.385124 5.482439 5.583506 10
More importantly, we can compute the document similarity between all the 84599 documents in the corpus without problems, if min_proxy
is used. It took only 15 minutes on my laptop and the resulting object is as small as 12MB.
new_min <- textstat_proxy(mt, method = "cosine", min_proxy = 0.9)
print(object.size(new_min), units = "Mb")
# 12.4 Mb
If you want to know which documents are similar to which, you can make a simple conversion function and run.
matrix2list <- function(x) {
names(x@x) <- rownames(x)[x@i + 1]
split(x@x, factor(x@j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
simil <- matrix2list(new_min)
head(simil[lengths(simil) > 1])
# $text119559
# text119554 text119559
# 0.9929825 1.0000000
#
# $text119561
# text119553 text119561
# 0.994557 1.000000
#
# $text119562
# text119557 text119562
# 0.9975438 1.0000000
#
# $text119564
# text119553 text119561 text119564
# 0.9854428 0.9908825 1.0000000
#
# $text119568
# text119555 text119568
# 0.9963637 1.0000000
#
# $text119570
# text119551 text119570
# 0.9586148 1.0000000
textstat_proxy()
has a great potential but it is still experimental, because we are not sure what the best format of the resulting objects. If you have any opinion, please post a comment on the GitHub page.
Thank You, great post!
For me, this library was very useful to find reusable email answers.
Very useful post!
How can i get access to the data_corpus_guardian_2016-2017.RDS dataset?
I am not making it publicly available but
quanteda.corpora::download("data_corpus_guardian")
returns 6,000 Guardian articles.