Replicating analysis with quanteda on multi-core systems

It is not always easy to write R scripts that always produce the same results. It is even more so when we analyse textual data that requires extensive preprocessing. One of our goals in developing quanteda was ensuring replicability of text analysis by making data preprocessing explicit and transparent. However, our package still produces different results because of its system environment.

It is widely recognized that analysis results depend on versions of software packages, but it is less so that they also depend on types of hardware. Quanteda’s tokens_*() functions use multiple CPU cores to process large corpora efficiently. For example, 8 documents are processed simultaneously on a machine with 8 cores.

Operations such as removing punctuation marks and searching for dictionary keywords happen independently in each document because they only delete or replace integer IDs of token types, but generating n-grams and compounding tokens are not independent because they assign unique IDs for new token types that can be used in other documents. Since token IDs are serial numbers of token types, they vary if documents are processed in different order.

The example below shows that compounding of tokens on a machine with 8 cores results in tokens objects with different token IDs. If the order of “types” in tokens objects differs, so does the order of “features” in DFM.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.

toks <- tokens(data_corpus_inaugural)
dict <- data_dictionary_LSD2015

toks_comp1 <- tokens_compound(toks, dict)
toks_comp2 <- tokens_compound(toks, dict)

identical(types(toks_comp1), types(toks_comp2)) # FALSE when multi-threaded
#> [1] FALSE
setequal(types(toks_comp1), types(toks_comp2)) # always TRUE
#> [1] TRUE

dfmt1 <- dfm(toks_comp1)
dfmt2 <- dfm(toks_comp2)
identical(featnames(dfmt1), featnames(dfmt2))
#> [1] FALSE

My solutions are to these problems are (1) save tokens object to the disk and use the same object from next time, or (2) sort the “features” of DMF in alphabetical order. In either way, you can ensure that the objects for your analysis is always the same and that you can replicate the results. This is also important to enable the caching function of textmodel_lss().

dfmt1 <- dfmt1[,sort(featnames(dfmt1))] 
dfmt2 <- dfmt2[,sort(featnames(dfmt2))] 
identical(featnames(dfmt1), featnames(dfmt2))
#> [1] TRUE

I have been thinking to sort features of DFM automatically in dfm() to avoid these problem in the first place, but I haven’t done. If you think I should, please leave your comments.

Update on 25/11/2021: dfm() returns DFMs with identical column order from quanteda v3.2. I addressed the above issue in a recent update.

4 thoughts on “Replicating analysis with quanteda on multi-core systems”

Hi Kohei,
thank you for this post. The example easily clarifies the problem and sorting one’s DFM neatly solves the problem in terms of reproducibility. However, my gut-feeling is that many people are not aware of this dynamic when using multi-threading in combination with tokens_compound or any other function that changes the order of types in tokens. Perhaps quanteda could pop-up a little fyi message on one’s console when running tokens_compound, informing them that (if they are multi-threading) order of types in tokens objects has changed and they might want to sort their resulting DFM before proceeding further?
Best,
Johannes

Kohei says:

November 25, 2021 at 5:30 am

I have fixed the issue in quanteda v3.2. It will be submitted to CRAN soon.

Reply
1. Johannes says:
  
  February 21, 2022 at 3:17 pm
  
  Thanks for fixing it ! Does that mean sorting one’s DFM is now superflous as dfm() does that for you?
  
  Reply

I didn’t notice that one since I haven’t used parallel processing in Quanteda. Quanteda is my favorite package. The recent update in quanteda text models made it difficult for me to reproduce my ROC curves. This was how I did it in the past for Naive Bayes trained using quanteda text models package.

nb_quanteda.prob <- predict(model=nb_quanteda, type=”probability”, newdata = dfmat_data)
pred_nb_qaunteda <- prediction(as.numeric(nb_quanteda.prob[,-1]), Truth)
perf_nb_qaunteda <- performance(pred_nb_qaunteda,”tpr”,”fpr”)
plot( perf_nb_qaunteda, col=”darkblue”, main=”Roc nb”)

Best,
Mihiretu

Share this:

Kohei

4 thoughts on “Replicating analysis with quanteda on multi-core systems”

Leave a Reply Cancel reply

Related Posts