I compared the performance of R and Python in 2017 when we were developing quanteda v1.0, and confirmed that our package’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim.
After two years, we are developing quanteda v2.0, which will be released early next year. We are improving the package’s usability, stability and flexibility by upgrading storage of document-level variables (a.k.a docvars) and by adding new options to functions such as tokens_select()
and tokens_compound()
that are written in C++. This led to redesigning of quanteda objects (corpus, tokens and DFM) and functions that work on those objects.
I felt it is time to run a benchmarking again to make sure that the improvement in usability, stability and flexibility in v2.0 does not undermine its performance advantage. The bechmarking is the same as in the 2017:
- loading a corpus of 117,942 news articles (96 million tokens);
- tokenizing by the white space;
- removing function words (stop words); and
- constructing a document-feature matrix.
I could confirm that the result is roughly the same as previous time. My current machine (Intel Core i7 Linux desktop) is faster than one back in 2017, so the overall execution time is around 30% shorter in both Python and R, but quanteda remains twice as fast as gensim. However, memory usage increased only in Python, reaching to 10GB when the corpus size is 500MB, so R’s memory usage decreased from 60% to 50% of Python’s.
2 thoughts on “R and Python text analysis packages performance comparison – updated”