We have announced the release of quanteda version 1.0 at the London R meeting on Tuesday. I thank all the organizers and 150+ participants. In the talk, I presented the performance comparison with R and Python packages, but I actually compared the performance with its earlier CRAN versions to show how the package evolved to be the best performing text analysis in R.
In this historical benchmaking, I measured time the earlier versions take to complete three basic operations (in second):
- Tokenization of 6,000 newspaper articles (‘tokens’)
- Removal of English stopwords from the tokenized texts (‘remove’)
- Construction of document-feature matrix (‘dfm’)
The earliest versions of quanteda was fast simply because it had only limited functionality. Its tokenization and document-feature matrix construction became considerably slower from v0.8.2 as more functions, such as Unicode support, have been implemented. There was almost no change in speed until v0.9.8.5, but it token selection and document-feature matrix construction became dramatically fast in the next release. This is exactly when we introduced the upfront tokens serialization design. It only speeds up operations after tokenization, but execution time became half in tokens selection and one seventh in document-feature matrix construction!
After improving the performance, we worked hard on consistency in API and stability of C++. Besides the regression in token selection the version before 0.99.9, it has been very fast until now. Tokenization went through up and down in speed in gradual optimization to the new design, but it is also one of the fastest since v0.8.2.