Why quanteda is so fast?

Those who read my recent post on quanteda’s performance might wonder why the package is so fast. It is not only because we carefully wrote R code for the package but also optimized internal functions and objects for large textual data. There are three design features of quanteda that dramatically enhanced its performance.

Upfront data serialization

Data serialization means tokens are stored as integers not as characters in quanteda. You can see token IDs that correspond to the types attribute if you apply unclass() to a tokens object:

> require(quanteda)
> toks <- tokens(c("This is document A", "That is document B"))
> print(toks)
tokens from 2 documents.
text1 :
[1] "This"     "is"       "document" "A"       
text2 :
[1] "That"     "is"       "document" "B"       

> unclass(toks)
$text1
[1] 1 2 3 4
$text2
[1] 5 2 3 6
attr(,"types")
[1] "This"     "is"       "document" "A"        "That"     "B"

Serialization of tokens makes the object 80 to 90% smaller than the original. For example, the corpus of 117,942 news articles (96 million tokens; 556 thousands unique types) shrinks from 3,168 MB to 436 MB in RAM by this compression. Data serialization reduces not only memory use but also execution time because operations on integers is much faster than on characters by computers.

Quanteda’s tokens_* functions that works on the serialized tokens objects are written in C++ using the Intel TBB library. It is not easy to keep the IDs and types in sync in multi-threaded functions, especially when new token types are produced as result of compounding, but we managed to do this using the library’s thread-safe objects.

Data serialization is a very common technique, but gensim does this only after prepossessing of tokens, or just before making a document-feature matrix.

Indexed pattern-ID translation

Upfront serialization means that all the character patterns (e.g. stop words, dictionary values) must be translated into token IDs before passed to the C++ functions. You can achieve this simply by scanning the token types for the patterns using regular expressions, but we took a different approach because scanning a long character vector takes a lot of time. For example, scanning 556 thousand token types for 1,000 dictionary value means making matching attempts 556 million times.

We developed a mechanism to index token types to perform matching of fixed and glob patterns without scanning, making pattern-ID translation instantaneous. With the indexed token types, only 1,000 matching attempts are required. This approach is extremely efficient for dictionaries such as the Lexicoder dictionaries and Newsmap dictionary, because the same words appear many times as elements of their multi-word patterns.

One-pass n-gram search

Quanteda’s tokens_* functions search serialized tokens objects for IDs translated from user-defined patterns, but such patterns can be a large set of n-grams. For example, scanning 96 million tokens for 1,000 dictionary values demands 96 billion matching attempts if search is naively performed for each, making execution time very long.

We devised a method in which scanning is required only once regardless of the number of patterns, namely matching tokens against patterns instead of matching patterns against tokens. This inverted approach is taking n-element sub-vectors from a tokens object to find token IDs in a hash table. The functions still have to scan tokens more than once when patterns contain n-grams for multi-word expressions, but it is usually less than three or four times (i.e. three or four-grams). This method reduces the number of matching attempts from 96 billion to 96 million × n times.

The culmination of this innovative design is very fast and near constant-time discovery of n-grams in large textual data. The following plots show that the execution time for token selection by tokens_select() is less than 2 seconds with 100 n-gram patterns (N=100) and 3 seconds with 1,000 n-gram patters (N=1000). Most importantly, there are only small differences between these conditions. In naive implementation, searching for 1,000 trigrams can take 30 times longer than for 100 unigrams, but it is only twice longer in quanteda.

One thought on “Why quanteda is so fast?”

Pingback: Quanteda and semisupervised models – Kohei Watanabe

Upfront data serialization

Indexed pattern-ID translation

One-pass n-gram search

Share this:

Kohei

One thought on “Why quanteda is so fast?”

Leave a Reply Cancel reply

Related Posts