The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*()
required long execution time and large memory space because they transfer data between R and C++. Such inefficiency is negligible when objects are only a few megabytes, but it is not when they are a few gigabytes. I decided to create the tokens_xptr object because I wanted the users of quanteda to analyze very large corpora with elaborated pre-processing.
I create a small function clean()
to show the overhead caused by the conversion between R and C++. In this function, the input object x
is a classic list-based tokens object and four tokens_*()
are used to modify it. Every time tokens_*()
is executed, the object is converted from a list in R to a vector in C++ (and vice versa).
clean <- function(x) {
x <- x |>
tokens_remove(stopwords("en"), padding = TRUE) |>
tokens_select("\\p{L}", valuetype = "regex", min_nchar = 2, padding = TRUE) |>
tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |>
tokens_compound(newsmap::data_dictionary_newsmap_en)
return(x)
}
We can make the function more efficient by using the tokenx_xptr object in clean_new()
. In this function, as.tokens_xptr()
is applied to x
in the beginning to convert tokens to tokens_xptr; as.tokens()
is applied to x
in the end to convert tokens_xptr back to tokens. In this way, we can eliminate the overhead caused by the transaction between R and C++ in tokens_*()
while keeping the input and output of the function the same.
clean_new <- function(x) {
x <- as.tokens_xptr(x) |>
tokens_remove(stopwords("en"), padding = TRUE) |>
tokens_select("\\p{L}", valuetype = "regex", min_nchar = 2, padding = TRUE) |>
tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |>
tokens_compound(newsmap::data_dictionary_newsmap_en)
return(as.tokens(x))
}
I measured how much tokens_xptr improves the efficiency using a large corpus of English news articles that contains 80 million tokens in 3.5 million sentences (1.39GB). Using bench::mark()
, I compared clean()
and clean_new()
in terms of execution time and memory usage on a cloud computer (Azure Standard_D16as_v4).
The result shows that the function with the tokens_xptr object (“new”) is twice faster and three time more memory efficient than the function only with the tokens object (“old”). These are significantly large performance gains.
> require(quanteda)
> toks <- readRDS("tokens.rds")
> ndoc(toks)
[1] 3556249
> sum(ntoken(toks))
[1] 80235389
> print(object.size(toks), unit = "MB")
1394.8 Mb
> bench::mark(
+ old = clean(toks),
+ new = clean_new(toks),
+ iterations = 10
+ )
# A tibble: 2 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl>
1 old 86.6 93.5 0.0104 3.94GB 0.0292 10 28 959.
2 new 43.5 44.8 0.0207 1.29GB 0.0124 10 6 483
The tokens_xptr object can be used in R’s global environment, but users must be always aware that tokens_*()
returns the address of the data, not the underlying data. To avoid mistakes, it is best to modify the tokens_xptr object within custom functions and apply as.tokens()
before returning it to the global environment.
I hope the users of quanteda find the tokens_xptr object useful in creating custom functions to process very large corpora. This object can also be used to develop other R packages for text analysis.