Develop efficient custom functions using quanteda v4.0

The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*() required long execution time and large memory space because they transfer data between R and C++. Such inefficiency is negligible when objects are only a few megabytes, but it is not when they are a few gigabytes. I decided to create the tokens_xptr object because I wanted the users of quanteda to analyze very large corpora with elaborated pre-processing.

I create a small function clean() to show the overhead caused by the conversion between R and C++. In this function, the input object x is a classic list-based tokens object and four tokens_*() are used to modify it. Every time tokens_*() is executed, the object is converted from a list in R to a vector in C++ (and vice versa).

clean <- function(x) {
    x <- x |>
        tokens_remove(stopwords("en"), padding = TRUE) |> 
        tokens_select("\\p{L}", valuetype = "regex", min_nchar = 2, padding = TRUE) |>
        tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |> 
        tokens_compound(newsmap::data_dictionary_newsmap_en)
    return(x)
}

We can make the function more efficient by using the tokenx_xptr object in clean_new(). In this function, as.tokens_xptr() is applied to x in the beginning to convert tokens to tokens_xptr; as.tokens() is applied to x in the end to convert tokens_xptr back to tokens. In this way, we can eliminate the overhead caused by the transaction between R and C++ in tokens_*() while keeping the input and output of the function the same.

clean_new <- function(x) {
    x <- as.tokens_xptr(x) |>
        tokens_remove(stopwords("en"), padding = TRUE) |> 
        tokens_select("\\p{L}", valuetype = "regex", min_nchar = 2, padding = TRUE) |>
        tokens_remove(c("http:*", "https:*", "*.com", "*.org", "*.net"), padding = TRUE) |> 
        tokens_compound(newsmap::data_dictionary_newsmap_en)
    return(as.tokens(x))
}

I measured how much tokens_xptr improves the efficiency using a large corpus of English news articles that contains 80 million tokens in 3.5 million sentences (1.39GB). Using bench::mark(), I compared clean() and clean_new() in terms of execution time and memory usage on a cloud computer (Azure Standard_D16as_v4).

The result shows that the function with the tokens_xptr object (“new”) is twice faster and three time more memory efficient than the function only with the tokens object (“old”). These are significantly large performance gains.

> require(quanteda)
> toks <- readRDS("tokens.rds")
> ndoc(toks)
[1] 3556249
> sum(ntoken(toks))
[1] 80235389
> print(object.size(toks), unit = "MB")
1394.8 Mb

> bench::mark(
+   old = clean(toks),
+   new = clean_new(toks),
+   iterations = 10
+ )
# A tibble: 2 × 13
  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl>
1 old         86.6   93.5    0.0104    3.94GB   0.0292    10    28       959.
2 new         43.5   44.8    0.0207    1.29GB   0.0124    10     6       483

The tokens_xptr object can be used in R’s global environment, but users must be always aware that tokens_*() returns the address of the data, not the underlying data. To avoid mistakes, it is best to modify the tokens_xptr object within custom functions and apply as.tokens() before returning it to the global environment.

I hope the users of quanteda find the tokens_xptr object useful in creating custom functions to process very large corpora. This object can also be used to develop other R packages for text analysis.

Posts created 113

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top