New tokens object in quanteda v4.0

I am very happy that we could release quanteda v4.0 after more than a year of development. For this release, I improved the package’s efficiency by creating a new tokens object and writing many internal functions in C++ to allow users to process millions of documents on a laptop (or tens of millions on a cloud server).

The new tokens object is called tokens_xptr. The most important difference between the old and the new tokens objects is that their underlying data is kept in C++ memory space and only linked to R through external pointers. Other high-performance R packages like data.table takes this approach to avoid copying large data between R and C++. The new tokens object helps you to build an efficient pipeline, but you should be careful about when and how you use tokens_xptr objects. If you use tokens_xptr objects without understanding its behavior, you would make unintended changes in your objects and produce invalid analysis results.

I created a small demo to explain the difference between the old and the new objects. First, I create a tokens object toks using tokens() and convert it to a tokens_xptr object xtoks using as.tokens_xptr(). By printing these objects, we can confirm that they contain the same tokens (words and symbols) but “pointer to 0x24ed163f2f8” is shown only in the output for xtoks. This message indicates that it is an tokens_xptr object that is liked to data save in an address 0x24ed163f2f8 in the RAM.

> require(quanteda)
> 
> toks <- tokens("quanteda v4 is on CRAN!")
> xtoks <- as.tokens_xptr(toks)
> 
> print(toks)
Tokens consisting of 1 document.
text1 :
[1] "quanteda" "v4"       "is"       "on"       "CRAN"     "!"       

> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4"       "is"       "on"       "CRAN"     "!"

Second, I assign these objects to xtoks to xtoks2 using <-, but xtoks2 is a pointer to same address (0x24ed163f2f8). This means that xtoks and xtoks2 are only aliases for the same underlying data (“shallow-copy”). In order to copy the data, you have to apply as.tokens_xptr() to tokens_xptr objects and assign the result (“deep-copy”). Only by doing this, we can create xtoks3, which is a pointer to a different address (0x24e127064c9) from xtoks.

> 
> xtoks2 <- xtoks # shallow-copy  
>
> print(xtoks2)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4"       "is"       "on"       "CRAN"     "!"

>
> xtoks3 <- as.tokens_xptr(xtoks) # deep-copy
> 
> print(xtoks3)
Tokens consisting of 1 document (pointer to 0x24e127064c9).
text1 :
[1] "quanteda" "v4"       "is"       "on"       "CRAN"     "!"

Third, I modify the tokens_xptr objects using tokens_remove(). The function removes “is” and “on” from the deep-copy xtoks3 but it keeps the original object xtoks intact. However, if I perform the same operation on the shallow-copy xtoks2, the function removes “is” and “on” from the original object xtoks too. This happens because the underlying data of xtoks and xtoks2 are the same.

> 
> xtoks3 <- tokens_remove(xtoks3, stopwords("en")) # remove stopwords from a deep-copy
> print(xtoks3)
Tokens consisting of 1 document (pointer to 0x24e127064c9).
text1 :
[1] "quanteda" "v4"       "CRAN"     "!"       

> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4"       "is"       "on"       "CRAN"     "!"       

> 
> xtoks2 <- tokens_remove(xtoks2, stopwords("en")) # remove stopwords from a shallow-copy
> print(xtoks2)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4"       "CRAN"     "!"       

> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4"       "CRAN"     "!"

I hope you understood the main difference between tokens and tokens_xptr objects through the demo. To learn more about tokens_xptr, please see External pointer-based tokens objects on the package website.

One thought on “New tokens object in quanteda v4.0”

Pingback: Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe

Share this:

Kohei

One thought on “New tokens object in quanteda v4.0”

Leave a Reply Cancel reply

Related Posts