I am very happy that we could release quanteda v4.0 after more than a year of development. For this release, I improved the package’s efficiency by creating a new tokens object and writing many internal functions in C++ to allow users to process millions of documents on a laptop (or tens of millions on a cloud server).
The new tokens object is called tokens_xptr. The most important difference between the old and the new tokens objects is that their underlying data is kept in C++ memory space and only linked to R through external pointers. Other high-performance R packages like data.table takes this approach to avoid copying large data between R and C++. The new tokens object helps you to build an efficient pipeline, but you should be careful about when and how you use tokens_xptr objects. If you use tokens_xptr objects without understanding its behavior, you would make unintended changes in your objects and produce invalid analysis results.
I created a small demo to explain the difference between the old and the new objects. First, I create a tokens object toks
using tokens()
and convert it to a tokens_xptr object xtoks
using as.tokens_xptr()
. By printing these objects, we can confirm that they contain the same tokens (words and symbols) but “pointer to 0x24ed163f2f8” is shown only in the output for xtoks
. This message indicates that it is an tokens_xptr object that is liked to data save in an address 0x24ed163f2f8 in the RAM.
> require(quanteda)
>
> toks <- tokens("quanteda v4 is on CRAN!")
> xtoks <- as.tokens_xptr(toks)
>
> print(toks)
Tokens consisting of 1 document.
text1 :
[1] "quanteda" "v4" "is" "on" "CRAN" "!"
> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4" "is" "on" "CRAN" "!"
Second, I assign these objects to xtoks
to xtoks2
using <-
, but xtoks2
is a pointer to same address (0x24ed163f2f8). This means that xtoks
and xtoks2
are only aliases for the same underlying data (“shallow-copy”). In order to copy the data, you have to apply as.tokens_xptr()
to tokens_xptr objects and assign the result (“deep-copy”). Only by doing this, we can create xtoks3
, which is a pointer to a different address (0x24e127064c9) from xtoks
.
>
> xtoks2 <- xtoks # shallow-copy
>
> print(xtoks2)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4" "is" "on" "CRAN" "!"
>
> xtoks3 <- as.tokens_xptr(xtoks) # deep-copy
>
> print(xtoks3)
Tokens consisting of 1 document (pointer to 0x24e127064c9).
text1 :
[1] "quanteda" "v4" "is" "on" "CRAN" "!"
Third, I modify the tokens_xptr objects using tokens_remove()
. The function removes “is” and “on” from the deep-copy xtoks3
but it keeps the original object xtoks
intact. However, if I perform the same operation on the shallow-copy xtoks2
, the function removes “is” and “on” from the original object xtoks
too. This happens because the underlying data of xtoks
and xtoks2
are the same.
>
> xtoks3 <- tokens_remove(xtoks3, stopwords("en")) # remove stopwords from a deep-copy
> print(xtoks3)
Tokens consisting of 1 document (pointer to 0x24e127064c9).
text1 :
[1] "quanteda" "v4" "CRAN" "!"
> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4" "is" "on" "CRAN" "!"
>
> xtoks2 <- tokens_remove(xtoks2, stopwords("en")) # remove stopwords from a shallow-copy
> print(xtoks2)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4" "CRAN" "!"
> print(xtoks)
Tokens consisting of 1 document (pointer to 0x24ed163f2f8).
text1 :
[1] "quanteda" "v4" "CRAN" "!"
I hope you understood the main difference between tokens and tokens_xptr objects through the demo. To learn more about tokens_xptr, please see External pointer-based tokens objects on the package website.
One thought on “New tokens object in quanteda v4.0”