Perform context-specific operations on tokens using a new argument

The bag-of-words approach is common in text analysis, but it has problems in distinguishing between meanings of words that depend on their contexts. In order to address this issue, we added an argument that allows users to select, remove or lookup tokens that occur in specific contexts in the new version of quanteda (v4.0).

We usually use tokens_lookup() to classify documents into topics, but this method does not work well when keywords in the dictionary are ambiguous. It is best to create a dictionary with phrases that cause only few or no false matches, but it is sometimes very difficult. In the example below, we have three texts about North Korea, Japan or Syria and a dictionary about weapons of mass destruction (WMD). Only the first and third documents are about WMD, but the second document is also classified as such because of “nuclear”.

> require(quanteda)
> txt <- c("North Korea's nuclear missiles", 
+          "Japan's nuclear power stations",
+          "Syria's chemical weapons")
> toks <- tokens(txt)
> print(toks)
Tokens consisting of 3 documents.
text1 :
[1] "North"    "Korea's"  "nuclear"  "missiles"

text2 :
[1] "Japan's"  "nuclear"  "power"    "stations"

text3 :
[1] "Syria's"  "chemical" "weapons" 

> 
> dict <- dictionary(list(wmd = c("nuclear", "chemical", "biological")))
> print(dict)
Dictionary object with 1 key entry.
- [wmd]:
  - nuclear, chemical, biological

> toks_dict1 <- tokens_lookup(toks, dict, append_key = TRUE)
> print(toks_dict1)
Tokens consisting of 3 documents.
text1 :
[1] "nuclear/wmd"

text2 :
[1] "nuclear/wmd"

text3 :
[1] "chemical/wmd"

To avoid false matches by tokens_lookup(), we should create a logical vector weapon = c(TRUE, FALSE, TRUE) and pass it to apply_if.

> weapon <- ntoken(tokens_select(toks, c("weapon*", "missile*"))) > 0
> print(weapon)
text1 text2 text3 
 TRUE FALSE  TRUE 

> toks_dict2 <- tokens_lookup(toks, dict, append_key = TRUE, apply_if = weapon)
> print(toks_dict2)
Tokens consisting of 3 documents.
text1 :
[1] "nuclear/wmd"

text2 :
character(0)

text3 :
[1] "chemical/wmd"

We can also use the new argument in combination with exclusive = FALSE to distinguish tokens between contexts of their occurrences. In this example, tokens_lookup() keeps all the tokens while assigning the “/WMD” tag to matched tokens. These annotated tokens appear in a separate column “nuclear/wmd”, indicating that they are “nuclear” in the context of armament.

> toks_anno <- tokens_lookup(toks, dict, exclusive = FALSE, append_key = TRUE,
+                           apply_if = weapon)
> print(toks_anno)
Tokens consisting of 3 documents.
text1 :
[1] "North"       "Korea's"     "nuclear/WMD" "missiles"   

text2 :
[1] "Japan's"  "nuclear"  "power"    "stations"

text3 :
[1] "Syria's"      "chemical/WMD" "weapons"     

> dfm(toks_anno)
Document-feature matrix of: 3 documents, 11 features (66.67% sparse) and 0 docvars.
       features
docs    north korea's nuclear/wmd missiles japan's nuclear power stations syria's chemical/wmd
  text1     1       1           1        1       0       0     0        0       0            0
  text2     0       0           0        0       1       1     1        1       0            0
  text3     0       0           0        0       0       0     0        0       1            1
[ reached max_nfeat ... 1 more feature ]

We created the logical vector for context-specific analysis only using tokens_select() in the example, but it can be created using more sophisticated tools such as Sequential Seeded LDA. I hope you find creative ways to use the new argument in your projects.

Posts created 114

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top