The bag-of-words approach is common in text analysis, but it has problems in distinguishing between meanings of words that depend on their contexts. In order to address this issue, we added an argument that allows users to select, remove or lookup tokens that occur in specific contexts in the new version of quanteda (v4.0).
We usually use tokens_lookup()
to classify documents into topics, but this method does not work well when keywords in the dictionary are ambiguous. It is best to create a dictionary with phrases that cause only few or no false matches, but it is sometimes very difficult. In the example below, we have three texts about North Korea, Japan or Syria and a dictionary about weapons of mass destruction (WMD). Only the first and third documents are about WMD, but the second document is also classified as such because of “nuclear”.
> require(quanteda)
> txt <- c("North Korea's nuclear missiles",
+ "Japan's nuclear power stations",
+ "Syria's chemical weapons")
> toks <- tokens(txt)
> print(toks)
Tokens consisting of 3 documents.
text1 :
[1] "North" "Korea's" "nuclear" "missiles"
text2 :
[1] "Japan's" "nuclear" "power" "stations"
text3 :
[1] "Syria's" "chemical" "weapons"
>
> dict <- dictionary(list(wmd = c("nuclear", "chemical", "biological")))
> print(dict)
Dictionary object with 1 key entry.
- [wmd]:
- nuclear, chemical, biological
> toks_dict1 <- tokens_lookup(toks, dict, append_key = TRUE)
> print(toks_dict1)
Tokens consisting of 3 documents.
text1 :
[1] "nuclear/wmd"
text2 :
[1] "nuclear/wmd"
text3 :
[1] "chemical/wmd"
To avoid false matches by tokens_lookup()
, we should create a logical vector weapon = c(TRUE, FALSE, TRUE)
and pass it to apply_if
.
> weapon <- ntoken(tokens_select(toks, c("weapon*", "missile*"))) > 0
> print(weapon)
text1 text2 text3
TRUE FALSE TRUE
> toks_dict2 <- tokens_lookup(toks, dict, append_key = TRUE, apply_if = weapon)
> print(toks_dict2)
Tokens consisting of 3 documents.
text1 :
[1] "nuclear/wmd"
text2 :
character(0)
text3 :
[1] "chemical/wmd"
We can also use the new argument in combination with exclusive = FALSE
to distinguish tokens between contexts of their occurrences. In this example, tokens_lookup()
keeps all the tokens while assigning the “/WMD” tag to matched tokens. These annotated tokens appear in a separate column “nuclear/wmd”, indicating that they are “nuclear” in the context of armament.
> toks_anno <- tokens_lookup(toks, dict, exclusive = FALSE, append_key = TRUE,
+ apply_if = weapon)
> print(toks_anno)
Tokens consisting of 3 documents.
text1 :
[1] "North" "Korea's" "nuclear/WMD" "missiles"
text2 :
[1] "Japan's" "nuclear" "power" "stations"
text3 :
[1] "Syria's" "chemical/WMD" "weapons"
> dfm(toks_anno)
Document-feature matrix of: 3 documents, 11 features (66.67% sparse) and 0 docvars.
features
docs north korea's nuclear/wmd missiles japan's nuclear power stations syria's chemical/wmd
text1 1 1 1 1 0 0 0 0 0 0
text2 0 0 0 0 1 1 1 1 0 0
text3 0 0 0 0 0 0 0 0 1 1
[ reached max_nfeat ... 1 more feature ]
We created the logical vector for context-specific analysis only using tokens_select()
in the example, but it can be created using more sophisticated tools such as Sequential Seeded LDA. I hope you find creative ways to use the new argument in your projects.