Good and bad methods to extract context words

Many questions on quanteda’s kwic() function have been posted to Stack Overflow. It shows how much people like the human-friendly function, but it also shows that how much they are confused: the function is created only for manual inspection of tokens objects, not for extracting context words in preproccessing. If you apply kwic() to your objects before statistical analyses, the results can be distorted because of multiple counting of the same words. You should always use tokens_select() instead.

In the example below, the tokens object has letters from “a” to “z”, all of which occur only once. Tokens that occur around “e” and “g” within three words are letters from “b” to “j”. tokens_select() and kwic() return these letters with arguments being pattern = c("e", "g") and window = 3. However, the frequency of letters from “d” to “h” are all two in kwic(). Such double-counting happens because they are within three words from “e” and “g”.

In addition to the possible distortion of word frequency, the kwic() method requires more steps than tokens_select() does because it converts tokens objects to character strings and tokenize them again. This is extremely inefficient when the corpus is large and keywords are frequent.


require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- as.tokens(list(letters))
print(toks)
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l"
#> [ ... and 14 more ]
featfreq(dfm(toks))
#> a b c d e f g h i j k l m n o p q r s t u v w x y z 
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

# Use tokens_select() (right method)
toks_win <- tokens_select(toks, pattern = c("e", "g"), window = 3)
featfreq(dfm(toks_win))
#> b c d e f g h i j 
#> 1 1 1 1 1 1 1 1 1

# Use kwic() (wrong method)
kw <- kwic(toks, pattern = c("e", "g"), window = 3)
dat_kw <- as.data.frame(kw)
dat_kw$text <- paste(dat_kw$pre, dat_kw$keyword, dat_kw$post)
corp_kw <- corpus(dat_kw)
toks_kw <- tokens(corp_kw)
featfreq(dfm(toks_kw))
#> b c d e f g h i j 
#> 1 1 2 2 2 2 2 1 1


Posts created 114

2 thoughts on “Good and bad methods to extract context words

  1. Hi, Kohei san, I am not sure if my question is related to this post … To calculate the polarity of the words in the documents and the polarity score of each document, we need (1) a trained LSS model (based on seed words and document text), (2) context words. My question is: what is the “context word” and why do we need it? At first, I thought “context” is something like “background”, and the reason we need context words is because the polarity of a word depends on the context, e.g., word “angry” might be negative under a context but be positive under another context. However, seems that is not correct? In this tutorial (https://tutorials.quanteda.io/machine-learning/lss/), the “background” is “economy”, then, what are the words generated by char_context(toks_sent, pattern = "econom*", p = 0.05)? Are they the words related to the context “economy”? If yes, then what about the other words left in the document text?

    1. When you measure sentiment on economy, the sentiment of its modifiers like “booming” or “stagnant” is the most important. “economy” is the target word and “booming” or “stagnant” are context words. If you assign polarity scores only to context words, LSS models only capture how economy is discussed. This is useful when texts include many other topics. Please also see a recent example.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top