Redefining word boundaries by collocation analysis

KoheiApril 20, 2017December 22, 2019

Quanteda’s tokenizer can segment Japanese and Chinese texts thanks to stringi, but its results are not always good, because its underlying function, ICU, recognizes only limited number of words. For example, this Japanese text

"ニューヨークのケネディ国際空港"

can be translated to “Kennedy International Airport (ケネディ国際空港) in (の) New York (ニューヨーク)”. Quanteda’s tokenizer (tokens function) segments this into too small pieces:

"ニュー"       "ヨーク"       "の"           "ケネディ"     "国際"         "空港"

Apparently, the first two words should not be separated. The standard Japanese POS tagger, Mecab, does just this:

"ニューヨーク" "の"           "ケネディ"     "国際"         "空港"

However, the erroneous segmentation can be corrected by running quaneda’s sequences function on a large corpus of news to identify contiguous collocations. After the correction of the word boundaries both the first (ニューヨーク) and last (国際空港) parts are joined together.

"ニューヨーク" "の"             "ケネディ"     "国際空港"

This is exactly the same approach to phrases and multi-word names in English texts. The process of word boundary correction is a series of collocation analysis and token concatenation. The data used to discover collocation comprises 138,108 news articles.

load('data_corpus_asahi_q10.RData')
toks <- tokens(corpus_segment(data_corpus_asahi_q10, what = "other", delimiter = "。"), include_docvars = TRUE)

toks <- tokens_select(toks, '^[０-９ぁ-んァ-ヶー一-龠]+$', valuetype = 'regex', padding = TRUE)

min_count <- 50

# process class of words that include 国際 and 空港
seqs_kanji <- sequences(toks, '^[一-龠]+$', valuetype = 'regex', nested = FALSE, 
                        min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kanji[seqs_kanji$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process class of words that include ニュー and ヨーク
seqs_kana <- sequences(toks, '^[ァ-ヶー]+$', valuetype = 'regex', nested = FALSE, 
                       min_count = min_count, ordered = FALSE) 
toks <- tokens_compound(toks, seqs_kana[seqs_kana$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

# process both classes of words
seqs <- sequences(toks, '^[０-９ァ-ヶー一-龠]+$', valuetype = 'regex', nested = FALSE, 
                  min_count = min_count, ordered = FALSE)
toks <- tokens_compound(toks, seqs[seqs$p < 0.01,], valuetype = 'fixed', 
                        concatenator = '', join = TRUE)

saveRDS(toks, 'data_tokens_asahi.RDS')

Kohei

Posts created 118

Leave a Reply Cancel reply

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…

Back To Top