Stringiによる日本語と中国語のテキストの分かち書き

KoheiDecember 2, 2016January 18, 2017

MecabやChasenなどのによる形態素解析が、日本語のテキストの分かち書きには不可欠だと多くの人が考えていますが、必ずしもそうではないようです。このことを知ったのは、quantedaのトークン化の関数を調べている時で、日本語のテキストをこの関数に渡してみると、単語が Mecabと同じように、きれいに単語に分かれたからです。

> txt_jp <- "政治とは社会に対して全体的な影響を及ぼし、社会で生きるひとりひとりの人の人生にも様々な影響を及ぼす複雑な領域である。"
> quanteda::tokens(txt_jp)
tokens from 1 document.
Component 1 :
 [1] "政治"         "と"           "は"           "社会"         "に対して"     "全体"         "的"           "な"          
 [9] "影響"         "を"           "及"           "ぼ"           "し"           "、"           "社会"         "で"          
[17] "生きる"       "ひとりひとり" "の"           "人"           "の"           "人生"         "に"           "も"          
[25] "様々"         "な"           "影響"         "を"           "及ぼす"       "複雑"         "な"           "領域"        
[33] "で"           "ある"         "。"

quantedaには、形態素解析の機能がないのですが、そのトークン化関数は、中国語のテキストもきれいに、分かち書きをしたのは意外でした。

> txt_cn <- "政治是各种团體进行集体决策的一个过程，也是各种团體或个人为了各自的領域所结成的特定关系，尤指對於某一政治實體的統治，例如統治一個國家，亦指對於一國內外事務之監督與管制。"
> quanteda::tokens(txt_cn)
tokens from 1 document.
Component 1 :
 [1] "政治"   "是"     "各种"   "团"     "體"     "进行"   "集体"   "决策"   "的"     "一个"   "过程"   "，"     "也是"  
[14] "各种"   "团"     "體"     "或"     "个人"   "为了"   "各自"   "的"     "領域"   "所"     "结成"   "的"     "特定"  
[27] "关系"   "，"     "尤"     "指"     "對於"   "某一"   "政治"   "實體"   "的"     "統治"   "，"     "例如"   "統治"  
[40] "一個"   "國家"   "，"     "亦"     "指"     "對於"   "一"     "國內外" "事務"   "之"     "監督"   "與"     "管制"  
[53] "。"

もっと調べてみると、この不思議な挙動は、トークン化関数が基づくstringi::stri_split_boundariesがICU (International Components for Unicode) を利用しており、そこでは中国語、日本語、タイ語、クメール語は、辞書に基づく分かち書きをされているからで、その日本語辞書は Mecabでも利用されているIPA辞書だからということがわかりました。

stringiによる分かち書きができるとすれば、構文分析を用いない日本語や中国語のテキスト分析においては、形態素解析を利用する必要がないということなので、技術的な障壁が下がり、日本語や中国語のテキストの社会科学的な分析がこれから広まるでしょう。

Kohei

Posts created 120

One thought on “Stringiによる日本語と中国語のテキストの分かち書き”

Pingback: MecabのトークンをQuantedaで読み込む二つの方法 – Kohei Watanabe

Leave a Reply Cancel reply

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…

Back To Top