Text analysis – Page 5 – Kohei Watanabe

Publication, Text analysisJanuary 29, 2020February 21, 2020

New research paper on how to choose seed words for semi-supervised models

I have been developing and applying semi-supervised models, such as seeded-LDA, Newsmap and LSS, for classification and document scaling aiming to broader the scope of quantitative text analysis in recent years. These models are very cost efficient because they only require a small set of “seed words” to learn categories or dimensions of interest. However, […]

Programing, Text analysisDecember 25, 2019January 19, 2020

Why quanteda is so fast?

Those who read my recent post on quanteda’s performance might wonder why the package is so fast. It is not only because we carefully wrote R code for the package but also optimized internal functions and objects for large textual data. There are three design features of quanteda that dramatically enhanced its performance. Upfront data […]

Programing, Text analysisDecember 19, 2019January 19, 2020

R and Python text analysis packages performance comparison – updated

I compared the performance of R and Python in 2017 when we were developing quanteda v1.0, and confirmed that our package’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim. After two years, we are developing quanteda v2.0, which will be released early next year. We are improving the […]

Event, Text analysisDecember 12, 2019March 25, 2020

COMPTEXT 2020 Conference

POLTEXT conference has been renamed COMPTEXT to broadened the focus from political science to wider social sciences. Anyone who analyze textual data from social science perspective is welcome to present. Next conference, COMPTEXT 2020, will take place in Innsbruck, Austria, on 15-16 May 2020. The developers of quanteda will offer tutorials in the pre-conference events […]

Text analysisOctober 28, 2019January 19, 2020

正規表現による「っ」を含むトークンの修正

quantedaのICUに基づく日本語の分かち書きはだいたいうまく行くけれど、「持った」「言った」「踊った」などの「っ」を含む文は苦手なようです。以下のMecabの形態素解析によれば「持っ」と「言っ」となるべきですが、ICUだと「って」と「っ」という意味をなさないトークンが生成されてしまいます。そこで思いついたのが、 tokens_compound()とtokens_split()を使ってトークンを修正する方法です。前者は昔からある関数ですが、後者は比較的新しい関数で、前者の反対の処理をします。この方法だと、まず、 tokens_split()で「っ」を単体のトークンとし、 tokens_compound() で前に出てくる漢字のトークンと結合します。結果として、Mecabによる分かち書きと同一なトークンを得ることができました。この方法だと、 tokens_split() が「っ」を含むけれど、関係のないトークンを破壊する恐れがありますが、だいたいの文書では問題にならないでしょう。また、この二つの関数は、C++で並列化してあるので、処理速度も早いと思います。

Text analysisSeptember 17, 2019September 24, 2019

Thanks for helping me organizing POLTEXT and good bye to WIAS

I held the POLTEXT conference last weekend in Tokyo with Lisa Lechner and Miklos Sobok. At the end of the conference, many people thanked me for organizing the conference, but I owe much of the success to my assistants and the admin staff at Waseda Institute for Advanced Study (WIAS). WIAS is Waseda University’s research-only […]

Event, Text analysisJune 28, 2019September 17, 2019

Auditing POLTEXT 2019 in Tokyo

We opened application for auditing POLTEXT 2019 that will take place at Waseda University in 14-15 September. We are very excited to have worldly famous keynote speakers, Jonathan Slapin (University of Zurich) and Sven-Oliver Proksch (University of Cologne), and over 60 presenters from all over the world. If you are interested in attending, please signup […]

Text analysisMay 17, 2019December 22, 2019

False European news sites

According to a news report, the European Union is stepping up its effort to prevent disinformation from spreading in collaboration with fact-checking organization in its member countries. They fear that foreign actors such as the Russian government to influence the EU parliament election later this month by spreading eurosceptic or anti-immigrant content. Since 2017, I […]

Text analysisMarch 3, 2019December 22, 2019

日本語の量的テキスト分析用の辞書

量的テキスト分析ではキーワード辞書が使われることが多いけれど、日本語では社会科学的な分析に用いられるものがほとんどなく、それが研究や教育における障害となっているように思います。でも最近、約15,000語が以下の23分野に分けられている日経シソーラスの存在を知人から教えてもらいました。 [1] “一般・共通” “経済・産業” “経営・企業” [4] “農林水産” “食品” “繊維・木材・紙パ” [7] “資源・エネルギー” “金属・土石” “化学” [10] “機械・器具・設備” “電子電機” “情報・通信” [13] “建設” “流通・サービス・家庭用品”　 “環境・公害” [16] “科学技術・文化” “自然界” “国際” [19] “政治” “地方” “労働・教育・医療” [22] “社会・家庭” “地域” 少なくとも新聞記事の分析では使えそうなので、語を集めてYAMLフォーマットにまとめてみました。単語版は、ウェブサイトに掲載されているままですが、複単語版はquantedaのtokens()で分かち書きをすることで、辞書分析や複単語の結合に使いやすくなっています。日経シソーラス（単語版）日経シソーラス（複単語版）このシソーラスを使う一番簡単な方法は、quantedaで dict <- dictionary(file = “nikkei-thesaurus_multiword.yml”) tokens_lookup(toks, dict) tokens_compound(toks, dict) のようにすることです。詳しい辞書の使い方については、Quanteda Tutorialsを参照してください。また、朝日新聞の『聞蔵』や読売新聞の『ヨミダス』から記事をダウンロードする場合は、newspapersを使うと簡単にテキストをRに読み込めます。

Text analysisMarch 2, 2019March 2, 2019

French and Chinese seed dictionaries are added to Newsmap

newsmap is a dictionary-based semi-supervised model for geographical document classification. The core of the package is not the machine learning algorithm but multi-lingual seed dictionaries created by me and other contributors in English, German, French, Spanish, Japanese, Russian, Chinese. We recently added Chinese (traditional and simplified) and French dictionaries, and submitted the package to CRAN. […]

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…