quanteda – Kohei Watanabe

Programing, Text analysisDecember 25, 2019January 19, 2020

Why quanteda is so fast?

Those who read my recent post on quanteda’s performance might wonder why the package is so fast. It is not only because we carefully wrote R code for the package but also optimized internal functions and objects for large textual data. There are three design features of quanteda that dramatically enhanced its performance. Upfront data […]

Programing, Text analysisDecember 19, 2019January 19, 2020

R and Python text analysis packages performance comparison – updated

I compared the performance of R and Python in 2017 when we were developing quanteda v1.0, and confirmed that our package’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim. After two years, we are developing quanteda v2.0, which will be released early next year. We are improving the […]

Text analysisOctober 28, 2019January 19, 2020

正規表現による「っ」を含むトークンの修正

quantedaのICUに基づく日本語の分かち書きはだいたいうまく行くけれど、「持った」「言った」「踊った」などの「っ」を含む文は苦手なようです。以下のMecabの形態素解析によれば「持っ」と「言っ」となるべきですが、ICUだと「って」と「っ」という意味をなさないトークンが生成されてしまいます。そこで思いついたのが、 tokens_compound()とtokens_split()を使ってトークンを修正する方法です。前者は昔からある関数ですが、後者は比較的新しい関数で、前者の反対の処理をします。この方法だと、まず、 tokens_split()で「っ」を単体のトークンとし、 tokens_compound() で前に出てくる漢字のトークンと結合します。結果として、Mecabによる分かち書きと同一なトークンを得ることができました。この方法だと、 tokens_split() が「っ」を含むけれど、関係のないトークンを破壊する恐れがありますが、だいたいの文書では問題にならないでしょう。また、この二つの関数は、C++で並列化してあるので、処理速度も早いと思います。

Programing, Text analysisOctober 25, 2018January 19, 2020

Computing document similarity in large corpus

Since early this year, I was asked by many people how to compute document (or feature) similarity in large corpus. They said their functions stops because the lack of space in RAM: Error in .local(x, y, …) : Cholmod error ‘problem too large’ at file ../Core/cholmod_sparse.c, line 92 This happened in our textstat_simil(margn = “documents”) […]

EventJuly 16, 2018January 19, 2020

Presentation at R user meeting in Tokyo

I have presented Quantitative Analysis of Textual Data with R at a TokyoR event on 15th July hosted by Yahoo Japan. This was a great opportunity for me to reach out broad Japanese R users and tell them how easy it is to analyze Asian texts using quanteda. It was also really nice to meet […]

Event, Text analysisApril 18, 2018January 19, 2020

Building text analysis models using Quanteda

At the LSE Computational Social Science hackathon, I presented how to develop text analysis models using quanteda‘s core API’s such as as.tokens(), as.dfm() and pattern2id(). All the slides and the files available are in my Github repository.

Text analysisFebruary 1, 2018January 19, 2020

Quanteda Tutorials

We launched the Quanteda Tutorials website for a workshop Introduction to Quantitative Text Analysis using Quanteda held at the WZB Berlin Social Science Center on 31st January. The website is still work-in-progress, but it already covers all the important Quanteda functions.

Event, Text analysisJanuary 20, 2018January 19, 2020

Release of Quanteda version 1.0

We have announced the release of quanteda version 1.0 at the London R meeting on Tuesday. I thank all the organizers and 150+ participants. In the talk, I presented the performance comparison with R and Python packages, but I actually compared the performance with its earlier CRAN versions to show how the package evolved to […]

Text analysisJune 3, 2017January 19, 2020

Workshops on Japanese text analysis using quanteda

I have presented how to analyze Japanese texts using quanteda in half-day workshops at Waseda University (22 May) and Kobe University (2 June) organized by Mikihito Tanaka (Waseda) and Atshushi Tago (Kobe). Materials for these workshops are made available on Github as Introduction to Japanese Text Analysis (IJTA).

Japanese, Text analysisMay 25, 2017January 19, 2020

quantedaによる日本語テキスト分析入門

quantedaについてのワークショップを早稲田大学で行いました。資料はRによる日本語テキスト分析入門と題して公開し、今後少しずつ内容を充実させていきます。今後、積極的に日本語テキストについてのワークショップの開催していこうと思うので、興味のある方はご連絡ください。

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…