Kohei Watanabe

Text analysisMay 23, 2025May 24, 2025

Align word vectors of multiple Word2vec models

I have been developing a new R package called wordvector since last year. I started it as a fork of the Word2vec package but made several important changes to make it fully compatible with quanteda. By training the Word2vec model on quanteda’s tokens object using the package, it becomes easier to use the trained models […]

Text analysisDecember 12, 2024December 13, 2024

AI products and text analysis methods

A few days ago, I received an email from a researcher asking if text analysis is becoming irrelevant because of artificial intelligence (AI). I replied to her briefly saying that text analysis methods and AI products serve different purposes, but I thought I need to write more to answer this question. In fact, I had […]

Publication, Text analysisNovember 23, 2024December 19, 2024

A new topic model for analyzing imbalanced corpora

I have been developing and testing a new topic model called Distributed Asymmetric Allocation (DAA) because latent Dirichlet allocation (LDA) takes a long time to fit to a large corpus, but does not always discover topics that I am interested in. I know that these are also problems for many other users, so I decided […]

Programing, Text analysisAugust 22, 2024August 22, 2024

Automatically adjusting alpha for small and large topics

Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists. Alpha, as […]

Programing, Text analysisApril 16, 2024April 19, 2024

Develop efficient custom functions using quanteda v4.0

The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*() required long execution time and large memory space because they transfer data between R and C++. Such inefficiency […]

Text analysisApril 7, 2024April 7, 2024

New tokens object in quanteda v4.0

I am very happy that we could release quanteda v4.0 after more than a year of development. For this release, I improved the package’s efficiency by creating a new tokens object and writing many internal functions in C++ to allow users to process millions of documents on a laptop (or tens of millions on a […]

Text analysisMarch 17, 2024March 17, 2024

Perform context-specific operations on tokens using a new argument

The bag-of-words approach is common in text analysis, but it has problems in distinguishing between meanings of words that depend on their contexts. In order to address this issue, we added an argument that allows users to select, remove or lookup tokens that occur in specific contexts in the new version of quanteda (v4.0). We […]

Programing, Text analysisMarch 9, 2024March 9, 2024

Group polarity words with different colors in LSS

We should always ensure that we are measuring what we want to measure in text analysis. In Latent Semantic Scaling (LSS), we can asses the validity of measurement by inspecting polarity scores of words using LSX::textplot_terms(). This function automatically selects words with very high or low polarity scores and highlights them. We can confirm that […]

Chinese, Programing, Text analysisFebruary 19, 2024February 19, 2024

Setting fonts for Chinese polarity words in LSS

I always recommend users of the LSX package visualizing polarity words using textplot_terms() because it allows them to explain intuitively what their fitted LSS models are measuring to others. I am using this function myself in my project on construction of a geopolitical threat index (GTI) for multiple countries that include China and Japan, but […]

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…