Programing – Kohei Watanabe

Programing, Text analysisAugust 22, 2024August 22, 2024

Automatically adjusting alpha for small and large topics

Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists. Alpha, as […]

Programing, Text analysisApril 16, 2024April 19, 2024

Develop efficient custom functions using quanteda v4.0

The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*() required long execution time and large memory space because they transfer data between R and C++. Such inefficiency […]

Programing, Text analysisMarch 9, 2024March 9, 2024

Group polarity words with different colors in LSS

We should always ensure that we are measuring what we want to measure in text analysis. In Latent Semantic Scaling (LSS), we can asses the validity of measurement by inspecting polarity scores of words using LSX::textplot_terms(). This function automatically selects words with very high or low polarity scores and highlights them. We can confirm that […]

Chinese, Programing, Text analysisFebruary 19, 2024February 19, 2024

Setting fonts for Chinese polarity words in LSS

I always recommend users of the LSX package visualizing polarity words using textplot_terms() because it allows them to explain intuitively what their fitted LSS models are measuring to others. I am using this function myself in my project on construction of a geopolitical threat index (GTI) for multiple countries that include China and Japan, but […]

Programing, Text analysisAugust 10, 2023August 10, 2023

Use pre-trained Glove word vectors in LSS

I mad it possible to use pre-trained word vector for Latent Semantic Scaling (LSS) in the version 0.9 of the LSX package, but I don’t think I explained how to do. It can be done easily by using the as.textmodel_lss() function but you need to load the word vectors to R as dense matrix beforehand. […]

Programing, Publication, Text analysisMay 31, 2023June 5, 2023

New papers on distributed LDA for sentence-level topic classification

I have been studying and developing an LDA algorithm for classification of sentences since 2022. Sentence-level topic classification allows us to analyze association between topics and other properties such as sentiments within documents. Also, sentence-level analysis has become more common in text analysis in general thanks to highly capable transformer models in recent years. My […]

Programing, Text analysisNovember 3, 2020November 3, 2020

LSX package upgraded as the paper published in CMM

I am please to tell you that my paper, Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, has been published in Communication Methods and Measures a few days ago. This paper explains the Latent Semantic Scaling technique, which is implemented in the LSX package available on CRAN, taking sentiment analysis […]

Programing, Text analysisSeptember 14, 2020September 15, 2020

Uploaded two new semisupervised models to CRAN

In this summer, I have submitted two packages for quantitative text analysis to CRAN: seededlda and LSX. These packages have been available in my Github repositories but I though it is time to make them more readily available to promote semisupervised machine learning techniques. seededlda is a package that implements seeded-LDA using the GibbsLDA++ library. […]

Programing, Text analysisDecember 25, 2019January 19, 2020

Why quanteda is so fast?

Those who read my recent post on quanteda’s performance might wonder why the package is so fast. It is not only because we carefully wrote R code for the package but also optimized internal functions and objects for large textual data. There are three design features of quanteda that dramatically enhanced its performance. Upfront data […]

Programing, Text analysisDecember 19, 2019January 19, 2020

R and Python text analysis packages performance comparison – updated

I compared the performance of R and Python in 2017 when we were developing quanteda v1.0, and confirmed that our package’s execution time is around 50% shorter, and peak memory consumption is 40% smaller than gensim. After two years, we are developing quanteda v2.0, which will be released early next year. We are improving the […]

Develop efficient custom functions using quanteda v4.0 – Kohei Watanabe on New tokens object in quanteda v4.0April 16, 2024
[…] most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows…
Setting fonts to plot Chinese polarity words in LSS – Kohei Watanabe on New paper on historical geopolitical threats to the USFebruary 19, 2024
[…] models are measuring to others. I am using this function myself in my project on construction of a geopolitical…
New paper on semantic temporality analysis – Kohei Watanabe on New paper on Latent Semantic ScalingAugust 29, 2023
[…] on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS)…
Kohei on Tutorial websites on LSS and Seeded LDAAugust 26, 2023
Please use base R's set.seed() before running the command.
Marli Fernandes on Tutorial websites on LSS and Seeded LDAAugust 24, 2023
I am currently using the seededlda package. I am using the following code: slda <- textmodel_seededlda(dfmt, dict, residual = 2)…