A few days ago, I received an email from a researcher asking if text analysis is becoming irrelevant because of artificial intelligence (AI). I replied to her briefly saying that text analysis methods and AI products serve different purposes, but I thought I need to write more to answer this question. In fact, I had […]
A new topic model for analyzing imbalanced corpora
I have been developing and testing a new topic model called Distributed Asymmetric Allocation (DAA) because latent Dirichlet allocation (LDA) takes a long time to fit to a large corpus, but does not always discover topics that I am interested in. I know that these are also problems for many other users, so I decided […]
Automatically adjusting alpha for small and large topics
Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists. Alpha, as […]
Develop efficient custom functions using quanteda v4.0
The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*() required long execution time and large memory space because they transfer data between R and C++. Such inefficiency […]
New tokens object in quanteda v4.0
I am very happy that we could release quanteda v4.0 after more than a year of development. For this release, I improved the package’s efficiency by creating a new tokens object and writing many internal functions in C++ to allow users to process millions of documents on a laptop (or tens of millions on a […]
Perform context-specific operations on tokens using a new argument
The bag-of-words approach is common in text analysis, but it has problems in distinguishing between meanings of words that depend on their contexts. In order to address this issue, we added an argument that allows users to select, remove or lookup tokens that occur in specific contexts in the new version of quanteda (v4.0). We […]
Group polarity words with different colors in LSS
We should always ensure that we are measuring what we want to measure in text analysis. In Latent Semantic Scaling (LSS), we can asses the validity of measurement by inspecting polarity scores of words using LSX::textplot_terms(). This function automatically selects words with very high or low polarity scores and highlights them. We can confirm that […]
Setting fonts for Chinese polarity words in LSS
I always recommend users of the LSX package visualizing polarity words using textplot_terms() because it allows them to explain intuitively what their fitted LSS models are measuring to others. I am using this function myself in my project on construction of a geopolitical threat index (GTI) for multiple countries that include China and Japan, but […]
New paper on semantic temporality analysis
My co-authored paper on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS) to a corpus of English and German texts to identify features related to the future or the past automatically. Only with a set common verbs as seed words, the algorithm could classify sentences […]
Use pre-trained Glove word vectors in LSS
I mad it possible to use pre-trained word vector for Latent Semantic Scaling (LSS) in the version 0.9 of the LSX package, but I don’t think I explained how to do. It can be done easily by using the as.textmodel_lss() function but you need to load the word vectors to R as dense matrix beforehand. […]