Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists. Alpha, as […]
Develop efficient custom functions using quanteda v4.0
The most important change in quanteda v4.0 is the creation of the external pointer-based tokens object, called tokens_xptr, that allows us to write efficient custom functions. In earlier versions of the package, modification of tokens using tokens_*() required long execution time and large memory space because they transfer data between R and C++. Such inefficiency […]
New tokens object in quanteda v4.0
I am very happy that we could release quanteda v4.0 after more than a year of development. For this release, I improved the package’s efficiency by creating a new tokens object and writing many internal functions in C++ to allow users to process millions of documents on a laptop (or tens of millions on a […]
Perform context-specific operations on tokens using a new argument
The bag-of-words approach is common in text analysis, but it has problems in distinguishing between meanings of words that depend on their contexts. In order to address this issue, we added an argument that allows users to select, remove or lookup tokens that occur in specific contexts in the new version of quanteda (v4.0). We […]
Group polarity words with different colors in LSS
We should always ensure that we are measuring what we want to measure in text analysis. In Latent Semantic Scaling (LSS), we can asses the validity of measurement by inspecting polarity scores of words using LSX::textplot_terms(). This function automatically selects words with very high or low polarity scores and highlights them. We can confirm that […]
Setting fonts for Chinese polarity words in LSS
I always recommend users of the LSX package visualizing polarity words using textplot_terms() because it allows them to explain intuitively what their fitted LSS models are measuring to others. I am using this function myself in my project on construction of a geopolitical threat index (GTI) for multiple countries that include China and Japan, but […]
New paper on semantic temporality analysis
My co-authored paper on temporal orientation of texts appeared in Research & Politics. In this study we applied latent semantic scaling (LSS) to a corpus of English and German texts to identify features related to the future or the past automatically. Only with a set common verbs as seed words, the algorithm could classify sentences […]
Use pre-trained Glove word vectors in LSS
I mad it possible to use pre-trained word vector for Latent Semantic Scaling (LSS) in the version 0.9 of the LSX package, but I don’t think I explained how to do. It can be done easily by using the as.textmodel_lss() function but you need to load the word vectors to R as dense matrix beforehand. […]
Tutorial websites on LSS and Seeded LDA
I have written about my packages in different places including in my blog posts, but I decided to explain how to use them in dedicated websites about Latent Semantic Scaling and Seeded LDA. I though this is necessary because the methodology with these packages are getting more established with new functions that I added to […]
日本メディア学会の研究会での討論
6月11日の「メディア研究における量的テキスト分析の動向」と題した日本メディア学会の研究会に討論者として参加させてもらいました。于海春さんの中国のメディア統制に関する重要な研究を、討論を通じてテキスト分析の観点からより広い文脈に位置付けるように試みたつもりです。60人以上の方がオンラインで出席してくれたようでとても有意義なものでした。内容の要約は発表に用いたスライドを見てください。また、質疑応答の際に約束したように種語の選別の仕方についてのページも作成したので活用してください。