Uploaded two new semisupervised models to CRAN

In this summer, I have submitted two packages for quantitative text analysis to CRAN: seededlda and LSX. These packages have been available in my Github repositories but I though it is time to make them more readily available to promote semisupervised machine learning techniques.

seededlda is a package that implements seeded-LDA using the GibbsLDA++ library. The library has been used in the topicmodels package too, but I decided to incorporate the C++ code into my package to simplify the workflow with quanteda (we no longer need to covert DFM to DTM to run topic models). However, when I was wring the code, I noticed that topicmodels handles seed words seemingly differently from the original model: seededlda only modify only the initial values for topic-word distribution while topicmodels generate multiple beta parameters and modify by seed words. This made the difference between the new and old packages in seeded-LDA larger than initially planned but they should be roughly the same in (unseeded) LDA.

LSX has been called LSS until it was submitted to CRAN. It is a package for Latent Semantic Scaling, a word embedding-based document scaling technique. I started developing the R package in 2016, but its original Python program was even older (around 2014-2015 when I was a PhD student). It took so long to submit the package to CRAN, because I needed to develop quanteda to process large corpora in R and establish a research methodology for the technique before releasing it. Finally, I made quanteda capable of handle millions of documents and wrote a methodological paper that should explain users of LSS how to apply the model in research.

I really hope that the two packages will make semisupervised machine learning techniques more accessible to researchers and help them to perform theoretically motivate large scale analysis of textual data in social sciences and humanities.

2 thoughts on “Uploaded two new semisupervised models to CRAN”

Hi Kohei,
Great to see that the LSX package is on CRAN! Just one small technical question regarding seed words: is it possible to use glob or even regex as seed words? In the Rdocumentation I see no specification and there appears to be no way to specify what type of words I provide to textmodel_lss as seedwords…
Looking forward to hearing from you!
Sébastien

Kohei says:

September 14, 2021 at 8:49 am

Yes, you can use glob pattern like “great*” because seed words are always interpreted as glob. However, you should be very careful about false matches of these patterns, because they negatively affect the results of analysis. In other words, you should only include few good seed words in fixed pattern.

Reply

Share this:

Kohei

2 thoughts on “Uploaded two new semisupervised models to CRAN”

Leave a Reply Cancel reply

Related Posts