I have been studying and developing an LDA algorithm for classification of sentences since 2022. Sentence-level topic classification allows us to analyze association between topics and other properties such as sentiments within documents. Also, sentence-level analysis has become more common in text analysis in general thanks to highly capable transformer models in recent years.
My research was published in a paper, Seeded Sequential LDA: A Semi-Supervised Algorithm for Topic-Specific Analysis of Sentences in Social Science Computer Review. In this co-authored paper, we applied Seeded Sequential LDA to classify sentences from speeches delivered by delegates at the United Nations General Assembly meetings and presented an example of topic-specific sentiment analysis. The plot shows how sentiment on security and development has been for countries from five world regions during the post-Cold War period. We measured sentiment using the Lexicoder Sentiment Dictionary for simplicity, but more sophisticated methods can be combined with topic analysis.
Through above research, I found that it takes too long to analyze increasingly large corpora using LDA. To make topic modeling faster, I implemented algorithms for parallel computing and convergence detection in my seededlda package (v1.0). I tested the algorithms by running them to identify 100 topics in a corpus of 10,000 news articles and described the results in a working paper, Speed Up Topic Modeling: Distributed Computing and Convergence Detection for LDA.
The next plot shows the execution time fell sharply from 1 processor to 8 processors both in the sequential models (sequential = TRUE) and the non-sequential models thanks to parallel computing. When iterative Gibbs sampling is terminated on convergence (auto_iter = TRUE), the execution time become another 50% shorter in the non-sequential models (sequential = FALSE).
Faster sequential LDA algorithms are very useful not only for topic classification at the sentence-level but also for creation of sentence vectors as summaries of their content. Once sentence vectors are created, other machine learning methods could applied to the textual data easily.