I have been developing and testing a new topic model called Distributed Asymmetric Allocation (DAA) because latent Dirichlet allocation (LDA) takes a long time to fit to a large corpus, but does not always discover topics that I am interested in. I know that these are also problems for many other users, so I decided to write a research paper on topic models. As always, I used the corpus of the United Nation General Assembly speeches (444,206 sentences). Since a small subset of the corpus is manually labelled for the topics, I could evaluate the performance of the new topic model systematically.
LDA takes a long time to fit to a large corpus because its algorithm iteratively assigns topics to each word in the corpus. This problem can be solved by distributed computing and convergence detection. Distributed computing is very effective when the number of topics is large: the execution time decreased on average from 46 minutes to 8 minutes when k = 50. When the number of topics is small, the convergence detection is more effective: it reduced the number of iterations from 2000 to 200 in most of the cases without compromising classification accuracy.
LDA does not always shows desired results because it identifies topics solely based on the co-occurrences of words. Seeded LDA could be used to weakly supervise the algorithm to identify pre-defined topics, but it also requires asymmetric Dirichlet priors when topics are highly specific. If the priors (alpha) are adjusted, some of the topics such as “Security” and “Development” became more frequent in the corpus while making other topics less frequent.
Thanks to the changes in the frequency of topics, the overall F1 scores has increased by more than 10% when it is combined with the strong sequential sampling. The improvement in the F1 scores were particularity high in “Security”, “Human rights” and “Democracy”. These are, in fact, the most and least common topics at the General Assembly.
The improved accuracy has a great impact on the results of substantive analysis. When priors were not adjusted (symmetric model), the frequencies of sentences about the topics were close with each other. However, when they were adjusted (asymmetric model), the frequencies of topics varied more widely and changed more strongly corresponding to important events such as the Kosovo war (1998–1999), the 9/11 attacks (2001), and the Arab Spring (2011–2012).
The key takeaway from this study is that we can classify sentences from a large corpus using LDA, but we must use asymmetric model to identify specific topics. If you cannot optimize alpha manually, you should use DAA. This model is already available as part of my seededlda package, so you only need to set batch_size < 1
and adjust_alpha > 0
in the functions to enable the distributed computing and Dirichlet prior optimization. Please give it a try and let me know how it worked with your data.