Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists.
Alpha, as a parameter that determines the dispersion of topics over documents, is used to smooth the topic probabilities in Gibbs sampling. If alpha = 1.0
for a topic, LDA assumes that words about the topic occur at least once in all the documents. If the value alpha is large, topics tend to be larger in number of words; if the values are equal, topics tend to be in the same sizes.
Many topic model packages use the same value of alpha for all the topics (“symmetric prior”) by default, but it is often unrealistic to assume that topics are in the same sizes in the real world, because people talk or write about important topics more and others less. We can manually set different values of alpha (“asymmetric prior”) to identify topics in different sizes; or ask the algorithm to determine the values automatically.
Wallach (2009) proposed a sophisticated algorithm to optimize alpha, but I think I could achieved this in a very simple and efficient manner in seededlda v1.4. In the package, the default value of alpha is the same for all the topics, but they are adjusted iteratively based on the result of Gibbs sampling if adjust_alpha > 0
. This feature is very new and still experimental, but I want you to use it and tell me how it worked with your data (v1.4 is available only on Gitub as of 22 August 2024).
Example: UN General Assembly speeches
In this example, I fit LDA with 10 topics on the sentences from the 2017 United Nations General Assembly speeches with different values of alpha. I used the default value alpha = 0.5
for all topics in the symmetric model; I set alpha = 0.75
for the first half and alpha = 0.25
for the last half in the asymmetric model; I asked the new algorithm to change the values up to 50% in the adjusted model. The R code to make this example is available in an RMD file.
require(quanteda.corpora)
require(quanteda)
require(seededlda)
corp <- data_corpus_ungd2017 %>%
corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_number = TRUE) %>%
tokens_remove(newsmap::data_dictionary_newsmap_en) %>%
tokens_remove(stopwords("en"), min_nchar = 2)
dfmt <- dfm(toks)
# symmetric model
lda_sym <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = 0.5)
# asymmetric model
lda_asy <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = rep(c(0.75, 0.25), c(5, 5)))
# adjusted model
lda_adj <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = 0.5, adjust_alpha = 0.5)
Alpha and topic sizes
The first plot shows the values of alpha after fitting the models. The values in the symmetric and asymmetric models are as described above. In the adjusted model, they range between 0.4 and 0.7 because of the automatic adjustment from the initial value. In the second plot, we can find the sizes of the topics (in number of words) reflect the values of alpha in all the models. The sizes are only between 7.5% and 12% in the symmetric model, whereas they are between 2.5% and 18% in the asymmetric model and 5% and 18% in the adjusted model. The difference between small and large topics are much greater in the latter two.
Topic terms
From earlier studies, we know that diplomats discuss much about security, development and UN institutions but little about human rights at the General Assembly. To compare the topics between models, I sorted them by the their sizes in descending order below (left to right; top to bottom). We can find development, UN institutions and international peace in the top three topics in all the models, but the topics about human rights are in different positions. Human rights is mixed with sustainability (7th) in the symmetric model; it is mixed with the international peace (3rd) in the asymmetric model; human rights is identified as a stand-alone topic (10th) only in the adjusted model. The topics are mashed together in the first two models because the values of alpha are too small or too large.
Symmetric model
Development (topic 2), US institutions (topic 6) and international peace (topic 7) are the largest topics. Human rights is mixed with sustainability (topic10).
## topic2 topic6 topic7 topic9
## [1,] "development" "nations" "international" "terrorism"
## [2,] "sustainable" "united" "nuclear" "international"
## [3,] "agenda" "security" "security" "humanitarian"
## [4,] "economic" "peace" "weapons" "conflicts"
## [5,] "goals" "council" "peace" "countries"
## [6,] "national" "international" "republic" "global"
## [7,] "social" "reform" "solution" "must"
## [8,] "implementation" "states" "political" "migration"
## [9,] "government" "global" "democratic" "also"
## [10,] "education" "organization" "council" "fight"
## topic3 topic8 topic10 topic1 topic5
## [1,] "world" "international" "human" "change" "people"
## [2,] "can" "african" "rights" "climate" "years"
## [3,] "future" "cooperation" "peace" "countries" "world"
## [4,] "must" "union" "people" "global" "many"
## [5,] "together" "process" "life" "per" "caribbean"
## [6,] "today" "regional" "respect" "agreement" "past"
## [7,] "one" "peace" "law" "cent" "ago"
## [8,] "time" "countries" "sustainable" "small" "solidarity"
## [9,] "better" "political" "planet" "states" "countries"
## [10,] "challenges" "national" "peoples" "developing" "war"
## topic4
## [1,] "assembly"
## [2,] "general"
## [3,] "like"
## [4,] "president"
## [5,] "session"
## [6,] "also"
## [7,] "mr"
## [8,] "support"
## [9,] "secretary-general"
## [10,] "wish"
Asymmetric model
Development (topic 1), US institutions (topic 2) and international peace (topic 5) are the largest topics. Human rights also appears in topic 5.
## topic1 topic2 topic5 topic3 topic4
## [1,] "development" "nations" "international" "world" "terrorism"
## [2,] "sustainable" "united" "rights" "people" "world"
## [3,] "economic" "security" "peace" "can" "people"
## [4,] "agenda" "peace" "political" "must" "countries"
## [5,] "national" "international" "human" "peace" "humanitarian"
## [6,] "climate" "states" "security" "future" "international"
## [7,] "change" "council" "state" "today" "global"
## [8,] "goals" "support" "law" "one" "many"
## [9,] "countries" "organization" "dialogue" "together" "also"
## [10,] "social" "reform" "government" "every" "must"
## topic6 topic8 topic9 topic10 topic7
## [1,] "assembly" "change" "nuclear" "years" "per"
## [2,] "general" "climate" "weapons" "year" "cent"
## [3,] "session" "caribbean" "republic" "ago" "million"
## [4,] "president" "natural" "people's" "last" "years"
## [5,] "mr" "disasters" "democratic" "two" "billion"
## [6,] "like" "people" "treaty" "first" "year"
## [7,] "also" "hurricanes" "korea" "past" "now"
## [8,] "people" "recent" "security" "minister" "population"
## [9,] "wish" "solidarity" "threat" "assembly" "world's"
## [10,] "theme" "affected" "korean" "president" "domestic"
Adjusted model
Development (topic 10), US institutions (topic 1) and international peace (topic 3) are the largest topics. Human rights (topic 4) is the smallest stand-alone topic.
## topic10 topic1 topic3 topic2
## [1,] "development" "nations" "international" "world"
## [2,] "sustainable" "united" "peace" "can"
## [3,] "economic" "security" "political" "people"
## [4,] "agenda" "peace" "security" "must"
## [5,] "national" "international" "dialogue" "future"
## [6,] "countries" "council" "state" "one"
## [7,] "goals" "states" "solution" "together"
## [8,] "international" "global" "support" "today"
## [9,] "social" "must" "peaceful" "peace"
## [10,] "per" "support" "states" "every"
## topic5 topic9 topic7 topic6 topic8
## [1,] "terrorism" "nuclear" "change" "assembly" "years"
## [2,] "people" "weapons" "climate" "general" "year"
## [3,] "world" "republic" "agreement" "session" "past"
## [4,] "humanitarian" "democratic" "global" "president" "ago"
## [5,] "countries" "caribbean" "small" "like" "two"
## [6,] "migration" "people" "island" "people" "first"
## [7,] "also" "solidarity" "pacific" "mr" "last"
## [8,] "international" "treaty" "states" "also" "country"
## [9,] "terrorist" "people's" "natural" "peace" "one"
## [10,] "many" "recent" "developing" "wish" "time"
## topic4
## [1,] "human"
## [2,] "rights"
## [3,] "women"
## [4,] "law"
## [5,] "respect"
## [6,] "right"
## [7,] "equality"
## [8,] "dignity"
## [9,] "fundamental"
## [10,] "peoples"
Model performance
To asses the suitability of the parameters, I computed in-sample perplexity and topic divergence. The perplexity scores for the symmetric and adjusted models are best (lowest); the topic divergence scores are the best (highest) for the asymmetric and adjusted models. These two quantitative measures too suggest that the automatic adjustment of alpha is working well.
# Perplexity
## symmetric asymmetric adjusted
## 1382.017 1416.700 1395.984
# Topic divergence
## symmetric asymmetric adjusted
## 0.4322946 0.4440354 0.4418116
I compared the execution time between the symmetric and adjusted models, but both took 56 seconds. The adjustment algorithm is also fully compatible with other features of the package such as the seeded and distributed algorithms, so you can improve topic models without loosing the control or the speed.
## Unit: seconds
## expr min lq mean median uq max neval
## symmetric 55.87726 56.12995 56.41751 56.32228 56.79633 57.04474 10
## adjusted 56.06226 56.20806 56.27838 56.28277 56.33151 56.54101 10
Extra: similar features in other packages
There are several other R packages for topic modeling that seem to have similar features, but I am not sure how to use them. If you know how to enable estimation of asymmetric priors, please let me know.
KeyATM
According to Eshima et al. (2024), KeyATM can estimate the priors but the values are 0.1 for all the topics. Naturally, the topic terms are similar to my symmetric model.
atm <- keyATM::weightedLDA(keyATM::keyATM_read(dfmt), model = "base",
number_of_topics = 10)
print(atm$priors$alpha)
## [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
keyATM::top_words(atm)
## Topic_1 Topic_2 Topic_3 Topic_4 Topic_5 Topic_6
## 1 nations nations assembly people world climate
## 2 world united general terrorist people change
## 3 peace development session world human people
## 4 united peace mr attacks countries countries
## 5 can international president lives many world
## 6 must sustainable like state terrorism caribbean
## 7 people security people violence change natural
## 8 international rights also forces climate global
## 9 development human seventy-second children conflict hurricanes
## 10 security agenda congratulate da'esh violence also
## Topic_7 Topic_8 Topic_9 Topic_10
## 1 nations development nuclear development
## 2 international sustainable weapons nations
## 3 united people republic international
## 4 security nations security united
## 5 peace united democratic countries
## 6 world economic people's national
## 7 must must international per
## 8 people international korea security
## 9 rights countries peninsula country
## 10 human also korean support
Topicmodels
The variational expectation maximization (VEM) algorithm in the topicmodels package estimates the value of alpha if estimate.alpha = TRUE
, but this also produces symmetric priors.
tm <- topicmodels::LDA(convert(dfmt, "topicmodels"), k = 10, method = "VEM",
control = list(estimate.alpha = TRUE))
print(tm@alpha)
## [1] 25.60497
topicmodels::terms(tm, 10)
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "development" "people" "peace" "development"
## [2,] "international" "united" "united" "can"
## [3,] "people" "international" "development" "security"
## [4,] "economic" "security" "security" "sustainable"
## [5,] "united" "nations" "support" "international"
## [6,] "must" "country" "must" "world"
## [7,] "countries" "peace" "international" "countries"
## [8,] "assembly" "can" "can" "assembly"
## [9,] "country" "must" "also" "human"
## [10,] "also" "world" "world" "one"
## Topic 5 Topic 6 Topic 7 Topic 8
## [1,] "nations" "nations" "nations" "international"
## [2,] "security" "united" "people" "world"
## [3,] "countries" "peace" "world" "rights"
## [4,] "can" "development" "countries" "like"
## [5,] "like" "world" "also" "country"
## [6,] "people" "sustainable" "change" "people"
## [7,] "economic" "also" "international" "human"
## [8,] "one" "rights" "region" "states"
## [9,] "also" "states" "community" "nations"
## [10,] "development" "international" "political" "must"
## Topic 9 Topic 10
## [1,] "nations" "united"
## [2,] "development" "peace"
## [3,] "support" "world"
## [4,] "united" "human"
## [5,] "global" "countries"
## [6,] "also" "global"
## [7,] "international" "states"
## [8,] "sustainable" "people"
## [9,] "world" "must"
## [10,] "human" "national"