Automatically adjusting alpha for small and large topics

Since the release of the seededlda package v1.0 which implements the distributed LDA algorithm, I have been applying topic models on many different corpora. In doing so, I became increasingly aware that it is important to optimize the Dirichlet prior the document-topic distribution, alpha, to identify topics that are interesting for social scientists.

Alpha, as a parameter that determines the dispersion of topics over documents, is used to smooth the topic probabilities in Gibbs sampling. If alpha = 1.0 for a topic, LDA assumes that words about the topic occur at least once in all the documents. If the value alpha is large, topics tend to be larger in number of words; if the values are equal, topics tend to be in the same sizes.

Many topic model packages use the same value of alpha for all the topics (“symmetric prior”) by default, but it is often unrealistic to assume that topics are in the same sizes in the real world, because people talk or write about important topics more and others less. We can manually set different values of alpha (“asymmetric prior”) to identify topics in different sizes; or ask the algorithm to determine the values automatically.

Wallach (2009) proposed a sophisticated algorithm to optimize alpha, but I think I could achieved this in a very simple and efficient manner in seededlda v1.4. In the package, the default value of alpha is the same for all the topics, but they are adjusted iteratively based on the result of Gibbs sampling if adjust_alpha > 0. This feature is very new and still experimental, but I want you to use it and tell me how it worked with your data (v1.4 is available only on Gitub as of 22 August 2024).

Example: UN General Assembly speeches

In this example, I fit LDA with 10 topics on the sentences from the 2017 United Nations General Assembly speeches with different values of alpha. I used the default value alpha = 0.5 for all topics in the symmetric model; I set alpha = 0.75 for the first half and alpha = 0.25 for the last half in the asymmetric model; I asked the new algorithm to change the values up to 50% in the adjusted model. The R code to make this example is available in an RMD file.

require(quanteda.corpora)
require(quanteda)
require(seededlda)

corp <- data_corpus_ungd2017 %>% 
  corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE, 
               remove_number = TRUE) %>% 
  tokens_remove(newsmap::data_dictionary_newsmap_en) %>% 
  tokens_remove(stopwords("en"), min_nchar = 2)
dfmt <- dfm(toks)

# symmetric model
lda_sym <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = 0.5)
# asymmetric model
lda_asy <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = rep(c(0.75, 0.25), c(5, 5)))
# adjusted model
lda_adj <- textmodel_lda(dfmt, k = 10, gamma = 0.25, alpha = 0.5, adjust_alpha = 0.5)

Alpha and topic sizes

The first plot shows the values of alpha after fitting the models. The values in the symmetric and asymmetric models are as described above. In the adjusted model, they range between 0.4 and 0.7 because of the automatic adjustment from the initial value. In the second plot, we can find the sizes of the topics (in number of words) reflect the values of alpha in all the models. The sizes are only between 7.5% and 12% in the symmetric model, whereas they are between 2.5% and 18% in the asymmetric model and 5% and 18% in the adjusted model. The difference between small and large topics are much greater in the latter two.

Topic terms

From earlier studies, we know that diplomats discuss much about security, development and UN institutions but little about human rights at the General Assembly. To compare the topics between models, I sorted them by the their sizes in descending order below (left to right; top to bottom). We can find development, UN institutions and international peace in the top three topics in all the models, but the topics about human rights are in different positions. Human rights is mixed with sustainability (7th) in the symmetric model; it is mixed with the international peace (3rd) in the asymmetric model; human rights is identified as a stand-alone topic (10th) only in the adjusted model. The topics are mashed together in the first two models because the values of alpha are too small or too large.

Symmetric model

Development (topic 2), US institutions (topic 6) and international peace (topic 7) are the largest topics. Human rights is mixed with sustainability (topic10).

##       topic2           topic6          topic7          topic9         
##  [1,] "development"    "nations"       "international" "terrorism"    
##  [2,] "sustainable"    "united"        "nuclear"       "international"
##  [3,] "agenda"         "security"      "security"      "humanitarian" 
##  [4,] "economic"       "peace"         "weapons"       "conflicts"    
##  [5,] "goals"          "council"       "peace"         "countries"    
##  [6,] "national"       "international" "republic"      "global"       
##  [7,] "social"         "reform"        "solution"      "must"         
##  [8,] "implementation" "states"        "political"     "migration"    
##  [9,] "government"     "global"        "democratic"    "also"         
## [10,] "education"      "organization"  "council"       "fight"        
##       topic3       topic8          topic10       topic1       topic5      
##  [1,] "world"      "international" "human"       "change"     "people"    
##  [2,] "can"        "african"       "rights"      "climate"    "years"     
##  [3,] "future"     "cooperation"   "peace"       "countries"  "world"     
##  [4,] "must"       "union"         "people"      "global"     "many"      
##  [5,] "together"   "process"       "life"        "per"        "caribbean" 
##  [6,] "today"      "regional"      "respect"     "agreement"  "past"      
##  [7,] "one"        "peace"         "law"         "cent"       "ago"       
##  [8,] "time"       "countries"     "sustainable" "small"      "solidarity"
##  [9,] "better"     "political"     "planet"      "states"     "countries" 
## [10,] "challenges" "national"      "peoples"     "developing" "war"       
##       topic4             
##  [1,] "assembly"         
##  [2,] "general"          
##  [3,] "like"             
##  [4,] "president"        
##  [5,] "session"          
##  [6,] "also"             
##  [7,] "mr"               
##  [8,] "support"          
##  [9,] "secretary-general"
## [10,] "wish"

Asymmetric model

Development (topic 1), US institutions (topic 2) and international peace (topic 5) are the largest topics. Human rights also appears in topic 5.

##       topic1        topic2          topic5          topic3     topic4         
##  [1,] "development" "nations"       "international" "world"    "terrorism"    
##  [2,] "sustainable" "united"        "rights"        "people"   "world"        
##  [3,] "economic"    "security"      "peace"         "can"      "people"       
##  [4,] "agenda"      "peace"         "political"     "must"     "countries"    
##  [5,] "national"    "international" "human"         "peace"    "humanitarian" 
##  [6,] "climate"     "states"        "security"      "future"   "international"
##  [7,] "change"      "council"       "state"         "today"    "global"       
##  [8,] "goals"       "support"       "law"           "one"      "many"         
##  [9,] "countries"   "organization"  "dialogue"      "together" "also"         
## [10,] "social"      "reform"        "government"    "every"    "must"         
##       topic6      topic8       topic9       topic10     topic7      
##  [1,] "assembly"  "change"     "nuclear"    "years"     "per"       
##  [2,] "general"   "climate"    "weapons"    "year"      "cent"      
##  [3,] "session"   "caribbean"  "republic"   "ago"       "million"   
##  [4,] "president" "natural"    "people's"   "last"      "years"     
##  [5,] "mr"        "disasters"  "democratic" "two"       "billion"   
##  [6,] "like"      "people"     "treaty"     "first"     "year"      
##  [7,] "also"      "hurricanes" "korea"      "past"      "now"       
##  [8,] "people"    "recent"     "security"   "minister"  "population"
##  [9,] "wish"      "solidarity" "threat"     "assembly"  "world's"   
## [10,] "theme"     "affected"   "korean"     "president" "domestic"

Adjusted model

Development (topic 10), US institutions (topic 1) and international peace (topic 3) are the largest topics. Human rights (topic 4) is the smallest stand-alone topic.

##       topic10         topic1          topic3          topic2    
##  [1,] "development"   "nations"       "international" "world"   
##  [2,] "sustainable"   "united"        "peace"         "can"     
##  [3,] "economic"      "security"      "political"     "people"  
##  [4,] "agenda"        "peace"         "security"      "must"    
##  [5,] "national"      "international" "dialogue"      "future"  
##  [6,] "countries"     "council"       "state"         "one"     
##  [7,] "goals"         "states"        "solution"      "together"
##  [8,] "international" "global"        "support"       "today"   
##  [9,] "social"        "must"          "peaceful"      "peace"   
## [10,] "per"           "support"       "states"        "every"   
##       topic5          topic9       topic7       topic6      topic8   
##  [1,] "terrorism"     "nuclear"    "change"     "assembly"  "years"  
##  [2,] "people"        "weapons"    "climate"    "general"   "year"   
##  [3,] "world"         "republic"   "agreement"  "session"   "past"   
##  [4,] "humanitarian"  "democratic" "global"     "president" "ago"    
##  [5,] "countries"     "caribbean"  "small"      "like"      "two"    
##  [6,] "migration"     "people"     "island"     "people"    "first"  
##  [7,] "also"          "solidarity" "pacific"    "mr"        "last"   
##  [8,] "international" "treaty"     "states"     "also"      "country"
##  [9,] "terrorist"     "people's"   "natural"    "peace"     "one"    
## [10,] "many"          "recent"     "developing" "wish"      "time"   
##       topic4       
##  [1,] "human"      
##  [2,] "rights"     
##  [3,] "women"      
##  [4,] "law"        
##  [5,] "respect"    
##  [6,] "right"      
##  [7,] "equality"   
##  [8,] "dignity"    
##  [9,] "fundamental"
## [10,] "peoples"

Model performance

To asses the suitability of the parameters, I computed in-sample perplexity and topic divergence. The perplexity scores for the symmetric and adjusted models are best (lowest); the topic divergence scores are the best (highest) for the asymmetric and adjusted models. These two quantitative measures too suggest that the automatic adjustment of alpha is working well.

# Perplexity
##  symmetric asymmetric   adjusted 
##   1382.017   1416.700   1395.984

# Topic divergence
##  symmetric asymmetric   adjusted 
##  0.4322946  0.4440354  0.4418116

I compared the execution time between the symmetric and adjusted models, but both took 56 seconds. The adjustment algorithm is also fully compatible with other features of the package such as the seeded and distributed algorithms, so you can improve topic models without loosing the control or the speed.

## Unit: seconds
##       expr      min       lq     mean   median       uq      max neval
##  symmetric 55.87726 56.12995 56.41751 56.32228 56.79633 57.04474    10
##   adjusted 56.06226 56.20806 56.27838 56.28277 56.33151 56.54101    10

Extra: similar features in other packages

There are several other R packages for topic modeling that seem to have similar features, but I am not sure how to use them. If you know how to enable estimation of asymmetric priors, please let me know.

KeyATM

According to Eshima et al. (2024), KeyATM can estimate the priors but the values are 0.1 for all the topics. Naturally, the topic terms are similar to my symmetric model.

atm <- keyATM::weightedLDA(keyATM::keyATM_read(dfmt), model = "base", 
                           number_of_topics = 10)
print(atm$priors$alpha)
##  [1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

keyATM::top_words(atm)
##          Topic_1       Topic_2        Topic_3   Topic_4   Topic_5    Topic_6
## 1        nations       nations       assembly    people     world    climate
## 2          world        united        general terrorist    people     change
## 3          peace   development        session     world     human     people
## 4         united         peace             mr   attacks countries  countries
## 5            can international      president     lives      many      world
## 6           must   sustainable           like     state terrorism  caribbean
## 7         people      security         people  violence    change    natural
## 8  international        rights           also    forces   climate     global
## 9    development         human seventy-second  children  conflict hurricanes
## 10      security        agenda   congratulate    da'esh  violence       also
##          Topic_7       Topic_8       Topic_9      Topic_10
## 1        nations   development       nuclear   development
## 2  international   sustainable       weapons       nations
## 3         united        people      republic international
## 4       security       nations      security        united
## 5          peace        united    democratic     countries
## 6          world      economic      people's      national
## 7           must          must international           per
## 8         people international         korea      security
## 9         rights     countries     peninsula       country
## 10         human          also        korean       support

Topicmodels

The variational expectation maximization (VEM) algorithm in the topicmodels package estimates the value of alpha if estimate.alpha = TRUE, but this also produces symmetric priors.

tm <- topicmodels::LDA(convert(dfmt, "topicmodels"), k = 10, method = "VEM",
					   control = list(estimate.alpha = TRUE))
print(tm@alpha)
## [1] 25.60497

topicmodels::terms(tm, 10)
##       Topic 1         Topic 2         Topic 3         Topic 4        
##  [1,] "development"   "people"        "peace"         "development"  
##  [2,] "international" "united"        "united"        "can"          
##  [3,] "people"        "international" "development"   "security"     
##  [4,] "economic"      "security"      "security"      "sustainable"  
##  [5,] "united"        "nations"       "support"       "international"
##  [6,] "must"          "country"       "must"          "world"        
##  [7,] "countries"     "peace"         "international" "countries"    
##  [8,] "assembly"      "can"           "can"           "assembly"     
##  [9,] "country"       "must"          "also"          "human"        
## [10,] "also"          "world"         "world"         "one"          
##       Topic 5       Topic 6         Topic 7         Topic 8        
##  [1,] "nations"     "nations"       "nations"       "international"
##  [2,] "security"    "united"        "people"        "world"        
##  [3,] "countries"   "peace"         "world"         "rights"       
##  [4,] "can"         "development"   "countries"     "like"         
##  [5,] "like"        "world"         "also"          "country"      
##  [6,] "people"      "sustainable"   "change"        "people"       
##  [7,] "economic"    "also"          "international" "human"        
##  [8,] "one"         "rights"        "region"        "states"       
##  [9,] "also"        "states"        "community"     "nations"      
## [10,] "development" "international" "political"     "must"         
##       Topic 9         Topic 10   
##  [1,] "nations"       "united"   
##  [2,] "development"   "peace"    
##  [3,] "support"       "world"    
##  [4,] "united"        "human"    
##  [5,] "global"        "countries"
##  [6,] "also"          "global"   
##  [7,] "international" "states"   
##  [8,] "sustainable"   "people"   
##  [9,] "world"         "must"     
## [10,] "human"         "national"

Automatically adjusting alpha for small and large topics

Example: UN General Assembly speeches

Alpha and topic sizes

Topic terms

Symmetric model

Asymmetric model

Adjusted model

Model performance

Extra: similar features in other packages

KeyATM

Topicmodels

Kohei

Leave a Reply Cancel reply

Example: UN General Assembly speeches

Alpha and topic sizes

Topic terms

Symmetric model

Asymmetric model

Adjusted model

Model performance

Extra: similar features in other packages

KeyATM

Topicmodels

Share this:

Kohei

Leave a Reply Cancel reply

Related Posts