Align word vectors of multiple Word2vec models

I have been developing a new R package called wordvector since last year. I started it as a fork of the Word2vec package but made several important changes to make it fully compatible with quanteda. By training the Word2vec model on quanteda’s tokens object using the package, it becomes easier to use the trained models in combination with other analytical tools. I think the potential of the model is under-explored in quantitative text analysis.

The wordvector package evolved quickly since the initial release in December 2024. Its progress monitoring mechanism was rewritten in v0.2.0 to enhance the stability; a new function, probability(), was added in v0.3.0 to compute the probability of word occurrences; it became compatible with the external pointer tokens objects in v0.4.0 for greater efficiency; it gained the ability to inherit parameters from existing models via the model argument for transfer learning in the latest version. All these changes, especially the last, made the package very useful for the application of word vectors (and document vectors) in research.

In order to demonstrate the usefulness of the new function, I created a small example with news summaries collected in 2014 and 2015. The dataset and pre-processing steps are the same as in the example on the Github. First, I train Word2vec separately on news from 2014 and 2015 with the default settings (50 dimension using CBOW). Second, I find words related to “migration” and “europe” based on their cosine similarity. According to the word vectors trained on news from 2014 (wdv14), “migration” is related to both human migration (“deportations”, “influx”) and animal migration (“over-fishing”); “europe” is related to the geographical area (“continent”) or European institutions (“nato” and “eu”). However, according to the word vectors from 2015 (wdv15), “migration” is related only to human migration; “europe” is also related to the routes of the migration (“balkans” and “mediterranean”) from the Middle East and North Africa. These changes in word vectors are caused by the news coverage of the 2015 European refugee crisis.

# select subset news bu year
toks14 <- tokens_subset(toks, year == 2014)
toks15 <- tokens_subset(toks, year == 2015)

keyword <- c("migration", "europe")

# train Word2vec on news from 2014
wdv14 <- textmodel_word2vec(toks14)
head(similarity(wdv14, keyword))
#>      migration      europe     
#> [1,] "migration"    "europe"   
#> [2,] "deportations" "continent"
#> [3,] "immigration"  "countries"
#> [4,] "influx"       "european" 
#> [5,] "oecd"         "nato"     
#> [6,] "over-fishing" "eu"

# train Word2vec on news from 2015
wdv15 <- textmodel_word2vec(toks15)
head(similarity(wdv15, keyword))
#>      migration     europe         
#> [1,] "migration"   "europe"       
#> [2,] "refugee"     "balkans"      
#> [3,] "migrant"     "continent"    
#> [4,] "immigration" "mediterranean"
#> [5,] "europe"      "influx"       
#> [6,] "migrants"    "europeans"

We can transfer the models’ knowledge from 2014 to 2015 using the model argument. If we initialize Word2vec with the parameters from 2014 and train the model on the news from 2015, associated words become more general: we see words about human migration only for “migration” and the names of European countries (“greece” and “germany”) for “europe” (wdv15s). If this appears to be too general, failing to capture the uniqueness of events in 2015, we can limit the impact of the model by increasing the learning rate alpha from 0.05 (default) to 0.1 (wdv15w).

wdv15s <- textmodel_word2vec(toks15, model = wdv14)
head(similarity(wdv15s, keyword))
#>      migration     europe         
#> [1,] "migration"   "europe"       
#> [2,] "refugee"     "continent"    
#> [3,] "migrant"     "balkans"      
#> [4,] "immigration" "mediterranean"
#> [5,] "europe"      "greece"       
#> [6,] "europol"     "germany"

wdv15w <- textmodel_word2vec(toks15, model = wdv14, alpha = 0.1)
head(similarity(wdv15w, keyword))
#>      migration     europe     
#> [1,] "migration"   "europe"   
#> [2,] "refugee"     "continent"
#> [3,] "migrant"     "balkans"  
#> [4,] "influx"      "germany"  
#> [5,] "immigration" "influx"   
#> [6,] "unfolding"   "country"

The changes in the relationship between words are interesting but they make the word vectors in separately trained models incomparable: the cosine similarity of the keywords between wdv14 and wdv15 is close to zero because the word vectors are sorted differently. However, thanks to transfer learning, word vectors are directly comparable between wdv14 and wdv15s with moderately high similarity scores (0.69 and 0.57 respectively). Even if the learning rate is higher, the word vectors are aligned between wdv14 and wdv15w (0.62 and 0.59 respectively).

rbind(
  # no transfer 
  "none" = diag(simil(wdv14$values[keyword,], wdv15$values[keyword,])),
  # strong transfer (low learning rate) 
  "strong" = diag(simil(wdv14$values[keyword,], wdv15s$values[keyword,])),
  # weak transfer (high learning rate)
  "weak" = diag(simil(wdv14$values[keyword,], wdv15w$values[keyword,]))
)
#>         migration     europe
#> none   0.02671217 -0.1277291
#> strong 0.69817714  0.5719806
#> weak   0.62802489  0.5958305

The usages of the new argument is not limited to aligning word vectors for similarity computation. We can use it to update the existing model as new data arrive or train a model more reliably on limited data by supplying a larger pre-trained model. Please give it a try and find a create usage of the new function.

Posts created 117

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top