Quanteda and semisupervised models

I and my co-developers received the 2020 Statistical Software Award from the Society for Political Methodology for quanteda‘s contribution to research. The package has established the reputation as user-friendly and highly-efficient R package for quantitative text analysis in the political scientist community. I also know that there are many users of the package in other fields of research in both academia and industry.

I have developed quanteda focusing on its ability to process large textual data with the novel algorithms and data structures. I also published multiple posts on performance benchmarking to show that it is actually fast: we can analyze around 50GB of textual data in one go on a (cloud) machine with a 256GB RAM. You may think I am a speed maniac, but there is a reason behind my obsession with the efficiency.

The importance of efficiency stems from the machine learning techniques, Newsmap and LSS, that I have developed for my research. Both are semisupervided models that learn from how user-provided keywords (called “seed words”) cooccur with other words in the corpus. Since the number of seed words is independent of the number of documents in the corpus, we can train these semisupervided models on a very large corpus without extra costs. This draws sharp contrast with (full) supervised models. Since manually creating labeled documents is very expensive, we cannot train supervised models on very large corpus (therefore, we do not need highly efficient software).

In the political science community, semisupervised models have been far less known than (full) supervised or unsupervised models, but I am no longer alone. A team at Harvard and MIT released a new semisupervised topic model, called keyword-assisted topic model, this year. The model is implemented in R using quanteda, of course.

Posts created 113

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top