New research paper on how to choose seed words for semi-supervised models

I have been developing and applying semi-supervised models, such as seeded-LDA, Newsmap and LSS, for classification and document scaling aiming to broader the scope of quantitative text analysis in recent years. These models are very cost efficient because they only require a small set of “seed words” to learn categories or dimensions of interest.

However, people often ask me how to choose seed words. My paper with Yuan Zhou (Kobe University), Theory-driven analysis of large corpora: semisupervised topic classification of the UN speeches, recently appeared in Social Science Computing Review is aimed at answering this question in the context of topic classification.

In this paper, much space is devoted to discussions on an entropy-based diagnosis tool but the key argument is that there are two criteria in choosing seed words: theoretical relevance and empirical frequency. The former is to operationalize concepts while latter is to enhance machine learning. I hope our paper to become a useful guideline for seed word selection. I am also thankful to the reviewers for their useful comments.

There is a growing interest in quantitative analysis of large corpora among the International Relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using unsupervised machine learning models to further develop the field. To solve this problem, we created a set of techniques that utilize a semi-supervised model that allows researchers to classify documents into pre-defined categories efficiently. We propose a dictionary making procedure to avoid inclusion of words that are likely to confuse the model and deteriorate its classification performance using a new entropy-based diagnostic tool. In our experiments, we classify sentences of the UN General Assembly speeches into six pre-defined categories using the seeded LDA and Newsmap, which were trained with a small “seed word dictionary” that we create following the procedure. The result shows that, while keyword dictionary can only classify 25% of sentences, Newsmap can classify over 60% of them correctly and it exceeds 70% when contextual information is taken into consideration by kernel smoothing of topic likelihoods. We argue that once seed word dictionaries are created by the IR community, semi-supervised models would become more useful than unsupervised models for theory-driven text analysis. 

Posts created 114

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top