I developed Latent Semantic Scaling (LSS) to perform sentiment analysis of news articles about the Ukraine crisis in my PhD project in London. LSS only requires a small set of polarity words, called “seed words”, to perform large-scale document scaling about a specific subject, becasue it automatically identify synonyms of seed words by latent semantic analysis (LSA), the seminal word-embedding technique.
When I was working in Tokyo, I found LSS is very useful for quantitative text analysis in non-English languages because many languages lack content analysis dictionary suitable for social science research. In fact, many of my colleagues used LSS to study Asian countries, such as Japan, China, Iraq and Philippines. They measured various dimensions, such as hawkish vs. dovish, pro vs anti-regime, sectarian vs. conciliatory, and conflict vs. peace, using tailor made seed words.
Since last year, I started teaching LSS as one of semisupervided methods for quantitative text analysis in data science courses at Innsbruck and the ECPR Method Summer School in Budapest. My tutorial on semisupervided techniques was also schedule as part of COMPTEXT conference. Although both the summer school and the conference were canceled due to the Coronavirus pandemic in Europe, I felt it is time for me to write about LSS for the users.
In my working paper, titled Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, I explain LSS using two examples: sentiment analysis of English economic news and Japanese political news; I also introduce a new diagnostic measure to determine the near-optimal sizes of word vectors for LSA. I hope this paper will help the users of LSS to understand the method better.
Update on 22/11/2020: I changed the link to the published paper in Communication Methods and Measures.
Dear Kohei,
Thank you for the paper, which I read with great interest. I want to use LSS for an analysis of interest group position papers. That is typically one area where both supervised and unsupervised scaling techniques seem to fail.
I already listed a number of candidate seed words, but I have two questions:
1) Do I absolutely need to have the same number of positive and negative seed words? (right now, I have roughly twice as much negative words than positive words);
2) Is there a statistical indicator to assess the “quality” of a seed word?
You can have different numbers of seed words, because a negative word is weighted half of a positive word automatically to make total weight to be 1.0 for both sides. I am developing a statistical indicator, but it is not ready to be used. My suggestion is therefore check the seed words one by one if they weight words that they should using coef(). You can use a single word (without polarity score) as a seed words in testing.
Thanks Kohei!
One additional question: can I use multi-word expressions as seed words? E.g. “fair competit*”.
Yes, but do not forget to form n-grams by
tokens_compound()
. Otherwise they will be lost in the document-feature matrix