My co-authored report on Russia’s influence on Twitter during the 2020 US presidential election has been published by Free Russia Foundation. I and Maria Snegovaya conducted a representative online survey of Americans during the election campaign along with quantitative content analysis of their Twitter posts over a year. We aimed to reveal the relationship between […]
武蔵大学データサイエンス研究所での講演
先日、武蔵大学データーサイエンス研究所で、「NYT紙の量的テキスト分析を通じた150年間の地政学的脅威の測定」と題する講演を行いました。主催者の方によれば、オフラインで30名でオンラインで70名程度の方が発表を聞いてくれたようです。今回の発表を通じて、量的テキスト分析の可能性を感じ、日本でより多くの人が研究や実務で、同手法を応用してくれることを期待しています。 今後しばらくは日本にとどまって研究を続けるつもりなので、Quanteda Tutorialsを使った量的テキスト分析の実践的なワークショップの開催を希望する大学や企業の方は連絡をください。 2020年12月23日更新:講演の録画がYoutubeで公開されました。
単語埋め込みによる柔軟な日本語文書の感情分析
先日、Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languagesと題する僕の論文がCommunication Methods and Measuresに掲載されました。当論文では、単語埋め込み(word embedding)を用いることで、すぐに利用できるキーワード辞書などが少ない日本語においても、英語と同様に量的テキスト分析を行えることを示しました。 当論文では、LSSという手法を用いて、新聞の記事から政治に関する語を抽出し、それらを感情に関する種語との距離によって重みづけしています。肯定的な語は「絶好、美麗、秀逸、卓越、優雅、絶賛、善良」は、否定的な語「粗悪、醜悪、稚拙、非礼、貧相、酷評、悪徳」となっています。重みづけの結果は、図にあるように、「絶好、人類、民主化、安定、立国」などが肯定的な語、「私利私欲、暴力団、脱税事件、不透明、流用」などが否定的な語となり、直感的に納得できる結果になっています。これら感情によっての重みづけされた語を用いて、文書を重みづけると適当な感情辞書が無くても、政治的な感情分析ができます。 LSSを使うと、重みづけされる語を変えることで、政治以外のさまざまな主題における感情分析を行えます。さらに、種語を変えることで脅威認識や精神状態などのより特定化された尺度における分析を行うことできます。この論文での日本語文書の処理と分析は、quantedaとLSXというRパッケージだけを使っていて簡単なので、ぜひとも試してみてください。分析を再現するRスクリプトは、Harvard Dataverseからダウンロードできます。
LSX package upgraded as the paper published in CMM
I am please to tell you that my paper, Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, has been published in Communication Methods and Measures a few days ago. This paper explains the Latent Semantic Scaling technique, which is implemented in the LSX package available on CRAN, taking sentiment analysis […]
Study political and economic changes with semisupervided text analysis methods
Earlier this year, I have published my first paper on semisupervised methods (Newsmap and seeded LDA) in Social Science Computer Review. My second paper on semisupervised method (Latent Semantic Scaling) has appeared in Communication Methods and Measures a few days ago. I wrote these research articles and developed software packages as part of my effort […]
Uploaded two new semisupervised models to CRAN
In this summer, I have submitted two packages for quantitative text analysis to CRAN: seededlda and LSX. These packages have been available in my Github repositories but I though it is time to make them more readily available to promote semisupervised machine learning techniques. seededlda is a package that implements seeded-LDA using the GibbsLDA++ library. […]
Quanteda and semisupervised models
I and my co-developers received the 2020 Statistical Software Award from the Society for Political Methodology for quanteda‘s contribution to research. The package has established the reputation as user-friendly and highly-efficient R package for quantitative text analysis in the political scientist community. I also know that there are many users of the package in other […]
Improved tokenization of hashtags in Asian languages
Quanteda can tokenize Asian texts thanks to the ICU library’s boundary detection mechanism, but it causes problems when we analyze social media posts that contain hashtags in Chinese or Japanese. For example, a hashtag “#英国首相仍在ICU但未使用呼吸机#” in a post about the British prime minister is completely destroyed by current quanteda’s tokenizer. Altough we can correct tokenization […]
New paper on Latent Semantic Scaling
I developed Latent Semantic Scaling (LSS) to perform sentiment analysis of news articles about the Ukraine crisis in my PhD project in London. LSS only requires a small set of polarity words, called “seed words”, to perform large-scale document scaling about a specific subject, becasue it automatically identify synonyms of seed words by latent semantic […]
New stopwords collection for European and Asian languages
In quantitative text analysis, it is common to remove grammatical elements using stopword lists defined in Snowball, but it does not contain stopword for Asian languages. The lack of stopwords collection that cover both European and Asian-languages made cross-lingual analysis difficult. To solve this problem, I and my collaborators created a new stopwords collection, called […]