New stopwords collection for European and Asian languages

In quantitative text analysis, it is common to remove grammatical elements using stopword lists defined in Snowball, but it does not contain stopword for Asian languages. The lack of stopwords collection that cover both European and Asian-languages made cross-lingual analysis difficult.

To solve this problem, I and my collaborators created a new stopwords collection, called Marimo, by extending and translating the Snowball collection. It currently contains stopwords only for English, German (Oul Han), Arabic (Dai Yamao), Hebrew (Elad Segev), and Japanese (Kohei Watanabe), but we will add more languages.

Marimo has a unique hierarchical structure to make it easy to translate by words specifying their grammatical functions and remove extra words. For example, we added categories such as reporting, time and number for analysis of newspaper articles but your might find them unnecessary for other types of documents.

You can easily load the YAML files into R using quanteda’s dictionary() function, but we will to make the lists available through the stopwords package.

Posts created 113

One thought on “New stopwords collection for European and Asian languages

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top