LIWC is a popular text analysis package developed and maintained by Pennebaker et al. The latest version of the LIWC dictionary was released in 2015. This dictionary seems more appropriate than classic dictionaries such as the General Inquire dictionaries for analysis of contemporary materials, because our vocabulary changes over years.
However, LIWC did not work with a large corpus of news articles published between 2012-2015 (around 800MB in raw text). The error seems to show that the text file is too large for the software:
java.util.concurrent.ExecutionException: java.lang.Exception: java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at com.liwc.LIWC2015.controller.TextAnalyzer.run(TextAnalyzer.java:109)
at com.liwc.LIWC2015.controller.MainMenuController.onAnalyzeText(MainMenuController.java:113)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
at javafx.fxml.FXMLLoader$MethodHandler.invoke(FXMLLoader.java:1771)
at javafx.fxml.FXMLLoader$ControllerMethodEventHandler.handle(FXMLLoader.java:1657)
My solution to the problem was to apply the LIWC dictionary using quanteda‘s dictionary lookup function – it could apply the dictionary to the data less the one minute on my Core i7 machine. I compared the results from quanteda and LIWC using a subset of the corpus, and found the word counts (in columns from “function” to “you” in the tables) very close to each other:
dict <- dictionary(file = './Text analysis/LIWC/LIWC2015_English_Flat.dic')
corp <- corpus(readLines('./Text analysis/Corpus/guardian_sub.txt'))
toks <- tokens(corp, remove_punct = TRUE)
toks_liwc <- tokens_lookup(toks, dict)
mx_liwc <- dfm(toks_liwc) / ntoken(toks) * 100
head(mx_liwc, 20)
Document-feature matrix of: 10,000 documents, 73 features (21.8% sparse).
(showing first 20 documents and first 6 features)
features
docs function pronoun ppron i we you
text1 43.57743 6.122449 1.4405762 0.12004802 0.7202881 0.12004802
text2 42.94872 5.769231 0.6410256 0.00000000 0.0000000 0.00000000
text3 43.94904 6.157113 1.6985138 0.00000000 0.2123142 0.00000000
text4 42.12963 4.783951 1.3888889 0.15432099 0.4629630 0.15432099
text5 40.22140 5.289053 2.7060271 0.00000000 0.6150062 0.12300123
text6 43.44473 4.755784 0.6426735 0.00000000 0.2570694 0.00000000
text7 41.03139 4.035874 0.2242152 0.00000000 0.0000000 0.00000000
text8 43.82716 8.847737 6.3786008 1.02880658 0.8230453 0.00000000
text9 42.56121 4.519774 1.3182674 0.00000000 0.3766478 0.00000000
text10 46.11111 6.888889 1.8888889 0.44444444 0.1111111 0.22222222
text11 49.62963 12.469136 5.5555556 1.60493827 1.1111111 0.12345679
text12 50.00000 11.121495 6.8224299 1.02803738 2.5233645 0.00000000
Note that quanteda version 0.99 has a problem in dfm_lookup()
, which slows down computation dramatically. If you want to use this function, install version 0.996 or later (available on Github).