Analyzing Asian texts in R on English Windows machines

R is generally good with Unicode, and we do not see garbled texts as far as we use stringi package. But there are some known bugs. The worst is probably the bug that have been discussed on the online community.

On Windows, R prints character vectors properly, but not character vectors in data.frame:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
                                       txt
1 <U+3042><U+3044><U+3046><U+3048><U+304A> # not good

While Ista Zahn’s interesting post only shows the depth of the problem, there is a solution (or work around) that you can try:

First, set language for non-Unicode programs in Windows’ Control Panel > Clock, Language, and Region > Change date, time, or number formats > Administrative > Language for non-Unicode program > Change system locale.

Second, set locale in R script:

> Sys.setlocale("LC_CTYPE", locale="Japanese") # set locale
[1] "Japanese_Japan.932"

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=Japanese_Japan.932 # changed          
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252

Then, the Japanese text in a data.frame is printed correctly:

> txt <- "あいうえお" # Japanese text
> print(txt)
[1] "あいうえお" # good
> print(data.frame(txt))
         txt
1 あいうえお # good

One thought on “Analyzing Asian texts in R on English Windows machines”

R 3.4 still has the same problem. The root cause seems to be formatC():

> formatC('あいうえお')
[1] "<U+3042><U+3044><U+3046><U+3048><U+304A>"

Analyzing Asian texts in R on English Windows machines

Kohei

One thought on “Analyzing Asian texts in R on English Windows machines”

Leave a Reply Cancel reply

Share this:

Kohei

One thought on “Analyzing Asian texts in R on English Windows machines”

Leave a Reply Cancel reply

Related Posts