Improved tokenization of hashtags in Asian languages

Quanteda can tokenize Asian texts thanks to the ICU library’s boundary detection mechanism, but it causes problems when we analyze social media posts that contain hashtags in Chinese or Japanese. For example, a hashtag “#英国首相仍在ICU但未使用呼吸机#” in a post about the British prime minister is completely destroyed by current quanteda’s tokenizer. Altough we can correct tokenization using tokens_compound(), I don’t think it is possible here.

> txt <- "#英国首相仍在ICU但未使用呼吸机#首相好惨,希望都平安,疫情带来的变数太多了"
> print(tokens(txt), max_ntoken = 20)
Tokens consisting of 1 document.
text1 :
 [1] "#"    "英国" "首相" "仍在" "ICU"  "但"   "未"   "使用" "呼吸" "机"   "#"    "首相"
[13] "好"   "惨"   ","   "希望" "都"   "平安" ","   "疫"  
[ ... and 6 more ]

So I decided to improve the mechanism to protect hashtags, exploiting the fast string substitution algorithm that I developed recently. As shown blew, the improved tokenizer keeps the character string in the hashtag intact while other character strings are segmented in the same way as above.

> print(tokens(txt), max_ntoken = 20)
Tokens consisting of 1 document.
text1 :
 [1] "#英国首相仍在ICU但未使用呼吸机#" "首相"                           
 [3] "好"                              "惨"                             
 [5] ","                              "希望"                           
 [7] "都"                              "平安"                           
 [9] ","                              "疫"                             
[11] "情"                              "带来"                           
[13] "的"                              "变数"                           
[15] "太多"                            "了"                             

It only supports Twitter-style (“#abc”) and Weibo-style (“#abc#”) hashtags by default, but you can preserve different types of tags if you change the regular expression in `quanteda_options()`. If you are interested, please try the development version on Github.

Posts created 116

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top