LSX package upgraded as the paper published in CMM

I am please to tell you that my paper, Latent Semantic Scaling: A Semisupervised Text Analysis Technique for New Domains and Languages, has been published in Communication Methods and Measures a few days ago. This paper explains the Latent Semantic Scaling technique, which is implemented in the LSX package available on CRAN, taking sentiment analysis of English economic news and Japanese political news as examples. Its pre-print has been available on this website, but you might find publication by the journal reassuring.

I also upgraded the LSX package to v0.9.4. In this version, you can use pre-trainened word vectors to construct a LSS model by as.textmodel_lss() and smooth LSS scores quickly by smooth_lss(engine = "locfit"). The possibility to use pre-trained word vectors for LSS is mentioned in the paper already. The quicker smoothing is useful when you apply LSS on large corpora (e.g. social media posts). Please be aware of the breaking change in cohesion(), which now returns a data.frame, if you run the replication code of the paper.

Posts created 113

3 thoughts on “LSX package upgraded as the paper published in CMM

  1. Hi Kohei!
    I am currently trying to run LSS on a corpus of responses to public consultations on financial regulation to measure the preference expressed by respondents for the new regulation. As you can imagine, that corpus has two characteristics:
    1) It is overly dominated by financial interest groups’ comments, meaning that the two extremes of my scale are not evenly represented, by far.
    2) It includes a lot of technical jargon, which ideally should be weighted as neutral in terms of sentiment. While technical terms are generally weighted with low absolute values, they still create some confusion…
    After my first trials with LSS, I would like to ask you three questions:
    a) Does the fact that most documents in my corpus will most likely fall onto one side of the scale constitute a problem for the algorithm (that is, is there any kind of underlying assumption that the vocabulary should be balanced in terms of polarities)?
    b) I notice that the maximum weights assigned to words on the negative side are much lower than the maximum weights for positive words. Is this normal? I provided the same number of seed words for both sides.
    c) Should I try as much as possible to remove technical terms from the dfm, or rather to indicate these terms (there is really a LOT of them) as model terms?
    Once again, thank you for providing us with these amazing tools!
    Sébastien

    1. Hi Sebastian,

      Thanks for the comments. My views are the following:

      a) LSS’s scores are estimated within the semantic universe of the corpus, so the the zero in LSS scores is not necessarily neutral in the general sense (e.g. shifted towards negative if the corpus is dominated by negative commnets). But we can still analyze documents relative to each other.

      b) Imbalance in assigned score suggests that seed words on one side smaller similarity to other words. I don’t think it is a big problem, but you can chose different seed words or try a new function auto_weight in the latest version of the package. I am curious how it works in your case.

      c) Technical terms should not be removed, but multi-word expressions must be compounded using textstat_collocations() and tokens_compound(). They often affect estimation of semantic proximity a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top