AI products and text analysis methods

A few days ago, I received an email from a researcher asking if text analysis is becoming irrelevant because of artificial intelligence (AI). I replied to her briefly saying that text analysis methods and AI products serve different purposes, but I thought I need to write more to answer this question. In fact, I had an opportunity to discuss the differences between them earlier this year at the Spring Convention of the Japan Association for Asian Studies. I was a discussant in a panel of scholars who study Chinese politics using text analysis, but I could feel strong interests in AI technologies among the audience. This blog post summarizes my discussion at the conference.

I believe that the main purpose of AI products is to automate tasks such translating, summarizing or creating texts, which only humans could do until recently. In contrast, the main purpose of text analysis is research through finding patterns, building theories, and testing hypotheses. The difference in the main purpose determines how AI products and text analysis methods are designed and used.

We want AI products to perform tasks accurately for individual documents because we want them to replace humans, but we only need text analysis methods to be accurate for groups of documents because we are interested in variables associated with them (e.g. author or time) in research. Text analysis methods are usually less accurate than AI products, but it is not a problem because random errors could be canceled out by aggregating the values (e.g. taking the average).

AI products are often created to identify generic concepts (e.g. topics, sentiment etc.) that are relevant to broader users, but target concepts are usually specific (e.g. policy issues, geopolitical threat etc.) in text analysis because theoretical constructs are operationalized in hypothesis testing. Similarly, while AI products focus on parsing messages in the documents, text analysis is often used to infer thoughts and behaviors through the documents (e.g. ideology and bias).

AI models require a very large corpus of documents, which are often gathered from unknown sources, to learn the general semantics of words. However, text analysis models need only a small corpus of documents, which are collected from known sources, to capture the semantics of words in particular domains. The use of known sources becomes even more important if the models need to capture the semantics of words in particular times (i.e. historical analysis or forecasting).

Large sizes of corpora allow complex AI models to learn general semantics very accurately, but small sizes of corpora require text analysis models to be simpler and efficient. I consider BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) to be complex models, while LDA (Latent Dirichlet Allocation) and Word2vec to be simple models, for example. Simple models are even desirable in academic research because their algorithms are more transparent and efficient, and their results are more easily reproducible. Most importantly, simple models, if implemented in open-source packages, allow us to conduct research independently from AI companies’ proprietary technologies.

AI productsText analysis methods
Main purposesAutomation (e.g. translating, summarizing or creating texts)Research (i.e. finding patterns, building theories, and testing hypotheses)
Unit of analysisIndividual documentsGroups of documents (e.g. by author or time)
Target conceptsGeneric and manifest (e.g. topics, sentiment etc.)
Specific and latent (e.g. policy issues, geopolitical threat, ideology, bias etc.)
Words semanticsLearn in general contextsLearn in domain and time-specific contexts
Corpus sizesMillions to billions of documents from unknown sourcesThousands to millions of documents from known sources
Analytical modelsLarge (e.g. BERT, GPT)Small (e.g. LDA, Word2vec)
Desired charactersUsability, accuracy, versatilityTransparency, efficiency, reproducibility, independence

I believe that text analysis will remain to be relevant in research because it serves different purposes from AI products. The confusion of text analysis with AI seems to be stemming from the lack of understanding about the technologies behind text analysis. Unfortunately, most users of text analysis do not understand the basic algorithms of LDA, and thus even such a simple model appears as non-transparent as AI models. However, it is hard to blame them on the lack of understanding because students of social sciences are rarely taught properly about analytical models. This is a serious problem in teaching programs in social science departments.

If text analysis remains to be relevant as a research methodology, we should continue improving its analytical models. As computer scientists shift their attention to development of AI models, it is becoming more important for social scientists to develop software packages ourselves because. Currently, I am making text analysis more granular by developing Sequential LDA and more accurate by implementing Word2vec. Development of such software packages helps us to understand the algorithms deeply and break the limit of the current text analysis methodology.

Posts created 116

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top