A few days ago, I received an email from a researcher asking if text analysis is becoming irrelevant because of artificial intelligence (AI). I replied to her briefly saying that text analysis methods and AI products serve different purposes, but I thought I need to write more to answer this question. In fact, I had an opportunity to discuss the differences between them earlier this year at the Spring Convention of the Japan Association for Asian Studies. I was a discussant in a panel of scholars who study Chinese politics using text analysis, but I could feel strong interests in AI technologies among the audience. This blog post summarizes my discussion at the conference.
I believe that the main purpose of AI products is to automate tasks such translating, summarizing or creating texts, which only humans could do until recently. In contrast, the main purpose of text analysis is research through finding patterns, building theories, and testing hypotheses. The difference in the main purpose determines how AI products and text analysis methods are designed and used.
We want AI products to perform tasks accurately for individual documents because we want them to replace humans, but we only need text analysis methods to be accurate for groups of documents because we are interested in variables associated with them (e.g. author or time) in research. Text analysis methods are usually less accurate than AI products, but it is not a problem because random errors could be canceled out by aggregating the values (e.g. taking the average).
AI products are often created to identify generic concepts (e.g. topics, sentiment etc.) that are relevant to broader users, but target concepts are usually specific (e.g. policy issues, geopolitical threat etc.) in text analysis because theoretical constructs are operationalized in hypothesis testing. Similarly, while AI products focus on parsing messages in the documents, text analysis is often used to infer thoughts and behaviors through the documents (e.g. ideology and bias).
AI models require a very large corpus of documents, which are often gathered from unknown sources, to learn the general semantics of words. However, text analysis models need only a small corpus of documents, which are collected from known sources, to capture the semantics of words in particular domains. The use of known sources becomes even more important if the models need to capture the semantics of words in particular times (i.e. historical analysis or forecasting).
Large sizes of corpora allow complex AI models to learn general semantics very accurately, but small sizes of corpora require text analysis models to be simpler and efficient. I consider BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) to be complex models, while LDA (Latent Dirichlet Allocation) and Word2vec to be simple models, for example. Simple models are even desirable in academic research because their algorithms are more transparent and efficient, and their results are more easily reproducible. Most importantly, simple models, if implemented in open-source packages, allow us to conduct research independently from AI companies’ proprietary technologies.
AI products | Text analysis methods | |
Main purposes | Automation (e.g. translating, summarizing or creating texts) | Research (i.e. finding patterns, building theories, and testing hypotheses) |
Unit of analysis | Individual documents | Groups of documents (e.g. by author or time) |
Target concepts | Generic and manifest (e.g. topics, sentiment etc.) | Specific and latent (e.g. policy issues, geopolitical threat, ideology, bias etc.) |
Words semantics | Learn in general contexts | Learn in domain and time-specific contexts |
Corpus sizes | Millions to billions of documents from unknown sources | Thousands to millions of documents from known sources |
Analytical models | Large (e.g. BERT, GPT) | Small (e.g. LDA, Word2vec) |
Desired characters | Usability, accuracy, versatility | Transparency, efficiency, reproducibility, independence |
I believe that text analysis will remain to be relevant in research because it serves different purposes from AI products. The confusion of text analysis with AI seems to be stemming from the lack of understanding about the technologies behind text analysis. Unfortunately, most users of text analysis do not understand the basic algorithms of LDA, and thus even such a simple model appears as non-transparent as AI models. However, it is hard to blame them on the lack of understanding because students of social sciences are rarely taught properly about analytical models. This is a serious problem in teaching programs in social science departments.
If text analysis remains to be relevant as a research methodology, we should continue improving its analytical models. As computer scientists shift their attention to development of AI models, it is becoming more important for social scientists to develop software packages ourselves because. Currently, I am making text analysis more granular by developing Sequential LDA and more accurate by implementing Word2vec. Development of such software packages helps us to understand the algorithms deeply and break the limit of the current text analysis methodology.