I tend to use political texts in my examples because of my academic background but quanteda and its associated packages can be used more broadly.
There is a growing interest in analysis of textual data using NLP tools in the financial industry. I have been work for Lazard Asset Management as a data science consultant since last year, so I think it is time for me to explain how these packages can be used in analysis of financial texts. There are a few tutorials in R (many of them are in Python) on the web, but they are over complicated. If data is available, analysis of financial texts is very easy.
I use the transcripts of earning calls for S&P 500 companies from 2017 in my examples. I extracted texts file from the dataset and saved them in a corpus object that you can download. The corpus object includes date (“date”) and company names (“company”) as document variables; “quarter” is created from “date”.
Simple pre-processing is completed in three lines of code using quanteda. dfmt
records the frequency of words in each document.
require(quanteda)
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks <- tokens_remove(toks, pattern = stopwords("en"), min_nchar = 2)
dfmt <- dfm(toks)
Sentiment Analysis
We can perform sentiment analysis easily thanks to the financial sentiment dictionary created by Loughran and Mcdonald (2011). The L&M dictionary is available as part of the quanteda.sentiment package. The dictionary contains 4,232 words under categories such as “negative”, “positive”, and “uncertainty”.
devtools::install_github("quanteda/quanteda.sentiment")
require(quanteda.sentiment)
data_dictionary_LoughranMcDonald[c(1, 3)]
## Dictionary object with 2 key entries.
## Polarities: pos = ""; neg = "NEGATIVE"
## - [NEGATIVE]:
## - abandon, abandoned, abandoning, abandonment, abandonments, abandons, abdicated, abdicates, abdicating, abdication, abdications, aberrant, aberration, aberrational, aberrations, abetting, abnormal, abnormalities, abnormality, abnormally [ ... and 2,335 more ]
## - [UNCERTAINTY]:
## - abeyance, abeyances, almost, alteration, alterations, ambiguities, ambiguity, ambiguous, anomalies, anomalous, anomalously, anomaly, anticipate, anticipated, anticipates, anticipating, anticipation, anticipations, apparent, apparently [ ... and 277 more ]
We can count the frequency of the negative and uncertainty words using dfm_lookup()
, Interestingly, the dictionary does not contain mutil-word expressions at all. The frequencies of the sentiment words are divided by the total number of words to normalize for the lengths of documents.
dfmt_dict <- dfm_lookup(dfmt, data_dictionary_LoughranMcDonald[c(1, 3)]) / ntoken(dfmt)
dat_dict <- cbind(convert(dfmt_dict, to = "data.frame"), docvars(dfmt_dict))
Using the result of dictionary analysis, I want to reveal how sentiment in earning calls changed in 2017. Managers discuss financial results from the previous periods in earning calls so those held between January to March are about previous year’s final result (“NEGATIVE.0″ and ” UNCERTAINTY.0″) and those between October to December are about the third quarter’s (“NEGATIVE.3″ and ” UNCERTAINTY.3″ ). It would be more appropriate to compare earning calls about the final results of 2016 and 2017, but it is not possible in the current data.
All the earning calls are placed on the two dimensional space on negativity and uncertainty. For visualization, I selected only five companies whose changes in sentiment are the largest as measured by the Euclidean distance.
dat_wide <- reshape(dat_dict[-5], timevar = "quarter", idvar = "company",
direction = "wide")
dat_wide$change <- sqrt((dat_wide$NEGATIVE.0 - dat_wide$NEGATIVE.3) ^ 2 +
(dat_wide$UNCERTAINTY.0 - dat_wide$UNCERTAINTY.3) ^ 2)
dat_top <- head(dat_wide[order(dat_wide$change, decreasing = TRUE),], 5)
Figure 1 shows that uncertainty decreased dramatically in the Dish Network while it increased in Tractor Supply and Lockheed Martin; negativity increased but uncertainty remained the same in Foot Locker; both uncertainty and negativity decreased in Caterpillar.
par(mar = c(4, 4, 1, 1))
plot(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0, ylim = c(0, 0.06), xlim = c(0, 0.06),
col = 1:5, pch = 1:5, xlab = "Negativity", ylab = "Uncertainty")
arrows(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0,
dat_top$NEGATIVE.3, dat_top$UNCERTAINTY.3,
code = 2, length = 0.1, col = 1:5)
legend("topright", col = 1:5, legend = dat_top$company, lty = 1, pch = 1:5)
Company Clustering
We can also perform more sophisticated analysis of the companies through texts. An example is clustering of companies based on their business using unsupervised topic models. Latent Dirichlet Allocation (LDA) is commonly used to identify topics of documents but it can also do business of companies.
I used my seededlda package that works perfectly smoothly with quanteda. Before fitting LDA with 20 topics, I remove both infrequent and frequent words using dfm_trim()
and combine documents for the same company using dfm_group()
.
require(seededlda)
dfmt_grp <- dfmt %>%
dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop") %>%
dfm_group(company)
lda <- textmodel_lda(dfmt_grp, k = 20)
t(terms(lda))
In Table 1, I can confirm that many of the 20 topics identified by LDA are strongly related to business sectors: aviation/defense (topic 1), cosmetics (topic 2), real estate (topic 3), computer (topic 4), communication (topic 5), oil (topic 6) and so on. Unrelated words occur frequently only in topic 14.
topic1 | aircraft | defense | skyworks | bookings | f-35 | marine | iot | pension | air | deliveries |
topic2 | beauty | makeup | luxury | social | prestige | travel | ulta | skin | department | guests |
topic3 | noi | square | lease | estate | feet | occupancy | leasing | interconnection | forward-looking | differ |
topic4 | hardware | hybrid | storage | memory | nand | dram | ip | ryzen | automotive | vertical |
topic5 | wireless | fiber | churn | packaging | unlimited | spectrum | video | networks | device | mps |
topic6 | wells | drilling | barrels | permian | basin | equivalent | midstream | rig | acreage | rigs |
topic7 | paypal | novaseq | sequencing | merchants | hiseq | diagnostics | self-employed | quickbooks | cynosure | consumables |
topic8 | retention | ceb | membership | medicare | gartner | insurance | consulting | bookings | year-on-year | aca |
topic9 | loan | loans | card | deposit | fees | balances | deposits | banking | mortgage | fee |
topic10 | invisalign | bookings | storage | automation | repair | compares | americas | tool | divestiture | dilution |
topic11 | fresh | snacks | beverage | foods | consumption | soup | beverages | merchandising | shelf | dsd |
topic12 | apparel | wholesale | foot | gap | merchandise | ticket | hurricanes | com | women’s | footwear |
topic13 | patients | study | studies | treatment | sleep | disease | patient | clinical | therapy | antibody |
topic14 | tom | structural | experiencing | insight | conversations | winning | dramatically | executive | source | serving |
topic15 | star | games | programming | video | cbs | entertainment | wars | audience | cable | |
topic16 | patients | cancer | lung | patient | trial | clinical | treatment | keytruda | heart | revlimid |
topic17 | electric | commission | settlement | grid | transmission | utilities | storage | megawatts | wisconsin | carolina |
topic18 | restaurants | pharmacy | restaurant | generic | pizza | taco | drug | olive | same-store | bell |
topic19 | truck | vehicles | vehicle | automotive | transportation | industries | car | trucks | auto | electric |
topic20 | colorado | florida | carolina | martin | california | housing | georgia | aggregates | tons | southeast |
In Table 2, we can check in which companies above words are salient. I think these topics really correspond to business sectors.
t(apply(lda$theta, 2,
function(x, y) head(y[order(x, decreasing = TRUE)], 5),
dfmt_grp$company))
topic1 | Skyworks Solutions | Lockheed Martin Corp. | Garmin Ltd. | General Dynamics | Harris Corporation |
topic2 | Estee Lauder Cos. | Coty, Inc | Ulta Beauty | Tapestry, Inc. | Sealed Air |
topic3 | The Clorox Company | General Growth Properties Inc. | Prologis | Booking Holdings Inc | Digital Realty Trust Inc |
topic4 | Advanced Micro Devices Inc | Microsoft Corp. | NetApp | Micron Technology | Cisco Systems |
topic5 | Verizon Communications | American Tower Corp A | Crown Castle International Corp. | Becton Dickinson | AT&T Inc. |
topic6 | Hess Corporation | Noble Energy Inc | Cimarex Energy | EQT Corporation | Apache Corporation |
topic7 | Illumina Inc | Hologic | Intuit Inc. | PerkinElmer | Thermo Fisher Scientific |
topic8 | Aetna Inc | Gartner Inc | Humana Inc. | Anthem Inc. | Automatic Data Processing |
topic9 | Regions Financial Corp. | Citigroup Inc. | SVB Financial | U.S. Bancorp | Fifth Third Bancorp |
topic10 | Red Hat Inc. | Dover Corp. | Verisign Inc. | Oracle Corp. | Broadridge Financial Solutions |
topic11 | Campbell Soup | Hormel Foods Corp. | Kellogg Co. | The Hershey Company | General Mills |
topic12 | Tractor Supply Company | Foot Locker Inc | Ross Stores | HCA Holdings | Lowe’s Cos. |
topic13 | Regeneron | ResMed | Incyte | Gilead Sciences | Amgen Inc. |
topic14 | Advance Auto Parts | Emerson Electric Company | Perrigo | Facebook, Inc. | Parker-Hannifin |
topic15 | The Walt Disney Company | Activision Blizzard | Hasbro Inc. | Viacom Inc. | CBS Corp. |
topic16 | Celgene Corp. | Merck & Co. | Nektar Therapeutics | ABIOMED Inc | Bristol-Myers Squibb |
topic17 | Dominion Energy | SCANA Corp | PPL Corp. | Ameren Corp | Wec Energy Group Inc |
topic18 | Yum! Brands Inc | Republic Services Inc | Walgreens Boots Alliance | Darden Restaurants | AmerisourceBergen Corp |
topic19 | PACCAR Inc. | Weyerhaeuser | General Motors | BorgWarner | Cummins Inc. |
topic20 | Martin Marietta Materials | Vulcan Materials | Freeport-McMoRan Inc. | Altria Group Inc | Ball Corp |
Since LDA is a document classifier, I can get the frequency of companies in each topic easily as shown in Figure 2.
par(mar = c(4, 4, 1, 1))
barplot(rev(table(topics(lda))), horiz = TRUE, las = 1)
The above examples demonstrate that quanteda and its associated packages can be used in analysis of financial texts too. This means that the we can employ semisupervised machine learning methods in analysis of financial texts: use seeded-LDA if topics must be predefined by the user or Latent Semantic Scaling if off-the-shelf dictionary is not available in the target domain or languages (e.g. Chinese or Japanese). These tools become useful when you analyze more specific aspect companies such as environmental and social governance (ESG).