Analysis of financial texts using R

I tend to use political texts in my examples because of my academic background but quanteda and its associated packages can be used more broadly.

There is a growing interest in analysis of textual data using NLP tools in the financial industry. I have been work for Lazard Asset Management as a data science consultant since last year, so I think it is time for me to explain how these packages can be used in analysis of financial texts. There are a few tutorials in R (many of them are in Python) on the web, but they are over complicated. If data is available, analysis of financial texts is very easy.

I use the transcripts of earning calls for S&P 500 companies from 2017 in my examples. I extracted texts file from the dataset and saved them in a corpus object that you can download. The corpus object includes date (“date”) and company names (“company”) as document variables; “quarter” is created from “date”.

Simple pre-processing is completed in three lines of code using quanteda. dfmt records the frequency of words in each document.

require(quanteda)
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks <- tokens_remove(toks, pattern = stopwords("en"), min_nchar = 2)
dfmt <- dfm(toks)

Sentiment Analysis

We can perform sentiment analysis easily thanks to the financial sentiment dictionary created by Loughran and Mcdonald (2011). The L&M dictionary is available as part of the quanteda.sentiment package. The dictionary contains 4,232 words under categories such as “negative”, “positive”, and “uncertainty”.

devtools::install_github("quanteda/quanteda.sentiment")
require(quanteda.sentiment)
data_dictionary_LoughranMcDonald[c(1, 3)]

## Dictionary object with 2 key entries.
## Polarities: pos = ""; neg = "NEGATIVE" 
## - [NEGATIVE]:
##   - abandon, abandoned, abandoning, abandonment, abandonments, abandons, abdicated, abdicates, abdicating, abdication, abdications, aberrant, aberration, aberrational, aberrations, abetting, abnormal, abnormalities, abnormality, abnormally [ ... and 2,335 more ]
## - [UNCERTAINTY]:
##   - abeyance, abeyances, almost, alteration, alterations, ambiguities, ambiguity, ambiguous, anomalies, anomalous, anomalously, anomaly, anticipate, anticipated, anticipates, anticipating, anticipation, anticipations, apparent, apparently [ ... and 277 more ]

We can count the frequency of the negative and uncertainty words using dfm_lookup() , Interestingly, the dictionary does not contain mutil-word expressions at all. The frequencies of the sentiment words are divided by the total number of words to normalize for the lengths of documents.

dfmt_dict <- dfm_lookup(dfmt, data_dictionary_LoughranMcDonald[c(1, 3)]) / ntoken(dfmt)
dat_dict <- cbind(convert(dfmt_dict, to = "data.frame"), docvars(dfmt_dict))

Using the result of dictionary analysis, I want to reveal how sentiment in earning calls changed in 2017. Managers discuss financial results from the previous periods in earning calls so those held between January to March are about previous year’s final result (“NEGATIVE.0″ and ” UNCERTAINTY.0″) and those between October to December are about the third quarter’s (“NEGATIVE.3″ and ” UNCERTAINTY.3″ ). It would be more appropriate to compare earning calls about the final results of 2016 and 2017, but it is not possible in the current data.

All the earning calls are placed on the two dimensional space on negativity and uncertainty. For visualization, I selected only five companies whose changes in sentiment are the largest as measured by the Euclidean distance.

dat_wide <- reshape(dat_dict[-5], timevar = "quarter", idvar = "company", 
                    direction = "wide")
dat_wide$change <- sqrt((dat_wide$NEGATIVE.0 - dat_wide$NEGATIVE.3) ^ 2 +
                        (dat_wide$UNCERTAINTY.0 - dat_wide$UNCERTAINTY.3) ^ 2)
dat_top <- head(dat_wide[order(dat_wide$change, decreasing = TRUE),], 5)

Figure 1 shows that uncertainty decreased dramatically in the Dish Network while it increased in Tractor Supply and Lockheed Martin; negativity increased but uncertainty remained the same in Foot Locker; both uncertainty and negativity decreased in Caterpillar.

par(mar = c(4, 4, 1, 1))
plot(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0, ylim = c(0, 0.06), xlim = c(0, 0.06), 
     col = 1:5, pch = 1:5, xlab = "Negativity", ylab = "Uncertainty")
arrows(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0, 
       dat_top$NEGATIVE.3, dat_top$UNCERTAINTY.3,
       code = 2, length = 0.1, col = 1:5)
legend("topright", col = 1:5, legend = dat_top$company, lty = 1, pch = 1:5)

Figure 1. Change in negativity and uncertainty in earning calls.

Company Clustering

We can also perform more sophisticated analysis of the companies through texts. An example is clustering of companies based on their business using unsupervised topic models. Latent Dirichlet Allocation (LDA) is commonly used to identify topics of documents but it can also do business of companies.

I used my seededlda package that works perfectly smoothly with quanteda. Before fitting LDA with 20 topics, I remove both infrequent and frequent words using dfm_trim() and combine documents for the same company using dfm_group().

require(seededlda)
dfmt_grp <- dfmt %>% 
    dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop") %>% 
    dfm_group(company)
lda <- textmodel_lda(dfmt_grp, k = 20)
t(terms(lda))

In Table 1, I can confirm that many of the 20 topics identified by LDA are strongly related to business sectors: aviation/defense (topic 1), cosmetics (topic 2), real estate (topic 3), computer (topic 4), communication (topic 5), oil (topic 6) and so on. Unrelated words occur frequently only in topic 14.

topic1	aircraft	defense	skyworks	bookings	f-35	marine	iot	pension	air	deliveries
topic2	beauty	makeup	luxury	social	prestige	travel	ulta	skin	department	guests
topic3	noi	square	lease	estate	feet	occupancy	leasing	interconnection	forward-looking	differ
topic4	hardware	hybrid	storage	memory	nand	dram	ip	ryzen	automotive	vertical
topic5	wireless	fiber	churn	packaging	unlimited	spectrum	video	networks	device	mps
topic6	wells	drilling	barrels	permian	basin	equivalent	midstream	rig	acreage	rigs
topic7	paypal	novaseq	sequencing	merchants	hiseq	diagnostics	self-employed	quickbooks	cynosure	consumables
topic8	retention	ceb	membership	medicare	gartner	insurance	consulting	bookings	year-on-year	aca
topic9	loan	loans	card	deposit	fees	balances	deposits	banking	mortgage	fee
topic10	invisalign	bookings	storage	automation	repair	compares	americas	tool	divestiture	dilution
topic11	fresh	snacks	beverage	foods	consumption	soup	beverages	merchandising	shelf	dsd
topic12	apparel	wholesale	foot	gap	merchandise	ticket	hurricanes	com	women’s	footwear
topic13	patients	study	studies	treatment	sleep	disease	patient	clinical	therapy	antibody
topic14	tom	structural	experiencing	insight	conversations	winning	dramatically	executive	source	serving
topic15	star	games	programming	twitter	video	cbs	entertainment	wars	audience	cable
topic16	patients	cancer	lung	patient	trial	clinical	treatment	keytruda	heart	revlimid
topic17	electric	commission	settlement	grid	transmission	utilities	storage	megawatts	wisconsin	carolina
topic18	restaurants	pharmacy	restaurant	generic	pizza	taco	drug	olive	same-store	bell
topic19	truck	vehicles	vehicle	automotive	transportation	industries	car	trucks	auto	electric
topic20	colorado	florida	carolina	martin	california	housing	georgia	aggregates	tons	southeast

Table 1: Keywords related to business in earning calls.

In Table 2, we can check in which companies above words are salient. I think these topics really correspond to business sectors.

t(apply(lda$theta, 2, 
        function(x, y) head(y[order(x, decreasing = TRUE)], 5), 
        dfmt_grp$company))

topic1	Skyworks Solutions	Lockheed Martin Corp.	Garmin Ltd.	General Dynamics	Harris Corporation
topic2	Estee Lauder Cos.	Coty, Inc	Ulta Beauty	Tapestry, Inc.	Sealed Air
topic3	The Clorox Company	General Growth Properties Inc.	Prologis	Booking Holdings Inc	Digital Realty Trust Inc
topic4	Advanced Micro Devices Inc	Microsoft Corp.	NetApp	Micron Technology	Cisco Systems
topic5	Verizon Communications	American Tower Corp A	Crown Castle International Corp.	Becton Dickinson	AT&T Inc.
topic6	Hess Corporation	Noble Energy Inc	Cimarex Energy	EQT Corporation	Apache Corporation
topic7	Illumina Inc	Hologic	Intuit Inc.	PerkinElmer	Thermo Fisher Scientific
topic8	Aetna Inc	Gartner Inc	Humana Inc.	Anthem Inc.	Automatic Data Processing
topic9	Regions Financial Corp.	Citigroup Inc.	SVB Financial	U.S. Bancorp	Fifth Third Bancorp
topic10	Red Hat Inc.	Dover Corp.	Verisign Inc.	Oracle Corp.	Broadridge Financial Solutions
topic11	Campbell Soup	Hormel Foods Corp.	Kellogg Co.	The Hershey Company	General Mills
topic12	Tractor Supply Company	Foot Locker Inc	Ross Stores	HCA Holdings	Lowe’s Cos.
topic13	Regeneron	ResMed	Incyte	Gilead Sciences	Amgen Inc.
topic14	Advance Auto Parts	Emerson Electric Company	Perrigo	Facebook, Inc.	Parker-Hannifin
topic15	The Walt Disney Company	Activision Blizzard	Hasbro Inc.	Viacom Inc.	CBS Corp.
topic16	Celgene Corp.	Merck & Co.	Nektar Therapeutics	ABIOMED Inc	Bristol-Myers Squibb
topic17	Dominion Energy	SCANA Corp	PPL Corp.	Ameren Corp	Wec Energy Group Inc
topic18	Yum! Brands Inc	Republic Services Inc	Walgreens Boots Alliance	Darden Restaurants	AmerisourceBergen Corp
topic19	PACCAR Inc.	Weyerhaeuser	General Motors	BorgWarner	Cummins Inc.
topic20	Martin Marietta Materials	Vulcan Materials	Freeport-McMoRan Inc.	Altria Group Inc	Ball Corp

Table 2: Companies with business-related keywords.

Since LDA is a document classifier, I can get the frequency of companies in each topic easily as shown in Figure 2.

par(mar = c(4, 4, 1, 1))
barplot(rev(table(topics(lda))), horiz = TRUE, las = 1)

Figure 2: Companies classified into topics.

The above examples demonstrate that quanteda and its associated packages can be used in analysis of financial texts too. This means that the we can employ semisupervised machine learning methods in analysis of financial texts: use seeded-LDA if topics must be predefined by the user or Latent Semantic Scaling if off-the-shelf dictionary is not available in the target domain or languages (e.g. Chinese or Japanese). These tools become useful when you analyze more specific aspect companies such as environmental and social governance (ESG).

Analysis of financial texts using R

Sentiment Analysis

Company Clustering

Kohei

Leave a Reply Cancel reply

Sentiment Analysis

Company Clustering

Share this:

Kohei

Leave a Reply Cancel reply

Related Posts