Analysis of financial texts using R

I tend to use political texts in my examples because of my academic background but quanteda and its associated packages can be used more broadly.

There is a growing interest in analysis of textual data using NLP tools in the financial industry. I have been work for Lazard Asset Management as a data science consultant since last year, so I think it is time for me to explain how these packages can be used in analysis of financial texts. There are a few tutorials in R (many of them are in Python) on the web, but they are over complicated. If data is available, analysis of financial texts is very easy.

I use the transcripts of earning calls for S&P 500 companies from 2017 in my examples. I extracted texts file from the dataset and saved them in a corpus object that you can download. The corpus object includes date (“date”) and company names (“company”) as document variables; “quarter” is created from “date”.

Simple pre-processing is completed in three lines of code using quanteda. dfmt records the frequency of words in each document.

require(quanteda)
toks <- tokens(corp, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)
toks <- tokens_remove(toks, pattern = stopwords("en"), min_nchar = 2)
dfmt <- dfm(toks)

Sentiment Analysis

We can perform sentiment analysis easily thanks to the financial sentiment dictionary created by Loughran and Mcdonald (2011). The L&M dictionary is available as part of the quanteda.sentiment package. The dictionary contains 4,232 words under categories such as “negative”, “positive”, and “uncertainty”.

devtools::install_github("quanteda/quanteda.sentiment")
require(quanteda.sentiment)
data_dictionary_LoughranMcDonald[c(1, 3)]

## Dictionary object with 2 key entries.
## Polarities: pos = ""; neg = "NEGATIVE" 
## - [NEGATIVE]:
##   - abandon, abandoned, abandoning, abandonment, abandonments, abandons, abdicated, abdicates, abdicating, abdication, abdications, aberrant, aberration, aberrational, aberrations, abetting, abnormal, abnormalities, abnormality, abnormally [ ... and 2,335 more ]
## - [UNCERTAINTY]:
##   - abeyance, abeyances, almost, alteration, alterations, ambiguities, ambiguity, ambiguous, anomalies, anomalous, anomalously, anomaly, anticipate, anticipated, anticipates, anticipating, anticipation, anticipations, apparent, apparently [ ... and 277 more ]

We can count the frequency of the negative and uncertainty words using dfm_lookup() , Interestingly, the dictionary does not contain mutil-word expressions at all. The frequencies of the sentiment words are divided by the total number of words to normalize for the lengths of documents.

dfmt_dict <- dfm_lookup(dfmt, data_dictionary_LoughranMcDonald[c(1, 3)]) / ntoken(dfmt)
dat_dict <- cbind(convert(dfmt_dict, to = "data.frame"), docvars(dfmt_dict))

Using the result of dictionary analysis, I want to reveal how sentiment in earning calls changed in 2017. Managers discuss financial results from the previous periods in earning calls so those held between January to March are about previous year’s final result (“NEGATIVE.0″ and ” UNCERTAINTY.0″) and those between October to December are about the third quarter’s (“NEGATIVE.3″ and ” UNCERTAINTY.3″ ). It would be more appropriate to compare earning calls about the final results of 2016 and 2017, but it is not possible in the current data.

All the earning calls are placed on the two dimensional space on negativity and uncertainty. For visualization, I selected only five companies whose changes in sentiment are the largest as measured by the Euclidean distance.

dat_wide <- reshape(dat_dict[-5], timevar = "quarter", idvar = "company", 
                    direction = "wide")
dat_wide$change <- sqrt((dat_wide$NEGATIVE.0 - dat_wide$NEGATIVE.3) ^ 2 +
                        (dat_wide$UNCERTAINTY.0 - dat_wide$UNCERTAINTY.3) ^ 2)
dat_top <- head(dat_wide[order(dat_wide$change, decreasing = TRUE),], 5)

Figure 1 shows that uncertainty decreased dramatically in the Dish Network while it increased in Tractor Supply and Lockheed Martin; negativity increased but uncertainty remained the same in Foot Locker; both uncertainty and negativity decreased in Caterpillar.

par(mar = c(4, 4, 1, 1))
plot(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0, ylim = c(0, 0.06), xlim = c(0, 0.06), 
     col = 1:5, pch = 1:5, xlab = "Negativity", ylab = "Uncertainty")
arrows(dat_top$NEGATIVE.0, dat_top$UNCERTAINTY.0, 
       dat_top$NEGATIVE.3, dat_top$UNCERTAINTY.3,
       code = 2, length = 0.1, col = 1:5)
legend("topright", col = 1:5, legend = dat_top$company, lty = 1, pch = 1:5)
Figure 1. Change in negativity and uncertainty in earning calls.

Company Clustering

We can also perform more sophisticated analysis of the companies through texts. An example is clustering of companies based on their business using unsupervised topic models. Latent Dirichlet Allocation (LDA) is commonly used to identify topics of documents but it can also do business of companies.

I used my seededlda package that works perfectly smoothly with quanteda. Before fitting LDA with 20 topics, I remove both infrequent and frequent words using dfm_trim() and combine documents for the same company using dfm_group().

require(seededlda)
dfmt_grp <- dfmt %>% 
    dfm_trim(min_termfreq = 0.5, termfreq_type = "quantile",
             max_docfreq = 0.1, docfreq_type = "prop") %>% 
    dfm_group(company)
lda <- textmodel_lda(dfmt_grp, k = 20)
t(terms(lda))

In Table 1, I can confirm that many of the 20 topics identified by LDA are strongly related to business sectors: aviation/defense (topic 1), cosmetics (topic 2), real estate (topic 3), computer (topic 4), communication (topic 5), oil (topic 6) and so on. Unrelated words occur frequently only in topic 14.

topic1aircraftdefenseskyworksbookingsf-35marineiotpensionairdeliveries
topic2beautymakeupluxurysocialprestigetravelultaskindepartmentguests
topic3noisquareleaseestatefeetoccupancyleasinginterconnectionforward-lookingdiffer
topic4hardwarehybridstoragememorynanddramipryzenautomotivevertical
topic5wirelessfiberchurnpackagingunlimitedspectrumvideonetworksdevicemps
topic6wellsdrillingbarrelspermianbasinequivalentmidstreamrigacreagerigs
topic7paypalnovaseqsequencingmerchantshiseqdiagnosticsself-employedquickbookscynosureconsumables
topic8retentioncebmembershipmedicaregartnerinsuranceconsultingbookingsyear-on-yearaca
topic9loanloanscarddepositfeesbalancesdepositsbankingmortgagefee
topic10invisalignbookingsstorageautomationrepaircomparesamericastooldivestituredilution
topic11freshsnacksbeveragefoodsconsumptionsoupbeveragesmerchandisingshelfdsd
topic12apparelwholesalefootgapmerchandisetickethurricanescomwomen’sfootwear
topic13patientsstudystudiestreatmentsleepdiseasepatientclinicaltherapyantibody
topic14tomstructuralexperiencinginsightconversationswinningdramaticallyexecutivesourceserving
topic15stargamesprogrammingtwittervideocbsentertainmentwarsaudiencecable
topic16patientscancerlungpatienttrialclinicaltreatmentkeytrudaheartrevlimid
topic17electriccommissionsettlementgridtransmissionutilitiesstoragemegawattswisconsincarolina
topic18restaurantspharmacyrestaurantgenericpizzatacodrugolivesame-storebell
topic19truckvehiclesvehicleautomotivetransportationindustriescartrucksautoelectric
topic20coloradofloridacarolinamartincaliforniahousinggeorgiaaggregatestonssoutheast
Table 1: Keywords related to business in earning calls.

In Table 2, we can check in which companies above words are salient. I think these topics really correspond to business sectors.

t(apply(lda$theta, 2, 
        function(x, y) head(y[order(x, decreasing = TRUE)], 5), 
        dfmt_grp$company))
topic1Skyworks SolutionsLockheed Martin Corp.Garmin Ltd.General DynamicsHarris Corporation
topic2Estee Lauder Cos.Coty, IncUlta BeautyTapestry, Inc.Sealed Air
topic3The Clorox CompanyGeneral Growth Properties Inc.PrologisBooking Holdings IncDigital Realty Trust Inc
topic4Advanced Micro Devices IncMicrosoft Corp.NetAppMicron TechnologyCisco Systems
topic5Verizon CommunicationsAmerican Tower Corp ACrown Castle International Corp.Becton DickinsonAT&T Inc.
topic6Hess CorporationNoble Energy IncCimarex EnergyEQT CorporationApache Corporation
topic7Illumina IncHologicIntuit Inc.PerkinElmerThermo Fisher Scientific
topic8Aetna IncGartner IncHumana Inc.Anthem Inc.Automatic Data Processing
topic9Regions Financial Corp.Citigroup Inc.SVB FinancialU.S. BancorpFifth Third Bancorp
topic10Red Hat Inc.Dover Corp.Verisign Inc.Oracle Corp.Broadridge Financial Solutions
topic11Campbell SoupHormel Foods Corp.Kellogg Co.The Hershey CompanyGeneral Mills
topic12Tractor Supply CompanyFoot Locker IncRoss StoresHCA HoldingsLowe’s Cos.
topic13RegeneronResMedIncyteGilead SciencesAmgen Inc.
topic14Advance Auto PartsEmerson Electric CompanyPerrigoFacebook, Inc.Parker-Hannifin
topic15The Walt Disney CompanyActivision BlizzardHasbro Inc.Viacom Inc.CBS Corp.
topic16Celgene Corp.Merck & Co.Nektar TherapeuticsABIOMED IncBristol-Myers Squibb
topic17Dominion EnergySCANA CorpPPL Corp.Ameren CorpWec Energy Group Inc
topic18Yum! Brands IncRepublic Services IncWalgreens Boots AllianceDarden RestaurantsAmerisourceBergen Corp
topic19PACCAR Inc.WeyerhaeuserGeneral MotorsBorgWarnerCummins Inc.
topic20Martin Marietta MaterialsVulcan MaterialsFreeport-McMoRan Inc.Altria Group IncBall Corp
Table 2: Companies with business-related keywords.

Since LDA is a document classifier, I can get the frequency of companies in each topic easily as shown in Figure 2.

par(mar = c(4, 4, 1, 1))
barplot(rev(table(topics(lda))), horiz = TRUE, las = 1)

Figure 2: Companies classified into topics.

The above examples demonstrate that quanteda and its associated packages can be used in analysis of financial texts too. This means that the we can employ semisupervised machine learning methods in analysis of financial texts: use seeded-LDA if topics must be predefined by the user or Latent Semantic Scaling if off-the-shelf dictionary is not available in the target domain or languages (e.g. Chinese or Japanese). These tools become useful when you analyze more specific aspect companies such as environmental and social governance (ESG).

Posts created 114

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top