Dictionary-based text analysis has a number of good properties, but it is always difficult to make a new dictionary and text analysts often use existing dictionaries that include the General Inquirer dictionaries, which are originally created decades ago, or their derivatives. However, I believe that it is time to create new dictionaries from scratch using a number of tools and techniques available to us.
My first original dictionary is the UK Immigration Dictionary. It is meant to measure attitude toward immigration to the UK. The words contains counter intuitive positive entities such as ‘racist’, but the result becomes as follows when applied to the 2010 UK party manifestos.
BNP -0.660772785 Coalition 0.403547905 Conservative 0.002508397 Greens -0.898075732 Labour 0.081029432 LibDem 0.050535076 PC -0.015306746 SNP -0.551027977 UKIP -0.335952325
I am not yet sure how accurate this is, but it looks interesting since small parties, which tend to be against immigration, are all negative.
It is very easy to used the dictionary in R using Quanteda:
options(stringsAsFactors=FALSE) df.temp <- read.csv(file="news.dictionary.tfidf.500.csv", header=FALSE, sep='\t') df.dict <- data.frame(word=as.character(df.temp$V1), score=as.numeric(df.temp$V2)) uk2010immigCorpus <- corpus(uk2010immig, docvars=data.frame(party=names(uk2010immig)), notes="Immigration-related sections of 2010 UK party manifestos", enc="UTF-8") mx <- tfidf(dfm(uk2010immigCorpus)) mx2 <- as.data.frame.matrix(t(subset(t(mx), colnames(mx) %in% df.dict$word))) #Remove columns not in the dictionary # Make a list in the same order as the columns v.dict <- list() for(word in colnames(mx2)){ v.dict[[word]] <- df.dict$score[df.dict$word==word] #v.dict[[word]] <- ifelse(df.dict$score[df.dict$word==word] > 0, 1, -1) } print(as.matrix(mx2) %*% as.matrix(unlist(v.dict)))