I believe that sentence is the optimal unit of sentiment analysis, but splitting whole news articles into sentences is often tricky because there are a lot of quotations in news. If we simply chop up texts based on punctuations, we get quoted texts are split into different sentences. This code is meant to avoid such […]
Nexis news importer updated
I posted the code Nexis importer last year, but it tuned out that the HTML format of the database service is less consistent than I though, so I changed the logic. The new version is dependent less on the structure of the HTML files, but more on the format of the content. library(XML) #might need […]
The Latent Semantic Scaling
I have posted document scaling results on different dimensions such as political left-right, and immigration positive-negative on this blog previously, but I did not explain the detail of the technique, call the Latent Semantic Scaling. The LSS is a type of lexicon expansion technique based on the Latent Semantic Analysis. Please have a look at […]
Human-coded test data for geographical classification
Early this year, I crated a sizable human-coded test data for my news classifier using the Prolific Academic service, and the data set is now ready for download. The data is comprised of 5,000 news summaries collected from RSS feeds of the New York Times, The Times (UK), The Australian, Times of India, and Daily […]
Geographical dictionary making technique
My new draft paper Newsmap: Dictionary expansion technique for geographical classification of very short longitudinal texts explains how to create a large geographical dictionary for text classification. Its algorithm is an updated version of the International Newsmap, and it is simpler and more statistically grounded. As I am arguing in the paper, this technique could […]
International news coding instruction
It was already four years ago when I created my Newsmap. It is time to update the whole system: fully rewritten in Python and developing a new classification algorithm. This is why I generated a 5,000 human-coded international news stories using the Prolific Academic. Thanks to the crowed-sourcing services, recruiting is no longer a problem, […]
Crowd-coding of international news by Prolific Academic
I recently created a sizable human-coded dataset (5,000 items) of international news using the Prolific Academic service. The Prolific Academic is an Oxford-based academic alternative to the Amazon Mechanical Turk. The advantage of using this services is that researchers only have to compensate for work that they approve. The potential drawback is its relatively high […]
Terrorism Dictionary 2014
After seeing mass media’s strong response to the extremists’ attack against Charlie Hebdo, I started thinking what can I do for this increasingly important topic? One simple work is making a dictionary containing keywords related to terrorism, so the Terrorism Dictionary 2014 is created. This dictionary is made from newswires submitted by the Associated Press […]
Left-right policy position dictionary
The Latent Semantic Scaling (LSS) not only works well with positive-negative sentiment but with left-right position on economic policy. The seed words for this dimension are {deficit, austerity, unstable, recession, inflation, currency, workforce} for the light and {poor, poverty, free, benefits, prices, money, workers} for the left. Left-right policy position dictionary was created from UK […]
Immigration dictionary
This is probably the final version of my immigration dictionary. This text analysis dictionary was created using technique called the Latent Semantic Scaling, which is based on the Latent Semantic Analysis, from British newspaper corpus. The result of the automated content analysis by this dictionary is strongly corresponds to manual coding by Amazon’s Mechanical Turks […]
