I tried to import UK parliamentary debates into R, but it seems that Hansard reports are too large for R. R also has very poor in handling different character coding, so I gave up with R and wrote an importer in Python. The Python script imports the XML into MySQL database. #!/usr/bin/python # -*- coding: […]
Import UK parliamentary debate data into R
Debates in UK parliament is transcribed and published online as Hansard, but not easy to scrape all the texts from the website. A much better source of parliament debate data is ParlParse, a website of TheyWorkForYou. On the website, Hansard reports are provided in XML files. Yet, we still have to write a script to […]
News data importer for R
In this April, I created a R scrip to import files downloaded from Nexis and Factiva. Factiva does not offer file download function, but its search results pages can be save as HTML files and imported to R using this script. library(XML) #might need libxml2-dev via apt-get command readNewsDir
International Newsmap
I have been running a website called International Newsmap. It collects international news stories from news sites and classify them according to their geographic focus using Bayesian classifier and lexicon expansion technique. The sources of of news are English websites in the US, the UK, New Zealand, India, Singapore, Kenya, and South Africa. The main […]