We usually use commercial database such as Nexis to download news stories in the past, but you should use New York Times APIs if you want to do historical analysis of news content. We can search NYT news articles until 1851 through the API, and it is free for anyone! We can only download meta-data, including summary texts (lead paragraphs), but we can still do a lot of content analysis with it.
You have to collect a lot of items when each text is short. It should not be difficult to so through the API if you use rtimes package. However, it is actually not as easy as it sound, because web APIs sometimes do not respond, and we can only call the API 1000 times a day. Therefore, our downloader have to be robust against unstable connections, and able to resume downloading next day.
After several attempts, I managed to run download without unexpected errors. Using the code below, you can download summaries of NYT articles that contain ‘diplomacy’ or ‘military’ in their main texts between 1851 and 2000. This program saves downloaded data yearly to RSD files, so that you do not loose, even if you have to restart your R. Do not forget to replace xxxxxxxxxxxxxxxxxxxxxxxxxxxx
wit your own API key.
#install.packages("rtimes")
rm(list=ls())
require(rtimes)
require(plyr)
httr::config(timeout = 120)
query <- '(body:"diplomacy" OR body:"military")'
field <- c("_id", "page", "snippet", "word_count", "score", "headline.main",
"headline.print_headline", "byline.original", "web_url")
fetch <- function(query, year, page) {
res <- as_search(q = NULL, fq = query,
begin_date = paste0(year, "0101"), end_date = paste0(year, '1231'),
key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx', page = page,
fl = c('_id', 'pub_date', 'word_count', 'snippet', 'headline',
'section_name', 'byline', 'web_url'))
return(res)
}
for (year in seq(1851, 2000)) {
if (file.exists(paste0('API/temp/', year, '.RDS'))) {
cat('Skip', year, "\n")
next
}
cat('Seach', year, "\n")
data <- data.frame()
res <- NULL
page <- 0
while (is.null(res) || res$meta$hits > 10 * page) {
res <- NULL
attempt <- 0
while (is.null(res) && attempt <= 5) {
attempt <- attempt + 1
try(
res <- fetch(query, year, page)
)
if (is.null(res)) {
cat('Error', attempt,'\n')
Sys.sleep(30)
}
if (attempt > 5) {
stop('Aborted\n')
}
}
if (nrow(res$data) == 0) {
cat('No data\n')
break
}
res$data$page <- page
data <- rbind.fill (data, res$data)
cat(10 * page, 'of', res$meta$hits, "\n")
Sys.sleep(5)
page <- page + 1
}
if (nrow(data) > 0) {
data$year <- year
saveRDS(data, file = paste0('API/temp/', year, '.RDS'))
}
Sys.sleep(5)
}