A story based on this analysis was published on Dec. 14, 2022. Click here to read it. To download the data related to this project, click here. For the sake of transparency and journalistic integrity, this is our published R notebook.
What we set out to do
CBC News wanted to calculate the probability of snow over the Christmas holidays for locations across Canada using historical data. This analysis has three components.
Methodology
#Global variables
dataQualityThreshold = .50
filePaths <- fs::dir_ls("../NodeJS/output/")
christmasDays <- c(24,25,26,27,28,29,30,31)
maxDataPoints <- length(christmasDays)*10
snowAmountThreshold <- 2
stationData <- jsonlite::fromJSON("../nodeJS/stations.json") To retrieve the data, we used a series of scripts coded in NodeJS: Based on the list of weather stations from Environment Canada, we downloaded the data for the stations with at least 20 years of snow precipitation and snow cover data. Because the data is incomplete in many regions, we grouped stations. Starting with the oldest weather stations, we looked for other stations within a 20-kilometre radius and then averaged the values. After grouping the stations, we kept the locations within the minimum time range of 1970-2010 and with at least 40 years of data. Here, we retrieved the data from the NodeJS scripts.
#Read in the data
raw <- data.frame(matrix(ncol = 6, nrow = 0))
for (i in seq_along(filePaths)) {
df <- read.csv(filePaths[[i]])
raw = rbind(raw,df)
} We restructured the data and removed missing values.
data <- raw %>%
dplyr::rename(id = CLIMATE_IDENTIFIER,
date = LOCAL_DATE,
groundSnow = SNOW_ON_GROUND,
snowFall = TOTAL_SNOW,
minTemp = MIN_TEMPERATURE) %>%
dplyr::mutate(
year = year(date),
month = month(date),
day = day(date),
decade = floor(year / 10) * 10
) %>%
dplyr::filter(month == 12 & day %in% christmasDays) %>%
dplyr::select(id,decade,date,groundSnow,snowFall,minTemp) %>%
pivot_longer(cols=c(groundSnow,snowFall,minTemp),names_to='var',values_to='val') %>%
drop_na()
data For each variable, we aggregated by ID, decade and variable and determined if there were enough data entries (days) to pass the data quality threshold. We filtered out the decades that don’t have enough data and combined the data back together. We also appended station data.
groundSnowData <- data %>%
dplyr::filter(var == "groundSnow") %>%
group_by(id,decade,var) %>%
summarise(
dataPoints = n(),
groundSnowDays = sum(val >= snowAmountThreshold),
groundSnowPerc = groundSnowDays/dataPoints*100,
snowDepthAvgCM = mean(val)) %>%
dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
dplyr::select(-dataPoints) ## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument. snowFallData <- data %>%
dplyr::filter(var == "snowFall") %>%
group_by(id,decade,var) %>%
summarise(
dataPoints = n(),
snowFallAvgCM = mean(val),
snowFallDays = sum(val > 0),
snowFallPerc = snowFallDays/dataPoints*100) %>%
dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
dplyr::select(-dataPoints) ## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument. tempData <- data %>%
dplyr::filter(var == "minTemp") %>%
group_by(id,decade,var) %>%
summarise(
dataPoints = n(),
minTempAvgC = mean(val)) %>%
dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
dplyr::select(-dataPoints) ## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument. data <- groundSnowData %>%
left_join(snowFallData, by = c("id" = "id", "decade" = "decade")) %>%
left_join(tempData, by = c("id" = "id", "decade" = "decade")) %>%
dplyr::select(id, decade,groundSnowPerc,snowDepthAvgCM,snowFallPerc,snowFallAvgCM,minTempAvgC) %>%
dplyr::left_join(stationData, by ="id")
data We removed decades that have NA values for temperature- and snowfall-related variables. We filtered out stations that didn’t have enough decades (fewer than 50 per cent between the 1960s and 2010s) and those that didn’t have data up to at least the 2000s. We also filtered out the 1950s because of widespread data issues.
data <- data %>%
dplyr::group_by(id,decade) %>%
filter(!is.na(minTempAvgC),
!is.na(snowFallPerc),
!is.na(snowFallAvgCM)) data <- data %>%
dplyr::ungroup() %>%
dplyr::group_by(id) %>%
dplyr::filter(max(decade) >=2000,
decade > 1950) %>%
dplyr::filter(n()/6 >= dataQualityThreshold) We calculated the per cent difference between the earliest and latest decade to see which stations may be problematic. We filtered out which stations and variables have more than a 100% difference for manual checks. Note that a large storm in the 2000s that hit the west coast does cause some dramatic figures at some B.C. stations. We are leaving those in.
decadePercentDiff <- function(var,decade) {
(var[which.max(decade)] - var[which.min(decade)])/var[which.min(decade)]*100
}
dataForManualChecks <- data %>%
dplyr::group_by(id,name) %>%
dplyr::summarize(
groundSnowPercChange = decadePercentDiff(groundSnowPerc,decade),
snowDepthAvgCMChange = decadePercentDiff(snowDepthAvgCM,decade),
snowFallPercChange = decadePercentDiff(snowFallPerc,decade),
snowFallAvgCMChange = decadePercentDiff(snowFallAvgCM,decade),
minTempAvgCChange = decadePercentDiff(minTempAvgC,decade)) %>%
pivot_longer(cols=c(groundSnowPercChange,snowDepthAvgCMChange,snowFallPercChange,snowFallAvgCMChange,minTempAvgCChange),names_to='var',values_to='val') %>%
dplyr::filter(val >= 100 | val <= -100) ## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument. dataForManualChecks Next we did the final cleanups, making sure the data types were correct, names were in the proper format and exported the json for use in the story.
# rawStationNames <- data %>%
# dplyr::group_by(id,name,province) %>%
# dplyr::summarise()
# write.csv(rawStationNames,"notebook_assets/rawStationNames.csv")
cleanNames <- read.csv("notebook_assets/cleanNames.csv") %>%
select(id,nameClean,provinceClean )
data <- data %>%
dplyr::left_join(cleanNames, by = "id") %>%
dplyr::mutate(groundSnowPerc = round(groundSnowPerc),
snowDepthAvgCM = round(snowDepthAvgCM),
snowFallPerc = round(snowFallPerc),
snowFallAvgCM = round(snowFallAvgCM),
minTempAvgC = round(minTempAvgC, digits = 1))
jsonData <- data %>%
dplyr::select(
lat,
lon,
id,
nameClean,
provinceClean,
decade,
groundSnowPerc,
snowDepthAvgCM,
snowFallPerc,
snowFallAvgCM,
minTempAvgC
) %>%
dplyr::mutate(nameClean = gsub("'","’", nameClean))
jsonData write_json(jsonData, "../final_data/final_data.json") Here, we checked how many stations saw an increase and decrease in snow depth and minimum temperature.
data %>%
group_by(id) %>%
dplyr::summarize(
recentDecadeSnowDepth = snowDepthAvgCM[which.max(decade)],
firstDecadeSnowDepth = snowDepthAvgCM[which.min(decade)]
) %>%
dplyr::mutate(
snowDepthAvgCMTrend =
case_when(
recentDecadeSnowDepth > firstDecadeSnowDepth ~ "up",
recentDecadeSnowDepth < firstDecadeSnowDepth ~ "down",
recentDecadeSnowDepth == firstDecadeSnowDepth ~ "no change",
TRUE ~ "NA"
)) %>%
group_by(snowDepthAvgCMTrend) %>%
summarise(count=n()) data %>%
group_by(id) %>%
dplyr::summarize(
recentDecadeMinTempAvgC = minTempAvgC[which.max(decade)],
firstDecadeMinTempAvgC = minTempAvgC[which.min(decade)]
) %>%
dplyr::mutate(
minTempAvgCTrend =
case_when(
recentDecadeMinTempAvgC > firstDecadeMinTempAvgC ~ "up",
recentDecadeMinTempAvgC < firstDecadeMinTempAvgC ~ "down",
recentDecadeMinTempAvgC == firstDecadeMinTempAvgC ~ "no change",
TRUE ~ "NA"
)) %>%
group_by(minTempAvgCTrend) %>%
summarise(count=n()) We charted all the variables for visual checks.
ggplot(data, aes(x = decade, y = groundSnowPerc))+
geom_bar(stat='identity')+
facet_wrap(~name, ncol=3) +
scale_y_continuous(breaks=NULL, labels=NULL) +
ylim(0, 120) +
theme_minimal() +
theme(
legend.position="none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x = element_text(size=7.5)
) +
geom_text(aes(label=groundSnowPerc), position=position_dodge(width=0.9)) +
ggtitle("% of Christmas days with snow on ground by decade, 1960-2010") ## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale. ggplot(data, aes(x = decade, y = snowDepthAvgCM))+
geom_bar(stat='identity')+
facet_wrap(~name, ncol=3) +
scale_y_continuous(breaks=NULL, labels=NULL) +
theme_minimal() +
theme(
legend.position="none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x = element_text(size=7.5)
) +
geom_text(aes(label=snowDepthAvgCM), position=position_dodge(width=0.9)) +
ggtitle("snow depth average during Christmas time by decade (cm)") ggplot(data, aes(x = decade, y = snowFallPerc))+
geom_bar(stat='identity')+
facet_wrap(~name, ncol=3) +
scale_y_continuous(breaks=NULL, labels=NULL) +
ylim(0, 120) +
theme_minimal() +
theme(
legend.position="none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x = element_text(size=7.5)
) +
geom_text(aes(label=snowFallPerc), position=position_dodge(width=0.9)) +
ggtitle("% of Christmas days with snowfall at Christmas time by decade,1960-2010") ## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale. ggplot(data, aes(x = decade, y = snowFallAvgCM))+
geom_bar(stat='identity')+
facet_wrap(~name, ncol=3) +
scale_y_continuous(breaks=NULL, labels=NULL) +
theme_minimal() +
theme(
legend.position="none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x = element_text(size=7.5)
) +
geom_text(aes(label=snowFallAvgCM), position=position_dodge(width=0.9)) +
ggtitle("Snowfall average during Christmas time by decade (cm),1960-2010") ggplot(data, aes(x = decade, y = minTempAvgC))+
geom_bar(stat='identity')+
facet_wrap(~name, ncol=3) +
scale_y_continuous(breaks=NULL, labels=NULL) +
theme_minimal() +
theme(
legend.position="none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor=element_blank(),
panel.grid.major.y=element_blank(),
axis.text.y=element_blank(),
axis.text.x = element_text(size=7.5)
) +
geom_text(aes(label=minTempAvgC), position=position_dodge(width=0.9)) +
ggtitle("Avererage min temp at Christmas time by decade (C),1960-2010")