A story based on this analysis was published on Dec. 14, 2022. Click here to read it. To download the data related to this project, click here. For the sake of transparency and journalistic integrity, this is our published R notebook.

What we set out to do

CBC News wanted to calculate the probability of snow over the Christmas holidays for locations across Canada using historical data. This analysis has three components.

Probability of snow falling during Christmastime and the probability that there will be snow cover during that period
Historical trends: Has the probability gone down? Are cities getting less snow or more snow during the Christmas holidays than in the past?
Are the temperatures changing?

Methodology

In this R analysis, we consider Dec. 24 to Dec. 31 as Christmastime.
Probability = number of snow observations/total observations.
In order to get the most data to calculate trends, we have aggregated the data by decade.
Our data quality threshold is 50 per cent, meaning decades with less than 50 per cent data coverage for each variable during the Christmastime period are omitted and stations with fewer than 50 per cent of valid decades are omitted.
Anything less than two centimetres of snow on the ground is not considered snow cover, as per Environment Canada definitions.
Any amount of snowfall is considered a snowfall event.
For the trends, we calculate the difference between the earliest decade on record and the most recent.
On the story page, we say whether ground snow and snowfall is likely or not. This is based on the most recent decade of data. If chances are 25 per cent or lower, it is considered “very unlikely”; 25-50 per cent is “unlikely”; 50-75 per cent is “likely”; and over 75 per cent is “very likely.”

Data cleaning

#Global variables
dataQualityThreshold = .50
filePaths <- fs::dir_ls("../NodeJS/output/")
christmasDays <- c(24,25,26,27,28,29,30,31)
maxDataPoints <- length(christmasDays)*10
snowAmountThreshold <- 2
stationData <- jsonlite::fromJSON("../nodeJS/stations.json")

To retrieve the data, we used a series of scripts coded in NodeJS: Based on the list of weather stations from Environment Canada, we downloaded the data for the stations with at least 20 years of snow precipitation and snow cover data. Because the data is incomplete in many regions, we grouped stations. Starting with the oldest weather stations, we looked for other stations within a 20-kilometre radius and then averaged the values. After grouping the stations, we kept the locations within the minimum time range of 1970-2010 and with at least 40 years of data. Here, we retrieved the data from the NodeJS scripts.

#Read in the data
raw <- data.frame(matrix(ncol = 6, nrow = 0))
for (i in seq_along(filePaths)) {
  df <- read.csv(filePaths[[i]])
  raw = rbind(raw,df)
}

We restructured the data and removed missing values.

data <- raw %>%
  dplyr::rename(id = CLIMATE_IDENTIFIER,
                date = LOCAL_DATE,
                groundSnow = SNOW_ON_GROUND,
                snowFall = TOTAL_SNOW,
                minTemp = MIN_TEMPERATURE) %>%
  dplyr::mutate(
    year = year(date),
    month = month(date),
    day = day(date),
    decade = floor(year / 10) * 10
  ) %>%
  dplyr::filter(month == 12 & day %in% christmasDays) %>%
  dplyr::select(id,decade,date,groundSnow,snowFall,minTemp) %>%
  pivot_longer(cols=c(groundSnow,snowFall,minTemp),names_to='var',values_to='val') %>%
  drop_na()
data

For each variable, we aggregated by ID, decade and variable and determined if there were enough data entries (days) to pass the data quality threshold. We filtered out the decades that don’t have enough data and combined the data back together. We also appended station data.

groundSnowData <- data %>%
  dplyr::filter(var == "groundSnow") %>%
  group_by(id,decade,var) %>%
  summarise(
    dataPoints = n(),
    groundSnowDays = sum(val >= snowAmountThreshold),
    groundSnowPerc = groundSnowDays/dataPoints*100,
    snowDepthAvgCM = mean(val)) %>%
  dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
  dplyr::select(-dataPoints)

## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument.

snowFallData <- data %>%
  dplyr::filter(var == "snowFall") %>%
  group_by(id,decade,var) %>%
  summarise(
    dataPoints = n(),
    snowFallAvgCM = mean(val),
    snowFallDays = sum(val > 0),
    snowFallPerc = snowFallDays/dataPoints*100) %>%
  dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
  dplyr::select(-dataPoints)

## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument.

tempData <- data %>%
  dplyr::filter(var == "minTemp") %>%
  group_by(id,decade,var) %>%
  summarise(
    dataPoints = n(),
    minTempAvgC = mean(val)) %>%
  dplyr::filter(dataPoints >= maxDataPoints*dataQualityThreshold) %>%
  dplyr::select(-dataPoints)

## `summarise()` has grouped output by 'id', 'decade'. You can override using the
## `.groups` argument.

data <- groundSnowData %>%
              left_join(snowFallData, by = c("id" = "id", "decade" = "decade")) %>%
              left_join(tempData, by = c("id" = "id", "decade" = "decade")) %>%
  dplyr::select(id, decade,groundSnowPerc,snowDepthAvgCM,snowFallPerc,snowFallAvgCM,minTempAvgC) %>%
  dplyr::left_join(stationData, by ="id")

data

We removed decades that have NA values for temperature- and snowfall-related variables. We filtered out stations that didn’t have enough decades (fewer than 50 per cent between the 1960s and 2010s) and those that didn’t have data up to at least the 2000s. We also filtered out the 1950s because of widespread data issues.

data <- data %>%
  dplyr::group_by(id,decade) %>%
  filter(!is.na(minTempAvgC),
         !is.na(snowFallPerc),
         !is.na(snowFallAvgCM))

data <- data %>%
  dplyr::ungroup() %>%
  dplyr::group_by(id) %>%
  dplyr::filter(max(decade) >=2000,
                decade > 1950) %>%
  dplyr::filter(n()/6 >= dataQualityThreshold)

We calculated the per cent difference between the earliest and latest decade to see which stations may be problematic. We filtered out which stations and variables have more than a 100% difference for manual checks. Note that a large storm in the 2000s that hit the west coast does cause some dramatic figures at some B.C. stations. We are leaving those in.

decadePercentDiff <- function(var,decade) {
  (var[which.max(decade)] - var[which.min(decade)])/var[which.min(decade)]*100
}

dataForManualChecks <- data %>% 
  dplyr::group_by(id,name) %>%
  dplyr::summarize(
    groundSnowPercChange = decadePercentDiff(groundSnowPerc,decade),
    snowDepthAvgCMChange = decadePercentDiff(snowDepthAvgCM,decade),
    snowFallPercChange = decadePercentDiff(snowFallPerc,decade),
    snowFallAvgCMChange = decadePercentDiff(snowFallAvgCM,decade),
    minTempAvgCChange = decadePercentDiff(minTempAvgC,decade)) %>%
  pivot_longer(cols=c(groundSnowPercChange,snowDepthAvgCMChange,snowFallPercChange,snowFallAvgCMChange,minTempAvgCChange),names_to='var',values_to='val') %>%
  dplyr::filter(val >= 100 | val <= -100)

## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.

dataForManualChecks

Next we did the final cleanups, making sure the data types were correct, names were in the proper format and exported the json for use in the story.

# rawStationNames <- data %>%
#   dplyr::group_by(id,name,province) %>%
#   dplyr::summarise()
# write.csv(rawStationNames,"notebook_assets/rawStationNames.csv")

cleanNames <- read.csv("notebook_assets/cleanNames.csv") %>%
  select(id,nameClean,provinceClean )
  
data <- data %>% 
  dplyr::left_join(cleanNames, by = "id") %>%
  dplyr::mutate(groundSnowPerc = round(groundSnowPerc),
                snowDepthAvgCM = round(snowDepthAvgCM),
                snowFallPerc = round(snowFallPerc),
                snowFallAvgCM = round(snowFallAvgCM),
                minTempAvgC = round(minTempAvgC, digits = 1))

jsonData <- data %>%
  dplyr::select(
    lat,
    lon,
    id,
    nameClean,
    provinceClean,
    decade,
    groundSnowPerc,
    snowDepthAvgCM,
    snowFallPerc,
    snowFallAvgCM,
    minTempAvgC
  ) %>%
  dplyr::mutate(nameClean = gsub("'","’", nameClean))


jsonData

write_json(jsonData, "../final_data/final_data.json")

Here, we checked how many stations saw an increase and decrease in snow depth and minimum temperature.

data %>%
  group_by(id) %>%
  dplyr::summarize(
    recentDecadeSnowDepth = snowDepthAvgCM[which.max(decade)],
    firstDecadeSnowDepth = snowDepthAvgCM[which.min(decade)]
  ) %>%
   dplyr::mutate(
    snowDepthAvgCMTrend =
      case_when(
        recentDecadeSnowDepth > firstDecadeSnowDepth ~ "up",
        recentDecadeSnowDepth < firstDecadeSnowDepth ~ "down",
        recentDecadeSnowDepth == firstDecadeSnowDepth ~ "no change",
        TRUE ~ "NA"
      )) %>%
  group_by(snowDepthAvgCMTrend) %>%
  summarise(count=n())

data %>%
  group_by(id) %>%
  dplyr::summarize(
    recentDecadeMinTempAvgC = minTempAvgC[which.max(decade)],
    firstDecadeMinTempAvgC = minTempAvgC[which.min(decade)]
  ) %>%
   dplyr::mutate(
    minTempAvgCTrend =
      case_when(
        recentDecadeMinTempAvgC > firstDecadeMinTempAvgC ~ "up",
        recentDecadeMinTempAvgC < firstDecadeMinTempAvgC ~ "down",
        recentDecadeMinTempAvgC == firstDecadeMinTempAvgC ~ "no change",
        TRUE ~ "NA"
      )) %>%
  group_by(minTempAvgCTrend) %>%
  summarise(count=n())

We charted all the variables for visual checks.

ggplot(data, aes(x = decade, y = groundSnowPerc))+
  geom_bar(stat='identity')+
  facet_wrap(~name,  ncol=3) + 
  scale_y_continuous(breaks=NULL, labels=NULL) +
  ylim(0, 120) +
  theme_minimal() +
  theme(
    legend.position="none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor=element_blank(),
    panel.grid.major.y=element_blank(),
    axis.text.y=element_blank(),
    axis.text.x = element_text(size=7.5)
  ) +
  geom_text(aes(label=groundSnowPerc), position=position_dodge(width=0.9)) +
  ggtitle("% of Christmas days with snow on ground by decade, 1960-2010")

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

ggplot(data, aes(x = decade, y = snowDepthAvgCM))+
  geom_bar(stat='identity')+
  facet_wrap(~name,  ncol=3) + 
  scale_y_continuous(breaks=NULL, labels=NULL) +
  theme_minimal() +
  theme(
    legend.position="none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor=element_blank(),
    panel.grid.major.y=element_blank(),
    axis.text.y=element_blank(),
    axis.text.x = element_text(size=7.5)
  ) +
  geom_text(aes(label=snowDepthAvgCM), position=position_dodge(width=0.9)) +
  ggtitle("snow depth average during Christmas time by decade (cm)")

ggplot(data, aes(x = decade, y = snowFallPerc))+
  geom_bar(stat='identity')+
  facet_wrap(~name,  ncol=3) + 
  scale_y_continuous(breaks=NULL, labels=NULL) +
  ylim(0, 120) +
  theme_minimal() +
  theme(
    legend.position="none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor=element_blank(),
    panel.grid.major.y=element_blank(),
    axis.text.y=element_blank(),
    axis.text.x = element_text(size=7.5)
  ) +
  geom_text(aes(label=snowFallPerc), position=position_dodge(width=0.9)) +
  ggtitle("% of Christmas days with snowfall at Christmas time by decade,1960-2010")

## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.

ggplot(data, aes(x = decade, y = snowFallAvgCM))+
  geom_bar(stat='identity')+
  facet_wrap(~name,  ncol=3) + 
  scale_y_continuous(breaks=NULL, labels=NULL) +
  theme_minimal() +
  theme(
    legend.position="none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor=element_blank(),
    panel.grid.major.y=element_blank(),
    axis.text.y=element_blank(),
    axis.text.x = element_text(size=7.5)
  ) +
  geom_text(aes(label=snowFallAvgCM), position=position_dodge(width=0.9)) +
  ggtitle("Snowfall average during Christmas time by decade (cm),1960-2010")

ggplot(data, aes(x = decade, y = minTempAvgC))+
  geom_bar(stat='identity')+
  facet_wrap(~name,  ncol=3) + 
  scale_y_continuous(breaks=NULL, labels=NULL) +
  theme_minimal() +
  theme(
    legend.position="none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor=element_blank(),
    panel.grid.major.y=element_blank(),
    axis.text.y=element_blank(),
    axis.text.x = element_text(size=7.5)
  ) +
  geom_text(aes(label=minTempAvgC), position=position_dodge(width=0.9)) +
  ggtitle("Avererage min temp at Christmas time by decade (C),1960-2010")

Analysis of snow precipitation and snow cover over the Christmas period in Canada ❄

Graeme Bruce (graeme.bruce@cbc.ca) and Nael Shiab (nael.shiab@cbc.ca)

Dec. 14, 2022

Data cleaning