library(ggplot2)
library(dplyr)
library(data.table)
library(tibble)
library(tmap)
library(tmaptools)
library(sf)
library(tidyr)
library(knitr)
library(kableExtra)
library(gridExtra)
library(xml2)
library(leaflet)
library(ggpmisc)
library(RColorBrewer)
library(scales)
The goal of this project was to explore the capabilities of interactive maps in R with the shiny
package while my friend Tom similarly explored interactive maps in python with streamlit
. The package leaflet
proved to be highly customizable and pleasing to work with.
This report details the acquisition, cleaning, transformation and joining of the data with data.table
, and visualization of the data with ggplot2
. After planning the visualization, the code for a leaflet rendering is provided at the end. The final interactive application can be found here while the repository for this project can be found on my github.
The population data, shape data, and metadata required for this project were all sourced from Statistics Canada.
folders <- c(
'./data',
'./data/original_rar',
'./data/division_data',
'./data/division_shapes',
'./shiny_app',
'./shiny_app/data',
'./shiny_app/data/division_shapes'
)
url1 <- 'https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/lcd_000b16a_e.zip'
url2 <- 'https://www150.statcan.gc.ca/n1/tbl/csv/17100139-eng.zip'
# url3 <-
for(folder in folders){
if(!file.exists(folder)){dir.create(folder)}
}
if(!file.exists('./data/original_rar/lcd_000b16a_e.zip')){
download.file(url1, './data/original_rar/lcd_000b16a_e.zip')
}
if(!file.exists('./data/original_rar/17100139-eng.zip')){
download.file(url2, './data/original_rar/17100139-eng.zip')
}
if(!file.exists('./data/division_shapes/lcd_000b16a_e.shp')){
unzip('./data/original_rar/lcd_000b16a_e.zip',
exdir = './data/division_shapes')
file.copy('./data/division_shapes', './shiny_app/data/division_shapes')
}
if(!file.exists('./data/division_data/17100139.csv')){
unzip('./data/original_rar/17100139-eng.zip',
exdir = './data/division_data')
}
metadata <- read.csv('./data/division_data/17100139_MetaData.csv',
skip = 8, nrows = 306)
popEst <- read.csv('./data/division_data/17100139.csv',
na.strings = '')
shp <- st_read('./data/division_shapes/lcd_000b16a_e.shp',
stringsAsFactors = F)
## Reading layer `lcd_000b16a_e' from data source `T:\Documents\Data Science\Projects\map_project\data\division_shapes\lcd_000b16a_e.shp' using driver `ESRI Shapefile'
## Simple feature collection with 293 features and 5 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 3689439 ymin: 659338.9 xmax: 9015737 ymax: 5242179
## projected CRS: PCS_Lambert_Conformal_Conic
Some variables contain a lot of missing data or ‘NA’ values while others contain no useful information. These tables help to determine which variables will be useful for exploration.
# for(i in colnames(popEst)){
# i_unique <- uniqueN(popEst[,i])
# cmd <- sprintf('Unique values for %s: %d', i, i_unique)
# print(eval(cmd))
# }
#OR
#sapply(popEst, uniqueN)
data.frame(colMeans(is.na(popEst))*100,
sapply(popEst, uniqueN)) %>%
rownames_to_column() %>%
kbl(col.names = c('Variable', '% NA', '# of Unique')) %>%
kable_styling(bootstrap_options = c('striped',
'hover',
'condensed',
'responsive'),
full_width = F)
Variable | % NA | # of Unique |
---|---|---|
ï..REF_DATE | 0 | 20 |
GEO | 0 | 306 |
DGUID | 0 | 306 |
Sex | 0 | 3 |
Age.group | 0 | 115 |
UOM | 0 | 2 |
UOM_ID | 0 | 2 |
SCALAR_FACTOR | 0 | 1 |
SCALAR_ID | 0 | 1 |
VECTOR | 0 | 105570 |
COORDINATE | 0 | 105570 |
VALUE | 0 | 68331 |
STATUS | 100 | 1 |
SYMBOL | 100 | 1 |
TERMINATED | 100 | 1 |
DECIMALS | 0 | 2 |
‘SCALAR_FACTOR’, ‘SCALAR_ID’, ‘STATUS’, ‘SYMBOL’ and ‘TERMINATED’ can all be ignored since they only contain a single value each.
‘Sex’ contains a summation observation ‘Both sexes’ while ‘Age.group’ contains ‘All ages’. It also contains specific age summations like ‘70 to 74 years’ and statistic summaries like ‘Average age’ and ‘Median age’. These may be useful in another visualization, however for this project only total summations will be used.
popEst$GEO <- gsub('é', 'e', popEst$GEO)
popEst$GEO <- gsub('è', 'e', popEst$GEO)
popEst$GEO <- gsub('ÃŽ', 'I', popEst$GEO)
popEst$CITY <- gsub(', .*', '', popEst$GEO)
popEst$PROVINCE <- gsub('.*, ', '', popEst$GEO)
popTot <- popEst[popEst$Age.group == 'All ages' &
popEst$Sex == 'Both sexes' &
grepl(',', popEst$GEO),
c(1:2, 12, 17:18)]
colnames(popTot) <- c('YEAR', 'GEO', 'POPULATION', 'CITY', 'PROVINCE')
popTot <- spread(popTot, YEAR, POPULATION)
The metadata can act as the glue to merge the population table to the shape data since it contains a ‘GEO’ column that overlaps with the population table and a ‘CDUID’ column that overlaps with the shape data. First, the ‘GEO’ variable of the metadata must be processed in the same way the population data has been:
metadata <- metadata[c(2, 3)]
colnames(metadata) <- c('GEO', 'CDUID')
metadata$CDUID <- gsub('\\[|\\]', '', metadata$CDUID)
metadata$GEO <- gsub('é', 'e', metadata$GEO)
metadata$GEO <- gsub('è', 'e', metadata$GEO)
metadata$GEO <- gsub('ÃŽ', 'I', metadata$GEO)
Merging the metadata, then merging the shape data:
popTot <- merge(popTot, metadata)
popMap <- st_as_sf(merge(popTot, shp))
Saving tidied data for shiny app use:
popTot_save <- gather(popTot, names(popTot[4:23]), key = YEAR, value = POPULATION)
write.csv(popTot_save, './shiny_app/data/popTot.csv', row.names = F)
ggplot() +
geom_sf(data = popMap, aes(fill = `2020`-`2019`))
The data is skewed to the point where the fill color doesn’t provide much meaningful differentiation. This can be solved by normalizing the variable either by scaling or log transformations.
popMap$log <- log10(abs(popMap$`2020`-popMap$`2019`))
popMap$log[is.na(log10(popMap$`2020`-popMap$`2019`))] <- popMap$log[is.na(log10(popMap$`2020`-popMap$`2019`))] * -1
Details on how to use hist()
for variable distribution analysis.
par(
mfrow = c(1, 3),
mar = c(4, 4, 1, 1)
)
hist(popMap$'2020' - popMap$'2019',
breaks = 30,
col = rgb(1, 0, 0, 0.5),
xlab = 'Pop Change',
main = '')
hist((popMap$'2020' - popMap$'2019')/popMap$'2019',
breaks = 30,
col = rgb(0, 0, 1, 0.5),
xlab = 'Pop Change per Capita',
main = '')
hist(popMap$log,
breaks = 30,
col = rgb(0, 0, 1, 0.5),
xlab = 'Pop Change log', main = '')
log_plot <- ggplot() +
geom_sf(data = popMap, aes(fill = log)) +
scale_fill_gradient2(name = '') +
labs(title = 'Log') +
theme(legend.position = 'bottom')
scaled_plot <- ggplot() +
geom_sf(data = popMap, aes(fill = (`2020`-`2019`)/`2019` * 10000)) +
scale_fill_gradient2(name = '') +
labs(title = 'Migration per 10k') +
theme(legend.position = 'bottom')
grid.arrange(log_plot, scaled_plot, ncol = 2)
Although both normalization methods produce more useful visualizations, reducing the skew by taking the log results in data that are considerably abstracted while reducing the skew by scaling produces results as change per capita which are still meaningful numbers.
For the application, net migration per 10k individuals was used to both visualize the data and provide a meaningful metric for the user to gain a better understanding of the data.
Unlike ggplot2
, leaflet
does not properly center the data at the intersect of a diverging color palette. In order to display the data properly, a custom color palette must be made with colorRampPalette
. This medium article was used as a reference to build a comparable palette.
minVal <- min((popMap$`2020`-popMap$`2019`)/popMap$`2019`*10000)
maxVal <- max((popMap$`2020`-popMap$`2019`)/popMap$`2019`*10000)
domain <- c(minVal, maxVal)
# colorPal_grad <- c(colorRampPalette(colors = c('#801100', 'white'),
# space = 'Lab', bias = 500)(abs(minVal)),
# colorRampPalette(colors = c('white', '#003d8c'),
# space = 'Lab', bias = 500)(maxVal))
colorPal <- c(colorRampPalette(colors = brewer.pal(11, 'RdBu')[c(1:4, 6)],
space = 'Lab')(abs(minVal/10)),
colorRampPalette(colors = brewer.pal(11, 'RdBu')[c(6, 8:11)],
space = 'Lab')(maxVal/10)[-1])
When using 2 colors and a center, the palette appears quite bland. To create a pleasing palette like the preset palettes, the same function can be used to interpolate a discrete palette rather than 3 colors, so it keeps the aesthetics of the discrete palette in a continuous palette.
An alternative method to visualize the differences in population without transforming the data could be to create a custom color palette with custom ‘bin’ values, distribituted to contain the same number of divisions in each rather than a consistent range of population values.
The leaflet choropleth tutorial on github provides a good framework for planning a choropleth application.
As a note, the data needs to be transformed to google maps style 4326 to work in leaflet application.
labels = sprintf('<strong>%s</strong><br/>%g net migration/10k',
popMap$GEO,
round((popMap$`2020`-popMap$`2019`)/popMap$`2019`*10000)) %>%
lapply(htmltools::HTML)
st_transform(popMap, '+init=epsg:4326') %>%
leaflet() %>%
addProviderTiles('Stamen.TonerLite') %>%
setView(lng = -95, lat = 60, zoom = 3) %>%
addPolygons(color = '#444444',
weight = 1,
smoothFactor = 0.5,
opacity = 1,
fillOpacity = 0.7,
fillColor = ~get('colorNumeric')(colorPal, domain)((`2020`-`2019`)/`2019`*10000),
highlightOptions = highlightOptions(color = 'white',
weight = 2,
bringToFront = T),
label = labels,
labelOptions = labelOptions(
style = list('font-weight' = 'normal',
padding = '3px 8px'),
textsize = '15px',
direction = 'auto')) %>%
addLegend(pal = colorNumeric(colorPal, domain = domain),
values = domain,
opacity = 0.7,
title = 'Net Migration Per 10k',
position = 'bottomright')
To use the application, visit my shiny webpage and to see the shiny code visit my github page.
notes on ggplot Histograms
tibble('PopChange' = popMap$'2020' - popMap$'2019') %>% arrange('PopChange') %>% ggplot(aes(x = PopChange)) + geom_histogram() + scale_x_continuous(breaks = 30)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Although I normally prefer ggplot2
, the base r function hist()
is a lot quicker and easier to use for an exploratory analysis of variable distribution. If ggplot2
is desired, I found this Stack Overflow article comparing hist() vs geom_histogram() to be a good source of information.