Section 8 Spatial data

Spatial data refers to any dataset that associates specific recorded values with a unique region of space. This can be latitude/longitude, state or city name, country, or any other spatial information.

In R, spatial data comes in a variety of formats. Amongst the most common are shapefiles, geodatabases, and geojson files. All of these data can be read and processed into a dataframe just like common data formats (csv, excel, etc.). However, we often have to use special libraries to do so. In addition, because of the nature of this data, spatial relationship operations can also be performed to join or create new features. Later in this tutorial, we will generate some examples using the sf and raster libraries to perform these analyses.

8.1 Importing spatial data

In general, the most common forms of geospatial data are shapefiles for vectors and tiffs for rasters. This tutorial will focus primarily on these 2 types of data and how to manipulate them.

8.1.1 Reading shapefile

The shapefile format consists of at least 3 files: the main file (.shp), the index file (.shx), and the table file (.dbf). Sometimes there is also a projection file (.prj). Please visit this webpage for more details on the distinctions and importance of each of these elements.

Because of this multi-file structure, it is good practice to save your shapefile in a folder or .zip file with its associated index file and table file. When reading the data into R, we can simply point to the .shp file in this folder and the code will know how to use the other files. In a similar way to how we read in csv files in the first page of instruction, we will use the sf package to read the shapefile.

# source:
tsunami <- read_sf('data/Historical_Tsunami_Event/Historical_Tsunami_Event.shp')
## tibble [27,459 × 11] (S3: sf/tbl_df/tbl/data.frame)
##  $ ID        : int [1:27459] 1 2 3 4 5 6 7 8 9 10 ...
##  $ TSEVENT_ID: int [1:27459] 3 9 10 10 10 10 10 11 27 27 ...
##  $ YEAR      : int [1:27459] -1610 -479 -426 -426 -426 -426 -426 -373 142 142 ...
##  $ MONTH     : int [1:27459] NA NA 6 6 6 6 6 NA NA NA ...
##  $ DAY       : int [1:27459] NA NA NA NA NA NA NA NA NA NA ...
##  $ DATE_STRIN: chr [1:27459] "-1610/??/??" "-0479/??/??" "-0426/06/??" "-0426/06/??" ...
##  $ ARR_DAY   : int [1:27459] NA NA NA NA NA NA NA NA NA NA ...
##  $ ARR_HOUR  : int [1:27459] NA NA NA NA NA NA NA NA NA NA ...
##  $ ARR_MIN   : int [1:27459] NA NA NA NA NA NA NA NA NA NA ...
##  $ TRAV_HOURS: int [1:27459] NA NA NA NA NA NA NA NA NA NA ...
##  $ geometry  :sfc_POINT of length 27459; first list element:  'XY' num [1:2] 2782987 4232038
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:10] "ID" "TSEVENT_ID" "YEAR" "MONTH" ...

When looking at the dataframe structure above, we see that the shapefile is processed similarly to how a regular dataframe is processed, except for the additional geometry field which contains the spatial information. If we want to view the data, it will show up much as a regular csv would, with rows representing observations and columns representing features.

If we want to plot this data, we can simply use the base plot function.

# a plot of the first 4 columns
# plotted as location
plot(tsunami[, 1:4])

# a plot of only columns
# plotted as values without geometry

plot(tsunami$LONGITUDE, tsunami$DIST_FROM_)

If you only want to plot the geometry (points):


8.1.2 Reading raster

Raster .tif files can be imported with the raster library.


# Import the raster file
wind_jan <- raster('data/wc2.1_10m_wind/wc2.1_10m_wind_01.tif')

# Plot the file

8.2 Processing

8.2.1 Creating point data

This example shows how to convert data with latitude and longitude coordinates into a spatial dataframe. We’ll use the charcoal records from the GCD package as our field.

df.table <- paleofiresites

It is important to note that when converting a regular dataframe into a spatial dataframe, you have to define the crs or coordinate reference system that was used to create the spatial values (coordinates) in the dataframes. For example, when plotting points from GPS, you have to define what crs was used by GPS to create the point. In most cases, the default crs for any data is the World Geodetic System 1984, used in GPS with EPSG code of 4326. You can visit this website for help on what EPSG code you should you for the data you’re working on.

wgs84crs <- 4326
sf.table <- st_as_sf(df.table, coords=c('long', 'lat'), crs=wgs84crs, remove=F)

The code above helps us to access the data of charcoal records, save them into a dataframe, and use the st_as_sf() function to transform the data into a shapefile by encoding the ‘long’ and ‘lat’ columns as the coordinate pairs for each observation. We gave the argument remove=F so that the function would not then take these points out of the data, so that we can continue to see them.

By saving what was a regular dataframe as a shapefile, we can now see the difference in how plot() treats the data. Below we give two identical calls, one on the df.table object and the other on the transformed sf.table object. What you will see is that the first call produces the usual matrix plot we expect when we ask plot() to look at multiple columns in a dataset. The second call gives us points plotted according to their geospatial coordinates. With a dataset large enough and broad enough, these data begin to resemble a global map:

# non spatial data

# spatial data

8.2.2 Joining with another dataframe

Oftentimes, we want to join non-spatial data with their geographic location. For example, if we have information on energy production and consumption by state, we can join it with a spatial feature for better visualization.

To do this, we use a function called inner_join, which is one of four possible types of joining. With an inner join, we only include the points from both datafiles that overlap. If you are curious about the distinctions between different types of joins, more information can be found towards the end of the ‘Structuring’ page earlier in our collection.

# Get energy data <- read.csv('data/energy-by_state.csv')

# Use USABoundaries library to get US spatial data
us.geo <- us_states(resolution='low')

When joining a spatial dataframe with a regular dataframe, make sure that the spatial dataframe is on the left side so that the geometry field will continue to be stored as a spatial information.

# joining the 2 data <- us.geo %>% inner_join(, by=c('state_abbr'='State'))

# here we use Albers USA projection for better visualization <- st_transform(, crs=5070)

# Plot the data

8.2.3 Spatial joining

If we look at the paleofiresites data, we can see that we do not have a US state corresponding to each point location. We can create this variable with a spatial join.

# Get all paleo sites for the US
us.paleo <- paleofiresites %>% filter(country=='USA') %>% 
  st_as_sf(coords=c('long', 'lat'), crs=wgs84crs)

When joining 2 spatial dataframes, make sure that they’re both projected to the same coordinate system. This will ensure the integrity of the spatial relationship. In this example, we transformed both data to Albers projection.

# Set data to the same coordinate system
us.paleo.albers <- st_transform(us.paleo, crs=5070)
us.geo.albers <- st_transform(us.geo, crs=5070)

# Perform a spatial join within
sp.join <- st_join(us.paleo.albers, us.geo.albers, join=st_within)

# Produce the plot
plot(sp.join[, 'state_name'], add=T, pch=16)}

For more information on what you can do with a vector file in R, please visit the sf library reference or read the book Geocomputation with R.

8.3 Some Visualization Examples

Now that we have walked through how to read in, manipulate, and clean spatial data, we are going to run through a few key examples of visualization. Below we use the rnaturalearth package, in coordination with the ggplot2 and ggrepel libraries, to generate a few examples, so that you can see the breadth of what plotting spatial data with R can accomplish.

8.3.1 Plotting charcoal records

These are point data from the Global Paleofire Database. And we’ll ma their distribution around the world.

# Get a world map from rnaturalearth library
world <- ne_countries(scale='small', returnclass='sf')

# Project to robinson projection
# here you can see another way of defining the crs <- st_transform(world, crs='+proj=robin')

# Use imported graticule data for viz
# this is just to show the sort of spatial data that you may need to make your map look good
# but this one is not required to visulized the data.
grat <- st_geometry(read_sf('data/ne_110m_graticules_all/ne_110m_wgs84_bounding_box.shp'))

# Get data for paleo sites
paleosites <- paleofiresites %>% 
  mutate(site_name = trimws(as.character(site_name))) %>% 
  st_as_sf(coords=c('long', 'lat'), crs=4326) %>% 
newCoods <- st_coordinates(paleosites)
paleosites$Lon <- newCoods[,1]
paleosites$Lat <- newCoods[,2]

# Set ggplot theme
ggTheme <- theme(
  panel.grid.minor = element_blank(),
  panel.grid.major = element_blank(),
  panel.background = element_blank(),
  plot.background = element_rect(fill='#e6e8ed'),
  panel.border = element_blank(),
  axis.line = element_blank(),
  axis.text.x = element_blank(),
  axis.text.y = element_blank(),
  axis.ticks = element_blank(),
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  plot.title = element_text(size=22)

# Pick a few sites to be highlighted
selected.sites <- c(
  'Little Molas Lake','Xindi','Laguna Oprasa','Morro de Itapeva',
  'Aracatuba', 'Lago del Greppo','Nursery Swamp', 'Huguangyan Maar Lake'

# Complete the mapping
map <- ggplot( +
  geom_sf(data=grat, fill='white', color='white') +
  geom_sf(fill='#c1cdcd', size=.1, color='white') + 
    aes(x=Lon, y=Lat), color='#009ACD',size=1) +
    data=paleosites %>% filter(site_name %in% selected.sites),
    aes(x=Lon, y=Lat), color='#FF0000',size=2) +
    data=paleosites %>% filter(site_name %in% selected.sites),
    aes(x=Lon, y=Lat, label=site_name),
    size=4) +
  coord_sf(datum=st_crs(54030)) +

# Print the map