Lab 09: SDMs - Model fitting, prediction, evaluation

Spatial Ecology FOR 870, Fall 2020

Author

Nan Nourn

Published

November 10, 2022

Introduction

Hello! You are reading an edited version of Lab 9 because you chose a species range whose occurrences falls outside of South America. Here, we will break down the lab to set you up in answering questions 1 and 2. This requires collecting animal occurrences from GBIF, loading BioClim data as a rasterStack, and model fitting and prediction, plus other information regarding reproducibility in an R workflow. If you’ve already gathered your species occurrences and saved them, you can skip to the Environmental Data portion.

Lab 9 also matches up with the vignette chapters title Model fitting, prediction, and evaluation. The chapter titled Modeling methods titled Modeling Methods and Geographic Null Models will be saved for Lab 10.

Self-Contained Projects Workflow in R

You’ll notice that I do not set up a working directory for R – I like to set up my workflow in R with the .Rproj files, and structure my folders following a certain way. With this workflow, I can email my collaborator(s) the .Rproj file and the contained folders (or they can download from my Github) and the project will be structured the same exact way it is on found on my computer. Here, collaborators won’t have to fudge around with their working directory and files. This workflow is especially dependent on the here package.

You can read more from Jenny Bryan here: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/

& Martin Chan has a nice example drawn up: https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner%27s-guide/

Quarto vs. Rmarkdown

Instead of rmarkdown, I made this file with Quarto. Quarto is pretty similar to Rmarkdown with minor differences, so if you’re a pro with Rmarkdown it might be worth it to check out how Quarto works.

Packages

We will be using a number of different packages.

library(tidyverse)
library(dismo)
library(raster)
library(sp)
library(stars)
library(sf)
library(tmap)
# set your working directory first then run here() again if you're not using an .rproj file
library(here) 
theme_set(theme_minimal())

Species Occurrences (GBIF)

Let’s use one of my favorite animals, the pronghorn antelope (Antilocapra americana). I know enough about the species distribution range that the pronghorn’s range should be in North America. Any occurrences that falls outside of North America is likely a zoo or museum record or a misidentification.

# 9633 records found - might take awhile to load
pronghorn <- gbif("Antilocapra", "americana", nrecs = 300)
# count records to make sure we are doing okay
# take a glimpse of the data
# turn data.frame into tibble
pronghorn %>% count()

     n
1 9670

pronghorn %>% glimpse()

Rows: 9,670
Columns: 187
$ acceptedNameUsage                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ acceptedScientificName                      <chr> "Antilocapra americana (Or…
$ acceptedTaxonKey                            <int> 2440902, 2440902, 2440902,…
$ accessRights                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ adm1                                        <chr> "Montana", "Colorado", "Ne…
$ adm2                                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedOccurrences                       <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedReferences                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedSequences                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedTaxa                              <chr> NA, NA, NA, NA, NA, NA, NA…
$ basisOfRecord                               <chr> "HUMAN_OBSERVATION", "HUMA…
$ bed                                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ behavior                                    <chr> NA, NA, NA, NA, NA, NA, NA…
$ bibliographicCitation                       <chr> NA, NA, NA, NA, NA, NA, NA…
$ catalogNumber                               <chr> "104556531", "104443420", …
$ class                                       <chr> "Mammalia", "Mammalia", "M…
$ classKey                                    <int> 359, 359, 359, 359, 359, 3…
$ cloc                                        <chr> "Montana, United States", …
$ collectionCode                              <chr> "Observations", "Observati…
$ collectionID                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ collectionKey                               <chr> NA, NA, NA, NA, NA, NA, NA…
$ continent                                   <chr> NA, NA, NA, NA, NA, NA, NA…
$ coordinatePrecision                         <dbl> NA, NA, NA, NA, NA, NA, NA…
$ coordinateUncertaintyInMeters               <dbl> 61, 977, 61, 222, 48, 378,…
$ country                                     <chr> "United States", "United S…
$ crawlId                                     <int> 324, 324, 324, 324, 324, 3…
$ dataGeneralizations                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ datasetID                                   <chr> NA, NA, NA, NA, NA, NA, NA…
$ datasetKey                                  <chr> "50c9509d-22c7-4a22-a47d-8…
$ datasetName                                 <chr> "iNaturalist research-grad…
$ dateIdentified                              <chr> "2022-01-08T03:57:50", "20…
$ day                                         <int> 7, 1, 5, 9, 10, 9, 17, 8, …
$ depth                                       <dbl> NA, NA, NA, NA, NA, NA, NA…
$ depthAccuracy                               <dbl> NA, NA, NA, NA, NA, NA, NA…
$ disposition                                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ dynamicProperties                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestAgeOrLowestStage                    <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEonOrLowestEonothem                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEpochOrLowestSeries                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEraOrLowestErathem                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestPeriodOrLowestSystem                <chr> NA, NA, NA, NA, NA, NA, NA…
$ elevation                                   <dbl> NA, NA, NA, NA, NA, NA, NA…
$ elevationAccuracy                           <dbl> NA, NA, NA, NA, NA, NA, NA…
$ endDayOfYear                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ establishmentMeans                          <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventDate                                   <chr> "2022-01-07T09:56:30", "20…
$ eventID                                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventRemarks                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventTime                                   <chr> "09:56:30-07:00", "11:03:0…
$ family                                      <chr> "Antilocapridae", "Antiloc…
$ familyKey                                   <int> 5715, 5715, 5715, 5715, 57…
$ fieldNotes                                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ fieldNumber                                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ footprintSRS                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ footprintWKT                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ formation                                   <chr> NA, NA, NA, NA, NA, NA, NA…
$ fullCountry                                 <chr> "United States of America"…
$ gbifID                                      <chr> "3455555450", "3455633105"…
$ genericName                                 <chr> "Antilocapra", "Antilocapr…
$ genus                                       <chr> "Antilocapra", "Antilocapr…
$ genusKey                                    <int> 2440901, 2440901, 2440901,…
$ geodeticDatum                               <chr> "WGS84", "WGS84", "WGS84",…
$ geologicalContextID                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferencedBy                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferencedDate                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceProtocol                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceRemarks                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceSources                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceVerificationStatus              <chr> NA, NA, NA, NA, NA, NA, NA…
$ habitat                                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherClassification                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherGeography                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherGeographyID                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ highestBiostratigraphicZone                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ hostingOrganizationKey                      <chr> "28eb1a3f-1c15-4a95-931a-4…
$ `http://unknown.org/canonicalName`          <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/captive`                <chr> "wild", "wild", "wild", "w…
$ `http://unknown.org/combinationAuthors`     <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/combinationYear`        <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/language`               <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/nick`                   <chr> "wkonwent", "sichtopher_ch…
$ `http://unknown.org/recordId`               <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/recordID`               <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/rights`                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/rightsHolder`           <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/verbatimScientificName` <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationID                            <chr> "230326964", "229979364", …
$ identificationQualifier                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationRemarks                       <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationVerificationStatus            <chr> NA, NA, NA, NA, NA, NA, NA…
$ identifiedBy                                <chr> "wkonwent", "sichtopher_ch…
$ identifier                                  <chr> "104556531", "104443420", …
$ individualCount                             <int> NA, NA, NA, NA, NA, NA, NA…
$ informationWithheld                         <chr> NA, NA, NA, NA, NA, NA, "C…
$ infraspecificEpithet                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ installationKey                             <chr> "997448a8-f762-11e1-a439-0…
$ institutionCode                             <chr> "iNaturalist", "iNaturalis…
$ institutionID                               <chr> NA, NA, NA, NA, NA, NA, NA…
$ institutionKey                              <chr> NA, NA, NA, NA, NA, NA, NA…
$ isInCluster                                 <lgl> FALSE, FALSE, FALSE, FALSE…
$ island                                      <chr> NA, NA, NA, NA, NA, NA, NA…
$ ISO2                                        <chr> "US", "US", "US", "US", "U…
$ iucnRedListCategory                         <chr> "LC", "LC", "LC", "LC", "L…
$ key                                         <chr> "3455555450", "3455633105"…
$ kingdom                                     <chr> "Animalia", "Animalia", "A…
$ kingdomKey                                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ language                                    <chr> NA, NA, NA, NA, NA, NA, NA…
$ lastCrawled                                 <chr> "2022-11-11T11:44:56.941+0…
$ lastInterpreted                             <chr> "2022-11-11T17:34:22.541+0…
$ lastParsed                                  <chr> "2022-11-11T17:34:22.541+0…
$ lat                                         <dbl> 45.15296, 37.83146, 36.571…
$ latestAgeOrHighestStage                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEonOrHighestEonothem                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEpochOrHighestSeries                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEraOrHighestErathem                   <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestPeriodOrHighestSystem                 <chr> NA, NA, NA, NA, NA, NA, NA…
$ license                                     <chr> "http://creativecommons.or…
$ lifeStage                                   <chr> NA, NA, NA, NA, NA, NA, "A…
$ lithostratigraphicTerms                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ locality                                    <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationAccordingTo                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationID                                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationRemarks                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ lon                                         <dbl> -110.8199, -105.9221, -103…
$ lowestBiostratigraphicZone                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ materialSampleID                            <chr> NA, NA, NA, NA, NA, NA, NA…
$ member                                      <chr> NA, NA, NA, NA, NA, NA, NA…
$ modified                                    <chr> "2022-04-13T15:35:45.000+0…
$ month                                       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ municipality                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ nameAccordingTo                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ namePublishedIn                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ namePublishedInYear                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ nomenclaturalCode                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ occurrenceID                                <chr> "https://www.inaturalist.o…
$ occurrenceRemarks                           <chr> NA, NA, NA, NA, "6 animals…
$ occurrenceStatus                            <chr> "PRESENT", "PRESENT", "PRE…
$ order                                       <chr> "Artiodactyla", "Artiodact…
$ orderKey                                    <int> 731, 731, 731, 731, 731, 7…
$ organismID                                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ organismName                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ organismQuantity                            <dbl> NA, NA, NA, NA, NA, NA, NA…
$ organismQuantityType                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ otherCatalogNumbers                         <chr> NA, NA, NA, NA, NA, NA, NA…
$ ownerInstitutionCode                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ parentNameUsage                             <chr> NA, NA, NA, NA, NA, NA, NA…
$ phylum                                      <chr> "Chordata", "Chordata", "C…
$ phylumKey                                   <int> 44, 44, 44, 44, 44, 44, 44…
$ preparations                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ previousIdentifications                     <chr> NA, NA, NA, NA, NA, NA, NA…
$ programmeAcronym                            <chr> NA, NA, NA, NA, NA, NA, NA…
$ projectId                                   <chr> NA, NA, NA, NA, NA, NA, NA…
$ protocol                                    <chr> "DWC_ARCHIVE", "DWC_ARCHIV…
$ publishingCountry                           <chr> "PA", "US", "US", "US", "U…
$ publishingOrgKey                            <chr> "28eb1a3f-1c15-4a95-931a-4…
$ recordedBy                                  <chr> "wkonwent", "sichtopher_ch…
$ recordNumber                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ references                                  <chr> "https://www.inaturalist.o…
$ rights                                      <chr> NA, NA, NA, NA, NA, NA, NA…
$ rightsHolder                                <chr> "wkonwent", "sichtopher_ch…
$ samplingProtocol                            <chr> NA, NA, NA, NA, NA, NA, NA…
$ scientificName                              <chr> "Antilocapra americana (Or…
$ scientificNameID                            <chr> NA, NA, NA, NA, NA, NA, NA…
$ sex                                         <chr> NA, NA, NA, NA, NA, NA, "M…
$ species                                     <chr> "Antilocapra americana", "…
$ speciesKey                                  <int> 2440902, 2440902, 2440902,…
$ specificEpithet                             <chr> "americana", "americana", …
$ startDayOfYear                              <chr> NA, NA, NA, NA, NA, NA, NA…
$ taxonConceptID                              <chr> NA, NA, NA, NA, NA, NA, NA…
$ taxonID                                     <chr> "42429", "42429", "42429",…
$ taxonKey                                    <int> 2440902, 2440902, 2440902,…
$ taxonomicStatus                             <chr> "ACCEPTED", "ACCEPTED", "A…
$ taxonRank                                   <chr> "SPECIES", "SPECIES", "SPE…
$ taxonRemarks                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ type                                        <chr> NA, NA, NA, NA, NA, NA, NA…
$ typeStatus                                  <chr> NA, NA, NA, NA, NA, NA, NA…
$ typifiedName                                <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimCoordinateSystem                    <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimElevation                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimEventDate                           <chr> "Fri Jan 07 2022 09:56:30 …
$ verbatimIdentification                      <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimLabel                               <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimLocality                            <chr> "Custer Gallatin National …
$ verbatimTaxonRank                           <chr> NA, NA, NA, NA, NA, NA, NA…
$ vernacularName                              <chr> NA, NA, NA, NA, NA, NA, NA…
$ year                                        <int> 2022, 2022, 2022, 2022, 20…
$ downloadDate                                <date> 2022-11-11, 2022-11-11, 2…

pronghorn <- pronghorn %>% as_tibble()

# GBIF provides us with so many columns!
# lets just use select() to choose the columns we only care about
# latitude, longitude, geo-referenced records and locality
# we then can mutate a new column and check out the # of records
pronghorn %>% 
  dplyr::select(lon,lat,locality) %>%
  mutate(georef_check = if_else(!is.na(lon) & !is.na(lat),"geo-referenced","not geo-referenced")) %>%
  count(georef_check)

# A tibble: 2 × 2
  georef_check           n
  <chr>              <int>
1 geo-referenced      8117
2 not geo-referenced  1553

# let's examine the top 25 localities of data that is not geo-referenced
pronghorn %>%
  filter(is.na(lon), is.na(lat)) %>%
  dplyr::select(locality) %>%
  count(locality, sort = TRUE) %>%
  mutate(locality = fct_reorder(locality, n)) %>%
  slice_head(n = 25) %>%
  ggplot(aes(x = locality, y = n, label = n)) +
  geom_col(size = 0.7) +
  geom_label() +
  coord_flip() +
  scale_y_discrete(expand = c(0,0)) +
  labs(title = "Top 25 non geo-referenced localities for pronghorn",
       subtitle = "GBIF")

# let's plot our pronghorn occurrences
# we load the World basemap from the `tmap` package
data(World)
World %>% glimpse()

Rows: 177
Columns: 16
$ iso_a3       <fct> AFG, AGO, ALB, ARE, ARG, ARM, ATA, ATF, AUS, AUT, AZE, BD…
$ name         <fct> Afghanistan, Angola, Albania, United Arab Emirates, Argen…
$ sovereignt   <fct> Afghanistan, Angola, Albania, United Arab Emirates, Argen…
$ continent    <fct> Asia, Africa, Europe, Asia, South America, Asia, Antarcti…
$ area         [km^2] 652860.000 [km^2], 1246700.000 [km^2], 27400.000 [km^2],…
$ pop_est      <dbl> 28400000, 12799293, 3639453, 4798491, 40913584, 2967004, …
$ pop_est_dens <dbl> 4.350090e+01, 1.026654e+01, 1.328268e+02, 6.734519e+01, 1…
$ economy      <fct> 7. Least developed region, 7. Least developed region, 6. …
$ income_grp   <fct> 5. Low income, 3. Upper middle income, 4. Lower middle in…
$ gdp_cap_est  <dbl> 784.1549, 8617.6635, 5992.6588, 38407.9078, 14027.1261, 6…
$ life_exp     <dbl> 59.668, NA, 77.347, NA, 75.927, 74.446, NA, NA, 82.052, 8…
$ well_being   <dbl> 3.800000, NA, 5.500000, NA, 6.500000, 4.300000, NA, NA, 7…
$ footprint    <dbl> 0.79000, NA, 2.21000, NA, 3.14000, 2.23000, NA, NA, 9.310…
$ inequality   <dbl> 0.42655744, NA, 0.16513372, NA, 0.16423830, 0.21664810, N…
$ HPI          <dbl> 20.22535, NA, 36.76687, NA, 35.19024, 25.66642, NA, NA, 2…
$ geometry     <MULTIPOLYGON [°]> MULTIPOLYGON (((61.21082 35..., MULTIPOLYGON…

World %>% 
  st_geometry %>% 
  plot()

# create sf object for North America (you create for your own locality)
northamerica_sf <- World %>% 
  filter(continent=="North America" & name!="Greenland")
# check CRS # EPSG: 4326 & WGS 84
northamerica_sf %>% st_crs()

Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["unknown"],
        AREA["World"],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

# let's also make an sp object just in case for later
northamerica_sp <- as_Spatial(northamerica_sf)

# grab coordinates and transform to sf object!
pronghorn_sf <- pronghorn %>% 
  dplyr::select(lon, lat) %>%
  filter(!is.na(lon) & !is.na(lat)) %>%
  st_as_sf(coords = c("lon", "lat"), remove = FALSE) %>%
  st_set_crs(4236)
# check CRS
pronghorn_sf %>% st_crs()

Coordinate Reference System:
  User input: EPSG:4236 
  wkt:
GEOGCRS["Hu Tzu Shan 1950",
    DATUM["Hu Tzu Shan 1950",
        ELLIPSOID["International 1924",6378388,297,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Geodesy."],
        AREA["Taiwan, Republic of China - onshore - Taiwan Island, Penghu (Pescadores) Islands."],
        BBOX[21.87,119.25,25.34,122.06]],
    ID["EPSG",4236]]

# plot all locations across North America QA/QC look for outliers
# use ggplot for a static map and tmap for an interactive map
# ggplot
ggplot() +
  geom_sf(data = northamerica_sf, alpha = 0.2) +
  geom_sf(data = pronghorn_sf, aes(color = lon), alpha = 0.7) +
  scale_color_viridis_c(option = "C", name = "Longitude") +
  scale_fill_discrete(name = "Country") +
    labs(title = "Pronghorn georeferenced occurrences",
         subtitle = "GBIF")

# tmap
#tmap_mode("view")
#tm_shape(northamerica_sf) +
#  tm_fill(alpha = 0.5) +
#  tm_shape(pronghorn_sf) +
#  tm_dots(col = "lon", border.col = NA,
#          alpha = 0.5, palette = "viridis",
#          midpoint = NA)

# there are 2 locations near the equator in the Atlantic Ocean 
# skeptical that there are pronghorn by the great lakes region
# we clean pronghorn locations by filtering out locations based on longitude
# getting rid anything east of 85° W should be sufficient
pronghorn_sf_filtered <- pronghorn_sf %>% filter(lon < -85)
# plot again
# ggplot
ggplot() +
  geom_sf(data = northamerica_sf, alpha = 0.2) +
  geom_sf(data = pronghorn_sf_filtered, aes(color = lon), 
          alpha = 0.7) +
  scale_color_viridis_c(option = "C", name = "Longitude") +
  scale_fill_discrete(name = "Country") +
    labs(title = "Pronghorn georeferenced occurrences",
         subtitle = "GBIF")

# tmap
tmap_mode("view")
tm_shape(northamerica_sf) +
  tm_fill(alpha = 0.5) +
  tm_shape(pronghorn_sf_filtered) +
  tm_dots(col = "lon", border.col = NA,
          alpha = 0.5, palette = "viridis",
          midpoint = NA)

Looks pretty good for pronghorn range and we are satisfied with GBIF occurrence records.

Environmental Data

Here we set up our raster stack of bioclim variables. We hope that we can download environmental data for North America since we’re targeting this extent.

A reminder:

BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3 = Isothermality (BIO2/BIO7) (×100)
BIO4 = Temperature Seasonality (standard deviation ×100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter

Raster data

here()

[1] "/Users/nournnan/Library/CloudStorage/OneDrive-MichiganStateUniversity/Grad/2020 Fall/FOR 870/Lab/script"

# create the folders (directories) "data"
# if they exist already, this command won't over-write them.
# create directory called "data"
dir.create("data", recursive = TRUE)

# Download WorldClim 2 data and unzip it
# Refer to the URL for your choice of data:

# Download 10 minute res worldclim V 2.1
wc_url_10m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_10m_bio.zip"
# Download 5 minute res worldclim V 2.1
wc_url_5m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_5m_bio.zip"
# Download 2.5 minute res worldclim V 2.1
wc_url_2.5m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_2.5m_bio.zip"
# Download 30 second res worldclim V 2.1
wc_url_30s <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_30s_bio.zip"
 
# example with 10 minute per degree solution data in "data" folder
# download file
download.file(url = wc_url_10m, destfile = here("data", "wc_10m.zip"),
              mode = "wb")
# unzip file
unzip(zipfile = here("data", "wc_10m.zip"), exdir = here("data"))
# load the data 
bioclim_files <- list.files(here("data"), pattern = "wc2.1_10m_bio_",
                            full.names = TRUE)

# create a RasterStack of predictor variables
predictors <- stack(bioclim_files)
# plot to check
predictors %>% plot()

# we don't need the entire globe, 
# let's crop out the information & only use North America extent
# (here you can create your own extent)
# create bounding box
bbox_sp <- bbox(northamerica_sp)
# now crop
predictors_na <- crop(predictors, bbox_sp)
# let's also change names for downstream modelling
names(predictors_na)

 [1] "wc2.1_10m_bio_1"  "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12"
 [5] "wc2.1_10m_bio_13" "wc2.1_10m_bio_14" "wc2.1_10m_bio_15" "wc2.1_10m_bio_16"
 [9] "wc2.1_10m_bio_17" "wc2.1_10m_bio_18" "wc2.1_10m_bio_19" "wc2.1_10m_bio_2" 
[13] "wc2.1_10m_bio_3"  "wc2.1_10m_bio_4"  "wc2.1_10m_bio_5"  "wc2.1_10m_bio_6" 
[17] "wc2.1_10m_bio_7"  "wc2.1_10m_bio_8"  "wc2.1_10m_bio_9"

bioclim_names <- names(predictors_na)
bioclim_names <- str_remove(bioclim_names, "wc2.1_10m_")
names(predictors_na) <- bioclim_names
# look at names information to see if they changed
predictors_na

class      : RasterBrick 
dimensions : 456, 715, 326040, 19  (nrow, ncol, ncell, nlayers)
resolution : 0.1666667, 0.1666667  (x, y)
extent     : -171.8333, -52.66667, 7.166667, 83.16667  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : memory
names      :       bio_1,      bio_10,      bio_11,      bio_12,      bio_13,      bio_14,      bio_15,      bio_16,      bio_17,      bio_18,      bio_19,       bio_2,       bio_3,       bio_4,       bio_5, ... 
min values :  -26.755625,   -9.820499,  -42.456039,   11.000000,    2.000000,    0.000000,    5.517085,    6.000000,    0.000000,    2.000000,    0.000000,    1.000000,    9.131122,    0.000000,   -6.246750, ... 
max values :    29.39055,    35.13550,    28.16318, 11191.00000,  2328.00000,   200.00000,   138.00240,  5282.00000,   700.00000,  5282.00000,  1584.00000,    21.14754,   100.00000,  1739.37622,    44.68700, ...

# check bioclim data
plot(predictors_na)

Extracting Values From Rasters

Here you can choose to follow the original vignette, or follow below. There are some subtle nuances that may have changed from the vignette since we’ve downloaded our own BioClim variables to make our rasterStack (e.g. the names of the variables and we also do not have a biome raster). I would read this webpage along with the dismo vignette in parallel to learn what is going on with the code. Anything that is in italicized font below was copy and pasted from the original vignette.

We now have a set of predictor variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the dismo package). This is a very straightforward thing to do using the ‘extract’ function from the raster package. In the example below we use that function first for the Pronghorn occurrence points, then for 8,000 random background points. We combine these into a single data.frame in which the first column (variable ‘pb’) indicates whether this is a presence or a background point.

presvals <- raster::extract(predictors_na, pronghorn_sf_filtered)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
# not sure what happens if you increase random points
# you might want to try to find out
# i changed mine to 8000 because i have around 8000 pronghorn presences
backgr <- randomPoints(predictors_na, 8000)
absvals <- extract(predictors_na, backgr)
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
# create data tibble for sdm
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
sdmdata %>% count(pb, sort = TRUE)

  pb    n
1  1 8107
2  0 8000

sdmdata %>% glimpse()

Rows: 16,107
Columns: 20
$ pb     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ bio_1  <dbl> 2.886625, 5.721406, 11.547552, 2.886625, 15.027687, 16.482063, …
$ bio_10 <dbl> 13.51179, 16.56792, 21.97633, 13.51179, 23.00483, 24.75971, 18.…
$ bio_11 <dbl> -7.23016644, -6.40008354, 1.56791675, -7.23016644, 7.99366665, …
$ bio_12 <dbl> 567, 229, 405, 567, 442, 478, 342, 443, 433, 382, 321, 237, 497…
$ bio_13 <dbl> 66, 46, 74, 66, 82, 111, 60, 73, 49, 62, 45, 30, 86, 70, 68, 68…
$ bio_14 <dbl> 37, 6, 7, 37, 0, 5, 14, 9, 18, 10, 10, 10, 11, 7, 13, 9, 0, 9, …
$ bio_15 <dbl> 19.44465, 61.80762, 69.87884, 19.44465, 87.80043, 80.83881, 48.…
$ bio_16 <dbl> 177, 111, 197, 177, 238, 262, 143, 181, 138, 170, 130, 76, 245,…
$ bio_17 <dbl> 124, 20, 23, 124, 5, 29, 49, 43, 66, 31, 34, 48, 42, 25, 49, 38…
$ bio_18 <dbl> 154, 101, 197, 154, 12, 221, 128, 146, 80, 147, 118, 52, 209, 1…
$ bio_19 <dbl> 133, 20, 23, 133, 226, 101, 51, 111, 122, 31, 34, 48, 42, 89, 4…
$ bio_2  <dbl> 14.37233, 19.15202, 16.18177, 14.37233, 18.73700, 16.93354, 13.…
$ bio_3  <dbl> 38.30605, 42.32280, 42.27240, 38.30605, 52.14717, 49.93709, 31.…
$ bio_4  <dbl> 851.3237, 929.4572, 829.3596, 851.3237, 627.4905, 668.8680, 108…
$ bio_5  <dbl> 23.98275, 27.15850, 31.15575, 23.98275, 35.56200, 33.77300, 27.…
$ bio_6  <dbl> -13.53700, -18.09375, -7.12400, -13.53700, -0.36900, -0.13675, …
$ bio_7  <dbl> 37.51975, 45.25225, 38.27975, 37.51975, 35.93100, 33.90975, 43.…
$ bio_8  <dbl> 10.6803331, 15.9158745, 21.9763336, 10.6803331, 8.7890835, 24.1…
$ bio_9  <dbl> -2.778833, -6.400084, 1.567917, -2.778833, 22.964375, 19.471834…

# pairs plot for bioclims 1, 5, 12
pairs(sdmdata[,c(2,5,16)], cex=0.1)

# save work for next section
saveRDS(sdmdata, "data/sdm.Rds")
saveRDS(presvals, "data/pvals.Rds")

Model fitting, prediction, and evaluation

Model Fitting

# load the data if not in the environment
sdmdata <- readRDS("data/sdm.Rds")
presvals <- readRDS("data/pvals.Rds")

# fit reduced model with bioclims 1, 5, 6, 12, and 17
m_red <- glm(pb ~ bio_1 + bio_5 + bio_6 + bio_12 + bio_17, data = sdmdata)
m_red %>% class()

[1] "glm" "lm"

m_red %>% summary()


Call:
glm(formula = pb ~ bio_1 + bio_5 + bio_6 + bio_12 + bio_17, data = sdmdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1762  -0.2870   0.0590   0.3253   3.5512  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.629e-01  2.868e-02  -9.166  < 2e-16 ***
bio_1       -1.268e-01  2.622e-03 -48.381  < 2e-16 ***
bio_5        9.480e-02  1.620e-03  58.530  < 2e-16 ***
bio_6        6.302e-02  1.312e-03  48.038  < 2e-16 ***
bio_12      -3.840e-04  1.424e-05 -26.975  < 2e-16 ***
bio_17       3.735e-04  7.548e-05   4.948 7.58e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1488072)

    Null deviance: 4026.6  on 16106  degrees of freedom
Residual deviance: 2395.9  on 16101  degrees of freedom
AIC: 15032

Number of Fisher Scoring iterations: 2

# fit full model with all bioclim variables
m_full = glm(pb ~ ., data = sdmdata)
m_full %>% class()

[1] "glm" "lm"

m_full %>% summary()


Call:
glm(formula = pb ~ ., data = sdmdata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.13275  -0.14795   0.06057   0.20386   0.90592  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.798e-01  4.692e-02   8.095 6.11e-16 ***
bio_1       -1.087e-01  7.904e-03 -13.747  < 2e-16 ***
bio_10       3.993e-01  1.873e-02  21.314  < 2e-16 ***
bio_11      -4.366e-01  1.576e-02 -27.702  < 2e-16 ***
bio_12       3.252e-04  6.508e-05   4.996 5.91e-07 ***
bio_13       3.798e-03  3.540e-04  10.728  < 2e-16 ***
bio_14       3.189e-03  1.106e-03   2.883  0.00394 ** 
bio_15       8.508e-04  3.001e-04   2.835  0.00458 ** 
bio_16      -2.248e-03  1.719e-04 -13.076  < 2e-16 ***
bio_17      -2.109e-03  3.919e-04  -5.383 7.42e-08 ***
bio_18      -1.055e-04  6.869e-05  -1.537  0.12441    
bio_19      -5.844e-05  6.024e-05  -0.970  0.33201    
bio_2        1.297e-01  4.626e-03  28.030  < 2e-16 ***
bio_3       -8.327e-03  7.072e-04 -11.775  < 2e-16 ***
bio_4       -9.544e-03  3.753e-04 -25.427  < 2e-16 ***
bio_5        6.552e+03  2.315e+03   2.830  0.00466 ** 
bio_6       -6.552e+03  2.315e+03  -2.830  0.00466 ** 
bio_7       -6.552e+03  2.315e+03  -2.830  0.00466 ** 
bio_8        3.605e-03  7.489e-04   4.813 1.50e-06 ***
bio_9       -4.435e-03  5.479e-04  -8.094 6.18e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1067704)

    Null deviance: 4026.6  on 16106  degrees of freedom
Residual deviance: 1717.6  on 16087  degrees of freedom
AIC: 9699.1

Number of Fisher Scoring iterations: 2

# fit null model
m_null <- glm(pb ~ 1, data = sdmdata)
m_null %>% class()

[1] "glm" "lm"

m_null %>% summary()


Call:
glm(formula = pb ~ 1, data = sdmdata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5033  -0.5033   0.4967   0.4967   0.4967  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.50332    0.00394   127.8   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.2500045)

    Null deviance: 4026.6  on 16106  degrees of freedom
Residual deviance: 4026.6  on 16106  degrees of freedom
AIC: 23384

Number of Fisher Scoring iterations: 2

# dismo model
# Models that are implemented in dismo do not use a formula 
# (and most models only take presence points). 
# Bioclim is an example. It only uses presence data, 
# so we use ‘presvals’ instead of ‘sdmdata’.
bc <- bioclim(presvals[,c('bio_1', 'bio_5', 'bio_6', 'bio_12', 'bio_17')])
bc %>% class()

[1] "Bioclim"
attr(,"package")
[1] "dismo"

bc

class    : Bioclim 

variables: bio_1 bio_5 bio_6 bio_12 bio_17 


presence points: 8107 
       bio_1    bio_5     bio_6 bio_12 bio_17
1   2.886625 23.98275 -13.53700    567    124
2   5.721406 27.15850 -18.09375    229     20
3  11.547552 31.15575  -7.12400    405     23
4   2.886625 23.98275 -13.53700    567    124
5  15.027687 35.56200  -0.36900    442      5
6  16.482063 33.77300  -0.13675    478     29
7   5.036927 27.11175 -16.39175    342     49
8  13.651906 33.77525  -4.11800    443     43
9   8.528791 29.26100  -5.25625    433     66
10  8.406760 29.44675  -9.36450    382     31
  (... ...  ...)

bc %>% pairs()

Model Prediction

Different modeling methods return different type of ‘model’ objects (typically they have the same name as the modeling method used). All of these ‘model’ objects, irrespective of their exact class, can be used to with the predict function to make predictions for any combination of values of the independent variables. This is illustrated in the example below where we make predictions with the glm model object ‘m1’ and for bioclim model ‘bc’, for three records with values for variables bio1, bio5 and bio12 (the variables used in the example above to create the model objects).

library(stars)
# i chose these values from my bc pairs plot
# looking at the histograms for each bio variable
# use all predictions since we want to compare full model
predictors_na %>% 
  st_as_stars() %>% 
  as_tibble() %>% 
  rename(value = 4) %>% 
  ggplot(aes(x = value)) +
  geom_density() +
  facet_wrap(~ band, scales = "free")

# predict table
bio_1 = c(-10, 15, 20)
bio_2 = c(5, 10, 15)
bio_3 = c(25, 50, 100)
bio_4 = c(500, 1000, 1500)
bio_5 = c(25, 30, 45)
bio_6 = c(-15, -5, 10)
bio_7 = c(20, 40, 50)
bio_8 = c(-10, 0 , 20)
bio_9 = c(-30, 10, 20)
bio_10 = c(10, 20, 30)
bio_11 = c(-20, 0 , 20)
bio_12 = c(300, 650, 1000)
bio_13 = c(100, 200, 300)
bio_14 = c(20, 30, 40)
bio_15 = c(30, 50 , 70)
bio_16 = c(100, 500, 1000)
bio_17 = c(30, 60, 120)
bio_18 = c(1000, 1500, 2000)
bio_19 = c(300, 400, 500)
pd = data.frame(cbind(bio_1, bio_2, bio_3, bio_4, bio_5, 
                      bio_6, bio_7, bio_8, bio_9, bio_10,
                      bio_11, bio_12, bio_13, bio_14, bio_15,
                      bio_16, bio_17, bio_18, bio_19))
pd

  bio_1 bio_2 bio_3 bio_4 bio_5 bio_6 bio_7 bio_8 bio_9 bio_10 bio_11 bio_12
1   -10     5    25   500    25   -15    20   -10   -30     10    -20    300
2    15    10    50  1000    30    -5    40     0    10     20      0    650
3    20    15   100  1500    45    10    50    20    20     30     20   1000
  bio_13 bio_14 bio_15 bio_16 bio_17 bio_18 bio_19
1    100     20     30    100     30   1000    300
2    200     30     50    500     60   1500    400
3    300     40     70   1000    120   2000    500

# predict with different models
predict(m_red, pd)

        1         2         3 
2.3261808 0.1361003 1.7570892

predict(m_full, pd)

        1         2         3 
131050.25 -32762.75 -98292.01

predict(m_null, pd)

        1         2         3 
0.5033215 0.5033215 0.5033215

# response curves
response(bc)

response(m_red)

response(m_full)

# predict suitability scores with raster object
pr <- predict(predictors_na, m_red)
pf <- predict(predictors_na, m_full)
pn <- predict(predictors_na, m_null)
pbc <- predict(predictors_na, bc)
plot(pr, main = "Reduced Model")

plot(pf, main = "Full Model")

plot(pn, main = "Null Model")

plot(pbc, main = "BioClim Model")

# plot BioClim model in western states
library(USAboundaries)
# plot mtn west
mtn_west <- us_counties(states = c("Colorado", "Nevada", "Montana",
                                   "New Mexico", "Wyoming", "Idaho",
                                   "Utah","Arizona"))
# save CRS for later
myCRS <- mtn_west %>% 
  st_crs()
# turn pbcsraster into pbc stars
pbc_stars <- pbc %>% st_as_stars()
# crop pbc_stars
pbc_mtn_west <- st_crop(pbc_stars, mtn_west)
# transform pronghorn CRS
pronghorn_sf_filtered <- pronghorn_sf_filtered %>% 
  st_transform(myCRS)
# crop
pronghorn_mtn_west <- st_crop(pronghorn_sf_filtered, mtn_west)
# plot + locations
ggplot() +
  geom_stars(data = pbc_mtn_west) +
  geom_sf(data = mtn_west, fill = NA,
          col = "black") +
  geom_sf(data = pronghorn_mtn_west, alpha = 0.4, color = "orange") +
  scale_fill_viridis_c(name = "Predicted Suitability",
                       option = "D") +
  labs(title = "Pronghorn Antelope",
       subtitle = "BioClim Model")

# tmap interactive
tm_shape(pbc_mtn_west) +
  tm_raster(alpha = 0.9, palette = "viridis", 
            title = "Predicted Suitability") +
  tm_shape(pronghorn_mtn_west) +
  tm_dots(size = 0.02, col = "orange", alpha = 0.3)

Model evaluation

For this section, most of the code from the original vignette could be run on your own, but I’ve included edited bits and pieces that may be useful.

p <- rnorm(50, mean=0.7, sd=0.3)
a <- rnorm(50, mean=0.4, sd=0.4)

# density plot
tibble(presence = p, absence = a) %>% 
  pivot_longer(1:2, names_to = "type", values_to ="value") %>% 
  ggplot(aes(x = value, fill = type)) +
  geom_density(alpha = 0.5)

# boxplot
tibble(presence = p, absence = a) %>% 
  pivot_longer(1:2, names_to = "type", values_to ="value") %>% 
  ggplot(aes(x = type, y = value, fill = type)) +
  geom_boxplot(width = 0.25)

# removing spatial sorting bias
set.seed(0)
backgr <- randomPoints(predictors_na, 500)
nr <- nrow(pronghorn_sf_filtered)
s <- sample(nr, 0.25 * nr)

# we drop our geometry from the sf objects so that it jives well with ssb()
pres_train <- pronghorn_sf_filtered[-s, ] %>% st_drop_geometry()
pres_test <- pronghorn_sf_filtered[s, ] %>% st_drop_geometry()
nr <- nrow(backgr)
set.seed(9)
s <- sample(nr, 0.25 * nr)
back_train <- backgr[-s, ]
back_test <- backgr[s, ]
# ssb
sb <- ssb(pres_test, back_test, pres_train)
sb[,1] / sb[,2]

          p 
0.005369935

# creating subsamples to remove SSB
i <- pwdSample(pres_test, back_test, pres_train, n=1, tr=0.1)
pres_test_pwd <- pres_test[!is.na(i[,1]), ]
back_test_pwd <- back_test[na.omit(as.vector(i)), ]
sb2 <- ssb(pres_test_pwd, back_test_pwd, pres_train)
sb2[1]/ sb2[2]

[1] 1.006879

# we reduced SSB - notice how AUC dropped
bc <- bioclim(predictors_na, pres_train)
evaluate(bc, p=pres_test, a=back_test, x=predictors_na)

class          : ModelEvaluation 
n presences    : 2026 
n absences     : 125 
AUC            : 0.9290286 
cor            : 0.2281553 
max TPR+TNR at : 0.00266681

Question 1

Describe the differences you observe in the mapped prediction between the full model, and the reduced model. Which mapped prediction is closer to the known distribution of the species? Provide a source and image (e.g., screenshot) of an independent source showing the range map of the species.

Question 2

Which model, the full or reduced model, performs better? Why? Include plots and model statistics to help with your explanation.

Extra information on AUC and Spatial Sorting Bias (SSB): In general AUC improves with larger geographic extents. As with most of these metrics, you should only compare them across models that have the same input data (meaning the same records for occurrences; same number of rows of unique presence and absence locations). Splitting data into training and testing makes the model susceptible to being influenced by the actual distances between the testing and training data in space. To remove this bias, you can subset the data into training and testing with pairwise distance sampling (pwdSample in dismo package).