Lab 09: SDMs - Model fitting, prediction, evaluation
Spatial Ecology FOR 870, Fall 2020
Introduction
Hello! You are reading an edited version of Lab 9 because you chose a species range whose occurrences falls outside of South America. Here, we will break down the lab to set you up in answering questions 1 and 2. This requires collecting animal occurrences from GBIF, loading BioClim data as a rasterStack, and model fitting and prediction, plus other information regarding reproducibility in an R workflow. If you’ve already gathered your species occurrences and saved them, you can skip to the Environmental Data portion.
Lab 9 also matches up with the vignette chapters title Model fitting, prediction, and evaluation. The chapter titled Modeling methods titled Modeling Methods and Geographic Null Models will be saved for Lab 10.
Self-Contained Projects Workflow in R
You’ll notice that I do not set up a working directory for R – I like to set up my workflow in R with the .Rproj files, and structure my folders following a certain way. With this workflow, I can email my collaborator(s) the .Rproj file and the contained folders (or they can download from my Github) and the project will be structured the same exact way it is on found on my computer. Here, collaborators won’t have to fudge around with their working directory and files. This workflow is especially dependent on the here
package.
You can read more from Jenny Bryan here: https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
& Martin Chan has a nice example drawn up: https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner%27s-guide/
Quarto vs. Rmarkdown
Instead of rmarkdown, I made this file with Quarto. Quarto is pretty similar to Rmarkdown with minor differences, so if you’re a pro with Rmarkdown it might be worth it to check out how Quarto works.
Packages
We will be using a number of different packages.
Species Occurrences (GBIF)
Let’s use one of my favorite animals, the pronghorn antelope (Antilocapra americana). I know enough about the species distribution range that the pronghorn’s range should be in North America. Any occurrences that falls outside of North America is likely a zoo or museum record or a misidentification.
# 9633 records found - might take awhile to load
pronghorn <- gbif("Antilocapra", "americana", nrecs = 300)
# count records to make sure we are doing okay
# take a glimpse of the data
# turn data.frame into tibble
pronghorn %>% count()
n
1 9670
Rows: 9,670
Columns: 187
$ acceptedNameUsage <chr> NA, NA, NA, NA, NA, NA, NA…
$ acceptedScientificName <chr> "Antilocapra americana (Or…
$ acceptedTaxonKey <int> 2440902, 2440902, 2440902,…
$ accessRights <chr> NA, NA, NA, NA, NA, NA, NA…
$ adm1 <chr> "Montana", "Colorado", "Ne…
$ adm2 <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedOccurrences <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedReferences <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedSequences <chr> NA, NA, NA, NA, NA, NA, NA…
$ associatedTaxa <chr> NA, NA, NA, NA, NA, NA, NA…
$ basisOfRecord <chr> "HUMAN_OBSERVATION", "HUMA…
$ bed <chr> NA, NA, NA, NA, NA, NA, NA…
$ behavior <chr> NA, NA, NA, NA, NA, NA, NA…
$ bibliographicCitation <chr> NA, NA, NA, NA, NA, NA, NA…
$ catalogNumber <chr> "104556531", "104443420", …
$ class <chr> "Mammalia", "Mammalia", "M…
$ classKey <int> 359, 359, 359, 359, 359, 3…
$ cloc <chr> "Montana, United States", …
$ collectionCode <chr> "Observations", "Observati…
$ collectionID <chr> NA, NA, NA, NA, NA, NA, NA…
$ collectionKey <chr> NA, NA, NA, NA, NA, NA, NA…
$ continent <chr> NA, NA, NA, NA, NA, NA, NA…
$ coordinatePrecision <dbl> NA, NA, NA, NA, NA, NA, NA…
$ coordinateUncertaintyInMeters <dbl> 61, 977, 61, 222, 48, 378,…
$ country <chr> "United States", "United S…
$ crawlId <int> 324, 324, 324, 324, 324, 3…
$ dataGeneralizations <chr> NA, NA, NA, NA, NA, NA, NA…
$ datasetID <chr> NA, NA, NA, NA, NA, NA, NA…
$ datasetKey <chr> "50c9509d-22c7-4a22-a47d-8…
$ datasetName <chr> "iNaturalist research-grad…
$ dateIdentified <chr> "2022-01-08T03:57:50", "20…
$ day <int> 7, 1, 5, 9, 10, 9, 17, 8, …
$ depth <dbl> NA, NA, NA, NA, NA, NA, NA…
$ depthAccuracy <dbl> NA, NA, NA, NA, NA, NA, NA…
$ disposition <chr> NA, NA, NA, NA, NA, NA, NA…
$ dynamicProperties <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestAgeOrLowestStage <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEonOrLowestEonothem <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEpochOrLowestSeries <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestEraOrLowestErathem <chr> NA, NA, NA, NA, NA, NA, NA…
$ earliestPeriodOrLowestSystem <chr> NA, NA, NA, NA, NA, NA, NA…
$ elevation <dbl> NA, NA, NA, NA, NA, NA, NA…
$ elevationAccuracy <dbl> NA, NA, NA, NA, NA, NA, NA…
$ endDayOfYear <chr> NA, NA, NA, NA, NA, NA, NA…
$ establishmentMeans <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventDate <chr> "2022-01-07T09:56:30", "20…
$ eventID <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventRemarks <chr> NA, NA, NA, NA, NA, NA, NA…
$ eventTime <chr> "09:56:30-07:00", "11:03:0…
$ family <chr> "Antilocapridae", "Antiloc…
$ familyKey <int> 5715, 5715, 5715, 5715, 57…
$ fieldNotes <chr> NA, NA, NA, NA, NA, NA, NA…
$ fieldNumber <chr> NA, NA, NA, NA, NA, NA, NA…
$ footprintSRS <chr> NA, NA, NA, NA, NA, NA, NA…
$ footprintWKT <chr> NA, NA, NA, NA, NA, NA, NA…
$ formation <chr> NA, NA, NA, NA, NA, NA, NA…
$ fullCountry <chr> "United States of America"…
$ gbifID <chr> "3455555450", "3455633105"…
$ genericName <chr> "Antilocapra", "Antilocapr…
$ genus <chr> "Antilocapra", "Antilocapr…
$ genusKey <int> 2440901, 2440901, 2440901,…
$ geodeticDatum <chr> "WGS84", "WGS84", "WGS84",…
$ geologicalContextID <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferencedBy <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferencedDate <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceProtocol <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceRemarks <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceSources <chr> NA, NA, NA, NA, NA, NA, NA…
$ georeferenceVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA…
$ habitat <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherClassification <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherGeography <chr> NA, NA, NA, NA, NA, NA, NA…
$ higherGeographyID <chr> NA, NA, NA, NA, NA, NA, NA…
$ highestBiostratigraphicZone <chr> NA, NA, NA, NA, NA, NA, NA…
$ hostingOrganizationKey <chr> "28eb1a3f-1c15-4a95-931a-4…
$ `http://unknown.org/canonicalName` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/captive` <chr> "wild", "wild", "wild", "w…
$ `http://unknown.org/combinationAuthors` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/combinationYear` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/language` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/nick` <chr> "wkonwent", "sichtopher_ch…
$ `http://unknown.org/recordId` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/recordID` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/rights` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/rightsHolder` <chr> NA, NA, NA, NA, NA, NA, NA…
$ `http://unknown.org/verbatimScientificName` <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationID <chr> "230326964", "229979364", …
$ identificationQualifier <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationRemarks <chr> NA, NA, NA, NA, NA, NA, NA…
$ identificationVerificationStatus <chr> NA, NA, NA, NA, NA, NA, NA…
$ identifiedBy <chr> "wkonwent", "sichtopher_ch…
$ identifier <chr> "104556531", "104443420", …
$ individualCount <int> NA, NA, NA, NA, NA, NA, NA…
$ informationWithheld <chr> NA, NA, NA, NA, NA, NA, "C…
$ infraspecificEpithet <chr> NA, NA, NA, NA, NA, NA, NA…
$ installationKey <chr> "997448a8-f762-11e1-a439-0…
$ institutionCode <chr> "iNaturalist", "iNaturalis…
$ institutionID <chr> NA, NA, NA, NA, NA, NA, NA…
$ institutionKey <chr> NA, NA, NA, NA, NA, NA, NA…
$ isInCluster <lgl> FALSE, FALSE, FALSE, FALSE…
$ island <chr> NA, NA, NA, NA, NA, NA, NA…
$ ISO2 <chr> "US", "US", "US", "US", "U…
$ iucnRedListCategory <chr> "LC", "LC", "LC", "LC", "L…
$ key <chr> "3455555450", "3455633105"…
$ kingdom <chr> "Animalia", "Animalia", "A…
$ kingdomKey <int> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ language <chr> NA, NA, NA, NA, NA, NA, NA…
$ lastCrawled <chr> "2022-11-11T11:44:56.941+0…
$ lastInterpreted <chr> "2022-11-11T17:34:22.541+0…
$ lastParsed <chr> "2022-11-11T17:34:22.541+0…
$ lat <dbl> 45.15296, 37.83146, 36.571…
$ latestAgeOrHighestStage <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEonOrHighestEonothem <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEpochOrHighestSeries <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestEraOrHighestErathem <chr> NA, NA, NA, NA, NA, NA, NA…
$ latestPeriodOrHighestSystem <chr> NA, NA, NA, NA, NA, NA, NA…
$ license <chr> "http://creativecommons.or…
$ lifeStage <chr> NA, NA, NA, NA, NA, NA, "A…
$ lithostratigraphicTerms <chr> NA, NA, NA, NA, NA, NA, NA…
$ locality <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationID <chr> NA, NA, NA, NA, NA, NA, NA…
$ locationRemarks <chr> NA, NA, NA, NA, NA, NA, NA…
$ lon <dbl> -110.8199, -105.9221, -103…
$ lowestBiostratigraphicZone <chr> NA, NA, NA, NA, NA, NA, NA…
$ materialSampleID <chr> NA, NA, NA, NA, NA, NA, NA…
$ member <chr> NA, NA, NA, NA, NA, NA, NA…
$ modified <chr> "2022-04-13T15:35:45.000+0…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ municipality <chr> NA, NA, NA, NA, NA, NA, NA…
$ nameAccordingTo <chr> NA, NA, NA, NA, NA, NA, NA…
$ namePublishedIn <chr> NA, NA, NA, NA, NA, NA, NA…
$ namePublishedInYear <chr> NA, NA, NA, NA, NA, NA, NA…
$ nomenclaturalCode <chr> NA, NA, NA, NA, NA, NA, NA…
$ occurrenceID <chr> "https://www.inaturalist.o…
$ occurrenceRemarks <chr> NA, NA, NA, NA, "6 animals…
$ occurrenceStatus <chr> "PRESENT", "PRESENT", "PRE…
$ order <chr> "Artiodactyla", "Artiodact…
$ orderKey <int> 731, 731, 731, 731, 731, 7…
$ organismID <chr> NA, NA, NA, NA, NA, NA, NA…
$ organismName <chr> NA, NA, NA, NA, NA, NA, NA…
$ organismQuantity <dbl> NA, NA, NA, NA, NA, NA, NA…
$ organismQuantityType <chr> NA, NA, NA, NA, NA, NA, NA…
$ otherCatalogNumbers <chr> NA, NA, NA, NA, NA, NA, NA…
$ ownerInstitutionCode <chr> NA, NA, NA, NA, NA, NA, NA…
$ parentNameUsage <chr> NA, NA, NA, NA, NA, NA, NA…
$ phylum <chr> "Chordata", "Chordata", "C…
$ phylumKey <int> 44, 44, 44, 44, 44, 44, 44…
$ preparations <chr> NA, NA, NA, NA, NA, NA, NA…
$ previousIdentifications <chr> NA, NA, NA, NA, NA, NA, NA…
$ programmeAcronym <chr> NA, NA, NA, NA, NA, NA, NA…
$ projectId <chr> NA, NA, NA, NA, NA, NA, NA…
$ protocol <chr> "DWC_ARCHIVE", "DWC_ARCHIV…
$ publishingCountry <chr> "PA", "US", "US", "US", "U…
$ publishingOrgKey <chr> "28eb1a3f-1c15-4a95-931a-4…
$ recordedBy <chr> "wkonwent", "sichtopher_ch…
$ recordNumber <chr> NA, NA, NA, NA, NA, NA, NA…
$ references <chr> "https://www.inaturalist.o…
$ rights <chr> NA, NA, NA, NA, NA, NA, NA…
$ rightsHolder <chr> "wkonwent", "sichtopher_ch…
$ samplingProtocol <chr> NA, NA, NA, NA, NA, NA, NA…
$ scientificName <chr> "Antilocapra americana (Or…
$ scientificNameID <chr> NA, NA, NA, NA, NA, NA, NA…
$ sex <chr> NA, NA, NA, NA, NA, NA, "M…
$ species <chr> "Antilocapra americana", "…
$ speciesKey <int> 2440902, 2440902, 2440902,…
$ specificEpithet <chr> "americana", "americana", …
$ startDayOfYear <chr> NA, NA, NA, NA, NA, NA, NA…
$ taxonConceptID <chr> NA, NA, NA, NA, NA, NA, NA…
$ taxonID <chr> "42429", "42429", "42429",…
$ taxonKey <int> 2440902, 2440902, 2440902,…
$ taxonomicStatus <chr> "ACCEPTED", "ACCEPTED", "A…
$ taxonRank <chr> "SPECIES", "SPECIES", "SPE…
$ taxonRemarks <chr> NA, NA, NA, NA, NA, NA, NA…
$ type <chr> NA, NA, NA, NA, NA, NA, NA…
$ typeStatus <chr> NA, NA, NA, NA, NA, NA, NA…
$ typifiedName <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimCoordinateSystem <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimElevation <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimEventDate <chr> "Fri Jan 07 2022 09:56:30 …
$ verbatimIdentification <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimLabel <chr> NA, NA, NA, NA, NA, NA, NA…
$ verbatimLocality <chr> "Custer Gallatin National …
$ verbatimTaxonRank <chr> NA, NA, NA, NA, NA, NA, NA…
$ vernacularName <chr> NA, NA, NA, NA, NA, NA, NA…
$ year <int> 2022, 2022, 2022, 2022, 20…
$ downloadDate <date> 2022-11-11, 2022-11-11, 2…
# GBIF provides us with so many columns!
# lets just use select() to choose the columns we only care about
# latitude, longitude, geo-referenced records and locality
# we then can mutate a new column and check out the # of records
pronghorn %>%
dplyr::select(lon,lat,locality) %>%
mutate(georef_check = if_else(!is.na(lon) & !is.na(lat),"geo-referenced","not geo-referenced")) %>%
count(georef_check)
# A tibble: 2 × 2
georef_check n
<chr> <int>
1 geo-referenced 8117
2 not geo-referenced 1553
# let's examine the top 25 localities of data that is not geo-referenced
pronghorn %>%
filter(is.na(lon), is.na(lat)) %>%
dplyr::select(locality) %>%
count(locality, sort = TRUE) %>%
mutate(locality = fct_reorder(locality, n)) %>%
slice_head(n = 25) %>%
ggplot(aes(x = locality, y = n, label = n)) +
geom_col(size = 0.7) +
geom_label() +
coord_flip() +
scale_y_discrete(expand = c(0,0)) +
labs(title = "Top 25 non geo-referenced localities for pronghorn",
subtitle = "GBIF")
# let's plot our pronghorn occurrences
# we load the World basemap from the `tmap` package
data(World)
World %>% glimpse()
Rows: 177
Columns: 16
$ iso_a3 <fct> AFG, AGO, ALB, ARE, ARG, ARM, ATA, ATF, AUS, AUT, AZE, BD…
$ name <fct> Afghanistan, Angola, Albania, United Arab Emirates, Argen…
$ sovereignt <fct> Afghanistan, Angola, Albania, United Arab Emirates, Argen…
$ continent <fct> Asia, Africa, Europe, Asia, South America, Asia, Antarcti…
$ area [km^2] 652860.000 [km^2], 1246700.000 [km^2], 27400.000 [km^2],…
$ pop_est <dbl> 28400000, 12799293, 3639453, 4798491, 40913584, 2967004, …
$ pop_est_dens <dbl> 4.350090e+01, 1.026654e+01, 1.328268e+02, 6.734519e+01, 1…
$ economy <fct> 7. Least developed region, 7. Least developed region, 6. …
$ income_grp <fct> 5. Low income, 3. Upper middle income, 4. Lower middle in…
$ gdp_cap_est <dbl> 784.1549, 8617.6635, 5992.6588, 38407.9078, 14027.1261, 6…
$ life_exp <dbl> 59.668, NA, 77.347, NA, 75.927, 74.446, NA, NA, 82.052, 8…
$ well_being <dbl> 3.800000, NA, 5.500000, NA, 6.500000, 4.300000, NA, NA, 7…
$ footprint <dbl> 0.79000, NA, 2.21000, NA, 3.14000, 2.23000, NA, NA, 9.310…
$ inequality <dbl> 0.42655744, NA, 0.16513372, NA, 0.16423830, 0.21664810, N…
$ HPI <dbl> 20.22535, NA, 36.76687, NA, 35.19024, 25.66642, NA, NA, 2…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((61.21082 35..., MULTIPOLYGON…
# create sf object for North America (you create for your own locality)
northamerica_sf <- World %>%
filter(continent=="North America" & name!="Greenland")
# check CRS # EPSG: 4326 & WGS 84
northamerica_sf %>% st_crs()
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["unknown"],
AREA["World"],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
# let's also make an sp object just in case for later
northamerica_sp <- as_Spatial(northamerica_sf)
# grab coordinates and transform to sf object!
pronghorn_sf <- pronghorn %>%
dplyr::select(lon, lat) %>%
filter(!is.na(lon) & !is.na(lat)) %>%
st_as_sf(coords = c("lon", "lat"), remove = FALSE) %>%
st_set_crs(4236)
# check CRS
pronghorn_sf %>% st_crs()
Coordinate Reference System:
User input: EPSG:4236
wkt:
GEOGCRS["Hu Tzu Shan 1950",
DATUM["Hu Tzu Shan 1950",
ELLIPSOID["International 1924",6378388,297,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Geodesy."],
AREA["Taiwan, Republic of China - onshore - Taiwan Island, Penghu (Pescadores) Islands."],
BBOX[21.87,119.25,25.34,122.06]],
ID["EPSG",4236]]
# plot all locations across North America QA/QC look for outliers
# use ggplot for a static map and tmap for an interactive map
# ggplot
ggplot() +
geom_sf(data = northamerica_sf, alpha = 0.2) +
geom_sf(data = pronghorn_sf, aes(color = lon), alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "Longitude") +
scale_fill_discrete(name = "Country") +
labs(title = "Pronghorn georeferenced occurrences",
subtitle = "GBIF")
# tmap
#tmap_mode("view")
#tm_shape(northamerica_sf) +
# tm_fill(alpha = 0.5) +
# tm_shape(pronghorn_sf) +
# tm_dots(col = "lon", border.col = NA,
# alpha = 0.5, palette = "viridis",
# midpoint = NA)
# there are 2 locations near the equator in the Atlantic Ocean
# skeptical that there are pronghorn by the great lakes region
# we clean pronghorn locations by filtering out locations based on longitude
# getting rid anything east of 85° W should be sufficient
pronghorn_sf_filtered <- pronghorn_sf %>% filter(lon < -85)
# plot again
# ggplot
ggplot() +
geom_sf(data = northamerica_sf, alpha = 0.2) +
geom_sf(data = pronghorn_sf_filtered, aes(color = lon),
alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "Longitude") +
scale_fill_discrete(name = "Country") +
labs(title = "Pronghorn georeferenced occurrences",
subtitle = "GBIF")
# tmap
tmap_mode("view")
tm_shape(northamerica_sf) +
tm_fill(alpha = 0.5) +
tm_shape(pronghorn_sf_filtered) +
tm_dots(col = "lon", border.col = NA,
alpha = 0.5, palette = "viridis",
midpoint = NA)
Looks pretty good for pronghorn range and we are satisfied with GBIF occurrence records.
Environmental Data
Here we set up our raster stack of bioclim variables. We hope that we can download environmental data for North America since we’re targeting this extent.
A reminder:
BIO1 = Annual Mean Temperature
BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3 = Isothermality (BIO2/BIO7) (×100)
BIO4 = Temperature Seasonality (standard deviation ×100)
BIO5 = Max Temperature of Warmest Month
BIO6 = Min Temperature of Coldest Month
BIO7 = Temperature Annual Range (BIO5-BIO6)
BIO8 = Mean Temperature of Wettest Quarter
BIO9 = Mean Temperature of Driest Quarter
BIO10 = Mean Temperature of Warmest Quarter
BIO11 = Mean Temperature of Coldest Quarter
BIO12 = Annual Precipitation
BIO13 = Precipitation of Wettest Month
BIO14 = Precipitation of Driest Month
BIO15 = Precipitation Seasonality (Coefficient of Variation)
BIO16 = Precipitation of Wettest Quarter
BIO17 = Precipitation of Driest Quarter
BIO18 = Precipitation of Warmest Quarter
BIO19 = Precipitation of Coldest Quarter
Raster data
here()
[1] "/Users/nournnan/Library/CloudStorage/OneDrive-MichiganStateUniversity/Grad/2020 Fall/FOR 870/Lab/script"
# create the folders (directories) "data"
# if they exist already, this command won't over-write them.
# create directory called "data"
dir.create("data", recursive = TRUE)
# Download WorldClim 2 data and unzip it
# Refer to the URL for your choice of data:
# Download 10 minute res worldclim V 2.1
wc_url_10m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_10m_bio.zip"
# Download 5 minute res worldclim V 2.1
wc_url_5m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_5m_bio.zip"
# Download 2.5 minute res worldclim V 2.1
wc_url_2.5m <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_2.5m_bio.zip"
# Download 30 second res worldclim V 2.1
wc_url_30s <- "https://geodata.ucdavis.edu/climate/worldclim/2_1/base/wc2.1_30s_bio.zip"
# example with 10 minute per degree solution data in "data" folder
# download file
download.file(url = wc_url_10m, destfile = here("data", "wc_10m.zip"),
mode = "wb")
# unzip file
unzip(zipfile = here("data", "wc_10m.zip"), exdir = here("data"))
# load the data
bioclim_files <- list.files(here("data"), pattern = "wc2.1_10m_bio_",
full.names = TRUE)
# create a RasterStack of predictor variables
predictors <- stack(bioclim_files)
# plot to check
predictors %>% plot()
# we don't need the entire globe,
# let's crop out the information & only use North America extent
# (here you can create your own extent)
# create bounding box
bbox_sp <- bbox(northamerica_sp)
# now crop
predictors_na <- crop(predictors, bbox_sp)
# let's also change names for downstream modelling
names(predictors_na)
[1] "wc2.1_10m_bio_1" "wc2.1_10m_bio_10" "wc2.1_10m_bio_11" "wc2.1_10m_bio_12"
[5] "wc2.1_10m_bio_13" "wc2.1_10m_bio_14" "wc2.1_10m_bio_15" "wc2.1_10m_bio_16"
[9] "wc2.1_10m_bio_17" "wc2.1_10m_bio_18" "wc2.1_10m_bio_19" "wc2.1_10m_bio_2"
[13] "wc2.1_10m_bio_3" "wc2.1_10m_bio_4" "wc2.1_10m_bio_5" "wc2.1_10m_bio_6"
[17] "wc2.1_10m_bio_7" "wc2.1_10m_bio_8" "wc2.1_10m_bio_9"
bioclim_names <- names(predictors_na)
bioclim_names <- str_remove(bioclim_names, "wc2.1_10m_")
names(predictors_na) <- bioclim_names
# look at names information to see if they changed
predictors_na
class : RasterBrick
dimensions : 456, 715, 326040, 19 (nrow, ncol, ncell, nlayers)
resolution : 0.1666667, 0.1666667 (x, y)
extent : -171.8333, -52.66667, 7.166667, 83.16667 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : memory
names : bio_1, bio_10, bio_11, bio_12, bio_13, bio_14, bio_15, bio_16, bio_17, bio_18, bio_19, bio_2, bio_3, bio_4, bio_5, ...
min values : -26.755625, -9.820499, -42.456039, 11.000000, 2.000000, 0.000000, 5.517085, 6.000000, 0.000000, 2.000000, 0.000000, 1.000000, 9.131122, 0.000000, -6.246750, ...
max values : 29.39055, 35.13550, 28.16318, 11191.00000, 2328.00000, 200.00000, 138.00240, 5282.00000, 700.00000, 5282.00000, 1584.00000, 21.14754, 100.00000, 1739.37622, 44.68700, ...
# check bioclim data
plot(predictors_na)
Extracting Values From Rasters
Here you can choose to follow the original vignette, or follow below. There are some subtle nuances that may have changed from the vignette since we’ve downloaded our own BioClim variables to make our rasterStack (e.g. the names of the variables and we also do not have a biome raster). I would read this webpage along with the dismo
vignette in parallel to learn what is going on with the code. Anything that is in italicized font below was copy and pasted from the original vignette.
We now have a set of predictor variables (rasters) and occurrence points. The next step is to extract the values of the predictors at the locations of the points. (This step can be skipped for the modeling methods that are implemented in the dismo package). This is a very straightforward thing to do using the ‘extract’ function from the raster package. In the example below we use that function first for the Pronghorn occurrence points, then for 8,000 random background points. We combine these into a single data.frame in which the first column (variable ‘pb’) indicates whether this is a presence or a background point.
presvals <- raster::extract(predictors_na, pronghorn_sf_filtered)
# setting random seed to always create the same
# random set of points for this example
set.seed(0)
# not sure what happens if you increase random points
# you might want to try to find out
# i changed mine to 8000 because i have around 8000 pronghorn presences
backgr <- randomPoints(predictors_na, 8000)
absvals <- extract(predictors_na, backgr)
pb <- c(rep(1, nrow(presvals)), rep(0, nrow(absvals)))
# create data tibble for sdm
sdmdata <- data.frame(cbind(pb, rbind(presvals, absvals)))
sdmdata %>% count(pb, sort = TRUE)
pb n
1 1 8107
2 0 8000
Rows: 16,107
Columns: 20
$ pb <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ bio_1 <dbl> 2.886625, 5.721406, 11.547552, 2.886625, 15.027687, 16.482063, …
$ bio_10 <dbl> 13.51179, 16.56792, 21.97633, 13.51179, 23.00483, 24.75971, 18.…
$ bio_11 <dbl> -7.23016644, -6.40008354, 1.56791675, -7.23016644, 7.99366665, …
$ bio_12 <dbl> 567, 229, 405, 567, 442, 478, 342, 443, 433, 382, 321, 237, 497…
$ bio_13 <dbl> 66, 46, 74, 66, 82, 111, 60, 73, 49, 62, 45, 30, 86, 70, 68, 68…
$ bio_14 <dbl> 37, 6, 7, 37, 0, 5, 14, 9, 18, 10, 10, 10, 11, 7, 13, 9, 0, 9, …
$ bio_15 <dbl> 19.44465, 61.80762, 69.87884, 19.44465, 87.80043, 80.83881, 48.…
$ bio_16 <dbl> 177, 111, 197, 177, 238, 262, 143, 181, 138, 170, 130, 76, 245,…
$ bio_17 <dbl> 124, 20, 23, 124, 5, 29, 49, 43, 66, 31, 34, 48, 42, 25, 49, 38…
$ bio_18 <dbl> 154, 101, 197, 154, 12, 221, 128, 146, 80, 147, 118, 52, 209, 1…
$ bio_19 <dbl> 133, 20, 23, 133, 226, 101, 51, 111, 122, 31, 34, 48, 42, 89, 4…
$ bio_2 <dbl> 14.37233, 19.15202, 16.18177, 14.37233, 18.73700, 16.93354, 13.…
$ bio_3 <dbl> 38.30605, 42.32280, 42.27240, 38.30605, 52.14717, 49.93709, 31.…
$ bio_4 <dbl> 851.3237, 929.4572, 829.3596, 851.3237, 627.4905, 668.8680, 108…
$ bio_5 <dbl> 23.98275, 27.15850, 31.15575, 23.98275, 35.56200, 33.77300, 27.…
$ bio_6 <dbl> -13.53700, -18.09375, -7.12400, -13.53700, -0.36900, -0.13675, …
$ bio_7 <dbl> 37.51975, 45.25225, 38.27975, 37.51975, 35.93100, 33.90975, 43.…
$ bio_8 <dbl> 10.6803331, 15.9158745, 21.9763336, 10.6803331, 8.7890835, 24.1…
$ bio_9 <dbl> -2.778833, -6.400084, 1.567917, -2.778833, 22.964375, 19.471834…
Model fitting, prediction, and evaluation
Model Fitting
# load the data if not in the environment
sdmdata <- readRDS("data/sdm.Rds")
presvals <- readRDS("data/pvals.Rds")
# fit reduced model with bioclims 1, 5, 6, 12, and 17
m_red <- glm(pb ~ bio_1 + bio_5 + bio_6 + bio_12 + bio_17, data = sdmdata)
m_red %>% class()
[1] "glm" "lm"
Call:
glm(formula = pb ~ bio_1 + bio_5 + bio_6 + bio_12 + bio_17, data = sdmdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1762 -0.2870 0.0590 0.3253 3.5512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.629e-01 2.868e-02 -9.166 < 2e-16 ***
bio_1 -1.268e-01 2.622e-03 -48.381 < 2e-16 ***
bio_5 9.480e-02 1.620e-03 58.530 < 2e-16 ***
bio_6 6.302e-02 1.312e-03 48.038 < 2e-16 ***
bio_12 -3.840e-04 1.424e-05 -26.975 < 2e-16 ***
bio_17 3.735e-04 7.548e-05 4.948 7.58e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.1488072)
Null deviance: 4026.6 on 16106 degrees of freedom
Residual deviance: 2395.9 on 16101 degrees of freedom
AIC: 15032
Number of Fisher Scoring iterations: 2
[1] "glm" "lm"
Call:
glm(formula = pb ~ ., data = sdmdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.13275 -0.14795 0.06057 0.20386 0.90592
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.798e-01 4.692e-02 8.095 6.11e-16 ***
bio_1 -1.087e-01 7.904e-03 -13.747 < 2e-16 ***
bio_10 3.993e-01 1.873e-02 21.314 < 2e-16 ***
bio_11 -4.366e-01 1.576e-02 -27.702 < 2e-16 ***
bio_12 3.252e-04 6.508e-05 4.996 5.91e-07 ***
bio_13 3.798e-03 3.540e-04 10.728 < 2e-16 ***
bio_14 3.189e-03 1.106e-03 2.883 0.00394 **
bio_15 8.508e-04 3.001e-04 2.835 0.00458 **
bio_16 -2.248e-03 1.719e-04 -13.076 < 2e-16 ***
bio_17 -2.109e-03 3.919e-04 -5.383 7.42e-08 ***
bio_18 -1.055e-04 6.869e-05 -1.537 0.12441
bio_19 -5.844e-05 6.024e-05 -0.970 0.33201
bio_2 1.297e-01 4.626e-03 28.030 < 2e-16 ***
bio_3 -8.327e-03 7.072e-04 -11.775 < 2e-16 ***
bio_4 -9.544e-03 3.753e-04 -25.427 < 2e-16 ***
bio_5 6.552e+03 2.315e+03 2.830 0.00466 **
bio_6 -6.552e+03 2.315e+03 -2.830 0.00466 **
bio_7 -6.552e+03 2.315e+03 -2.830 0.00466 **
bio_8 3.605e-03 7.489e-04 4.813 1.50e-06 ***
bio_9 -4.435e-03 5.479e-04 -8.094 6.18e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.1067704)
Null deviance: 4026.6 on 16106 degrees of freedom
Residual deviance: 1717.6 on 16087 degrees of freedom
AIC: 9699.1
Number of Fisher Scoring iterations: 2
[1] "glm" "lm"
Call:
glm(formula = pb ~ 1, data = sdmdata)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5033 -0.5033 0.4967 0.4967 0.4967
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50332 0.00394 127.8 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.2500045)
Null deviance: 4026.6 on 16106 degrees of freedom
Residual deviance: 4026.6 on 16106 degrees of freedom
AIC: 23384
Number of Fisher Scoring iterations: 2
# dismo model
# Models that are implemented in dismo do not use a formula
# (and most models only take presence points).
# Bioclim is an example. It only uses presence data,
# so we use ‘presvals’ instead of ‘sdmdata’.
bc <- bioclim(presvals[,c('bio_1', 'bio_5', 'bio_6', 'bio_12', 'bio_17')])
bc %>% class()
[1] "Bioclim"
attr(,"package")
[1] "dismo"
bc
class : Bioclim
variables: bio_1 bio_5 bio_6 bio_12 bio_17
presence points: 8107
bio_1 bio_5 bio_6 bio_12 bio_17
1 2.886625 23.98275 -13.53700 567 124
2 5.721406 27.15850 -18.09375 229 20
3 11.547552 31.15575 -7.12400 405 23
4 2.886625 23.98275 -13.53700 567 124
5 15.027687 35.56200 -0.36900 442 5
6 16.482063 33.77300 -0.13675 478 29
7 5.036927 27.11175 -16.39175 342 49
8 13.651906 33.77525 -4.11800 443 43
9 8.528791 29.26100 -5.25625 433 66
10 8.406760 29.44675 -9.36450 382 31
(... ... ...)
Model Prediction
Different modeling methods return different type of ‘model’ objects (typically they have the same name as the modeling method used). All of these ‘model’ objects, irrespective of their exact class, can be used to with the predict function to make predictions for any combination of values of the independent variables. This is illustrated in the example below where we make predictions with the glm model object ‘m1’ and for bioclim model ‘bc’, for three records with values for variables bio1, bio5 and bio12 (the variables used in the example above to create the model objects).
library(stars)
# i chose these values from my bc pairs plot
# looking at the histograms for each bio variable
# use all predictions since we want to compare full model
predictors_na %>%
st_as_stars() %>%
as_tibble() %>%
rename(value = 4) %>%
ggplot(aes(x = value)) +
geom_density() +
facet_wrap(~ band, scales = "free")
# predict table
bio_1 = c(-10, 15, 20)
bio_2 = c(5, 10, 15)
bio_3 = c(25, 50, 100)
bio_4 = c(500, 1000, 1500)
bio_5 = c(25, 30, 45)
bio_6 = c(-15, -5, 10)
bio_7 = c(20, 40, 50)
bio_8 = c(-10, 0 , 20)
bio_9 = c(-30, 10, 20)
bio_10 = c(10, 20, 30)
bio_11 = c(-20, 0 , 20)
bio_12 = c(300, 650, 1000)
bio_13 = c(100, 200, 300)
bio_14 = c(20, 30, 40)
bio_15 = c(30, 50 , 70)
bio_16 = c(100, 500, 1000)
bio_17 = c(30, 60, 120)
bio_18 = c(1000, 1500, 2000)
bio_19 = c(300, 400, 500)
pd = data.frame(cbind(bio_1, bio_2, bio_3, bio_4, bio_5,
bio_6, bio_7, bio_8, bio_9, bio_10,
bio_11, bio_12, bio_13, bio_14, bio_15,
bio_16, bio_17, bio_18, bio_19))
pd
bio_1 bio_2 bio_3 bio_4 bio_5 bio_6 bio_7 bio_8 bio_9 bio_10 bio_11 bio_12
1 -10 5 25 500 25 -15 20 -10 -30 10 -20 300
2 15 10 50 1000 30 -5 40 0 10 20 0 650
3 20 15 100 1500 45 10 50 20 20 30 20 1000
bio_13 bio_14 bio_15 bio_16 bio_17 bio_18 bio_19
1 100 20 30 100 30 1000 300
2 200 30 50 500 60 1500 400
3 300 40 70 1000 120 2000 500
# predict with different models
predict(m_red, pd)
1 2 3
2.3261808 0.1361003 1.7570892
predict(m_full, pd)
1 2 3
131050.25 -32762.75 -98292.01
predict(m_null, pd)
1 2 3
0.5033215 0.5033215 0.5033215
# response curves
response(bc)
response(m_red)
response(m_full)
# predict suitability scores with raster object
pr <- predict(predictors_na, m_red)
pf <- predict(predictors_na, m_full)
pn <- predict(predictors_na, m_null)
pbc <- predict(predictors_na, bc)
plot(pr, main = "Reduced Model")
plot(pf, main = "Full Model")
plot(pn, main = "Null Model")
plot(pbc, main = "BioClim Model")
# plot BioClim model in western states
library(USAboundaries)
# plot mtn west
mtn_west <- us_counties(states = c("Colorado", "Nevada", "Montana",
"New Mexico", "Wyoming", "Idaho",
"Utah","Arizona"))
# save CRS for later
myCRS <- mtn_west %>%
st_crs()
# turn pbcsraster into pbc stars
pbc_stars <- pbc %>% st_as_stars()
# crop pbc_stars
pbc_mtn_west <- st_crop(pbc_stars, mtn_west)
# transform pronghorn CRS
pronghorn_sf_filtered <- pronghorn_sf_filtered %>%
st_transform(myCRS)
# crop
pronghorn_mtn_west <- st_crop(pronghorn_sf_filtered, mtn_west)
# plot + locations
ggplot() +
geom_stars(data = pbc_mtn_west) +
geom_sf(data = mtn_west, fill = NA,
col = "black") +
geom_sf(data = pronghorn_mtn_west, alpha = 0.4, color = "orange") +
scale_fill_viridis_c(name = "Predicted Suitability",
option = "D") +
labs(title = "Pronghorn Antelope",
subtitle = "BioClim Model")
# tmap interactive
tm_shape(pbc_mtn_west) +
tm_raster(alpha = 0.9, palette = "viridis",
title = "Predicted Suitability") +
tm_shape(pronghorn_mtn_west) +
tm_dots(size = 0.02, col = "orange", alpha = 0.3)
Model evaluation
For this section, most of the code from the original vignette could be run on your own, but I’ve included edited bits and pieces that may be useful.
p <- rnorm(50, mean=0.7, sd=0.3)
a <- rnorm(50, mean=0.4, sd=0.4)
# density plot
tibble(presence = p, absence = a) %>%
pivot_longer(1:2, names_to = "type", values_to ="value") %>%
ggplot(aes(x = value, fill = type)) +
geom_density(alpha = 0.5)
# boxplot
tibble(presence = p, absence = a) %>%
pivot_longer(1:2, names_to = "type", values_to ="value") %>%
ggplot(aes(x = type, y = value, fill = type)) +
geom_boxplot(width = 0.25)
# removing spatial sorting bias
set.seed(0)
backgr <- randomPoints(predictors_na, 500)
nr <- nrow(pronghorn_sf_filtered)
s <- sample(nr, 0.25 * nr)
# we drop our geometry from the sf objects so that it jives well with ssb()
pres_train <- pronghorn_sf_filtered[-s, ] %>% st_drop_geometry()
pres_test <- pronghorn_sf_filtered[s, ] %>% st_drop_geometry()
nr <- nrow(backgr)
set.seed(9)
s <- sample(nr, 0.25 * nr)
back_train <- backgr[-s, ]
back_test <- backgr[s, ]
# ssb
sb <- ssb(pres_test, back_test, pres_train)
sb[,1] / sb[,2]
p
0.005369935
# creating subsamples to remove SSB
i <- pwdSample(pres_test, back_test, pres_train, n=1, tr=0.1)
pres_test_pwd <- pres_test[!is.na(i[,1]), ]
back_test_pwd <- back_test[na.omit(as.vector(i)), ]
sb2 <- ssb(pres_test_pwd, back_test_pwd, pres_train)
sb2[1]/ sb2[2]
[1] 1.006879
# we reduced SSB - notice how AUC dropped
bc <- bioclim(predictors_na, pres_train)
evaluate(bc, p=pres_test, a=back_test, x=predictors_na)
class : ModelEvaluation
n presences : 2026
n absences : 125
AUC : 0.9290286
cor : 0.2281553
max TPR+TNR at : 0.00266681
Question 1
Describe the differences you observe in the mapped prediction between the full model, and the reduced model. Which mapped prediction is closer to the known distribution of the species? Provide a source and image (e.g., screenshot) of an independent source showing the range map of the species.
Question 2
Which model, the full or reduced model, performs better? Why? Include plots and model statistics to help with your explanation.
Extra information on AUC and Spatial Sorting Bias (SSB): In general AUC improves with larger geographic extents. As with most of these metrics, you should only compare them across models that have the same input data (meaning the same records for occurrences; same number of rows of unique presence and absence locations). Splitting data into training and testing makes the model susceptible to being influenced by the actual distances between the testing and training data in space. To remove this bias, you can subset the data into training and testing with pairwise distance sampling (pwdSample
in dismo
package).