Spatial interpolation exercises (NL)
From spatial-analyst.net
he following article describes four data interpolation exercises that have been developed with support from the EcoGRID-GAN and Geo-valley projects and are a.o. used in the Earth sciences BSc and MSc programmes at the University of Amsterdam. The exercise demonstrate various geostatistical analysis steps: how to generate maps from sampled point data, interpret the results and make objective evaluation of various linear geostatistical techniques. The point data is publicly available and has been obtained from referred websites. For more information about various topics please refer also to the indicated references.
The focus is put on use of linear geostatistical techniques (regression-kriging) and assessment of the accuracy of spatial predictions. To run this analysis on your computer/laptop, you will first need to (1) install R with all necessary packages, and SAGA GIS; (2) donwload and unzip data (zip files include R scripts) listed below, and then you can run these exercise line-by-line. The sample codes shown below are only used for illustration. To master R syntax, consider obtaining some of the R manuals. The most comprehensive guide to spatial data analysis in R is the Bivand et al. (2008) book. To learn about R+SAGA integration, consider also obtaining the Hengl (2009) book.
|
The following exercises use some of the common spatial packages. Before you can run any further analysis you need to first install them (e.g. from CRAN), and then you can load packages by using:
> library(rgdal) > library(maptools) > library(gstat) > library(RSAGA) > library(lattice)
Contents |
Spatial prediction of daily precipitation in the Netherlands
INSTRUCTIONS: The zip file contains three data sets: (1) monv_200906.txt – values of daily precipitation (in mm) measured at 326 locations over June 2006 (this data was obtained from the KNMI website), (2) NL_stations_XY.csv – coordinates of meteorological stations (projected in the EPSG:28992); (3) prec6.asc –-- predicted long-term precipitation for June based on the Worldclim project page. The purpose of this exercise is to produce a series of maps of daily rainfall based on the point data and by using ordinary and regression-kriging techniques. We are interested to see if the variograms for rainfall between different dates differ and what is the overall accuracy of the produced maps.
Data import and formatting/filtering
We start with importing the point (meteorological) data. The precipitation measurements and locations come in two separate files:
# locaton of stations:
> NL_stations_XY <- read.csv("NL_stations_XY.csv")
> str(NL_stations_XY)
'data.frame': 326 obs. of 6 variables:
$ Locatie : Factor w/ 326 levels "Aalsmeer ","Aalten ",..: 235 1 2 3 4 5 6 7 8 9 ...
$ X.Nr : int 760 458 680 572 89 664 678 560 910 441 ...
$ X.OL : Factor w/ 170 levels "3° 24' ","3° 26' ",..: 11 46 143 58 105 149 130 81 67 55 ...
$ X.NB : Factor w/ 131 levels "50° 45' ","50° 46' ",..: 25 62 49 62 126 66 107 54 39 68 ...
$ X.X..km.: int 43 113 236 127 184 242 219 160 143 123 ...
$ X.Y..km.: int 388 475 437 475 563 485 464 445 418 487 ...
# fix the coordinates:
> NL_stations_XY$X <- NL_stations_XY$X.X..km. * 1000
> NL_stations_XY$Y <- NL_stations_XY$X.Y..km. * 1000
# fix the station names (trim white spaces):
> NL_stations_XY$Locatie <- as.factor(gsub(pattern='[[:space:]]+$', replacement='', NL_stations_XY$Locatie))
# import the precipitation values:
> tmp <- readLines(paste(getwd(), "/", "monv_200906.txt", sep=""))
> tmp <- gsub(pattern=' . ', replacement=' 0.0 ', tmp, fixed = TRUE)
> tmp <- gsub(pattern='[[:space:]]{2,7}', replacement=',', tmp)
> write.table(t(matrix(unlist(strsplit(tmp, ",")[-c(1:8)]), nrow=41)), "tmp.txt", col.names=FALSE, row.names=FALSE)
> monv_200906 <- read.table("tmp.txt", col.names=c("empty1", "NR", paste("D", 1:30, sep=""), "I", "NORM", "II", "NORM", "III", "NORM", "MND", "NORM", .... [TRUNCATED]
> monv_200906$empty1 <- NULL
> monv_200906$Locatie <- as.factor(gsub(pattern='[[:space:]]+$', replacement='', monv_200906$Locatie))
> str(monv_200906)
'data.frame': 326 obs. of 40 variables:
$ NR : int 760 458 680 572 89 664 678 560 910 441 ...
$ D1 : num 0 0 0 0 0 0 0 0 0 0 ...
...
$ D30 : num 0 0 0 0 0 0 0 0 0 0 ...
$ I : num 30.4 18.7 29.9 17 12.4 19.9 17.1 13 24.6 13.4 ...
$ NORM : num 26.9 26.1 30.2 25.7 NA 28.7 27.2 29.1 27.4 29.3 ...
$ II : num 65.2 32.4 45.5 28.9 32.1 37 31.7 40.4 46.4 26.4 ...
$ NORM.1 : num 21.7 20.7 19.2 19.2 NA 20.8 18.8 19.7 19.1 22.1 ...
$ III : num 0.2 0 15.1 12.6 4.2 8 2.5 0.7 0 5.5 ...
$ NORM.2 : num 23.3 23.7 28.1 25 NA 26.1 24.9 25 22.9 24.3 ...
$ MND : num 95.8 51.1 90.5 58.5 48.7 64.9 51.3 54.1 71 45.3 ...
$ NORM.3 : num 71.9 70.4 77.4 69.9 NA 75.6 70.9 73.8 69.3 75.6 ...
$ Locatie: Factor w/ 326 levels "'s Heerenhoek",..: 1 3 4 5 6 7 8 9 10 11 ...
Once we have imported and sorted both tables, we can merge them to a new table that contains both precipitation values and coordiantes:
# merge the precipitation and coordinates:
> monv_200906.XY <- merge(x=NL_stations_XY[,c("X", "Y", "Locatie")], y=monv_200906, by=c("Locatie"), all.x=TRUE)
# str(monv_200906.XY)
# convert to a point map:
> coordinates(monv_200906.XY) <- ~X+Y
We can also import gridded maps, and borders of municipalities in the Netherlands:
# import gridded map:
> grids1km <- readGDAL("prec6.asc")
> names(grids1km) <- "prec6"
> grids1km$NLmask <- as.factor(readGDAL("NLmask.asc")$band1)
# CBS Wijk en buurtkaart 2008
> download.file("http://www.cbs.nl/NR/rdonlyres/3D8F1FF7-E6BA-430D-AF3B-9F16CB6647B9/0/buurt_2008_gen1.zip",
+ destfile=paste(getwd(), "/", "NL_buurtkaart2008.zip", sep=""))
> unzip("NL_buurtkaart2008.zip")
# import polygons and convert to line map:
> NLprovs <- readShapePoly("gem_2008_gn2.shp", proj4string=CRS("+proj=sterea +lat_0=52.15616055555555
+ +lon_0=5.38763888888889 +k=0.999908 +x_0=155000 +y_0=463000 +ellps=bessel
+ +towgs84=565.237,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs"))
# the Dutch coordinate system;
> NLprovs <- as(NLprovs, "SpatialLinesDataFrame")
Spatial interpolation of precipitation using ordinary kriging
Now that we have imported and sorted all data (i.e. put them in a right format), we can start with making predictions using simple geostatistical models. For example, we want to estimate how much rain did fall on 10th of June 2009 on the location: X=155000, Y=463000.
> sel <- !is.na(monv_200906.XY$D10) > bubble(monv_200906.XY[sel, "D10"]) # hist(log1p(monv_200906.XY$D10)) # normally distributed after log-transformation! > D10.svar <- variogram(log1p(D10)~1, monv_200906.XY[sel,]) > D10.vgm <- fit.variogram(D10.svar, model=vgm(nugget=var(log1p(monv_200906.XY$D10), na.rm=TRUE)/3, model="Lin")) > New.XY <- data.frame(X=155000, Y=463000) > coordinates(New.XY) <- ~X+Y > expm1(krige(log1p(D10)~1, monv_200906.XY[sel,], New.XY, D10.vgm)$var1.pred) # back-transformation included [using ordinary kriging] [1] 1.343429
The estimated rainfall on the new location was 1.34 mm. This is relatively small value considering that rainfall on that day ranged from 0 to 25.7 mm. Note that we need to use log-transformation before spatial prediction because the rainfall data is typically skewed. If we have run interpolation without log-transformation, our estimates would have probably be significantly higher (over-estimation) because of the impact of high values. Next we can try to see what the uncertainty of that estimate is, even better, what is the overall uncertainty of the estimated rainfall for the whole area of the Netherlands? To answer this question, we can run:
> New.OK <- krige(log1p(D10)~1, monv_200906.XY[sel,], New.XY, D10.vgm) [using ordinary kriging] # 95% range based on the OK variance: > expm1(New.OK$var1.pred + 1.96*sqrt(New.OK$var1.var)); expm1(New.OK$var1.pred - 1.96*sqrt(New.OK$var1.var)) [1] 3.479269 [1] 0.2260168
and then for the whole Netherlands:
> D10.OK <- krige(log1p(D10)~1, monv_200906.XY[sel,], grids1km["NLmask"], D10.vgm, nmin=10, nmax=50)
[using ordinary kriging]
> D10.OK$pred <- expm1(D10.OK$var1.pred)
# plot the resulting map:
> D10ok.plt <- spplot(D10.OK["pred"], col.regions=rev(bpy.colors(40)), at=expm1(seq(0, max(log1p(D10.OK$pred), na.rm=TRUE), .... [TRUNCATED]
> D10var.plt <- spplot(D10.OK["var1.var"], col.regions=grey(seq(0,1,by=0.025)),
+ sp.layout=list("sp.points", monv_200906.XY, pch="+", col="red"), main="OK variance")
> print(D10ok.plt, split=c(1,1,2,1), more=TRUE)
> print(D10var.plt, split=c(2,1,2,1), more=FALSE)
> mean(expm1(D10.OK$var1.pred + 1.96*sqrt(D10.OK$var1.var))-expm1(D10.OK$var1.pred - 1.96*sqrt(D10.OK$var1.var)), na.rm=TRUE)/2
[1] 3.977561
This shows that the average precision in the map shown above is about +/- 4 mm. This is relatively high number considering that the global standard deviation is:
> sd(monv_200906.XY$D10, na.rm=TRUE) [1] 5.036219 # but this is not a good measure because the values are skewed!
The remaining issue is whether we could use some auxiliary map to improve the interpolation of precipitation. For example, we can consider using the long-term precipitation map for June obtained from the Worldclim project:
> monv_200906.XY.ov <- overlay(grids1km, monv_200906.XY)
> monv_200906.XY.ov@data <- cbind(monv_200906.XY.ov@data, monv_200906.XY@data)
> scatter.smooth(monv_200906.XY.ov$prec6, monv_200906.XY.ov$D10, xlab="Long-term precipitation (mm)",
+ ylab="Precipitation on 10th of June")
> summary(lm(log1p(D10)~prec6, monv_200906.XY.ov))
Call:
lm(formula = log1p(D10) ~ prec6, data = monv_200906.XY.ov)
Residuals:
Min 1Q Median 3Q Max
-1.90188 -0.55536 -0.06208 0.54645 2.15420
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.956160 0.385053 -10.27 <2e-16 ***
prec6 0.079478 0.005901 13.47 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7428 on 317 degrees of freedom
(7 observations deleted due to missingness)
Multiple R-squared: 0.364, Adjusted R-squared: 0.362
F-statistic: 181.4 on 1 and 317 DF, p-value: < 2.2e-16
The model is significant, hence the long-term pattern of rainfall is a good predictor of daily rainfall for this date.
| Figure: Correlation plot rainfall on June 10th and long-term precipitation map for June obtained from the Worldclim project. |
Spatial auto-correlation for different dates
In previous example we have noticed that precipitation can be interpolated using geostatistical techniques. In fact, it looks that the target variable is correlated up to high distances. Will we get the same variograms for all other dates? To find out we need to plot few variograms for different dates (e.g. 5th, 10th, 15th and 20th of June) on top of each other and then compare to see if there are large differences. This can be done in few steps:
> D5.vgm <- fit.variogram(variogram(log1p(D5)~1, monv_200906.XY[!is.na(monv_200906.XY$D5),]), + model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D5), na.rm=TRUE), range=50000)) > D15.vgm <- fit.variogram(variogram(log1p(D15)~1, monv_200906.XY[!is.na(monv_200906.XY$D15),]), + model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D15), na.rm=TRUE), range=50000)) > D20.vgm <- fit.variogram(variogram(log1p(D20)~1, monv_200906.XY[!is.na(monv_200906.XY$D20),]), + model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D20), na.rm=TRUE), range=50000)) # plot of variograms: > plot(variogramLine(D5.vgm, maxdist=2e+05)$dist, variogramLine(D5.vgm, maxdist=2e+05)$gamma, + xlim=c(0, 2e+5), ylim=c(0, var(c(log1p(monv_200906.XY$D5), log1p(monv_200906.XY$D10), + log1p(monv_200906.XY$D15), log1p(monv_200906.XY$D20)), na.rm=TRUE)), xlab="Distance (m)", + ylab="Semivariance", type="l", col="black", lwd=2) > lines(variogramLine(D10.vgm, maxdist=2e+05)$dist, variogramLine(D10.vgm, maxdist=2e+05)$gamma, col="red", lwd=2) > lines(variogramLine(D15.vgm, maxdist=2e+05)$dist, variogramLine(D15.vgm, maxdist=2e+05)$gamma, col="blue", lwd=2) > lines(variogramLine(D20.vgm, maxdist=2e+05)$dist, variogramLine(D20.vgm, maxdist=2e+05)$gamma, col="green", lwd=2)
This plot shows that there seems to exist large differences between variograms for different dates. How would you explain this feature? Most probably because the type of rain is defined with meteorological conditions i.e. the type of clouds over an area, and there are many various types of clouds that result in either local precipitation or uniform precipitation over a large area.
Locating areas of highest rainfall
In the final exercise we demonstrate how to locate an area with the highest rainfall. We focus on 8th of June, but this time we also use an auxiliary map to improve interpolations:
> D8.rvgm <- fit.variogram(variogram(log1p(D8)~prec6, monv_200906.XY.ov[!is.na(monv_200906.XY.ov$D8)&!is.na(monv_200906.XY.ov$prec6),]),
+ model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY.ov$D8), na.rm=TRUE)/2, range=50000))
> D8.RK <- krige(log1p(D8)~prec6, monv_200906.XY.ov[!is.na(monv_200906.XY.ov$D8)&!is.na(monv_200906.XY.ov$prec6),],
+ grids1km, D8.rvgm, nmin=20, nmax=50)
[using universal kriging]
# mask areas of interest:
> D8.RK$pred <- ifelse(is.na(grids1km$NLmask), NA, expm1(D8.RK$var1.pred))
> spplot(D8.RK["pred"], col.regions=rev(bpy.colors(40)),
+ at=expm1(seq(0, max(log1p(D8.RK$pred), na.rm=TRUE), by=(max(log1p(D8.RK$pred), na.rm=TRUE)-0)/40)),
+ sp.layout=list("sp.lines", NLprovs, col="black"), main="Neerslag 7-8 juni 2009")
# Locate the maximum value:
> D8.RK.pnt <- as(D8.RK["pred"], "SpatialPointsDataFrame")
> proj4string(D8.RK.pnt) <- NLprovs@proj4string
> spTransform(D8.RK.pnt[which.max(D8.RK.pnt$pred),], CRS("+proj=latlon +datum=WGS84"))
coordinates pred
8939 (6.94032, 53.3179) 24.88213
# enter in Google maps and you will get - Farmsum:
> system(paste('"c:/Program Files/firefox/firefox.exe"', ' -url http://maps.google.com/maps?&ll=',
+ 53.3179,',',6.94032,'&t=h&q=', 53.3179,',',6.94032, sep=""), wait=F)
> max(monv_200906.XY.ov$D8, na.rm=TRUE)
[1] 29.4
Which shows that the maximum rainfall was in the small city called Farmsum. This number is still smaller that what was estimated in the near-by meterological station.
Spatial prediction of daily temperatures over Europe
INSTRUCTIONS: The zip file contains two maps: (1) GSOD20060501.shp – a point map with attached values of daily temperatures (TEMP in °C) measured at 2118 locations on 1st of May 2006, and (2) eudem.asc – a SRTM30+ digital elevation map (DEM) of whole of Europe (5 km resolution; projected in the EPSG:3035). To learn more about the Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) data look at the NCDC website. The purpose of this exercise is to produce a map of mean daily temperature for Europe given the point data and DEM.
Data import and initial analysis
We start with importing the maps to R and making some initial plots:
> GSOD20060501 <- readOGR("GSOD20060501.shp", "GSOD20060501")
OGR data source with driver: ESRI Shapefile
Source: "GSOD20060501.shp", layer: "GSOD20060501"
with 2117 features and 3 fields
Feature type: wkbPoint with 2 dimensions
> grids5km <- readGDAL("eudem.asc")
eudem.asc has GDAL driver AAIGrid
and has 799 rows and 664 columns
> names(grids5km) <- "eudem"
> grids5km$mask <- readGDAL("mask5km.asc")$band1
mask5km.asc has GDAL driver AAIGrid
and has 799 rows and 664 columns
> proj4string(grids5km) <- GSOD20060501@proj4string
# overlay points and grids:
> GSOD20060501.ov <- overlay(grids5km, GSOD20060501)
> GSOD20060501.ov@data <- cbind(GSOD20060501.ov@data, GSOD20060501@data)
Next, we can generate a scatter plot to see general relationship between temperature and elevations:
> par(mfrow=c(1,2)) > scatter.smooth(GSOD20060501.ov$eudem, GSOD20060501.ov$TEMPC, xlab="Elevation", ylab="TEMP (degree celsius)")
This shows that some of the points fall on the bottom of sea. Because we are only interested in mapping the temperature on land, we can mask out some points by using:
> GSOD20060501.ov <- subset(GSOD20060501.ov, GSOD20060501.ov$eudem> -20) > scatter.smooth(GSOD20060501.ov$eudem, GSOD20060501.ov$TEMPC, xlab="Elevation", ylab="TEMP (degree celsius)") # is the target variable normaly distributed? > hist(GSOD20060501.ov$TEMPC, breaks=25, col="grey") # yes
Temperature on 2006-05-01 in Amsterdam and Vienna?
Before we can produce any predictions, we need to estimate the geostatistical model. In this case (simple linear model), this takes one line of code:
> TEMPC.rvgm <- fit.variogram(variogram(TEMPC~eudem, GSOD20060501.ov, cutoff=400000), vgm(nugget=3, model="Lin")) > plot(variogram(TEMPC~eudem, GSOD20060501.ov, cutoff=400000), TEMPC.rvgm) # remark: some strange values can be observed on short distances;
Note that the models is linear, hence there is practically no limit for spatial auto-correlation for this data. This is typical for temperature data that varies smoothly in space. What worries us is the nugget variation that is fairly significant (3.8).
To get locations of Amsterdam and Vienna we can use the Google maps API service:
# set the Google key string:
googlekey <- "InsertGoogleKey"
# coordinates of Amsterdam:
> googleurl <- url(paste("http://maps.google.com/maps/geo?q=", "Amsterdam+Netherlands", "&output=csv&key=",googlekey,sep=""), open="r")
> amsterdam.XY <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl, n=1, warn=FALSE), ","))))
# [1] 200.000000 4.000000 52.373801 4.890935
> coordinates(New.ll) <- ~Lon+Lat
> proj4string(New.ll) <- CRS("+proj=lonlat +datum=WGS84")
> New.XY <- spTransform(New.ll, GSOD20060501@proj4string)
> New.XY$eudem <- overlay(grids5km, New.XY)@data$eudem
> krige(TEMPC~eudem, GSOD20060501.ov, New.XY, TEMPC.rvgm, nmax=80)
[using universal kriging]
coordinates var1.pred var1.var
1 (3973380, 3263800) 8.169655 3.890196
and then the same thing for Vienna:
> googleurl <- url(paste("http://maps.google.com/maps/geo?q=", "Vienna+Austria", "&output=csv&key=",googlekey,sep=""), open="r")
> vienna.XY <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl, n=1, warn=FALSE), ","))))
# [1] 200.00000 4.00000 48.20921 16.37278
> New.ll <- data.frame(Lon=vienna.XY[4], Lat=vienna.XY[3], Name="Vienna")
> coordinates(New.ll) <- ~Lon+Lat
> proj4string(New.ll) <- CRS("+proj=lonlat +datum=WGS84")
> New.XY <- spTransform(New.ll, GSOD20060501@proj4string)
> New.XY$eudem <- overlay(grids5km, New.XY)@data$eudem
> krige(TEMPC~eudem, GSOD20060501.ov, New.XY, TEMPC.rvgm, nmax=80)
[using universal kriging]
coordinates var1.pred var1.var
1 (4794200, 2808840) 9.06544 3.863452
This shows that the mean daily temperature on May 1st 2006 in Amsterdam was 8.2 +/- 2.0 and in Vienna 9.1 +/- 2.0 degree C.
Significance of the model
To find out how much of variability in TEMP values can be explained using elevation, we can run:
> summary(lm(TEMPC~eudem, GSOD20060501.ov@data))$adj.r.square [1] 0.1618341
This tells us only how good can we fit the data. To estimate the true accuracy of the resulting map, we can run 3-fold cross-validation in gstat:
> TEMP.cv <- krige.cv(TEMPC~eudem, nfold=3, GSOD20060501.ov, TEMPC.rvgm, nmax=40, debug.level=-1, set=list(cn_max=1e10))
[using universal kriging]
100% done
|
|========================== | 50%[using universal kriging]
100% done
|
|=====================================================| 100%[using universal kriging]
100% done
> 1-mean(TEMP.cv$residual^2, na.rm=TRUE)/var(GSOD20060501.ov$TEMPC, na.rm=TRUE)
[1] 0.855478
> quantile(TEMP.cv$residual[!is.na(TEMP.cv$residual)], c(0.05, 0.95))
5% 95%
-2.319952 2.418919
The result of cross-validation shows that the model explains >85% of variability in the TEMP values. The overall accuracy of the map (base on the results of cross-validation) will be between -2.3 / 2.5 degree C. Although elevations explain only 15% of variability in the data, this model is highly significant.
We can finally run predictions for the whole study area. Because the data set is large (>1000 points), it is more efficient to use SAGA to generate predictions instead of using R+gstat. The following commands will run universal kriging in SAGA GIS (predictions can still take few minutes):
# export data to SAGA compatible format: gridcell <- grids5km@grid@cellsize[1] grids5km$eudem.f <- ifelse(grids5km$eudem < -20, NA, grids5km$eudem) # write to SAGA grid: writeGDAL(grids5km["eudem.f"], "eudem.sdat", "SAGA") writeOGR(GSOD20060501.ov["TEMPC"], "TEMP.shp", "TEMP", "ESRI Shapefile") # universal kriging in SAGA GIS: > rsaga.geoprocessor(lib="geostatistics_kriging", module=7, param=list(GRID="TEMPC_rk.sgrd", SHAPES="TEMP.shp", + GRIDS="eudem.sgrd", BVARIANCE=F, BLOCK=F, FIELD=1, BLOG=F, MODEL=3, TARGET=0, USER_CELL_SIZE=5000, + NUGGET=TEMPC.rvgm$psill[1], SILL=TEMPC.rvgm$psill[2], INTERPOL=0, NPOINTS_MAX=30, NPOINTS_MIN=20, + MAXRADIUS=1200000, USER_X_EXTENT_MIN=grids5km@bbox[1,1]+gridcell/2, USER_X_EXTENT_MAX=grids5km@bbox[1,2]-gridcell/2, + USER_Y_EXTENT_MIN=grids5km@bbox[2,1]+gridcell/2, USER_Y_EXTENT_MAX=grids5km@bbox[2,2]-gridcell/2)) SAGA CMD 2.0.3 library path: C:/PROGRA~1/R/R-210~1.1/library/RSAGA/saga_vc/modules library name: geostatistics_kriging module name : Universal Kriging author : (c) 2003 by O.Conrad Load shapes: TEMP.shp... ready ready Load grid: eudem.sgrd... ready Parameters Grid: [not set] Variance: [not set] Points: TEMP.shp Attribute: TEMPC Create Variance Grid: no Target Grid: user defined Variogram Model: Linear Regression Block Kriging: no Block Size: 100.000000 Logarithmic Transformation: no Nugget: 3.643687 Sill: 0.000003 Range: 100.000000 Linear Regression: 1.000000 Exponential Regression: 0.100000 Power Function - A: 1.000000 Power Function - B: 0.500000 Grids: 1 object (eudem.sgrd)) Grid Interpolation: Nearest Neighbor Maximum Search Radius (map units): 1200000.000000 Min./Max. Number of m_Points: [20.000000] - [30.000000] ready Save grid: TEMPC_rk.sgrd... ready
The final resulting map can be seen down-below. Note that the relationship between temperature and relief is more than distinct.
Adding auxiliary predictors
We can consider using distance to coast as an additional auxiliary predictor. This can be derived using SAGA GIS and the line map of coast line:
# convert shape file to a gridded map:
> rsaga.geoprocessor(lib="grid_gridding", module=0, param=list(GRID="coast_buffer.sgrd", INPUT="coast_europe.shp",
+ FIELD=0, LINE_TYPE=0, USER_CELL_SIZE=gridcell, USER_X_EXTENT_MIN=grids5km@bbox[1,1]+gridcell/2,
+ USER_X_EXTENT_MAX=grids5km@bbox[1,2]-gridcell/2, USER_Y_EXTENT_MIN=grids5km@bbox[2,1]+gridcell/2,
+ USER_Y_EXTENT_MAX=grids5km@bbox[2,2]-gridcell/2))
# derive a buffer distance:
> rsaga.geoprocessor(lib="grid_tools", module=10, param=list(SOURCE="coast_buffer.sgrd", DISTANCE="land_dist.sgrd",
+ ALLOC="tmp.sgrd", BUFFER="tmp.sgrd", DIST=sqrt(areaSpatialGrid(grids5km))/3, IVAL=gridcell))
# this takes time!
> grids5km$land.dist <- readGDAL("land_dist.sdat")$band1
> grids5km$dcoast <- ifelse(grids5km$eudem < -10, grids5km$land.dist, -grids5km$land.dist)
> GSOD20060501.ov <- overlay(grids5km, GSOD20060501)
> GSOD20060501.ov@data <- cbind(GSOD20060501.ov@data, GSOD20060501@data)
> GSOD20060501.ov <- subset(GSOD20060501.ov, GSOD20060501.ov$eudem> -20)
and we can re-fit a linear model:
> summary(lm(TEMPC~eudem+dcoast, GSOD20060501.ov@data))
Call:
lm(formula = TEMPC ~ eudem + dcoast, data = GSOD20060501.ov@data)
Residuals:
Min 1Q Median 3Q Max
-10.5770 -2.6206 -0.5369 2.2636 17.0814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.057e+01 1.167e-01 90.619 <2e-16 ***
eudem -4.299e-03 2.171e-04 -19.804 <2e-16 ***
dcoast -1.424e-06 5.808e-07 -2.451 0.0143 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.755 on 1980 degrees of freedom
Multiple R-squared: 0.169, Adjusted R-squared: 0.1682
F-statistic: 201.3 on 2 and 1980 DF, p-value: < 2.2e-16
This shows that the temperatures also change with distance to coast line, however this relationship is much less significant than with DEM. It is very possible that a the influence of ocean on temperature is valid only up to some treshold distances (e.g. 50--100 km), so that one would have to use a locally adaptive model.
Spatial prediction of soil properties for the Barro Colorado Island 50 ha Plot
INSTRUCTIONS: The zip file contains three data sets: (1) soil_samples.shp – values of 19 soil parameters estimated in laboratory using soil samples from 300 locations (this data is a part of the Barro Colorado Island 50 ha monitoring plot in Panama), (2) dem.asc – a 5 meter digital elevation model of the area (coordinates of origin: lat=9.15125, lon=-79.8553); (3) twi.asc – topographic wetness index showing a potential soil deposition. The purpose of this exercise is to produce maps of various soil parameters using the point data and two auxiliary maps dem.asc and twi.asc. This data set is described in detail in Grim et al. (2008).
Import and initial analysis
We can read the maps to R by using:
> soil_samples <- readShapePoints("soil_samples.shp", proj4string = CRS(as.character(NA)))
Field name: NA changed to: NA.
Field name: TOTAL N changed to: TOTAL.N
Field name: MIN N changed to: MIN.N
> str(soil_samples@data)
'data.frame': 300 obs. of 21 variables:
$ GX : num 25 25 25 25 45 25 25 23.6 25 25 ...
$ GY : num 25 45 75 125 125 ...
$ AL : num 555 1002 1035 744 998 ...
$ B : num 1.44 0.37 0.37 0.96 0.55 1.1 0.41 0.68 0.54 0.45 ...
$ CA : num 2542 691 705 2143 1842 ...
$ CU : num 9.66 6.19 5.84 7.84 6.02 7.53 7.42 8.33 7.32 6.64 ...
$ FE : num 104 104 109 150 131 ...
$ K : num 154 133 112 204 140 ...
$ MG : num 303 181 167 311 196 ...
$ MN : num 305 116 160 427 274 ...
$ NA. : num 59.5 48.5 60.7 64.7 59.7 ...
$ P : num 0.64 2.37 2.35 1.39 2.5 2.9 1.68 2.22 0.75 1.29 ...
$ S : num 0 0 2.88 0 0 ...
$ ZN : num 7.11 0.98 1.04 0.71 2.63 3.9 1.45 2.2 2.15 1.54 ...
$ NH4 : num 5.39 4.27 5.13 7.69 6.36 ...
$ NO3 : num 7.5 11.12 8.43 14.95 15.3 ...
$ TOTAL.N: num 12.9 15.4 13.6 22.6 21.7 ...
$ MIN.NH4: num 0.12 -1.53 -4.39 -5.86 3.36 2.72 -2.53 2.11 4.01 -3.11 ...
$ MIN.NO3: num 8.73 -6.96 -1.39 -10.25 38.2 ...
$ MIN.N : num 8.85 -8.49 -5.78 -16.11 41.55 ...
$ PH : num 5.9 5.24 4.94 5.55 5.46 5.45 4.96 5.07 4.79 5.16 ...
- attr(*, "data_types")= chr "N" "N" "N" "N" ...
> grids5m <- readGDAL("dem.asc")
dem.asc has GDAL driver AAIGrid
and has 101 rows and 201 columns
> names(grids5m) <- "dem"
> grids5m$twi <- readGDAL("twi.asc")$band1
twi.asc has GDAL driver AAIGrid
and has 101 rows and 201 columns
This shows that the soil_samples dataset contains measured values of 21 soil parameters. We will try to interpolate a number of soil parameters over the gridded locations defined in the object grids5m.
Spatial prediction of pH
We can focus further on mapping pH in soil. Since this is a fairly small study area, it would be interesting to see up to which distance are the values auto-correlated. First, we need to fit a variogram:
> PH.svar <- variogram(PH~1, soil_samples) # plot(PH.svar) > PH.vgm <- fit.variogram(PH.svar, model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PH, na.rm=TRUE), range=100)) > plot(PH.svar, PH.vgm) > max(subset(PH.svar, gamma>var(soil_samples$PH, na.rm=TRUE))$dist) [1] 353.757
Next we can try using the two auxiliary maps to improve interpolation of pH. Assuming that the two maps are correlated and that the target variable is normaly distributed, we can produce predictions in four lines:
> soil_samples_ov <- overlay(grids5m, soil_samples) > soil_samples_ov@data <- cbind(soil_samples@data, soil_samples_ov@data) > PH.rvgm <- fit.variogram(variogram(PH~dem+twi, soil_samples_ov), + model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PH, na.rm=TRUE)/2, range=100)) > PH.rk <- krige(PH~dem+twi, soil_samples_ov, grids5m, PH.rvgm, nmax=80) [using universal kriging] > writeGDAL(PH.rk[1], "PH_rk.sdat", "SAGA")
To find out what the 90% range of values of soil pH for the whole area is, we can use the quantile function:
> range.PH <- quantile(PH.rk$var1.pred, probs=c(0.05,0.95))
> range.PH
5% 95%
5.075561 6.137635
We have 'jumped over' some analysis too fast. It is maybe a good idea to take a step back and look at individual relationships between the target variable and predictors. How much of a variability in the pH values can be explained with the two auxiliary maps? Where are soil more acidic –- on hill-tops or in depressions?
> pairs(~PH+dem+twi, soil_samples_ov@data) > summary(lm(PH~dem+twi, soil_samples_ov@data))$adj.r.squared [1] 0.08780846
This shows that the relationship is not as clear. It seems that spatial distribution of the target variable can be explained with these two predictors, but only for a small portion of variation (9%). We can visualize the resulting map of pH in SAGA GIS to see how is the variable correlated with the topography. Open the maps and SAGA and use the 3D display (see below).
This obviously shows that the more acidic soils come in depressions (potentially wetter areas), which largely corresponds with our expectations.
Transformation of soil variables using PCA
Similar way we could analyze all other soil parameters from the point data. However, most of the parameters are in fact cross-correlated. To reduce the cross-correlation in the soil parameters we can use Principal Component Analysis to remove the multi-dimensionality and then focus on the most dominant PCs. We can interpolate the values of PC1 and produce a map using regression-kriging:
# Which soil parameters are highly cross-correlated? Plot a correlation matrix. > sort(cor(soil_samples@data)[,"PH"]) > pairs(~PH+S+AL+MIN.NO3+P, soil_samples@data)
> soil_samples.PCA <- prcomp(~AL+B+CA+CU+FE+K+MN+NA.+P+S+ZN+NH4+NO3+TOTAL.N+MIN.NH4+MIN.NO3+MIN.N+PH, soil_samples@data)
> biplot(soil_samples.PCA, xlabs=rep(".", length(soil_samples.PCA$x[,1])))
> soil_samples$PC1 <- soil_samples.PCA$x[,"PC1"]
> soil_samples_ov <- overlay(grids5m, soil_samples)
> soil_samples_ov@data <- cbind(soil_samples@data, soil_samples_ov@data)
> PC1.rvgm <- fit.variogram(variogram(PC1~dem+twi, soil_samples_ov),
+ model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PC1, na.rm=TRUE)/2, range=100))
> PC1.rk <- krige(PC1~dem+twi, soil_samples_ov, grids5m, PC1.rvgm, nmax=80)
[using universal kriging]
> spplot(PC1.rk[1], col.regions=bpy.colors(40), at=seq(min(soil_samples$PC1), max(soil_samples$PC1),
+ by=(max(soil_samples$PC1)-min(soil_samples$PC1))/40), sp.layout=list("sp.points", pch="+", soil_samples, col="black"))
This map is not easy to interpret. One would need to zoom into the Principal Component Analysis to find out which soil parameters are positively correlated with the PC1 and which are the dominant soil parameters.
Interpolation of chemicals in surface sediment for the North Sea (NL)
INSTRUCTIONS: The zip file contains four data sets: (1) sed2006mon.txt – values of various chemicals estimated in laboratory using sediment samples from 43 locations in May 2006 (this data is available via the Noordzee-Atlas project; see also Hegeman and Laane (2008)), (2) seatemp.asc – MODIS estimated monthly Sea Surfafe Temperature for the same period (obtained from the OceanColor project); (3) seachlo.asc – MODIS estimated monthly Chlorophyll content in the sea for the same period (obtained from the OceanColor project), (3) srtmtopo.asc – topographic model of the land and sea bed based on the SRTM 30+ data, (4) watermask.asc – mask showing area of interest. The purpose of this exercise is to produce map of PCB (Polychlorinated Biphenyls) values (micro-g/kg dry matter) using the point data and three auxiliary maps srtmtopo.asc, seatemp.asc and seachlo.asc. All the maps are projected in the Dutch coordinate system (RDH, EPSG:28992). A similar interpolation exercise can be followed in the work of Pebesma (2006).
> sed2006mon <- read.delim("sed2006mon.txt")
> str(sed2006mon)
'data.frame': 43 obs. of 89 variables:
$ labnr : int 10052461 10052462 10052463 10052464 10052465 10052466 10052467 10052468 10052469 10052470 ...
$ lokatie : Factor w/ 43 levels "AMLD70","APPZK20",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x : int 166444 3471 -3408 107800 99538 79909 40395 102177 93115 39115 ...
$ y : int 677771 391803 615153 541968 545764 549305 557312 515011 516849 642383 ...
$ datum : int 38785 38804 38790 38806 38806 38805 38805 38806 38806 38792 ...
$ G : num 191.4 83.8 199.4 22 7.4 ...
$ Som.van.kalk.humus.z : num 22.8 34.3 22.4 NA NA ...
...
$ PCB187 : Factor w/ 32 levels "< 0.1","< 0.2",..: 1 15 1 23 21 7 7 16 20 4 ...
...
# convert the target variable to numbers:
> sel <- which(substr(sed2006mon$PCB187, 1, 2)=='< ')
> sed2006mon$PCB187 <- as.numeric(gsub(pattern='< ', replacement='', sed2006mon$PCB187, fixed=TRUE))
> sed2006mon$PCB187[sel] <- sed2006mon$PCB187[sel]/2
> coordinates(sed2006mon) <- ~x+y
> bubble(sed2006mon, "PCB187")
We can automate the import of data by using:
> asc.list <- dir(pattern=glob2rx("*.asc"))
> grids2km <- readGDAL(asc.list[[1]])
PCB187_rk.asc has GDAL driver AAIGrid
and has 215 rows and 133 columns
> names(grids2km) <- strsplit(asc.list[[1]], ".asc")[[1]]
> for(j in 2:length(asc.list)){ grids2km@data[,strsplit(asc.list[[j]], ".asc")[[1]]] <- readGDAL(asc.list[[j]])$band1 }
seachlo.asc has GDAL driver AAIGrid
and has 215 rows and 133 columns
...
Estimation of the geostatistical model
We can first produce a map of PCB for the whole of North Sea (watermask.asc) and see how well does the model estimates values. We start with overlaying the points and grids and plotting the relationship:
> sed2006mon.ov <- overlay(grids2km, sed2006mon) > sed2006mon.ov@data <- cbind(sed2006mon@data, sed2006mon.ov@data) > scatter.smooth(sed2006mon.ov$seachlo, sed2006mon.ov$PCB187, xlab="MODIS Sea Chlorophyll", ylab="PCB187") # the relationship is log-normal;
Next we need to fit a geostatistical model. Because the target variable is heavily skewed, it is probably a better idea to fit a GLM with a log-link function than to use a simple linear model:
> PCB.glm <- glm(PCB187~seachlo+seatemp+srtmtopo, sed2006mon.ov@data, family=gaussian("log"))
# mask the missing values:
> sel <- !is.na(sed2006mon.ov$seachlo)
> PCB.rvgm <- fit.variogram(variogram(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], cutoff=100000),
+ model=vgm(nugget=0, model="Exp", sill=var(residuals(PCB.glm)/2, na.rm=TRUE), range=100000))
Warning: singular model in variogram fit
# plot(variogram(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], cutoff=100000))
Note that we use Sea Surface temperature as predictor, but it is very unlikely that surface temperature is directly correlated with the processes at the sea bottom. However, it might be indirectly correlated with some properties of sea water, and hence also with the content of the sea sediment.
Software gstat reports a problem with estimating the variogram. Note that this variable is not really suited for plain geostatistical analysis:
> plot(variogram(log1p(PCB187)~1, sed2006mon, cutoff=100000))
This shows that the variogram of the target variable is close to the pure nugget effect. Why is the nugget variation high? The noisy variogram is possible due to the low number of samples and their high clustering, but it could very be that the spatial distribution of PCB187 is controlled by factors different than geographic proximity.
Making spatial predictions
Now that we have estimated the regression model and variogram for residuals, we can proceed with making spatial predictions. We now do seperately predictions for the regression and kriging part:
# predict the GLM model: > pred.glm <- predict(PCB.glm, grids2km, type="response") > PCB187.RK <- krige(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], grids2km, vgm(nugget=0.03, sill=0.05, model="Exp", range=30000)) > PCB187.RK$pred <- ifelse(is.na(grids2km$watermask), NA, pred.glm+PCB187.RK$var1.pred)
To predict value of PCB at some arbitrary location X=36774 and Y=573495 we can use:
> New.XY <- data.frame(X=36774, Y=573495) > coordinates(New.XY) <- ~X+Y > overlay(PCB187.RK["pred"], New.XY)$pred [1] 0.3578286
A relatively fast way to visualize the results of analysis in Google Earth is to export the map to SAGA GIS, then open the map and reproject it to the geographic coordinates. To do this run "Modules" -> "Projection" -> "grid" -> "PROJ.4 (Command Line Arguments, Grid)". For NL coordinate system parameters put +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.999908 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.237,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs and for the target coordinate system put +proj=longlat +datum=WGS84. After you have converted the grid to WGS84 geographical coordinates, open the new map, select "Map" -> "Save as image" and confirm that you want also a KML save of this map. This will produce a ground overlay as shown in figure below.
Maps of pollutants and spatial queries
Now that we have produced a map of PCB187 for the whole area of interest, we can use this information for spatial planning. For example, a question of interest to the decisiom makers could be: "What is the average value of PCB at locations where the sea bed is deep?". The answer can be generated using e.g.:
> summary(PCB187.RK@data[grids2km$srtmdem<-35,"pred"])[4] Mean 0.1093
Finally we can try to see if distance to the coast line could be used to improve predictions of PCB187. We need to derive this map first using SAGA GIS:
> grids2km$landmask <- ifelse(is.na(grids2km$watermask), 1, NA)
# writeGDAL(grids2km["landmask"], "landmask.sdat", "SAGA")
# buffer distance:
> rsaga.geoprocessor(lib="grid_tools", module=10, param=list(SOURCE="landmask.sgrd", DISTANCE="land_dist.sgrd",
+ ALLOC="tmp.sgrd", BUFFER="tmp.sgrd", DIST=sqrt(areaSpatialGrid(grids2km)), IVAL=grids2km@grid@cellsize[[1]]))
> grids2km$land.dist <- readGDAL("land_dist.sdat")$band1
# overlay again points and grids:
> sed2006mon.ov <- overlay(grids2km, sed2006mon)
> sed2006mon.ov@data <- cbind(sed2006mon@data, sed2006mon.ov@data)
> scatter.smooth(sqrt(sed2006mon.ov$land.dist), sed2006mon.ov$PCB187, xlab="Distance from the coast line", ylab="PCB187")
This plot shows that the relationship is clear, but both values of PCB187 and distance to coast line are skewed. We can now take a look at the regression model to see which is the best predictor of PCB:
> PCB.glm <- glm(PCB187~seachlo+seatemp+srtmtopo+sqrt(land.dist), sed2006mon.ov@data, family=gaussian("log"))
> summary(PCB.glm)
Call:
glm(formula = PCB187 ~ seachlo + seatemp + srtmtopo + sqrt(land.dist),
family = gaussian("log"), data = sed2006mon.ov@data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.41826 -0.05604 0.02216 0.09619 0.44239
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.551963 1.039980 -2.454 0.01910 *
seachlo 0.062789 0.011561 5.431 4.01e-06 ***
seatemp -0.023287 0.040582 -0.574 0.56964
srtmtopo -0.035555 0.012409 -2.865 0.00692 **
sqrt(land.dist) -0.011067 0.001936 -5.717 1.66e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.0289008)
Null deviance: 10.1802 on 40 degrees of freedom
Residual deviance: 1.0404 on 36 degrees of freedom
(2 observations deleted due to missingness)
AIC: -22.281
Number of Fisher Scoring iterations: 7
This shows that distance to coast line is the best predictor of PCB187. The model is in fact highly significant but the problem remains that the number of points (<50) used to build the model is on the edge of what we typically use in geostatical analysis. Also some locations of high chlorophyll are not equaly good represented, so that the model needs to extrapolate in much of the North Sea. On the other hand, we are primarily interested in mapping concentration of poluttans in the areas close to the coast.
References
- Bivand, R., Pebesma, E., Rubio, V., 2008. Applied Spatial Data Analysis with R. Use R Series. Springer, Heidelberg.
- Grimm, R., Behrens, T., Märker, M. and Elsenbeer, H., 2008. Soil organic carbon concentrations and stocks on Barro Colorado Island -- Digital soil mapping using Random Forests analysis. Geoderma, 146(1-2): 102-113.
- Hegeman, W.J.M., Laane, R.W.P.M. 2008. Concentraties, Trends en Normtoetsing van chemische stoffen in het oppervlakte sediment van het Nederlandse Continentale Plat (1981 - 2006). Deltares, Hoofddorp, 101 p.
- Hengl, T., 2009. A Practical Guide to Geostatistical Mapping. University of Amsterdam, Amsterdam, 291 p.
- Pebesma, E., 2006. The Role of External Variables and GIS Databases in Geostatistical Analysis. Transactions in GIS, 10(4): 615-632.
|
