Spatial interpolation exercises (NL)

From spatial-analyst.net
Jump to: navigation, search
T

he following article describes four data interpolation exercises that have been developed with support from the EcoGRID-GAN and Geo-valley projects and are a.o. used in the Earth sciences BSc and MSc programmes at the University of Amsterdam. The exercise demonstrate various geostatistical analysis steps: how to generate maps from sampled point data, interpret the results and make objective evaluation of various linear geostatistical techniques. The point data is publicly available and has been obtained from referred websites. For more information about various topics please refer also to the indicated references.

The focus is put on use of linear geostatistical techniques (regression-kriging) and assessment of the accuracy of spatial predictions. To run this analysis on your computer/laptop, you will first need to (1) install R with all necessary packages, and SAGA GIS; (2) donwload and unzip data (zip files include R scripts) listed below, and then you can run these exercise line-by-line. The sample codes shown below are only used for illustration. To master R syntax, consider obtaining some of the R manuals. The most comprehensive guide to spatial data analysis in R is the Bivand et al. (2008) book. To learn about R+SAGA integration, consider also obtaining the Hengl (2009) book.

Interpolation exercises using various environmental variables.


The following exercises use some of the common spatial packages. Before you can run any further analysis you need to first install them (e.g. from CRAN), and then you can load packages by using:

> library(rgdal)
> library(maptools)
> library(gstat)
> library(RSAGA)
> library(lattice)


Contents

Spatial prediction of daily precipitation in the Netherlands

INSTRUCTIONS: The zip file contains three data sets: (1) monv_200906.txt – values of daily precipitation (in mm) measured at 326 locations over June 2006 (this data was obtained from the KNMI website), (2) NL_stations_XY.csv – coordinates of meteorological stations (projected in the EPSG:28992); (3) prec6.asc –-- predicted long-term precipitation for June based on the Worldclim project page. The purpose of this exercise is to produce a series of maps of daily rainfall based on the point data and by using ordinary and regression-kriging techniques. We are interested to see if the variograms for rainfall between different dates differ and what is the overall accuracy of the produced maps.

Data import and formatting/filtering

We start with importing the point (meteorological) data. The precipitation measurements and locations come in two separate files:

# locaton of stations:
> NL_stations_XY <- read.csv("NL_stations_XY.csv")
> str(NL_stations_XY)
'data.frame':   326 obs. of  6 variables:
 $ Locatie : Factor w/ 326 levels "Aalsmeer ","Aalten ",..: 235 1 2 3 4 5 6 7 8 9 ...
 $ X.Nr    : int  760 458 680 572 89 664 678 560 910 441 ...
 $ X.OL    : Factor w/ 170 levels "3° 24' ","3° 26' ",..: 11 46 143 58 105 149 130 81 67 55 ...
 $ X.NB    : Factor w/ 131 levels "50° 45' ","50° 46' ",..: 25 62 49 62 126 66 107 54 39 68 ...
 $ X.X..km.: int  43 113 236 127 184 242 219 160 143 123 ...
 $ X.Y..km.: int  388 475 437 475 563 485 464 445 418 487 ...
# fix the coordinates:
> NL_stations_XY$X <- NL_stations_XY$X.X..km. * 1000
> NL_stations_XY$Y <- NL_stations_XY$X.Y..km. * 1000
# fix the station names (trim white spaces):
> NL_stations_XY$Locatie <- as.factor(gsub(pattern='[[:space:]]+$', replacement='', NL_stations_XY$Locatie))

# import the precipitation values:
> tmp <- readLines(paste(getwd(), "/", "monv_200906.txt", sep=""))
> tmp <- gsub(pattern=' . ', replacement=' 0.0 ', tmp, fixed = TRUE)
> tmp <- gsub(pattern='[[:space:]]{2,7}', replacement=',', tmp)
> write.table(t(matrix(unlist(strsplit(tmp, ",")[-c(1:8)]), nrow=41)), "tmp.txt", col.names=FALSE, row.names=FALSE)
> monv_200906 <- read.table("tmp.txt", col.names=c("empty1", "NR", paste("D", 1:30, sep=""), "I", "NORM", "II", "NORM", "III", "NORM", "MND", "NORM",  .... [TRUNCATED] 
> monv_200906$empty1 <- NULL
> monv_200906$Locatie <- as.factor(gsub(pattern='[[:space:]]+$', replacement='', monv_200906$Locatie))
> str(monv_200906)
'data.frame':   326 obs. of  40 variables:
 $ NR     : int  760 458 680 572 89 664 678 560 910 441 ...
 $ D1     : num  0 0 0 0 0 0 0 0 0 0 ...
...
 $ D30    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ I      : num  30.4 18.7 29.9 17 12.4 19.9 17.1 13 24.6 13.4 ...
 $ NORM   : num  26.9 26.1 30.2 25.7 NA 28.7 27.2 29.1 27.4 29.3 ...
 $ II     : num  65.2 32.4 45.5 28.9 32.1 37 31.7 40.4 46.4 26.4 ...
 $ NORM.1 : num  21.7 20.7 19.2 19.2 NA 20.8 18.8 19.7 19.1 22.1 ...
 $ III    : num  0.2 0 15.1 12.6 4.2 8 2.5 0.7 0 5.5 ...
 $ NORM.2 : num  23.3 23.7 28.1 25 NA 26.1 24.9 25 22.9 24.3 ...
 $ MND    : num  95.8 51.1 90.5 58.5 48.7 64.9 51.3 54.1 71 45.3 ...
 $ NORM.3 : num  71.9 70.4 77.4 69.9 NA 75.6 70.9 73.8 69.3 75.6 ...
 $ Locatie: Factor w/ 326 levels "'s Heerenhoek",..: 1 3 4 5 6 7 8 9 10 11 ...

Once we have imported and sorted both tables, we can merge them to a new table that contains both precipitation values and coordiantes:

# merge the precipitation and coordinates:
> monv_200906.XY <- merge(x=NL_stations_XY[,c("X", "Y", "Locatie")], y=monv_200906, by=c("Locatie"), all.x=TRUE)
# str(monv_200906.XY)
# convert to a point map:
> coordinates(monv_200906.XY) <- ~X+Y

We can also import gridded maps, and borders of municipalities in the Netherlands:

# import gridded map:
> grids1km <- readGDAL("prec6.asc")
> names(grids1km) <- "prec6"
> grids1km$NLmask <- as.factor(readGDAL("NLmask.asc")$band1)

# CBS Wijk en buurtkaart 2008
> download.file("http://www.cbs.nl/NR/rdonlyres/3D8F1FF7-E6BA-430D-AF3B-9F16CB6647B9/0/buurt_2008_gen1.zip",
+      destfile=paste(getwd(), "/", "NL_buurtkaart2008.zip", sep=""))
> unzip("NL_buurtkaart2008.zip")
# import polygons and convert to line map:
> NLprovs <- readShapePoly("gem_2008_gn2.shp", proj4string=CRS("+proj=sterea +lat_0=52.15616055555555
+      +lon_0=5.38763888888889 +k=0.999908 +x_0=155000 +y_0=463000 +ellps=bessel 
+      +towgs84=565.237,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs")) 
# the Dutch coordinate system;
> NLprovs <- as(NLprovs, "SpatialLinesDataFrame")

Spatial interpolation of precipitation using ordinary kriging

Now that we have imported and sorted all data (i.e. put them in a right format), we can start with making predictions using simple geostatistical models. For example, we want to estimate how much rain did fall on 10th of June 2009 on the location: X=155000, Y=463000.

> sel <- !is.na(monv_200906.XY$D10)
> bubble(monv_200906.XY[sel, "D10"])
# hist(log1p(monv_200906.XY$D10))  # normally distributed after log-transformation!
> D10.svar <- variogram(log1p(D10)~1, monv_200906.XY[sel,])
> D10.vgm <- fit.variogram(D10.svar, model=vgm(nugget=var(log1p(monv_200906.XY$D10), na.rm=TRUE)/3, model="Lin"))
> New.XY <- data.frame(X=155000, Y=463000)
> coordinates(New.XY) <- ~X+Y
> expm1(krige(log1p(D10)~1, monv_200906.XY[sel,], New.XY, D10.vgm)$var1.pred)  # back-transformation included
[using ordinary kriging]
[1] 1.343429

The estimated rainfall on the new location was 1.34 mm. This is relatively small value considering that rainfall on that day ranged from 0 to 25.7 mm. Note that we need to use log-transformation before spatial prediction because the rainfall data is typically skewed. If we have run interpolation without log-transformation, our estimates would have probably be significantly higher (over-estimation) because of the impact of high values. Next we can try to see what the uncertainty of that estimate is, even better, what is the overall uncertainty of the estimated rainfall for the whole area of the Netherlands? To answer this question, we can run:

> New.OK <- krige(log1p(D10)~1, monv_200906.XY[sel,], New.XY, D10.vgm) 
[using ordinary kriging]
# 95% range based on the OK variance:
> expm1(New.OK$var1.pred + 1.96*sqrt(New.OK$var1.var)); expm1(New.OK$var1.pred - 1.96*sqrt(New.OK$var1.var))
[1] 3.479269
[1] 0.2260168

and then for the whole Netherlands:

> D10.OK <- krige(log1p(D10)~1, monv_200906.XY[sel,], grids1km["NLmask"], D10.vgm, nmin=10, nmax=50)
[using ordinary kriging]
> D10.OK$pred <- expm1(D10.OK$var1.pred)
# plot the resulting map:
> D10ok.plt <- spplot(D10.OK["pred"], col.regions=rev(bpy.colors(40)), at=expm1(seq(0, max(log1p(D10.OK$pred), na.rm=TRUE), .... [TRUNCATED]
> D10var.plt <- spplot(D10.OK["var1.var"], col.regions=grey(seq(0,1,by=0.025)), 
+      sp.layout=list("sp.points", monv_200906.XY, pch="+", col="red"), main="OK variance")
> print(D10ok.plt, split=c(1,1,2,1), more=TRUE)
> print(D10var.plt, split=c(2,1,2,1), more=FALSE)
> mean(expm1(D10.OK$var1.pred + 1.96*sqrt(D10.OK$var1.var))-expm1(D10.OK$var1.pred - 1.96*sqrt(D10.OK$var1.var)), na.rm=TRUE)/2
[1] 3.977561
Figure: Rainfall (in mm) predicted using ordinary kriging (date: June 10th 2009).

This shows that the average precision in the map shown above is about +/- 4 mm. This is relatively high number considering that the global standard deviation is:

> sd(monv_200906.XY$D10, na.rm=TRUE)
[1] 5.036219
# but this is not a good measure because the values are skewed!

The remaining issue is whether we could use some auxiliary map to improve the interpolation of precipitation. For example, we can consider using the long-term precipitation map for June obtained from the Worldclim project:

> monv_200906.XY.ov <- overlay(grids1km, monv_200906.XY)
> monv_200906.XY.ov@data <- cbind(monv_200906.XY.ov@data, monv_200906.XY@data)
> scatter.smooth(monv_200906.XY.ov$prec6, monv_200906.XY.ov$D10, xlab="Long-term precipitation (mm)", 
+     ylab="Precipitation on 10th of June") 
> summary(lm(log1p(D10)~prec6, monv_200906.XY.ov))

Call:
lm(formula = log1p(D10) ~ prec6, data = monv_200906.XY.ov)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.90188 -0.55536 -0.06208  0.54645  2.15420 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.956160   0.385053  -10.27   <2e-16 ***
prec6        0.079478   0.005901   13.47   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7428 on 317 degrees of freedom
  (7 observations deleted due to missingness)
Multiple R-squared: 0.364,      Adjusted R-squared: 0.362 
F-statistic: 181.4 on 1 and 317 DF,  p-value: < 2.2e-16 

The model is significant, hence the long-term pattern of rainfall is a good predictor of daily rainfall for this date.

Figure: Correlation plot rainfall on June 10th and long-term precipitation map for June obtained from the Worldclim project.

Spatial auto-correlation for different dates

In previous example we have noticed that precipitation can be interpolated using geostatistical techniques. In fact, it looks that the target variable is correlated up to high distances. Will we get the same variograms for all other dates? To find out we need to plot few variograms for different dates (e.g. 5th, 10th, 15th and 20th of June) on top of each other and then compare to see if there are large differences. This can be done in few steps:

> D5.vgm <- fit.variogram(variogram(log1p(D5)~1, monv_200906.XY[!is.na(monv_200906.XY$D5),]), 
+    model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D5), na.rm=TRUE), range=50000))
> D15.vgm <- fit.variogram(variogram(log1p(D15)~1, monv_200906.XY[!is.na(monv_200906.XY$D15),]), 
+    model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D15), na.rm=TRUE), range=50000))
> D20.vgm <- fit.variogram(variogram(log1p(D20)~1, monv_200906.XY[!is.na(monv_200906.XY$D20),]), 
+    model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY$D20), na.rm=TRUE), range=50000))
# plot of variograms:
> plot(variogramLine(D5.vgm, maxdist=2e+05)$dist, variogramLine(D5.vgm, maxdist=2e+05)$gamma, 
+     xlim=c(0, 2e+5), ylim=c(0, var(c(log1p(monv_200906.XY$D5), log1p(monv_200906.XY$D10), 
+     log1p(monv_200906.XY$D15), log1p(monv_200906.XY$D20)), na.rm=TRUE)), xlab="Distance (m)", 
+     ylab="Semivariance", type="l", col="black", lwd=2) 
> lines(variogramLine(D10.vgm, maxdist=2e+05)$dist, variogramLine(D10.vgm, maxdist=2e+05)$gamma, col="red", lwd=2)
> lines(variogramLine(D15.vgm, maxdist=2e+05)$dist, variogramLine(D15.vgm, maxdist=2e+05)$gamma, col="blue", lwd=2)
> lines(variogramLine(D20.vgm, maxdist=2e+05)$dist, variogramLine(D20.vgm, maxdist=2e+05)$gamma, col="green", lwd=2)
Figure: Four variograms for 5th (black), 10th (red), 15th (blue) and 20th (green) of June plotted on top of each other.

This plot shows that there seems to exist large differences between variograms for different dates. How would you explain this feature? Most probably because the type of rain is defined with meteorological conditions i.e. the type of clouds over an area, and there are many various types of clouds that result in either local precipitation or uniform precipitation over a large area.

Locating areas of highest rainfall

In the final exercise we demonstrate how to locate an area with the highest rainfall. We focus on 8th of June, but this time we also use an auxiliary map to improve interpolations:

> D8.rvgm <- fit.variogram(variogram(log1p(D8)~prec6, monv_200906.XY.ov[!is.na(monv_200906.XY.ov$D8)&!is.na(monv_200906.XY.ov$prec6),]), 
+     model=vgm(nugget=0, model="Exp", sill=var(log1p(monv_200906.XY.ov$D8), na.rm=TRUE)/2, range=50000))
> D8.RK <- krige(log1p(D8)~prec6, monv_200906.XY.ov[!is.na(monv_200906.XY.ov$D8)&!is.na(monv_200906.XY.ov$prec6),],
+     grids1km, D8.rvgm, nmin=20, nmax=50)
[using universal kriging]
# mask areas of interest:
> D8.RK$pred <- ifelse(is.na(grids1km$NLmask), NA, expm1(D8.RK$var1.pred))
> spplot(D8.RK["pred"], col.regions=rev(bpy.colors(40)), 
+     at=expm1(seq(0, max(log1p(D8.RK$pred), na.rm=TRUE), by=(max(log1p(D8.RK$pred), na.rm=TRUE)-0)/40)),
+     sp.layout=list("sp.lines", NLprovs, col="black"), main="Neerslag 7-8 juni 2009")

# Locate the maximum value:
> D8.RK.pnt <- as(D8.RK["pred"], "SpatialPointsDataFrame")
> proj4string(D8.RK.pnt) <- NLprovs@proj4string
> spTransform(D8.RK.pnt[which.max(D8.RK.pnt$pred),], CRS("+proj=latlon +datum=WGS84"))
            coordinates     pred
8939 (6.94032, 53.3179) 24.88213  
# enter in Google maps and you will get - Farmsum:
> system(paste('"c:/Program Files/firefox/firefox.exe"', ' -url http://maps.google.com/maps?&ll=',
+ 53.3179,',',6.94032,'&t=h&q=', 53.3179,',',6.94032, sep=""), wait=F)
> max(monv_200906.XY.ov$D8, na.rm=TRUE)
[1] 29.4

Which shows that the maximum rainfall was in the small city called Farmsum. This number is still smaller that what was estimated in the near-by meterological station.


Spatial prediction of daily temperatures over Europe

INSTRUCTIONS: The zip file contains two maps: (1) GSOD20060501.shp – a point map with attached values of daily temperatures (TEMP in °C) measured at 2118 locations on 1st of May 2006, and (2) eudem.asc – a SRTM30+ digital elevation map (DEM) of whole of Europe (5 km resolution; projected in the EPSG:3035). To learn more about the Daily Global Weather Measurements, 1929-2009 (NCDC, GSOD) data look at the NCDC website. The purpose of this exercise is to produce a map of mean daily temperature for Europe given the point data and DEM.

Data import and initial analysis

We start with importing the maps to R and making some initial plots:

> GSOD20060501 <- readOGR("GSOD20060501.shp", "GSOD20060501")
OGR data source with driver: ESRI Shapefile 
Source: "GSOD20060501.shp", layer: "GSOD20060501"
with 2117 features and 3 fields
Feature type: wkbPoint with 2 dimensions

> grids5km <- readGDAL("eudem.asc")
eudem.asc has GDAL driver AAIGrid 
and has 799 rows and 664 columns
> names(grids5km) <- "eudem"
> grids5km$mask <- readGDAL("mask5km.asc")$band1
mask5km.asc has GDAL driver AAIGrid 
and has 799 rows and 664 columns
> proj4string(grids5km) <- GSOD20060501@proj4string
# overlay points and grids:
> GSOD20060501.ov <- overlay(grids5km, GSOD20060501)
> GSOD20060501.ov@data <- cbind(GSOD20060501.ov@data, GSOD20060501@data)

Next, we can generate a scatter plot to see general relationship between temperature and elevations:

> par(mfrow=c(1,2))
> scatter.smooth(GSOD20060501.ov$eudem, GSOD20060501.ov$TEMPC, xlab="Elevation", ylab="TEMP (degree celsius)")

This shows that some of the points fall on the bottom of sea. Because we are only interested in mapping the temperature on land, we can mask out some points by using:

> GSOD20060501.ov <- subset(GSOD20060501.ov, GSOD20060501.ov$eudem> -20)
> scatter.smooth(GSOD20060501.ov$eudem, GSOD20060501.ov$TEMPC, xlab="Elevation", ylab="TEMP (degree celsius)")
# is the target variable normaly distributed?
> hist(GSOD20060501.ov$TEMPC, breaks=25, col="grey") # yes
Figure: Scatter plot showing general relationship between daily temperatures (TEMP) and elevation (DEM) - before (left) and after removing points with negative elevations (right).

Temperature on 2006-05-01 in Amsterdam and Vienna?

Before we can produce any predictions, we need to estimate the geostatistical model. In this case (simple linear model), this takes one line of code:

> TEMPC.rvgm <- fit.variogram(variogram(TEMPC~eudem, GSOD20060501.ov, cutoff=400000), vgm(nugget=3, model="Lin"))
> plot(variogram(TEMPC~eudem, GSOD20060501.ov, cutoff=400000), TEMPC.rvgm)
# remark: some strange values can be observed on short distances;

Note that the models is linear, hence there is practically no limit for spatial auto-correlation for this data. This is typical for temperature data that varies smoothly in space. What worries us is the nugget variation that is fairly significant (3.8).

Figure: Estimated variogram of residuals for model TEMP~DEM.

To get locations of Amsterdam and Vienna we can use the Google maps API service:

# set the Google key string:
googlekey <- "InsertGoogleKey"
# coordinates of Amsterdam:
> googleurl <- url(paste("http://maps.google.com/maps/geo?q=", "Amsterdam+Netherlands", "&output=csv&key=",googlekey,sep=""), open="r")
> amsterdam.XY <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl, n=1, warn=FALSE), ","))))
# [1] 200.000000   4.000000  52.373801   4.890935
> coordinates(New.ll) <- ~Lon+Lat
> proj4string(New.ll) <- CRS("+proj=lonlat +datum=WGS84")
> New.XY <- spTransform(New.ll, GSOD20060501@proj4string)
> New.XY$eudem <- overlay(grids5km, New.XY)@data$eudem
> krige(TEMPC~eudem, GSOD20060501.ov, New.XY, TEMPC.rvgm, nmax=80) 
[using universal kriging]
         coordinates var1.pred var1.var
1 (3973380, 3263800)  8.169655 3.890196

and then the same thing for Vienna:

> googleurl <- url(paste("http://maps.google.com/maps/geo?q=", "Vienna+Austria", "&output=csv&key=",googlekey,sep=""), open="r")
> vienna.XY <- as.vector(as.numeric(unlist(strsplit(readLines(googleurl, n=1, warn=FALSE), ","))))
# [1] 200.00000   4.00000  48.20921  16.37278
> New.ll <- data.frame(Lon=vienna.XY[4], Lat=vienna.XY[3], Name="Vienna")
> coordinates(New.ll) <- ~Lon+Lat
> proj4string(New.ll) <- CRS("+proj=lonlat +datum=WGS84")
> New.XY <- spTransform(New.ll, GSOD20060501@proj4string)
> New.XY$eudem <- overlay(grids5km, New.XY)@data$eudem
> krige(TEMPC~eudem, GSOD20060501.ov, New.XY, TEMPC.rvgm, nmax=80) 
[using universal kriging]
         coordinates var1.pred var1.var
1 (4794200, 2808840)   9.06544 3.863452

This shows that the mean daily temperature on May 1st 2006 in Amsterdam was 8.2 +/- 2.0 and in Vienna 9.1 +/- 2.0 degree C.

Significance of the model

To find out how much of variability in TEMP values can be explained using elevation, we can run:

> summary(lm(TEMPC~eudem, GSOD20060501.ov@data))$adj.r.square
[1] 0.1618341

This tells us only how good can we fit the data. To estimate the true accuracy of the resulting map, we can run 3-fold cross-validation in gstat:

> TEMP.cv <- krige.cv(TEMPC~eudem, nfold=3, GSOD20060501.ov, TEMPC.rvgm, nmax=40, debug.level=-1, set=list(cn_max=1e10)) 
[using universal kriging]
100% done
  |                                                           
  |==========================                           |  50%[using universal kriging]
100% done
  |                                                           
  |=====================================================| 100%[using universal kriging]
100% done
> 1-mean(TEMP.cv$residual^2, na.rm=TRUE)/var(GSOD20060501.ov$TEMPC, na.rm=TRUE)
[1] 0.855478
> quantile(TEMP.cv$residual[!is.na(TEMP.cv$residual)], c(0.05, 0.95))
       5%       95% 
-2.319952  2.418919

The result of cross-validation shows that the model explains >85% of variability in the TEMP values. The overall accuracy of the map (base on the results of cross-validation) will be between -2.3 / 2.5 degree C. Although elevations explain only 15% of variability in the data, this model is highly significant.

We can finally run predictions for the whole study area. Because the data set is large (>1000 points), it is more efficient to use SAGA to generate predictions instead of using R+gstat. The following commands will run universal kriging in SAGA GIS (predictions can still take few minutes):

# export data to SAGA compatible format:
gridcell <- grids5km@grid@cellsize[1]
grids5km$eudem.f <- ifelse(grids5km$eudem < -20, NA, grids5km$eudem) 
# write to SAGA grid:
writeGDAL(grids5km["eudem.f"], "eudem.sdat", "SAGA")
writeOGR(GSOD20060501.ov["TEMPC"], "TEMP.shp", "TEMP", "ESRI Shapefile")
# universal kriging in SAGA GIS:
> rsaga.geoprocessor(lib="geostatistics_kriging", module=7, param=list(GRID="TEMPC_rk.sgrd", SHAPES="TEMP.shp",
+     GRIDS="eudem.sgrd", BVARIANCE=F, BLOCK=F, FIELD=1, BLOG=F, MODEL=3, TARGET=0, USER_CELL_SIZE=5000,
+     NUGGET=TEMPC.rvgm$psill[1], SILL=TEMPC.rvgm$psill[2], INTERPOL=0, NPOINTS_MAX=30, NPOINTS_MIN=20,
+     MAXRADIUS=1200000, USER_X_EXTENT_MIN=grids5km@bbox[1,1]+gridcell/2, USER_X_EXTENT_MAX=grids5km@bbox[1,2]-gridcell/2,
+     USER_Y_EXTENT_MIN=grids5km@bbox[2,1]+gridcell/2, USER_Y_EXTENT_MAX=grids5km@bbox[2,2]-gridcell/2))

SAGA CMD 2.0.3
library path:   C:/PROGRA~1/R/R-210~1.1/library/RSAGA/saga_vc/modules
library name:   geostatistics_kriging
module name :   Universal Kriging
author      :   (c) 2003 by O.Conrad

Load shapes: TEMP.shp...
ready

ready

Load grid: eudem.sgrd...
ready

Parameters

Grid: [not set]
Variance: [not set]
Points: TEMP.shp
Attribute: TEMPC
Create Variance Grid: no
Target Grid: user defined
Variogram Model: Linear Regression
Block Kriging: no
Block Size: 100.000000
Logarithmic Transformation: no
Nugget: 3.643687
Sill: 0.000003
Range: 100.000000
Linear Regression: 1.000000
Exponential Regression: 0.100000
Power Function - A: 1.000000
Power Function - B: 0.500000
Grids: 1 object (eudem.sgrd))
Grid Interpolation: Nearest Neighbor
Maximum Search Radius (map units): 1200000.000000
Min./Max. Number of m_Points: [20.000000] - [30.000000]

ready

Save grid: TEMPC_rk.sgrd...
ready

The final resulting map can be seen down-below. Note that the relationship between temperature and relief is more than distinct.

Figure: Mean daily temperature predicted for Europe for May 1st 2006.

Adding auxiliary predictors

We can consider using distance to coast as an additional auxiliary predictor. This can be derived using SAGA GIS and the line map of coast line:

# convert shape file to a gridded map:
> rsaga.geoprocessor(lib="grid_gridding", module=0, param=list(GRID="coast_buffer.sgrd", INPUT="coast_europe.shp", 
+     FIELD=0, LINE_TYPE=0, USER_CELL_SIZE=gridcell, USER_X_EXTENT_MIN=grids5km@bbox[1,1]+gridcell/2,
+     USER_X_EXTENT_MAX=grids5km@bbox[1,2]-gridcell/2, USER_Y_EXTENT_MIN=grids5km@bbox[2,1]+gridcell/2,
+     USER_Y_EXTENT_MAX=grids5km@bbox[2,2]-gridcell/2))
# derive a buffer distance:
> rsaga.geoprocessor(lib="grid_tools", module=10, param=list(SOURCE="coast_buffer.sgrd", DISTANCE="land_dist.sgrd",
+     ALLOC="tmp.sgrd", BUFFER="tmp.sgrd", DIST=sqrt(areaSpatialGrid(grids5km))/3, IVAL=gridcell)) 
# this takes time!
> grids5km$land.dist <- readGDAL("land_dist.sdat")$band1
> grids5km$dcoast <- ifelse(grids5km$eudem < -10, grids5km$land.dist, -grids5km$land.dist)
> GSOD20060501.ov <- overlay(grids5km, GSOD20060501)
> GSOD20060501.ov@data <- cbind(GSOD20060501.ov@data, GSOD20060501@data)
> GSOD20060501.ov <- subset(GSOD20060501.ov, GSOD20060501.ov$eudem> -20)

and we can re-fit a linear model:

> summary(lm(TEMPC~eudem+dcoast, GSOD20060501.ov@data))

Call:
lm(formula = TEMPC ~ eudem + dcoast, data = GSOD20060501.ov@data)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.5770  -2.6206  -0.5369   2.2636  17.0814 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.057e+01  1.167e-01  90.619   <2e-16 ***
eudem       -4.299e-03  2.171e-04 -19.804   <2e-16 ***
dcoast      -1.424e-06  5.808e-07  -2.451   0.0143 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 3.755 on 1980 degrees of freedom
Multiple R-squared: 0.169,      Adjusted R-squared: 0.1682 
F-statistic: 201.3 on 2 and 1980 DF,  p-value: < 2.2e-16
Figure: Buffer distance from coast line derived in SAGA GIS.

This shows that the temperatures also change with distance to coast line, however this relationship is much less significant than with DEM. It is very possible that a the influence of ocean on temperature is valid only up to some treshold distances (e.g. 50--100 km), so that one would have to use a locally adaptive model.

Spatial prediction of soil properties for the Barro Colorado Island 50 ha Plot

INSTRUCTIONS: The zip file contains three data sets: (1) soil_samples.shp – values of 19 soil parameters estimated in laboratory using soil samples from 300 locations (this data is a part of the Barro Colorado Island 50 ha monitoring plot in Panama), (2) dem.asc – a 5 meter digital elevation model of the area (coordinates of origin: lat=9.15125, lon=-79.8553); (3) twi.asc – topographic wetness index showing a potential soil deposition. The purpose of this exercise is to produce maps of various soil parameters using the point data and two auxiliary maps dem.asc and twi.asc. This data set is described in detail in Grim et al. (2008).

Import and initial analysis

We can read the maps to R by using:

> soil_samples <- readShapePoints("soil_samples.shp", proj4string = CRS(as.character(NA)))
Field name: NA changed to: NA.
Field name: TOTAL N changed to: TOTAL.N
Field name: MIN N changed to: MIN.N
> str(soil_samples@data)
'data.frame':   300 obs. of  21 variables:
 $ GX     : num  25 25 25 25 45 25 25 23.6 25 25 ...
 $ GY     : num  25 45 75 125 125 ...
 $ AL     : num  555 1002 1035 744 998 ...
 $ B      : num  1.44 0.37 0.37 0.96 0.55 1.1 0.41 0.68 0.54 0.45 ...
 $ CA     : num  2542 691 705 2143 1842 ...
 $ CU     : num  9.66 6.19 5.84 7.84 6.02 7.53 7.42 8.33 7.32 6.64 ...
 $ FE     : num  104 104 109 150 131 ...
 $ K      : num  154 133 112 204 140 ...
 $ MG     : num  303 181 167 311 196 ...
 $ MN     : num  305 116 160 427 274 ...
 $ NA.    : num  59.5 48.5 60.7 64.7 59.7 ...
 $ P      : num  0.64 2.37 2.35 1.39 2.5 2.9 1.68 2.22 0.75 1.29 ...
 $ S      : num  0 0 2.88 0 0 ...
 $ ZN     : num  7.11 0.98 1.04 0.71 2.63 3.9 1.45 2.2 2.15 1.54 ...
 $ NH4    : num  5.39 4.27 5.13 7.69 6.36 ...
 $ NO3    : num  7.5 11.12 8.43 14.95 15.3 ...
 $ TOTAL.N: num  12.9 15.4 13.6 22.6 21.7 ...
 $ MIN.NH4: num  0.12 -1.53 -4.39 -5.86 3.36 2.72 -2.53 2.11 4.01 -3.11 ...
 $ MIN.NO3: num  8.73 -6.96 -1.39 -10.25 38.2 ...
 $ MIN.N  : num  8.85 -8.49 -5.78 -16.11 41.55 ...
 $ PH     : num  5.9 5.24 4.94 5.55 5.46 5.45 4.96 5.07 4.79 5.16 ...
 - attr(*, "data_types")= chr  "N" "N" "N" "N" ...
> grids5m <- readGDAL("dem.asc")
dem.asc has GDAL driver AAIGrid 
and has 101 rows and 201 columns
> names(grids5m) <- "dem"
> grids5m$twi <- readGDAL("twi.asc")$band1
twi.asc has GDAL driver AAIGrid 
and has 101 rows and 201 columns

This shows that the soil_samples dataset contains measured values of 21 soil parameters. We will try to interpolate a number of soil parameters over the gridded locations defined in the object grids5m.

Spatial prediction of pH

We can focus further on mapping pH in soil. Since this is a fairly small study area, it would be interesting to see up to which distance are the values auto-correlated. First, we need to fit a variogram:

> PH.svar <- variogram(PH~1, soil_samples)
# plot(PH.svar)
> PH.vgm <- fit.variogram(PH.svar, model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PH, na.rm=TRUE), range=100))
> plot(PH.svar, PH.vgm)
> max(subset(PH.svar, gamma>var(soil_samples$PH, na.rm=TRUE))$dist)
[1] 353.757
Figure: Variogram model fitted for pH using default settings in gstat.

Next we can try using the two auxiliary maps to improve interpolation of pH. Assuming that the two maps are correlated and that the target variable is normaly distributed, we can produce predictions in four lines:

> soil_samples_ov <- overlay(grids5m, soil_samples)
> soil_samples_ov@data <- cbind(soil_samples@data, soil_samples_ov@data) 
> PH.rvgm <- fit.variogram(variogram(PH~dem+twi, soil_samples_ov), 
+      model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PH, na.rm=TRUE)/2, range=100)) 
> PH.rk <- krige(PH~dem+twi, soil_samples_ov, grids5m, PH.rvgm, nmax=80)
[using universal kriging]
> writeGDAL(PH.rk[1], "PH_rk.sdat", "SAGA")

To find out what the 90% range of values of soil pH for the whole area is, we can use the quantile function:

> range.PH <- quantile(PH.rk$var1.pred, probs=c(0.05,0.95))
> range.PH
      5%      95% 
5.075561 6.137635

We have 'jumped over' some analysis too fast. It is maybe a good idea to take a step back and look at individual relationships between the target variable and predictors. How much of a variability in the pH values can be explained with the two auxiliary maps? Where are soil more acidic –- on hill-tops or in depressions?

> pairs(~PH+dem+twi, soil_samples_ov@data)
> summary(lm(PH~dem+twi, soil_samples_ov@data))$adj.r.squared
[1] 0.08780846
Figure: Individual relationships shown in a correlation matrix plot.

This shows that the relationship is not as clear. It seems that spatial distribution of the target variable can be explained with these two predictors, but only for a small portion of variation (9%). We can visualize the resulting map of pH in SAGA GIS to see how is the variable correlated with the topography. Open the maps and SAGA and use the 3D display (see below).

Figure: Predicted pH of soil shown in SAGA GIS using the 3D (2.5D) display.

This obviously shows that the more acidic soils come in depressions (potentially wetter areas), which largely corresponds with our expectations.

Transformation of soil variables using PCA

Similar way we could analyze all other soil parameters from the point data. However, most of the parameters are in fact cross-correlated. To reduce the cross-correlation in the soil parameters we can use Principal Component Analysis to remove the multi-dimensionality and then focus on the most dominant PCs. We can interpolate the values of PC1 and produce a map using regression-kriging:

# Which soil parameters are highly cross-correlated? Plot a correlation matrix.
> sort(cor(soil_samples@data)[,"PH"])
> pairs(~PH+S+AL+MIN.NO3+P, soil_samples@data)
> soil_samples.PCA <- prcomp(~AL+B+CA+CU+FE+K+MN+NA.+P+S+ZN+NH4+NO3+TOTAL.N+MIN.NH4+MIN.NO3+MIN.N+PH, soil_samples@data)
> biplot(soil_samples.PCA, xlabs=rep(".", length(soil_samples.PCA$x[,1]))) 
> soil_samples$PC1 <- soil_samples.PCA$x[,"PC1"]
> soil_samples_ov <- overlay(grids5m, soil_samples)
> soil_samples_ov@data <- cbind(soil_samples@data, soil_samples_ov@data) 
> PC1.rvgm <- fit.variogram(variogram(PC1~dem+twi, soil_samples_ov), 
+      model=vgm(nugget=0, model="Exp", sill=var(soil_samples$PC1, na.rm=TRUE)/2, range=100)) 
> PC1.rk <- krige(PC1~dem+twi, soil_samples_ov, grids5m, PC1.rvgm, nmax=80)
[using universal kriging]
> spplot(PC1.rk[1], col.regions=bpy.colors(40), at=seq(min(soil_samples$PC1), max(soil_samples$PC1), 
+      by=(max(soil_samples$PC1)-min(soil_samples$PC1))/40), sp.layout=list("sp.points", pch="+", soil_samples, col="black"))
Figure: Map of PC1 predicted for the whole area.

This map is not easy to interpret. One would need to zoom into the Principal Component Analysis to find out which soil parameters are positively correlated with the PC1 and which are the dominant soil parameters.

Interpolation of chemicals in surface sediment for the North Sea (NL)

INSTRUCTIONS: The zip file contains four data sets: (1) sed2006mon.txt – values of various chemicals estimated in laboratory using sediment samples from 43 locations in May 2006 (this data is available via the Noordzee-Atlas project; see also Hegeman and Laane (2008)), (2) seatemp.asc – MODIS estimated monthly Sea Surfafe Temperature for the same period (obtained from the OceanColor project); (3) seachlo.asc – MODIS estimated monthly Chlorophyll content in the sea for the same period (obtained from the OceanColor project), (3) srtmtopo.asc – topographic model of the land and sea bed based on the SRTM 30+ data, (4) watermask.asc – mask showing area of interest. The purpose of this exercise is to produce map of PCB (Polychlorinated Biphenyls) values (micro-g/kg dry matter) using the point data and three auxiliary maps srtmtopo.asc, seatemp.asc and seachlo.asc. All the maps are projected in the Dutch coordinate system (RDH, EPSG:28992). A similar interpolation exercise can be followed in the work of Pebesma (2006).

> sed2006mon <- read.delim("sed2006mon.txt")
> str(sed2006mon)
'data.frame':   43 obs. of  89 variables:
 $ labnr                 : int  10052461 10052462 10052463 10052464 10052465 10052466 10052467 10052468 10052469 10052470 ...
 $ lokatie               : Factor w/ 43 levels "AMLD70","APPZK20",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ x                     : int  166444 3471 -3408 107800 99538 79909 40395 102177 93115 39115 ...
 $ y                     : int  677771 391803 615153 541968 545764 549305 557312 515011 516849 642383 ...
 $ datum                 : int  38785 38804 38790 38806 38806 38805 38805 38806 38806 38792 ...
 $ G                     : num  191.4 83.8 199.4 22 7.4 ...
 $ Som.van.kalk.humus.z  : num  22.8 34.3 22.4 NA NA ...
...
 $ PCB187                : Factor w/ 32 levels "< 0.1","< 0.2",..: 1 15 1 23 21 7 7 16 20 4 ...
...
# convert the target variable to numbers:
> sel <- which(substr(sed2006mon$PCB187, 1, 2)=='< ')
> sed2006mon$PCB187 <- as.numeric(gsub(pattern='< ', replacement='', sed2006mon$PCB187, fixed=TRUE))
> sed2006mon$PCB187[sel] <- sed2006mon$PCB187[sel]/2
> coordinates(sed2006mon) <- ~x+y
> bubble(sed2006mon, "PCB187")
Figure: Sampling locations and values of PCB187 in micro-g/kg of dry matter.

We can automate the import of data by using:

> asc.list <- dir(pattern=glob2rx("*.asc"))
> grids2km <- readGDAL(asc.list[[1]])
PCB187_rk.asc has GDAL driver AAIGrid 
and has 215 rows and 133 columns
> names(grids2km) <- strsplit(asc.list[[1]], ".asc")[[1]]
> for(j in 2:length(asc.list)){ grids2km@data[,strsplit(asc.list[[j]], ".asc")[[1]]] <- readGDAL(asc.list[[j]])$band1 } 
seachlo.asc has GDAL driver AAIGrid 
and has 215 rows and 133 columns
...

Estimation of the geostatistical model

We can first produce a map of PCB for the whole of North Sea (watermask.asc) and see how well does the model estimates values. We start with overlaying the points and grids and plotting the relationship:

> sed2006mon.ov <- overlay(grids2km, sed2006mon)
> sed2006mon.ov@data <- cbind(sed2006mon@data, sed2006mon.ov@data)
> scatter.smooth(sed2006mon.ov$seachlo, sed2006mon.ov$PCB187, xlab="MODIS Sea Chlorophyll", ylab="PCB187")
# the relationship is log-normal;

Next we need to fit a geostatistical model. Because the target variable is heavily skewed, it is probably a better idea to fit a GLM with a log-link function than to use a simple linear model:

> PCB.glm <- glm(PCB187~seachlo+seatemp+srtmtopo, sed2006mon.ov@data, family=gaussian("log"))
# mask the missing values:
> sel <- !is.na(sed2006mon.ov$seachlo) 
> PCB.rvgm <- fit.variogram(variogram(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], cutoff=100000),
+      model=vgm(nugget=0, model="Exp", sill=var(residuals(PCB.glm)/2, na.rm=TRUE), range=100000)) 
Warning: singular model in variogram fit
# plot(variogram(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], cutoff=100000))

Note that we use Sea Surface temperature as predictor, but it is very unlikely that surface temperature is directly correlated with the processes at the sea bottom. However, it might be indirectly correlated with some properties of sea water, and hence also with the content of the sea sediment.

Figure: Relationship between the MODIS estimated sea chlorophyll content and PCB187.

Software gstat reports a problem with estimating the variogram. Note that this variable is not really suited for plain geostatistical analysis:

> plot(variogram(log1p(PCB187)~1, sed2006mon, cutoff=100000))

This shows that the variogram of the target variable is close to the pure nugget effect. Why is the nugget variation high? The noisy variogram is possible due to the low number of samples and their high clustering, but it could very be that the spatial distribution of PCB187 is controlled by factors different than geographic proximity.

Making spatial predictions

Now that we have estimated the regression model and variogram for residuals, we can proceed with making spatial predictions. We now do seperately predictions for the regression and kriging part:

# predict the GLM model:
> pred.glm <- predict(PCB.glm, grids2km, type="response")
> PCB187.RK <- krige(residuals(PCB.glm)~1, sed2006mon.ov[-PCB.glm$na.action,], grids2km, vgm(nugget=0.03, sill=0.05, model="Exp", range=30000))
> PCB187.RK$pred <- ifelse(is.na(grids2km$watermask), NA, pred.glm+PCB187.RK$var1.pred) 

To predict value of PCB at some arbitrary location X=36774 and Y=573495 we can use:

> New.XY <- data.frame(X=36774, Y=573495)
> coordinates(New.XY) <- ~X+Y
> overlay(PCB187.RK["pred"], New.XY)$pred
[1] 0.3578286

A relatively fast way to visualize the results of analysis in Google Earth is to export the map to SAGA GIS, then open the map and reproject it to the geographic coordinates. To do this run "Modules" -> "Projection" -> "grid" -> "PROJ.4 (Command Line Arguments, Grid)". For NL coordinate system parameters put +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.999908 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.237,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs and for the target coordinate system put +proj=longlat +datum=WGS84. After you have converted the grid to WGS84 geographical coordinates, open the new map, select "Map" -> "Save as image" and confirm that you want also a KML save of this map. This will produce a ground overlay as shown in figure below.

Figure: Predicted PCB187 content in the sea bed visualized in Google Earth.

Maps of pollutants and spatial queries

Now that we have produced a map of PCB187 for the whole area of interest, we can use this information for spatial planning. For example, a question of interest to the decisiom makers could be: "What is the average value of PCB at locations where the sea bed is deep?". The answer can be generated using e.g.:

> summary(PCB187.RK@data[grids2km$srtmdem<-35,"pred"])[4]
  Mean 
0.1093

Finally we can try to see if distance to the coast line could be used to improve predictions of PCB187. We need to derive this map first using SAGA GIS:

> grids2km$landmask <- ifelse(is.na(grids2km$watermask), 1, NA)
# writeGDAL(grids2km["landmask"], "landmask.sdat", "SAGA")
# buffer distance:
> rsaga.geoprocessor(lib="grid_tools", module=10, param=list(SOURCE="landmask.sgrd", DISTANCE="land_dist.sgrd",
+       ALLOC="tmp.sgrd", BUFFER="tmp.sgrd", DIST=sqrt(areaSpatialGrid(grids2km)), IVAL=grids2km@grid@cellsize[[1]]))
> grids2km$land.dist <- readGDAL("land_dist.sdat")$band1
# overlay again points and grids:
> sed2006mon.ov <- overlay(grids2km, sed2006mon)
> sed2006mon.ov@data <- cbind(sed2006mon@data, sed2006mon.ov@data)
> scatter.smooth(sqrt(sed2006mon.ov$land.dist), sed2006mon.ov$PCB187, xlab="Distance from the coast line", ylab="PCB187")
Figure: Correlation plot PCB187 versus distance from the coast line.

This plot shows that the relationship is clear, but both values of PCB187 and distance to coast line are skewed. We can now take a look at the regression model to see which is the best predictor of PCB:

> PCB.glm <- glm(PCB187~seachlo+seatemp+srtmtopo+sqrt(land.dist), sed2006mon.ov@data, family=gaussian("log"))
> summary(PCB.glm)

Call:
glm(formula = PCB187 ~ seachlo + seatemp + srtmtopo + sqrt(land.dist), 
    family = gaussian("log"), data = sed2006mon.ov@data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.41826  -0.05604   0.02216   0.09619   0.44239  

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -2.551963   1.039980  -2.454  0.01910 *  
seachlo          0.062789   0.011561   5.431 4.01e-06 ***
seatemp         -0.023287   0.040582  -0.574  0.56964    
srtmtopo        -0.035555   0.012409  -2.865  0.00692 ** 
sqrt(land.dist) -0.011067   0.001936  -5.717 1.66e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for gaussian family taken to be 0.0289008)

    Null deviance: 10.1802  on 40  degrees of freedom
Residual deviance:  1.0404  on 36  degrees of freedom
  (2 observations deleted due to missingness)
AIC: -22.281

Number of Fisher Scoring iterations: 7

This shows that distance to coast line is the best predictor of PCB187. The model is in fact highly significant but the problem remains that the number of points (<50) used to build the model is on the edge of what we typically use in geostatical analysis. Also some locations of high chlorophyll are not equaly good represented, so that the model needs to extrapolate in much of the North Sea. On the other hand, we are primarily interested in mapping concentration of poluttans in the areas close to the coast.


References


<rating> Rate this article: 1 Poor and in error 2 Far from ready 3 Needs improvements 4 Minor corrections 5 Very useful!! </rating>

Personal tools