Spatial prediction of soil moisture
his exercise is a part of the module Meetstrategieën (introduction to Environmetrics), which is taught at the 3rd year Earth sciences BSc programme (Bachelor Aardwetenschappen). It demonstrates how to generate maps from sampled point data, interpret the results and run an objective comparison between various geostatistical techniques. The specific objective was to produce a map of the soil moisture content for a given study area, and analyze how does the soil moisture changes in relation to land cover, elevation and surface brightness. The field data (88 points) was collected by students (18), that were split into groups. The study area is a regular block of 525x605 meters in the Waterleidingduinen park (see the map of the area). The soil moisture in the area is mainly controlled by the organic matter content in the topsoil. The texture of the soils is relatively homogenous (>60% of sand).
This article reviews some study questions and suggests ways to analyze/visualize this data and make interpretation. We focus on geostatistical mapping and assessment of the accuracy of the interpolation techniques: inverse distance interpolation, ordinary kriging and regressionkriging. To run the analysis, you only need to (1) install R with all necessary packages; (2) obtain this R script, and then run the exercise linebyline. The sample codes shown below are only used for illustration. To understand the R syntax, consider obtaining some of the R manuals.

Contents 
Sampling plan and the fieldwork
Students were split in three groups, following the sampling designs shown below (generated in R using the spsample method). Each group received a GPS received with cca. 30 points, and then navigated to the sampling locations. At each location, groups made 5 measurements of soil moisture (SM), by randomly allocating measurements within 5x5 m blocks of land.
For measurements of the soil moisture, we used the Theta Probe type ML2X device, which measures the volumetric soil moisture content (the number shown below refer to a uncorrected dimensionless parameter  a ratio m3 m3). For more information on how to estimate the corrected volumetric soil moisture content, see the Theta Probe ML2X manual.
We use three auxiliary maps (all at resolution of 5 m) to improve the spatial prediction of soil moisture: (1) dtm.asc a LiDARbased digital elevation model (see AHN project), (2) geimg.asc the brightness values of a fineresolution satellite image of the area, and (3) landcov.asc a map showing the land cover classes  forests, grasslands, water bodies, shrubs and roads. These maps can be obtained from the zipped file.
Data import and preliminary analysis
We start by reading the table (wduinen.txt) and converting it to a point map:
> wduinen < read.delim("wduinen.txt") > str(wduinen) 'data.frame': 88 obs. of 11 variables: $ Code : Factor w/ 88 levels "C01","C02","C03",..: 1 2 3 4 5 ... $ Latitude : num 52.4 52.4 52.4 52.4 52.4 ... $ Longitude: num 4.55 4.55 4.55 4.55 4.55 ... $ SM1 : num 0.84 0.794 0.854 0.521 0.944 0.888 0.823 ... $ SM2 : num 0.968 0.522 0.783 0.703 0.983 ... $ SM3 : num 0.898 0.68 0.834 0.688 0.899 0.725 0.687 ... $ SM4 : num 0.89 0.721 0.816 0.861 0.855 0.732 0.97 ... $ SM5 : num 0.808 0.705 0.965 0.884 0.886 0.752 0.743 ... $ SM : num 0.881 0.684 0.850 0.731 0.913 ... $ sd : num 0.0611 0.1002 0.0692 0.1475 0.0504 ... $ Lux : num 270 805 806 830 220 308 ...
This shows that the table consists of 11 columns and 88 observations. SM15 stands for the measurements of soil moisture at the point support, SM is the averaged value (block support), sd is the standard deviation within the block and Lux are the measurements of the illuminance using Luxmeter.
To estimate the global variance of SM1 versus SM we use:
> var(wduinen$SM1); var(wduinen$SM) [1] 0.03110635 [1] 0.02096423
which shows that the variance at the point support is about 48% higher than for the block support data.
Next, we convert this table to a point map. We also want to work with the local coordinates (NL coordinate system), so we run:
> coordinates(wduinen) < ~Longitude+Latitude > proj4string(wduinen) < CRS("+proj=longlat +ellps=WGS84 +towgs84=0,0,0") # project to the Dutch coordinate system: > wduinen.nl < spTransform(wduinen, CRS("+proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.999908 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.237,50.0087,465.658,0.406857,0.350733,1.87035,4.0812 +units=m +no_defs")) # plot the points: > bubble(wduinen.nl, "SM", scales=list(draw=T, cex=.8))
We also need to import the gridded maps (using GDAL):
gridmaps < readGDAL("dtm.asc") names(gridmaps) < "dtm" gridmaps$landcov < as.factor(readGDAL("landcov.asc")$band1) gridmaps$geimg < readGDAL("geimg.asc")$band1 proj4string(gridmaps) < CRS(proj4string(wduinen.nl))
We now have all maps necessary to run the geostatistical analysis.
Analysis of spatial autocorrelation (variogram)
Before we can run any geostatistical interpolation, we need to analyze the spatial autocorrelation of the target variable. To fit the variogram for SM variable, we can use the gstat function fit.variogram:
# fit the variogram: > vt.fit < fit.variogram(variogram(SM~1, wduinen.nl), vgm(0, "Exp", 100, + sill=var(wduinen.nl$SM))) > plot(variogram(SM~1, wduinen.nl), vt.fit) > plot(variogram(SM1~1, wduinen.nl), vt.fit)
As you can notice from the plots above, the nugget variation at the point support is significant, while with the block support the nugget variation has been smoothed out. The variable is, in both cases, autocorrelated up to the distance of about 100 m:
> vt.fit$range*3 [1] 103.4554
Inverse distance interpolation vs Ordinary Kriging
We are first interested to see if ordinary kriging interpolation is significantly more accurate than inverse distance interpolation. We can run both interpolations and then compare the maps visually:
> SM.ok < krige(SM~1, wduinen.nl, gridmaps, vt.fit) # Ordinary kriging; > SM.ok$SM.id < krige(SM~1, wduinen.nl, gridmaps)$var1.pred # ID; > spplot(SM.ok[c("var1.pred", "SM.id")], scales=list(draw=T, cex=.7), + col.regions=bpy.colors(30), sp.layout=list("sp.points", wduinen.nl, + pch="+", col="black"))
Visually, there seems to be only little difference between the two maps. To make this comparison statistically sound, we will also run the leaveoneout cross validation using all points:
> SM.id.cv < krige.cv(SM~1, wduinen.nl) [inverse distance weighted interpolation] ... > SM.ok.cv < krige.cv(SM~1, wduinen.nl, vt.fit) [using ordinary kriging] ... > var(SM.id.cv$residual, na.rm=T); var(SM.ok.cv$residual, na.rm=T) [1] 0.01759620 [1] 0.01754314
The difference between the RMSE at validation points is very small. Just to be sure that there is no statistical difference, we can run a ttest (assuming a normal distribution of the errors) on the two groups of validation errors:
> t.test(SM.id.cv$residual, SM.ok.cv$residual) Welch Two Sample ttest data: SM.id.cv$residual and SM.ok.cv$residual t = 0.2729, df = 174, pvalue = 0.7852 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 0.04489375 0.03398585 sample estimates: mean of x mean of y 0.009883239 0.004429287
The nullhypothesis that the difference between the two samples equals 0 is accepted, hence there is no significant difference between the prediction power of the inverse distance interpolation and ordinary kriging.
Regression analysis and regressionkriging
In the next steps we will try to explain variability of the SM values using the auxiliary maps. Before we can fit a regression model, we need to estimate the values of predictors at sampling locations:
> wduinen.ov < overlay(gridmaps, wduinen.nl) # copy the values: > wduinen.nl$landcov < wduinen.ov$landcov > wduinen.nl$dtm < wduinen.ov$dtm > wduinen.nl$geimg < wduinen.ov$geimg
and then fit a regression model (we use the stepwise regression fitting to exclude the predictors that are not significant):
> SM.lm < lm(SM~landcov+dtm+geimg, wduinen.nl) > SM.lms < step(SM.lm) > summary(SM.lms) Call: lm(formula = SM ~ dtm + geimg, data = wduinen.nl) Residuals: Min 1Q Median 3Q Max 0.45283 0.08265 0.03485 0.08707 0.26354 Coefficients: Estimate Std. Error t value (Intercept) 1.080e+00 8.963e02 12.047 dtm 1.590e04 9.248e05 1.719 geimg 2.598e03 5.651e04 4.598 Pr(>t) (Intercept) < 2e16 *** dtm 0.0893 . geimg 1.49e05 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1294 on 84 degrees of freedom (1 observation deleted due to missingness) Multiple Rsquared: 0.2207, Adjusted Rsquared: 0.2022 Fstatistic: 11.9 on 2 and 84 DF, pvalue: 2.826e05
This shows that only dtm and geimg are predictors that are significantly correlated with SM (they explain about 20% of total variability in SM). The landcov is practically of no use for the interpolation of SM.
Before we can run regressionkriging, we still need to fit a variogram for the residuals:
# residuals: > sel < as.numeric(attr(residuals(SM.lms), "names")) # mask the NA values! > vr.fit < fit.variogram(variogram(residuals(SM.lms)~1, wduinen.nl[sel,]), + vgm(0, "Exp", 100, sill=var(wduinen.nl$SM))) > plot(variogram(residuals(SM.lms)~1, wduinen.nl[sel,]), vr.fit)
which now has a smaller sill and range parameters that in the case of vt.fit.
We have no estimated all the model parameters (significant predictors, variogram for residuals) and can proceed with making predictions. To predict a value at a single point location we can run:
> new.point < data.frame(X=98126, Y=485863, SM=NA) > coordinates(new.point) <~X+Y > proj4string(new.point) < CRS(proj4string(wduinen.nl)) > new.point.ov < overlay(gridmaps, new.point) > new.point@data < merge(new.point.ov@data, new.point@data) > krige(SM~geimg+dtm, wduinen.nl[sel,], new.point, vr.fit) [using universal kriging] coordinates var1.pred var1.var 1 (98126, 485863) 0.3238555 0.01148894
The regressionkriging variance is 0.011, which is still lower than the global variance (55% of the global variance in fact), hence the predictions are relatively precise.
We can now generate maps of soil moisture using the same model and compare them with the results of ordinary kriging:
> SM.rk < krige(SM~geimg+dtm, wduinen.nl[sel,], gridmaps, vr.fit) > SM.rk$SM.ok < SM.ok$var1.pred > spplot(SM.rk[c("var1.pred", "SM.ok")], scales=list(draw=T, cex=.7), + col.regions=bpy.colors(30), sp.layout=list("sp.points", wduinen.nl, + pch="+", col="black"))
Again, we also want to do a statistical crossvalidation and then objectively compare the two techniques:
> SM.rk.cv < krige.cv(SM~geimg+dtm, wduinen.nl[sel,], vr.fit) [using universal kriging] ... > 1var(SM.ok.cv$residual, na.rm=T)/var(wduinen.nl$SM) [1] 0.1631869 > 1var(SM.rk.cv$residual, na.rm=T)/var(wduinen.nl$SM) [1] 0.2958105
which shows that regressionkriging explains about 30% of variation, as compared to ordinary kriging that explains only 16%. This is also visible from the regresionkriging map that now nicely shows areas of bare soil (open sand dunes), which are commonly areas of very low soil moisture content.
How to improve the maps we make and explain even higher percent of variation? There are two possibilities: (1) to consider using additional predictors (e.g. soil parameters from a soil map), (2) to collect more samples in the critical areas  the areas of high prediction error (in this case these are higher elevations where we had no points).
Optional: you can generate geostatistical simulations (equiprobable realization) using the same RK model:
> SM.rk$sim1 < krige(SM~geimg+dtm, wduinen.nl[sel,], nsim=1, gridmaps, vr.fit, nmax=30)$sim1 > spplot(SM.rk[c("sim1", "SM.ok")], scales=list(draw=T, cex=.7), + col.regions=bpy.colors(30), sp.layout=list("sp.points", wduinen.nl, + pch="+", col="black"))
This shows a more realistic picture of how this variable really varies in the space. Note that, even the simulations look noisier than the original RK prediction map, the variogram of the map shown above is in fact equal to the variogram we fitted. To prove this we can randomly allocate a same number of points and then reestimate the variogram:
> rsample < spsample(SM.rk["sim1"], type="random", length(wduinen.nl$SM)) > rsmaple.ov < overlay(SM.rk["sim1"], rsample) > plot(variogram(sim1~1, rsmaple.ov), vt.fit)
As you will see, the simulated values have the same spatial autocorrelation structure. This proves that the simulations are accurately presenting the feature of interest.
References
 Bivand, R., Pebesma, E., Rubio, V., 2008. Applied Spatial Data Analysis with R. Use R Series. Springer, Heidelberg.
 Hengl, T., 2007. A Practical Guide to Geostatistical Mapping of Environmental Variables. EUR 22904 EN Scientific and Technical Research series, Office for Official Publications of the European Communities, Luxemburg, 143 pp.
 Pebesma, E., 2006. The Role of External Variables and GIS Databases in Geostatistical Analysis. Transactions in GIS, 10(4): 615632.