warning: Parameter 2 to gmap_gmap() expected to be a reference, value given in /home/spatiala/public_html/book/includes/module.inc on line 497.

# Regression-kriging

Embedded Scribd iPaper - Requires Javascript and Flash Player
2 Regression-kriging
As we saw in the previous chapter, there are many geostatistical techniques that can be used to map environmental variables. In reality, we always try to go for the most ﬂexible, most comprehensive and the most robust technique (preferably implemented in a software with an user-friendly GUI). In fact, many (geo)statisticians believe that there is only one Best Linear Unbiased Prediction (BLUP) model for spatial data, from which all other (linear) techniques can be derived (Gotway and Stroup, 1997; Stein, 1999; Christensen, 2001). As we will see in this chapter, one such generic mapping technique is regression-kriging. All other techniques mentioned previously — ordinary kriging, environmental correlation, averaging of values per polygons or inverse distance interpolation — can be seen as special cases of RK.
2.1
The Best Linear Unbiased Predictor of spatial data
Matheron (1969) proposed that a value of a target variable at some location can be modeled as a sum of the deterministic and stochastic components: Z(s) = m(s) + (s) + (2.1.1)
Fig. 2.1: A schematic example of the regression-kriging concept shown using a cross-section.
27
28
Regression-kriging
which he termed the universal model of spatial variation. We have seen in the previous sections (§1.3.1 and §1.3.2) that both deterministic and stochastic components of spatial variation can be modeled separately. By combining the two approaches, we obtain: ˆ ˆ z (s0 ) = m(s0 ) + ˆ(s0 ) e
p
=
k=0
ˆ βk · qk (s0 ) +
n
λi · e(si )
i=1
(2.1.2)
ˆ ˆ where m(s0 ) is the ﬁtted deterministic part, ˆ(s0 ) is the interpolated residual, βk are estimated deterministic e ˆ0 is the estimated intercept), λi are kriging weights determined by the spatial dependence model coefﬁcients (β ˆ structure of the residual and where e(si ) is the residual at location si . The regression coefﬁcients βk can be estimated from the sample by some ﬁtting method, e.g. ordinary least squares (OLS) or, optimally, using Generalized Least Squares (Cressie, 1993, p.166): ˆ βGLS = qT · C−1 · q
−1
· qT · C−1 · z
(2.1.3)
ˆ where βGLS is the vector of estimated regression coefﬁcients, C is the covariance matrix of the residuals, q is a matrix of predictors at the sampling locations and z is the vector of measured values of the target variable. The GLS estimation of regression coefﬁcients is, in fact, a special case of geographically weighted regression (compare with Eq.1.3.20). In this case, the weights are determined objectively to account for the spatial auto-correlation between the residuals. Once the deterministic part of variation has been estimated, the residual can be interpolated with kriging and added to the estimated trend (Fig. 2.1). Estimation of the residuals and their variogram model is an iterative process: ﬁrst the deterministic part of variation is estimated using ordinary least squares (OLS), then the covariance function of the residuals is used to obtain the GLS coefﬁcients. Next, these are used to re-compute the residuals, from which an updated covariance function is computed, and so on (Schabenberger and Gotway, 2004, p.286). Although this is recommended as the proper procedure by many geostatisticians, Kitanidis (1994) showed that use of the covariance function derived from the OLS residuals (i.e. a single iteration) is often satisfactory, because it is not different enough from the function derived after several iterations; i.e. it does not affect the ﬁnal predictions much. Minasny and McBratney (2007) reported similar results: it is often more important to use more useful and higher quality data than to use more sophisticated statistical methods. In some situations1 however, the model needs to be ﬁtted using the most sophisticated technique to avoid making biased predictions. In matrix notation, regression-kriging is commonly written as (Christensen, 2001, p.277): ˆ ˆ ˆ zRK (s0 ) = qT · βGLS + λT · (z − q · βGLS ) 0 0 (2.1.4)
ˆ where z (s0 ) is the predicted value at location s0 , q0 is the vector of p + 1 predictors and λ0 is the vector of n kriging weights used to interpolate the residuals. The model in Eq.(2.1.4) is considered to be the Best Linear Predictor of spatial data (Christensen, 2001; Schabenberger and Gotway, 2004). It has a prediction variance that reﬂects the position of new locations (extrapolation effect) in both geographical and feature space:
T −1
ˆ2 σRK (s0 ) = (C0 + C1 ) − cT · C−1 · c0 + q0 − qT · C−1 · c0 0
· qT · C−1 · q
· q0 − qT · C−1 · c0
(2.1.5)
where C0 + C1 is the sill variation and c0 is the vector of covariances of residuals at the unvisited location.
1 For example: if the points are extremely clustered, and/or if the sample is using non-standard techniques.
100, and/or if the measurements are noisy or obtained
2.1 The Best Linear Unbiased Predictor of spatial data
29
Fig. 2.2: Whether we will use pure regression model, pure kriging or hybrid regression-kriging is basically determined by R-square: (a) if R-square is high, then the residuals will be inﬁnitively small; (c) if R-square is insigniﬁcant, then we will probably ﬁnish with using ordinary kriging; (b) in most cases, we will use a combination of regression and kriging.
If the residuals show no spatial auto-correlation (pure nugget effect), the regression-kriging (Eq.2.1.4) converges to pure multiple linear regression (Eq.1.3.14) because the covariance matrix (C) becomes identity matrix:   C0 + C1   . . C= .   0 ··· C0 + C1 0 0 0 C0 + C1      = (C0 + C1 ) · I   (2.1.6)
so the kriging weights (Eq.1.3.4) at any location predict the mean residual i.e. 0 value. Similarly, the regressionkriging variance (Eq.2.1.5) reduces to the multiple linear regression variance (Eq.1.3.16): ˆ2 σRK (s0 ) = (C0 + C1 ) − 0 + qT · qT · 0 1 (C0 + C1 )
−1
·q
−1
· q0 · q0
ˆ2 σRK (s0 ) = (C0 + C1 ) + (C0 + C1 ) · qT · qT · q 0
and since (C0 + C1 ) = C(0) = MSE, the RK variance reduces to the MLR variance: ˆ2 ˆ2 σRK (s0 ) = σOLS (s0 ) = MSE · 1 + qT · qT · q 0
−1
· q0
(2.1.7)
Likewise, if the target variable shows no correlation with the auxiliary predictors, the regression-kriging model reduces to ordinary kriging model because the deterministic part equals the (global) mean value (Fig. 2.2c, Eq.1.3.25). The formulas above show that, depending on the strength of the correlation, RK might turn into pure kriging — if predictors are uncorrelated with the target variable — or pure regression — if there is signiﬁcant correlation and the residuals show pure nugget variogram (Fig. 2.2). Hence, pure kriging and pure regression should be considered as only special cases of regression-kriging (Hengl et al., 2004a, 2007a).
30 2.1.1 Mathematical derivation of BLUP
Regression-kriging
Understanding how a prediction model is derived becomes important once we start getting strange results or poor cross-validation scores. Each model is based on some assumptions that need to be respected and taken into account during the ﬁnal interpretation of results. A detailed derivation of the BLUP for spatial data can be followed in several standard books on geostatistics (Stein, 1999; Christensen, 2001); one of the ﬁrst complete derivations is given by Goldberger (1962). Here is a somewhat shorter explanation of how BLUP is derived, and what the implications of various mathematical assumptions are. All ﬂavors of linear statistical predictors share the same objective of minimizing the estimation error variˆE ance σ2 (s0 ) under the constraint of unbiasedness (Goovaerts, 1997). In mathematical terms, the estimation error: ˆ σ2 (s0 ) = E ˆ ˆ z (s0 ) − z(s0 ) · z (s0 ) − z(s0 )
T
(2.1.8)
is minimized under the (unbiasedness) constraint that: ˆ E z (s0 ) − z(s0 ) = 0 (2.1.9)
Assuming the universal model of spatial variation, we can deﬁne a generalized linear regression model (Goldberger, 1962):
z(s) = qT · β + (s) E { (s)} = 0 E ·
T
(2.1.10) (2.1.11) (2.1.12)
(s) = C
where is the residual variation, and C is the n×n positive-deﬁnite variance-covariance matrix of residuals. This model can be read as follows: (1) the information signal is a function of deterministic and residual parts; (2) the best estimate of the residuals is 0; (3) the best estimate of the correlation structure of residuals is the variance-covariance matrix. Now that we have deﬁned the statistical model and the minimization criteria, we can derive the best linear unbiased prediction of the target variable: ˆT ˆ z (s0 ) = δ0 · z (2.1.13)
Assuming that we use the model shown in Eq.(2.1.10), and assuming that the objective is to minimize ˆE the estimation error σ2 (s0 ), it can be shown2 that BLUP parameters can be obtained by solving the following system:       (2.1.14)  C q   δ   c   · = 0        qT 0 φ q0
where c0 is the vector of n×1 covariances at a new location, q0 is the vector of p×1 predictors at a new location, and φ is a vector of Lagrange multipliers. It can be further shown that, by solving the Eq.(2.1.14), we get the following:
2
The actual derivation of formulas is not presented. Readers are advised to obtain the paper by Goldberger (1962).
2.1 The Best Linear Unbiased Predictor of spatial data
31
ˆ ˆ ˆ ˆ z (s0 ) = qT · β + λT · (z − q · β) 0 0 ˆ β = qT · C−1 · q ˆ λ0 = C
−1 −1
· qT · C−1 · z
(2.1.15)
· c0
which is the prediction model explained in the previous section. Under the assumption of the ﬁrst order stationarity i.e. constant trend: E {z(s)} = µ ∀s ∈ (2.1.16)
the Eq.(2.1.15) modiﬁes to (Schabenberger and Gotway, 2004, p.268): ˆ ˆ z (s0 ) = µ + λT · (z − µ) 0 ˆ λ0 = C−1 · c0
i.e. to ordinary kriging (§1.3.1). If we assume that the deterministic part of variation is not constant, then we need to consider obtaining a number of covariates (q) that can be used to model the varying mean value. Another important issue you need to know about the model in Eq.(2.1.15) is that, in order to solve the residual part of variation, we need to know covariances at new locations: C e(s0 ), e(si ) = E e(s0 ) − µ · e(si ) − µ (2.1.17)
which would require that we know the values of the target variable at a new location (e(s0 )), which we of ˆ course do not know. Instead, we can use the existing sampled values (e(si ) = z(si ) − z (si )) to model the covariance structure using a pre-deﬁned mathematical model (e.g. Eq.1.3.8). If we assume that the covariance model is the same (constant) in the whole area of interest, then the covariance is dependent only on the separation vector h: C e(s0 ), e(si ) = C(h) (2.1.18)
which is known as the assumption of second order stationarity; and which means that we can use the same model to predict values anywhere in the area of interest (global estimation). If this assumption is not correct, we would need to estimate covariance models locally. This is often not so trivial because we need to have a lot of points (see further §2.2), so the assumption of second order stationarity is very popular among geostatisticians. Finally, you need to also be aware that the residuals in Eq.(2.1.10) are expected to be normally distributed around the regression line and homoscedastic3 , as with any linear regression model (Kutner et al., 2004). If this is not the case, then the target variable needs to be transformed until these conditions are met. The ﬁrst and second order stationarity, and normality of residuals/target variables are rarely tested in real case studies. In the case of regression-kriging (see further §2.1), the target variable does not have to be stationary but its residuals do, hence we do not have to test this property with the original variable. In the case of regression-kriging in a moving window, we do not have to test neither ﬁrst nor second order stationarity. Furthermore, if the variable is non-normal, then we can use some sort of GLM to ﬁt the model. If this is successful, the residuals will typically be normal around the regression line in the transformed space, and this will allow us to proceed with kriging. The predicted values can ﬁnally be back-transformed to the original scale using the inverse of the link function.
3
Meaning symmetrically distributed around the feature space and the regression line.
32
Regression-kriging
The lesson learned is that each statistical spatial predictor comes with: (a) a conceptual model that explains the general relationship (e.g. Eq.2.1.10); (b) model-associated assumptions (e.g. zero mean estimation error, ﬁrst or second order stationarity, normality of the residuals); (c) actual prediction formulas (Eq.2.1.15); and (d) a set of proofs that, under given assumptions, a prediction model is the BLUP Ignoring the important model . assumptions can lead to poor predictions, even though the output maps might appear to be visually ﬁne. 2.1.2 Selecting the right spatial prediction technique
Knowing that the most of the linear spatial prediction models are more or less connected, we can start by testing the most generic technique, and then ﬁnish by using the most suitable technique for our own case study. Pebesma (2004, p.689), for example, implemented such a nested structure in his design of the package. An user can switch between one and another technique by following a simple decision tree shown in Fig. 2.3. First, we need to check if the deterministic model is deﬁned already, if it has not been, we can try to correlate the sampled variables with Do the environmental factors. If the environmental Is the REGRESSIONYES YES residuals show KRIGING physical model spatial autofactors are signiﬁcantly correlated, we can ﬁt (calibration) known? correlation? a multiple linear regression model (Eq.1.3.14) NO and then analyze the residuals for spatial auDETERMINISTIC MODEL tocorrelation. If the residuals show no spatial autocorrelation (pure nugget effect), we Is the Do the REGRESSIONYES YES variable correlated residuals show proceed with OLS estimation of the regression KRIGING with environmental spatial auto(GLS) coefﬁcients. Otherwise, if the residuals show factors? correlation? spatial auto-correlation, we can run regressionNO ENVIRONMENTAL kriging. If the data shows no correlation with CORRELATION (OLS) environmental factors, then we can still anaDoes the Can a lyze the variogram of the target variable. This YES YES variable shows variogram with >1 ORDINARY time, we might also consider modeling the KRIGING spatial autoparameters correlation? be fitted? anisotropy. If we can ﬁt a variogram different from pure nugget effect, then we can run ordinary kriging. Otherwise, if we can only ﬁt a NO SPATIAL linear variogram, then we might just use some MECHANICAL PREDICTION INTERPOLATORS POSSIBLE mechanical interpolator such as the inverse distance interpolation. If the variogram of the target variable Fig. 2.3: A general decision tree for selecting the suitable spatial shows no spatial auto-correlation, and no cor- prediction model based on the results of model estimation. Similar relation with environmental factors, this practi- decision tree is implemented in the gstat package. cally means that the only statistically valid prediction model is to estimate a global mean for the whole area. Although this might frustrate you because it would lead to a nonsense map where each pixel shows the same value, you should be aware that even this is informative4 . How does the selection of the spatial prediction model works in practice? In the package, a user can easily switch from one to other prediction model by changing the arguments in the generic krige function in (Fig. 1.13; see further §3.2). For example, if the name of the input ﬁeld samples is meuse and the prediction locations (grid) is deﬁned by meuse.grid, we can run the inverse distance interpolation (§1.2.1) by specifying (Pebesma, 2004):
gstat
NO
NO
NO
NO
gstat
R
> > > > >
library(gstat) data(meuse) coordinates(meuse) <- ∼ x+y data(meuse.grid) coordinates(meuse.grid) <- ∼ x+y
4
Sometimes an information that we are completely uncertain about a feature is better than a colorful but completely unreliable map.
2.1 The Best Linear Unbiased Predictor of spatial data
33
> gridded(meuse.grid) <- TRUE > zinc.id <- krige(zinc ∼ 1, data=meuse, newdata=meuse.grid) [inverse distance weighted interpolation]
where zinc is the sampled environmental variable (vector) and zinc.id is the resulting raster map (shown in Fig. 1.13). Instead of using inverse distance interpolation we might also try to ﬁt the values using the coordinates and a 2nd order polynomial model:
> zinc.ts <- krige(zinc

x+y+x*y+x*x+y*y, data=meuse, newdata=meuse.grid)
[ordinary or weighted least squares prediction]
which can be converted to the moving surface ﬁtting by adding a search window:
> zinc.mv <- krige(zinc

x+y+x*y+x*x+y*y, data=meuse, newdata=meuse.grid, nmax=20)
[ordinary or weighted least squares prediction]
If we add a variogram model, then kriging (§1.3.1):
gstat will instead of running inverse distance interpolation run ordinary
> zinc.ok <- krige(log1p(zinc) ∼ 1, data=meuse, newdata=meuse.grid, + model=vgm(psill=0.714, "Exp", range=449, nugget=0)) [using ordinary kriging]
where vgm(psill=0.714, "Exp", range=449, nugget=0) is the Exponential variogram model with a sill parameter of 0.714, range parameter of 449 m and the nugget parameter of 0 (the target variable was logtransformed). Likewise, if there were environmental factors signiﬁcantly correlated with the target variable, we could run OLS regression (§1.3.2) by omitting the variogram model:
> zinc.ec <- krige(log1p(zinc)

dist+ahn, data=meuse, newdata=meuse.grid)
[ordinary or weighted least squares prediction]
where dist and ahn are the environmental factor used as predictors (raster maps), which are available as separate layers within the spatial layer5 meuse.grid. If the residuals do show spatial auto-correlation, then we can switch to universal kriging (Eq.2.1.4) by adding the variogram:
> zinc.rk <- krige(log1p(zinc) ∼ dist+ahn, data=meuse, newdata=meuse.grid, + model=vgm(psill=0.151, "Exp", range=374, nugget=0.055)) [using universal kriging]
If the model between the environmental factors and our target variable is deterministic, then we can use the point samples to calibrate our predictions. The command would then look something like this:
R
> zinc.rkc <- krige(zinc ∼ zinc.df, data=meuse, newdata=meuse.grid, + model=vgm(psill=3, "Exp", range=500, nugget=0)) [using universal kriging]
where zinc.df are the values of the target variable estimated using a deterministic function. In , a user can also easily switch from estimation to simulations (§2.4) by adding to the command above an additional argument: nsim=1. This will generate Sequential Gaussian Simulations using the same prediction model. Multiple simulations can be generated by increasing the number set for this argument. In addition, a user can switch from block predictions by adding argument block=100; and from global estimation of weights by adding a search radius or maximum number of pairs, e.g. radius=1000 or nmax=60. 6 By using the package one needs to specify even less arguments. For example, the command:
gstat
automap
5 6
In R a SpatialGridDataframe object.
http://cran.r-project.org/web/packages/automap/
34
Regression-kriging

> zinc.rk <- autoKrige(log1p(zinc) [using universal kriging]
dist, data=meuse, newdata=meuse.grid)
will do much of the standard geostatistical analysis without any intervention from the user: it will ﬁlter the duplicate points where needed, estimate the residuals, then ﬁt the variogram for the residuals, and generate the predictions at new locations. The results can be plotted in a single page in a form of a report. Such generic commands can signiﬁcantly speed up data processing, and make it easier for a non-geostatistician to generate maps (see further section 2.10.3). In the package7 , one needs to set even less parameters to generate predictions from a variety of methods:
intamap
> meuse\$value <- log(meuse\$zinc) > output <- interpolate(data=meuse, newdata=meuse.grid) R 2009-11-11 17:09:14 interpolating 155 observations, 3103 prediction locations [Time models loaded...]  "estimated time for copula 133.479866956255" Checking object ... OK
which gives the (presumably) best interpolation method8 for the current problem (value column), given the time available set with maximumTime. A more systematic strategy to select the right spatial prediction technique is to use objective criteria of mapping success (i.e. a posteriori criteria). From the application point of view, it can be said that there are (only) ﬁve relevant criteria to evaluate various spatial predictors (see also §1.4): (1.) the overall mapping accuracy, e.g. standardized RMSE at control points — the amount of variation explained by the predictor expressed in %; (2.) the bias, e.g. mean error — the accuracy of estimating the central population parameters; (3.) the model robustness, also known as model sensitivity — in how many situations would the algorithm completely fail / how much artifacts does it produces?; (4.) the model reliability — how good is the model in estimating the prediction error (how accurate is the prediction variance considering the true mapping accuracy)?; (5.) the computational burden — the time needed to complete predictions; From this ﬁve, you could derive a single composite measure that would then allow you to select ‘the optimal’ predictor for any given data set, but this is not trivial! Hsing-Cheng and Chun-Shu (2007) suggest a framework to select the best predictor in an automated way, but this work would need to be much extended. In many cases we simply ﬁnish using some naïve predictor — that is predictor that we know has a statistically more optimal alternative9 , but this alternative is simply not practical. The decision tree, shown in Fig. 2.4, is an example of how the selection of the method can be automated to account for (1) anisotropy, (2) speciﬁed observation errors, and (3) extreme values. This is a speciﬁc application primarily developed to interpolate the radioactivity measurements from the European radiological data exchange platform, a network of around 4000 sensors. Because the radioactivity measurements can often carry local extreme values, robust techniques need to be used to account for such effects. For example, Copula kriging10 methods can generate more accurate maps if extreme values are also present in the observations. The problem of using methods such as Copula kriging, however, is that they can often take even few hours to generate maps even for smaller areas. To minimize the risk of running into endless computing, the authors of the decision tree have decided to select the prediction algorithm based on
intamap
intamap
9 For example, instead of using the REML approach to variogram modeling, we could simply ﬁt a variogram using weighted least squares (see §1.3.1), and ignore all consequences (Minasny and McBratney, 2007). 10 Copula kriging is a sophistication of ordinary kriging; an iterative technique that splits the original data set and then re-estimates the model parameters with maximization of the corresponding likelihood function (Bárdossy and Li, 2008).
http://cran.r-project.org/web/packages/intamap/ intamap automatically chooses between: (1) kriging, (2) copula methods, (3) inverse distance interpolation, projected spatial gaussian process methods in the psgp package, (4) transGaussian kriging or yamamoto interpolation.
7 8
the interpolation result e.g. as a GML document or coverage. To encode the interpolation error UncertML, a markup language for specifying information that is represented probabilistically, has been developed within the data 2.1 The Best Linear Unbiased Predictor of spatial project, which OGC has currently released as 35 an OGC Discussion Paper .
Data (observations) [Interpolation domain] Yes Correct for it Coordinate reference sytems: Reproject?
Anisotropy: significant? No
Observation errors specified? No Extreme value distribution? No Modelling: Interpolation: Sample variograms, Variogram model fitting Ordinary kriging
Yes
Yes
Maximum Likelihood Copula kriging
Maximum likelihood Projected sparse Gaussian process
Spatial aggregation Reformatting
Output
Figure 2: Decision tree for the interpolation method choices in the interpolation process that takes place in R. References in text.
the back end and interpolation decision tree The procedure needed statistical analysis The Rcomputational time. Hence the system ﬁrst estimates the approximate timefor the to run the prediction using the most sophisticated technique; if this is above the threshold time, the system will switch to a more of naïvedata are implemented in R, the major open source of intamap suggest 30 analysing stathe method (Pebesma et al., 2009). As a rule of thumb, the authors environment for seconds as the threshold time to ﬁgure 1 shows, this is not noticeable for the tistical data. As accept automated generation of a map via a web-service. user of the INTAMAP web processing service, as R is run in the back end. Interfacing R from the web processing 2.1.3 The Best Combined Spatial Predictor service by using the TCP/IP protocol (i.e., as a web service, using the Rserve package) has the advantage that the R process, are mutually independent11 ,work, may be generated as a a Assuming that a series of prediction techniques doing the numerical predictions can be running on weighted average from cluster not directly generating the the internet. Multiple interpolation dedicated computingmultiple predictions i.e. byconnected toBest Combined Spatial Prediction (BSCP): requests at the same time will(sbe· executed in 0parallel. A. + zSPj (s0 ) advantage of having all ˆ ˆ ˆ zSP1 0 ) σ 1(s ) + zSP2 (s ) · σ 1(s ) + . . second · σ 1(s ) ˆ SP1 ˆ SP2 ˆ SPj ˆ in (s0 ) = zBCSPthe R environment is that p can be re-used independently of (2.1.19) statistical routines it 1 the WPS ˆ σ ) interface, e.g. interactively on a PC, from a SOAP(sinterface, or on a mobile device. The j=1 SPj decision tree for choosing an interpolation method automatically is shown in Figure 2. In where σSPj (s0 )of the prediction error project, by the modelinterpolation methodsphave been imthe context is the INTAMAP estimated dedicated (prediction variance), and is the number of predictors. For example, we can generate a combined prediction using OK and e.g. GLM-regression and plemented for (i)two maps (Fig. 2.5). The predictionsanisotropy, partsdealing with extreme value then sum-up the detecting and correcting for will in some (ii) of the study are look more as OK, in others more as GLM, with known measurement errors. distributions, (iii) dealingwhich actually depicts extrapolation areas of both methods. This map is very
0 0 0 0
Fig. 2.4: Decision tree used in the intamap interpolation service for automated mapping. After Pebesma et al. (2009).
similar to predictions produced using regression-kriging (see further Fig. 5.9); in fact, one could probably Methods for network harmonisation were also developed, but are not part of the aumathematically prove that under ideal conditions (absolute stationarity of residuals; no spatial clustering; perfect linear relationship), BCSP predictions would equal the regression-kriging predictions. In general, the tomated interpolation framework, as this should be done before interpolation takes place. map in the middle of Fig. 2.5 looks more as the GLM-regression map because this map is about 2–3 times The same is true forOK map. It is important to monitoring networkto combine various With the we more precise than the outlier removal and emphasize that, in order optimisation. predictors, softdo developed for INTAMAP, it would be relatively simple to customize the INTAMAP ware need to have an estimate of the prediction uncertainty, otherwise we are not able to assign the weights (see further §7.5). In principle, linear combination of statistical techniques web service theoretical basis exists that incorporates such combination.using the Eq.(2.1.19) above should be and perform these manipulations. avoided if a
4
11
If Othey do not use the same model parameters; if they treat different parts of spatial variation etc. PERATIONAL PERFORMANCE
At the stage of writing this paper, the INTAMAP interpolation service is fully functional, and open for testing. During the testing period, the the following issues need to be further
36
Regression-kriging
Fig. 2.5: Best Combined Spatial Predictor as weighted average of ordinary kriging (zinc.ok) and GLM regression (zinc.glm).
In the example above (GLM+OK), we assume that the predictions/prediction errors are independent, and they are probably not. In addition, a statistical theory exists that supports a combination of regression and kriging (see previously §2.1.1), so there is no need to run predictions separately and derive an unrealistic measure of model error. The BCSP can be only interesting for situations where there are indeed several objective predictors possible, where no theory exists that reﬂects their combination, and/or where ﬁtting of individual models is faster and less troublesome than ﬁtting of a hybrid model. For example, ordinary kriging can be speed-up by limiting the search radius, predictions using GLMs is also relatively inexpensive. External trend kriging using a GLM in package might well be the statistically most robust technique you could possibly use, but it can also be beyond the computational power of your PC. The combined prediction error of a BCSP can be estimated as the smallest prediction error achieved by any of the prediction models:
geoRglm
ˆ ˆ ˆ σBCSP (s0 ) = min σSP1 (s0 ), . . . , σSPj (s0 )
(2.1.20)
which is really an ad hoc formula and should be used only to visualize and depict problematic areas (highest prediction error). 2.1.4 Universal kriging, kriging with external drift
The geostatistical literature uses many different terms for what are essentially the same or at least very similar techniques. This confuses the users and distracts them from using the right technique for their mapping projects. In this section, we will show that both universal kriging, kriging with external drift and regressionkriging are basically the same technique. Matheron (1969) originally termed the technique Le krigeage universel, however, the technique was intended as a generalized case of kriging where the trend is modeled as a function of coordinates. Thus, many authors (Deutsch and Journel, 1998; Wackernagel, 2003; Papritz and Stein, 1999) reserve the term Universal Kriging (UK) for the case when only the coordinates are used as predictors. If the deterministic part of variation (drift) is deﬁned externally as a linear function of some explanatory variables, rather than the coordinates, the term Kriging with External Drift (KED) is preferred (Wackernagel, 2003; Chiles and Delﬁner, 1999). In the case of UK or KED, the predictions are made as with kriging, with the difference that the covariance matrix of residuals is extended with the auxiliary predictors qk (si )’s (Webster and Oliver, 2001, p.183). However, the drift and residuals can also be estimated separately and then summed. This procedure was suggested by Ahmed and de Marsily (1987); Odeh et al. (1995) later named it Regression-kriging, while Goovaerts (1997, §5.4) uses the term Kriging with a trend model to refer to a family of predictors, and refers to RK as Simple kriging with varying local means. Although equivalent, KED and RK differ in the computational steps used.
2.1 The Best Linear Unbiased Predictor of spatial data
37
Let us zoom into the two variants of regression-kriging. In the case of KED, predictions at new locations are made by:
n
ˆ zKED (s0 ) =
i=1
w iKED (s0 )·z(si )
(2.1.21)
for
n
w iKED (s0 )·qk (si ) = qk (s0 );
k = 1, ..., p
(2.1.22)
i=1
or in matrix notation:
T ˆ zKED (s0 ) = δ0 · z
(2.1.23)
where z is the target variable, qk ’s are the predictor variables i.e. values at a new location (s0 ), δ0 is the vector of KED weights (w iKED ), p is the number of predictors and z is the vector of n observations at primary locations. The KED weights are solved using the extended matrices:
KED KED λKED = w1 (s0 ), ..., w n (s0 ), ϕ0 (s0 ), ..., ϕ p (s0 ) 0
T
= CKED−1 · cKED 0
(2.1.24)
where λKED is the vector of solved weights, ϕ p are the Lagrange multipliers, CKED is the extended covariance 0 matrix of residuals and cKED is the extended vector of covariances at a new location. 0 In the case of KED, the extended covariance matrix of residuals looks like this (Webster and Oliver, 2001, p.183):    C(s1 , s1 )   . .  .     C(sn , s1 )  KED C =  1    q1 (s1 )   . .  .   q p (s1 ) and cKED like this: 0 cKED = C(s0 , s1 ), ..., C(s0 , sn ), q0 (s0 ), q1 (s0 ), ..., q p (s0 ) 0
T
···
C(s1 , sn ) . . . C(sn , sn ) 1 q1 (sn ) . . . q p (sn )
1 . . . 1 0 0 0 0
q1 (s1 ) . . . q1 (sn ) 0 0 . . . 0
···
··· ··· ···
··· ··· ···
···
···
       q p (sn )     0    0   . .  .   0
q p (s1 ) . . .
(2.1.25)
;
q0 (s0 ) = 1
(2.1.26)
Hence, KED looks exactly as ordinary kriging (Eq.1.3.2), except the covariance matrix and vector are extended with values of auxiliary predictors. In the case of RK, the predictions are made separately for the drift and residuals and then added back together (Eq.2.1.4):
38
Regression-kriging
ˆ ˆ zRK (s0 ) = qT · βGLS + λT · e 0 0 It can be demonstrated that both KED and RK algorithms give exactly the same results (Stein, 1999; Hengl ˆ et al., 2007a). Start from KED where the predictions are made as in ordinary kriging using zKED (s0 ) = λT · z. KED The KED kriging weights (λT ) are obtained by solving the system (Wackernagel, 2003, p.179): KED        C q   λ      ·  KED  =  c0        qT 0 φ q0 where φ is a vector of Lagrange multipliers. Writing this out yields: C · λKED + q · φ = c0 qT · λKED = q0 from which follows: qT · λKED = qT · C−1 · c0 − qT · C−1 · q · φ and hence: φ = qT · C−1 · q
−1
(2.1.27)
(2.1.28)
· qT · C−1 · c0 − qT · C−1 · q
−1
· q0
(2.1.29)
where the identity qT · λKED = q0 has been used. Substituting φ back into Eq. (2.1.27) shows that the KED weights equal (Papritz and Stein, 1999, p.94): λKED = C−1 · c0 − C−1 · q · qT · C−1 · q
−1 −1
· qT · C−1 · c0 − qT · C−1 · q
−1
· q0
= C−1 · c0 + q · qT · C−1 q·
· q0 − qT · C−1 · c0
(2.1.30)
Let us now turn to RK. Recall from Eq.(2.1.3) that the GLS estimate for the vector of regression coefﬁcients is given by: ˆ βGLS = qT · C−1 · q and weights for residuals by: λT = cT · C−1 0 0 and substituting these in RK formula (Eq.2.1.4) gives: ˆ ˆ = qT · βGLS + λT · (z − q · βGLS ) 0 0 = qT · qT · C−1 · q 0
−1 −1
· qT · C−1 · z
(2.1.31)
(2.1.32)
· qT · C−1 + cT · C−1 − cT · C−1 · q · qT · C−1 q 0 0
−1 −1
−1
· qT · C−1 · z (2.1.33)
= C−1 · cT + qT · qT · C−1 · q 0 0 = C−1 · c0 + q · qT · C−1 · q
· qT − cT · C−1 · q · qT · C−1 q 0 ·z
−1
· qT · z
· q0 − qT · C−1 c0
2.1 The Best Linear Unbiased Predictor of spatial data
39
The left part of the equation is equal to Eq.(2.1.30), which proves that KED will give the same predictions as RK if same inputs are used. A detailed comparison of RK and KED using the 5–points example in MS Excel is also available as supplementary material12 . Although the KED seems, at ﬁrst glance, to be computationally more straightforward than RK, the variogram parameters for KED must also be estimated from regression residuals, thus requiring a separate regression modeling step. This regression should be GLS because of the likely spatial correlation between residuals. Note that many analyst use instead the OLS residuals, which may not be too different from the GLS residuals (Hengl et al., 2007a; Minasny and McBratney, 2007). However, they are not optimal if there is any spatial correlation, and indeed they may be quite different for clustered sample points or if the number of samples is relatively small ( 200). A limitation of KED is the instability of the extended matrix in the case that the covariate does not vary smoothly in space (Goovaerts, 1997, p.195). RK has the advantage that it explicitly separates trend estimation from spatial prediction of residuals, allowing the use of arbitrarily-complex forms of regression, rather than the simple linear techniques that can be used with KED (Kanevski et al., 1997). In addition, it allows the separate interpretation of the two interpolated components. For these reasons the use of the term regression-kriging over universal kriging has been advocated by the author (Hengl et al., 2007a). The emphasis on regression is important also because ﬁtting of the deterministic part of variation is often more beneﬁcial for the quality of ﬁnal maps than ﬁtting of the stochastic part (residuals). 2.1.5 A simple example of regression-kriging
The next section illustrates how regression-kriging computations work and compares it to ordinary kriging using the textbook example from Burrough and McDonnell (1998, p.139-141), in which ﬁve measurements are used to predict a value of the target variable (z) at an unvisited location (s0 ) (Fig. 2.6a). We extend this example by adding a hypothetical explanatory data source: a raster image of 10×10 pixels (Fig. 2.6b), which has been constructed to show a strong negative correlation with the target variable at the sample points.
Fig. 2.6: Comparison of ordinary kriging and regression-kriging using a simple example with 5 points (Burrough and McDonnell, 1998, p.139–141): (a) location of the points and unvisited site; (b) values of the covariate q; (c) variogram for target and residuals, (d) OLS and GLS estimates of the regression model and results of prediction for a 10×10 grid using ordinary kriging (e) and regression-kriging (f). Note how the RK maps reﬂects the pattern of the covariate.
The RK predictions are computed as follows:
12
http://spatial-analyst.net/book/RK5points
40
Regression-kriging
(1.) Determine a linear model of the variable as predicted by the auxiliary map q. In this case the correlation is high and negative with OLS coefﬁcients b0 =6.64 and b1 =-0.195 (Fig. 2.6d). (2.) Derive the OLS residuals at all sample locations as: e∗ (si ) = z(si ) − b0 + b1 · q(si ) (2.1.34)
For example, the point at (x=9, y=9) with z=2 has a prediction of 6.64 − 0.195 · 23 = 1.836, resulting in an OLS residual of e∗ = −0.164. (3.) Model the covariance structure of the OLS residuals. In this example the number of points is far too small to estimate the autocorrelation function, so we follow the original text in using a hypothetical variogram of the target variable (spherical model, nugget C0 =2.5, sill C1 =7.5 and range R=10) and residuals (spherical model, C0 =2, C1 =4.5, R=5). The residual model is derived from the target variable model of the text by assuming that the residual variogram has approximately the same form and nugget but a somewhat smaller sill and range (Fig. 2.6c), which is often found in practice (Hengl et al., 2004a). (4.) Estimate the GLS coefﬁcients using Eq.(2.1.3). In this case we get just slightly different coefﬁcients b0 =6.68 and b1 =-0.199. The GLS coefﬁcients will not differ much from the OLS coefﬁcients as long there is no signiﬁcant clustering of the sampling locations (Fig. 2.6d) as in this case. (5.) Derive the GLS residuals at all sample locations: e∗∗ (si ) = z(si ) − b0 + b1 · q(si ) Note that the b now refer to the GLS coefﬁcients. (6.) Model the covariance structure of the GLS residuals as a variogram. In practice this will hardly differ from the covariance structure of the OLS residuals. (7.) Interpolate the GLS residuals using ordinary kriging (OK) using the modeled variogram13 . In this case at the unvisited point location (5, 5) the interpolated residual is −0.081. (8.) Add the GLS surface to the interpolated GLS residuals at each prediction point. At the unvisited point location (5, 5) the explanatory variable has a value 12, so that the prediction is then:
n
(2.1.35)
ˆ z (5, 5) = b0 + b1 · qi +
i=1
λi (s0 )·e(si ) (2.1.36)
= 6.68 − 0.199 · 12 − 0.081 = 4.21
which is, in this speciﬁc case, a slightly different result than that derived by OK with the hypothetical variogram of the target variable (ˆ=4.30). z The results of OK (Fig. 2.6e) and RK (Fig. 2.6f) over the entire spatial ﬁeld are quite different in this case, because of the strong relation between the covariate and the samples. In the case of RK, most of variation in the target variable (82%) has been accounted for by the predictor. Unfortunately, this version of RK has not been implemented in any software package yet14 (see further §3.4.3). Another interesting issue is that most of the software in use (gstat, ) does not estimate variogram using the GLS estimation of the residuals, but only of the OLS residuals (0 iterations). Again, for most of balanced and well spread sample sets, this will not cause any signiﬁcant problems (Minasny and McBratney, 2007).
SAGA
13 Some authors argue whether one should interpolate residuals using simple kriging with zero expected mean of the residuals (by deﬁnition) or by ordinary kriging. In the case of OLS estimation, there is no difference; otherwise one should always use OK to avoid making biased estimates. 14 Almost all geostatistical packages implement the KED algorithm because it is mathematically more elegant and hence easier to program.
2.2 Local versus localized models
41
2.2
Local versus localized models
In many geostatistical packages, a user can opt to limit the selection of points to determine the kriging weights by setting up a maximum distance and/or minimum and maximum number of point pairs (e.g. take only the closest 50 points). This way, the calculation of the new map can be signiﬁcantly speed up. In fact, kriging in global neighborhood where n 1000 becomes cumbersome because of computation of C−1 (Eq.1.3.5). Recall from §1.3.1 that the importance of points (in the case of ordinary kriging and assuming a standard initial variogram model) exponentially decreases with their distance from the point of interest. Typically, geostatisticians suggest that already ﬁrst 30–60 closest points will be good enough to obtain stable predictions. Fig. 2.7: Local regression-kriging is a further sophistication of A prediction model where the search radius for regression-kriging. It will largely depend on the availability derivation of kriging weights (Eq.1.3.4) is limited to of explanatory and ﬁeld data. a local neighborhood can be termed localized prediction model. There is a signiﬁcant difference between localized and local prediction model, which often confuses inexperienced users. For example, if we set a search radius to re-estimate the variogram model, then we speak about a local prediction model, also known as the moving window kriging or kriging using local variograms (Haas, 1990; Walter et al., 2001; Lloyd, 2009). The local prediction model assumes that the variograms (and regression models) are non-stationary, i.e. that they need to be estimated locally.
Fig. 2.8: Local variogram modeling and local ordinary kriging using a moving window algorithm in visually observe how the variograms change locally. Courtesy of Budiman Minasny.
Vesper:
a user can
While localized prediction models are usually just a computational trick to speed up the calculations, local prediction models are computationally much more demanding. Typically, they need to allow automated
42
Regression-kriging
variogram modeling and ﬁltering of improbable models to prevent artifacts in the ﬁnal outputs. A result of local prediction model (e.g. moving window variogram modeling) are not only maps of predictions, but also spatial distribution of the ﬁtted variogram parameters (Fig. 2.7). This way we can observe how does the nugget variation changes locally, which parts of the area are smooth and which are noisy etc. Typically, local variogram modeling and prediction make sense only when we work with large point data sets (e.g. 1000 of ﬁeld observations), which is still not easy to ﬁnd. In addition, local variogram modeling is not implemented in 15 many packages. In fact, the author is aware of only one: (Fig. 2.8). In the case of regression-kriging, we could also run both localized and local models. This way we will not only produce maps of variogram parameters but we would also be able to map the regression coefﬁcients16 . In the case of kriging with external drift, some users assume that the same variogram model can be used in various parts of the study area and limit the search window to speed up the calculations17 . This is obviously a simpliﬁcation, because in the case of KED both regression and kriging part of predictions are solved at the same time. Hence, if we limit the search window, but keep a constant variogram model, we could obtain very different predictions then if we use the global (regression-kriging) model. Only if the variogram of residuals if absolutely stationary, then we can limit the search window to ﬁt the KED weights. In practice, either global (constant variogram) or local prediction models (locally estimated regression models and variograms of residuals) should be used for KE ﬁtting.
Vesper
2.3
Spatial prediction of categorical variables
Although geostatistics is primarily intended for use with continuous environmental variables, it can also be used to predict various types of categorical or class-type variables. Geostatistical analysis of categorical variables is by many referred to as the indicator geostatistics (Bierkens and Burrough, 1993). In practice, indicator kriging leads to many computational problems, which probably explains why there are not many operational applications of geostatistical mapping of categorical variables in the world (Hession et al., 2006). For example, it will typically be difﬁcult to ﬁt variogram for less frequent classes that occur at isolated locations (Fig. 2.9d). Statistical grounds of indicator geostatistics has been recently reviewed by Fig. 2.9: Difﬁculties of predicting point-class data (b) and (d), as comPapritz et al. (2005); Papritz (2009) who pared to quantitative variables (a) and (c), is that the class-interpolators recognizes several conceptual difﬁculties are typically more complex and computationally more time-consuming. of working with indicator data: (1) inconsistent modeling of indicator variograms, and (2) use of global variogram leads to biased predictions because the residuals are by deﬁnition nonstationary. Any attempt to use indicator kriging for data with an apparent trend either explicitly or implicitly by using ordinary indicator kriging within a local neighborhood requires the modeling of non-stationary indicator variograms to preserve the mean square optimality of kriging (Papritz, 2009). Indicator regressionkriging without any transformation has also been criticized because the model (binomial variable) suggests that residuals have mean-dependent variance (p · (1 − p)), and thus using a single variogram for the full set of residuals is not in accordance with theory. Let us denote the ﬁeld observations of a class-type variable as zc (s1 ), zc (s2 ), ..., zc (sn ), where c1 , c2 ,..., ck are discrete categories (or states) and k is the total number of classes. A technique that estimates the soil-classes
15 16 17
http://www.usyd.edu.au/su/agric/acpa/vesper/vesper.html
Regression coefﬁcients are often mapped with geographically weighted regression (Grifﬁth, 2008). Software such as gstat and SAGA allow users to limit the search radius; geoR does not allow this ﬂexibility.
2.3 Spatial prediction of categorical variables
43
ˆ at new unvisited location zc (s0 ), given the input point data set (zc (s1 ), zc (s2 ), ..., zc (sn )), can then be named a class-type interpolator. If spatially exhaustive predictors q1 , q2 , ..., q p (where p is the number of predictors) are available, they can be used to map each category over the area of interest. So far, there is a limited number of techniques that can achieve this: Multi-indicator co-kriging — The simple multi-indicator kriging can also be extended to a case where several covariates are used to improve the predictions. This technique is known by the name indicator (soft) co-kriging (Journel, 1986). Although the mathematical theory is well explained (Bierkens and Burrough, 1993; Goovaerts, 1997; Pardo-Iguzquiza and Dowd, 2005), the application is cumbersome because of the need to ﬁt a very large number of cross-covariance functions. Multinomial Log-linear regression — This a generalization of logistic regression for situations when there are multiple classes of a target variable (Venables and Ripley, 2002). Each class gets a separate set of regression coefﬁcients (βc ). Because the observed values equal either 0 or 1, the regression coefﬁcients need to be solved through a maximum likelihood iterative algorithm (Bailey et al., 2003), which makes the whole method somewhat more computationally demanding than simple multiple regression. An example of multinomial regression is given further in section 9.6. Regression-kriging of indicators — One approach to interpolate soil categorical variables is to ﬁrst assign memberships to point observations and then to interpolate each membership separately. This approach was ﬁrst elaborated by de Gruijter et al. (1997) and then applied by Bragato (2004) and Triantaﬁlis et al. (2001). An alternative is to ﬁrst map cheap, yet descriptive, diagnostic distances and then classify these per pixel in a GIS (Carré and Girard, 2002). In the case of logistic regression, the odds to observe a class (c) at new locations are computed as: ˆ+ zc (s0 ) = 1 + exp −βc T · q0
−1
; c = 1, 2, .., k
(2.3.1)
ˆ+ where zc (s0 ) are the estimated odds for class (c) at a new location s0 and k is the number of classes. The multinomial logistic regression can also be extended to regression-kriging (for a complete derivation see Hengl et al. (2007b)). This means that the regression modeling is supplemented with the modeling of variograms for regression residuals, which can then be interpolated and added back to the regression estimate. So the predictions are obtained using: ˆ+ zc (s0 ) = 1 + exp −βc T · q0
−1
+ ˆc (s0 ) e+
(2.3.2)
where ˆc are the interpolated residuals. The extension from multinomial regression to regression-kriging is not e+ as simple as it seems. This is because the estimated values at new locations in Eq.(2.3.2) are constrained within the indicator range, which means that interpolation of residuals might lead to values outside the physical range (<0 or >1)18 . One solution to this problem is to predict the trend part in transformed space, then interpolate residuals, sum the trend and residual part and back-transform the values (see §5.4). Hengl et al. (2007b) show that memberships (µc ), instead of indicators, are more suitable both for regression and geostatistical modeling, which has been also conﬁrmed by several other authors (McBratney et al., 1992; de Gruijter et al., 1997; Triantaﬁlis et al., 2001). Memberships can be directly linearized using the logit transformation: µ+ = ln c µc 1 − µc ; 0 < µc < 1 (2.3.3)
where µc are the membership values used as input to interpolation. Then, all ﬁtted values will be within the physical range (0–1). The predictions of memberships for class c at new locations are then obtained using the standard regression-kriging model (Eq.2.1.4): ˆ ˆ ˆc µ+ (s0 ) = qT · βc,GLS + λT · µ+ − q · βc,GLS 0 c,0 c
18
(2.3.4)
The degree to which they will fall outside the 0–1 range is controlled by the variogram and amount of extrapolation in feature space
44
Regression-kriging The interpolated values can then be back-transformed to the membership range using (Neter et al., 1996): ˆ µc (s0 ) =
ˆ eµc (s0 ) ˆ 1 + eµc (s0 )
+ +
(2.3.5)
In the case of regression-kriging of memberships, both spatial dependence and correlation with the predictors are modeled in a statistically sophisticated way. In addition, regression-kriging of memberships allows ﬁtting of each class separately, which facilitates the understanding of the distribution of soil variables and the identiﬁcation of problematic classes, i.e. classes which are not correlated with the predictors or do not show any spatial autocorrelation etc. Spatial prediction of memberships can be excessive in computation time. Another problem is that, if the interpolated classes (odds, memberships) are ﬁtted only by using the sampled data, the predictions of the odds/memberships will commonly not sum to unity at new locations. In this case, one needs to standardized values for each grid node by diving the original values by the sum of odds/memberships to ensure that they sum to unity, which is an ad-hoc solution. An algorithm, such as compositional regression-kriging19 will need to be developed. A number of alternative hybrid class-interpolators exists, e.g. the Bayesian Maximum Entropy (BME) approach by D’Or and Bogaert (2005). Another option is to use Markov-chain algorithms (Li et al., 2004, 2005a). However, note that although use of the BME and Markov-chain type of algorithms is a promising development, their computational complexity makes it still far from use in operational mapping.
2.4
Geostatistical simulations
Regression-kriging can also be used to generate simulations of a target variable using the same inputs as in the case of spatial prediction system. An equiprobable realization of an environmental variable can be generated by using the sampled values and their variogram model: Z (SIM) (s0 ) = E Z|z(s j ), γ(h) (2.4.1)
where Z (SIM) is the simulated value at the new location. The most common technique in geostatistics that can be used to generate equiprobable realizations is the Sequential Gaussian Simulation (Goovaerts, 1997, p.380-392). It starts by deﬁning a random path for visiting each node of the grid once. At ﬁrst node, kriging is used to determine the location-speciﬁc mean and variance of the conditional cumulative distribution function. A simulated value can then be drawn by using the inverse normal distribution (Box and Muller, 1958; Banks, 1998): ˆ ˆ ziSIM = zi + σi · −2 · ln(1 − A) · cos(2 · π · B) (2.4.2)
where ziSIM is the simulated value of the target variable with induced error, A and B are the independent random ˆ ˆ numbers within the 0 − 0.99. . . range, zi is the estimated value at ith location, and σi is the regression-kriging error. The simulated value is then added to the original data set and the procedure is repeated until all nodes have been visited. Geostatistical simulations are used in many different ﬁelds to generate multiple realizations of the same feature (Heuvelink, 1998; Kyriakidis et al., 1999), or to generate realistic visualizations of a natural phenomena (Hengl and Toomanian, 2006; Pebesma et al., 2007). Examples of how to generate geostatistical simulations and use them to estimate the propagated error are further shown in section 10.3.2.
2.5
Spatio-temporal regression-kriging
In statistics, temporal processes (time series analysis, longitudinal data analysis) are well-known, but mixed spatio-temporal processes are still rather experimental (Banerjee et al., 2004). The 2D space models can be
19 Walvoort and de Gruijter (2001), for example, already developed a compositional solution for ordinary kriging that will enforce estimated values to sum to unity at all locations.
2.5 Spatio-temporal regression-kriging
45
extended to the time domain, which leads to spatio-temporal geostatistics (Kyriakidis and Journel, 1999). The universal kriging model (Eq.2.1.1) then modiﬁes to: Z(s, t) = m(s, t) + (s, t) + (2.5.1)
where (s, t) is the spatio-temporally autocorrelated residual for every (s, t) ∈ S × T , while m(s, t), the deterministic component of the model, can be estimated using e.g. (Fassó and Cameletti, 2009): m(s, t) = q(s, t) · β + K(s) · y t + ω(s, t) (2.5.2)
where q is a matrix of covariates available at all s, t locations, yt is a component of a target variable that is constant in space (global trend), K(s) is a matrix of coefﬁcients, and ω(s, t) is the spatial small-scale component (white noise in time) correlated over space. A possible but tricky simpliﬁcation of the space-time models is to simply consider time to be third dimension of space. In that case, spatio-temporal interpolation follows the same interpolation principle as explained in Eq.(1.1.2), except that here the variograms are estimated in three dimensions (two-dimensional position x and y and ‘position’ in time). From the mathematical aspect, the extension from the static 2D interpolation to the 3D interpolation is then rather simple. Regression modeling can be simply extended to a space-time model by adding time as a predictor. For example, a spatio-temporal regression model for interpolation of land surface temperature (see further §2.9.2) would look like this: LST(s0 , t 0 ) = b0 + b1 · DEM(s0 ) + b2 · LAT(s0 ) + b3 · DISTC(s0 ) + b4 · LSR(s0 , t 0 ) π ; ∆t = 1 day + b5 · SOLAR(s0 , t 0 ) + b6 · cos [t 0 − φ] · 180 where DEM is the elevation map, LAT is the map showing distance from the equator, DISTC is the distance from the coast line, LSR is the land surface radiation from natural or manmade objects, SOLAR is the direct solar insolation for a given cumulative Julian day t ∈ (0, +∞), cos(t) is a generic function to account for seasonal variation of values and φ is the phase angle20 . DEM, LAT, DISTC are temporally-constant predictors, while surface radiation and solar insolation maps need to be provided for each time interval used for data ﬁtting. The residuals from this regression model can then be analyzed for (spatio-temporal) auto-correlation. In , extension from 2D to 3D variograms is possible by extending the variogram parameters: for 3D space-time variograms ﬁve values should be given in the form anis = c(p,q,r,s,t), where p is the angle for the principal direction of continuity (measured in degrees, clockwise from y, in direction of x), q is the dip angle for the principal direction of continuity (measured in positive degrees up from horizontal), r is the third rotation angle to rotate the two minor directions
(2.5.3)
gstat
Fig. 2.10: Extension of a 2D prediction model to the space-time domain. Note that in the space-time cube, the amount of pixels needed to store the data exponentially increases as a function of: width × height × number of predictors × number of time intervals.
20
A time delay from the coldest day.
46
Regression-kriging
around the principal direction deﬁned by p and q21 (see Fig. 1.11). A positive angle acts counter-clockwise while looking in the principal direction. Once we have ﬁtted the space-time variogram, we can run regression-kriging to estimate the values at 3D locations. In practice, we only wish to produce maps for a given time interval (t 0 =constant), i.e. to produce 2D-slices of values in time (Fig. 2.10). Once we have produced a time-series of predictions, we can analyze the successive time periods and run various types of time-series analysis. This will help us detect temporal trends spatially and extract informative images about the dynamics of the feature of interest. Note that, in order to yield accurate predictions using spatio-temporal techniques, dense sampling in both space and time is required. This means that existing natural resource surveys that have little to no repetition in time ( 10 repetitions in time) cannot be adopted. Not to mention the computational complexity as the maps of predictors now multiply by the amount of time intervals. In addition, estimation of the spatio-temporal variograms will often be a cumbersome because we need to ﬁt space-time models, for which we might not have enough space-time observations. A review of spatio-temporal models, i.e. dynamic linear state-space models, and some practical suggestions how to analyze such data and ﬁt spatially varying coefﬁcients can be followed in Banerjee et al. (2004, §8). A speciﬁc extension of the general model from Eq.(2.5.1) is to estimate the deterministic part of variation by using process-based (simulation) models, which are often based on differential equations. In this case an environmental variable is predicted from a set of environmental predictors incorporated in a dynamic model (Eq.1.3.12): Z(s, t) = fs,c,r,p,a (t) + (s, t) + (2.5.4)
where s, c, r, p, a are the input (zero-stage) environmental conditions and f is a mathematical deterministic function that can be used to predict the values for a given space-time position. This can be connected with the Einstein’s assumption that the Universe is in fact a trivial system that can be modeled and analyzed using “a one–dimensional differential equation — in which everything is a function of time”22 . Some examples of operational soil-landscape process-based models are given by Minasny and McBratney (2001) and Schoorl et al. (2002). In vegetation science, for example, global modeling has proven to be very efﬁcient for explanation of the actual distribution of vegetation and of global changes (Bonan et al., 2003). Integration of environmental process-based models will soon lead to development of a global dynamic model of environmental systems that would then provide solutions for different multipurpose national or continental systems. Fassó and Cameletti (2009) recently proposed hierarchical models as a general approach for spatio-temporal problems, including dynamical mapping, and the analysis of the outputs from complex environmental modeling chains. The hierarchical models are a suitable solution to spatio-temporal modeling because they make it possible to deﬁne the joint dynamics and the full likelihood; the maximum likelihood estimation can be further simpliﬁed by using Expectation-Maximization algorithm. The basis of this approach is the classical two-stage hierarchical state-space model (Fassó and Cameletti, 2009): Zt = qt · β + K · yt + et y t = G · y t−1 + η t (2.5.5) (2.5.6)
where y t is modeled as the autoregressive process, G is the transition matrix and η t is the innovation error. If all parameters are known, the unobserved temporal process y t can be estimated for each time point t using e.g. Kalman ﬁlter or Kalman smoother. Such process-based spatio-temporal models are still experimental and it make take time until their semi-automated software implementations appear in .
R
2.6
Species Distribution Modeling using regression-kriging
The key inputs to a Species Distribution Model (SDM) are: the inventory (population) of animals or plants N consisting of a total of N individuals (a point pattern X = si 1 ; where si is a spatial location of individual
21 22
http://www.gstat.org/manual/node20.html
Quote by James Peebles, Princeton, 1990; published in “God’s Equation: Einstein, Relativity, and the Expanding Universe” by Amir D. Aczel.
2.6 Species Distribution Modeling using regression-kriging
47
animal or plant; Fig. 1.3a), covering some area BHR ⊂ 2 (where HR stands for home-range and 2 is the Euclidean space), and a list of environmental covariates/predictors (q1 , q2 , . . . q p ) that can be used to explain spatial distribution of a target species. In principle, there are two distinct groups of statistical techniques that can be used to map the realized species’ distribution: (a) the point pattern analysis techniques, such as kernel smoothing, which aim at predicting density of a point process (Fig. 2.11a); and (b) statistical, GLM-based, techniques that aim at predicting the probability distribution of occurrences (Fig. 2.11c). Both approaches are explained in detail in the following sections.
Fig. 2.11: Examples of (simulated) species distribution maps produced using common statistical models.
Species’ density estimation using kernel smoothing and covariates Spatial density (ν; if unscaled, also known as “spatial intensity”) of a point pattern (ignoring the time dimension) is estimated as: [N (X ∩ B)] =
B
ν(s)ds
(2.6.1)
48
Regression-kriging In practice, it can be estimated using e.g. a kernel estimator (Diggle, 2003; Baddeley, 2008):
n
ν(s) =
i=1
κ·
s − si
· b(s)
(2.6.2)
where ν(s) is spatial density at location s, κ(s) is the kernel (an arbitrary probability density), si is location of an occurrence record, s − si is the distance (norm) between an arbitrary location and observation location, and b(s) is a border correction to account for missing observations that occur when s is close to the border of the region (Fig. 2.11a). A common (isotropic) kernel estimator is based on a Gaussian function with mean zero and variance 1: ν(s) = 1 H2
n
·
i=1
1 2π
· e−
s−si 2
2
· b(s)
(2.6.3)
The key parameter for kernel smoothing is the bandwidth (H) i.e. the smoothing parameter, which can be connected with the choice of variogram in geostatistics. The output of kernel smoothing is typically a map (raster image) consisting of M grid nodes, and showing spatial pattern of species’ clustering. Spatial density of a point pattern can also be modeled using a list of spatial covariates q’s (in ecology, we call this environmental predictors), which need to be available over the whole area of interest B. For example, using a Poisson model (Baddeley, 2008): log ν(s) = log β0 + log q1 (s) + . . . + log q p (s) (2.6.4)
where log transformation is used to account for the skewed distribution of both density values and covariates; p is the number of covariates. Models with covariates can be ﬁtted to point patterns e.g. in the package23 . Such point pattern–covariates analysis is commonly run only to determine i.e. test if the covariates are correlated with the feature of interest, to visualize the predicted trend function, and to inspect the spatial trends in residuals. Although statistically robust, point pattern–covariates models are typically not considered as a technique to improve prediction of species’ distribution. Likewise, the model residuals are typically not used for interpolation purposes.
spatstat
Predicting species’ distribution using ENFA and GLM (pseudo-absences) An alternative approach to spatial prediction of species’ distribution using occurrence-only records and environmental covariates is the combination of ENFA and regression modeling. In general terms, predictions are based on ﬁtting a GLM: (P) = µ = g −1 (q · β) (2.6.5)
where E(P) is the expected probability of species occurrence (P ∈ [0, 1]; Fig. 2.11c), q·β is the linear regression model, and g is the link function. A common link function used for SDM with presence observations is the logit link function: g(µ) = µ+ = ln µ 1−µ (2.6.6)
and the Eq.(2.6.5) becomes logistic regression (Kutner et al., 2004). The problem of running regression analysis with occurrence-only observations is that we work with 1’s only, which means that we cannot ﬁt any model to such data. To account for this problem, species distribution modelers (see e.g. Engler et al. (2004); Jiménez-Valverde et al. (2008) and Chefaoui and Lobo (2008)) typically insert the so-called “pseudo-absences” — 0’s simulated using a plausible models, such as Environmental
23
This actually ﬁts the maximum pseudolikelihood to a point process; for more details see Baddeley (2008).
2.6 Species Distribution Modeling using regression-kriging
49
Niche Factor Analysis (ENFA), MaxEnt or GARP (Guisan and Zimmermann, 2000), to depict areas where a species is not likely to occur. ENFA is a type of factor analysis that uses observed presences of a species to estimate which are the most favorable areas in the feature space, and then uses this information to predict the potential distribution of species for all locations (Hirzel and Guisan, 2002). The difference between ENFA and the Principal Component Analysis is that the ENFA factors have an ecological meaning. ENFA results in a Habitat Suitability Index (HSI∈ [0 − 100%]) — by depicting the areas of low HSI, we can estimate where the species is very unlikely to occur, and then simulate a new point pattern that can be added to the occurrence locations to produce a ‘complete’ occurrences+absences data set. Once we have both 0’s and 1’s, we can ﬁt a GLM as shown in Eq.(2.6.5) and generate predictions (probability of occurrence) using geostatistical techniques as described in e.g. Gotway and Stroup (1997). Predicting species’ density using ENFA and logistic regression-kriging Point pattern analysis, ENFA and regression-kriging can be successfully combined using the approach explained in Hengl et al. (2009b). First, we will assume that our input point pattern represents only a sample of the whole n population (XS = si 1 ), so that the density estimation needs to be standardized to avoid biased estimates. Second, we will assume that pseudo-absences can be generated using both information about the potential habitat (HSI) and geographical location of the occurrence-only records. Finally, we focus on mapping the actual count of individuals over the grid nodes (realized distribution), instead of mapping the probability of species’ occurrence. Spatial density values estimated by kernel smoothing are primarily controlled by the bandwidth size (Bivand et al., 2008). The higher the bandwidth, the lower the values in the whole map; likewise, the higher the sampling intensity (n/N ), the higher the spatial density, which eventually makes it difﬁcult to physically interpret mapped values. To account for this problem, we propose to use relative density (ν r : B → [0, 1]) expressed as the ratio between the local and maximum density at all locations: ν r (s) = ν(s)
M max {ν(s)|s ∈ B}1
(2.6.7)
An advantage of using the relative density is that the values are in the range [0, 1], regardless of the bandwidth and sample size (n/N ). Assuming that our sample XS is representative and unbiased, it can be shown that ν r (s) is an unbiased estimator of the true spatial density (see e.g. Diggle (2003) or Baddeley (2008)). In other words, regardless of the sample size, by using relative intensity we will always be able to produce an unbiased estimator of the spatial pattern of density for the whole population (see further Fig. 8.4). Furthermore, assuming that we actually know the size of the whole population (N ), by using predicted relative density, we can also estimate the actual spatial density (number of individuals per grid node; as shown in Fig. 2.11b): ν(s) = ν r (s) · N
M j=1 ν r (s) M
;
j=1
ν(s) = N
(2.6.8)
which can be very useful if we wish to aggregate the species’ distribution maps over some polygons of interest, e.g. to estimate the actual counts of individuals. Our second concern is the insertion of pseudo-absences. Here, two questions arise: (1) how many pseudoabsences should we insert? and (b) where should we locate them? Intuitively, it makes sense to generate the same number of pseudo-absence locations as occurrences. This is also supported by the statistical theory of model-based designs, also known as “D-designs”. For example, assuming a linear relationship between density and some predictor q, the optimal design that will minimize the prediction variance is to put half of observation at one extreme and other at other extreme. All D-designs are in fact symmetrical, and all advocate higher spreading in feature space (for more details about D-designs, see e.g. Montgomery (2005)), so this principle seems logical. After the insertion of the pseudo-absences, the extended observations data set is: Xf = si
n 1,
s∗ i
n∗ 1
;
n = n∗
(2.6.9)
50
Regression-kriging
where s∗ i are locations of the simulated pseudo-absences. This is not a point pattern any more because now also quantitative values — either relative densities (ν r (si )) or indicator values — are attached to locations (µ(si ) = 1 and µ(s∗ i ) = 0). The remaining issue is where and how to allocate the pseudo-absences? Assuming that a spreading of species in an area of interest is a function of the potential habitat and assuming that the occurrence locations on the HSI axis will commonly be skewed toward high values (see further Fig. 8.8 left; see also Chefaoui and Lobo (2008)), we can deﬁne the probability distribution (τ) to generate the pseudo-absence locations as e.g.: τ(s∗ ) = [100% − HSI(s)]2 (2.6.10)
where the square term is used to insure that there are progressively more pseudo-absences at the edge of low HSI. This way also the pseudo-absences will approximately follow Poisson distribution. In this paper we propose to extend this idea by considering location of occurrence points in geographical space also (see also an interesting discussion on the importance of geographic extent for generation of pseudo-absences by VanDerWal et al. (2009)). The Eq.(2.6.10) then modiﬁes to: τ(s∗ ) = dR (s) + (100% − HSI(s)) 2
2
(2.6.11)
where dR is the normalized distance in the range [0, 100%], i.e. the distance from the observation points (X) divided by the maximum distance. By using Eq.(2.6.11) to simulate the pseudo-absence locations, we will purposively locate them both geographically further away from the occurrence locations and in the areas of low HSI (unsuitable habitat). After the insertion of pseudo-absences, we can attach to both occurrence-absence locations values of estimated relative density, and then correlate this with environmental predictors. This now becomes a standard geostatistical point data set, representative of the area of interest, and with quantitative values attached to point locations (see further Fig. 8.10d). Recall from Eq.(2.6.7) that we attach relative intensities to observation locations. Because these are bounded in the [0, 1] range, we can use the logistic regression model to make predictions. Thus, the relative density at some new location (s0 ) can be estimated using:
+ ν r (s0 ) = 1 + exp −β T · q0 −1
(2.6.12)
where β is a vector of ﬁtted regression coefﬁcients, q0 is a vector of predictors (maps) at a new location, and + ν r (s0 ) is the predicted logit-transformed value of the relative density. Assuming that the sampled intensities are continuous values in the range ν r ∈ (0, 1), the model in Eq.(2.6.12) is in fact a liner model, which allows us to extend it to a more general linear geostatistical model such as regression-kriging. This means that the regression modeling is supplemented with the modeling of variograms for regression residuals, which can then be interpolated and added back to the regression estimate (Eq.2.1.4):
+ T + ˆ ˆ ν r (s0 ) = qT · βGLS + δ0 · νr − q · βGLS 0
(2.6.13)
where δ0 is the vector of ﬁtted weights to interpolate the residuals using ordinary kriging. In simple terms, logistic regression-kriging consists of ﬁve steps: (1.) convert the relative intensities to logits using Eq.(2.6.6); if the input values are equal to 0/1, replace with the second smallest/highest value; (2.) ﬁt a linear regression model using Eq.(2.6.12); (3.) ﬁt a variogram for the residuals (logits); (4.) produce predictions by ﬁrst predicting the regression-part, then interpolate the residuals using ordinary kriging; ﬁnally add the two predicted trend-part and residuals together (Eq.2.6.13) (5.) back-transform interpolated logits to the original (0, 1) scale by:
2.7 Modeling of topography using regression-kriging
51
ν r (s0 ) =
eνr (s0 ) 1 + eνr (s0 )
+
+
(2.6.14)
After we have mapped relative density over area of interest, we can also estimate the actual counts using the Eq.(2.6.8). This procedure is further elaborated in detail in chapter 8.
2.7
Modeling of topography using regression-kriging
A Digital Elevation Model (DEM) is a digital representation of the land surface — the major input to quantitative analysis of topography, also known as Digital Terrain Analysis or Geomorphometry (Wilson and Gallant, 2000; Hengl and Reuter, 2008). Typically, a DEM is a raster map (an image or an elevation array) that, like many other spatial features, can be efﬁciently modeled using geostatistics. The geostatistical concepts were introduced in geomorphometry by Fisher (1998) and Wood and Fisher (1993), then further elaborated by Kyriakidis et al. (1999), Holmes et al. (2000) and Oksanen (2006). An important focus of using geostatistics to model topography is assessment of the errors in DEMs and analysis of effects that the DEM errors have on the results of spatial modeling. This is the principle of error propagation that commonly works as follows: simulations are generated from point-measured heights to produce multiple equiprobable realizations of a DEM of an area; a spatial model is applied m times and output maps then analyzed for mean values and standard deviations per pixel; the results of analysis can be used to quantify DEM accuracy and observe impacts of uncertain information in various parts of the study area (Hunter and Goodchild, 1997; Heuvelink, 1998; Temme et al., 2008). So far, DEMs have been modeled by using solely point-sampled elevations. For exam- Fig. 2.12: Conceptual aspects of modeling topography using geople, ordinary kriging is used to generate DEMs statistics. A cross section showing the true topography and the as(Mitas and Mitasova, 1999; Lloyd and Atkin- sociated uncertainty: (a) constant, global uncertainty model and son, 2002); conditional geostatistical simula- (b) spatially variable uncertainty; (c) estimation of the DEM errors tions are used to generate equiprobable realiza- using precise height measurements. tions of DEMs (Fisher, 1998; Kyriakidis et al., 1999). In most studies, no explanatory information on topography is employed directly in the geostatistical modeling. Compared to the approach of Hutchinson (1989, 1996) where auxiliary maps of streams are often used to produce hydrologically-correct DEMs, the geostatistical approach to modeling of topography has often been limited to analysis of pointsampled elevations. 2.7.1 Some theoretical considerations
DEMs are today increasingly produced using automated (mobile GPS) ﬁeld sampling of elevations or airborne scanning devices (radar or LiDAR-based systems). In the case elevations are sampled at sparsely-located points,
52
Regression-kriging
a DEM can be generated using geostatistical techniques such as ordinary kriging (Wood and Fisher, 1993; Mitas and Mitasova, 1999). The elevation at some grid node (s0 ) of the output DEM can be interpolated using ordinary kriging (Eq.1.3.2); the same technique can be used to produce simulated DEMs (see section 2.4). Direct simulation of DEMs using the sampled elevations is discussed in detail by Kyriakidis et al. (1999). The use of kriging in geomorphometry to generate DEMs has been criticized by many (Wood and Fisher, 1993; Mitas and Mitasova, 1999; Li et al., 2005b), mainly because it leads to many artifacts, it oversmooths elevations and it is very sensitive to sampling density and local extreme values. So far, splines have been used in geomorphometry as a preferred technique to generate DEMs or to ﬁlter local errors (Mitasova et al., 2005). More recently, Hengl et al. (2008) demonstrated that regression-kriging can be used to employ auxiliary maps, such as maps of drainage patterns, land cover and remote sensing-based indices, directly in the geostatistical modeling of topography. Details are now discussed in the succeeding sections. If additional, auxiliary maps (drainage network, water bodies, physiographic break-lines) are available, a DEM can be generated from the point-measured elevations using the regression-kriging model (Eq.2.1.4). The biggest advantage of using auxiliary maps is a possibility to more precisely model uncertainty of the sampled elevations and analyze which external factors cause this variability. Whereas, in pure statistical Monte Carlo approach where we work with global, constant parameters (Fig. 2.12a), in the case of geostatistical modeling, the DEM uncertainty can be modeled with a much higher level of detail (Fig. 2.12b). In the case a DEM is obtained from an airborne or satellite-based scanning mission (radar, LiDAR or stereoscopic images), elevations are already available over the whole area of interest. Geostatistics is then used to analyze inherent errors in the DEM images (Grohmann, 2004), ﬁlter local errors caused by physical limitations of the instrument (Lloyd and Atkinson, 2002; Evans and Hudak, 2007), and eventually cluster the area according to their statistical properties (Lloyd and Atkinson, 1998). Geostatistical simulation of complete elevation data is somewhat more complicated than with point data. At the moment, the simulations of DEM images are most commonly obtained by simulating error surfaces derived from additional ﬁeld-control samples (Fig. 2.12c). The elevations measured at control points are used to assess the errors. The point map of DEM errors can then be used to generate equiprobable error surfaces, which are then added to the original DEM to produce an equiprobable realization of a DEM (Hunter and Goodchild, 1997; Holmes et al., 2000; Endreny and Wood, 2001; Temme et al., 2008). From a statistical perspective, a DEM produced directly by using scanning devices (SRTM, LiDAR) consists of three components: Z ∗ (s) the deterministic component, (s) the spatially correlated random component, and is the pure noise, usually the result of the measurement error. In raster-GIS terms, we can decompose a DEM into two grids: (1) the deterministic DEM and (2) the error surface. If precise point-samples of topography (e.g. highly precise GPS measurements) are available, they can be used to estimate the errors (Fig. 2.12c):
∗ e(si ) = zREF (si ) − Z(si );
E {e(s)} = 0
(2.7.1)
The measured errors at point locations can also be manipulated using geostatistics to generate the error surface: e(SIM) (s0 ) = E |e(si ), γe (h) (2.7.2)
The simulated error surface can then be added to the deterministic DEM to produce an equiprobable realization of a DEM: z (SIM) (s j ) = z ∗ (s j ) + e(SIM) (s j ) (2.7.3)
An obvious problem with this approach is that the deterministic DEM (z ∗ (s j )) is usually not available, so that the input DEM is in fact used to generate simulations, which leads to (see e.g. Holmes et al. (2000); Temme et al. (2008)): z (SIM) (s j ) = z(s j ) + e(SIM) (s j ) = z ∗ (s j ) + (s j ) +
+ e(SIM) (s j )
(2.7.4)
2.7 Modeling of topography using regression-kriging
53
which means that the simulated error surface and the inherent error component, at some locations, will double, and at others will annul each other. However, because the inherent error and the simulated error are in fact independent, the mean of the summed errors will be close to zero (unbiased simulation), but the standard deviation of the error component will be on average 40% larger. Hence a DEM simulated using Eq.(2.7.3) will be much noisier than the original DEM. The solution to this problem is to substitute the deterministic DEM component with a smoother DEM, e.g. a DEM derived from contour lines digitized from a ﬁner-scale topo-map. As an alternative, the deterministic DEM component can be prepared by smoothing the original DEM i.e. ﬁltering it for known noise and systematic errors (see e.g. Selige et al. (2006)).
2.7.2 Choice of auxiliary maps The spatially correlated error component will also often correlate with the explanatory information (Oksanen, 2006). For example, in traditional cartography, it is known that the error of measuring elevations is primarily determined by the complexity of terrain (the slope factor), land cover (density of objects) and relative visibility (the shadow effect). Especially in the cases where the DEMs are produced through photogrammetric methods, information about the terrain shading can be used to estimate the expected error of measuring heights. Similarly, a SRTM DEM will show systematic errors in areas of higher canopy and smaller precision in areas which are hidden or poorly exposed to the scanning device (Hengl and Reuter, 2008, p.79-80). This opens a possibility to also use the regression-kriging model with auxiliary maps to produce a more realistic error surface (Hengl et al., 2008). There are three major groups of auxiliary maps of interest to DEM generation: (1.) Hydrological maps: stream line data; water stagnation areas (soil-water content images); seashore and lakes border lines; (2.) Land cover maps: canopy height; Leaf Area Index; land cover classes; (3.) Geomorphological maps: surface roughness maps; physiographic breaks; ridges and terraces; A lot of topography-connected information can be derived from remote sensing multi- and hyper-spectral images, such as shading-based indices, drainage patterns, ridge-lines, topographic breaks. All these can be derived using automated (pattern recognition) techniques, which can signiﬁcantly speed up processing for large areas. Many auxiliary maps will mutually overlap in information and value. Ideally, auxiliary maps used to improve generation of DEMs should be only GIS layers produced independently from the sampled elevations — e.g. remotely sensed images, topographic features, thematic maps etc. Where this is not possible, auxiliary maps can be derived from an existing DEM, provided that this DEM is generated using independent elevation measurements. Care needs to be taken not to employ auxiliary maps which are only indirectly or accidentally connected with the variations in topography. Otherwise unrealistic simulations can be generated, of even poorer quality than if only standard DEM generation techniques are used (Hengl et al., 2008).
54
Regression-kriging
2.8
Regression-kriging and sampling optimization algorithms
R
intamap
intamap
If ordinary kriging is used to generate predictions. Ideally, already one iteration of additional sampling should guarantee map of required accuracy/quality. In practice, also the estimation of model will need to be updated with additional predictors, hence more iterations can be anticipated.
25 26
24
http://intamap.org
2.9 Fields of application
Optimized network (SSA)
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q
55
New sampling locations (RK)
q
1.4
q
q q q
q q
1.0
q
0.8
1.2
q q
q q q q q q q
q qq q q qq q q
0.6
1.0
q q qq q q
q q q q q q q q qq q q q q
0.4
0.8
q q
q q q qq q q q q q qq q q q q q
0.2
q q q
q
q
0.6
q
0.0
q
Fig. 2.13: Example of the two-stage model-based sampling: + — 50 ﬁrst stage samples (transects); o — 100 new samples allocated using the model estimated in the ﬁrst stage (optimized allocation produced using Spatial Simulated Annealing implemented in the intamapInteractive package). In the case of low correlation with auxiliary maps (left), new sampling design shows have higher spreading in the geographical space; if the correlation with predictors is high (right), then the new sampling design follows the extremes of the features space.
2.9
Fields of application
With the rapid development of remote sensing and geoinformation science, natural resources survey teams are now increasingly creating their products (geoinformation) using ancillary data sources and computer programs — the so-called direct-to-digital approach. For example, sampled concentrations of heavy metals can be mapped with higher detail if information about the sources of pollution (distance to industrial areas and trafﬁc or map showing the ﬂooding potential) is used. In the following sections, a short review of the groups of application where regression-kriging has shown its potential is given.
2.9.1
Soil mapping applications
In digital soil mapping, soil variables such as pH, clay content or concentration of a heavy metal, are increasingly mapped using the regression-kriging framework: the deterministic part of variation is dealt with maps of soil forming factors (climatic, relief-based and geological factors) and the residuals are dealt with kriging (McBratney et al., 2003). The same techniques is now used to map categorical variables (Hengl et al., 2007b). A typical soil mapping project based on geostatistics will also be demonstrated in the following chapter of this handbook. This follows the generic framework for spatial prediction set in Hengl et al. (2004a) and applicable also to other environmental and geosciences (Fig. 2.14). In geomorphometry, auxiliary maps, such as maps of drainage patterns, land cover and remote sensingbased indices, are increasingly used for geostatistical modeling of topography together with point data sets. Auxiliary maps can help explain spatial distribution of errors in DEMs and regression-kriging can be used to generate equiprobable realizations of topography or map the errors in the area of interest (Hengl et al., 2008). Such hybrid geostatistical techniques will be more and more attractive for handling rich LiDAR and radarbased topographic data, both to analyze their inherent geostatistical properties and generate DEMs ﬁt-for-use in various environmental and earth science applications.
56
Regression-kriging
PREDICTIVE COMPONENTS DIGITAL ELEVATION MODEL Derive DEM-parameters:
slope, curvatures, TWI, solar radiation, etc.
REMOTE SENSING INDICES
Global coverage
Prepare a multi-temporal dataset that reflects also the seasonal variation
Principal Component Analysis
CATEGORICAL PREDICTORS
Convert classes to indicators
FIELD OBSERVATIONS
Analyze relationship between the predictors and environmental variables
Regression modelling
Compare and evaluate
Spatial prediction: regressionkriging
Prediction variance
Variogram modelling
Fig. 2.14: A generic framework for digital soil mapping based on regression-kriging. After Hengl et al. (2007b).
2.9.2
Interpolation of climatic and meteorological data
Regression-kriging of climatic variables, especially the ones derived from DEMs, is now favoured in many climatologic applications (Jarvis and Stuart, 2001; Lloyd, 2005). DEMs are most commonly used to adjust measurements at meteorological stations to local topographic conditions. Other auxiliary predictors used range from distance to sea, meteorological images of land surface temperature, water vapor, short-wave radiation ﬂux, surface albedo, snow Cover, fraction of vegetation cover (see also section 4). In many cases, real deterministic models can be used to make predictions, so that regression-kriging is only used to calibrate the values using the real observations (D’Agostino and Zelenka, 1992, see also Fig. 2.3). The exercise in chapter 11 demonstrates the beneﬁts of using the auxiliary predictors to map climatic variables. In this case the predictors explained almost 90% of variation in the land surface temperatures measured at 152 stations. Such high R-square allows us to extrapolate the values much further from the original sampling locations, which would be completely inappropriate to do by using ordinary kriging. The increase of the predictive capabilities using the explanatory information and regression-kriging has been also reported by several participants of the Conference on spatial interpolation in climatology and meteorology (Szalai et al., 2007). Interpolation of climatic and meteorological data is also interesting because the explanatory (meteorological images) data are today increasingly collected in shorter time intervals so that time-series of images are available and can be used to develop spatio-temporal regression-kriging models. Note also that many meteorological prediction models can generate maps of forecasted conditions in the close-future time, which could then again be calibrated using the actual measurements and RK framework. 2.9.3 Species distribution modeling
Geostatistics is considered to be one of the four spatially-implicit group of techniques suited for species distribution modeling — the other three being: autoregressive models, geographically weighted regression and parameter estimation models (Miller et al., 2007). Type of technique suitable for analysis of species (oc-
2.9 Fields of application
57
currence) records is largely determined by the species’ biology. There is a distinct difference between ﬁeld observation of animal and plant species and measurements of soil or meteorological variables. Especially the observations of animal species asks for high sampling densities in temporal dimension. If the biological species are represented with quantitative composite measures (density, occurrence, biomass, habitat category), such measures are ﬁt for use with standard spatio-temporal geostatistical tools. Some early examples of using geostatistics with the species occurrence records can be found in the work of Legendre and Fortin (1989) and Gotway and Stroup (1997). Kleinschmidt et al. (2005) uses regression-kriging method, based on the Generalized mixed model, to predict the malaria incidence rates in South Africa. Miller (2005) uses a similar principle (predict the regression part, analyze and interpolate residuals, and add them back to predictions) to generate vegetation maps. Miller et al. (2007) further provide a review of predictive vegetation models that incorporate geographical aspect into analysis. Pure interpolation techniques will often outperform niche based models (Bahn and McGill, 2007), although there is no reason not to combine them. Pebesma et al. (2005) demonstrates that geostatistics is ﬁt to be used with spatio-temporal species density records. §8 shows that even occurrence-only records can be successfully analyzed using geostatistics i.e. regression-kriging.
1. Define taxa/domain of interest Species of interest (taxa) Study area (mask) Coordinate system (proj4/EPSG)
Retrieve occurrence records Fit SDM parameters
model fitting and prediction (adehabitat, stats)
Select suitable grid cell size
Retrieve gridded maps
database query (RODBC, XML)
PostGIS database query (rgdal)
GBIF
1 km 5 km 10 km
2. SDM settings Species Distribution Model (optional) model parameters Visualization settings
GIS layers
data export (GDAL, OGR)
Generate a distribution map Visualize and interpret the final result
3. Data export settings Select GIS format for output data (geoTIFF, KML) Select format for metadata
Fig. 2.15: Schematic example of a geo-processing service to automate extraction of species distribution maps using GBIF occurrence records and gridded predictors. The suitable R packages are indicated in brackets.
Fig. 2.15 shows an example of a generic automated data processing scheme to generate distribution maps and similar biodiversity maps using web-data. The occurrence(-only) records can be retrieved from the Global Biodiversity Information Facility27 (GBIF) Data Portal, then overlaid over number of gridded predictors (possibly stored in a PostGIS database), a species’ prediction model can then be ﬁtted, and results exported to some GIS format / KML. Such automated mapping portals are now increasingly being used to generate up-to-date species’ distribution maps.
27
Established in 2001; today the largest international data sharing network for biodiversity.
58 2.9.4 Downscaling environmental data
Regression-kriging
Interpolation becomes down-scaling once the grid resolution in more than 50% of the area is ﬁner than it should be for the given sampling density. For example, in soil mapping, one point sample should cover 160 pixels (Hengl, 2006). If we have 100 samples and the size of the area is 10 km2 , then it is valid to map soil variables at resolutions of 25 m (maximum 10 m) or coarser. Note that down-scaling is only valid if we have some auxiliary data (e.g. digital elevation model) which is of ﬁner resolution than the effective grid resolution, and which is highly correlated with the variable of interest. If the auxiliary predictors are available at ﬁner resolutions than the sampling intensity, regression-kriging can be used to downscale information. Much of recent research in the ﬁeld of biogeography, for example, has been focusing on the down-scaling techniques (Araújo et al., 2005). Hengl et al. (2008) shows how auxiliary maps can be used to downscale SRTM DEMs from 90 to 30 m resolution. Pebesma et al. (2007) use various auxiliary maps to improve detail of air pollution predictions. For the success of downscaling procedures using regression-kriging, the main issue is how to locate the samples so that extrapolation in the feature space is minimized.
2.10
At the moment, there are not many contra-arguments not to replace the existing traditional soil, vegetation, climatic, geological and similar maps with the maps produced using analytical techniques. Note that this does not mean that we should abandon the traditional concepts of ﬁeld survey and that surveyors are becoming obsolete. On the contrary, surveyors continue to be needed to prepare and collect the input data and to assess the results of spatial prediction. On the other hand, they are less and less involved in the actual delineation of features or derivation of predictions, which is increasingly the role of the predictive models. One such linear prediction techniques that is especially promoted in this handbook is regression-kriging (RK). It can be used to interpolate sampled environmental variables (both continuous and categorical) from large point sets. However, in spite of this and other attractive properties of RK, it is not as widely used in geosciences as might be expected. The barriers to widespread routine use of RK in environmental modeling and mapping are as follows. First, the statistical analysis in the case of RK is more sophisticated than for simple mechanistic or kriging techniques. Second, RK is computationally demanding28 and often cannot be run on standard PCs. The third problem is that many users are confused by the quantity of spatial prediction options, so that they are never sure which one is the most appropriate. In addition, there is a lack of user-friendly GIS environments to run RK. This is because, for many years GIS technologies and geostatistical techniques have been developing independently. Today, a border line between statistical and geographical computing is fading away, in which you will hopefully be more convinced in the remaining chapters of this guide.
2.10.1
Alternatives to RK
The competitors to RK include completely different methods that may ﬁt certain situations better. If the explanatory data is of different origin and reliability, the Bayesian Maximum Entropy approach might be a better alternative (D’Or, 2003). There are also machine learning techniques that combine neural network algorithms and robust prediction techniques (Kanevski et al., 1997). Henderson et al. (2004) used decision trees to predict various soil parameters from large quantity of soil proﬁle data and with the help of land surface and remote sensing attributes. This technique is ﬂexible, optimizes local ﬁts and can be used within a GIS. However, it is statistically suboptimal because it ignores spatial location of points during the derivation of classiﬁcation trees. The same authors (Henderson et al., 2004, pp.394–396) further reported that, although there is still some spatial correlation in the residuals, it is not clear how to employ it. Regression-kriging must also be compared with alternative kriging techniques, such as collocated cokriging, which also makes use of the explanatory information. However, collocated co-kriging is developed for situations in which the explanatory information is not spatially exhaustive (Knotters et al., 1995). CK also requires simultaneous modeling of both direct and cross-variograms, which can be time-consuming for large
28 Why does RK takes so much time? The most enduring computations are connected with derivation of distances from the new point to all sampled points. This can be speed up by setting up a smaller search radius.
59
number of covariates29 . In the case where the covariates are available as complete maps, RK will generally be preferred over CK, although CK may in some circumstances give superior results (D’Agostino and Zelenka, 1992; Goovaerts, 1999; Rossiter, 2007). In the case auxiliary point samples of covariates, in addition to auxiliary raster maps, are available, regression-kriging can be combined with co-kriging: ﬁrst the deterministic part can be dealt with the regression, then the residuals can be interpolated using co-kriging (auxiliary point samples) and added back to the estimated deterministic part of variation. 2.10.2 Limitations of RK
RK have shown a potential to become the most popular mapping technique used by environmental scientists because it is (a) easy to use, and (b) it outperforms plain geostatistical techniques. However, success of RK largely depends on characteristics of the case study i.e. quality of the input data. These are some main consideration one should have in mind when using RK: (1.) Data quality: RK relies completely on the quality of data. If the data comes from different sources and have been sampled using biased or unrepresentative design, the predictions might be even worse than with simple mechanistic prediction techniques. Even a single bad data point can make any regression arbitrarily bad, which affects the RK prediction over the whole area. (2.) Under-sampling: For regression modeling, the multivariate feature space must be well-represented in all dimensions. For variogram modeling, an adequate number of point-pairs must be available at various spacings. Webster and Oliver (2001, p.85) recommend at least 50 and preferably 300 points for variogram estimation. Neter et al. (1996) recommends at least 10 observations per predictor for multiple regression. We strongly recommend using RK only for data sets with more than 50 total observations and at least 10 observations per predictor to prevent over-ﬁtting. (3.) Reliable estimation of the covariance/regression model: The major dissatisfaction of using KED or RK is that both the regression model parameters and covariance function parameters need to be estimated simultaneously. However, in order to estimate coefﬁcients we need to know covariance function of residuals, which can only be estimated after the coefﬁcients (the chicken-egg problem). Here, we have assumed that a single iteration is a satisfactory solution, although someone might also look for other iterative solutions (Kitanidis, 1994). Lark et al. (2005) recently suggested that an iterative Restricted Maximum Likelihood (REML) approach should be used to provide an unbiased estimate of the variogram and regression coefﬁcients. However, this approach is rather demanding for 103 point data sets because for each iteration, an n × n matrix is inverted (Minasny and McBratney, 2007). (4.) Extrapolation outside the sampled feature space: If the points do not represent feature space or represent only the central part of it, this will often lead to poor estimation of the model and poor spatial prediction. For this reason, it is important that the points be well spread at the edges of the feature space and that they be symmetrically spread around the center of the feature space (Hengl et al., 2004b). Assessing the extrapolation in feature space is also interesting to allocate additional point samples that can be used to improve the existing prediction models. This also justiﬁes use of multiple predictors to ﬁt the target variable, instead of using only the most signiﬁcant predictor or ﬁrst principal component, which if, for example, advocated by the development team (Bleines et al., 2004).
Isatis
(5.) Predictors with uneven relation to the target variable: Auxiliary maps should have a constant physical relationship with the target variable in all parts of the study area, otherwise artifacts will be produced. An example is a single NDVI as a predictor of topsoil organic matter. If an agricultural ﬁeld has just been harvested (low NDVI), the prediction map will (incorrectly) show very low organic matter content within the crop ﬁeld. (6.) Intermediate-scale modeling: RK has not been adapted to ﬁt data locally, with arbitrary neighborhoods for the regression as can be done with kriging with moving window (Walter et al., 2001). Many practitioners would like to adjust the neighborhood to ﬁt their concepts of the scale of processes that are not truly global (across the whole study area) but not completely local either.
29 Co-kriging requires estimation of p + 1 variograms, plus p · (p + 1) /2 cross-variograms, where the p is the number of predictors (Knotters et al., 1995).
60
Regression-kriging
(7.) Data over-ﬁtting problems: Care needs to be taken when ﬁtting the statistical models — today, complex models and large quantities of predictors can be used so that the model can ﬁt the data almost 100%. But there is a distinction between the goodness of ﬁt and true success of prediction that cannot really be assessed without independent validation (Rykiel, 1996). If any of these problems occur, RK can give even worse results than even non-statistical, empirical spatial predictors such as inverse distance interpolation or expert systems. The difﬁculties listed above might also be considered as challenges for the geostatisticians. 2.10.3 Beyond RK
Although the bibliometric research of Zhou et al. (2007) indicates that the ﬁeld of geostatistics has already reached its peak in 1996–1998, the development of regression-kriging and similar hybrid techniques is certainly not over and the methods will continue to evolve both from theoretical and practical aspect. Gotway Crawford and Young (2008) recognizes four ‘hot’ areas of geostatistics that will receive attention in the near future: (1) geostatistics in non-euclidian space (i.e. space that accounts for barriers, streams, disease transmittion vectors etc.); (2) assessment of spatio-temporal support — spatial prediction methods will be increasingly compared at various spatial/temporal scales; users are increasingly doing predictions from point to area support and vice versa; (3) kriging is increasingly used with discrete data and uncertain data (this emphasized the importance of using Bayesian-based models), and (4) geostatistics as a tool of politics. What you can certainly anticipate in the near future considering regression-kriging connected methods are the following six developments: More sophisticated prediction models: Typically, regression-kriging is sensitive to blunders in data, local outliers and small size data sets. To avoid such problems, we will experience an evolution of methods that are more generic and more robust to be used to any type of data set. Recently, several authors suggested ways to make more sophisticated, more universally applicable BLUPs (Lark et al., 2005; Minasny and McBratney, 2007; Bárdossy and Li, 2008). We can anticipate a further development of intelligent, iterative data ﬁtting algorithms that can account for problems of local hot-spots, mixed data and poor sampling strategies. This is now one of the major focuses of the intamap project (Pebesma et al., 2009). Local regression-kriging: As mentioned previously in §2.2, local regression-kriging algorithms are yet to be developed. Integration of the local prediction algorithms (Haas, 1990; Walter et al., 2001) would open many new data analysis possibilities. For example, with local estimation of the regression coefﬁcients and variogram parameters, a user will be able to analyze which predictors are more dominant in different parts of the study area, and how much these parameters vary in space. The output of the interpolation with not be only a map of predictions, but also the maps of (local) regression coefﬁcients, R-square, variogram parameters and similar. Lloyd (2009) recently compared KED (monthly precipitation in UK) based on local variogram models and discovered that it provides more accurate predictions (as judged by cross-validation statistics) than any other ‘global’ approach. User-friendly sampling optimisation packages: Although methodologies both to plan new sampling designs, and to optimize additional sampling designs have already been tested and described (Minasny and McBratney, 2006; Brus and Heuvelink, 2007), techniques such as simulated annealing or Latin hypercube sampling are still not used in operational mapping. The recently released package now supports simulated annealing and optimization of sampling designs following the regression-kriging modeling. Development of user-friendly sampling design packages will allow mapping teams to generate (smart) sampling schemes at the click of button.
active
intamapInter-
Automated interpolation of categorical variables: So far no tool exists that can automatically generate membership maps given a point data with observed categories (e.g. soil types, land degradation types etc.). A compositional RK algorithm is needed that takes into account relationship between all categories in the legend, and then ﬁts regression models and variogram models for all classes (Hengl et al., 2007b). Intelligent data analysis reports generation: The next generation of geostatistical packages will be intelligent. It will not only generate predictions and prediction variances, but will also provide interpretation of the ﬁtted models and analysis of the intrinsic properties of the input data sets. This will include detection of possible outliers and hot-spots, robust estimation of the non-linear regression model, assessment
2.10 Final notes about regression-kriging of the quality of the input data sets and ﬁnal maps. The this direction.
61
R package automap, for example, is pointing to
Multi-temporal, multi-variate prediction models: At the moment, most of the geostatistical mapping projects in environmental sciences focus on mapping a single variable sampled in a short(er) period of time and for a local area of interest. It will not take too long until we will have a global repository of (multi-temporal) predictors (see further section 4.1) and point data sets that could then be interpolated all at once (to employ all possible relationships and cross-correlations). The future data sets will deﬁnitively be multi-temporal and multi-variate, and it will certainly ask for more powerful computers and more sophisticated spatio-temporal 3D mapping tools. Consequently, outputs of the spatial prediction models will be animations and multimedia, rather then simple and static 2D maps. Although we can observe that with the more sophisticated methods (e.g. REML approach), we are able to produce more realistic models, the quality of the output maps depends much more on the quality of input data (Minasny and McBratney, 2007). Hence, we can also anticipate that evolution of technology such as hyperspectral remote sensing and LiDAR will contribute to the ﬁeld of geostatistical mapping even more than the development of the more sophisticated algorithms. Finally, we can conclude that an unavoidable trend in the evolution of spatial prediction models will be a development and use of fully-automated, robust, intelligent mapping systems (see further §3.4.3). Systems that will be able to detect possible problems in the data, iteratively estimate the most reasonable model parameters, employ all possible explanatory and empirical data, and assist the user in generating the survey reports. Certainly, in the near future, a prediction model will be able to run more analysis with less interaction with user, and offer more information to decision makers. This might overload the inexperience users, so that practical guides even thicker than this one can be anticipated.
Further reading: Banerjee, S. and Carlin, B. and Gelfand, A. 2004. Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC, Boca Raton, 472 p. Christensen, R. 2001. Best Linear Unbiased Prediction of Spatial Data: Kriging. In: Cristensen, R. Linear Models for Multivariate, Time Series, and Spatial Data, Springer, 420 p. Hengl T., Heuvelink G. B. M., Rossiter D. G., 2007. About regression-kriging: from equations to case studies. Computers & Geosciences, 33(10): 1301–1315. Minasny, B., McBratney, A. B., 2007. Spatial prediction of soil properties using EBLUP with Matérn covariance function. Geoderma 140: 324–336. Pebesma, E. J., 1999. Gstat user’s manual. Department of Physical Geography, Utrecht University, Utrecht, 96 p. Schabenberger, O., Gotway, C. A., 2004. Statistical methods for spatial data analysis. Chapman & Hall/CRC, 524 p. Stein, M. L., 1999. Interpolation of Spatial Data: Some Theory for Kriging. Series in Statistics. Springer, New York, 247 p.
62
Regression-kriging