04-Geostatistics Spatial Data Analysis
04-Geostatistics Spatial Data Analysis
Analysis
Julián M. Ortiz - https://julianmortiz.com/
August 3, 2024
Summary
The statistical exploratory data analysis must be completed
by exploring the spatial distribution of the data. In space,
stationarity needs to be assumed in order to allow for statis-
tical inference. The behavior of the variables in space needs
to be understood in order to make the stationary decision
and understand its potential drawbacks. On the other hand,
we also need to define domains that will be used for esti-
mation or simulation. These domains define the set of data
used to infer the variable at unsampled locations within that
same volume. Therefore, a representative distribution must
be available within each domain to condition the inference
process. The statistical distribution of the variables must be
characterized within each domain, but oftentimes, the sam-
ples are not spatially representative, biasing the statistics
that can be inferred from them.
1
In this chapter, we introduce the formal notion of sta-
tionarity and discuss issues related to inference of the rep-
resentative distribution when faced with clustered or pref-
erentially sampled data. We briefly touch on intentionally
censored data.
1 Introduction
As soon as we start looking at a set of samples to infer
properties that are common to them, we enter the realm
of inference. We are interested in finding the set of samples
and the associated volume, where the variables behave in a
consistent manner. So far, we have not defined clearly this
consistency, but let’s say for now, that we want the vari-
ables to hold the same properties within this volume. This
calls for an abstraction in order to go beyond the data. It
would be impossible to learn about an unsampled location,
unless we assume something about the properties of the
variables of interest at that location. The natural step is to
assume that “it behaves similarly to its neighbors”. When
we say that, we are referring to two different aspects: first,
the statistical properties are similar, and second, the spatial
properties are also similar to the neighbors.
When we think about the statistical properties, we as-
sume the value at an unsampled location should be within
the range of the known samples. We would expect it to
be similar to its neighbors: if these are high, we expect to
find a high value. Similarly, if the neighboring samples are
low. On the other hand, we would be more confident on this
if the values in the neighborhood are very homogeneous,
2
with low variability. Why would this point be any different?
On the other hand, if the samples in the neighborhood vary
a lot, then we would not be as sure about what to expect at
this unsampled location.
So, it is clear that we can see a relationship between
the variable at the unsampled location and at other known
positions in space. We also realize that we cannot predict
the value without error. A natural approach to handle this
uncertainty is the use of probability theory.
We will formally introduce the concepts of random vari-
able and random function, and then we will discuss the is-
sues of statonarity and representative sampling.
3
tween multiple points. The collection of random variables
within a domain is known as a random function. The random
function describes the statistical properties of the random
variables within the domain and their spatial relationships.
In geostatistics, inference is aimed at characterizing the
behavior of the random function. We will see that this re-
quires a statistical and a spatial analysis.
Definitions:
3 Stationarity
Stationarity refers to a notion of homogeneity within a do-
main. Intuitively, we expect to find the same behavior in
any part of a domain. When we say behavior, we refer to
statistical properties and spatial properties. For example,
a homogeneous domain should have the same mean value
4
for the variable in different parts of that domain. We would
not expect to find a homogeneous value, that is, the same
value in every place of the domain, but we do expect to find
values within the same range. Similarly, if values show a
given level of variability in one part of the domain, we ex-
pect to find the same variability elsewhere in the domain.
First order
Second order
Quasi second order
Strict stationariity
5
amount of similarity between points separated by a distance
h (this is actually a vector, with magnitude and direction) in
any position within the domain D. We will come back to this
later, once we introduce the measures of spatial continuity.
Quasi second order stationarity is a convenient approxi-
mation of the second order type of stationarity. It basically
relaxes the condition to local neighborhoods.
6
ence about different statistics.
4 Domaining
The decision of stationarity is key for statistical inference.
Inference must be done using individuals (samples) that
belong to the same population (domain). Furthermore, in
space, the statistical properties of the locations used to make
inference about an unsampled location should be consis-
tent, in order to avoid bias. If the spatial properties of the
variable change with location, inference becomes problem-
atic.
In order to handle this issue, domains are defined. In the
context of modeling in geosciences, these domains must
show similar properties from the geological point of view,
and many times they should also show consistency in re-
gards to the response variable. For example, in mineral de-
posits, a domain should group locations with similar grade
distribution, but also the mineralization may be relevant if
performance depends on it, as in the case of metallurgical
recovery.
Stationary domains are then defined, based on the ge-
ological properties, and on the statistical properties of the
variable being studied.
7
Defining domains for resource estimation (not really a
recipe, but first steps that require further iterations with
geological input):
8
In presence of systematic trends in the attribute:
9
exist within a small neighborhood. Trends become (very)
problematic when scarce data exist, for example, in early
exploration stages, where drillholes are far apart.
Defining stationary domains is a decision, a very im-
portant one, since it determines what data are used to make
inference about unsampled locations within that region. Fur-
thermore, defining the local quasi stationary neighborhoods
is as important as the definition of domains.
In practice, domains are dictated by the geological un-
derstanding. Statistical and spatial analyses are done to
assess how appropriate a stationary model is to the data.
Exploratory data analysis tools are used for this purpose. If
domains are homogeneous from their geological standpoint,
but show trends, this can be handled by setting the ap-
propriate local neighborhood when making inference about
each location.
In some cases, this is not sufficient and the trend may
need to be modeled and removed, in the hope that the
residuals obtained through this process are stationary and
capture some of the spatial structure. However, this is a
tricky balance and no rule of thumb will be provided at this
stage.
5 Representative sampling
One of the key questions raised during the analysis of the
data is whether the sample available over the domain we
are studying is fair, in the sense that it represents the prop-
erties of the domain. Notice that the term sample is being
used here in a statistical sense. A sample is a collection
10
of individuals selected from the population. In the spatial
context, each location corresponds to an individual from the
population (the domain).
In sampling theory, two approaches are considered fair:
pure random sampling and regular sampling.
In the case of pure random sampling, each individual be-
longing to the sample from the population is chosen at ran-
dom. In our context, this means that each location within
the domain can be selected with equal probability. No spe-
cific area within the domain should have a higher probability
of being sampled.
On the other hand, another way of making the sample
fair, is to sample in a regular fashion, so all possible indi-
viduals may belong to the sample (since the origin of the
sampling grid is randomly selected). This can be extended
to a stratified sampling, that is sampling randomly within
regular strata that are predefined. Although we are not go-
ing to get into details of sampling theory, as long as the
variable does not show a cyclic behavior and the strata are
of different size than the cycle size, this method should also
provide a fair sample.
In many applications in geosciences, sampling is not reg-
ular or random. This is due to different reasons: sometimes,
we cannot access all areas for sampling, but in most cases,
it is because we direct sampling to areas of interest. This
naturally biases the result towards the “interesting” values.
In mining, we tend to sample areas of high grades and over
drill them to make sure we delineate the ore properly for
extraction. The same happens in the oil industry. High per-
meability areas are sought after for production drilling.
Therefore, in most cases, sampling will not be fair, and
attention must be paid to ensure we can infer a reasonable
11
estimate of the statistical distribution of the variables of in-
terest within the modeling domain. This is extremely im-
portant when using some statistical and geostatistical tools.
For example, in geostatistical simulation, the reference dis-
tribution is a key parameter. The simulated results should
reproduce the reference histogram provided, thus, if this
histogram is biased, the resulting simulations will be biased
too, and will not represent the domain they are trying to
characterize. Fortunately, in the case of estimation, kriging
provides an unbiased estimate and weighs the data based
on their spatial importance.
5.1 Declustering
There are several approaches to correct the statistical dis-
tribution, to ensure that it does correct for spatial bias, that
is for clusters of data that have been gathered preferentially
in some locations.
Let us first understand what the purpose of declustering
is. Consider the following simplified example (Figure 1).
We can see that over the domain D, 8 samples have been
taken. Initially, 4 samples were taken to “explore” the do-
main and a high value was found at the lower right quad-
rant. Since this value is an interesting value, infill samples
have been taken around that location (samples 5 to 8).
If we were trying to represent the statistical distribution
of values in this domain, we would consider the available
information and build a histogram. Basically, we need to
find the frequency associated to each value of the variable
z (Table 1).
We can now group the values into the bins for the his-
12
Figure 1: Clustered data
Table 1: Sample values, frequencies and weights in histogram calculation for clustered
data
13
Value Frequency Weight
1 3 3/ 8
10 5 5/ 8
X
z̄ = z · ƒ (z) (5)
3 5
= 1· + 10 · = 6.625
8 8
And the variance1 is:
X
σz2 = (z − z̄)2 · ƒ (z) (6)
3 5
= (1 − 6.625)2 · + (10 − 6.625)2 · = 18.984
8 8
Therefore, during the calculation of any statistics, we as-
sign a weight to each value, linked to its frequency ƒ (z).
The histogram is built assigning an equal weight of 1/ 8
to each one of the 8 samples (Figure 2).
Now, looking at the spatial configuration, we realize sam-
pling is preferential and we should “compensate” for this
fact. A very simple approach would be to interpret that the
domain is really represented in four quadrants (see Figure
3) and that the lower right quadrant has been over sampled.
1 Noticethat here we are using the estimator for the variance for large samples,
which uses 1/ n instead of 1/ (n − 1). The point of the example is understanding the
weights assigned to each “squared difference”.
14
Figure 2: Histogram of clustered data
15
the same quadrants as a reference. Therefore, each quad-
rant will be assigned 1/ 4 of the total weight, and samples
within the quadrant will be evenly weighted. This would lead
to the weight configuration displayed in Figure 4 and the
corresponding histogram (Figure 5).
X
zdec = z · ƒ (z) (7)
1 1 1
= 3· 1· + 5 · 10 · · = 3.25
4 4 5
16
Figure 5: Declustered histogram
X
σz2 = (z − zdec )2 · ƒ (z) (8)
dec
3 1
= (1 − 3.25)2 · + (10 − 3.25)2 · = 15.1875
4 4
It is clear from this example that declustering is needed
to obtain statistics that are more representative of the do-
main, but we will never be sure whether these corrected
statistics actually match the reference distribution of the
domain, unless we can exhaustively access all the domain
locations.
We can see that the idea behind declustering is to mod-
ify the weight assigned to each sample when computing
statistics or when building the histogram (which amounts to
the same). Samples that are more redundant should get a
lower weight, while samples that represent a larger volume,
should get a larger weight.
17
Notice that the change really should depend on the spa-
tial continuity of the variable. Think about a case where
the variable at different locations shows no correlation. This
means that knowing the value at one location does not in-
form about neighboring locations. The only thing we know
for sure is that all these locations belong to the same do-
main. In that case, the histogram of the variable within that
domain can be obtained by pooling together all the avail-
able sample, no matter where they are in space, since no
sample is more redundant than the others, because there
is no spatial correlation. All samples should have an equal
weight in that case, so no correction for clusters is needed.
In a case with a significant spatial correlation (like the
one shown in the example before), the declustering weights
should penalize more significantly redundant samples. How-
ever, the tools available for declustering are today purely
geometric, and do not take the spatial continuity into ac-
count. They are based on the idea of volume of influence to
modify the weight.
When clusters of samples are preferentially located in
high valued areas, the declustered distribution will have a
lower mean. The declustered distribution assigns a non-
uniform weight to the samples. The weights are determined
by heuristic methods. Two main approaches are used for
declustering: polygonal and cell declustering.
Polygonal declustering
18
This is achieved in 2D by defining a Voronoi tessellation,
that is, a set of convex polygons are defined by intersecting
half-planes defined by the line perpendicular to the line con-
necting each pair of samples in the plane, and that passes
through the midpoint of that segment. In 3D, the same
idea works by defining polyhedra using half-spaces, that is
planes perpendicular to the midpoint of the line connecting
pairs of samples in space.
Implementation of polygonal declustering can be easily
achieved through a numerical approximation: a regular grid
of points is defined over the volume, and the closest sample
is found for each point in the grid. The relative frequency of
points associated to each sample defines its weight. No-
tice that this can be done with arbitrarily high resolution,
by increasing the density of the grid. Also, it is important
to mention that the boundaries of the volume over which
the weights are calculated can also be defined, so it works
nicely with geological volumes, inside which we want to
decluster the samples available.
Polygonal declustering is highly sensitive to the size of
the domain, since samples in the boundary may get a higher
weight if the domain is enlarged.
Cell declustering
19
sented before.
The result depends on the cell size and on the origin of
the grid of cells.
The cell size should be set to account for the regularity
of the underlying grid (if available), aiming at having most
of the time one sample per cell, and expecting to find more
than one sample per cell in areas with clusters of samples.
20
6 Example
Continuing the example presented earlier, we will analyze
the distributions to determine reasonable domains for esti-
mation and simulation. Then, we will apply declustering and
discuss the result. If we look at the spatial distribution of the
samples, coded with the rock types, we can see the spatial
location of the different rock types (Figure 6).
21
Figure 7: Probability plots of copper grade for the rock types
22
Code Description
4 Cascade Granodiorite GD-CASC
20 Tourmaline Breccia BC-TOUR
28 Monolith Breccia BC-MONO
29 Tourmaline-Monolith Breccia BC-TMMN
31 Castellana Breccia BC-CAST
34 Tourmaline-Castellana Breccia BC-TMCT
54 Diorite DIORITE
unit 20, which we will model separately from the other rock
types, we can apply cell declustering.
To show the effect of this method, we first present the
change in the mean as a function of the cell size, param-
eterized by the size of the cell over the X direction (East),
and considering a cell with anisotropy to match the “regular
sampling” found in the EDA. This means that the cell should
be 35 by 35 by 12m, in the X, Y and Z directions, respec-
tively.
We try 50 cell sizes ranging from 5 to 500m (for the X di-
mension of the cell). The weights obtained are the average
over 50 random origins of the grid used for declustering.
Figure 8 shows the change in the declustered mean with
the cell size. This shows that the resulting statistics are
highly sensitive to the cell size selected for declustering.
Since we know that there is an underlying regular sampling
grid in our data, we can use it to define the declustering
weights. Any areas with denser sampling will penalize the
weight assigned to the samples. Areas with scarce sampling
will impose a higher weight to those samples.
By running the cell declustering algorithm (using a cell
size of 35m), the corrected histogram and statistics can be
obtained (see Figure 9).
23
Figure 8: Declustered mean of copper grade in rock type 20 as a function of cell size
Figure 9: Histogram and probability plot of declustered copper grade for rock type 20
24
Finally, we can compare the declustered histogram with
the raw distribution using a quantil-quantile plot (see Fig-
ure 10).
Figure 10: Quantile-quantile plot comparing the raw and declustered distributions
(logarithmic scale used)
25
Figure 11: Plan views displaying declustering weights by bench
26
Index
boundaries, 19 regular sampling, 11
residuals, 10
cell declustering, 19
cell size, 20 sampling theory, 11
Second order stationarity, 5
decision of stationarity, 7 spatial bias, 12
domains, 7 spatial continuity, 18
exploratory data analysis, 10 spatial correlation, 3
spatial properties, 2, 4, 7
First order stationarity, 5 stationarity, 4
inference, 2 statistical properties, 2, 4,
7
kriging, 12 stratified sampling, 11
Strict stationarity, 6
ordinary kriging, 6
trends, 8, 9
polygonal declustering, 18
preferential sampling, 14 uncertainty, 3
probability distribution, 3
probability theory, 3
quasi second order station-
arity, 6
random function, 4
random sampling, 11
random variable, 3
redundancy, 17
reference distribution, 12, 17
regionalized variable, 3
27