Prediction of soil properties with NIR data and
site descriptors using preprocessing and neural
networks
Matt Aitkenhead
Malcolm Coull
Jean Robertson
1
Introduction to NSIS
 A component of the Scottish Soils Database
 One of the most detailed and systematic collections of national soil
data in Europe.
 Soil Survey of Scotland produced a range of digitised and paper maps
at a number of scales from full national coverage at 1:250000 scale to
more local surveys at scales of 1:10560 or larger.
 Comprehensive database was developed that currently contains
chemical and physical information on over 13000 georeferenced soil
profiles.
 The National Soils Inventory for Scotland (NSIS) is an objective
sample of Scottish soils.
 Soil and site conditions of 183 locations throughout Scotland were
sampled using a 20km grid across the entire country (NSIS 2).
 Samples taken at multiple depths from soil pits and analysed to
determine their physical and chemical properties (approx. 800
datasets)
2
NSIS data
Ag (aqua-regia
digestion, ppm)
Cd (aqua-regia
digestion, ppm)
K (exchangeable,
meq per 100g)
Mo (aqua-regia
digestion, ppm)
pH (in H2O)
Al (exchangeable,
meq per 100g)
Co (aqua-regia
digestion, ppm)
K (aqua-regia
digestion, ppm)
H2O loss (105°C) Pt (aqua-regia
digestion, ppm)
Al (aqua-regia
digestion, ppm)
Cr (aqua-regia
digestion, ppm)
LOI (loss on
ignition, 450°C)
Na (exchangeable,
meq per 100g)
S (aqua-regia
digestion, ppm)
As (aqua-regia
digestion, ppm)
Cu (aqua-regia
digestion, ppm)
LOI (loss on
ignition, 900°C)
Na (aqua-regia
digestion, ppm)
Se (aqua-regia
digestion, ppm)
B (aqua-regia
digestion, ppm)
H (exchangeable,
meq per 100g)
Mg (exchangeable,
meq per 100g)
Ni (aqua-regia
digestion, ppm)
Sr (aqua-regia
digestion, ppm)
Ba (aqua-regia
digestion, ppm)
Fe (exchangeable,
meq per 100g)
Mg (aqua-regia
digestion, ppm)
P (aqua-regia
digestion, ppm)
Ti (aqua-regia
digestion, ppm)
Ca (exchangeable,
meq per 100g)
Fe (aqua-regia
digestion, ppm)
Mn (EDTA
extraction, ppm)
Pb (aqua-regia
digestion, ppm)
P (total, derived
from P2O5 ppm)
Ca (aqua-regia
digestion, ppm)
Hg (aqua-regia
digestion, ppm)
Mn (aqua-regia
digestion, ppm)
pH (in CaCl2) Zn (aqua-regia
digestion, ppm)
…and outputs
inputs…
VIS-NIR spectra
(pre-processed)
Temperature (12
monthly means)
Topography (8
parameters)
Rainfall (12
monthly means)
Land cover (10
parameters)
Geology (19
classes)
Soil (9 classes)
3
NIR data - introduction
 WinISI software used for on-board analysis of spectra
4
NIR data - example
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1000 1200 1400 1600 1800 2000 2200 2400 2600
nm
5
Experimental design
 Multiple steps, based on on-going NIR/FTIR soil work:
 Moving window transform
 Derivative transform
 Normalisation
 Input subsampling
 Neural network layer size
 3600 combinations explored:
 Moving window/derivative transform first
 Moving window size of 5, 10, 20, 50, 100
 Derivative transform options of (1) no derivative, (2) 1st derivative, (3) 2nd
derivative, (4) Savitsky-Golay 0-order, (5) S-V 1st order, (6) S-V 2nd order
 Spectral normalisation over either entire range of values, or by min/max
for each spectrum
 Dataset subsampling rate of 1, 2, 5, 10, 20 or 50
 NN hidden layer sizes of 5, 10, 20, 50 or 100
6
Moving window/smoothing
7
 Smoothing/derivation of the spectra prior to
interpretation is common
 Reduces noise and accentuates useful data
 Many different smoothing/derivative functions exist
 Using a ‘moving window’ subtraction makes peaks stand
out from their surroundings
 Which should be chosen?
 Moving window – what radius of window?
 Smoothing/derivative – what function?
 Both? In what order?
Moving window
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1000 1500 2000 2500nm
-0,06
-0,04
-0,02
0
0,02
0,04
0,06
0,08
1000 1500 2000 2500
nm
Before
After
(window radius 20)
8
Smoothing/derivation
Before
After
(1st order derivative)
-0,06
-0,04
-0,02
0
0,02
0,04
0,06
0,08
1000 1500 2000 2500
nm
-0,008
-0,006
-0,004
-0,002
0
0,002
0,004
0,006
0,008
0,01
1000 1500 2000 2500
nm 9
Normalisation
Before
After
-0,008
-0,006
-0,004
-0,002
0
0,002
0,004
0,006
0,008
0,01
1000 1500 2000 2500
nm
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1000 1500 2000 2500
nm
10
Sampling
Before
After
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1000 1500 2000 2500
nm
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1000 1500 2000 2500nm
11
Neural network modelling
 Simplified model of biological
learning
 Useful for large, ‘messy’ datasets
 Can handle large numbers of
input and output parameters
 Backpropagation training method
 Relatively old, simple NN
approach
 Based on error minimisation
 Standard for data
mining/modelling
 Allows the ‘black box’ to be
opened
12
Neural network design/training
 One input node for each value in the
pre-processed spectrum (700)
 Additional nodes can be added if
other input data is to be used
 Two hidden layers of 100 nodes each
 One output node for each of the
output parameters (40)
 Dataset split into training/testing
(75/25) at random
 Testing at every 1000 training steps
to find optimal network
13
Statistical evaluation
 Statistics of predictive accuracy:
 R-squared
 RMSE
 MAE
 ME
 Weighting of network input/output relationships
 Partial derivatives method (Olden & Jackson, 2002; Olden
et al., 2004)
 Looks at the relationships between every input/output
parameter combination
14
Variation in results
 Neural network can underfit or overfit the data
 Underfitting if not sufficiently trained
 Overfitting if trained too well on the training data
 Need to identify ‘stopping point’
 Testing data (separate from training data) used for this
0
5
10
15
20
25
30
35
40
0 1 2 3 4 5 6 7 8 9 10
Training data
Testing data
15
Best preprocessing algorithm
16
 Moving window first, with window radius 50
 Then 1st-order Savitsky-Golay smoothing
 Normalisation by min-max range for each spectrum
 Minimise data subsampling (no subsampling at all is best)
 Maximise NN hidden layer size (100 was largest used)
 Demonstrable variation in results between experimental
combinations:
 All statistical measures varied greatly between worst and
best combinations
 Trends seen in subsampling & NN hidden layer size effects
Best results (aqua regia r-squared)
17
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Ag Al As B Ba Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Pt S Se Sr Ti Zn
Best results (exchangeable r-squared)
18
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Al Ca H Fe K Mg Na
Best results (other r-squared)
19
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
LOI (450C) LOI (900C) Mn (EDTA) H2O (105C) pH (CaCl2) pH (H2O) P (P2O5)
Site characterisation
 Environmental factors influence the character of the soil
 Topography
 Vegetation
 Climate
 Geology
 Sample locations were recorded to within 10m accuracy
(in most cases!)
 With sufficiently large dataset, can be used to develop an
‘environment-specific’ calibration of the model
 NN approach is sufficiently flexible to incorporate this
information ‘automatically’
20
Inclusion of site character
 8 extra input parameters for topography
 Elevation, slope, curvature, curve-plan, curve-profile, aspect, aspect-east,
aspect-north
 20 extra input parameters for vegetation
 10 classes for each of 2 land cover maps (LCS88 & LCM2007)
 Cropland, improved grassland, rough grassland, deciduous, coniferous, peat,
heath, bare, water, montane
 9 extra input parameters for soil
 Alluvial, alpine, bare, brown earth, gley, peat, podzol, lithosol, regosol
 24 extra input parameters for climate
 Monthly means for temperature and rainfall
 19 extra input parameters for geology
 Derived from geological information produced during soil survey work (Lilly,
Towers and others)
21
Modelling with all of the data
 80 extra input nodes for 80 extra input parameters
 Identical training regime
 Identical NN architecture
 Sensitivity analysis to identify important input parameters
(spectroscopy inputs included in this)
 Site characterisation derived from existing spatial
datasets, all adjusted to 100m resolution
22
Changes in the results
23
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
R2 (spectra only)
R2 (spectra & site)
Specific parameter (LOI450)
 Almost suspiciously good!
 So I went back and checked
 R-squared (all inputs) of 0.974
 Accurate within 1% of LOI >90% of the time for LOI < 20%
 Can still be out by up to 4% in this range...
 Accuracy better at low and high LOI values, slightly worse
in the middle range
 Overall RMSE: 0.046
 Overall MAE: 0.035
 Overall ME: 0.001
24
Sensitivity analysis
 Several inputs in the spectra/environmental data have
relatively high mean absolute or maximum weightings
 No clear pattern or clustering of ‘important’ inputs’
 Environmental inputs no more important than spectra
25
Mean weighting Maximum weighting
0
0,005
0,01
0,015
0,02
0,025
0,03
0 200 400 600 800
0
0,02
0,04
0,06
0,08
0,1
0,12
0 200 400 600 800
Ongoing and future work
26
 Redo the sensitivity analysis using other approaches
 Current sensitivity analysis is noisy, tells us less than it could
 Comparison of prediction accuracies with standard approaches
 Jean Robertson’s analysis for matching these soil samples
 Literature, for wider comparison
 Local calibration – automated real-time stratification based on
site characteristics (real-time model training, testing)
 LUCAS data analysis (for the future!)
 Similar approach as described here
 Need to develop site descriptor data (topography, climate,
vegetation, geology, soil type)
A potential side-route?
27
 SOCIT mobile phone app
(iPhone/Android)
 Estimates soil OM and soil C
using mobile phone imagery
& site descriptors
 LUCAS spectroscopy could
be used to produce RGB
estimates
 A soil C estimation app for
Europe?
Conclusions
 Some soil parameters can be predicted ‘well’ using NIR data
 Depends on your definition of ‘well predicted’
 Mg, Na, S, Ti, H, Fe, Mn, LOI, H2O, pH all above r2 of 0.75
 C (0.94) , N (0.88) also found to be predicted well in ongoing study
 P (0.72), K (0.48) not predicted so well
 Some important parameters not predicted so well (totals generally
better than exchangeables)
 Preprocessing of the spectral data can improve the prediction
accuracy if done appropriately
 Inclusion of site characteristics improves prediction accuracy
 Predictions can be made using a trained network in <5 seconds
28

Prediction of soil properties with NIR data and site descriptors using preprocessing and neural networks - Matt Aitkenhead, Malcolm Coull Jean Robertson, James Hutton Institute

  • 1.
    Prediction of soilproperties with NIR data and site descriptors using preprocessing and neural networks Matt Aitkenhead Malcolm Coull Jean Robertson 1
  • 2.
    Introduction to NSIS A component of the Scottish Soils Database  One of the most detailed and systematic collections of national soil data in Europe.  Soil Survey of Scotland produced a range of digitised and paper maps at a number of scales from full national coverage at 1:250000 scale to more local surveys at scales of 1:10560 or larger.  Comprehensive database was developed that currently contains chemical and physical information on over 13000 georeferenced soil profiles.  The National Soils Inventory for Scotland (NSIS) is an objective sample of Scottish soils.  Soil and site conditions of 183 locations throughout Scotland were sampled using a 20km grid across the entire country (NSIS 2).  Samples taken at multiple depths from soil pits and analysed to determine their physical and chemical properties (approx. 800 datasets) 2
  • 3.
    NSIS data Ag (aqua-regia digestion,ppm) Cd (aqua-regia digestion, ppm) K (exchangeable, meq per 100g) Mo (aqua-regia digestion, ppm) pH (in H2O) Al (exchangeable, meq per 100g) Co (aqua-regia digestion, ppm) K (aqua-regia digestion, ppm) H2O loss (105°C) Pt (aqua-regia digestion, ppm) Al (aqua-regia digestion, ppm) Cr (aqua-regia digestion, ppm) LOI (loss on ignition, 450°C) Na (exchangeable, meq per 100g) S (aqua-regia digestion, ppm) As (aqua-regia digestion, ppm) Cu (aqua-regia digestion, ppm) LOI (loss on ignition, 900°C) Na (aqua-regia digestion, ppm) Se (aqua-regia digestion, ppm) B (aqua-regia digestion, ppm) H (exchangeable, meq per 100g) Mg (exchangeable, meq per 100g) Ni (aqua-regia digestion, ppm) Sr (aqua-regia digestion, ppm) Ba (aqua-regia digestion, ppm) Fe (exchangeable, meq per 100g) Mg (aqua-regia digestion, ppm) P (aqua-regia digestion, ppm) Ti (aqua-regia digestion, ppm) Ca (exchangeable, meq per 100g) Fe (aqua-regia digestion, ppm) Mn (EDTA extraction, ppm) Pb (aqua-regia digestion, ppm) P (total, derived from P2O5 ppm) Ca (aqua-regia digestion, ppm) Hg (aqua-regia digestion, ppm) Mn (aqua-regia digestion, ppm) pH (in CaCl2) Zn (aqua-regia digestion, ppm) …and outputs inputs… VIS-NIR spectra (pre-processed) Temperature (12 monthly means) Topography (8 parameters) Rainfall (12 monthly means) Land cover (10 parameters) Geology (19 classes) Soil (9 classes) 3
  • 4.
    NIR data -introduction  WinISI software used for on-board analysis of spectra 4
  • 5.
    NIR data -example 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1000 1200 1400 1600 1800 2000 2200 2400 2600 nm 5
  • 6.
    Experimental design  Multiplesteps, based on on-going NIR/FTIR soil work:  Moving window transform  Derivative transform  Normalisation  Input subsampling  Neural network layer size  3600 combinations explored:  Moving window/derivative transform first  Moving window size of 5, 10, 20, 50, 100  Derivative transform options of (1) no derivative, (2) 1st derivative, (3) 2nd derivative, (4) Savitsky-Golay 0-order, (5) S-V 1st order, (6) S-V 2nd order  Spectral normalisation over either entire range of values, or by min/max for each spectrum  Dataset subsampling rate of 1, 2, 5, 10, 20 or 50  NN hidden layer sizes of 5, 10, 20, 50 or 100 6
  • 7.
    Moving window/smoothing 7  Smoothing/derivationof the spectra prior to interpretation is common  Reduces noise and accentuates useful data  Many different smoothing/derivative functions exist  Using a ‘moving window’ subtraction makes peaks stand out from their surroundings  Which should be chosen?  Moving window – what radius of window?  Smoothing/derivative – what function?  Both? In what order?
  • 8.
    Moving window 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1000 15002000 2500nm -0,06 -0,04 -0,02 0 0,02 0,04 0,06 0,08 1000 1500 2000 2500 nm Before After (window radius 20) 8
  • 9.
    Smoothing/derivation Before After (1st order derivative) -0,06 -0,04 -0,02 0 0,02 0,04 0,06 0,08 10001500 2000 2500 nm -0,008 -0,006 -0,004 -0,002 0 0,002 0,004 0,006 0,008 0,01 1000 1500 2000 2500 nm 9
  • 10.
    Normalisation Before After -0,008 -0,006 -0,004 -0,002 0 0,002 0,004 0,006 0,008 0,01 1000 1500 20002500 nm 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1000 1500 2000 2500 nm 10
  • 11.
    Sampling Before After 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1000 1500 20002500 nm 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1000 1500 2000 2500nm 11
  • 12.
    Neural network modelling Simplified model of biological learning  Useful for large, ‘messy’ datasets  Can handle large numbers of input and output parameters  Backpropagation training method  Relatively old, simple NN approach  Based on error minimisation  Standard for data mining/modelling  Allows the ‘black box’ to be opened 12
  • 13.
    Neural network design/training One input node for each value in the pre-processed spectrum (700)  Additional nodes can be added if other input data is to be used  Two hidden layers of 100 nodes each  One output node for each of the output parameters (40)  Dataset split into training/testing (75/25) at random  Testing at every 1000 training steps to find optimal network 13
  • 14.
    Statistical evaluation  Statisticsof predictive accuracy:  R-squared  RMSE  MAE  ME  Weighting of network input/output relationships  Partial derivatives method (Olden & Jackson, 2002; Olden et al., 2004)  Looks at the relationships between every input/output parameter combination 14
  • 15.
    Variation in results Neural network can underfit or overfit the data  Underfitting if not sufficiently trained  Overfitting if trained too well on the training data  Need to identify ‘stopping point’  Testing data (separate from training data) used for this 0 5 10 15 20 25 30 35 40 0 1 2 3 4 5 6 7 8 9 10 Training data Testing data 15
  • 16.
    Best preprocessing algorithm 16 Moving window first, with window radius 50  Then 1st-order Savitsky-Golay smoothing  Normalisation by min-max range for each spectrum  Minimise data subsampling (no subsampling at all is best)  Maximise NN hidden layer size (100 was largest used)  Demonstrable variation in results between experimental combinations:  All statistical measures varied greatly between worst and best combinations  Trends seen in subsampling & NN hidden layer size effects
  • 17.
    Best results (aquaregia r-squared) 17 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Ag Al As B Ba Ca Cd Co Cr Cu Fe Hg K Mg Mn Mo Na Ni P Pb Pt S Se Sr Ti Zn
  • 18.
    Best results (exchangeabler-squared) 18 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Al Ca H Fe K Mg Na
  • 19.
    Best results (otherr-squared) 19 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 LOI (450C) LOI (900C) Mn (EDTA) H2O (105C) pH (CaCl2) pH (H2O) P (P2O5)
  • 20.
    Site characterisation  Environmentalfactors influence the character of the soil  Topography  Vegetation  Climate  Geology  Sample locations were recorded to within 10m accuracy (in most cases!)  With sufficiently large dataset, can be used to develop an ‘environment-specific’ calibration of the model  NN approach is sufficiently flexible to incorporate this information ‘automatically’ 20
  • 21.
    Inclusion of sitecharacter  8 extra input parameters for topography  Elevation, slope, curvature, curve-plan, curve-profile, aspect, aspect-east, aspect-north  20 extra input parameters for vegetation  10 classes for each of 2 land cover maps (LCS88 & LCM2007)  Cropland, improved grassland, rough grassland, deciduous, coniferous, peat, heath, bare, water, montane  9 extra input parameters for soil  Alluvial, alpine, bare, brown earth, gley, peat, podzol, lithosol, regosol  24 extra input parameters for climate  Monthly means for temperature and rainfall  19 extra input parameters for geology  Derived from geological information produced during soil survey work (Lilly, Towers and others) 21
  • 22.
    Modelling with allof the data  80 extra input nodes for 80 extra input parameters  Identical training regime  Identical NN architecture  Sensitivity analysis to identify important input parameters (spectroscopy inputs included in this)  Site characterisation derived from existing spatial datasets, all adjusted to 100m resolution 22
  • 23.
    Changes in theresults 23 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 R2 (spectra only) R2 (spectra & site)
  • 24.
    Specific parameter (LOI450) Almost suspiciously good!  So I went back and checked  R-squared (all inputs) of 0.974  Accurate within 1% of LOI >90% of the time for LOI < 20%  Can still be out by up to 4% in this range...  Accuracy better at low and high LOI values, slightly worse in the middle range  Overall RMSE: 0.046  Overall MAE: 0.035  Overall ME: 0.001 24
  • 25.
    Sensitivity analysis  Severalinputs in the spectra/environmental data have relatively high mean absolute or maximum weightings  No clear pattern or clustering of ‘important’ inputs’  Environmental inputs no more important than spectra 25 Mean weighting Maximum weighting 0 0,005 0,01 0,015 0,02 0,025 0,03 0 200 400 600 800 0 0,02 0,04 0,06 0,08 0,1 0,12 0 200 400 600 800
  • 26.
    Ongoing and futurework 26  Redo the sensitivity analysis using other approaches  Current sensitivity analysis is noisy, tells us less than it could  Comparison of prediction accuracies with standard approaches  Jean Robertson’s analysis for matching these soil samples  Literature, for wider comparison  Local calibration – automated real-time stratification based on site characteristics (real-time model training, testing)  LUCAS data analysis (for the future!)  Similar approach as described here  Need to develop site descriptor data (topography, climate, vegetation, geology, soil type)
  • 27.
    A potential side-route? 27 SOCIT mobile phone app (iPhone/Android)  Estimates soil OM and soil C using mobile phone imagery & site descriptors  LUCAS spectroscopy could be used to produce RGB estimates  A soil C estimation app for Europe?
  • 28.
    Conclusions  Some soilparameters can be predicted ‘well’ using NIR data  Depends on your definition of ‘well predicted’  Mg, Na, S, Ti, H, Fe, Mn, LOI, H2O, pH all above r2 of 0.75  C (0.94) , N (0.88) also found to be predicted well in ongoing study  P (0.72), K (0.48) not predicted so well  Some important parameters not predicted so well (totals generally better than exchangeables)  Preprocessing of the spectral data can improve the prediction accuracy if done appropriately  Inclusion of site characteristics improves prediction accuracy  Predictions can be made using a trained network in <5 seconds 28