Comparison between geostatistical and machine learning models as predictors of topsoil organic carbon with a focus on local uncertainty estimation

Veronesi, F.; Schillaci, C.

doi:10.1016/j.ecolind.2019.02.026

In recent years, the environmental modeling community has moved away from kriging as the main mapping algorithm and embraced machine learning (ML) as the go-to method for spatial prediction. The drawback of this shift has been a gradual decline in the number of papers in which uncertainty is presented and mapped alongside estimates of the target variables because in some ML algorithms, computing the local uncertainty can be challenging. This drawback has been recently identified in the literature as one of the key areas in DSM where progress is most needed. The main objective of this work is to compare geostatistical techniques, ML methods and hybrid methods, e.g., regression kriging, in terms of not only their overall accuracy but also their precision in providing useful confidence intervals at unsampled locations. We aim to provide clear application guidelines for future mapping exercises. For this experiment, we used a legacy soil dataset (n = 414) of topsoil observations from the semi-arid Mediterranean region of Sicily. This dataset was collected in a 2008 survey with a pedo-landscape sampling design; hence, it is ideal for comparing geostatistics and ML. In the comparison, we included algorithms that have been widely adopted in the literature: ordinary and universal kriging, linear regression, random forest (RF), quantile regression forest, boosted regression trees (BRT) and hybrid forms of kriging (e.g., regression kriging with RF and BRT used as regressors). To evaluate the accuracy of each algorithm, a validation test that was based on the random exclusion of 25% of the samples was repeated 100 times. In addition, we performed a test of the transferability, in which the locations with the largest nearest-neighbor distances were excluded from training and re-predicted. The validation results demonstrate that ordinary and universal kriging are the best performers, followed closely by random forest (RF) and quantile regression forest (QRF). In terms of local uncertainty, RF and QRF provide confidence intervals that most often include the observed values of SOC. However, they both provide very wide confidence intervals, which may be problematic in some studies. Other algorithms, such as boosted regression trees and boosted regression kriging, performed slightly worse (on this dataset), but produced narrower ranges of uncertainty. Hence, they may be more attractive since their estimates are very robust against changes and noise in the predictors.

Comparison between geostatistical and machine learning models as predictors of topsoil organic carbon with a focus on local uncertainty estimation / F. Veronesi, C. Schillaci. - In: ECOLOGICAL INDICATORS. - ISSN 1470-160X. - 101(2019 Jun), pp. 1032-1044. [10.1016/j.ecolind.2019.02.026]

Comparison between geostatistical and machine learning models as predictors of topsoil organic carbon with a focus on local uncertainty estimation

C. Schillaci^{Co-primo

Conceptualization}

2019

Abstract

In recent years, the environmental modeling community has moved away from kriging as the main mapping algorithm and embraced machine learning (ML) as the go-to method for spatial prediction. The drawback of this shift has been a gradual decline in the number of papers in which uncertainty is presented and mapped alongside estimates of the target variables because in some ML algorithms, computing the local uncertainty can be challenging. This drawback has been recently identified in the literature as one of the key areas in DSM where progress is most needed. The main objective of this work is to compare geostatistical techniques, ML methods and hybrid methods, e.g., regression kriging, in terms of not only their overall accuracy but also their precision in providing useful confidence intervals at unsampled locations. We aim to provide clear application guidelines for future mapping exercises. For this experiment, we used a legacy soil dataset (n = 414) of topsoil observations from the semi-arid Mediterranean region of Sicily. This dataset was collected in a 2008 survey with a pedo-landscape sampling design; hence, it is ideal for comparing geostatistics and ML. In the comparison, we included algorithms that have been widely adopted in the literature: ordinary and universal kriging, linear regression, random forest (RF), quantile regression forest, boosted regression trees (BRT) and hybrid forms of kriging (e.g., regression kriging with RF and BRT used as regressors). To evaluate the accuracy of each algorithm, a validation test that was based on the random exclusion of 25% of the samples was repeated 100 times. In addition, we performed a test of the transferability, in which the locations with the largest nearest-neighbor distances were excluded from training and re-predicted. The validation results demonstrate that ordinary and universal kriging are the best performers, followed closely by random forest (RF) and quantile regression forest (QRF). In terms of local uncertainty, RF and QRF provide confidence intervals that most often include the observed values of SOC. However, they both provide very wide confidence intervals, which may be problematic in some studies. Other algorithms, such as boosted regression trees and boosted regression kriging, performed slightly worse (on this dataset), but produced narrower ranges of uncertainty. Hence, they may be more attractive since their estimates are very robust against changes and noise in the predictors.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Boosted regression trees; Digital soil mapping; Kriging; Local uncertainty; Machine learning; Random forest; Regression kriging
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore AGR/02 - Agronomia e Coltivazioni Erbacee
Settore AGR/14 - Pedologia
Settore GEO/04 - Geografia Fisica e Geomorfologia
			
	Data di pubblicazione
	
				giu-2019
			
	Rivista in ANCE
	
				ECOLOGICAL INDICATORS
			
	DOI
	
				https://dx.doi.org/10.1016/j.ecolind.2019.02.026
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Veronesi_Schillaci_2019_comparison between geostatistical and machine learning models SOC.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 2.76 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.76 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/662023

Citazioni

ND

122

113

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca