IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Decision tree classiers are a widely used tool in data stream mining. The use of condence intervals to estimate the gain associated with each split leads to very eective methods, like the popular Hoeding tree algorithm. From a statistical viewpoint, the analysis of decision tree classiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeding trees have been already claried, a general and rigorous study of condence intervals for splitting criteria is missing. We ll this gap by deriving accurate condence intervals to estimate the splitting gain in decision tree Learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our condence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of- the-art techniques. In addition to that, our active learning module empirically uses fewer labels without signicantly hurting the performance.

Confidence decision trees via online and active learning for streaming data / R. De Rosa, N. Cesa-Bianchi. - In: THE JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH. - ISSN 1076-9757. - 60(2017), pp. 1031-1055. [10.1613/jair.5440]

Confidence decision trees via online and active learning for streaming data

R. De Rosa;N. Cesa-Bianchi

2017

Abstract

Decision tree classiers are a widely used tool in data stream mining. The use of condence intervals to estimate the gain associated with each split leads to very eective methods, like the popular Hoeding tree algorithm. From a statistical viewpoint, the analysis of decision tree classiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeding trees have been already claried, a general and rigorous study of condence intervals for splitting criteria is missing. We ll this gap by deriving accurate condence intervals to estimate the splitting gain in decision tree Learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our condence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of- the-art techniques. In addition to that, our active learning module empirically uses fewer labels without signicantly hurting the performance.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				2017
			
	Rivista in ANCE
	
				THE JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
			
	DOI
	
				https://dx.doi.org/10.1613/jair.5440
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
live-5440-10346-jair.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 432.94 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	432.94 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/541890

Citazioni

ND

11

2

social impact