Decision tree classiers are a widely used tool in data stream mining. The use of condence intervals to estimate the gain associated with each split leads to very eective methods, like the popular Hoeding tree algorithm. From a statistical viewpoint, the analysis of decision tree classiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeding trees have been already claried, a general and rigorous study of condence intervals for splitting criteria is missing. We ll this gap by deriving accurate condence intervals to estimate the splitting gain in decision tree Learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our condence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of- the-art techniques. In addition to that, our active learning module empirically uses fewer labels without signicantly hurting the performance.

Confidence decision trees via online and active learning for streaming data / R. De Rosa, N. Cesa-Bianchi. - In: THE JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH. - ISSN 1076-9757. - 60(2017), pp. 1031-1055. [10.1613/jair.5440]

Confidence decision trees via online and active learning for streaming data

R. De Rosa;N. Cesa-Bianchi
2017

Abstract

Decision tree classiers are a widely used tool in data stream mining. The use of condence intervals to estimate the gain associated with each split leads to very eective methods, like the popular Hoeding tree algorithm. From a statistical viewpoint, the analysis of decision tree classiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeding trees have been already claried, a general and rigorous study of condence intervals for splitting criteria is missing. We ll this gap by deriving accurate condence intervals to estimate the splitting gain in decision tree Learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our condence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-of- the-art techniques. In addition to that, our active learning module empirically uses fewer labels without signicantly hurting the performance.
Settore INF/01 - Informatica
2017
Article (author)
File in questo prodotto:
File Dimensione Formato  
live-5440-10346-jair.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 432.94 kB
Formato Adobe PDF
432.94 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/541890
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 1
social impact