High dimensional and ultrahigh dimensional variable selection is a formidable challenge in biomedical statistics. To face this problem, a number of promising approaches have recently been proposed. A very attractive method is penalized regression. This class of procedures shows many advantages but is still not very popular, mainly due to its computational cost. In this work, we focus our attention on these techniques and on their applications to genome-wide association studies (GWAS). An overview of some of the most interesting penalization methods is given in the first chapter: Lasso, Least Angle Regression, Elastic Net, Adaptive Lasso, Scad, Combined Penalization and Relaxed Lasso. For each technique, we consider from a theoretical point of view the main ideas behind the method and we examine its pros and cons compared to other methods. An important open problem in the field of L1-penalized regression is the construction of confidence intervals for the model coefficients. A popular approach to calculating confidence intervals is to use bootstrap simulation algorithms. In the second chapter, we investigate four bootstrap methods for regression models (parametric bootstrap, vector resampling, residual bootstrap, and two variants of one-step bootstrap - vector and residual resampling) and we consider their application to generalized linear model (GLMs). We review their functioning, we describe their implementation algorithms and we evaluate their performance by simulation studies. In the third chapter, we start considering the residual bootstrap method for the lasso estimator of the regression parameters in a multiple linear regression model, recently proposed by [Chatterjee and Lahiri (2010)]. In the following section we extend this idea to penalized GLMs, using the notion of standardized Pearson’s residuals. The results of the simulation studies show that this method has some serious drawbacks. As a result, in the sections that follows, we explore a completely different approach based on the fact that the coefficients of lasso for linear models can be approximated by ridge regression. After generalizing this result to L1-penalized GLMs, we develop a one-step (residual) resampling method for this class of models in the spirit of the one-step bootstrap for GLMs proposed by [Moulton and Zeger (1991)]. Then, applying the results of [Vinod (1995)], we build confidence intervals (CIs) for the coefficients of the class of L1-penalized GLMs. The simulation studies suggest that by this method we are able to build CIs with good empirical coverage probabilities. In the final section, we consider the double bootstrap of Beran (1987) in order to further reduce the coverage errors of single bootstrap and to build confidence intervals with a higher order of accuracy. Chapter four contains an overview of one of the most challenging and fascinating problems of modern biomedical statistics: ultrahigh dimensional variable selection in GWAS and gene environment-wide interaction (GEWI) studies, with particular attention on the evaluation of gene-gene and gene environment interactions. This is a fundamental task in the investigation of complex patterns for complex disease. Sure Independence Screening (SIS), Iterative SIS (ISIS) and their variants are novel and effective methods for variable selection in ultrahigh dimensional settings. They are based on a prescreening step for dimension reduction followed by a selection/estimation step performed using L1-penalized regression. We test this method on a simulated dataset obtaining interesting results.

PENALIZED REGRESSION: BOOTSTRAP CONFIDENCE INTERVALS AND VARIABLE SELECTION FOR HIGH-DIMENSIONAL DATA SETS / S. Sartori ; tutor: Silvano Milani ; coordinatore: Silvano Milano. Universita' degli Studi di Milano, 2011 Feb 04. 23. ciclo, Anno Accademico 2010. [10.13130/sartori-samantha_phd2011-02-04].

PENALIZED REGRESSION: BOOTSTRAP CONFIDENCE INTERVALS AND VARIABLE SELECTION FOR HIGH-DIMENSIONAL DATA SETS

S. Sartori
2011

Abstract

High dimensional and ultrahigh dimensional variable selection is a formidable challenge in biomedical statistics. To face this problem, a number of promising approaches have recently been proposed. A very attractive method is penalized regression. This class of procedures shows many advantages but is still not very popular, mainly due to its computational cost. In this work, we focus our attention on these techniques and on their applications to genome-wide association studies (GWAS). An overview of some of the most interesting penalization methods is given in the first chapter: Lasso, Least Angle Regression, Elastic Net, Adaptive Lasso, Scad, Combined Penalization and Relaxed Lasso. For each technique, we consider from a theoretical point of view the main ideas behind the method and we examine its pros and cons compared to other methods. An important open problem in the field of L1-penalized regression is the construction of confidence intervals for the model coefficients. A popular approach to calculating confidence intervals is to use bootstrap simulation algorithms. In the second chapter, we investigate four bootstrap methods for regression models (parametric bootstrap, vector resampling, residual bootstrap, and two variants of one-step bootstrap - vector and residual resampling) and we consider their application to generalized linear model (GLMs). We review their functioning, we describe their implementation algorithms and we evaluate their performance by simulation studies. In the third chapter, we start considering the residual bootstrap method for the lasso estimator of the regression parameters in a multiple linear regression model, recently proposed by [Chatterjee and Lahiri (2010)]. In the following section we extend this idea to penalized GLMs, using the notion of standardized Pearson’s residuals. The results of the simulation studies show that this method has some serious drawbacks. As a result, in the sections that follows, we explore a completely different approach based on the fact that the coefficients of lasso for linear models can be approximated by ridge regression. After generalizing this result to L1-penalized GLMs, we develop a one-step (residual) resampling method for this class of models in the spirit of the one-step bootstrap for GLMs proposed by [Moulton and Zeger (1991)]. Then, applying the results of [Vinod (1995)], we build confidence intervals (CIs) for the coefficients of the class of L1-penalized GLMs. The simulation studies suggest that by this method we are able to build CIs with good empirical coverage probabilities. In the final section, we consider the double bootstrap of Beran (1987) in order to further reduce the coverage errors of single bootstrap and to build confidence intervals with a higher order of accuracy. Chapter four contains an overview of one of the most challenging and fascinating problems of modern biomedical statistics: ultrahigh dimensional variable selection in GWAS and gene environment-wide interaction (GEWI) studies, with particular attention on the evaluation of gene-gene and gene environment interactions. This is a fundamental task in the investigation of complex patterns for complex disease. Sure Independence Screening (SIS), Iterative SIS (ISIS) and their variants are novel and effective methods for variable selection in ultrahigh dimensional settings. They are based on a prescreening step for dimension reduction followed by a selection/estimation step performed using L1-penalized regression. We test this method on a simulated dataset obtaining interesting results.
4-feb-2011
Settore MED/01 - Statistica Medica
penalized regression ; confidence intervals ; high dimensional variable selection ; GWAS ; bootstrap
MILANI, SILVANO
MILANI, SILVANO
Doctoral Thesis
PENALIZED REGRESSION: BOOTSTRAP CONFIDENCE INTERVALS AND VARIABLE SELECTION FOR HIGH-DIMENSIONAL DATA SETS / S. Sartori ; tutor: Silvano Milani ; coordinatore: Silvano Milano. Universita' degli Studi di Milano, 2011 Feb 04. 23. ciclo, Anno Accademico 2010. [10.13130/sartori-samantha_phd2011-02-04].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R07738.pdf

accesso aperto

Tipologia: Tesi di dottorato completa
Dimensione 2.54 MB
Formato Adobe PDF
2.54 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/153099
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact