Determining how many factors to retain as expression of an underlying structure is an important topic in principal component analysis (PCA). With this aim, different empirical criteria are usually adopted, such as to retain all eigenvalues higher than 1 (Kaiser-Guttman rule), or the first eigenvalues totalling a prefixed amount of explained variance, or higher than a prefixed threshold (broken-stick method), or those eigenvalues that depart from the straight line on which tend to lie all the other eigenvalues (scree plot). Yet, these rules often have weak theoretical bases and not always are appropriately applied. Facing this problem from a mathematical-analytical point of view by finding the distribution of the eigenvalues of sampling correlation matrices is a hard task, and most studies report results which are valid only asymptotically or under specific assumptions. There is a need to generalize the method, also to deal with real and possibly small datasets. The aim of this thesis was to model the decision thresholds for the eigenvalue distribution as a function of number of variables (k) and sample size (n), under the assumptions that no latent factors exist and variables are standard-normally distributed. Two methods were taken into consideration: a direct and an indirect method. Through a simulation, data were generated for 70 different settings, obtained combining 7 different values for n (75, 150, 300, 600, 1200, 2400, 4800) with 10 different values for k (6, 12, 18, 24, 30, 36, 42, 48, 60). All variables were generated as independent and standard-normally distributed. The distribution of the first 4 eigenvalues of the correlation matrix was considered and the values of the 95th centiles were computed. For each setting, PCA was applied to 6001 independent samples. It was shown that there is a positive correlation between couples of consecutive eigenvalues, and that this correlation increases as k increases and, to a lesser extent, as n increases. It is expected that this pattern also persists when latent factors are present. With the direct method, the observed 95th centile of the distribution of the first 4 eigenvalues could be predicted as a function of k and n by a nonlinear model with 7 parameters. With the indirect method, we normalized the distribution of the first 4 eigenvalues through a 3-parameter Box-Cox transformation. The parameters of the Box-Cox transformation were then expressed as functions of k and n, and used to predict the 95th centile. Both methods appeared to accurately predict the value of the 95th centile. For the first eigenvalue, the mean of the absolute difference between type I error risk associated with the observed and the predicted thresholds is 3‰ for the direct method and 5‰ for the indirect method. The latter method has the additional advantage of providing any computed eigenvalue with its probability of occurrence under the null hypothesis. The number of samples generated in this study is large enough to obtain highly precise estimates of the type I error risk as regards the 1st eigenvalue. The reliability of the estimates is lower for the 2nd and, a fortiori, the 3rd and 4th eigenvalue. Further research in this field should focus on how and to what extent the distribution of the eigenvalues of sample correlation matrices depends on the shape of the parent distribution (e.g. skewed, leptokurtic, multimodal), and on the possible extension of the predicting functions to the case where latent factors exist, by including a parameter that takes into account the variance explained by these factors.

IDENTIFICAZIONE DEGLI ASSI FATTORIALI INFORMATIVI NELL'ANALISI DELLE COMPONENTI PRINCIPALI / M. Plebani ; tutor: S. Milani ; coordinatore: A. Decarli. DIPARTIMENTO DI SCIENZE CLINICHE E DI COMUNITA', 2014 Feb 25. 26. ciclo, Anno Accademico 2013. [10.13130/plebani-maddalena_phd2014-02-25].

IDENTIFICAZIONE DEGLI ASSI FATTORIALI INFORMATIVI NELL'ANALISI DELLE COMPONENTI PRINCIPALI

M. Plebani
2014

Abstract

Determining how many factors to retain as expression of an underlying structure is an important topic in principal component analysis (PCA). With this aim, different empirical criteria are usually adopted, such as to retain all eigenvalues higher than 1 (Kaiser-Guttman rule), or the first eigenvalues totalling a prefixed amount of explained variance, or higher than a prefixed threshold (broken-stick method), or those eigenvalues that depart from the straight line on which tend to lie all the other eigenvalues (scree plot). Yet, these rules often have weak theoretical bases and not always are appropriately applied. Facing this problem from a mathematical-analytical point of view by finding the distribution of the eigenvalues of sampling correlation matrices is a hard task, and most studies report results which are valid only asymptotically or under specific assumptions. There is a need to generalize the method, also to deal with real and possibly small datasets. The aim of this thesis was to model the decision thresholds for the eigenvalue distribution as a function of number of variables (k) and sample size (n), under the assumptions that no latent factors exist and variables are standard-normally distributed. Two methods were taken into consideration: a direct and an indirect method. Through a simulation, data were generated for 70 different settings, obtained combining 7 different values for n (75, 150, 300, 600, 1200, 2400, 4800) with 10 different values for k (6, 12, 18, 24, 30, 36, 42, 48, 60). All variables were generated as independent and standard-normally distributed. The distribution of the first 4 eigenvalues of the correlation matrix was considered and the values of the 95th centiles were computed. For each setting, PCA was applied to 6001 independent samples. It was shown that there is a positive correlation between couples of consecutive eigenvalues, and that this correlation increases as k increases and, to a lesser extent, as n increases. It is expected that this pattern also persists when latent factors are present. With the direct method, the observed 95th centile of the distribution of the first 4 eigenvalues could be predicted as a function of k and n by a nonlinear model with 7 parameters. With the indirect method, we normalized the distribution of the first 4 eigenvalues through a 3-parameter Box-Cox transformation. The parameters of the Box-Cox transformation were then expressed as functions of k and n, and used to predict the 95th centile. Both methods appeared to accurately predict the value of the 95th centile. For the first eigenvalue, the mean of the absolute difference between type I error risk associated with the observed and the predicted thresholds is 3‰ for the direct method and 5‰ for the indirect method. The latter method has the additional advantage of providing any computed eigenvalue with its probability of occurrence under the null hypothesis. The number of samples generated in this study is large enough to obtain highly precise estimates of the type I error risk as regards the 1st eigenvalue. The reliability of the estimates is lower for the 2nd and, a fortiori, the 3rd and 4th eigenvalue. Further research in this field should focus on how and to what extent the distribution of the eigenvalues of sample correlation matrices depends on the shape of the parent distribution (e.g. skewed, leptokurtic, multimodal), and on the possible extension of the predicting functions to the case where latent factors exist, by including a parameter that takes into account the variance explained by these factors.
25-feb-2014
Settore MED/01 - Statistica Medica
principal component analysis ; sample correlation matrix ; parallel analysis ; eigenvalue distribution
MILANI, SILVANO
DECARLI, ADRIANO
Doctoral Thesis
IDENTIFICAZIONE DEGLI ASSI FATTORIALI INFORMATIVI NELL'ANALISI DELLE COMPONENTI PRINCIPALI / M. Plebani ; tutor: S. Milani ; coordinatore: A. Decarli. DIPARTIMENTO DI SCIENZE CLINICHE E DI COMUNITA', 2014 Feb 25. 26. ciclo, Anno Accademico 2013. [10.13130/plebani-maddalena_phd2014-02-25].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R09128.pdf

Open Access dal 30/07/2015

Tipologia: Tesi di dottorato completa
Dimensione 8.86 MB
Formato Adobe PDF
8.86 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/232964
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact