Introduction Because of the complexity of diet and the potential for interaction between dietary components, approaches that focus on individual foods or nutrients may miss information on the role of diet in disease aetiology. Due to their ability to solve both problems, dietary patterns have been used in nutritional epidemiology to describe associations between diet and disease. Dietary patterns are combinations of dietary components (food items, food groups, nutrients, or both foods and nutrients) intended to summarize total diet or key factors of the diet for a given population. Three approaches have been proposed in the literature to define dietary patterns: a posteriori approach, a priori approach and reduced rank regression (RRR) (1,2). In an a posteriori approach, dietary patterns are defined applying multivariate statistical methods (Principal Component Analysis (PCA), exploratory Factor Analysis (FA), Cluster Analysis) directly to the data under consideration. In an a priori approach, they are defined as indexes built upon scientific evidence for specific diseases. RRR is based on a combination of the a priori knowledge of the diet-disease association and of statistical methods. Objective FA is the most widely used method to identify a posteriori dietary patterns. Its understanding and application is not straigthforward for non-statisticians, and some unsolved methodological issues and a general lack of evidence seem to dominate in a series of reviews on the association between dietary patterns and several types of cancer. In this contribution, we discuss some methodological issues concerning the identification of dietary patterns through FA and emerging from these reviews. Methods The aim of PCA and FA is to reduce the dimensionality of the data, by transforming an original larger set of correlated foods or nutrients into a smaller and easily interpretable set of uncorrelated variables, called principal components or factors. FA starts from the same covariance/correlation matrix and shares the data reduction rationale of PCA, but is based on a statistical model and it is more indicated when the main focus of the analysis is on interpretation. The definition of a statistical model allows to rotate the factor loading matrix. FA uses both principal component and maximum likelihood methods for model parameter estimation. A continuous summary score is derived from both PCA and FA for each subject and for each (retained) factor indicating the degree to which a subject's diet conforms to each of the identified dietary patterns. Factor scores are used for further analyses on disease risk assessment. Results There are opportunities for subjectivity that occur throughout dietary pattern definition, and decisions made almost at any step may have an impact on the number and type of patterns that are derived, reported, and analyzed (1). Some preliminary decisions are relevant to any approach of pattern analysis. Specific challenges of FA include which data matrix to work on (covariance/correlation matrix, separate analyses for known subgroups in the data), the number of factors to retain and the corresponding percentage of explained variance, the opportunity of a factor rotation and which rotation to choose and the identification of some criteria for labelling the factors. In this contribution, we present some methodological issues concerning FA that have not been covered yet in detail. In FA, the primary question for a researcher is whether data are consistent with the prescribed structure, where all variables within a particular group are highly correlated among themselves but have relatively small correlations with variables in a different group. Preliminary checks of the covariance/correlation matrix of the data have been never reported in nutritional epidemiology. However, the cumulative percentage of the original variance explained by the retained factors is generally low, indicating the need for some caution in the interpretation of the identified dietary patterns. The adoption of the correlation matrix of the original data avoids having one variable with large variance unduly influencing the determination of loadings and should be generally recommended in FA, although this may obscure differences across subjects. However, very few papers reported to perform the analyses referring to the correlation matrix. The choice of the number of factors to retain is often based on the "eigenvalue greater than 1" criterion. This criterion indicates that the corresponding factor explains more of the variance in the correlation matrix than is explained by a single variable. However, it is systematically applied even when a covariance matrix seems to be used. Quantitative labelling of the identified factors should be improved to reduce subjectivity in naming the factors and help comparability of the results. The most sensitive criterion is to consider the set of the highest (absolute) (rotated) factor loadings, defined according to a common cut-off for all the retained factors. The cut-offs have, indeed, an objective interpretation in terms of the amount of variance that each item shares with the factor. Alternatively, one may use Chronbach's coefficient alpha and confirmatory FA. Subjectivity could be also limited performing robustness analyses of the identified dietary patterns, in order to guarantee that an obtained data reduction solution is likely to be independent of the specific statistical method and options used to derive it. Although reproducibility of dietary patterns (the extent to which similar patterns are seen in the same population or across different ones) is becoming an increasingly popular topic in dietary pattern analysis (1), only a few papers still deal with this aspect, and some major concerns should be taken into account for a fair evaluation of it. Dietary assessment method, length and type of responses, reproducibility and validity of the questionnaire are all relevant issues in performing sound reproducibility analyses. Special care is also needed when comparing results obtained from different approaches, and from different statistical techniques within the a posteriori approach. Conclusion There is a general need for a refinement in the definition of dietary patterns, and in the evaluation of their reproducibility. Special care is recommended in the application of FA, as it is the most common method used for the identification of a posteriori dietary patterns, and as it might be that some data are not consistent with the correlation structure required by this technique.

The use of factor analysis in nutritional epidemiology : the identification of dietary patterns / V.C. Edefonti, F. Bravi, A. Decarli, M. Ferraroni - In: Dalla genetica all'ambiente : il ruolo della statistica medica e dell'epidemiologia clinica : 5. Congresso nazionale SISMEC : Palazzo San Tommaso, Università degli Studi di Pavia, 16-19 settembre 2009 : atti / [a cura di] P. Borrelli, B. Corso, M. C. Monti, C. Montomoli, P. Sciarini. - Pavia : La Goliardica pavese, 2009. - ISBN 9788878305014. - pp. 126-127 (( Intervento presentato al 5. convegno Congresso nazionale SISMEC : dalla genetica all'ambiente : il ruolo della statistica medica e dell'epidemiologia clinica tenutosi a Pavia nel 2009.

### The use of factor analysis in nutritional epidemiology : the identification of dietary patterns

#####
*V.C. Edefonti*^{Primo};F. Bravi^{Secondo};A. Decarli^{Penultimo};M. Ferraroni^{Ultimo}

^{Primo};F. Bravi

^{Secondo};A. Decarli

^{Penultimo};M. Ferraroni

^{Ultimo}

##### 2009

#### Abstract

Introduction Because of the complexity of diet and the potential for interaction between dietary components, approaches that focus on individual foods or nutrients may miss information on the role of diet in disease aetiology. Due to their ability to solve both problems, dietary patterns have been used in nutritional epidemiology to describe associations between diet and disease. Dietary patterns are combinations of dietary components (food items, food groups, nutrients, or both foods and nutrients) intended to summarize total diet or key factors of the diet for a given population. Three approaches have been proposed in the literature to define dietary patterns: a posteriori approach, a priori approach and reduced rank regression (RRR) (1,2). In an a posteriori approach, dietary patterns are defined applying multivariate statistical methods (Principal Component Analysis (PCA), exploratory Factor Analysis (FA), Cluster Analysis) directly to the data under consideration. In an a priori approach, they are defined as indexes built upon scientific evidence for specific diseases. RRR is based on a combination of the a priori knowledge of the diet-disease association and of statistical methods. Objective FA is the most widely used method to identify a posteriori dietary patterns. Its understanding and application is not straigthforward for non-statisticians, and some unsolved methodological issues and a general lack of evidence seem to dominate in a series of reviews on the association between dietary patterns and several types of cancer. In this contribution, we discuss some methodological issues concerning the identification of dietary patterns through FA and emerging from these reviews. Methods The aim of PCA and FA is to reduce the dimensionality of the data, by transforming an original larger set of correlated foods or nutrients into a smaller and easily interpretable set of uncorrelated variables, called principal components or factors. FA starts from the same covariance/correlation matrix and shares the data reduction rationale of PCA, but is based on a statistical model and it is more indicated when the main focus of the analysis is on interpretation. The definition of a statistical model allows to rotate the factor loading matrix. FA uses both principal component and maximum likelihood methods for model parameter estimation. A continuous summary score is derived from both PCA and FA for each subject and for each (retained) factor indicating the degree to which a subject's diet conforms to each of the identified dietary patterns. Factor scores are used for further analyses on disease risk assessment. Results There are opportunities for subjectivity that occur throughout dietary pattern definition, and decisions made almost at any step may have an impact on the number and type of patterns that are derived, reported, and analyzed (1). Some preliminary decisions are relevant to any approach of pattern analysis. Specific challenges of FA include which data matrix to work on (covariance/correlation matrix, separate analyses for known subgroups in the data), the number of factors to retain and the corresponding percentage of explained variance, the opportunity of a factor rotation and which rotation to choose and the identification of some criteria for labelling the factors. In this contribution, we present some methodological issues concerning FA that have not been covered yet in detail. In FA, the primary question for a researcher is whether data are consistent with the prescribed structure, where all variables within a particular group are highly correlated among themselves but have relatively small correlations with variables in a different group. Preliminary checks of the covariance/correlation matrix of the data have been never reported in nutritional epidemiology. However, the cumulative percentage of the original variance explained by the retained factors is generally low, indicating the need for some caution in the interpretation of the identified dietary patterns. The adoption of the correlation matrix of the original data avoids having one variable with large variance unduly influencing the determination of loadings and should be generally recommended in FA, although this may obscure differences across subjects. However, very few papers reported to perform the analyses referring to the correlation matrix. The choice of the number of factors to retain is often based on the "eigenvalue greater than 1" criterion. This criterion indicates that the corresponding factor explains more of the variance in the correlation matrix than is explained by a single variable. However, it is systematically applied even when a covariance matrix seems to be used. Quantitative labelling of the identified factors should be improved to reduce subjectivity in naming the factors and help comparability of the results. The most sensitive criterion is to consider the set of the highest (absolute) (rotated) factor loadings, defined according to a common cut-off for all the retained factors. The cut-offs have, indeed, an objective interpretation in terms of the amount of variance that each item shares with the factor. Alternatively, one may use Chronbach's coefficient alpha and confirmatory FA. Subjectivity could be also limited performing robustness analyses of the identified dietary patterns, in order to guarantee that an obtained data reduction solution is likely to be independent of the specific statistical method and options used to derive it. Although reproducibility of dietary patterns (the extent to which similar patterns are seen in the same population or across different ones) is becoming an increasingly popular topic in dietary pattern analysis (1), only a few papers still deal with this aspect, and some major concerns should be taken into account for a fair evaluation of it. Dietary assessment method, length and type of responses, reproducibility and validity of the questionnaire are all relevant issues in performing sound reproducibility analyses. Special care is also needed when comparing results obtained from different approaches, and from different statistical techniques within the a posteriori approach. Conclusion There is a general need for a refinement in the definition of dietary patterns, and in the evaluation of their reproducibility. Special care is recommended in the application of FA, as it is the most common method used for the identification of a posteriori dietary patterns, and as it might be that some data are not consistent with the correlation structure required by this technique.##### Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.