The present PhD research project starts from the need of deeply investigating some methodological aspects related to the identification, validation and application, in a routine clinical setting, of new non-invasive biomarkers for the (early) detection of cancer. Specifically, I investigated the statistical-methodological issues related to the identification and validation of molecular-based signatures detected with qPCR-based platforms by using the colorectal cancer as disease model (CRC-INT study). Colorectal cancer (CRC) is one of the major causes of cancer death in western countries1,2. More than 90% of CRC cases occur after the age of 50 years2 and, on the basis of the natural history of CRC progression and of the long time-interval of progression from normal mucosa to invasive cancer many efforts have been focused on the implementation of screening programs for CRC prevention and detection in its early stage, especially in this cohort of subjects. Currently adopted screening programs are based on colonoscopy (invasive, but still considered as the gold standard for both detection and removal of lesions) or on tests that search the human haemoglobin in stool (i.e. faecal occult blood test, FOBT/FIT). The latter are less invasive and easier to carry out, but showed low sensitivity values for polyps identification. In Italy, screening programs are based on FOBT/FIT - offered every 2 years to residents of 50-69/74 years-old - or on flexible sigmoidoscopy (FS) offered to a single age cohort, generally at 58-60 years-old. As concerns FIT programmes, quantitative haemoglobin analysis is performed in a centralized reference laboratory using the threshold of 100ng/ml of faecal haemoglobin as cut-off value to determine the positivity to the test. People with a negative test are invited to repeat the test after 2 years. Subjects with a positive test (FIT+) are contacted in order to perform a total colonoscopy (TC) in referral centres during dedicated sessions. According to 2011-2012 data, colonoscopy is performed by the 81% of FIT+ subjects and a diagnosis of carcinoma is formulated in 5% of FIT+ subjects and that of advanced adenoma in a further 25%. Subjects with non cancerous lesions are enrolled in follow-up programs according to the colonoscopy’s output, whereas subjects with a screen-detected CRC undergo surgery(www.giscor.it). MicroRNAs have been studied intensively in the field of oncological research and several studies have highlighted their easy detectability in plasma and serum3, suggesting their possible role as non-invasive biomarkers for the diagnosis and monitoring of human cancers. However, their detection and quantification could be influenced by haemolysis, i.e. pink discoloration of serum or plasma due to the release of the red blood cells into the fluids4,5,6,7. In addition, several Authors are now highlighting the need of shared workflows for the entire process of miRNA identification (pre-analytical and analytical) as well as for the statistical analysis. Accordingly, we recently proposed a workflow that tries to schematize all the key phases involved in biomarkers studies, from biomarker discovery to their analytical and clinical validation, including the issues related to the development of operative procedures for their analysis8. The process usually begins with a discovery phase, followed by a validation one and ultimately by the clinical application of the identified biomarker signature. Two additional assay-oriented steps (assay optimization and assay development) could be introduced in the workflow, before and/or after the validation phase. From a statistical-methodological point of view, the main issues involved in this workflow are those related to (i) data normalization of high-throughput qPCR data and the (ii) building and validation of miRNA-based signatures. Data normalization represents a crucial pre-processing step aimed both at removing experimentally-induced variation and at differentiating true biological changes. Inappropriate normalization strategies can affect the results of the subsequent analysis and, as a consequence, the conclusion drawn from the results. In the miRNA context, there are no yet verified and shared reference RNA in serum and plasma that can be used for data normalization. Several methods have been proposed to solve the problem of reference RNA selection. Currently, the most accepted and widely used method for data normalization of circulating miRNAs is that proposed by Mestdagh9, based on the computation of the global mean of the expressed miRNAs. This method is valid if a large number of miRNAs are profiled, but it is almost never applicable in validation studies focused on a limited number of miRNAs. To overcome this issue, the same Authors proposed to search the set of reference miRNAs that resembles the mean expression value of all the miRNAs, and use that set for data normalization. Starting from this, we developed a comprehensive data-driven normalization method for high-throughput qPCR data that identifies a small set of miRNAs to be used as reference for data normalization in view of the subsequent validation studies10,11, using results obtained with the global mean method as reference method. This algorithm was also implemented in a R-function (NqA algorithm). As concerns the biomarker studies based on the use of high-throughput assays, it should be considered that in such setting a wide range of weak biomarkers are constantly identified. In such setting, a single molecular biomarker alone may not achieve satisfactory performance for patients classification, so that the linear combination of these biomarkers in a more powerful composite score could represent a suitable approach to achieve higher diagnostic performances. As reported by Yan12, several methods are available, such as those based on the searching of the alpha value that maximizes the AUC value of the linear combination of p-biomarkers (grid-search methods), or the conventional logistic regression model when the purpose is to guide healthcare professionals in their decision-making. Prediction model studies are usually organized in a model development and in a model validation phase, or in a combination of both. In the first phase the aim is to derive a multivariate prediction model by selecting the most relevant predictors and combing them into a multivariate model, whereas the model validation consists in the evaluation of the performance of the model on other data, not used for model development. As far as miRNAs are concerned, the process starts with the identification of the candidate miRNAs that should be included in the initial multivariate model, as reported in (8). Once identified these candidates, the model development phase could start with the fitting of the initial multivariate model, by opportunely taking into consideration the number of event-per-variable (EPV). Penalized regression models (PMLE) 13, in which the beta value are obtained by maximising the penalized log-likelihood, can be used to prevent overfitting when a large number of covariates are present in a model with respect to the number of outcome events. According to the penalty term (i.e. the functional form of the constraints) and the tuning parameter (i.e. the amount of shrinkage applied to the coefficients), different penalized regression models can be fitted13 Once defined the initial model and fitted it with the proper procedures, another important theme is the definition of the final model, which could correspond to the full initial model or to a reduced one. For standard regression models, backward elimination or forward selection can be used for this purpose, even if the latter does not provide a simultaneous assessment of the effects of all the candidates in the model14, whereas for PMLE, a reduced model can be obtained using the R-square method15. An alternative approach to the standard stepwise/backward methods is the all subsets regression, which can discover combinations of variables that explain more variation in patients outcome than those obtained by using the standard stepwise/backward algorithms8. This approach has several potential advantages, but also some drawbacks, including the possibility of selecting models which omit important predictors (i.e. pre-existing evidences). The performance of a developed model should be then assessed by evaluating discrimination and calibration. Discrimination refers to the ability of the model to distinguish individuals with the disease from those without the disease (measured with the c-index or the equivalent area under the ROC curve), whereas calibration refers to the agreement between the probability of developing/having the outcome of interest as estimated by the model, and the observed outcome. In addition, it is important to consider that the performance of the developed model could be too optimistic because the same data are used for developing and testing the model14,17. In fact, when applied to new subjects, the performance of the model is generally lower than that observed in the development phase. Therefore, it is necessary to evaluate the performance of a developed model in new individuals before its implementation and application in clinical practice. Two different types of validation can be adopted, according to the design of the study and the data available: internal and external validation. Internal validation can be adopted when only one sample is available: approaches vary from the single splitting (one time) of the study data in a training and a testing set to the repeated splitting of the data a large number of times (i.e leave-one-out, k-fold and repeated random split cross-validation). Alternatively, the bootstrapping can be adopted when the development sample is relatively small and/or a large number of candidate predictors is under investigation14. The bootstrapping procedure allows the use of all the data for model development, and it also provides information about the level of model overfitting and optimism as well as about what can be expected when the model is applied to new individuals from the same theoretical source population. Even if these internal validation methods can correctly control overfitting and optimisms, they cannot substitute external validation, which consists in testing the model on new subjects17. The objective is to apply the original model to new data and to quantify the model’s predictive performance, without any re-estimation of the parameters included in the model. The new set of individuals may come from the same institution but be recruited in a different period (i.e. temporal validation), or they may come from other institutions/contexts (i.e. geographical validation)17. When a poorer performance of a prediction model is obtained on new individuals, instead of developing a new model (sometimes by repeating the entire selection of predictors), a valid alternative is to update the existing model by adjusting (or recalibrating) the model by the local circumstance or setting of the validation sample at hand17,18. In this way, the updated model combines the information captured in the original model on the development dataset with information from new individuals, theoretically improving the transportability to other individuals. Moreover, to improve the performance of a clinical prediction model, new marker(s) can be incorporated in the existing model19. The above-described methodology was applied to the CRC-INT study, aimed at identifying plasma circulating miRNAs to be used as biomarkers for the early detection of CRC lesions, using the FIT positive individuals as target population. The study included three cohorts of FIT+ subjects: discovery (DC), internal validation (IVC) and external validation (EVC). Blood was collected before colonoscopy and circulating miRNAs extracted from plasma were analyzed by using PCR assays. The principal aim of the discovery phase was to investigate the suitability of searching miRNAs in plasma from FIT+ individuals as well as identify a set of reference and candidate miRNAs to be deeply investigated in the subsequent phases focused on prospectively enrolled subjects. During the discovery phase, the expression levels of human miRNAs on a cohort of already available plasma samples from FIT+ individuals who have performed a screening colonoscopy at INT was used. As output, a subset of reference miRNAs was identified using the NqA R-function and a total of candidate miRNAs were identified as showing significantly different expressions in subjects with proliferative lesions vs subjects without lesions, or in subjects with a specific proliferative lesions. Based on these results, a custom micro fluidic card including the candidate and reference miRNAs was designed to be used in the following internal validation phase. Before moving to the custom-made assay, we performed a technical validation phase (on the same DC samples) aimed at evaluating the level of reproducibility between the involved assays (i.e. the high-throughput assay used in the discovery phase and the custom-made one to be used in the IVC/EVC20). In addition, an ad-hoc in-vitro controlled haemolysis experiment was implemented by artificially introducing different percentages of red blood cells (RBCs) in a haemolysis-free plasma sample21. Results evidenced that miRNAs known as haemolysis-related in literature were confirmed also in our experiment as influenced by haemolysis, whereas all our reference and 70% of candidate miRNAs were not influenced by haemolysis. Candidate miRNAs, showing relevant changes with respect to the haemolysis-free plasma sample were not considered in the subsequent statistical analysis. In addition, by taking advantage of the availability, for each contaminated tube, of haemolysis indexes (spectrophotometrically measured) and known RBC concentration, a calibration curve was generated with the aim to estimate the unknown percentage of RBCs in new plasma samples. For the analysis of the IVC data, we adopted an approach based on the all-subset analysis and the PMLE method to estimate the miRNA-based signatures, in order to take into consideration the peculiarity of the scenario under investigation, such as many weak biomarkers measured in plasma using platforms developed for research purposes only. We then performed a signature selection (i.e EPV > 3, significant AUC and finite shrinkage value) in order to select on the IVC only few signatures to be tested on the EVC. The latter includes FIT+ subjects enrolled at our Institute and also in other Hospitals joining the CRC-screening program of the Local Health Authority of Milan. Statistical analysis of the EVC cohort is ongoing: preliminary results confirmed the predictive capability of some of the identified signatures, even if lower performance with respect to that obtained on the IVC. Further analysis will be performed to evaluate a possible effect of the variable “Hospital” as well as to apply different model validation and model updating approaches. The evaluation of the added value of the developed miRNA-based signatures to pre-existing prediction models and the gain brought in by the introduction of the miR-test in the existing CRC screening diagnostic workflow will be eventually evaluated. References 1 Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E., & Forman, D. (2011). Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2), 69-90. 2 Mazeh, H., Mizrahi, I., Ilyayev, N., Halle, D., Brucher, B., Bilchik, A., et al. (2013). The diagnostic and prognostic role of microRNA in colorectal cancer - a comprehensive review. Journal of Cancer, 4(3), 281-295. 3 Chen, X., Ba, Y., Ma, L., Cai, X., Yin, Y., Wang, K., et al. (2008). Characterization of microRNAs in serum: A novel class of biomarkers for diagnosis of cancer and other diseases. Cell Research, 18(10), 997-1006. 4 Kirschner, M. B., Kao, S. C., Edelman, J. J., Armstrong, N. J., Vallely, M. P., van Zandwijk, N., et al. (2011). Haemolysis during sample preparation alters microRNA content of plasma. PloS One, 6(9), e24145. 5 Pritchard, C. C., Kroh, E., Wood, B., Arroyo, J. D., Dougherty, K. J., Miyaji, M. M., et al. (2012). Blood cell origin of circulating microRNAs: A cautionary note for cancer biomarker studies. Cancer Prevention Research (Philadelphia, Pa.), 5(3), 492-497. 6 Kirschner, M. B., Edelman, J. J., Kao, S. C., Vallely, M. P., van Zandwijk, N., & Reid, G. (2013). The impact of hemolysis on cell-free microRNA biomarkers. Frontiers in Genetics, 4, 94. 7 Yamada, A., Cox, M. A., Gaffney, K. A., Moreland, A., Boland, C. R., & Goel, A. (2014). Technical factors involved in the measurement of circulating microRNA biomarkers for the detection of colorectal neoplasia. PloS One, 9(11), e112481. 8 Verderio, P., Bottelli, S., Pizzamiglio, S., & Ciniselli, C. M. (2016). Developing miRNA signatures: A multivariate prospective. British Journal of Cancer, 115(1), 1-4. 9 Mestdagh, P., Van Vlierberghe, P., De Weer, A., Muth, D., Westermann, F., Speleman, F., et al. (2009). A novel and universal method for microRNA RT-qPCR data normalization. Genome Biology, 10(6), R64-2009-10-6-r64 10 Pizzamiglio, S., Bottelli, S., Ciniselli, C. M., Zanutto, S., Bertan, C., Gariboldi, M., et al. (2014). A normalization strategy for the analysis of plasma microRNA qPCR data in colorectal cancer. International Journal of Cancer, 134(8), 2016-2018. 11 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Gariboldi, M., & Pizzamiglio, S. (2014). NqA: An R-based algorithm for the normalization and analysis of microRNA quantitative real-time polymerase chain reaction data. Analytical Biochemistry, 461, 7-9. 12 Yan, L., Tian, L., & Liu, S. (2015). Combining large number of weak biomarkers based on AUC. Statistics in Medicine, 34(29), 3811-3830. 13 Pavlou, M., Ambler, G., Seaman, S., De Iorio, M., & Omar, R. Z. (2016). Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine, 35(7), 1159-1177. 14 Moons, K. G., Kengne, A. P., Woodward, M., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: I. development, internal validation, and assessing the incremental value of a new (bio)marker. Heart (British Cardiac Society), 98(9), 683-690. 15 Moons, K. G., Donders, A. R., Steyerberg, E. W., & Harrell, F. E. (2004). Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: A clinical example. Journal of Clinical Epidemiology, 57(12), 1262-1270. 16 Altman, D. G., & Royston, P. (2000). What do we mean by validating a prognostic model? Statistics in Medicine, 19(4), 453-473. 17 Moons, K. G., Kengne, A. P., Grobbee, D. E., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: II. external validation, model updating, and impact assessment. Heart (British Cardiac Society), 98(9), 691-698. 18 Vergouwe, Y., Nieboer, D., Oostenbrink, R., Debray, T. P., Murray, G. D., Kattan, M. W., et al. (2016). A closed testing procedure to select an appropriate method for updating prediction models. Statistics in Medicine, 19 Nieboer, D., Vergouwe, Y., Ankerst, D. P., Roobol, M. J., & Steyerberg, E. W. (2016). Improving prediction models with new markers: A comparison of updating strategies. BMC Medical Research Methodology, 16(1), 128. 20 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Zanutto, S., Gariboldi, M., et al. (2015). Moving from discovery to validation in circulating microRNA research. The International Journal of Biological Markers, 30(2), e258-61. 21 Pizzamiglio, S., Zanutto, S., Ciniselli, C. M., Belfiore, A., Bottelli, S., Gariboldi, M., et al. (2017). A methodological procedure for evaluating the impact of hemolysis on circulating microRNAs. Oncology Letters, 13(1), 315-320.

IDENTIFICATION OF CIRCULATING BIOMARKERS FOR THE EARLY DIAGNOSIS OF COLORECTAL CANCER: METHODOLOGICAL ASPECTS / C.m. Ciniselli ; tutor: P. Verderio ; coordinatore: C. La Vecchia. DIPARTIMENTO DI SCIENZE CLINICHE E DI COMUNITA', 2017 Mar 29. 29. ciclo, Anno Accademico 2016. [10.13130/c-m-ciniselli_phd2017-03-29].

IDENTIFICATION OF CIRCULATING BIOMARKERS FOR THE EARLY DIAGNOSIS OF COLORECTAL CANCER: METHODOLOGICAL ASPECTS

C.M. Ciniselli
2017

Abstract

The present PhD research project starts from the need of deeply investigating some methodological aspects related to the identification, validation and application, in a routine clinical setting, of new non-invasive biomarkers for the (early) detection of cancer. Specifically, I investigated the statistical-methodological issues related to the identification and validation of molecular-based signatures detected with qPCR-based platforms by using the colorectal cancer as disease model (CRC-INT study). Colorectal cancer (CRC) is one of the major causes of cancer death in western countries1,2. More than 90% of CRC cases occur after the age of 50 years2 and, on the basis of the natural history of CRC progression and of the long time-interval of progression from normal mucosa to invasive cancer many efforts have been focused on the implementation of screening programs for CRC prevention and detection in its early stage, especially in this cohort of subjects. Currently adopted screening programs are based on colonoscopy (invasive, but still considered as the gold standard for both detection and removal of lesions) or on tests that search the human haemoglobin in stool (i.e. faecal occult blood test, FOBT/FIT). The latter are less invasive and easier to carry out, but showed low sensitivity values for polyps identification. In Italy, screening programs are based on FOBT/FIT - offered every 2 years to residents of 50-69/74 years-old - or on flexible sigmoidoscopy (FS) offered to a single age cohort, generally at 58-60 years-old. As concerns FIT programmes, quantitative haemoglobin analysis is performed in a centralized reference laboratory using the threshold of 100ng/ml of faecal haemoglobin as cut-off value to determine the positivity to the test. People with a negative test are invited to repeat the test after 2 years. Subjects with a positive test (FIT+) are contacted in order to perform a total colonoscopy (TC) in referral centres during dedicated sessions. According to 2011-2012 data, colonoscopy is performed by the 81% of FIT+ subjects and a diagnosis of carcinoma is formulated in 5% of FIT+ subjects and that of advanced adenoma in a further 25%. Subjects with non cancerous lesions are enrolled in follow-up programs according to the colonoscopy’s output, whereas subjects with a screen-detected CRC undergo surgery(www.giscor.it). MicroRNAs have been studied intensively in the field of oncological research and several studies have highlighted their easy detectability in plasma and serum3, suggesting their possible role as non-invasive biomarkers for the diagnosis and monitoring of human cancers. However, their detection and quantification could be influenced by haemolysis, i.e. pink discoloration of serum or plasma due to the release of the red blood cells into the fluids4,5,6,7. In addition, several Authors are now highlighting the need of shared workflows for the entire process of miRNA identification (pre-analytical and analytical) as well as for the statistical analysis. Accordingly, we recently proposed a workflow that tries to schematize all the key phases involved in biomarkers studies, from biomarker discovery to their analytical and clinical validation, including the issues related to the development of operative procedures for their analysis8. The process usually begins with a discovery phase, followed by a validation one and ultimately by the clinical application of the identified biomarker signature. Two additional assay-oriented steps (assay optimization and assay development) could be introduced in the workflow, before and/or after the validation phase. From a statistical-methodological point of view, the main issues involved in this workflow are those related to (i) data normalization of high-throughput qPCR data and the (ii) building and validation of miRNA-based signatures. Data normalization represents a crucial pre-processing step aimed both at removing experimentally-induced variation and at differentiating true biological changes. Inappropriate normalization strategies can affect the results of the subsequent analysis and, as a consequence, the conclusion drawn from the results. In the miRNA context, there are no yet verified and shared reference RNA in serum and plasma that can be used for data normalization. Several methods have been proposed to solve the problem of reference RNA selection. Currently, the most accepted and widely used method for data normalization of circulating miRNAs is that proposed by Mestdagh9, based on the computation of the global mean of the expressed miRNAs. This method is valid if a large number of miRNAs are profiled, but it is almost never applicable in validation studies focused on a limited number of miRNAs. To overcome this issue, the same Authors proposed to search the set of reference miRNAs that resembles the mean expression value of all the miRNAs, and use that set for data normalization. Starting from this, we developed a comprehensive data-driven normalization method for high-throughput qPCR data that identifies a small set of miRNAs to be used as reference for data normalization in view of the subsequent validation studies10,11, using results obtained with the global mean method as reference method. This algorithm was also implemented in a R-function (NqA algorithm). As concerns the biomarker studies based on the use of high-throughput assays, it should be considered that in such setting a wide range of weak biomarkers are constantly identified. In such setting, a single molecular biomarker alone may not achieve satisfactory performance for patients classification, so that the linear combination of these biomarkers in a more powerful composite score could represent a suitable approach to achieve higher diagnostic performances. As reported by Yan12, several methods are available, such as those based on the searching of the alpha value that maximizes the AUC value of the linear combination of p-biomarkers (grid-search methods), or the conventional logistic regression model when the purpose is to guide healthcare professionals in their decision-making. Prediction model studies are usually organized in a model development and in a model validation phase, or in a combination of both. In the first phase the aim is to derive a multivariate prediction model by selecting the most relevant predictors and combing them into a multivariate model, whereas the model validation consists in the evaluation of the performance of the model on other data, not used for model development. As far as miRNAs are concerned, the process starts with the identification of the candidate miRNAs that should be included in the initial multivariate model, as reported in (8). Once identified these candidates, the model development phase could start with the fitting of the initial multivariate model, by opportunely taking into consideration the number of event-per-variable (EPV). Penalized regression models (PMLE) 13, in which the beta value are obtained by maximising the penalized log-likelihood, can be used to prevent overfitting when a large number of covariates are present in a model with respect to the number of outcome events. According to the penalty term (i.e. the functional form of the constraints) and the tuning parameter (i.e. the amount of shrinkage applied to the coefficients), different penalized regression models can be fitted13 Once defined the initial model and fitted it with the proper procedures, another important theme is the definition of the final model, which could correspond to the full initial model or to a reduced one. For standard regression models, backward elimination or forward selection can be used for this purpose, even if the latter does not provide a simultaneous assessment of the effects of all the candidates in the model14, whereas for PMLE, a reduced model can be obtained using the R-square method15. An alternative approach to the standard stepwise/backward methods is the all subsets regression, which can discover combinations of variables that explain more variation in patients outcome than those obtained by using the standard stepwise/backward algorithms8. This approach has several potential advantages, but also some drawbacks, including the possibility of selecting models which omit important predictors (i.e. pre-existing evidences). The performance of a developed model should be then assessed by evaluating discrimination and calibration. Discrimination refers to the ability of the model to distinguish individuals with the disease from those without the disease (measured with the c-index or the equivalent area under the ROC curve), whereas calibration refers to the agreement between the probability of developing/having the outcome of interest as estimated by the model, and the observed outcome. In addition, it is important to consider that the performance of the developed model could be too optimistic because the same data are used for developing and testing the model14,17. In fact, when applied to new subjects, the performance of the model is generally lower than that observed in the development phase. Therefore, it is necessary to evaluate the performance of a developed model in new individuals before its implementation and application in clinical practice. Two different types of validation can be adopted, according to the design of the study and the data available: internal and external validation. Internal validation can be adopted when only one sample is available: approaches vary from the single splitting (one time) of the study data in a training and a testing set to the repeated splitting of the data a large number of times (i.e leave-one-out, k-fold and repeated random split cross-validation). Alternatively, the bootstrapping can be adopted when the development sample is relatively small and/or a large number of candidate predictors is under investigation14. The bootstrapping procedure allows the use of all the data for model development, and it also provides information about the level of model overfitting and optimism as well as about what can be expected when the model is applied to new individuals from the same theoretical source population. Even if these internal validation methods can correctly control overfitting and optimisms, they cannot substitute external validation, which consists in testing the model on new subjects17. The objective is to apply the original model to new data and to quantify the model’s predictive performance, without any re-estimation of the parameters included in the model. The new set of individuals may come from the same institution but be recruited in a different period (i.e. temporal validation), or they may come from other institutions/contexts (i.e. geographical validation)17. When a poorer performance of a prediction model is obtained on new individuals, instead of developing a new model (sometimes by repeating the entire selection of predictors), a valid alternative is to update the existing model by adjusting (or recalibrating) the model by the local circumstance or setting of the validation sample at hand17,18. In this way, the updated model combines the information captured in the original model on the development dataset with information from new individuals, theoretically improving the transportability to other individuals. Moreover, to improve the performance of a clinical prediction model, new marker(s) can be incorporated in the existing model19. The above-described methodology was applied to the CRC-INT study, aimed at identifying plasma circulating miRNAs to be used as biomarkers for the early detection of CRC lesions, using the FIT positive individuals as target population. The study included three cohorts of FIT+ subjects: discovery (DC), internal validation (IVC) and external validation (EVC). Blood was collected before colonoscopy and circulating miRNAs extracted from plasma were analyzed by using PCR assays. The principal aim of the discovery phase was to investigate the suitability of searching miRNAs in plasma from FIT+ individuals as well as identify a set of reference and candidate miRNAs to be deeply investigated in the subsequent phases focused on prospectively enrolled subjects. During the discovery phase, the expression levels of human miRNAs on a cohort of already available plasma samples from FIT+ individuals who have performed a screening colonoscopy at INT was used. As output, a subset of reference miRNAs was identified using the NqA R-function and a total of candidate miRNAs were identified as showing significantly different expressions in subjects with proliferative lesions vs subjects without lesions, or in subjects with a specific proliferative lesions. Based on these results, a custom micro fluidic card including the candidate and reference miRNAs was designed to be used in the following internal validation phase. Before moving to the custom-made assay, we performed a technical validation phase (on the same DC samples) aimed at evaluating the level of reproducibility between the involved assays (i.e. the high-throughput assay used in the discovery phase and the custom-made one to be used in the IVC/EVC20). In addition, an ad-hoc in-vitro controlled haemolysis experiment was implemented by artificially introducing different percentages of red blood cells (RBCs) in a haemolysis-free plasma sample21. Results evidenced that miRNAs known as haemolysis-related in literature were confirmed also in our experiment as influenced by haemolysis, whereas all our reference and 70% of candidate miRNAs were not influenced by haemolysis. Candidate miRNAs, showing relevant changes with respect to the haemolysis-free plasma sample were not considered in the subsequent statistical analysis. In addition, by taking advantage of the availability, for each contaminated tube, of haemolysis indexes (spectrophotometrically measured) and known RBC concentration, a calibration curve was generated with the aim to estimate the unknown percentage of RBCs in new plasma samples. For the analysis of the IVC data, we adopted an approach based on the all-subset analysis and the PMLE method to estimate the miRNA-based signatures, in order to take into consideration the peculiarity of the scenario under investigation, such as many weak biomarkers measured in plasma using platforms developed for research purposes only. We then performed a signature selection (i.e EPV > 3, significant AUC and finite shrinkage value) in order to select on the IVC only few signatures to be tested on the EVC. The latter includes FIT+ subjects enrolled at our Institute and also in other Hospitals joining the CRC-screening program of the Local Health Authority of Milan. Statistical analysis of the EVC cohort is ongoing: preliminary results confirmed the predictive capability of some of the identified signatures, even if lower performance with respect to that obtained on the IVC. Further analysis will be performed to evaluate a possible effect of the variable “Hospital” as well as to apply different model validation and model updating approaches. The evaluation of the added value of the developed miRNA-based signatures to pre-existing prediction models and the gain brought in by the introduction of the miR-test in the existing CRC screening diagnostic workflow will be eventually evaluated. References 1 Jemal, A., Bray, F., Center, M. M., Ferlay, J., Ward, E., & Forman, D. (2011). Global cancer statistics. CA: A Cancer Journal for Clinicians, 61(2), 69-90. 2 Mazeh, H., Mizrahi, I., Ilyayev, N., Halle, D., Brucher, B., Bilchik, A., et al. (2013). The diagnostic and prognostic role of microRNA in colorectal cancer - a comprehensive review. Journal of Cancer, 4(3), 281-295. 3 Chen, X., Ba, Y., Ma, L., Cai, X., Yin, Y., Wang, K., et al. (2008). Characterization of microRNAs in serum: A novel class of biomarkers for diagnosis of cancer and other diseases. Cell Research, 18(10), 997-1006. 4 Kirschner, M. B., Kao, S. C., Edelman, J. J., Armstrong, N. J., Vallely, M. P., van Zandwijk, N., et al. (2011). Haemolysis during sample preparation alters microRNA content of plasma. PloS One, 6(9), e24145. 5 Pritchard, C. C., Kroh, E., Wood, B., Arroyo, J. D., Dougherty, K. J., Miyaji, M. M., et al. (2012). Blood cell origin of circulating microRNAs: A cautionary note for cancer biomarker studies. Cancer Prevention Research (Philadelphia, Pa.), 5(3), 492-497. 6 Kirschner, M. B., Edelman, J. J., Kao, S. C., Vallely, M. P., van Zandwijk, N., & Reid, G. (2013). The impact of hemolysis on cell-free microRNA biomarkers. Frontiers in Genetics, 4, 94. 7 Yamada, A., Cox, M. A., Gaffney, K. A., Moreland, A., Boland, C. R., & Goel, A. (2014). Technical factors involved in the measurement of circulating microRNA biomarkers for the detection of colorectal neoplasia. PloS One, 9(11), e112481. 8 Verderio, P., Bottelli, S., Pizzamiglio, S., & Ciniselli, C. M. (2016). Developing miRNA signatures: A multivariate prospective. British Journal of Cancer, 115(1), 1-4. 9 Mestdagh, P., Van Vlierberghe, P., De Weer, A., Muth, D., Westermann, F., Speleman, F., et al. (2009). A novel and universal method for microRNA RT-qPCR data normalization. Genome Biology, 10(6), R64-2009-10-6-r64 10 Pizzamiglio, S., Bottelli, S., Ciniselli, C. M., Zanutto, S., Bertan, C., Gariboldi, M., et al. (2014). A normalization strategy for the analysis of plasma microRNA qPCR data in colorectal cancer. International Journal of Cancer, 134(8), 2016-2018. 11 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Gariboldi, M., & Pizzamiglio, S. (2014). NqA: An R-based algorithm for the normalization and analysis of microRNA quantitative real-time polymerase chain reaction data. Analytical Biochemistry, 461, 7-9. 12 Yan, L., Tian, L., & Liu, S. (2015). Combining large number of weak biomarkers based on AUC. Statistics in Medicine, 34(29), 3811-3830. 13 Pavlou, M., Ambler, G., Seaman, S., De Iorio, M., & Omar, R. Z. (2016). Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Statistics in Medicine, 35(7), 1159-1177. 14 Moons, K. G., Kengne, A. P., Woodward, M., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: I. development, internal validation, and assessing the incremental value of a new (bio)marker. Heart (British Cardiac Society), 98(9), 683-690. 15 Moons, K. G., Donders, A. R., Steyerberg, E. W., & Harrell, F. E. (2004). Penalized maximum likelihood estimation to directly adjust diagnostic and prognostic prediction models for overoptimism: A clinical example. Journal of Clinical Epidemiology, 57(12), 1262-1270. 16 Altman, D. G., & Royston, P. (2000). What do we mean by validating a prognostic model? Statistics in Medicine, 19(4), 453-473. 17 Moons, K. G., Kengne, A. P., Grobbee, D. E., Royston, P., Vergouwe, Y., Altman, D. G., et al. (2012). Risk prediction models: II. external validation, model updating, and impact assessment. Heart (British Cardiac Society), 98(9), 691-698. 18 Vergouwe, Y., Nieboer, D., Oostenbrink, R., Debray, T. P., Murray, G. D., Kattan, M. W., et al. (2016). A closed testing procedure to select an appropriate method for updating prediction models. Statistics in Medicine, 19 Nieboer, D., Vergouwe, Y., Ankerst, D. P., Roobol, M. J., & Steyerberg, E. W. (2016). Improving prediction models with new markers: A comparison of updating strategies. BMC Medical Research Methodology, 16(1), 128. 20 Verderio, P., Bottelli, S., Ciniselli, C. M., Pierotti, M. A., Zanutto, S., Gariboldi, M., et al. (2015). Moving from discovery to validation in circulating microRNA research. The International Journal of Biological Markers, 30(2), e258-61. 21 Pizzamiglio, S., Zanutto, S., Ciniselli, C. M., Belfiore, A., Bottelli, S., Gariboldi, M., et al. (2017). A methodological procedure for evaluating the impact of hemolysis on circulating microRNAs. Oncology Letters, 13(1), 315-320.
29-mar-2017
Settore MED/01 - Statistica Medica
colorectal cancer; biomarkers; microRNA; high-throughput qPCR data; data normalization; miRNA-based signatures
VERDERIO, PAOLO
LA VECCHIA, CARLO VITANTONIO BATTISTA
Doctoral Thesis
IDENTIFICATION OF CIRCULATING BIOMARKERS FOR THE EARLY DIAGNOSIS OF COLORECTAL CANCER: METHODOLOGICAL ASPECTS / C.m. Ciniselli ; tutor: P. Verderio ; coordinatore: C. La Vecchia. DIPARTIMENTO DI SCIENZE CLINICHE E DI COMUNITA', 2017 Mar 29. 29. ciclo, Anno Accademico 2016. [10.13130/c-m-ciniselli_phd2017-03-29].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R10676.pdf

Open Access dal 18/09/2018

Tipologia: Tesi di dottorato completa
Dimensione 2.39 MB
Formato Adobe PDF
2.39 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/486530
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact