Motivation The exponential growth of genomic data produced by the introduction of Next Generation Sequencing (NGS) [1] and other high throughput technologies requires powerful (and costly) computational infrastructures in order to store, share and analyze them. A key factor to ensure an efficient exploitation of this huge wealth of data is to allow users, either individual scientists, research teams or greater organizations, an easy way to access and interact both with the data and the software tools needed to analyze them. While cloud computing opens unprecedented opportunities by providing powerful computational resources also to individual scientists and small research groups, running bioinformatic analyses can still be difficult for most non-bioinformaticians. In fact, small and medium research groups often lack the necessary resources (either human or computational) to manage large quantities of data, while installing the required analysis tools, reference data and setting up a virtualized environment can represent a significant obstacle. To overcome these limitations ELIXIR-ITA (Italian Node of ELIXIR [2]), in collaboration with the INDIGO-Datacloud partners, is developing a case study focused on the “cloudification” of the Galaxy platform. Methods Galaxy [3-5] is an open source, web based, workflow manager platform for bioinformatics analysis. It is designed to allow data analysis by integrating multiple tools and complex bioinformatics workflows through an easy to use web-based environment. The Galaxy platform has many advantages, for example: it allows the end users to easily deploy analysis pipelines and to effortlessly share them among other users of the same instance, together with data and results; it is well supported with an huge community of users and developers; it is easy to learn but powerful enough to support complex workflows. While many public Galaxy servers exist, they have some important drawbacks. First of all the resources allocated to each user are usually very limited, then users can not use software tools other than the ones provided by the administrator and finally users can not have full control of who can access their data, since the platform administrators can override any limitation set by the user. The ELIXIR-ITA case study consists in the development of a fully customizable Galaxy instance provider platform based on the technologies developed within the INDIGO- DataCloud project [6] framework and designed to overcome these drawbacks. When 115 Poster Topic 2-Big Data Management, Modeling and Computing fully operational this use case will allow the easy setup of an on-demand workspace, ready to be used by life scientists and bioinformaticians. Each Galaxy instance will be automatically configured according to the virtual machine hardware, with specific configurations available through the setup web interface. It will be fully customizable by the instance administrator with tools and reference data using either the Galaxy Tool Shed or via direct access to the virtual environment. Each instance will also be deployed in an insulated environment. Insulating data from any other instance on the same platform and from the Cloud service provider will thus provide a suitable platform for research and clinical scenarios involving sensible human data. To deploy the required components and to automatically set up Galaxy production instances we use TOSCA and the Ansible automation framework, both compatible with the most common open-source cloud middleware OpenStack and OpenNebula. Results Galaxy is currently adopted in many life science research environments in order to facilitate the use of many bioinformatics tools and the handling of large quantities of biological data. While the use of the workflow manager is relatively simple, its deployment and its administration require an adequate computational infrastructure and people with the necessary technical know-how. Our project to provide it via PaaS will provide small research groups, institutions or SMEs a simple way to setup and use their own Galaxy instance on suitable computation resources, without the need to maintain their own hardware and software infrastructure. A Galaxy cloud service could be also a practical solution for universities and other training facilities. Currently, the system is in prototype phase and allows to setup and launch a virtual machine (VM) fully configured with the operating system (CentOS7) and the ancillary applications needed to support a Galaxy production environment [7] such as Postgresql, Nginx, uwsgi and proftpd and to deploy the Galaxy environment itself. Currently, the system allows to choose among two different Galaxy flavors: basic Galaxy or Galaxy with a selection of tools for NGS analyses (e.g. SamTools, BamTools, Bowtie, MACS, RSEM, etc...) already installed and configured. The basic configuration provide also an external volume for reference data and one for users data. SSH and FTP access to instances is possible since a public IP is associated to each VM. References: [1] van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of nextgeneration sequencing technology. Trends Genet. 2014 Sep;30(9):41826. [2] www.elixireurope.org [3] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. [4] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a webbased genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.121. [5] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, 116 Poster Topic 2-Big Data Management, Modeling and Computing Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive largescale genome analysis." Genome Research. 2005 Oct; 15(10):14515. [6] www.indigodatacloud.eu [7] https://wiki.galaxyproject.org/Admin/Config/Performance/ProductionServer
Providing bioinformatic workflow environments through the INDIGO-DataCloud e- infrastructure / M. Tangaro, G. Pesole, F. Zambelli. ((Intervento presentato al 13. convegno Annual Meeting of the Bioinformatics Italian Society tenutosi a Salerno nel 2016.
Providing bioinformatic workflow environments through the INDIGO-DataCloud e- infrastructure
G. PesoleSecondo
;F. Zambelli
Ultimo
2016
Abstract
Motivation The exponential growth of genomic data produced by the introduction of Next Generation Sequencing (NGS) [1] and other high throughput technologies requires powerful (and costly) computational infrastructures in order to store, share and analyze them. A key factor to ensure an efficient exploitation of this huge wealth of data is to allow users, either individual scientists, research teams or greater organizations, an easy way to access and interact both with the data and the software tools needed to analyze them. While cloud computing opens unprecedented opportunities by providing powerful computational resources also to individual scientists and small research groups, running bioinformatic analyses can still be difficult for most non-bioinformaticians. In fact, small and medium research groups often lack the necessary resources (either human or computational) to manage large quantities of data, while installing the required analysis tools, reference data and setting up a virtualized environment can represent a significant obstacle. To overcome these limitations ELIXIR-ITA (Italian Node of ELIXIR [2]), in collaboration with the INDIGO-Datacloud partners, is developing a case study focused on the “cloudification” of the Galaxy platform. Methods Galaxy [3-5] is an open source, web based, workflow manager platform for bioinformatics analysis. It is designed to allow data analysis by integrating multiple tools and complex bioinformatics workflows through an easy to use web-based environment. The Galaxy platform has many advantages, for example: it allows the end users to easily deploy analysis pipelines and to effortlessly share them among other users of the same instance, together with data and results; it is well supported with an huge community of users and developers; it is easy to learn but powerful enough to support complex workflows. While many public Galaxy servers exist, they have some important drawbacks. First of all the resources allocated to each user are usually very limited, then users can not use software tools other than the ones provided by the administrator and finally users can not have full control of who can access their data, since the platform administrators can override any limitation set by the user. The ELIXIR-ITA case study consists in the development of a fully customizable Galaxy instance provider platform based on the technologies developed within the INDIGO- DataCloud project [6] framework and designed to overcome these drawbacks. When 115 Poster Topic 2-Big Data Management, Modeling and Computing fully operational this use case will allow the easy setup of an on-demand workspace, ready to be used by life scientists and bioinformaticians. Each Galaxy instance will be automatically configured according to the virtual machine hardware, with specific configurations available through the setup web interface. It will be fully customizable by the instance administrator with tools and reference data using either the Galaxy Tool Shed or via direct access to the virtual environment. Each instance will also be deployed in an insulated environment. Insulating data from any other instance on the same platform and from the Cloud service provider will thus provide a suitable platform for research and clinical scenarios involving sensible human data. To deploy the required components and to automatically set up Galaxy production instances we use TOSCA and the Ansible automation framework, both compatible with the most common open-source cloud middleware OpenStack and OpenNebula. Results Galaxy is currently adopted in many life science research environments in order to facilitate the use of many bioinformatics tools and the handling of large quantities of biological data. While the use of the workflow manager is relatively simple, its deployment and its administration require an adequate computational infrastructure and people with the necessary technical know-how. Our project to provide it via PaaS will provide small research groups, institutions or SMEs a simple way to setup and use their own Galaxy instance on suitable computation resources, without the need to maintain their own hardware and software infrastructure. A Galaxy cloud service could be also a practical solution for universities and other training facilities. Currently, the system is in prototype phase and allows to setup and launch a virtual machine (VM) fully configured with the operating system (CentOS7) and the ancillary applications needed to support a Galaxy production environment [7] such as Postgresql, Nginx, uwsgi and proftpd and to deploy the Galaxy environment itself. Currently, the system allows to choose among two different Galaxy flavors: basic Galaxy or Galaxy with a selection of tools for NGS analyses (e.g. SamTools, BamTools, Bowtie, MACS, RSEM, etc...) already installed and configured. The basic configuration provide also an external volume for reference data and one for users data. SSH and FTP access to instances is possible since a public IP is associated to each VM. References: [1] van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of nextgeneration sequencing technology. Trends Genet. 2014 Sep;30(9):41826. [2] www.elixireurope.org [3] Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86. [4] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a webbased genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.121. [5] Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, 116 Poster Topic 2-Big Data Management, Modeling and Computing Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive largescale genome analysis." Genome Research. 2005 Oct; 15(10):14515. [6] www.indigodatacloud.eu [7] https://wiki.galaxyproject.org/Admin/Config/Performance/ProductionServerFile | Dimensione | Formato | |
---|---|---|---|
AbstractBookBITS2016.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
8.91 MB
Formato
Adobe PDF
|
8.91 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.