Atlas event production on the EGEE infrastructure

Espinal, X.; Campana, S.; Perini, L.; Rod, W.

ATLAS is one of the four LHC (Large Hadron Collider) experiments at CERN, is devoted to study proton-proton and ion-ion collisions at 14TeV. ATLAS collaboration is composed of about 2000 scientists spread around the world. The activity of the experiment requirements for next year is of about 300TB of storage and a CPU power of about 13 Mski2sk, and is relying on GRID philosophy and EGEE infrastructure. Simulated events are distributed over EGEE by the Atlas production system. Data has to be processed and must be accessible by a huge number of scientists for analysis. The throughput of data for Atlas experiment is expected to be of 320 MB/s with an integrated amount of data per year of ~10Pb. The processing and storage need a distributed share of resources, spread worldwide and interconnected with GRID technologies as the requirements are so demanding for the LHC. In that sense event production is the way to produce, process and store data for analysis before the experiment startup, and is performed in a distributed way. Tasks are defined by physics coordinators and then are assigned to Computing Elements spread worldwide. Some of the jobs that build up the tasks need input data as well to produce new output, this means the jobs may need input from external sites and store remotely. For that reason sites are connected by File Transfer Service (FTS) channels that links the Storage Elements (SE) interface for each site. ATLAS is using the services provided by the EGEE middleware. Event simulation jobs are sent to the LCG (LHC Computing Grid) GRID by glite-WMS (Workload Management System) and Condor-G and using the dispatching tools of the CE's. Event simulation jobs perform the Data Management as well, request the inputs and stores the outputs on the desired SE's, file location and information is managed with distributed LCG File Catalogues (LFC). On the other hand, asymmetric file movement is performed by the ATLAS specific software on Distributed Data Management (DDM), which takes care of the file movement on top of the FTS services. Services which are causing problems are basically the Storage Elements, the system is strongly dependent on the inputs for the event simulation jobs and failing to retrieve it produces job failures, while failures in storing the outputs due to SE's instabilities leads to the loss of the CPU consumed by the job and the consequent failure. From the event simulation is expected that glite-WMS handles the jobs in a more reliable way, and concerning the CE's perhaps introduce different implementations that would have no scalability limitations. Certainly we hope new implementation of the SRM (Storage Resource Manager) interface that would solve stability problems mainly in the stage-in and stageout of the files needed and produced by the jobs respectively.

Atlas event production on the EGEE infrastructure / X. Espinal, S. Campana, L. Perini, W. Rod - In: 2nd EGEE user forum / [a cura di] Vangelis Floros, Bob Jones, Frank Harris, Massimo Lamanna, Carl Loomis. - Manchester : EGEE, 2007. - ISBN 9789290833031. - pp. 54-54 (( Intervento presentato al 2. convegno EGEE User Forum tenutosi a Manchester nel 2007.