GIN (Grid Interoperation Now) Monitoring

Lessons Learned

28/4/2006

People Involved

PRAGMA Grid

1.      UCSD/SDSC, USA: Peter Arzberger, Phil Papadopoulos, Mason Katz, Cindy Zheng

2.      AIST, Japan: Yoshio Tanaka

3.      TNGC/KU, Thailand: Putchong Uthayopas, Somsak Sriprayoonsakul, Sugree Phattanapherom

TeraGrid

1.      ANL, USA: Charlie Catlett, JP Navarro

2.      ISI: Laura Pearlman

Introduction

One of the requirements for running an application on the real Grid is the ability to view system and job status across the boundaries of computing systems and grids. System monitoring software can help the application drivers and grid administrators to spot the state of job and located the fail or left-over job easily across the Grid.

Issues

1.      System monitoring software incompatibility - PRAGMA uses SCMSWeb as the main grid monitoring tool, while within TeraGrid, each site uses different monitoring software. These software are not interoperable with each other, even though, most them have web interfaces.

2.      Job monitoring incompatibility - This is the same problem as system monitoring. In addition, many sites do not have a web-based job monitoring software.

3.      Different software deployment scheme - SCMSWeb assumed that the cluster front-end also offered many Grid services such as GridFTP, Authentication, Grid job submission, etc., while TeraGrid clusters may have different services installed on different nodes.

Current Decisions

1.      For system monitoring, SCMS was installed on Teragrid-uc-anl cluster and linked to PRAGMA Grid Operation Center web-site (http://goc.pragma-grid.net/scmsweb). SCMS is also being used as Job monitoring system to monitor TDDFT job across GIN testbed

2.      SCMS will be turned off on Teragrid-uc-anl after GGF17.

Lessons learned & Discussion

1.      Differences in Grid monitoring software and lack of standardize interfaces prevent easy cross-system and cross grid monitoring.

o        Solutions

§         Standardize data exchange format - We agree to use and extend GLUE as the underlying standard for data exchange, even though GLUE schema lacks many statistics collected by some monitoring software. TeraGrid team already has some data definitions & modifications/extensions done for Nagios and Ganglia based on GLUE. SCMS will adapt the data definitions to enable the data exchange with other monitoring software.

§         Standardize data extraction interface - Some standard interface such as Web Services/WSRF might be used as the medium to enable different monitoring software interoperate, even though the interfaces between monitoring software are not standardized.

§         Bridging module is needed for each software - SCE team has developed a bridging module for Ganglia to export needed data for SCMSWeb. More testing is needed on real Ganglia site.

2.      Different software deployment scheme

o        Solutions

§         Software design must not assume any software stack & deployment schemes.

Plan

1.      We agreed to use the standardize data exchange format to share the monitoring information across our grids.

2.      TeraGrid team and PRAGMA monitoring team will work together to enable monitoring software sharing the same common data format. The reference of GLUE based data for SCMS is here

3.      The common data format will be extended to cover job accounting. This is a part of GRMAP (Grid Resource Management and Accounting Project) in PRAGMA.


Written by Somsak Sriprayoonsakul <somsak_sr at thaigrid dot or dot th>