User:Sanabria

From PRAGMAgridWIKI

Jump to: navigation, search

John Sanabria, UPRM


Gridjobs

Gridjobs is a computational gateway to integrate deployment, monitoring, execution and notification mechanisms for running legacy applications to process arbitrarily divisible load. Real computational grids are characterized by exhibiting high levles of uncertainty because their heterogeneity, multiple administrative domains and non-deterministic levels of availability and reliability. Gridjobs diminishes that fact using statistical analysis to forecast resources performance and distributing load amongst available resources following an opportunistic approach.

 Gridjobs' modules

Gridjobs exhibits a modular design to leverage its extensibility (Figure above). For instance, the Statistical module uses information provided by remote monitoring tools along with observed application executions to estimate posterior resource performance. However, it is possible to replace it with other module to implement more sophisticated forecasting techniques.

Gridjobs interacts with computational grids formed by resources to properly deploy a Local Resource Manager (SGE, PBS), a monitoring tool (SCMSWeb, Ganglia) and Globus as grid middleware. In addition, it assumes that Globus is appropriately integrated with some LRM.

From a logical point of view, Gridjobs perceives a computational grid infrastructure as Figure below. Gridjobs is totally aware of number of computational nodes registered in each computational resource but it only interacts with those resources through their head nodes observing their particular management policies. All that information, number of nodes along with the performance observed, is used to determine the amount of load that each computational resource is able to process.

Gridjobs must be deployed in one computational resource where the GRAM service runs smoothly. Figure below shows how Gridjobs interacts with a grid resource. First, Gridjobs submits a grid execution request to the local GRAM. The local GRAM contacts the remote GRAM and dispatches the Gridjobs request. The remote GRAM sends the request to its LRM instance. The LRM according to its local management policies determines what computational node is prepared to attend the aforementioned request.

Gridjobs monitors the execution of every single task that it submits. It queries periodically the remote resource in which a given task has been submitted and records the time-stamp in which a transition state has occurred. Figure below shows stages traversed by a task to successfully finish its execution. On production grid environments (e.g. PRAGMA) the time that a task lasts in each stage can barely determined.

For instance, the lasted time on UNSUBMITTED stage is affected by the network latency between local and remote GRAM instances. The time on PENDING stage is determined by management policies deployed at the remote LRM and the elapsed time on ACTIVE stage is highly affected by the existing load on the selected computational node where the request is executed.

Gridjobs has been tested over five computational resources shared by the PRAGMA community.

Figure below presents frequency diagrams associated with behavior observed at fsvc001.asc.hpcc.jp. X-axis represents time in milliseconds and y-axis indicates the number of occurrences of a given event, in particular, the observed lasted time on UNSUBMITTED, PENDING and ACTIVE stage, respectively.

Next Figure presents the observed behavior at komolongma.ece.uprm.edu.

Gridjobs accurately determines the probability function to best represent the behavior observed by those resources. Thus, it evaluates several probability functions, finds their parameters and selects the function to best model a resource behavior using the Kolmogorov-Smirnov test.

Following Figure shows probability functions selected by Gridjobs to represent the observed behavior by komolongma at PENDING stage. X-axis represents probability functions and y-axis represents percentage values. So, Cauchy probability function was used to model the PENDING behavior at komolongma over 55% (green bar). When Cauchy was selected, the estimated values were apart of the actual values no more than 2.5%. In other words, for this particular resource and stage the error was less than 2.5%.

Gridjobs follows a divisible load approach to determine the amount of load that each computational resource is able to process in such a way that all participant resources finish of processing at the "same time". When this approach is compared with the classical Round-robin algorithm, response times are overwhelming (Figure below). That remarkable difference between one algorithm and other is because the divisible load approach reduces the number of interventions of Globus when it is compared with the round-robin approach.

Different from conventional divisible load approaches, Gridjobs supports an opportunistic approach at resources selection time. Divisible load suggests to dispatch load amongst the available resources in such a way that the fastest resource goes first and the slowest one goes last. Gridjobs is able to adopt other ranking schemes to consider, for instance, the actual execution time, pending time and resource load. Next Figure compares the classical ranking scheme versus a ranking scheme to consider the grid resource load. Figure suggests a slight advantage when the load of the resource is taking in count.

Gridjobs offers an interesting approach for executing legacy applications over collaborative infrastructures such as PRAGMA. We are interested to improve the existing design in such a way to third parties develop novel modules associated with monitoring, scheduling and ranking schemes. At middle term, we are interested to determine how to mitigate scalability problems exhibited by grid middleware implementations and to evaluate novel ranking schemes to consider issues associated with sustainability, reliability and availability.

Presentation at INC09

Personal tools