Condor-PRAGMA Interoperation

From PRAGMAgridWIKI

Jump to: navigation, search

Contents

Condor-SCMSWeb Interoperation

Milestones

  • Project initiated: 4/10/2007
  • Finish Condor match-making using PRAGMA SCMS before the next PRAGMA (PRAGMA14)
  • 3/4/2008 Due to workload, the development of Condor interface started on Late March 2008. Expect to finish before end of April.
  • 21/4/2008 Beta version is almost finish. Main codes are here. Going to experiment on Rocks-153.
  • 22/4/2008 Beta version installed on Rocks-153. Thank you to Mr. Jysoo Lee's team from KISTI, who will help us test the system.

Architecture

[condor-g]                                            [scmsweb]
    ^                                                     |
    |                                                     |
    |                                                     v
    +---(condor_advertise)---[scmsweb_condor]<---------[httpd]

Issues, discussions and resolutions

  • 10/26/2007
    • Project Goal
      • Condor-G will use data from SCMS to do match making
      • SCMS provides Condor-pool interface (GUI)
    • We agree to start with Condor-G match making with SCMS first
    • ThaiGrid team will develop some adapter to push data to Condor-G. Todd suggest the optimal interval is couple of minutes, or when any data is changed.
    • There's possibility to also push software cataloging data from SCMS to Condor-G. ThaiGrid suppose to develop software catalog mechanism for PRAGMA anyways.
    • ThaiGrid will have a cluster to experiment whole this. ThaiGrid will give the account for Condor team so we can help each other on this.
    • We will start with Cluster-level first.
  • 5/2008
    • Job submission is hard, due to authentication/authorization not being set up properly on EVERY clusters
    • Looking for some way to exclude some cluster from testing automatically
  • 7/2008
  • 15/8/2008
    • Abstract sent to Yoshio-san. The paper will represent resource working group.
    • Paper named by Miron. Abstract revised by Cindy. Many thanks!

Paper abstract and title

Interfacing SCMSWeb with Condor-G – A joint PRAGMA-Condor effort

SCMSWeb-Condor interface development is an international collaborative effort among some PRAGMA member institutions (ThaiGrid, Thailand; KISTI, Korea; UCSD, USA) and Condor Team at University of Wisconsin. It aims at utilizing the rich information collected by the grid monitoring system - SCMSWeb, to enable Condor-G to provide more intelligent Grid-level scheduling services. This paper addresses the motivations, practices, benefits and issues of this collaboration; illustrates the design, development and implementation of the SCMSWeb-Condor interface; describes the experiences, test results and discussions driven by running applications thru SCMSWeb-Condor interface.

Experiments and results

  • Important attributes for Condor-G
    • Used/Free slots on scheduler queue (cluster-level)
    • CPU Utilization
    • Load average
    • Memory usage
    • Disk usage
    • Architecture and CPU information
    • Hostname (FQDN)
    • Keyboard interaction (should we have that? all machine in PRAGMA are cluster)

Running simulations on PRAGMA Grid

Step 1: Check your account

  • Check your account for the machines in the PRAGMA Grid
jysoo@rocks153 [1] usertext.sh >& test-051308

- Detailed descriptions on usertext.sh can be found at [1]

  • Step 2: Select "live" sites and prepare for Condor job submission file, and specify system requirements (e.g., architecture)
jysoo@rocks153 [2] get-live.py test-051308

requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma" ) 
|| ( TARGET.Name == "server1" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn" ) || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) 
|| ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) 
|| ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) 
|| ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) 
|| ( TARGET.Name == "rocks-96.sdsc.edu" ) || ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) 
|| ( TARGET.Name == "ocikbpra.unizh.ch" ) ) && ( TARGET.Arch == "INTEL" )
jysoo@rocks153 [3] cat get-live.py

#!/usr/bin/python
# select live sites in the output file from Cindy's 'usertext.sh'
import sys
output = 'requirements = ( '
ifile = open(sys.argv[1], mode='r')
command = ifile.readlines()
for line in command[3:]:
    if (line.find('globus') == -1) and (line.find('error') == -1):
        output = output + '( TARGET.Name == \"' + line.rstrip("\n") + '\" ) || '
output = output[:len(output)-3] + ') && ( TARGET.Arch == "INTEL" )'
print output

Step 2: Run a single job

  • Prepare job submission file
jysoo@rocks153 [4] cat job-perc-grid

Executable = span-2d
Universe = grid
x509userproxy = /tmp/x509up_u531
grid_resource = $$(resource_name)
requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma.sdg.ac.cn" ) 
|| ( TARGET.Name == "server1.itsc.cuhk.edu.hk" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn" ) || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) 
|| ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) 
|| ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) 
|| ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) || ( TARGET.Name == "rocks-96.sdsc.edu" ) 
|| ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) || ( TARGET.Name == "ocikbpra.unizh.ch" ) ) 
&& (TARGET.Arch == "INTEL")
output = perc.out
input = perc.in
error = perc.err
Log = perc-grid.log
Queue

- Here, the excutable is "span-2d". The input file "perc.in" is three numbers for the simulation, and the output file "perc.out" contains two numbers.

  • Submit job
jysoo@rocks153 [5] condor_submit job-perc-grid

jysoo@rocks153 [6] condor_q

-- Submitter: rocks-153.sdsc.edu : <198.202.88.153:32774> : rocks-153.sdsc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
3473.0   jysoo           5/13 12:39   0+00:00:00 I  0   9.8  span-2d           

1 jobs; 1 idle, 0 running, 0 held

jysoo@rocks153 [7] cat perc.out
10000
10000

Step 3: Run multiple jobs

  • Prepare job submission file
jysoo@rocks153 [8] cat job-multi-perc-grid

Executable = span-2d
Universe = grid
x509userproxy = /tmp/x509up_u531
grid_resource = $$(resource_name)
requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma.sdg.ac.cn" ) 
|| ( TARGET.Name == "server1.itsc.cuhk.edu.hk" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn") || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) 
|| ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) 
|| ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) 
|| ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) || ( TARGET.Name == "rocks-96.sdsc.edu" ) 
|| ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) || ( TARGET.Name == "ocikbpra.unizh.ch" ) )
&& ( TARGET.Arch == "INTEL" )
output = perc$(Process).out
input = perc$(Process).in
error = perc$(Process).err
Log = perc-grid.log
Queue 81

- Submit 81 files with different parameter values

  • Submit job
jysoo@rocks153 [9] condor_submit job-multi-perc-grid

- Jobs sent to "pragma001.grid.sinica.edu.tw" and "pragama.lzu.edu.cn" are being on hold for unknown reason. Both machines are deleted from the list

Step 4: Performance measurement

  • Paralle executions of short jobs
For small scale problem, the execution time for individual job is very short, so the performance will be dominated by queueing structure.
For system size L = 10, number of iteration = 10000, execution time will be order of seconds

jysoo@rocks153 time span-2d < perc.in
real    0m0.155s
user    0m0.123s
sys     0m0.002s
The wall-time for the completion of # individual jobs (time difference between the termination of the last job and the submission) are the follwing

  # indivudual jobs     wall-time
          51              40:53
         101              69:54
         201             138:54
The execution time for individual job can be varied by changing the number of iteration.

  number of iteration    real-time (sec)
        10000                0.155
        100000               1.221
        1000000             12.314
        10000000           123.419
I also measured the wall-time for the completion of fixed (51) individual jobs, while varying the number of iteration

  number of iteration    wall-time
        10000               40:53
        100000              35:46
        1000000             37:46
        10000000            39:59

Papers and presentations

Related links

Personal tools