Condor-PRAGMA Interoperation
From PRAGMAgridWIKI
Contents |
Condor-SCMSWeb Interoperation
- Participants and Contacts
- Condor team contact: Miron Livny, Jaime Frey, Todd Tannenbaum
- ThaiGrid SCMSWeb Team contact: Putchong Uthayopas, Somsak Sriprayoonsakul, Sugree Phatanapherom
- PRAGMA Grid contact: Cindy Zheng
- Application testing: Jysoo Lee
Milestones
- Project initiated: 4/10/2007
- Finish Condor match-making using PRAGMA SCMS before the next PRAGMA (PRAGMA14)
- 3/4/2008 Due to workload, the development of Condor interface started on Late March 2008. Expect to finish before end of April.
- 21/4/2008 Beta version is almost finish. Main codes are here. Going to experiment on Rocks-153.
- 22/4/2008 Beta version installed on Rocks-153. Thank you to Mr. Jysoo Lee's team from KISTI, who will help us test the system.
Architecture
[condor-g] [scmsweb]
^ |
| |
| v
+---(condor_advertise)---[scmsweb_condor]<---------[httpd]
Issues, discussions and resolutions
- 10/26/2007
- Project Goal
- Condor-G will use data from SCMS to do match making
- SCMS provides Condor-pool interface (GUI)
- We agree to start with Condor-G match making with SCMS first
- ThaiGrid team will develop some adapter to push data to Condor-G. Todd suggest the optimal interval is couple of minutes, or when any data is changed.
- There's possibility to also push software cataloging data from SCMS to Condor-G. ThaiGrid suppose to develop software catalog mechanism for PRAGMA anyways.
- ThaiGrid will have a cluster to experiment whole this. ThaiGrid will give the account for Condor team so we can help each other on this.
- We will start with Cluster-level first.
- Project Goal
- 5/2008
- Job submission is hard, due to authentication/authorization not being set up properly on EVERY clusters
- Looking for some way to exclude some cluster from testing automatically
- 7/2008
- Agreed to publish paper in PRAGMA Workshop on e-Science.
- 15/8/2008
- Abstract sent to Yoshio-san. The paper will represent resource working group.
- Paper named by Miron. Abstract revised by Cindy. Many thanks!
Paper abstract and title
Interfacing SCMSWeb with Condor-G – A joint PRAGMA-Condor effort
SCMSWeb-Condor interface development is an international collaborative effort among some PRAGMA member institutions (ThaiGrid, Thailand; KISTI, Korea; UCSD, USA) and Condor Team at University of Wisconsin. It aims at utilizing the rich information collected by the grid monitoring system - SCMSWeb, to enable Condor-G to provide more intelligent Grid-level scheduling services. This paper addresses the motivations, practices, benefits and issues of this collaboration; illustrates the design, development and implementation of the SCMSWeb-Condor interface; describes the experiences, test results and discussions driven by running applications thru SCMSWeb-Condor interface.
Experiments and results
- Important attributes for Condor-G
- Used/Free slots on scheduler queue (cluster-level)
- CPU Utilization
- Load average
- Memory usage
- Disk usage
- Architecture and CPU information
- Hostname (FQDN)
- Keyboard interaction (should we have that? all machine in PRAGMA are cluster)
Running simulations on PRAGMA Grid
Step 1: Check your account
- Check your account for the machines in the PRAGMA Grid
jysoo@rocks153 [1] usertext.sh >& test-051308
- Detailed descriptions on usertext.sh can be found at [1]
- Step 2: Select "live" sites and prepare for Condor job submission file, and specify system requirements (e.g., architecture)
jysoo@rocks153 [2] get-live.py test-051308 requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma" ) || ( TARGET.Name == "server1" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn" ) || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) || ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) || ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) || ( TARGET.Name == "rocks-96.sdsc.edu" ) || ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) || ( TARGET.Name == "ocikbpra.unizh.ch" ) ) && ( TARGET.Arch == "INTEL" )
jysoo@rocks153 [3] cat get-live.py
#!/usr/bin/python
# select live sites in the output file from Cindy's 'usertext.sh'
import sys
output = 'requirements = ( '
ifile = open(sys.argv[1], mode='r')
command = ifile.readlines()
for line in command[3:]:
if (line.find('globus') == -1) and (line.find('error') == -1):
output = output + '( TARGET.Name == \"' + line.rstrip("\n") + '\" ) || '
output = output[:len(output)-3] + ') && ( TARGET.Arch == "INTEL" )'
print output
Step 2: Run a single job
- Prepare job submission file
jysoo@rocks153 [4] cat job-perc-grid Executable = span-2d Universe = grid x509userproxy = /tmp/x509up_u531 grid_resource = $$(resource_name) requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma.sdg.ac.cn" ) || ( TARGET.Name == "server1.itsc.cuhk.edu.hk" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn" ) || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) || ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) || ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) || ( TARGET.Name == "rocks-96.sdsc.edu" ) || ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) || ( TARGET.Name == "ocikbpra.unizh.ch" ) ) && (TARGET.Arch == "INTEL") output = perc.out input = perc.in error = perc.err Log = perc-grid.log Queue
- Here, the excutable is "span-2d". The input file "perc.in" is three numbers for the simulation, and the output file "perc.out" contains two numbers.
- Submit job
jysoo@rocks153 [5] condor_submit job-perc-grid jysoo@rocks153 [6] condor_q -- Submitter: rocks-153.sdsc.edu : <198.202.88.153:32774> : rocks-153.sdsc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3473.0 jysoo 5/13 12:39 0+00:00:00 I 0 9.8 span-2d 1 jobs; 1 idle, 0 running, 0 held jysoo@rocks153 [7] cat perc.out 10000 10000
Step 3: Run multiple jobs
- Prepare job submission file
jysoo@rocks153 [8] cat job-multi-perc-grid Executable = span-2d Universe = grid x509userproxy = /tmp/x509up_u531 grid_resource = $$(resource_name) requirements = ( ( TARGET.Name == "sakura.hpcc.jp" ) || ( TARGET.Name == "pragma001.grid.sinica.edu.tw" ) || ( TARGET.Name == "pragma.sdg.ac.cn" ) || ( TARGET.Name == "server1.itsc.cuhk.edu.hk" ) || ( TARGET.Name == "bkluster.hpcc.hut.edu.vn") || ( TARGET.Name == "jupiter.gridcenter.or.kr" ) || ( TARGET.Name == "pragma.lzu.edu.cn" ) || ( TARGET.Name == "nucleus.mygridusbio.net.my" ) || ( TARGET.Name == "nacona00.nchc.org.tw" ) || ( TARGET.Name == "grid64.hpcc.nectec.or.th" ) || ( TARGET.Name == "cafe01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "tea01.exp-net.osaka-u.ac.jp" ) || ( TARGET.Name == "rocks-52.sdsc.edu" ) || ( TARGET.Name == "rocks-153.sdsc.edu" ) || ( TARGET.Name == "rocks-96.sdsc.edu" ) || ( TARGET.Name == "sunyata.thaigrid.or.th" ) || ( TARGET.Name == "volatile.ece.uprm.edu" ) || ( TARGET.Name == "ocikbpra.unizh.ch" ) ) && ( TARGET.Arch == "INTEL" ) output = perc$(Process).out input = perc$(Process).in error = perc$(Process).err Log = perc-grid.log Queue 81
- Submit 81 files with different parameter values
- Submit job
jysoo@rocks153 [9] condor_submit job-multi-perc-grid
- Jobs sent to "pragma001.grid.sinica.edu.tw" and "pragama.lzu.edu.cn" are being on hold for unknown reason. Both machines are deleted from the list
Step 4: Performance measurement
- Paralle executions of short jobs
For small scale problem, the execution time for individual job is very short, so the performance will be dominated by queueing structure.
For system size L = 10, number of iteration = 10000, execution time will be order of seconds jysoo@rocks153 time span-2d < perc.in real 0m0.155s user 0m0.123s sys 0m0.002s
The wall-time for the completion of # individual jobs (time difference between the termination of the last job and the submission) are the follwing
# indivudual jobs wall-time
51 40:53
101 69:54
201 138:54
The execution time for individual job can be varied by changing the number of iteration.
number of iteration real-time (sec)
10000 0.155
100000 1.221
1000000 12.314
10000000 123.419
I also measured the wall-time for the completion of fixed (51) individual jobs, while varying the number of iteration
number of iteration wall-time
10000 40:53
100000 35:46
1000000 37:46
10000000 39:59
Papers and presentations
Related links
- Condor ClassAds interfaces - http://www.cs.wisc.edu/condor/manual/v6.4/4_1Condor_s_ClassAd.html
- Advertising Grid resources to Condor - http://www.cs.wisc.edu/condor/manual/v7.0/5_3Grid_Universe.html#SECTION00637200000000000000
- SCMSWeb XML interfaces
- Original - off-line (zipped text file), on-line (PRAGMA)
- GLUE compliant - off-line (zipped text file), on-line (PRAGMA)
