Application tips
From PRAGMAgridWIKI
Contents |
[edit]
Need help with a problem?
To ask others help figuring out a problem with your application run, make sure that you
- describe the nature of the problem
- provide exact commands and exact output
- if requested, provide all input, configuration, files etc., so others can reproduce your problem
[edit]
Dock6 MPI job sometimes quits prematurely
1. Problem description
I am using Cafe as the master cluster and using opal-op to submit dock6.mpi jobs on some remote clusters. The problem I am having is that the job will show up in qstat as running for approximately 1 minute and get dropped soon after. This is the output from std.out file: Initializing MPI Routines... p0_18339: (44.387859) net_send: could not write to fd=4, errno = 32 -catch_rsh /opt/gridengine/default/spool/compute-1-1/active_jobs/28735.1/pe_host file This is the output from std.err file: Killed by signal 2. Killed by signal 2. This occurs about 80-90% of the time but it will work occassionally. When I am submitting the jobs, the clusters seem to be quite free showing zero load on most processors in qhost.
2. Possible cause
The problem indicates that one of the mpi processes dies and the others are getting error messages when trying to communicate with it.
One possible reason is a shortage of semaphores (System V interprocess communication objects). This happens as a result of programs not
releasing semaphores when shutting down (any MPI or other program that uses memory segments and crashes, for example).
3. Possible solution
Remove left over semaphores and memory segments that no longer have processes associated with them.
- 3.1 System-wide level
- Run a cron job that will check for the unused semaphores and memory segments and release them.
- This can be done daily. The advantage is a consistent removal of the unused objects for all users. See example scripts
- 3.2 User level
- A user can run a command to release semaphores soon after discovering that a program had crashed. The user can remove
- only objects that were created by the user processes. in case your job crashed, execute the following commands on all worker nodes used by the job to release sharemd memory segments, semaphores and message queues. The deletion of a semaphore or a message queue is immediate even if there is still a process that uses the object, the deletion of the memory object happens after all processes detach. Make sure that you have no other jobs running when using these commands:
ipcrm shm `ipcs -m | grep $LOGNAME | awk '{ print $2 }'`
ipcrm sem `ipcs -s | grep $LOGNAME | awk '{ print $2 }'`
ipcrm msg `ipcs -q | grep $LOGNAME | awk '{ print $2 }'`
