[RegCNET] mpi problem

James Ciarlo` james.ciarlo at physics.org
Fri Aug 5 12:57:23 CEST 2016


Hi everyone,

It took me a while but I could spare a few moments to run a test to see how
the new model performs on our cluster. I tried running the  test_001 and it
worked, but when I tried running the dust scenario it didn't work. I didn't
have the same issue you experienced but I have got the following error.

Has anyone else tested the dust scenarios? If I get some time I'll try
running more chemistry tests, but I cannot right now.


This is the error that I got:
*****************************************************************
qrsh_starter: cannot write pid file /tmp/1074778.1.all.q/pid.1.node046: No
such file or directory
qrsh_starter: cannot open file /tmp/1074778.1.all.q/qrsh_error: No such
file or directory
qrsh_starter: cannot open file /tmp/1074778.1.all.q/qrsh_exit_code.1.node046:
No such file or directory
qrsh_starter: cannot open file /tmp/1074778.1.all.q/qrsh_error: No such
file or directory
--------------------------------------------------------------------------
A daemon (pid 9098) died unexpectedly with status 129 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        node034.cm.cluster - daemon did not report back when launched
        node042.cm.cluster - daemon did not report back when launched
        node046.cm.cluster - daemon did not report back when launched
*****************************************************************

Best regards,
James


On 8 June 2016 at 13:30, Thanos Tsikerdekis <atsikerdekis at gmail.com> wrote:

> Hello RegCM community,
>
> I am trying to run the latest version of RegCM-4.5.0 on my university
> cluster. But I had some problems running it parallel with mpi. The
> experiment starts and each core runs serial the same experiment. When I am
> using the current stable ( RegCM-4.4.5.11) or even previous test versions
> (RegCM-4.4-rc30) the parallel runs work without any problem.
>
> I configured RegCM with:
> ./configure CC=/opt/intel-13/bin/icc FC=/opt/intel-13/bin/ifort
> --with-netcdf=/opt/netcdf-4.2.1.1-intel-13/
>
> And I am using mpi 3.0.2 to run in parallel:
> /opt/mpich-3.0.2-intel-13/bin/mpirun -n 24 -host node007 bin/regcmMPI
> test.in
>
> Is anyone else having the similar issues?
>
> Best wishes,
> Thanos
>
> _______________________________________________
> RegCNET mailing list
> RegCNET at lists.ictp.it
> https://lists.ictp.it/mailman/listinfo/regcnet
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20160805/7c3bacd7/attachment.html>


More information about the RegCNET mailing list