Quick comments: do you or the administrator set a wall-time for your
job in the computer(s)? Can the model be restarted after the crash?
----- Original Message -----
Sent: Tuesday, May 22, 2007 4:43 PM
Subject: [RegCNET] Model Crashes after
regular interval
Dear RegCNET
Users,
I am running parallel version of RegCM3 on 4 processors. My model crashes
after every 90 days. The error log, for
3 crashes that occur
at 90days , 180days and 270 days, is attached. I did the same simulation
on different machine using different
compiler but I got the same error. I
am running at dx=10km with dt=30.
Thanks in
advance
Imran
*******************************First
Crash*******************************************
BCs are ready
from 1998123006 to 1998123012
at day = 90.2601, ktau
= 259950 : 1st, 2nd time deriv of ps =
0.81562E-05 0.64871E-07, no. of points w/convection =
0
at day = 90.2774, ktau
= 260000 : 1st, 2nd time deriv of ps =
0.10552E-04 0.65019E-07, no. of points w/convection =
0
at day = 90.2948, ktau
= 260050 : 1st, 2nd time deriv of ps =
0.12071E-04 0.74096E-07, no. of points w/convection =
0
[cli_2]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error,
error stack:
MPI_Sendrecv(217).........................:
MPI_Sendrecv(sbuf=0x4080c960, scount=184, MPI_DOUBLE_PRECISION, dest=1,
stag=2, rbuf=0x4080cf20, rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2,
MPI_COMM_WORLD, status=0x7ee380) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while
handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(428):
MPIDI_EagerContigIsend(512)...............:
failure occurred while allocating memory for a request object
rank 2 in
job 9 imp9_52929 caused collective abort of all
ranks
exit status of rank 2: killed by signal 9
36.314u 5.760s
53:58:42.21 0.0% 0+0k 0+0io
0pf+0w
******************************2nd
Crash****************************************
BCs are ready
from 1999033006 to
1999033012
at day = 180.2635, ktau
= 254200 : 1st, 2nd time deriv of ps =
0.84223E-05 0.10490E-06, no. of points w/convection =
61
at day = 180.2809, ktau
= 254250 : 1st, 2nd time deriv of ps =
0.11980E-04 0.95740E-07, no. of points w/convection =
84
[cli_2]: aborting job:
Fatal error in MPI_Sendrecv: Other MPI error,
error stack:
MPI_Sendrecv(217).........................:
MPI_Sendrecv(sbuf=0x4080c960, scount=184, MPI_DOUBLE_PRECISION, dest=1,
stag=2, rbuf=0x4080cf20, rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2,
MPI_COMM_WORLD, status=0x7ee380) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while
handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(428):
MPIDI_EagerContigIsend(512)...............:
failure occurred while allocating memory for a request object
rank 2 in
job 10 imp9_52929 caused collective abort of all
ranks
exit status of rank 2: killed by signal 9
33.842u 5.688s
53:02:37.63 0.0% 0+0k 0+0io
431pf+0w
******************************************3rd
Crash************************************
Writing rad fields at ktau
= 513360 1999062806
BCs are
ready from 1999062806 to
1999062812
at day = 270.2635, ktau
= 513400 : 1st, 2nd time deriv of ps =
0.10755E-04 0.17164E-06, no. of points w/convection = 1532
at day = 270.2809, ktau
= 513450 : 1st, 2nd time deriv of ps =
0.12644E-04 0.20978E-06, no. of points w/convection = 2103
[cli_2]:
aborting job:
Fatal error in MPI_Sendrecv: Other MPI error, error stack:
MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20, rcount=184,
MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD, status=0x7ee380) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while
handling an event returned by
MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(428):
MPIDI_EagerContigIsend(512)...............:
failure occurred while allocating memory for a request object
rank 2 in
job 14 imp9_52929 caused collective abort of all
ranks
exit status of rank 2: killed by signal 9
34.102u 5.336s
53:42:44.95 0.0% 0+0k 0+0io
0pf+0w
**************************************
_______________________________________________
RegCNET mailing
list
RegCNET@lists.ictp.it
https://lists.ictp.it/mailman/listinfo/regcnet