[RegCNET] Model Crashes after regular interval
IMRAN NADEEM
qphoton at gmail.com
Tue May 22 11:00:45 CEST 2007
Hi Gao,
I don't know about wall-time set by administrator but I have not set any
wall-time for any job.
Yes I can restart the model after crash.
Regards
Imran
On 5/22/07, gaoxj at cma.gov.cn <gaoxj at cma.gov.cn> wrote:
>
> Hi Imran,
>
> Quick comments: do you or the administrator set a wall-time for your job
> in the computer(s)? Can the model be restarted after the crash?
>
> Gao
>
>
>
>
> ----- Original Message -----
> *From:* IMRAN NADEEM <qphoton at gmail.com>
> *To:* regcnet at lists.ictp.it
> *Sent:* Tuesday, May 22, 2007 4:43 PM
> *Subject:* [RegCNET] Model Crashes after regular interval
>
>
> Dear RegCNET Users,
>
> I am running parallel version of RegCM3 on 4 processors. My model crashes
> after every 90 days. The error log, for
> 3 crashes that occur at 90days , 180days and 270 days, is attached. I did
> the same simulation on different machine using different
> compiler but I got the same error. I am running at dx=10km with dt=30.
>
> Thanks in advance
> Imran
>
> *******************************First
> Crash*******************************************
> BCs are ready from 1998123006 to 1998123012
> at day = 90.2601, ktau = 259950 : 1st, 2nd time deriv of ps =
> 0.81562E-05 0.64871E-07, no. of points w/convection = 0
> at day = 90.2774, ktau = 260000 : 1st, 2nd time deriv of ps =
> 0.10552E-04 0.65019E-07, no. of points w/convection = 0
> at day = 90.2948, ktau = 260050 : 1st, 2nd time deriv of ps =
> 0.12071E-04 0.74096E-07, no. of points w/convection = 0
> [cli_2]: aborting job:
> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> status=0x7ee380) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(428):
> MPIDI_EagerContigIsend(512)...............: failure occurred while
> allocating memory for a request object
> rank 2 in job 9 imp9_52929 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> 36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
>
> ******************************2nd
> Crash****************************************
> BCs are ready from 1999033006 to 1999033012
> at day = 180.2635, ktau = 254200 : 1st, 2nd time deriv of ps =
> 0.84223E-05 0.10490E-06, no. of points w/convection = 61
> at day = 180.2809, ktau = 254250 : 1st, 2nd time deriv of ps =
> 0.11980E-04 0.95740E-07, no. of points w/convection = 84
> [cli_2]: aborting job:
> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> status=0x7ee380) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(428):
> MPIDI_EagerContigIsend(512)...............: failure occurred while
> allocating memory for a request object
> rank 2 in job 10 imp9_52929 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> 33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
> ******************************************3rd
> Crash************************************
> Writing rad fields at ktau = 513360 1999062806
> BCs are ready from 1999062806 to 1999062812
> at day = 270.2635, ktau = 513400 : 1st, 2nd time deriv of ps =
> 0.10755E-04 0.17164E-06, no. of points w/convection = 1532
> at day = 270.2809, ktau = 513450 : 1st, 2nd time deriv of ps =
> 0.12644E-04 0.20978E-06, no. of points w/convection = 2103
> [cli_2]: aborting job:
> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> status=0x7ee380) failed
> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> handling an event returned by MPIDU_Sock_Wait()
> MPIDI_CH3I_Progress_handle_sock_event(428):
> MPIDI_EagerContigIsend(512)...............: failure occurred while
> allocating memory for a request object
> rank 2 in job 14 imp9_52929 caused collective abort of all ranks
> exit status of rank 2: killed by signal 9
> 34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
> **************************************
>
> ------------------------------
>
> _______________________________________________
> RegCNET mailing list
> RegCNET at lists.ictp.it
> https://lists.ictp.it/mailman/listinfo/regcnet
>
>
--
Imran Nadeem
PhD Student
Institute of Meteorology
Department of Water, Atmosphere and Environment
Uni. of Natural Resources and Applied Life Sciences (BOKU)
Peter-Jordan Strasse 82
1190 Vienna, Austria
Mobile: +43 699 1194 3044
Tel.: +43 1 47654 5614
Fax: +43 1 47654 5610
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20070522/b925893a/attachment-0002.html>
More information about the RegCNET
mailing list