[RegCNET] Model Crashes after regular interval

Wed May 23 09:10:31 CEST 2007

Hi, Imran:

It's clear now, Your domain probably reaches the limit of your 4-CPU
cluster. You know, like it's almost impossible to run a 1000x1000 domain
by using just 1 CPU (except on super vector computer), a 4-CPU cluster
has also its own limit. The solution for your problem might be:

1. Hardware upgrade
    try to see if possible to enlarge the size of memory for the cluster
    try to see if you could have (or add) more CPU for the cluster.
    reinstall the LINUX and give more swap space for system

2. RegCM configuration
    It seems that you can do multible two-month restart run;
    instead of run 10km, you change to run 12km for the same area;
    try to see if you can reduce the domain size by 4 grids on both side.

Hope the above suggestion helps,
Xunqiang Bi

On Tue, 22 May 2007, IMRAN NADEEM wrote:

> Hi Bi and Gao,
>
>          I have checked there is no wall-time set by administrator. Also I
> am using timestep(dt =30) which is exactly three times the horizontal
> resolution (10km). If it is computational instability, why it occurs so
> regularly (after every 3 months). It is not possible to make serial run for
> this case immediately because it will take more than a week to reach 90th
> day at which model crashes. From the part of
> error message "failure occurred while allocating memory for a request
> object" I guess this problem may be related to memory.
>
> Regards
> Nadeem
>
>
>
>
>
>
>
>
>
>  5/22/07, XUNQIANG BI <bixq at ictp.it> wrote:
>> 
>> 
>>  Hi, Imran:
>> 
>>  Can you get exact the same results between serial and parallel ?
>> 
>>  If yes, then it's easy to check where the problem is. Run the serail
>>  job for the same case to see if it crashed or not.
>> 
>>  I guess the problem is also a computational instability, the timestep
>>  problem.
>> 
>>  Regards,
>>  Bi
>> 
>>  On Tue, 22 May 2007, IMRAN NADEEM wrote:
>> 
>> >  Hi Gao,
>> > 
>> >  I don't know about wall-time set by administrator but I have not set any
>> >  wall-time for any job.
>> >   Yes I can  restart the model after crash.
>> > 
>> >  Regards
>> >  Imran
>> > 
>> >  On 5/22/07, gaoxj at cma.gov.cn <gaoxj at cma.gov.cn> wrote:
>> > > 
>> > >    Hi Imran,
>> > > 
>> > >   Quick comments: do you or the administrator set a wall-time for your
>>  job
>> > >   in the computer(s)? Can the model be restarted after the crash?
>> > > 
>> > >   Gao
>> > > 
>> > > 
>> > > 
>> > > 
>> > >   ----- Original Message -----
>> > >   *From:* IMRAN NADEEM <qphoton at gmail.com>
>> > >   *To:* regcnet at lists.ictp.it
>> > >   *Sent:* Tuesday, May 22, 2007 4:43 PM
>> > >   *Subject:* [RegCNET] Model Crashes after regular interval
>> > > 
>> > > 
>> > >   Dear RegCNET Users,
>> > > 
>> > >   I am running parallel version of RegCM3 on 4 processors. My model
>>  crashes
>> > >   after every 90 days. The error log, for
>> > >   3 crashes that occur at  90days , 180days and 270 days, is attached. 
>> > >   I
>>  did
>> > >   the same simulation on different machine using different
>> > >   compiler but I got the same error. I am running at dx=10km with 
>> > >   dt=30.
>> > > 
>> > >   Thanks in advance
>> > >   Imran
>> > > 
>> > >   *******************************First
>> > >   Crash*******************************************
>> > >   BCs are ready from   1998123006   to   1998123012
>> > >       at day =   90.2601, ktau =     259950 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.81562E-05 0.64871E-07,  no. of points w/convection =    0
>> > >       at day =   90.2774, ktau =     260000 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.10552E-04 0.65019E-07,  no. of points w/convection =    0
>> > >       at day =   90.2948, ktau =     260050 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.12071E-04 0.74096E-07,  no. of points w/convection =    0
>> > >   [cli_2]: aborting job:
>> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> > >   MPI_Sendrecv(217).........................:
>>  MPI_Sendrecv(sbuf=0x4080c960,
>> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> > >   status=0x7ee380) failed
>> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> > >   handling an event returned by MPIDU_Sock_Wait()
>> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
>> > >   MPIDI_EagerContigIsend(512)...............: failure occurred while
>> > >   allocating memory for a request object
>> > >   rank 2 in job 9  imp9_52929   caused collective abort of all ranks
>> > >   exit status of rank 2: killed by signal 9
>> > >   36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
>> > > 
>> > >   ******************************2nd
>> > >   Crash****************************************
>> > >   BCs are ready from   1999033006   to   1999033012
>> > >       at day =  180.2635, ktau =     254200 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.84223E-05 0.10490E-06,  no. of points w/convection =   61
>> > >       at day =  180.2809, ktau =     254250 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.11980E-04 0.95740E-07,  no. of points w/convection =   84
>> > >   [cli_2]: aborting job:
>> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> > >   MPI_Sendrecv(217).........................:
>>  MPI_Sendrecv(sbuf=0x4080c960,
>> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> > >   status=0x7ee380) failed
>> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> > >   handling an event returned by MPIDU_Sock_Wait()
>> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
>> > >   MPIDI_EagerContigIsend(512)...............: failure occurred while
>> > >   allocating memory for a request object
>> > >   rank 2 in job 10  imp9_52929   caused collective abort of all ranks
>> > >   exit status of rank 2: killed by signal 9
>> > >   33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
>> > >   ******************************************3rd
>> > >   Crash************************************
>> > >    Writing rad fields at ktau =       513360  1999062806
>> > >    BCs are ready from   1999062806   to   1999062812
>> > >       at day =  270.2635, ktau =     513400 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.10755E-04 0.17164E-06,  no. of points w/convection = 1532
>> > >       at day =  270.2809, ktau =     513450 :  1st, 2nd time deriv of 
>> > >       ps
>>  =
>> > >   0.12644E-04 0.20978E-06,  no. of points w/convection = 2103
>> > >   [cli_2]: aborting job:
>> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> > >   MPI_Sendrecv(217).........................:
>>  MPI_Sendrecv(sbuf=0x4080c960,
>> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> > >   status=0x7ee380) failed
>> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> > >   handling an event returned by MPIDU_Sock_Wait()
>> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
>> > >   MPIDI_EagerContigIsend(512)...............: failure occurred while
>> > >   allocating memory for a request object
>> > >   rank 2 in job 14  imp9_52929   caused collective abort of all ranks
>> > >   exit status of rank 2: killed by signal 9
>> > >   34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
>> > >   **************************************
>> > > 
>> > >   ------------------------------
>> > > 
>> > >   _______________________________________________
>> > >   RegCNET mailing list
>> > >   RegCNET at lists.ictp.it
>> > >   https://lists.ictp.it/mailman/listinfo/regcnet
>> > > 
>> > > 
>> > 
>> > 
>> >  --
>> >  Imran Nadeem
>> >  PhD Student
>> >  Institute of Meteorology
>> >  Department of Water, Atmosphere and Environment
>> >  Uni. of Natural Resources and Applied Life Sciences (BOKU)
>> > 
>> >  Peter-Jordan Strasse 82
>> >  1190 Vienna, Austria
>> > 
>> >  Mobile: +43 699 1194 3044
>> >  Tel.: +43 1 47654 5614
>> >  Fax: +43 1 47654 5610
>> > 
>> 
>> ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>     Dr. Xunqiang Bi         email:bixq at ictp.it
>>     Earth System Physics Group
>>     The Abdus Salam ICTP
>>     Strada Costiera, 11
>>     P.O. BOX 586, 34100 Trieste, ITALY
>>     Tel: +39-040-2240302  Fax: +39-040-2240449
>> ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> 
>
>
>
> -- 
> Imran Nadeem
> PhD Student
> Institute of Meteorology
> Department of Water, Atmosphere and Environment
> Uni. of Natural Resources and Applied Life Sciences (BOKU)
>
> Peter-Jordan Strasse 82
> 1190 Vienna, Austria
>
> Mobile: +43 699 1194 3044
> Tel.: +43 1 47654 5614
> Fax: +43 1 47654 5610
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Dr. Xunqiang Bi         email:bixq at ictp.it
   Earth System Physics Group
   The Abdus Salam ICTP
   Strada Costiera, 11
   P.O. BOX 586, 34100 Trieste, ITALY
   Tel: +39-040-2240302  Fax: +39-040-2240449
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~