[RegCNET] Model Crashes after regular interval

IMRAN NADEEM qphoton at gmail.com
Wed May 23 17:26:40 CEST 2007


Hi BI, Moetasim and Gao.

Thanks to all of you for discussion on my problem. I have Physical memory of
15GB ( and 15GB Swap)
which is too much for my domain having 184 x164 grid points. Anyways, I am
working on it and I will
send an email to group if I find some solution.

Regards
Imran




On 5/23/07, XUNQIANG BI <bixq at ictp.it> wrote:
>
>
> Hi, Imran:
>
> It's clear now, Your domain probably reaches the limit of your 4-CPU
> cluster. You know, like it's almost impossible to run a 1000x1000 domain
> by using just 1 CPU (except on super vector computer), a 4-CPU cluster
> has also its own limit. The solution for your problem might be:
>
> 1. Hardware upgrade
>     try to see if possible to enlarge the size of memory for the cluster
>     try to see if you could have (or add) more CPU for the cluster.
>     reinstall the LINUX and give more swap space for system
>
> 2. RegCM configuration
>     It seems that you can do multible two-month restart run;
>     instead of run 10km, you change to run 12km for the same area;
>     try to see if you can reduce the domain size by 4 grids on both side.
>
> Hope the above suggestion helps,
> Xunqiang Bi
>
> On Tue, 22 May 2007, IMRAN NADEEM wrote:
>
> > Hi Bi and Gao,
> >
> >          I have checked there is no wall-time set by administrator. Also
> I
> > am using timestep(dt =30) which is exactly three times the horizontal
> > resolution (10km). If it is computational instability, why it occurs so
> > regularly (after every 3 months). It is not possible to make serial run
> for
> > this case immediately because it will take more than a week to reach
> 90th
> > day at which model crashes. From the part of
> > error message "failure occurred while allocating memory for a request
> > object" I guess this problem may be related to memory.
> >
> > Regards
> > Nadeem
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >  5/22/07, XUNQIANG BI <bixq at ictp.it> wrote:
> >>
> >>
> >>  Hi, Imran:
> >>
> >>  Can you get exact the same results between serial and parallel ?
> >>
> >>  If yes, then it's easy to check where the problem is. Run the serail
> >>  job for the same case to see if it crashed or not.
> >>
> >>  I guess the problem is also a computational instability, the timestep
> >>  problem.
> >>
> >>  Regards,
> >>  Bi
> >>
> >>  On Tue, 22 May 2007, IMRAN NADEEM wrote:
> >>
> >> >  Hi Gao,
> >> >
> >> >  I don't know about wall-time set by administrator but I have not set
> any
> >> >  wall-time for any job.
> >> >   Yes I can  restart the model after crash.
> >> >
> >> >  Regards
> >> >  Imran
> >> >
> >> >  On 5/22/07, gaoxj at cma.gov.cn <gaoxj at cma.gov.cn> wrote:
> >> > >
> >> > >    Hi Imran,
> >> > >
> >> > >   Quick comments: do you or the administrator set a wall-time for
> your
> >>  job
> >> > >   in the computer(s)? Can the model be restarted after the crash?
> >> > >
> >> > >   Gao
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >   ----- Original Message -----
> >> > >   *From:* IMRAN NADEEM <qphoton at gmail.com>
> >> > >   *To:* regcnet at lists.ictp.it
> >> > >   *Sent:* Tuesday, May 22, 2007 4:43 PM
> >> > >   *Subject:* [RegCNET] Model Crashes after regular interval
> >> > >
> >> > >
> >> > >   Dear RegCNET Users,
> >> > >
> >> > >   I am running parallel version of RegCM3 on 4 processors. My model
> >>  crashes
> >> > >   after every 90 days. The error log, for
> >> > >   3 crashes that occur at  90days , 180days and 270 days, is
> attached.
> >> > >   I
> >>  did
> >> > >   the same simulation on different machine using different
> >> > >   compiler but I got the same error. I am running at dx=10km with
> >> > >   dt=30.
> >> > >
> >> > >   Thanks in advance
> >> > >   Imran
> >> > >
> >> > >   *******************************First
> >> > >   Crash*******************************************
> >> > >   BCs are ready from   1998123006   to   1998123012
> >> > >       at day =   90.2601, ktau =     259950 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.81562E-05 0.64871E-07,  no. of points w/convection =    0
> >> > >       at day =   90.2774, ktau =     260000 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.10552E-04 0.65019E-07,  no. of points w/convection =    0
> >> > >       at day =   90.2948, ktau =     260050 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.12071E-04 0.74096E-07,  no. of points w/convection =    0
> >> > >   [cli_2]: aborting job:
> >> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> >> > >   MPI_Sendrecv(217).........................:
> >>  MPI_Sendrecv(sbuf=0x4080c960,
> >> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2,
> rbuf=0x4080cf20,
> >> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> >> > >   status=0x7ee380) failed
> >> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred
> while
> >> > >   handling an event returned by MPIDU_Sock_Wait()
> >> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
> >> > >   MPIDI_EagerContigIsend(512)...............: failure occurred
> while
> >> > >   allocating memory for a request object
> >> > >   rank 2 in job 9  imp9_52929   caused collective abort of all
> ranks
> >> > >   exit status of rank 2: killed by signal 9
> >> > >   36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
> >> > >
> >> > >   ******************************2nd
> >> > >   Crash****************************************
> >> > >   BCs are ready from   1999033006   to   1999033012
> >> > >       at day =  180.2635, ktau =     254200 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.84223E-05 0.10490E-06,  no. of points w/convection =   61
> >> > >       at day =  180.2809, ktau =     254250 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.11980E-04 0.95740E-07,  no. of points w/convection =   84
> >> > >   [cli_2]: aborting job:
> >> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> >> > >   MPI_Sendrecv(217).........................:
> >>  MPI_Sendrecv(sbuf=0x4080c960,
> >> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2,
> rbuf=0x4080cf20,
> >> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> >> > >   status=0x7ee380) failed
> >> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred
> while
> >> > >   handling an event returned by MPIDU_Sock_Wait()
> >> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
> >> > >   MPIDI_EagerContigIsend(512)...............: failure occurred
> while
> >> > >   allocating memory for a request object
> >> > >   rank 2 in job 10  imp9_52929   caused collective abort of all
> ranks
> >> > >   exit status of rank 2: killed by signal 9
> >> > >   33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
> >> > >   ******************************************3rd
> >> > >   Crash************************************
> >> > >    Writing rad fields at ktau =       513360  1999062806
> >> > >    BCs are ready from   1999062806   to   1999062812
> >> > >       at day =  270.2635, ktau =     513400 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.10755E-04 0.17164E-06,  no. of points w/convection = 1532
> >> > >       at day =  270.2809, ktau =     513450 :  1st, 2nd time deriv
> of
> >> > >       ps
> >>  =
> >> > >   0.12644E-04 0.20978E-06,  no. of points w/convection = 2103
> >> > >   [cli_2]: aborting job:
> >> > >   Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> >> > >   MPI_Sendrecv(217).........................:
> >>  MPI_Sendrecv(sbuf=0x4080c960,
> >> > >   scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2,
> rbuf=0x4080cf20,
> >> > >   rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> >> > >   status=0x7ee380) failed
> >> > >   MPIDI_CH3_Progress_wait(212)..............: an error occurred
> while
> >> > >   handling an event returned by MPIDU_Sock_Wait()
> >> > >   MPIDI_CH3I_Progress_handle_sock_event(428):
> >> > >   MPIDI_EagerContigIsend(512)...............: failure occurred
> while
> >> > >   allocating memory for a request object
> >> > >   rank 2 in job 14  imp9_52929   caused collective abort of all
> ranks
> >> > >   exit status of rank 2: killed by signal 9
> >> > >   34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
> >> > >   **************************************
> >> > >
> >> > >   ------------------------------
> >> > >
> >> > >   _______________________________________________
> >> > >   RegCNET mailing list
> >> > >   RegCNET at lists.ictp.it
> >> > >   https://lists.ictp.it/mailman/listinfo/regcnet
> >> > >
> >> > >
> >> >
> >> >
> >> >  --
> >> >  Imran Nadeem
> >> >  PhD Student
> >> >  Institute of Meteorology
> >> >  Department of Water, Atmosphere and Environment
> >> >  Uni. of Natural Resources and Applied Life Sciences (BOKU)
> >> >
> >> >  Peter-Jordan Strasse 82
> >> >  1190 Vienna, Austria
> >> >
> >> >  Mobile: +43 699 1194 3044
> >> >  Tel.: +43 1 47654 5614
> >> >  Fax: +43 1 47654 5610
> >> >
> >>
> >> ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>     Dr. Xunqiang Bi         email:bixq at ictp.it
> >>     Earth System Physics Group
> >>     The Abdus Salam ICTP
> >>     Strada Costiera, 11
> >>     P.O. BOX 586, 34100 Trieste, ITALY
> >>     Tel: +39-040-2240302  Fax: +39-040-2240449
> >> ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>
> >
> >
> >
> > --
> > Imran Nadeem
> > PhD Student
> > Institute of Meteorology
> > Department of Water, Atmosphere and Environment
> > Uni. of Natural Resources and Applied Life Sciences (BOKU)
> >
> > Peter-Jordan Strasse 82
> > 1190 Vienna, Austria
> >
> > Mobile: +43 699 1194 3044
> > Tel.: +43 1 47654 5614
> > Fax: +43 1 47654 5610
> >
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>    Dr. Xunqiang Bi         email:bixq at ictp.it
>    Earth System Physics Group
>    The Abdus Salam ICTP
>    Strada Costiera, 11
>    P.O. BOX 586, 34100 Trieste, ITALY
>    Tel: +39-040-2240302  Fax: +39-040-2240449
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>



-- 
Imran Nadeem
PhD Student
Institute of Meteorology
Department of Water, Atmosphere and Environment
Uni. of Natural Resources and Applied Life Sciences (BOKU)

Peter-Jordan Strasse 82
1190 Vienna, Austria

Mobile: +43 699 1194 3044
Tel.: +43 1 47654 5614
Fax: +43 1 47654 5610
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20070523/1464547a/attachment-0002.html>


More information about the RegCNET mailing list