[RegCNET] Model Crashes after regular interval
Moetasim
mashfaq at purdue.edu
Tue May 22 19:10:44 CEST 2007
Hi,
Well it appears that this is neither walltime nor time step problem. Although
it is less likely but probably model is crashing due to some memory hole in the
code - Probably some array size is lesser than actual requirement of the
simulation but again this is less likely. However, through our own experience,
we have seen that model runs even with memory bugs with some mpich versions and
doesn't run with some other versions of mpich.
I would suggest you to use debugger to isolate the place in the code, which
causes the crash. You may try with other mpich as well.
Apart from this, 30 sec time step for 10 km run is not a guarantee that model
will not crash - At high resolutions, you may need smaller than 3xds time
step to keep the computation stable. You crash point in all cases has
relatively unstable conditions when compared with time steps above that. So
also give it a try with smaller time step e.g 20 sec.
Also, Bi suggested the right thing to try serial version as well. I assume
that if you really have problem with memory that should also cause model to
crash in serial mode. If not then at least you would be one step ahead in
fixing the problem.
Good luck,
Moet
Quoting IMRAN NADEEM <qphoton at gmail.com>:
> Hi Bi and Gao,
>
> I have checked there is no wall-time set by administrator. Also I
> am using timestep(dt =30) which is exactly three times the horizontal
> resolution (10km). If it is computational instability, why it occurs so
> regularly (after every 3 months). It is not possible to make serial run for
> this case immediately because it will take more than a week to reach 90th
> day at which model crashes. From the part of
> error message "failure occurred while allocating memory for a request
> object" I guess this problem may be related to memory.
>
> Regards
> Nadeem
>
>
>
>
>
>
>
>
>
> 5/22/07, XUNQIANG BI <bixq at ictp.it> wrote:
> >
> >
> > Hi, Imran:
> >
> > Can you get exact the same results between serial and parallel ?
> >
> > If yes, then it's easy to check where the problem is. Run the serail
> > job for the same case to see if it crashed or not.
> >
> > I guess the problem is also a computational instability, the timestep
> > problem.
> >
> > Regards,
> > Bi
> >
> > On Tue, 22 May 2007, IMRAN NADEEM wrote:
> >
> > > Hi Gao,
> > >
> > > I don't know about wall-time set by administrator but I have not set any
> > > wall-time for any job.
> > > Yes I can restart the model after crash.
> > >
> > > Regards
> > > Imran
> > >
> > > On 5/22/07, gaoxj at cma.gov.cn <gaoxj at cma.gov.cn> wrote:
> > >>
> > >> Hi Imran,
> > >>
> > >> Quick comments: do you or the administrator set a wall-time for your
> > job
> > >> in the computer(s)? Can the model be restarted after the crash?
> > >>
> > >> Gao
> > >>
> > >>
> > >>
> > >>
> > >> ----- Original Message -----
> > >> *From:* IMRAN NADEEM <qphoton at gmail.com>
> > >> *To:* regcnet at lists.ictp.it
> > >> *Sent:* Tuesday, May 22, 2007 4:43 PM
> > >> *Subject:* [RegCNET] Model Crashes after regular interval
> > >>
> > >>
> > >> Dear RegCNET Users,
> > >>
> > >> I am running parallel version of RegCM3 on 4 processors. My model
> > crashes
> > >> after every 90 days. The error log, for
> > >> 3 crashes that occur at 90days , 180days and 270 days, is attached. I
> > did
> > >> the same simulation on different machine using different
> > >> compiler but I got the same error. I am running at dx=10km with dt=30.
> > >>
> > >> Thanks in advance
> > >> Imran
> > >>
> > >> *******************************First
> > >> Crash*******************************************
> > >> BCs are ready from 1998123006 to 1998123012
> > >> at day = 90.2601, ktau = 259950 : 1st, 2nd time deriv of ps
> > =
> > >> 0.81562E-05 0.64871E-07, no. of points w/convection = 0
> > >> at day = 90.2774, ktau = 260000 : 1st, 2nd time deriv of ps
> > =
> > >> 0.10552E-04 0.65019E-07, no. of points w/convection = 0
> > >> at day = 90.2948, ktau = 260050 : 1st, 2nd time deriv of ps
> > =
> > >> 0.12071E-04 0.74096E-07, no. of points w/convection = 0
> > >> [cli_2]: aborting job:
> > >> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >> MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >> status=0x7ee380) failed
> > >> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >> handling an event returned by MPIDU_Sock_Wait()
> > >> MPIDI_CH3I_Progress_handle_sock_event(428):
> > >> MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >> allocating memory for a request object
> > >> rank 2 in job 9 imp9_52929 caused collective abort of all ranks
> > >> exit status of rank 2: killed by signal 9
> > >> 36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
> > >>
> > >> ******************************2nd
> > >> Crash****************************************
> > >> BCs are ready from 1999033006 to 1999033012
> > >> at day = 180.2635, ktau = 254200 : 1st, 2nd time deriv of ps
> > =
> > >> 0.84223E-05 0.10490E-06, no. of points w/convection = 61
> > >> at day = 180.2809, ktau = 254250 : 1st, 2nd time deriv of ps
> > =
> > >> 0.11980E-04 0.95740E-07, no. of points w/convection = 84
> > >> [cli_2]: aborting job:
> > >> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >> MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >> status=0x7ee380) failed
> > >> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >> handling an event returned by MPIDU_Sock_Wait()
> > >> MPIDI_CH3I_Progress_handle_sock_event(428):
> > >> MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >> allocating memory for a request object
> > >> rank 2 in job 10 imp9_52929 caused collective abort of all ranks
> > >> exit status of rank 2: killed by signal 9
> > >> 33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
> > >> ******************************************3rd
> > >> Crash************************************
> > >> Writing rad fields at ktau = 513360 1999062806
> > >> BCs are ready from 1999062806 to 1999062812
> > >> at day = 270.2635, ktau = 513400 : 1st, 2nd time deriv of ps
> > =
> > >> 0.10755E-04 0.17164E-06, no. of points w/convection = 1532
> > >> at day = 270.2809, ktau = 513450 : 1st, 2nd time deriv of ps
> > =
> > >> 0.12644E-04 0.20978E-06, no. of points w/convection = 2103
> > >> [cli_2]: aborting job:
> > >> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >> MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >> status=0x7ee380) failed
> > >> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >> handling an event returned by MPIDU_Sock_Wait()
> > >> MPIDI_CH3I_Progress_handle_sock_event(428):
> > >> MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >> allocating memory for a request object
> > >> rank 2 in job 14 imp9_52929 caused collective abort of all ranks
> > >> exit status of rank 2: killed by signal 9
> > >> 34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
> > >> **************************************
> > >>
> > >> ------------------------------
> > >>
> > >> _______________________________________________
> > >> RegCNET mailing list
> > >> RegCNET at lists.ictp.it
> > >> https://lists.ictp.it/mailman/listinfo/regcnet
> > >>
> > >>
> > >
> > >
> > > --
> > > Imran Nadeem
> > > PhD Student
> > > Institute of Meteorology
> > > Department of Water, Atmosphere and Environment
> > > Uni. of Natural Resources and Applied Life Sciences (BOKU)
> > >
> > > Peter-Jordan Strasse 82
> > > 1190 Vienna, Austria
> > >
> > > Mobile: +43 699 1194 3044
> > > Tel.: +43 1 47654 5614
> > > Fax: +43 1 47654 5610
> > >
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Dr. Xunqiang Bi email:bixq at ictp.it
> > Earth System Physics Group
> > The Abdus Salam ICTP
> > Strada Costiera, 11
> > P.O. BOX 586, 34100 Trieste, ITALY
> > Tel: +39-040-2240302 Fax: +39-040-2240449
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
>
>
>
> --
> Imran Nadeem
> PhD Student
> Institute of Meteorology
> Department of Water, Atmosphere and Environment
> Uni. of Natural Resources and Applied Life Sciences (BOKU)
>
> Peter-Jordan Strasse 82
> 1190 Vienna, Austria
>
> Mobile: +43 699 1194 3044
> Tel.: +43 1 47654 5614
> Fax: +43 1 47654 5610
>
More information about the RegCNET
mailing list