[RegCNET] Model Crashes after regular interval

Moetasim mashfaq at purdue.edu
Tue May 22 19:10:44 CEST 2007


Hi,

Well it appears that this is neither walltime nor time step problem. Although 
it is less likely but probably model is crashing due to some memory hole in the 
code - Probably some array size is lesser than actual requirement of the 
simulation but again this is less likely. However, through our own experience, 
we have seen that model runs even with memory bugs with some mpich versions and 
doesn't run with some other versions of mpich.

I would suggest you to use debugger to isolate the place in the code, which 
causes the crash. You may try with other mpich as well. 

Apart from this, 30 sec time step for 10 km run  is not a guarantee that model 
will not crash - At high resolutions, you may need smaller than 3xds time 
step to keep the computation stable. You crash point in all cases has 
relatively unstable conditions when compared with time steps above that. So 
also give it a try with smaller time step e.g 20 sec. 

Also, Bi suggested the right thing to  try serial version as well. I assume 
that if you really have problem with memory that should also cause model to 
crash in serial mode. If not then at least you would be one step ahead in 
fixing the problem.

Good luck,

Moet   

   


Quoting IMRAN NADEEM <qphoton at gmail.com>:

> Hi Bi and Gao,
> 
>            I have checked there is no wall-time set by administrator. Also I
> am using timestep(dt =30) which is exactly three times the horizontal
> resolution (10km). If it is computational instability, why it occurs so
> regularly (after every 3 months). It is not possible to make serial run for
> this case immediately because it will take more than a week to reach 90th
> day at which model crashes. From the part of
> error message "failure occurred while allocating memory for a request
> object" I guess this problem may be related to memory.
> 
> Regards
> Nadeem
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  5/22/07, XUNQIANG BI <bixq at ictp.it> wrote:
> >
> >
> > Hi, Imran:
> >
> > Can you get exact the same results between serial and parallel ?
> >
> > If yes, then it's easy to check where the problem is. Run the serail
> > job for the same case to see if it crashed or not.
> >
> > I guess the problem is also a computational instability, the timestep
> > problem.
> >
> > Regards,
> > Bi
> >
> > On Tue, 22 May 2007, IMRAN NADEEM wrote:
> >
> > > Hi Gao,
> > >
> > > I don't know about wall-time set by administrator but I have not set any
> > > wall-time for any job.
> > >  Yes I can  restart the model after crash.
> > >
> > > Regards
> > > Imran
> > >
> > > On 5/22/07, gaoxj at cma.gov.cn <gaoxj at cma.gov.cn> wrote:
> > >>
> > >>   Hi Imran,
> > >>
> > >>  Quick comments: do you or the administrator set a wall-time for your
> > job
> > >>  in the computer(s)? Can the model be restarted after the crash?
> > >>
> > >>  Gao
> > >>
> > >>
> > >>
> > >>
> > >>  ----- Original Message -----
> > >>  *From:* IMRAN NADEEM <qphoton at gmail.com>
> > >>  *To:* regcnet at lists.ictp.it
> > >>  *Sent:* Tuesday, May 22, 2007 4:43 PM
> > >>  *Subject:* [RegCNET] Model Crashes after regular interval
> > >>
> > >>
> > >>  Dear RegCNET Users,
> > >>
> > >>  I am running parallel version of RegCM3 on 4 processors. My model
> > crashes
> > >>  after every 90 days. The error log, for
> > >>  3 crashes that occur at  90days , 180days and 270 days, is attached. I
> > did
> > >>  the same simulation on different machine using different
> > >>  compiler but I got the same error. I am running at dx=10km with dt=30.
> > >>
> > >>  Thanks in advance
> > >>  Imran
> > >>
> > >>  *******************************First
> > >>  Crash*******************************************
> > >>  BCs are ready from   1998123006   to   1998123012
> > >>      at day =   90.2601, ktau =     259950 :  1st, 2nd time deriv of ps
> > =
> > >>  0.81562E-05 0.64871E-07,  no. of points w/convection =    0
> > >>      at day =   90.2774, ktau =     260000 :  1st, 2nd time deriv of ps
> > =
> > >>  0.10552E-04 0.65019E-07,  no. of points w/convection =    0
> > >>      at day =   90.2948, ktau =     260050 :  1st, 2nd time deriv of ps
> > =
> > >>  0.12071E-04 0.74096E-07,  no. of points w/convection =    0
> > >>  [cli_2]: aborting job:
> > >>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >>  MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >>  status=0x7ee380) failed
> > >>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >>  handling an event returned by MPIDU_Sock_Wait()
> > >>  MPIDI_CH3I_Progress_handle_sock_event(428):
> > >>  MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >>  allocating memory for a request object
> > >>  rank 2 in job 9  imp9_52929   caused collective abort of all ranks
> > >>   exit status of rank 2: killed by signal 9
> > >>  36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
> > >>
> > >>  ******************************2nd
> > >>  Crash****************************************
> > >>  BCs are ready from   1999033006   to   1999033012
> > >>      at day =  180.2635, ktau =     254200 :  1st, 2nd time deriv of ps
> > =
> > >>  0.84223E-05 0.10490E-06,  no. of points w/convection =   61
> > >>      at day =  180.2809, ktau =     254250 :  1st, 2nd time deriv of ps
> > =
> > >>  0.11980E-04 0.95740E-07,  no. of points w/convection =   84
> > >>  [cli_2]: aborting job:
> > >>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >>  MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >>  status=0x7ee380) failed
> > >>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >>  handling an event returned by MPIDU_Sock_Wait()
> > >>  MPIDI_CH3I_Progress_handle_sock_event(428):
> > >>  MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >>  allocating memory for a request object
> > >>  rank 2 in job 10  imp9_52929   caused collective abort of all ranks
> > >>   exit status of rank 2: killed by signal 9
> > >>  33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
> > >>  ******************************************3rd
> > >>  Crash************************************
> > >>   Writing rad fields at ktau =       513360  1999062806
> > >>   BCs are ready from   1999062806   to   1999062812
> > >>      at day =  270.2635, ktau =     513400 :  1st, 2nd time deriv of ps
> > =
> > >>  0.10755E-04 0.17164E-06,  no. of points w/convection = 1532
> > >>      at day =  270.2809, ktau =     513450 :  1st, 2nd time deriv of ps
> > =
> > >>  0.12644E-04 0.20978E-06,  no. of points w/convection = 2103
> > >>  [cli_2]: aborting job:
> > >>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
> > >>  MPI_Sendrecv(217).........................:
> > MPI_Sendrecv(sbuf=0x4080c960,
> > >>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
> > >>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
> > >>  status=0x7ee380) failed
> > >>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
> > >>  handling an event returned by MPIDU_Sock_Wait()
> > >>  MPIDI_CH3I_Progress_handle_sock_event(428):
> > >>  MPIDI_EagerContigIsend(512)...............: failure occurred while
> > >>  allocating memory for a request object
> > >>  rank 2 in job 14  imp9_52929   caused collective abort of all ranks
> > >>   exit status of rank 2: killed by signal 9
> > >>  34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
> > >>  **************************************
> > >>
> > >>  ------------------------------
> > >>
> > >>  _______________________________________________
> > >>  RegCNET mailing list
> > >>  RegCNET at lists.ictp.it
> > >>  https://lists.ictp.it/mailman/listinfo/regcnet
> > >>
> > >>
> > >
> > >
> > > --
> > > Imran Nadeem
> > > PhD Student
> > > Institute of Meteorology
> > > Department of Water, Atmosphere and Environment
> > > Uni. of Natural Resources and Applied Life Sciences (BOKU)
> > >
> > > Peter-Jordan Strasse 82
> > > 1190 Vienna, Austria
> > >
> > > Mobile: +43 699 1194 3044
> > > Tel.: +43 1 47654 5614
> > > Fax: +43 1 47654 5610
> > >
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >    Dr. Xunqiang Bi         email:bixq at ictp.it
> >    Earth System Physics Group
> >    The Abdus Salam ICTP
> >    Strada Costiera, 11
> >    P.O. BOX 586, 34100 Trieste, ITALY
> >    Tel: +39-040-2240302  Fax: +39-040-2240449
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> 
> 
> 
> -- 
> Imran Nadeem
> PhD Student
> Institute of Meteorology
> Department of Water, Atmosphere and Environment
> Uni. of Natural Resources and Applied Life Sciences (BOKU)
> 
> Peter-Jordan Strasse 82
> 1190 Vienna, Austria
> 
> Mobile: +43 699 1194 3044
> Tel.: +43 1 47654 5614
> Fax: +43 1 47654 5610
> 



More information about the RegCNET mailing list