[RegCNET] Model Crashes after regular interval

gaoxj at cma.gov.cn gaoxj at cma.gov.cn
Tue May 22 10:52:32 CEST 2007


Hi Imran,

Quick comments: do you or the administrator set a wall-time for your job in the computer(s)? Can the model be restarted after the crash?

Gao



  ----- Original Message ----- 
  From: IMRAN NADEEM 
  To: regcnet at lists.ictp.it 
  Sent: Tuesday, May 22, 2007 4:43 PM
  Subject: [RegCNET] Model Crashes after regular interval



  Dear RegCNET Users,
                   
  I am running parallel version of RegCM3 on 4 processors. My model crashes after every 90 days. The error log, for  
  3 crashes that occur at  90days , 180days and 270 days, is attached. I did the same simulation on different machine using different 
  compiler but I got the same error. I am running at dx=10km with dt=30.

  Thanks in advance
  Imran

  *******************************First Crash*******************************************
  BCs are ready from   1998123006   to   1998123012 
       at day =   90.2601, ktau =     259950 :  1st, 2nd time deriv of ps =  0.81562E-05 0.64871E-07,  no. of points w/convection =    0
       at day =   90.2774, ktau =     260000 :  1st, 2nd time deriv of ps =  0.10552E-04 0.65019E-07,  no. of points w/convection =    0
       at day =   90.2948, ktau =     260050 :  1st, 2nd time deriv of ps =  0.12071E-04 0.74096E-07,  no. of points w/convection =    0
  [cli_2]: aborting job:
  Fatal error in MPI_Sendrecv: Other MPI error, error stack: 
  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960, scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20, rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD, status=0x7ee380) failed 
  MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
  MPIDI_CH3I_Progress_handle_sock_event(428):
  MPIDI_EagerContigIsend(512)...............: failure occurred while allocating memory for a request object 
  rank 2 in job 9  imp9_52929   caused collective abort of all ranks
    exit status of rank 2: killed by signal 9
  36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
   
  ******************************2nd Crash**************************************** 
  BCs are ready from   1999033006   to   1999033012
       at day =  180.2635, ktau =     254200 :  1st, 2nd time deriv of ps =  0.84223E-05 0.10490E-06,  no. of points w/convection =   61
       at day =  180.2809, ktau =     254250 :  1st, 2nd time deriv of ps =  0.11980E-04 0.95740E-07,  no. of points w/convection =   84
  [cli_2]: aborting job:
  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960, scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20, rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD, status=0x7ee380) failed 
  MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
  MPIDI_CH3I_Progress_handle_sock_event(428):
  MPIDI_EagerContigIsend(512)...............: failure occurred while allocating memory for a request object 
  rank 2 in job 10  imp9_52929   caused collective abort of all ranks
    exit status of rank 2: killed by signal 9
  33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
  ******************************************3rd Crash************************************ 
   Writing rad fields at ktau =       513360  1999062806
   BCs are ready from   1999062806   to   1999062812
       at day =  270.2635, ktau =     513400 :  1st, 2nd time deriv of ps =  0.10755E-04 0.17164E-06,  no. of points w/convection = 1532 
       at day =  270.2809, ktau =     513450 :  1st, 2nd time deriv of ps =  0.12644E-04 0.20978E-06,  no. of points w/convection = 2103
  [cli_2]: aborting job:
  Fatal error in MPI_Sendrecv: Other MPI error, error stack: 
  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960, scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20, rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD, status=0x7ee380) failed 
  MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
  MPIDI_CH3I_Progress_handle_sock_event(428):
  MPIDI_EagerContigIsend(512)...............: failure occurred while allocating memory for a request object 
  rank 2 in job 14  imp9_52929   caused collective abort of all ranks
    exit status of rank 2: killed by signal 9
  34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
  **************************************



------------------------------------------------------------------------------


  _______________________________________________
  RegCNET mailing list
  RegCNET at lists.ictp.it
  https://lists.ictp.it/mailman/listinfo/regcnet
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20070522/e4c61b70/attachment-0002.html>


More information about the RegCNET mailing list