Hi, Imran:

Can you get exact the same results between serial and parallel ?

If yes, then it's easy to check where the problem is. Run the serail
job for the same case to see if it crashed or not.

I guess the problem is also a computational instability, the timestep
problem.

Regards,
Bi

On Tue, 22 May 2007, IMRAN NADEEM wrote:

> Hi Gao,
>
> I don't know about wall-time set by administrator but I have not set any
> wall-time for any job.
>  Yes I can  restart the model after crash.
>
> Regards
> Imran
>
> On 5/22/07, gaoxj@cma.gov.cn < gaoxj@cma.gov.cn> wrote:
>>
>>   Hi Imran,
>>
>>  Quick comments: do you or the administrator set a wall-time for your job
>>  in the computer(s)? Can the model be restarted after the crash?
>>
>>  Gao
>>
>>
>>
>>
>>  ----- Original Message -----
>>  *From:* IMRAN NADEEM <qphoton@gmail.com>
>>  *To:* regcnet@lists.ictp.it
>>  *Sent:* Tuesday, May 22, 2007 4:43 PM
>>  *Subject:* [RegCNET] Model Crashes after regular interval
>>
>>
>>  Dear RegCNET Users,
>>
>>  I am running parallel version of RegCM3 on 4 processors. My model crashes
>>  after every 90 days. The error log, for
>>  3 crashes that occur at  90days , 180days and 270 days, is attached. I did
>>  the same simulation on different machine using different
>>  compiler but I got the same error. I am running at dx=10km with dt=30.
>>
>>  Thanks in advance
>>  Imran
>>
>>  *******************************First
>>  Crash*******************************************
>>  BCs are ready from   1998123006   to   1998123012
>>      at day =   90.2601, ktau =     259950 :  1st, 2nd time deriv of ps =
>>  0.81562E-05 0.64871E-07,  no. of points w/convection =    0
>>      at day =   90.2774, ktau =     260000 :  1st, 2nd time deriv of ps =
>>  0.10552E-04 0.65019E-07,  no. of points w/convection =    0
>>      at day =   90.2948, ktau =     260050 :  1st, 2nd time deriv of ps =
>>  0.12071E-04 0.74096E-07,  no. of points w/convection =    0
>>  [cli_2]: aborting job:
>>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>>  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>>  status=0x7ee380) failed
>>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>>  handling an event returned by MPIDU_Sock_Wait()
>>  MPIDI_CH3I_Progress_handle_sock_event(428):
>>  MPIDI_EagerContigIsend(512)...............: failure occurred while
>>  allocating memory for a request object
>>  rank 2 in job 9  imp9_52929   caused collective abort of all ranks
>>   exit status of rank 2: killed by signal 9
>>  36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
>>
>>  ******************************2nd
>>  Crash****************************************
>>  BCs are ready from   1999033006   to   1999033012
>>      at day =  180.2635, ktau =     254200 :  1st, 2nd time deriv of ps =
>>  0.84223E-05 0.10490E-06,  no. of points w/convection =   61
>>      at day =  180.2809, ktau =     254250 :  1st, 2nd time deriv of ps =
>>  0.11980E-04 0.95740E-07,  no. of points w/convection =   84
>>  [cli_2]: aborting job:
>>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>>  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>>  status=0x7ee380) failed
>>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>>  handling an event returned by MPIDU_Sock_Wait()
>>  MPIDI_CH3I_Progress_handle_sock_event(428):
>>  MPIDI_EagerContigIsend(512)...............: failure occurred while
>>  allocating memory for a request object
>>  rank 2 in job 10  imp9_52929   caused collective abort of all ranks
>>   exit status of rank 2: killed by signal 9
>>  33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
>>  ******************************************3rd
>>  Crash************************************
>>   Writing rad fields at ktau =       513360  1999062806
>>   BCs are ready from   1999062806   to   1999062812
>>      at day =  270.2635 , ktau =     513400 :  1st, 2nd time deriv of ps =
>>  0.10755E-04 0.17164E-06,  no. of points w/convection = 1532
>>      at day =  270.2809, ktau =     513450 :  1st, 2nd time deriv of ps =
>>   0.12644E-04 0.20978E-06,  no. of points w/convection = 2103
>>  [cli_2]: aborting job:
>>  Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>>  MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>>  scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>>  rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>>  status=0x7ee380) failed
>>  MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>>  handling an event returned by MPIDU_Sock_Wait()
>>  MPIDI_CH3I_Progress_handle_sock_event(428):
>>  MPIDI_EagerContigIsend(512)...............: failure occurred while
>>  allocating memory for a request object
>>  rank 2 in job 14  imp9_52929   caused collective abort of all ranks
>>   exit status of rank 2: killed by signal 9
>>  34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
>>  **************************************
>>
>>  ------------------------------
>>
>>  _______________________________________________
>>  RegCNET mailing list
>>  RegCNET@lists.ictp.it
>>  https://lists.ictp.it/mailman/listinfo/regcnet
>>
>>
>
>
> --
> Imran Nadeem
> PhD Student
> Institute of Meteorology
> Department of Water, Atmosphere and Environment
> Uni. of Natural Resources and Applied Life Sciences (BOKU)
>
> Peter-Jordan Strasse 82
> 1190 Vienna, Austria
>
> Mobile: +43 699 1194 3044
> Tel.: +43 1 47654 5614
> Fax: +43 1 47654 5610
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Dr. Xunqiang Bi         email:bixq@ictp.it
   Earth System Physics Group
   The Abdus Salam ICTP
   Strada Costiera, 11
   P.O. BOX 586, 34100 Trieste, ITALY
   Tel: +39-040-2240302  Fax: +39-040-2240449
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~