Hi Bi and Gao,
I have checked there is no wall-time set by administrator. Also I am using timestep(dt =30) which is exactly three times the horizontal resolution (10km). If it is computational instability, why it occurs so regularly (after every 3 months). It is not possible to make serial run for this case immediately because it will take more than a week to reach 90th day at which model crashes. From the part of
error message "failure occurred while allocating memory for a request object" I guess this problem may be related to memory.
Regards
Nadeem
5/22/07,
XUNQIANG BI <bixq@ictp.it> wrote:
Hi, Imran:
Can you get exact the same results between serial and parallel ?
If yes, then it's easy to check where the problem is. Run the serail
job for the same case to see if it crashed or not.
I guess the problem is also a computational instability, the timestep
problem.
Regards,
Bi
On Tue, 22 May 2007, IMRAN NADEEM wrote:
> Hi Gao,
>
> I don't know about wall-time set by administrator but I have not set any
> wall-time for any job.
> Yes I can restart the model after crash.
>
> Regards
> Imran
>
> On 5/22/07, gaoxj@cma.gov.cn < gaoxj@cma.gov.cn> wrote:
>>
>> Hi Imran,
>>
>> Quick comments: do you or the administrator set a wall-time for your job
>> in the computer(s)? Can the model be restarted after the crash?
>>
>> Gao
>>
>>
>>
>>
>> ----- Original Message -----
>> *From:* IMRAN NADEEM <qphoton@gmail.com>
>> *To:* regcnet@lists.ictp.it
>> *Sent:* Tuesday, May 22, 2007 4:43 PM
>> *Subject:* [RegCNET] Model Crashes after regular interval
>>
>>
>> Dear RegCNET Users,
>>
>> I am running parallel version of RegCM3 on 4 processors. My model crashes
>> after every 90 days. The error log, for
>> 3 crashes that occur at 90days , 180days and 270 days, is attached. I did
>> the same simulation on different machine using different
>> compiler but I got the same error. I am running at dx=10km with dt=30.
>>
>> Thanks in advance
>> Imran
>>
>> *******************************First
>> Crash*******************************************
>> BCs are ready from 1998123006 to 1998123012
>> at day = 90.2601, ktau = 259950 : 1st, 2nd time deriv of ps =
>> 0.81562E-05 0.64871E-07, no. of points w/convection = 0
>> at day = 90.2774, ktau = 260000 : 1st, 2nd time deriv of ps =
>> 0.10552E-04 0.65019E-07, no. of points w/convection = 0
>> at day = 90.2948, ktau = 260050 : 1st, 2nd time deriv of ps =
>> 0.12071E-04 0.74096E-07, no. of points w/convection = 0
>> [cli_2]: aborting job:
>> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> status=0x7ee380) failed
>> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(428):
>> MPIDI_EagerContigIsend(512)...............: failure occurred while
>> allocating memory for a request object
>> rank 2 in job 9 imp9_52929 caused collective abort of all ranks
>> exit status of rank 2: killed by signal 9
>> 36.314u 5.760s 53:58:42.21 0.0% 0+0k 0+0io 0pf+0w
>>
>> ******************************2nd
>> Crash****************************************
>> BCs are ready from 1999033006 to 1999033012
>> at day = 180.2635, ktau = 254200 : 1st, 2nd time deriv of ps =
>> 0.84223E-05 0.10490E-06, no. of points w/convection = 61
>> at day = 180.2809, ktau = 254250 : 1st, 2nd time deriv of ps =
>> 0.11980E-04 0.95740E-07, no. of points w/convection = 84
>> [cli_2]: aborting job:
>> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> status=0x7ee380) failed
>> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(428):
>> MPIDI_EagerContigIsend(512)...............: failure occurred while
>> allocating memory for a request object
>> rank 2 in job 10 imp9_52929 caused collective abort of all ranks
>> exit status of rank 2: killed by signal 9
>> 33.842u 5.688s 53:02:37.63 0.0% 0+0k 0+0io 431pf+0w
>> ******************************************3rd
>> Crash************************************
>> Writing rad fields at ktau = 513360 1999062806
>> BCs are ready from 1999062806 to 1999062812
>> at day = 270.2635 , ktau = 513400 : 1st, 2nd time deriv of ps =
>> 0.10755E-04 0.17164E-06, no. of points w/convection = 1532
>> at day = 270.2809, ktau = 513450 : 1st, 2nd time deriv of ps =
>> 0.12644E-04 0.20978E-06, no. of points w/convection = 2103
>> [cli_2]: aborting job:
>> Fatal error in MPI_Sendrecv: Other MPI error, error stack:
>> MPI_Sendrecv(217).........................: MPI_Sendrecv(sbuf=0x4080c960,
>> scount=184, MPI_DOUBLE_PRECISION, dest=1, stag=2, rbuf=0x4080cf20,
>> rcount=184, MPI_DOUBLE_PRECISION, src=3, rtag=2, MPI_COMM_WORLD,
>> status=0x7ee380) failed
>> MPIDI_CH3_Progress_wait(212)..............: an error occurred while
>> handling an event returned by MPIDU_Sock_Wait()
>> MPIDI_CH3I_Progress_handle_sock_event(428):
>> MPIDI_EagerContigIsend(512)...............: failure occurred while
>> allocating memory for a request object
>> rank 2 in job 14 imp9_52929 caused collective abort of all ranks
>> exit status of rank 2: killed by signal 9
>> 34.102u 5.336s 53:42:44.95 0.0% 0+0k 0+0io 0pf+0w
>> **************************************
>>
>> ------------------------------
>>
>> _______________________________________________
>> RegCNET mailing list
>> RegCNET@lists.ictp.it
>> https://lists.ictp.it/mailman/listinfo/regcnet
>>
>>
>
>
> --
> Imran Nadeem
> PhD Student
> Institute of Meteorology
> Department of Water, Atmosphere and Environment
> Uni. of Natural Resources and Applied Life Sciences (BOKU)
>
> Peter-Jordan Strasse 82
> 1190 Vienna, Austria
>
> Mobile: +43 699 1194 3044
> Tel.: +43 1 47654 5614
> Fax: +43 1 47654 5610
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Xunqiang Bi email:bixq@ictp.it
Earth System Physics Group
The Abdus Salam ICTP
Strada Costiera, 11
P.O. BOX 586, 34100 Trieste, ITALY
Tel: +39-040-2240302 Fax: +39-040-2240449
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~