[RegCNET] Processor limit?
Alexander Bryan
ambrya at umich.edu
Tue Nov 3 04:07:47 CET 2015
Dear RegCNeters,
RegCM-4.4.5.4 seg-faults when I run the model with more than 85 processors.
The following error lines repeat as many times as their are processors
selected....
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
regcmMPICLM 0000000001325431 Unknown Unknown Unknown
regcmMPICLM 0000000001323B87 Unknown Unknown Unknown
regcmMPICLM 00000000012D3124 Unknown Unknown Unknown
regcmMPICLM 00000000012D2F36 Unknown Unknown Unknown
regcmMPICLM 000000000126875F Unknown Unknown Unknown
regcmMPICLM 000000000127152D Unknown Unknown Unknown
libpthread.so.0 00007FFFF58B3790 Unknown Unknown Unknown
libmpi.so.12 00007FFFF6158BBC Unknown Unknown Unknown
libmpi.so.12 00007FFFF6143FFF Unknown Unknown Unknown
libmpi.so.12 00007FFFF5FAA6E1 Unknown Unknown Unknown
libmpi.so.12 00007FFFF6134563 Unknown Unknown Unknown
libmpi.so.12 00007FFFF60F67F0 Unknown Unknown Unknown
libmpi.so.12 00007FFFF60E9B74 Unknown Unknown Unknown
libmpifort.so.12 00007FFFF6648160 Unknown Unknown Unknown
regcmMPICLM 000000000074E2B0 Unknown Unknown Unknown
regcmMPICLM 0000000000420B8E Unknown Unknown Unknown
libc.so.6 00007FFFF552ED5D Unknown Unknown Unknown
regcmMPICLM 0000000000420A99 Unknown Unknown Unknown
I am running the model on a single node with 256 cores (32 sockets of 8
cores) at 12-km res (ds) over a relatively small domain (Puerto Rico) with
the following settings:
iy = 64
jx = 80
kz = 18
I attempted several runs ranging from procs of 86 - 192, all of which
yielded the above message * the number of processors. I was sure to select
procs that were easily divisible by the iy and jx values above and that met
the 3x3 box per processor minimum, e.g., 128 yields:
CPUS DIM1 = 16, divided into 80 (jx) = 5 (> 3)
CPUS DIM2 = 8, divided into 64 (iy) = 8 (> 3)
Any ideas? Is this an issue with how I set up the model or the single-node
system I'm using?
On a related note, the documentation (version 4.4) states.....
"In the current version 4.4 the model parallelizes execution dividing the
work between the processors, with the minimum work per processor is 9
points or a box 3 * 3, so the maximum number of processors which can be
used in a parallel run for the above configuration [iy=34, jx=48] is
roughly 180."
I arrive at 180(ish) by multiplying 34 * 48 and dividing by 9. However,
when I do the same for my domain above (64 * 80 / 9), I get 569. I am
missing something?
Many thanks.
Best,
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20151102/ac70b426/attachment.html>
More information about the RegCNET
mailing list