[RegCNET] MPI crashing with segmentation fault
Peter Huszar
huszarpet at gmail.com
Wed Oct 12 09:06:30 CEST 2011
Dear RegCM4 users,
I am running RegCM4.1.1 in a pseudo coupled moded meaning that
chemical species (ozone and aerosols) are supplied from external
source each hour.
This requires to restart RegCM each hour (a process automated by Linux
shell scripts). This issue was easy to solve by thickening ICBC from 6
to 1 h and setting ibdyfrq to 1.
However, after a while in my RegCM4.1.1 runs, I encountered the
following type of error
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 6999 on node meop1 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
No further error messages are available(!).
When I start the simulation again, this occurs at different point (but
always after 1-4 months) so I am unable to reproduce this error. I
tried to change the number of processors (e.g. from 8 to 6), this led
to the same error.
For reading in data at the beginning of each hour of RegCM4
simulation, I had to modify a little bit the regCM source, so in order
to see if the problem lies in these modifications, I tried to change
corresponding modified binary to the default RegCM 4.1.1 binary, but
this has no effect and the same error occured, again at different
point when starting the simulation from the beginning.
I performed the following command hour by hour (as I said, RegCM is
restarted each hour)
/home/met/usr/local/bin/mpirun -v --output-filename
$RegCMCAMxROOT/logs/mpi.log -np $number_proc $REGCMBIN regcm.in
and I attached the output from one of the crashing hours (mpi.log.0)
and also the output I got when using valgrind to see the possible
memory leacks.
An important thing is that when I simply run RegCM4 without restarting
it at each our, I get no errors. So I assume that there is a problem
with the restart mechanism in RegCM in connection with mpi, but I am
really stuck with this error and any help or suggestion will be
appreciated.
uname -a:
Linux meop1 2.6.27.25-78.2.56.fc9.x86_64 #1 SMP Thu Jun 18 12:24:37
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
(Fedora 9, 64bit, 8 CPU, 4GB RAM, 2GB swap)
Peter
-------------- next part --------------
Performing RegCM4.x run for 2000-02-24 04 hours
Today: 2000 02 24, 055, 04 hours
2000010100 2000022404 2000022405 2000020100
0
c
Running RegCM4.x
==3366== Memcheck, a memory error detector.
==3366== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al.
==3366== Using LibVEX rev 1804, a library for dynamic binary translation.
==3366== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==3366== Using valgrind-3.3.0, a dynamic binary instrumentation framework.
==3366== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al.
==3366== For more details, rerun with: -v
==3366==
==3366== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==3366== at 0x31258E0AE9: syscall (in /lib64/libc-2.8.so)
==3366== by 0x54CA4E8: opal_paffinity_linux_plpa_api_probe_init (in /home/met/usr/local/lib/openmpi/mca_paffinity_linux.so)
==3366== by 0x54CABB4: opal_paffinity_linux_plpa_init (in /home/met/usr/local/lib/openmpi/mca_paffinity_linux.so)
==3366== by 0x54CB4C4: opal_paffinity_linux_plpa_have_topology_information (in /home/met/usr/local/lib/openmpi/mca_paffinity_linux.so)
==3366== by 0x54CA37E: linux_module_init (in /home/met/usr/local/lib/openmpi/mca_paffinity_linux.so)
==3366== by 0x4E960C5: opal_paffinity_base_select (in /home/met/usr/local/lib/libopen-pal.so.0.0.0)
==3366== by 0x4E69464: opal_init (in /home/met/usr/local/lib/libopen-pal.so.0.0.0)
==3366== by 0x4C187DC: orte_init (in /home/met/usr/local/lib/libopen-rte.so.0.0.0)
==3366== by 0x402D18: orterun (orterun.c:542)
==3366== by 0x402AA6: main (main.c:13)
==3366== Address 0x0 is not stack'd, malloc'd or (recently) free'd
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 3373 on node meop1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
==3366==
==3366== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 5 from 1)
==3366== malloc/free: in use at exit: 296,726 bytes in 1,519 blocks.
==3366== malloc/free: 10,349 allocs, 8,830 frees, 12,915,254 bytes allocated.
==3366== For counts of detected errors, rerun with: -v
==3366== searching for pointers to 1,519 not-freed blocks.
==3366== checked 242,176 bytes.
==3366==
==3366== LEAK SUMMARY:
==3366== definitely lost: 68,525 bytes in 132 blocks.
==3366== possibly lost: 0 bytes in 0 blocks.
==3366== still reachable: 228,201 bytes in 1,387 blocks.
==3366== suppressed: 0 bytes in 0 blocks.
==3366== Rerun with --leak-check=full to see details of leaked memory.
-rw-rw-r-- 1 met met 18M 2011-10-12 00:34 RegCMCAMx4_SAV.2000022404
ls: cannot access RegCMCAMx4_SAV.2000022405: No such file or directory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi.log.0
Type: application/octet-stream
Size: 9489 bytes
Desc: not available
URL: <https://lists.ictp.it/pipermail/regcnet/attachments/20111012/4e446dcf/attachment-0001.obj>
More information about the RegCNET
mailing list