[RegCNET] Problem in running RegCM3 in PC Cluster
Maurice.McHugh at noaa.gov
Maurice.McHugh at noaa.gov
Mon Nov 30 15:42:00 CET 2009
Hi Akhmad,
Here is what we are doing on our cluster, which is currently a pair of dual hex-core (6 core) servers located on the same network subnet. Having the machines on the same subnet is very important as it allows message passing between machines and CPUs. They are connected via ethernet, nothing fancy. The subnet issue was causing the model to run on only one server of the cluster at a time; if more than one server was used then it would refuse to run properly. I emailed about this issue last week to this regcnet email list.
What we have found is that you have to ensure that the same openmpi library and openmpi executables (mpirun, mpif90 etc.) are installed in identical locations on all servers. Ensure you have the same versions of the library and executables.
Ensure that you rsync your working RegCM directory including all executables and input files to an identical structure on all servers you intend to use. You can also do this when invoking openmpi as there is an mpirun flag you can set to copy all required files across servers if you so desire. I do not do this, and use "rsync -avz" to only update the changes made to the directory across servers at this point in time.
Ensure that you can "ssh" to all servers from your head node - the server from which you run the model.
On our system (which runs the debian OS), we found that openmpi blows up (gives openmpi run-time errors) if there are files of the form: "openmpi-sessions-USERNAME at SERVERNAME" in the /tmp directory. Whether this is unique to Debian or not I do not know. I just delete any files like this in the /tmp directories in my run-time script. If I recall correctly these files were found using the "strace" command at run-time to see what I/O was occurring.
Ensure that your mpirun command-line arguments include the hostfile or machinefile argument and the associated file containing your hosts and the number of slots (openmpi's term for CPUs) on each that you wish to use.
Disclaimers and general lawyerism:
Keep in mind that these are the changes I can remember which were made to our servers that are dedicated to RegCM3 and do not have a clustering package or job scheduler installed. At this point in time I am still testing the model's scalability across servers and further changes may have to be made.
Hope this helps!
Regards,
Maurice
----- Original Message -----
From: Akhmad Faqih <akhmadfaqih at hotmail.com>
Date: Thursday, November 26, 2009 2:27 am
Subject: [RegCNET] Problem in running RegCM3 in PC Cluster
To: regcnet at lists.ictp.it
> Dear RegCM3 users,
> I am currently trying to run RegCM3 simulations on a PC cluster,
> however I am still unable to operate the openMPI software in my
> simulation. I think the problem is on the shared library for using the
> openmpi between PCs. I will be grateful if someone that has
> experiences in running RegCM3 on PC cluster could provide me with the
> steps to setup the cluster that is ready for running the model.
> Additionally, specific information on how to setup shared libraries,
> such as openMPI for parallel simulation using PC cluster will be much
> appreciated. Btw, what kind of cluster packages that is most suitable
> for RegCM3? For your information, I am currently using Fedora 11_64
> bit for my PC cluster. Thanks.
> Best regards,
> Akhmad Faqih
> Center for Climate Risk and Opportunity Managementin Southeast Asia
> and Pacific (CCROM-SEAP)Bogor Agricultural University, Indonesia
>
> _________________________________________________________________
> For more of what happens online Head to the Daily Blob on Windows Live
> _______________________________________________
> RegCNET mailing list
> RegCNET at lists.ictp.it
More information about the RegCNET
mailing list