[RegCNET] Problem in running RegCM3 in PC Cluster

Maurice.McHugh at noaa.gov Maurice.McHugh at noaa.gov
Tue Dec 1 14:31:14 CET 2009


Akhmad,

I really cannot help you so much with setting up a PC cluster, but I think that it's really nothing more complicated than two or more PCs on the same network that can securely exchange information via protocols like "ssh".  Having said that I am really ignorant of that aspect of clustering.

In terms of openmpi, compile and mpirun the following test code to see if openmpi is successfully installed:

        sample.f

        include 'mpif.h'
        integer myid,ierr,NCPU
        integer status(MPI_STATUS_SIZE)
        call MPI_INIT(ierr)
        call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr)
        call MPI_COMM_SIZE(MPI_COMM_WORLD,NCPU,ierr)
        print *,"process",myid,"of",NCPU
        call MPI_FINALIZE(ierr)
        stop
        end

From:  https://lists.ictp.it/pipermail/regcnet/2008/001412.html

Test this on each machine at a time, then if successful test on the whole cluster.  RegCM seems to be really picky about how the servers in the cluster communicate, so be sure to have them all in the same subnet or the model will not run across the whole cluster.

The following code is how to set up your hostfile, which tells the openmpi where to run and with how many CPUs.  The PC on which you initiate RegCM3 does not have to be specified, only the cluster nodes on which you want to run the model.

# my_hostfile:
# setup hostfile for mpi for various machines, disallow oversubscription.
# from: http://www.open-mpi.org/faq/?category=running#simple-launch

machine.name.here slots=4 max-slots=12
machine.name.2.here slots=4 max-slots=12

#end

In the above, slots refers to the number of CPUs you want the model to run on, max-slots prevents you from running more processes that you have available to you (called oversubscribing) which slows model runs down dramatically.  

Then at run time start the run by executing in your script:
mpirun -machinefile my_hostfile -np 12 --byslot ./regcm<./regcm.in

Or, if you want to time the run:
time mpirun -machinefile my_hostfile -np 12 --byslot ./regcm<./regcm.in

I found http://www.open-mpi.org/faq/ very useful.

The makefile you specify depends on your compiler and the architecture and does not matter whether you're running in parallel or not.  If you're using 64-bit Fedora on a generic *86 box (as opposed a machine using AMD chips) then simply use the makefile appropriate to your compiler, ignoring the IBM, SUN and DEC options.  I do not know what the CL2 makefile is for, but it appears to be the only Makefile specifying MPI_ROOT which just appears to set the path to the openmpi executables.  You can append the openmpi directories to your path in your run-time script or in a start-up script (i.e. the .cshrc, .tcshrc or .bashrc files).

The one thing you do have to do differently in your run-time script use the following:

#  Use below linking option for parallel runs
#!/bin/csh -f
set mydir=$PWD
cd ../Main
make clean

#  Use below linking option for parallel runs
ln -s 0options/0_NODIAG_PARALLEL_CODE MAKECODE

# Make code, change dirs, copy executable
./MAKECODE
make -f Makefile_gnu
cd $mydir
mv ../Main/regcm .

# Link datafiles
/bin/ln -sf ../Input/DOMAIN.INFO fort.10
/bin/ln -sf ../Input/ICBC1990010100 fort.101

etc. etc. etc. ....

It's the ln -s 0options/0_NODIAG_PARALLEL_CODE MAKECODE line you must include for parallel runs, just ensure that you have included this line before you run ./MAKECODE

Note in the above I use Makefile_gnu due to my use of gFortran as a compiler and had to create my own makefile for that compiler.  I cannot use the Intel IFORT compiler due to licensing restrictions.  Please swap out that makefile and insert the makefile appropriate to your architecture and compiler.

Hope this helps; I'm copying this to the list so that other people may be helped.
Regards,

Maurice

----- Original Message -----
From: Akhmad Faqih <akhmadfaqih at hotmail.com>
Date: Tuesday, December 1, 2009 2:50 am
Subject: RE: [RegCNET] Problem in running RegCM3 in PC Cluster
To: Maurice.McHugh at noaa.gov


> Hi Maurice,
> 
> Many thanks for your response to my email. I am trying to implement 
> your description, although it is a bit difficult for me to understand 
> it completely. Configuring a PC cluster as well as setting up the 
> supporting softwares for the model are something new for me and I have 
> been struggled to work on this by myself. Although I am familiar with 
> linux environment, but I am not completely understand on how an 
> openmpi works in a PC cluster as well as on how a PC cluster works. Do 
> you know any simple commands/ways to test whether the openmpi has 
> successfully work in our PC cluster?
> 
> Btw, which makefile that we have to use to run the model on PC 
> cluster? is it Makefile_CL2? What is the meaning of this line in that 
> file: MPI_ROOT = /net/shared/mpich-1.2.5..11/intel/8.0, and where can 
> I find/made this directory in my PC cluster?
> 
> Thanks. Looking forward for your reply.
> 
> Best regards,
> 
> Akhmad Faqih
> 
> 
> > Date: Mon, 30 Nov 2009 09:42:00 -0500
> > From: Maurice.McHugh at noaa.gov
> > Subject: Re: [RegCNET] Problem in running RegCM3 in PC Cluster
> > To: akhmadfaqih at hotmail.com
> > CC: regcnet at lists.ictp.it
> > 
> > Hi Akhmad,
> > 
> > Here is what we are doing on our cluster, which is currently a pair 
> of dual hex-core (6 core) servers located on the same network subnet.  
> Having the machines on the same subnet is very important as it allows 
> message passing between machines and CPUs.  They are connected via 
> ethernet, nothing fancy.  The subnet issue was causing the model to 
> run on only one server of the cluster at a time; if more than one 
> server was used then it would refuse to run properly.  I emailed about 
> this issue last week to this regcnet email list.
> > 
> > What we have found is that you have to ensure that the same openmpi 
> library and openmpi executables (mpirun, mpif90 etc.) are installed in 
> identical locations on all servers.  Ensure you have the same versions 
> of the library and executables.
> > 
> > Ensure that you rsync your working RegCM directory including all 
> executables and input files to an identical structure on all servers 
> you intend to use.  You can also do this when invoking openmpi as 
> there is an mpirun flag you can set to copy all required files across 
> servers if you so desire.  I do not do this, and use "rsync -avz" to 
> only update the changes made to the directory across servers at this 
> point in time.
> > 
> > Ensure that you can "ssh" to all servers from your head node - the 
> server from which you run the model. 
> > 
> > On our system (which runs the debian OS), we found that openmpi 
> blows up (gives openmpi run-time errors) if there are files of the 
> form: "openmpi-sessions-USERNAME at SERVERNAME" in the /tmp directory.  
> Whether this is unique to Debian or not I do not know.  I just delete 
> any files like this in the /tmp directories in my run-time script.  If 
> I recall correctly these files were found using the "strace" command 
> at run-time to see what I/O was occurring.
> > 
> > Ensure that your mpirun command-line arguments include the hostfile 
> or machinefile argument and the associated file containing your hosts 
> and the number of slots (openmpi's term for CPUs) on each that you 
> wish to use.
> > 
> > Disclaimers and general lawyerism:
> > Keep in mind that these are the changes I can remember which were 
> made to our servers that are dedicated to RegCM3 and do not have a 
> clustering package or job scheduler installed.  At this point in time 
> I am still testing the model's scalability across servers and further 
> changes may have to be made.
> > 
> > Hope this helps!
> > 
> > Regards,
> > 
> > Maurice
> > 
> > 
> > ----- Original Message -----
> > From: Akhmad Faqih <akhmadfaqih at hotmail.com>
> > Date: Thursday, November 26, 2009 2:27 am
> > Subject: [RegCNET] Problem in running RegCM3 in PC Cluster
> > To: regcnet at lists.ictp.it
> > 
> > 
> > > Dear RegCM3 users,
> > > I am currently trying to run RegCM3 simulations on a PC cluster, 
> > > however I am still unable to operate the openMPI software in my 
> > > simulation. I think the problem is on the shared library for using 
> the 
> > > openmpi between PCs. I will be grateful if someone that has 
> > > experiences in running RegCM3 on PC cluster could provide me with 
> the 
> > > steps to setup the cluster that is ready for running the model. 
> > > Additionally, specific information on how to setup shared 
> libraries, 
> > > such as openMPI for parallel simulation using PC cluster will be 
> much 
> > > appreciated. Btw, what kind of cluster packages that is most 
> suitable 
> > > for RegCM3? For your information, I am currently using Fedora 
> 11_64 
> > > bit for my PC cluster. Thanks.
> > > Best regards,
> > > Akhmad Faqih
> > > Center for Climate Risk and Opportunity Managementin Southeast 
> Asia 
> > > and Pacific (CCROM-SEAP)Bogor Agricultural University, Indonesia   
>     
> > >                                         
> > > _________________________________________________________________
> > > For more of what happens online Head to the Daily Blob on Windows 
> Live
> > > _______________________________________________
> > > RegCNET mailing list
> > > RegCNET at lists.ictp.it
>                                                
> _________________________________________________________________
> If It Exists, You'll Find it on SEEK Australia's #1 job site



More information about the RegCNET mailing list