Cloud, IT

Running mpich2 on both Windows 7 and Ubuntu 10.4

Check list for a successful run of mpich2 programs between Windows and Linux.

  1. Version 1.3.2 causes bugs for me, I’ve successfully used version 1.2.1
  2. Pass-phraseshave to be the same on the machines.
  3. mpich2 versions have to be the same.
  4. Even if Ubuntu’s version of mpich2 is compiled with channel sock, you still have to give -channel sock when running, otherwise weird lockups appear.
  5. When running smpd.exe as a service on windows, you also need to use -plaintext argument to mpiexec. You don’t need the argument if starting smpd -debug
  6. When launching on Windows, use the executable name without .exe extension. You can also use -path argument to specify both Windows and Linux paths where executables will be searched for

Before anything, make sure that the machines ‘see’ themselves by trying to ping them, and also, use command smpd.exe -status [othermachine]. You should get confirmation that the remote system is up and running. On Windows, you need also to register using mpiexec -register followed by a quick check with mpiexec -validate.

At the end of the page there’s a small MPI example that can be compiled to test mpich2.

1. I’ve tried to run version 1.3.2 of mpich2, but unfortunately I get this error below. Or the application just hangs there, both machines ending up on a wait for the other machines.

C:\Program Files (x86)\MPICH2\bin>mpiexec.exe -machinefile hosts.cfg -n 4 -channel sock -plaintext testmpi
Hello from process 0 on adevaraciune
Hello from process 1 on adevaraciune
Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(318)...........................: MPI_Finalize failed
MPI_Finalize(211)...........................:
MPID_Finalize(92)...........................:
PMPI_Barrier(476)...........................: MPI_Barrier(comm=0x44000002) faile
d
MPIR_Barrier(82)............................:
MPIC_Sendrecv(161)..........................:
MPIC_Wait(513)..............................:
MPIDI_CH3i_Progress_wait(215)...............: an error occurred while handling a
n event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(631)..:
MPIDI_CH3_Sockconn_handle_connect_event(620): [ch3:sock] failed to connnect to r
emote process
MPIDU_Socki_handle_connect(809).............: connection failure (set=0,sock=3,e
rrno=111:Connection refused)
Hello from process 2 on Cinemalaptop
Hello from process 3 on Cinemalaptop

job aborted:
rank: node: exit code[: error message]
0: adevaraciune: -2: Fatal error in MPI_Finalize: Other MPI error, error stack:
MPI_Finalize(318)...........................: MPI_Finalize failed
MPI_Finalize(211)...........................:
MPID_Finalize(92)...........................:
PMPI_Barrier(476)...........................: MPI_Barrier(comm=0x44000002) faile
d
MPIR_Barrier(82)............................:
MPIC_Sendrecv(161)..........................:
MPIC_Wait(513)..............................:
MPIDI_CH3i_Progress_wait(215)...............: an error occurred while handling a
n event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(631)..:
MPIDI_CH3_Sockconn_handle_connect_event(620): [ch3:sock] failed to connnect to r
emote process
MPIDU_Socki_handle_connect(809).............: connection failure (set=0,sock=3,e
rrno=111:Connection refused)
1: adevaraciune: -2
2: cinemalaptop2: 1
3: cinemalaptop2: 1

The same command, ran with version 1.2.1 of mpich2 produces correct results:

C:\Program Files (x86)\MPICH2\bin>mpiexec.exe -machinefile hosts.cfg -channel so
ck -plaintext -n 4 testmpi
Hello from process 0 on adevaraciune
Hello from process 1 on adevaraciune
Hello from process 3 on Cinemalaptop
Hello from process 2 on Cinemalaptop

I thought initially that this might be because I am connecting the two machines using OpenVPN. Also, I have checked the source code and I saw that mpich2 uses INADDR_ANY when using sockets, so they were indeed connecting to all interfaces of the machines, thus, not creating the problems.
I will check future versions of mpich2 when they will be released, but for now, in my configuration, only version 1.2.1 does the job.
Another thought was that failure happens because Windows 7 machine (Cinemalaptop above) is 64bit and I’m trying to use 32bit version of mpich2. I am using the same 32bit version 1.2.1, and it works.

This might also have links to point 4. below.

2. Pass-phraseshave to be the same on the machines.

If passphrases mismatch, the error you get is (which is pretty generic, and not self explanatory):

C:\Program Files (x86)\MPICH2\bin>mpiexec.exe -machinefile hosts.cfg -channel sock -plaintext -n 4 testmpi
Aborting: unable to connect to adevaraciune

The only way to debug this, is to actually check the debug log of smpd and notice that after some challenge response is received, then a FAIL is generated.

Also, on Windows, it seems that when removing smpd as service (for debug purposes for example), pass-phrase is also removed you have to add it again using smpd -set phrase behappy for example.

3. mpich2 versions have to be the same.

If you don’t, you get the error:

C:\Program Files (x86)\MPICH2\bin>mpiexec.exe -machinefile hosts.cfg -n 4 -channel sock -plaintext cpi
Aborting: unable to connect to adevaraciune, smpd version mismatch

4. Option -channel sock

This one was very difficult to find, I found it mpich2 mailing list, but searching for other errors, etc. Basically, without it, you get the same exception as at point 1. Adding the option fixes the problem. Option -channel nemesis also seems to work, even if on Ubuntu, I’ve compiled with --with-device=ch3:sock.

5. Executable name on Windows and Ubuntu.

Of course you have to have the same sourcecode compiled both in Windows and Linux. You have to ommit extension .exe on Windows as the exact executable name will be looked in Linux. Using

strace -f -s254 smpd -debug > smpd.log 2>&1

you can actually see the locations where the Linux smpd looks for the executable.

For me, it was easier to just copy them on /usr/local/bin as that seemed to be present in the search path (don’t know if it is a coincidence, but I do keep /usr/local/bin on the PATH 🙂 ). On Windows, I kept the executables in the mpich2 bin folder as you can see from the paths above.

Hosts file and sample code.

The code I am using is:

#include 
#include 

int main( int argc, char * argv[  ] )
{
   int  processId;      /* rank of process */
   int  noProcesses;    /* number of processes */
   int  nameSize;       /* length of name */
   char computerName[MPI_MAX_PROCESSOR_NAME];

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &noProcesses);
   MPI_Comm_rank(MPI_COMM_WORLD, &processId);
   MPI_Get_processor_name(computerName, &nameSize);
   fprintf(stderr,"Hello from process %d on %s\n", processId, computerName);
   MPI_Finalize( );

   return 0;
}

The hosts.cfg file in which I specify how many cores each system has:

adevaraciune:2
cinemalaptop:4

I’ve assigned the names to the OpenVPN IPs by modifying the hosts file in both machines. However, I assume it might just work with IPs too since the sockets will bind to INADDR_ANY address ..

Conclusions

I appreciate the effort put into mpich2. I’ve tried openmpi and while it worked great, I had to give up because I only have a Windows and a Ubuntu, and apparently openmpi doesn’t support this. Mpich2 does, but by using version 1.2.1.

I was planning on creating a personal file system using mpich2 processes with a small GUI app that was checking phone presence, and if found, new photos would be automatically downloaded and pushed around the your personal network of PCs.
Given the difficulty of getting MPI to run on different operating systems, the multitude of command line arguments, I think it might be easier for me to code this in Java, since Java:

  • when it comes to sockets / threads etc, is really is bulletproof.
  • coding in a good IDE, with very nice debuggers and the multitude of libraries, you can have faster and more iterations finished, than using C. Imagine that in C you only get an error that processed exited without calling finalize, while in Java you get directly the stack trace and line number where the issue happened.

However, I will keep my options open.

If this helped, drop me a line 🙂

Leave a Reply