maandag 19 november 2012

DMTCP and Java RMI (3)

I have continued my efforts on trying to get my application working with DMTCP checkpointing and restarting.
But to make things more clear we will start of with a complete description of  the test application.

RMI HellowordCounter

The layout is as follows:
  • HelloInterface.java
    public interface HelloInterface extends Remote {
     public String say() throws RemoteException;
    }
    
  • Hello.java
    public class Hello extends UnicastRemoteObject implements HelloInterface {
     private static final Logger LOG = Logger.getLogger("Hello");
     private static FileHandler logFileHandler;
    
     private static final long serialVersionUID = 1L;
     private String message;
     private static int counter = 0;
     
     public Hello (String msg) throws RemoteException {
      initLogging();
      message = msg;
     }
     
     public String say() throws RemoteException {
      counter++;
      LOG.info("Say :: " + message + counter);
      return message+counter;
     }
    }
    
  • HelloServer.java
    public class HelloServer 
    {
     public static void main (String[] argv) 
     {
      try {
       System.setProperty("java.rmi.server.codebase","http://myserver:9999/");
       System.setProperty("java.rmi.server.hostname", "myserver");
       Naming.rebind ("Hello", new Hello ("Hello,"));
       System.out.println ("Server is connected and ready for operation.");
      } 
      catch (Exception e) {
       System.out.println ("Server not connected: " + e);
      }
     }
    }
    
  • HelloClient.java
    public class HelloClient 
    {
     private static final Logger LOG = Logger.getLogger("HelloClient");
     private static FileHandler logFileHandler;
     
     public static void main (String[] argv) {
      try {
       initLogging();
       System.setProperty("java.rmi.server.hostname", "myserver");
       System.setProperty("java.rmi.server.codebase","http://myserver:9999/");
       
       
       HelloInterface hello =(HelloInterface) Naming.lookup ("rmi://myserver:1099/Hello");
       
       while (true) {
        Thread.sleep(1000);
        
        LOG.warning(hello.say());
       }
       
      } 
      catch (Exception e){
       System.out.println ("HelloClient exception: " + e);}
     }
    }
This application represents a simple helloworld application combined with a counter to make sure we have communication between the stub and the skeleton.

To run this we use an rmiregistry running on the system of the server and jar file containing the jini (apache river) classserver to perform remote classloading.

If we combine this with the dmtcp commands we get the following commands to be executed on the machine of the server:

  • dmtcp_coordinator : we start up the coordinator
  • dmtcp_checkpoint rmiregistry
  • dmtcp_checkpoint java -jar classserver.jar -trees -dir "/home/robin/RMITest/bin" -port 9999 -verbose
  • dmtcp_checkpoint java HelloServer

On the machine of the client :

  • dmtcp_checkpoint java HelloClient
All of these commands have been executed with DMTCP_HOST set as myserver which is the name given to the machine of the server.

To test everything in a more detailed manner, the dmtcp commands have been recompiled with debugging enabled. 

Afterwards I tried to execute the application with only the rmiregistry and the classserver run through dmtcp. 
The checkpointing now went successful as well as the restart. 
Executing rmiregistry, classserver and HelloServer through dmtcp resulted in an exception at the client side when checkpointing since dmtcp pauses the application when checkpointing. 

Executing the complete application with dmtcp made it impossible to restart the application. 
Some of the problems that arise have already been shown in my previous posts and I can now even add a new one that occurred just recently:


[2432] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
java (2432): Terminating...
[2456] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[26067]
     fdToDevice ( fds[i] ) = pipe:[26068]
     fds[i] = 1
     fds[0] = 0
java (2456): Terminating...
[2512] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
rmiregistry (2512): Terminating...
[2590] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
java (2590): Terminating...
Segmentation fault (core dumped)
Segmentation fault (core dumped)

This are just pieces of the complete output, but they clearly show the failure of starting up the application.

dinsdag 13 november 2012

DMTCP and Java RMI (2)

This post is a continuation of the previous one which was explaining the issues encountered with DMTCP and RMI.

Third phase:

Now that it is possible to take checkpoints, the next step is to restart the checkpoint we created.
Whenever DMTCP creates a checkpoint it automatically generates a bash script that can be used to restart the complete job.
During tests it was noted that this script is far to general and needs manual tweaking before functioning. 
  1. First of the variable DMTCP_HOST should always be set to the DNS or IP of the system running the dmtcp_coordinator.
  2. The dmtcp commands need to be in the path. (personal remark: .profile is not executed in an ssh command -> add to .bashrc before the check for interactive shells) 
  3. In the middle of the restart script the following lines are present:
    # SYNTAX:
    #  :: HOST : MODE: CHECKPOINT_IMAGE ...
    # Host names and filenames must not include ':'
    # At most one fg (foreground) mode allowed; it must be last.
    # 'maybexterm' and 'maybebg' are set from MODE.
    worker_ckpts='
     :: myserver :bg: /home/robin/RMITest/bin/ckpt_java_5bbc58e7-2634-509bd481.dmtcp /home/robin/RMITest/ckpt_java_5bbc58e7-2453-509bd40a.dmtcp /home/robin/RMITest/ckpt_rmiregistry_5bbc58e7-2452-509bd40a.dmtcp
     :: Ubuntu :bg: /home/robin/RMITest/bin/ckpt_java_3e62ae3a-6597-509bd51f.dmtcp
    The problem is that the computers hostname is being taken directly as the way to locate different systems and needs to be altered. Although while testing this on Amazon AWS Ubuntu instances it was noted that the computer name can be used as a private DNS to the machine which solves this problem.
  4. To restart the different jobs on remote systems, dmtcp uses SSH. The code that is provided to handle this only works on the default setup that requires no authentication. If another way of authentication is used (as with amazon using a private key file), this command will need to be altered. A bug was also found in that paths with spaces do not work, unfortunately a solution to this problem wasn't found yet. Simply quoting the variables is not sufficient. A different method of retrieving the different files will have to be created.
  5.  A new coordinator has to be restarted manually, when we don't do this, following error is shown:
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [9748] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed'
         s = myserver
         jalib::Filesystem::GetCurrentHostname() = Robin-Laptop
    Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host.
    dmtcp_restart (9748): Terminating...
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [13432] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed'
         s = myserver
         jalib::Filesystem::GetCurrentHostname() = ubuntu-desktop
    Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host.
    dmtcp_restart (13432): Terminating...
  6. After fixing all of this we got the following problem:
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
         (strerror((*__errno_location ()))) = Address already in use
         id() = 5bbc58e7-12085-50a151fe(99005)
    Message: Bind failed.
    [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
         (strerror((*__errno_location ()))) = Address already in use
         id() = 5bbc58e7-12086-50a151fe(99004)
    Message: Bind failed.
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [14350] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/14350 with data from ckpt image
    [12086] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file:
      mapping current version of /home/robin/RMITest/codebase.jar into memory;
      _not_ file as it existed at time of checkpoint.
      Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078228992: 1
    [12086] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12086 with data from ckpt image
    [12126] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12126 with data from ckpt image
    [12085] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12085 with data from ckpt image
    [12126] WARNING at jsocket.cpp:352 in writeAll; REASON='JWARNING(cnt > 0) failed'
         cnt = -1
         len = 388
         (strerror((*__errno_location ()))) = Connection reset by peer
    Message: JSocket write failure
    [14350] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed'
         sockfd() = 20
         origLen = 388
         len = 388
    Message: JSocket needed to read origLen chars,
     still needs to read len chars, but EOF reached
    [14350] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
         _magicBits = 
    Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?
    java (14350): Terminating...
    [12126] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed'
         sockfd() = 19
         origLen = 388
         len = 388
    Message: JSocket needed to read origLen chars,
     still needs to read len chars, but EOF reached
    [12126] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
         _magicBits = 
    Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?
    java (12126): Terminating...
    
    Not really knowing the origin of this error I decided to try the entire procedure again wherafter I got the following error:
    [12985] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12985 with data from ckpt image
    [13082] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/13082 with data from ckpt image
    [12986] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file:
      mapping current version of /home/robin/RMITest/codebase.jar into memory;
      _not_ file as it existed at time of checkpoint.
      Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078360064: 1
    [12986] ERROR at virtualpidtable.cpp:558 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed'
         versionCheck = ntries:[
         correctValue = NumEntries:[
         o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657
    Message: invalid file format
    java (12986): Terminating...
    [12985] ERROR at virtualpidtable.cpp:560 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed'
         versionCheck = nt�, @����  
         correctValue = ]
         o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657
    Message: invalid file format

At the moment the only thing I can conclude is that DMTCP isn't yet working as I had hoped it would.
Simply restarting a checkpointed program in exactly the way it was during checkpointing, seems far from trivial.
Some more tests will need to be done to pinpoint the possible reasons of these crashes. 

maandag 12 november 2012

DMTCP and Java RMI

Initial Testing on Amazon


Some initial tests with DMTCP and Java RMI have proven it to be quite difficult.
Out of the box it simply did not work.
The first tests were done on standard Ubuntu 12.04 instances from Amazon and the following problems have been observed:

When starting an RMI program under JDK1.6 DMTCP can't start it.
To test this the RMIRegistry was ran but failed to start with following error thrown:

[1988] ERROR at connection.cpp:274 in onListen; REASON='JASSERT(tcpType() == TCP_BIND) failed'
     tcpType() = 4098
     id() = 2e03157ea2e7ccf4-1988-508e466b(99006)
Message: Listening on a non-bind()ed socket????
rmiregistry (1988): Terminating...

After using debugging capabilities of DMTCP it was traced back to a call to one of Java's internal libraries containing the implementation of a socket. Not really knowing what to do now a new version of the Java JDK was installed while hoping for the best.

Somehow this has solved the issue, I decided to perform the next stages of testing on my local network containing 2 Ubuntu 12.10 computers with Java OpenJDK 1.7 and DMTCP-1.2.6 installed.

Test Setup

First phase:

I created a small RMI program existing out of a server which is called upon by the client. It performs a simple hellow world and keeps track of the amount of calls that were made.
To immediately test the functionality with files, everything is logged both on the server and on the client.
Next to that we have an RMIRegistry capable of handling remote calls and a small HTTP server for remote classloading.

Without checkpointing the program runs as expected without any difficulties.

Second phase:

A first trial was to start up the program through the DMTCP framework by using the dmtp_checkpoint commands. This went well and the program still functioned as expected. The next step was to take a snapshot of the current state. This didn't go well and threw the following errors:


exception: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: 
 java.net.SocketException: Connection reset

java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: 
 java.io.EOFException
 at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:304)
 at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:128)
 at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
 at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
 at $Proxy0.say(Unknown Source)
 at HelloClient.main(HelloClient.java:24)
Caused by: java.io.EOFException
 at java.io.DataInputStream.readByte(DataInputStream.java:267)
 at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:246)

It seemed to be a problem with the restart of the program after having taken a snapshot.
A simple reboot of the system was the solution to this exception. It was now possible to take snapshots of the running RMI program.

The following fazes will be explained in my next blogpost.