woensdag 19 december 2012

First round of changes

The first attempts are made to integrate the proposal into CBAS.
This has proven to be quite the challenge. 

Large parts of the code need to be completely rewritten as I also address some of the issues that were stated in the code. 
The integration of DynamoDB is going quite well thanks to the Object Persistence technology that is available.
It provides some annotations that can be used together with a Java class and it's methods to map it's data to a DynamoDB table. One of those annotations is DynamoDBAutoGeneratedKey which generates a UUID to be used as the hashkey, this is used to provide unique id's for every job.

Apart from a job table a misc table was also provided, this table has a string as it's key and can contain a string value. 
Momentarily this is used to store the bucket names that are used, this is required because those names need to be unique.

Methods are also supplied to create or retrieve the necessary tables, queues and buckets.

The Job Messages classes are altered, some information could be removed since it's now stored in the database and can be retrieved there by both the worker and the master.
Next to that their toString functionality now provides the type of the message as well.

A Job Record now only contains a JobItem which is the annotated Java class representing an item from the job table.

JobRequest was extended to include the outputSandboxFile, in here users can supply a name for their output. Momentarily only one name is supported.

The master is heavily adapted to work with all the changes made until now. 

The next point was to test and check the SNS functionality of Amazon. 
This requires an HTTP endpoint, for which Jetty was chosen. Some initial testing showed great promise for it's functionality. 

The initial plan was to provide SNS topics for every master to make it easy to send messages to the master even from external applications (a possible frontend) and a topic for every job to provide easy communication between master and Worker. 

Several problems have been encountered with this approach though: 
  • Amazon has a maximum of 25 topics for every user.
  • Amazon SNS is a fire and forget kind of service
Amazon SNS provides a retry mechanism in which it tries to send the message to all endpoints and retries if it fails. The number of retries is a setting that can be altered, as is the time between every retry. 
If we'd use this mechanism to determine if a worker is still alive, a timer mechanism will have to be used or a ping system.

Additionally mechanisms to deal with recovery and clean up are also required.
( Retrieving the SNS ARN, removing old endpoints from topic, resubscribing) 

The first restriction is a more severe one, with only 25 topics to be created our suggestion of providing one for every job is unattainable. 
Therefor we will just create 1 topic but provide specific message structures and titles.
This will cause overhead because every message is send to every endpoint.
If there are enough topics left it is suggested to create one for every master.

We will need to decide if this is worth the effort.

A second thought that arose while working on making all the changes was the necessity of the input and output queue's.
Their main reason of being there is in order for the workers to retrieve work from those queue's.
This is actually not practical in our cloud setting.
We will need to start up instances, they have to perform work and afterwards they get shut down.
It would be costly to have those instances wait for new work to arrive.
(With the exception of the hour rule, where you pay for every hour in which your instance is running.)

Therefor just sending the job through userdata when starting up the instance or providing it through SNS seems more efficient.

The same reasoning can be made for the output queue.

To take the best of both worlds a new SQS queue can be created which will be used to perform the ping-like tactics which were discussed earlier in SNS.
Doing this in SNS would mean every X seconds sending a ping message from every worker to the topic which will in turn reach every worker and the master.
This would mean a lot of overhead and is actually the perfect job for an SQS queue.
Mainly because the order of arrival does not matter nor is there a very strict deadline of delivery.

To summarize the changes to the proposal:

  • Only use 1 SNS topic (for every master)
  • Remove the input / output SQS queue, replace by userdata and SNS
  • Create a PING SQS queue to inform the master of workers being alive.



woensdag 12 december 2012

CBAS Proposal

Having studied the current CBAS implementation, it was required to create a proposal entailing the necessary changes.
To start of we show a diagram of the current situation:
As can be seen, there currently is an instance having both a Master and a Worker running,
communication goes through RMI and SQS while using S3 for file transfer.

Since it is still a prototype it functioned mainly through the usage of local directories to store jobs and output.
The biggest issues that were noted through the prototype have also been listed, some are fixed in our own proposal but most of them require simple rewriting.

This will be further explained using the diagram of the proposed deployment:

Something that will be noticed right away is the focus on Amazon that was absent in the prototype.
In the prototype the usage of Amazon was limited to SQS and S3.
This has been changed to EC2, SQS,S3, SNS and DynamoDB.
A combination of these services will make us capable of having a robust and safe implementation and solves some of the issues that were present in the prototype.
Next to that it was suggested to provide a web frontend to the system for easy usage. This suggestion was included as well for completeness.

To reduce costs as much as possible we will try to use Amazon's Spotmarket to run both Master and Workers. For this to succeed we require a terminate resilient implementation of these applications.
The ways we provide this are explained further on.

One of the bigger differences with the current prototype is the handling of information concerning jobs. Currently this is simply being stored in a job manager on the master. But whenever a master crashes or is being terminated we have lost all the data concerning these jobs.
Therefor it is suggested to store the information in key-value pares on a dynamodb database on amazon.
This solution entails robustness and scalability. But requires the usage of an additional service which will provide additional costs. But since the size of this database is rather limited, this extra cost will be rather low in comparison to the usage of the different instances.
This approach also provides the advantage of both Master and Worker being able to access this database and making the necessary changes. The problem of not having unique job ids is also solved through the usage of dynamodb since it has the capability of providing these ids.

In this schematic I also represented a possible scenario of distributing multiple masters and their respective workers across the different zones of a region. This mainly to improve robustness and simply as an example.
It would be up to the Master to decide whether it is beneficial to create a new Master on a different zone, simply deploy the workers across zones or even start a Master and workers in a different region.

Another change can be found in the usage of Amazon's Simple Notification Service, this should provide for solutions to several problems. For example the application would no longer require the usage of RMI since this can now be done through SNS messages.
We could now easily detect terminated instances, signals don't need to be send through SQS (which was a problem since SQS does not guarantee FIFO ordering of its messages), ...
The only disadvantage at the moment is the requirement of having an HTTP endpoint to receive these messages. This can be solved through the usage of libraries of which Jetty is an example.
Yet again this is an extra service which will introduce new costs.
It will need to be decided whether it is worth these costs or not.

The already present systems of SQS and S3 were reduced.
For example the signals queue's are no longer needed because of the usage of SNS.
The usage of the interactive buckets isn't needed anymore either, we suggest using a folder
structure within the input and output buckets to reflect for example the different jobs.

These were some of the general changes but the Master and Worker require a more detailed view.
Therefor the following schematic was created:


As can be seen, the Master will be subdivided into 4 different parts.

  • HTTP Server : this will be the HTTP endpoint required for using the SNS service. It will have to forward the received messages to other parts of the Master. 
  • Spot Simulator : the core part of the decision making process. The spotsimulator will be used to decide which bid will be used to request new instances. This simulator has been provided to us in the form of the Broker Prototype from Kurt Vermeersch. In this application only historical data is being used, read from CSV files. There will need to be a system that either provides up to date CSV's or it will have to be expanded to retrieve the current data from the Amazon webservices. The simulator also tries to account for price differences in the different regions, this is something that could be provided if we have at least one Master in each region or if it would be profitable enough to start up a Master in the specific region with some workers. Another issue that will have to be solved is the input format of the simulator and CBAS, a consensus will have to be found between the JDL files used in CBAS and the input files of the simulator. To make things easier I would suggest a new message format namely JSON, since it is easily used in Amazon's services. 
  • Resource Manager : using the bids of the spotsimulator, this manager will make sure the required amount of spotinstances are present which can then be used by the workermanager. If the price differences between regions are large enough to transfer work from one region to another, this manager could also terminate instances, this decision making process would then need to be added to the simulator because it is not yet present. If the simulator has shown that it is best to start the job in a different region, the job will be posted to this region if a Master already exists, if not we check if it is worth setting up a Master. When this is not the case we will have to make sure to get a bid for this region. 
  • Worker Manager : Makes the decisions concerning the workers, contains all the communication, prepares the necessary AMI's for jobs to run on. Next to that it will also be responsible for restarting terminated workers. We would restrict the communication between Worker and Master as much as possible to make sure they can still function without having the other around. This is possible because we save all the job data in the DynamoDB database. 

The Worker consists out of 3 parts:
  • HTTPServer: serves the same purpose as the one within the Master.
  • Checkpointing: this should be functionality in control of the checkpointing, ideally this would be done through DMTCP and an additional script which would perform a snapshot of the current state, give it a specific name and saves all information in the job on DynamoDB. The checkpointing functionality within the prototype would prove to be of no use anymore and could be removed. 
  • JobManager: a manager in control of the job that needs to be executed, this would largely be the same logic that is currently present within the CBAS prototype. Only some optimizations would be put in place as for example the polling for work from SQS would be replaced by SNS posting a message that there is work and the visibility timeout which can be renewed while the job is running. 
If multiple workers are working on the same job, some communication is also required between these workers. 


The best approach to this system is yet to be determined.
For now the most important part is to find out how the spotsimulator could be integrated and to see if this would require a great deal of change to the original code. Then the efficiency of this simulator will have to be checked and perhaps adjusted for our needs.
In the meanwhile testing will continue on the DMTCP checkpointing mechanism.

Further DMTCP Testing

In attendance of the answer of the creators of DMTCP, some additional testing is being performed.
Regular Java sockets checkpointing and restarting worked without much difficulties.
To mimic our usecase even further, an EC2 image was created after checkpointing.
This image was then relaunched on different instances where we altered the IP's in the restartscript.
After making sure the SSH information was all correct, a dmtcp_coordinator was launched.
Thereafter the restart script was executed and functioned without any issues.

Even though the new instances have new network addresses, the connections were still recovered.
This is because of DMTCP's feature to use a cluster-wide discovery service to find new addresses.
(See DMTCP: transparent checkpointing for cluster computations and the desktop)

There have also been some new attempts in trying to automatically execute an EC2 snapshot while checkpointing in DMTCP.

Currently this has been tried through the usage of dmtcpplugin.
This method provides a couple of points during checkpointing after which you may execute custom code.
But problems arise since during checkpointing a lot of methods cannot be executed.
Therefor executing a system() call right after checkpointing is being refused.
I hope to solve this through further communication with the creators of DMTCP.

maandag 19 november 2012

DMTCP and Java RMI (3)

I have continued my efforts on trying to get my application working with DMTCP checkpointing and restarting.
But to make things more clear we will start of with a complete description of  the test application.

RMI HellowordCounter

The layout is as follows:
  • HelloInterface.java
    public interface HelloInterface extends Remote {
     public String say() throws RemoteException;
    }
    
  • Hello.java
    public class Hello extends UnicastRemoteObject implements HelloInterface {
     private static final Logger LOG = Logger.getLogger("Hello");
     private static FileHandler logFileHandler;
    
     private static final long serialVersionUID = 1L;
     private String message;
     private static int counter = 0;
     
     public Hello (String msg) throws RemoteException {
      initLogging();
      message = msg;
     }
     
     public String say() throws RemoteException {
      counter++;
      LOG.info("Say :: " + message + counter);
      return message+counter;
     }
    }
    
  • HelloServer.java
    public class HelloServer 
    {
     public static void main (String[] argv) 
     {
      try {
       System.setProperty("java.rmi.server.codebase","http://myserver:9999/");
       System.setProperty("java.rmi.server.hostname", "myserver");
       Naming.rebind ("Hello", new Hello ("Hello,"));
       System.out.println ("Server is connected and ready for operation.");
      } 
      catch (Exception e) {
       System.out.println ("Server not connected: " + e);
      }
     }
    }
    
  • HelloClient.java
    public class HelloClient 
    {
     private static final Logger LOG = Logger.getLogger("HelloClient");
     private static FileHandler logFileHandler;
     
     public static void main (String[] argv) {
      try {
       initLogging();
       System.setProperty("java.rmi.server.hostname", "myserver");
       System.setProperty("java.rmi.server.codebase","http://myserver:9999/");
       
       
       HelloInterface hello =(HelloInterface) Naming.lookup ("rmi://myserver:1099/Hello");
       
       while (true) {
        Thread.sleep(1000);
        
        LOG.warning(hello.say());
       }
       
      } 
      catch (Exception e){
       System.out.println ("HelloClient exception: " + e);}
     }
    }
This application represents a simple helloworld application combined with a counter to make sure we have communication between the stub and the skeleton.

To run this we use an rmiregistry running on the system of the server and jar file containing the jini (apache river) classserver to perform remote classloading.

If we combine this with the dmtcp commands we get the following commands to be executed on the machine of the server:

  • dmtcp_coordinator : we start up the coordinator
  • dmtcp_checkpoint rmiregistry
  • dmtcp_checkpoint java -jar classserver.jar -trees -dir "/home/robin/RMITest/bin" -port 9999 -verbose
  • dmtcp_checkpoint java HelloServer

On the machine of the client :

  • dmtcp_checkpoint java HelloClient
All of these commands have been executed with DMTCP_HOST set as myserver which is the name given to the machine of the server.

To test everything in a more detailed manner, the dmtcp commands have been recompiled with debugging enabled. 

Afterwards I tried to execute the application with only the rmiregistry and the classserver run through dmtcp. 
The checkpointing now went successful as well as the restart. 
Executing rmiregistry, classserver and HelloServer through dmtcp resulted in an exception at the client side when checkpointing since dmtcp pauses the application when checkpointing. 

Executing the complete application with dmtcp made it impossible to restart the application. 
Some of the problems that arise have already been shown in my previous posts and I can now even add a new one that occurred just recently:


[2432] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
java (2432): Terminating...
[2456] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[26067]
     fdToDevice ( fds[i] ) = pipe:[26068]
     fds[i] = 1
     fds[0] = 0
java (2456): Terminating...
[2512] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
rmiregistry (2512): Terminating...
[2590] ERROR at connectionmanager.cpp:545 in KernelDeviceToConnection; REASON='JASSERT(device == fdToDevice ( fds[i] )) failed'
     device = pipe:[150039]
     fdToDevice ( fds[i] ) = pipe:[150040]
     fds[i] = 1
     fds[0] = 0
java (2590): Terminating...
Segmentation fault (core dumped)
Segmentation fault (core dumped)

This are just pieces of the complete output, but they clearly show the failure of starting up the application.

dinsdag 13 november 2012

DMTCP and Java RMI (2)

This post is a continuation of the previous one which was explaining the issues encountered with DMTCP and RMI.

Third phase:

Now that it is possible to take checkpoints, the next step is to restart the checkpoint we created.
Whenever DMTCP creates a checkpoint it automatically generates a bash script that can be used to restart the complete job.
During tests it was noted that this script is far to general and needs manual tweaking before functioning. 
  1. First of the variable DMTCP_HOST should always be set to the DNS or IP of the system running the dmtcp_coordinator.
  2. The dmtcp commands need to be in the path. (personal remark: .profile is not executed in an ssh command -> add to .bashrc before the check for interactive shells) 
  3. In the middle of the restart script the following lines are present:
    # SYNTAX:
    #  :: HOST : MODE: CHECKPOINT_IMAGE ...
    # Host names and filenames must not include ':'
    # At most one fg (foreground) mode allowed; it must be last.
    # 'maybexterm' and 'maybebg' are set from MODE.
    worker_ckpts='
     :: myserver :bg: /home/robin/RMITest/bin/ckpt_java_5bbc58e7-2634-509bd481.dmtcp /home/robin/RMITest/ckpt_java_5bbc58e7-2453-509bd40a.dmtcp /home/robin/RMITest/ckpt_rmiregistry_5bbc58e7-2452-509bd40a.dmtcp
     :: Ubuntu :bg: /home/robin/RMITest/bin/ckpt_java_3e62ae3a-6597-509bd51f.dmtcp
    The problem is that the computers hostname is being taken directly as the way to locate different systems and needs to be altered. Although while testing this on Amazon AWS Ubuntu instances it was noted that the computer name can be used as a private DNS to the machine which solves this problem.
  4. To restart the different jobs on remote systems, dmtcp uses SSH. The code that is provided to handle this only works on the default setup that requires no authentication. If another way of authentication is used (as with amazon using a private key file), this command will need to be altered. A bug was also found in that paths with spaces do not work, unfortunately a solution to this problem wasn't found yet. Simply quoting the variables is not sufficient. A different method of retrieving the different files will have to be created.
  5.  A new coordinator has to be restarted manually, when we don't do this, following error is shown:
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [9748] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed'
         s = myserver
         jalib::Filesystem::GetCurrentHostname() = Robin-Laptop
    Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host.
    dmtcp_restart (9748): Terminating...
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [13432] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed'
         s = myserver
         jalib::Filesystem::GetCurrentHostname() = ubuntu-desktop
    Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host.
    dmtcp_restart (13432): Terminating...
  6. After fixing all of this we got the following problem:
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
         (strerror((*__errno_location ()))) = Address already in use
         id() = 5bbc58e7-12085-50a151fe(99005)
    Message: Bind failed.
    [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed'
         (strerror((*__errno_location ()))) = Address already in use
         id() = 5bbc58e7-12086-50a151fe(99004)
    Message: Bind failed.
    dmtcp_checkpoint (DMTCP + MTCP) 1.2.6
    Copyright (C) 2006-2011  Jason Ansel, Michael Rieker, Kapil Arya, and
                                                           Gene Cooperman
    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it
    under certain conditions; see COPYING file for details.
    (Use flag "-q" to hide this message.)
    
    [14350] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/14350 with data from ckpt image
    [12086] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file:
      mapping current version of /home/robin/RMITest/codebase.jar into memory;
      _not_ file as it existed at time of checkpoint.
      Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078228992: 1
    [12086] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12086 with data from ckpt image
    [12126] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12126 with data from ckpt image
    [12085] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12085 with data from ckpt image
    [12126] WARNING at jsocket.cpp:352 in writeAll; REASON='JWARNING(cnt > 0) failed'
         cnt = -1
         len = 388
         (strerror((*__errno_location ()))) = Connection reset by peer
    Message: JSocket write failure
    [14350] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed'
         sockfd() = 20
         origLen = 388
         len = 388
    Message: JSocket needed to read origLen chars,
     still needs to read len chars, but EOF reached
    [14350] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
         _magicBits = 
    Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?
    java (14350): Terminating...
    [12126] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed'
         sockfd() = 19
         origLen = 388
         len = 388
    Message: JSocket needed to read origLen chars,
     still needs to read len chars, but EOF reached
    [12126] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed'
         _magicBits = 
    Message: read invalid message, _magicBits mismatch.  Did DMTCP coordinator die uncleanly?
    java (12126): Terminating...
    
    Not really knowing the origin of this error I decided to try the entire procedure again wherafter I got the following error:
    [12985] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/12985 with data from ckpt image
    [13082] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file:
      mapping /tmp/hsperfdata_robin/13082 with data from ckpt image
    [12986] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file:
      mapping current version of /home/robin/RMITest/codebase.jar into memory;
      _not_ file as it existed at time of checkpoint.
      Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078360064: 1
    [12986] ERROR at virtualpidtable.cpp:558 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed'
         versionCheck = ntries:[
         correctValue = NumEntries:[
         o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657
    Message: invalid file format
    java (12986): Terminating...
    [12985] ERROR at virtualpidtable.cpp:560 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed'
         versionCheck = nt�, @����  
         correctValue = ]
         o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657
    Message: invalid file format

At the moment the only thing I can conclude is that DMTCP isn't yet working as I had hoped it would.
Simply restarting a checkpointed program in exactly the way it was during checkpointing, seems far from trivial.
Some more tests will need to be done to pinpoint the possible reasons of these crashes. 

maandag 12 november 2012

DMTCP and Java RMI

Initial Testing on Amazon


Some initial tests with DMTCP and Java RMI have proven it to be quite difficult.
Out of the box it simply did not work.
The first tests were done on standard Ubuntu 12.04 instances from Amazon and the following problems have been observed:

When starting an RMI program under JDK1.6 DMTCP can't start it.
To test this the RMIRegistry was ran but failed to start with following error thrown:

[1988] ERROR at connection.cpp:274 in onListen; REASON='JASSERT(tcpType() == TCP_BIND) failed'
     tcpType() = 4098
     id() = 2e03157ea2e7ccf4-1988-508e466b(99006)
Message: Listening on a non-bind()ed socket????
rmiregistry (1988): Terminating...

After using debugging capabilities of DMTCP it was traced back to a call to one of Java's internal libraries containing the implementation of a socket. Not really knowing what to do now a new version of the Java JDK was installed while hoping for the best.

Somehow this has solved the issue, I decided to perform the next stages of testing on my local network containing 2 Ubuntu 12.10 computers with Java OpenJDK 1.7 and DMTCP-1.2.6 installed.

Test Setup

First phase:

I created a small RMI program existing out of a server which is called upon by the client. It performs a simple hellow world and keeps track of the amount of calls that were made.
To immediately test the functionality with files, everything is logged both on the server and on the client.
Next to that we have an RMIRegistry capable of handling remote calls and a small HTTP server for remote classloading.

Without checkpointing the program runs as expected without any difficulties.

Second phase:

A first trial was to start up the program through the DMTCP framework by using the dmtp_checkpoint commands. This went well and the program still functioned as expected. The next step was to take a snapshot of the current state. This didn't go well and threw the following errors:


exception: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: 
 java.net.SocketException: Connection reset

java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: 
 java.io.EOFException
 at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:304)
 at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
 at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:128)
 at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:194)
 at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:148)
 at $Proxy0.say(Unknown Source)
 at HelloClient.main(HelloClient.java:24)
Caused by: java.io.EOFException
 at java.io.DataInputStream.readByte(DataInputStream.java:267)
 at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:246)

It seemed to be a problem with the restart of the program after having taken a snapshot.
A simple reboot of the system was the solution to this exception. It was now possible to take snapshots of the running RMI program.

The following fazes will be explained in my next blogpost.


donderdag 25 oktober 2012

Advanced spot market manager for Amazon EC2

This blog will contain updates about the various steps that were taking in completing a research internship about the Amazon EC2 Spotmarket.

The goal is to design and build a system that uses the EC2 Spotmarket to complete certain tasks in a cost efficient manner.

Current status of the project:

1) One of the properties of instances running on the spotmarket is that they are terminated without warning if the current market price exceeds your bid. It is for this reason that there needs to be a fail-safe. In this internship it was requested to use checkpointing. A means of taking snapshots from the current state of execution of a program. The program dmtcp was chosen to fulfill this task.
Currently some tests with this library have been performed and it works as expected.

The next phase will be to further test dmtcp and search for specific problems that are related to cloud computing.

2) A special checkpointing AMI has been created with dmtcp and s3cmd (a program to use S3) installed.
It could be that s3cmd will be replaced by a personal implementation.

3) A java test program to launch instances on the spotmarket was created.