woensdag 19 december 2012

First round of changes

The first attempts are made to integrate the proposal into CBAS.
This has proven to be quite the challenge. 

Large parts of the code need to be completely rewritten as I also address some of the issues that were stated in the code. 
The integration of DynamoDB is going quite well thanks to the Object Persistence technology that is available.
It provides some annotations that can be used together with a Java class and it's methods to map it's data to a DynamoDB table. One of those annotations is DynamoDBAutoGeneratedKey which generates a UUID to be used as the hashkey, this is used to provide unique id's for every job.

Apart from a job table a misc table was also provided, this table has a string as it's key and can contain a string value. 
Momentarily this is used to store the bucket names that are used, this is required because those names need to be unique.

Methods are also supplied to create or retrieve the necessary tables, queues and buckets.

The Job Messages classes are altered, some information could be removed since it's now stored in the database and can be retrieved there by both the worker and the master.
Next to that their toString functionality now provides the type of the message as well.

A Job Record now only contains a JobItem which is the annotated Java class representing an item from the job table.

JobRequest was extended to include the outputSandboxFile, in here users can supply a name for their output. Momentarily only one name is supported.

The master is heavily adapted to work with all the changes made until now. 

The next point was to test and check the SNS functionality of Amazon. 
This requires an HTTP endpoint, for which Jetty was chosen. Some initial testing showed great promise for it's functionality. 

The initial plan was to provide SNS topics for every master to make it easy to send messages to the master even from external applications (a possible frontend) and a topic for every job to provide easy communication between master and Worker. 

Several problems have been encountered with this approach though: 
  • Amazon has a maximum of 25 topics for every user.
  • Amazon SNS is a fire and forget kind of service
Amazon SNS provides a retry mechanism in which it tries to send the message to all endpoints and retries if it fails. The number of retries is a setting that can be altered, as is the time between every retry. 
If we'd use this mechanism to determine if a worker is still alive, a timer mechanism will have to be used or a ping system.

Additionally mechanisms to deal with recovery and clean up are also required.
( Retrieving the SNS ARN, removing old endpoints from topic, resubscribing) 

The first restriction is a more severe one, with only 25 topics to be created our suggestion of providing one for every job is unattainable. 
Therefor we will just create 1 topic but provide specific message structures and titles.
This will cause overhead because every message is send to every endpoint.
If there are enough topics left it is suggested to create one for every master.

We will need to decide if this is worth the effort.

A second thought that arose while working on making all the changes was the necessity of the input and output queue's.
Their main reason of being there is in order for the workers to retrieve work from those queue's.
This is actually not practical in our cloud setting.
We will need to start up instances, they have to perform work and afterwards they get shut down.
It would be costly to have those instances wait for new work to arrive.
(With the exception of the hour rule, where you pay for every hour in which your instance is running.)

Therefor just sending the job through userdata when starting up the instance or providing it through SNS seems more efficient.

The same reasoning can be made for the output queue.

To take the best of both worlds a new SQS queue can be created which will be used to perform the ping-like tactics which were discussed earlier in SNS.
Doing this in SNS would mean every X seconds sending a ping message from every worker to the topic which will in turn reach every worker and the master.
This would mean a lot of overhead and is actually the perfect job for an SQS queue.
Mainly because the order of arrival does not matter nor is there a very strict deadline of delivery.

To summarize the changes to the proposal:

  • Only use 1 SNS topic (for every master)
  • Remove the input / output SQS queue, replace by userdata and SNS
  • Create a PING SQS queue to inform the master of workers being alive.



woensdag 12 december 2012

CBAS Proposal

Having studied the current CBAS implementation, it was required to create a proposal entailing the necessary changes.
To start of we show a diagram of the current situation:
As can be seen, there currently is an instance having both a Master and a Worker running,
communication goes through RMI and SQS while using S3 for file transfer.

Since it is still a prototype it functioned mainly through the usage of local directories to store jobs and output.
The biggest issues that were noted through the prototype have also been listed, some are fixed in our own proposal but most of them require simple rewriting.

This will be further explained using the diagram of the proposed deployment:

Something that will be noticed right away is the focus on Amazon that was absent in the prototype.
In the prototype the usage of Amazon was limited to SQS and S3.
This has been changed to EC2, SQS,S3, SNS and DynamoDB.
A combination of these services will make us capable of having a robust and safe implementation and solves some of the issues that were present in the prototype.
Next to that it was suggested to provide a web frontend to the system for easy usage. This suggestion was included as well for completeness.

To reduce costs as much as possible we will try to use Amazon's Spotmarket to run both Master and Workers. For this to succeed we require a terminate resilient implementation of these applications.
The ways we provide this are explained further on.

One of the bigger differences with the current prototype is the handling of information concerning jobs. Currently this is simply being stored in a job manager on the master. But whenever a master crashes or is being terminated we have lost all the data concerning these jobs.
Therefor it is suggested to store the information in key-value pares on a dynamodb database on amazon.
This solution entails robustness and scalability. But requires the usage of an additional service which will provide additional costs. But since the size of this database is rather limited, this extra cost will be rather low in comparison to the usage of the different instances.
This approach also provides the advantage of both Master and Worker being able to access this database and making the necessary changes. The problem of not having unique job ids is also solved through the usage of dynamodb since it has the capability of providing these ids.

In this schematic I also represented a possible scenario of distributing multiple masters and their respective workers across the different zones of a region. This mainly to improve robustness and simply as an example.
It would be up to the Master to decide whether it is beneficial to create a new Master on a different zone, simply deploy the workers across zones or even start a Master and workers in a different region.

Another change can be found in the usage of Amazon's Simple Notification Service, this should provide for solutions to several problems. For example the application would no longer require the usage of RMI since this can now be done through SNS messages.
We could now easily detect terminated instances, signals don't need to be send through SQS (which was a problem since SQS does not guarantee FIFO ordering of its messages), ...
The only disadvantage at the moment is the requirement of having an HTTP endpoint to receive these messages. This can be solved through the usage of libraries of which Jetty is an example.
Yet again this is an extra service which will introduce new costs.
It will need to be decided whether it is worth these costs or not.

The already present systems of SQS and S3 were reduced.
For example the signals queue's are no longer needed because of the usage of SNS.
The usage of the interactive buckets isn't needed anymore either, we suggest using a folder
structure within the input and output buckets to reflect for example the different jobs.

These were some of the general changes but the Master and Worker require a more detailed view.
Therefor the following schematic was created:


As can be seen, the Master will be subdivided into 4 different parts.

  • HTTP Server : this will be the HTTP endpoint required for using the SNS service. It will have to forward the received messages to other parts of the Master. 
  • Spot Simulator : the core part of the decision making process. The spotsimulator will be used to decide which bid will be used to request new instances. This simulator has been provided to us in the form of the Broker Prototype from Kurt Vermeersch. In this application only historical data is being used, read from CSV files. There will need to be a system that either provides up to date CSV's or it will have to be expanded to retrieve the current data from the Amazon webservices. The simulator also tries to account for price differences in the different regions, this is something that could be provided if we have at least one Master in each region or if it would be profitable enough to start up a Master in the specific region with some workers. Another issue that will have to be solved is the input format of the simulator and CBAS, a consensus will have to be found between the JDL files used in CBAS and the input files of the simulator. To make things easier I would suggest a new message format namely JSON, since it is easily used in Amazon's services. 
  • Resource Manager : using the bids of the spotsimulator, this manager will make sure the required amount of spotinstances are present which can then be used by the workermanager. If the price differences between regions are large enough to transfer work from one region to another, this manager could also terminate instances, this decision making process would then need to be added to the simulator because it is not yet present. If the simulator has shown that it is best to start the job in a different region, the job will be posted to this region if a Master already exists, if not we check if it is worth setting up a Master. When this is not the case we will have to make sure to get a bid for this region. 
  • Worker Manager : Makes the decisions concerning the workers, contains all the communication, prepares the necessary AMI's for jobs to run on. Next to that it will also be responsible for restarting terminated workers. We would restrict the communication between Worker and Master as much as possible to make sure they can still function without having the other around. This is possible because we save all the job data in the DynamoDB database. 

The Worker consists out of 3 parts:
  • HTTPServer: serves the same purpose as the one within the Master.
  • Checkpointing: this should be functionality in control of the checkpointing, ideally this would be done through DMTCP and an additional script which would perform a snapshot of the current state, give it a specific name and saves all information in the job on DynamoDB. The checkpointing functionality within the prototype would prove to be of no use anymore and could be removed. 
  • JobManager: a manager in control of the job that needs to be executed, this would largely be the same logic that is currently present within the CBAS prototype. Only some optimizations would be put in place as for example the polling for work from SQS would be replaced by SNS posting a message that there is work and the visibility timeout which can be renewed while the job is running. 
If multiple workers are working on the same job, some communication is also required between these workers. 


The best approach to this system is yet to be determined.
For now the most important part is to find out how the spotsimulator could be integrated and to see if this would require a great deal of change to the original code. Then the efficiency of this simulator will have to be checked and perhaps adjusted for our needs.
In the meanwhile testing will continue on the DMTCP checkpointing mechanism.

Further DMTCP Testing

In attendance of the answer of the creators of DMTCP, some additional testing is being performed.
Regular Java sockets checkpointing and restarting worked without much difficulties.
To mimic our usecase even further, an EC2 image was created after checkpointing.
This image was then relaunched on different instances where we altered the IP's in the restartscript.
After making sure the SSH information was all correct, a dmtcp_coordinator was launched.
Thereafter the restart script was executed and functioned without any issues.

Even though the new instances have new network addresses, the connections were still recovered.
This is because of DMTCP's feature to use a cluster-wide discovery service to find new addresses.
(See DMTCP: transparent checkpointing for cluster computations and the desktop)

There have also been some new attempts in trying to automatically execute an EC2 snapshot while checkpointing in DMTCP.

Currently this has been tried through the usage of dmtcpplugin.
This method provides a couple of points during checkpointing after which you may execute custom code.
But problems arise since during checkpointing a lot of methods cannot be executed.
Therefor executing a system() call right after checkpointing is being refused.
I hope to solve this through further communication with the creators of DMTCP.