zondag 31 maart 2013

Checkpointing, fully operational

Finally, after a lot of trial and error I have managed to pinpoint the cause of the checkpointing issue.

Whenever a checkpoint is taken by DMTCP, it uses a temporarily folder to store information.
If it is not initialised by the user, a folder is created in the /tmp directory.
The problem now originated from the snapshot taken by the EC2 instance.
When such a snapshot is taken and restarted, the tmp folder is empty.

This caused problems for DMTCP, something I presume with being unable to read the expected folder inside tmp.
To solve this issue, it was sufficient to change the DMTCP_TMPDIR to a different and more permanent directory. (/dmtcp was chosen)

Then I noticed that the restart scripts did not have support for the --tmpdir flag, which got me puzzled for a while since I saw no other way of enforcing a different temporarily directory.
(Setting the DMTCP_TMPDIR variable didn't seem to work?)
After having looked at the code once more, the flag did seem to exist inside the dmtcp_restart executable.
Thus I altered the dmtcp_coordinator file in order for it to support the --tmpdir flag.

After this problem I also discovered another bug in DMTCP concerning an implementation that retrieves the directories out of a path.

I combined my solutions in a patch and have send it to the DMTCP developers.

Currently checkpointing works perfectly and restarts as well.
I now have to devise a more effective way for timing.
Currently the timings show values between 25 and 60 seconds.
The largest overhead being the VM snapshot.

Restarts have not been timed yet, but currently I wait for at least 2 minutes to determine a worker as being dead. (Pings are being send every 30 seconds).

Next to that I have read the papers from Andrzejak and Javadi.
But they have not enlightened me, further investigation is required.

My following plan is to gather as much data as possible about the time it requires to perform snapshots.
And getting the brokerprototype up and running.

zaterdag 23 maart 2013

Checkpoint timing seems difficult?

It is now possible to restart a checkpointed application through the usage of some additional timers and the sync function.
But another problem has risen, it seems to be impossible to restart from a checkpoint that was created after a previous restart.
At the moment I fail to see the cause of this but hope to resolve it quite quickly.

Another part that is interesting is the time it takes for snapshots.
Once again I tried to calculate the time with the plugin system of DMTCP but this seems not to work.
The order of execution of plugin points seems not be as predictable as I would want them to be.

Since I already noticed some timing code being present inside DMTCP, I was quite sure there was a way to use this to my advantage.
There is indeed an --enable-timing flag while configuring DMTCP which writes timing results to the error stream and to a jtimings.csv file.

But even this method gives me timestamps that are not taking the VM snapshot into account.
I seriously doubt the way they calculate those timing values and will contact the developers once more.

maandag 18 maart 2013

Checkpointing issues

Having debugged throughout the entire weekend, I was completely surprised by the inability of restarting my checkpointed programs. In an attempt to figure out what went wrong I decided to once more dive into the code of DMTCP and find the problem.

I noticed that the checkpointing file was always being generated but the restart scripts were empty in the snapshots on Amazon. This in contrast to the restart scripts on the running instance which were perfectly fine.
The only reason why this could happen is when the file is not yet completely written to disk when the snapshot is being taken.

This amazed me since the point where I was taking that snapshot was after that the files are written to disk.

After some digging I figured out my mistake.

DMTCP runs with a DMTCP_Coordinator whom is in charge of taking the snapshots and restarting them.
And the process you want to run uses a DMTCP_Worker.

This worker is where the plugin system is operational but NOT where the files are written to disk.
These 2 mechanisms are decoupled from one another, therefor I cannot be sure that the files are written to disk on the given DMTCP plugin event.

I have contacted the developers once more in the hope of finding a solution for this problem.




woensdag 13 maart 2013

Successful jobs!

Having worked quite a lot the past few days, it's time to make another post on the progress.

I continued on the system and tackled the worker.
Next to that I moved my project from bitbucket to github.
The DMTCP and job files required on the basic CBAS ami are fixed as well.

The first jobs have run and the first checkpoints are created.

Plenty of bugfixes were required, and I have inserted the restarting functionality today.

This still needs to be tested a bit more thoroughly.

Afterwards we can start with the main part of this internship : combining our system with something to aid us in the decision of bid price and instance type.
Currently this is implemented using the on demand EC2 system.

Some other things that require additional thought:
- Cleaning of the buckets at some time.
- Making the master more robust.
- Clean the SNS topic of unused HTTP endpoints, this could even become a security issue.



zondag 3 maart 2013

AMICreator

Having continued with the work our next challenge was to somehow provide a job with a good starting position.
With this I mean a functional AMI that is capable of running a Worker and where the worker is capable of executing a given job.
This is where the Prologue files are being used for.

Having first created a basic AMI for the complete CBAS project, starting from Ubuntu 12.04.
I updated and installed the default Java version, build essential and updated Boto.
Afterwards I compiled the DMTCP version which has the plugin and 'system' support from the last SVN update they provide.
During this compilation process I noticed some of the tests that are provided in DMTCP have failed.
So this is by no means a release version and could have bugs while checkpointing certain applications.

Next I created and compiled my plugin that will snapshot the virtual machine.
Having then taken a snapshot and created an AMI, this will form the CBAS AMI.

Next off was to create job specific AMI's by using the prologue files that were provided within the JDL.
To do this I made an AMICreator which will launch an instance and execute all files that are provided in it's userdata.
After it has finished it will inform us by posting a message in a temporarily created SQS queue whereafter we clean up the used resources.
The files provided in the userdata script are presigned URLs from the prologue files that are uploaded to S3.

Having created a job specific AMI, we use this in our worker manager to start a new instance.
I was planning on launching an instance and then starting a worker on it by using the same methodology as the run files that are used in the original CBAS project.
For that reason I have created a private Git repository on Bitbucket.

But when testing this scenario, I concluded that either the git credentials need to be hardcoded into this script or ssh information needs to be exchanged between the new worker and Bitbucket.
Both are really ugly solutions so I think I'll opt for a third: uploading the source code to an S3 bucket and downloading it from there.

When the instance is launched, we send an SNS message to the new worker indicating that we want to start a job.

The next part will then be to adapt the worker code to our requirements.