donderdag 20 juni 2013

Java oracle small success

Having installed the oracle version of java seems to not give the error.
I will try this out a bit more thoroughly in a test that will run during the night.

EDIT: Tests have shown that indeed this is a success. Which makes us conclude that there is a problem in the OpenJDK.

Some more results

Since the error behavior only propagated while running the dmtcp_restart_script.sh through a Java process builder but I never got any error messages or any indication of what went wrong.
I decided to build a small test system and mimick the behavior without having to start up the complete CBAS system.
This system exists out of the process builder and the S3Streamgobblers running the restart script.

And lo and behold the error appeared again but now with additional error output !
Unfortunately it seems to be a Java JVM problem :

20/06 17:36:43 :: 418
20/06 17:36:48 :: 419
20/06 17:36:53 :: 420
#

[error occurred during error reporting (printing fatal error message), id 0x4]

#
#  SIGILL (0x4) at pc=0x00007f2960897dd8, pid=40000
[error occurred during error reporting (printing current thread and pid), id 0x4]

#

[error occurred during error reporting (printing Java version string), id 0x4]


[error occurred during error reporting (printing problematic frame), id 0x4]

#
[error occurred during error reporting (printing core file information), id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[Too many errors, abort]

I am currently testing it out with the java JVM from Oracle to see if this shows the same behavior.
I also want to compare this with the regular ProcessBuilder without S3StreamGobbler.

woensdag 19 juni 2013

Manual snapshotting fails

As a final attempt to get some trustworthy behavior from DMTCP, I changed the checkpointing system to use dmtcp_command .
This means that instead of letting DMTCP automatically checkpoint after a given period of time, I will now time it myself and issue the checkpoint command.
Something that can be done through the usage of dmtcp_command -c

This causes restarts to sometimes crash with a 134 error code.
It is very frustrating to not have consistent behavior but I have no other means of explaining it.
In some cases the restart works in others it does not with the above error code as a result.

Next to that I also noticed that whenever a restart works, it will not take a second snapshot anymore.
So the new method through the command part is something I will remove again.

Whenever I manually try it seems to be working, so perhaps it has something to do with output not being read?
I have not found any information about this causing crashes to the child process.

dinsdag 18 juni 2013

All java tests failed once again

For some inexplicable reason all of my java tests have failed both the VM checkpointing and the DIRECTORY archiving strategy.

They simply stopped the checkpointing procedure for some unknown reason.
Examination of the output gives a lot of restarted calculations.
But something is very odd about them, they always restart from the same point and then continue to the same point whereafter they restart once more.

For example: a counter that should be going from 0 to 3600 now is stuck at 360 to around 838.
Then it starts again from 360. No output of checkpoints in between either.
This is very strange because going from 360 to 838 with 5 seconds in between each count would reach  39 minutes and 45 seconds. While a snapshot should start after 30 minutes.


Whenever a manually start and restart is executed, there seem to be no problems.

More tests have shown that indeed checkpointing is somehow not working anymore after a restart.
I want to declare DMTCP as too buggy for further use.
With some luck you can have your application fully operational but this is a very annoying factor.
No clear indications were find for the noticed behavior.
More thorough knowledge of DMTCP's internals is required or simply waiting for a more stable build.

zondag 16 juni 2013

New testversions

The backup system is in place and has already been doing several testruns.
Unfortunately not with very good results. The problems seem to keep on appearing.
I have also noticed by accident that when running multiple checkpoints before a single restart, it seems that the amount of errors is greatly reduced.
In fact when taking snapshots every 5 minutes (which nearly caused problems while using the VM checkpointing strategy since it takes around 2.5 minutes to perform this checkpoint and in worse case could even take longer.), there was not a single error in the restart procedure.
But the runtime of the VM checkpointing strategy was doubled.

I also coincidentally discovered that DMTCP has changed their tests to incorporate my proposed solution but have not heard much more about this. My current Java tests show no sign of problems with DMTCP when my fix is in place. (-XX:-UsePerfData)

dinsdag 11 juni 2013

Test problems

Errors keep on occurring during the different tests.
A last problem that was noted is that a started worker seems to be unable to inform the Master.
This creates a loop of the master trying to restart the worker and the worker being unable to inform the master.
Strange thing is that the ping messages keep on arriving at the master so we can be sure that the worker has successfully started. So something inside the Master is blocking or the SNS system has decided to stop working. Since I suspect the Master being the problem, I have currently threaded the message handling part in order to prevent further blockage.

I have also noted that restart errors also occur with the directory checkpointing methodology.
Albeit less frequently.

The VM snapshotting method still behaves with the same insecurity. It would be nice if we could check if a given snapshot is "good". Perhaps something that can be suggested to the DMTCP developers?

The developers have also contacted me again about the Java issues and have recognized that there are indeed problems. Not only with Java but some others as well. They are currently trying to fix those.

I have been thinking about the usage of the next to last snapshot and will start implementing this method.
Until now I have been hesitant to use it due to additional management difficulties and the large overhead of going back so much. But the VM checkpointing is just not stable enough and really requires it.

vrijdag 7 juni 2013

Scheduler

The scheduler that will be used is the one written by Kurt Vermeersch.
This scheduler uses static data about Amazon EC2 and information about a task in order to schedule it.
It is mainly suited for scheduling multiple tasks that need to be run at the same time.
An initial integration of the scheduler will not be able to use this feature though.
Some new considerations should be made :

- In the current CBAS system single tasks are supplied to be executed.
- These are then scheduled to execute with a minimal cost.
- This scheduling is done independently from the other tasks.

This is actually a basic implementation and should be reconsidered.
I would suggest a system where the master waits for a given amount of time for all the jobs that arrive and then combine them.
Even better still would be the usage of all the current tasks in the scheduling process, even those that are already running.
But this would require changes to the scheduler.
If it could take into account the time a given task is already running, the proximity to the hourly payment and the proximity to the deadline, it should be possible to temporarily halt a given task from executing because the scheduler might know about a cheaper period that is approaching.

Another remark is concerning the 2 workload models that are supported by the scheduler from Kurt.
Only the first one is supported at the moment.


First attempt to integrate the scheduler is implemented.
Some more test cases are executing, with java tests included.

Java checkpointing fixed !

After some more debugging attempts and scrutiny of the DMTCP library.
I think to have found a solution to the java checkpointing problem.
It seems that the culprit was the java monitoring system enabled by UsePerfData.
This enables java to use the jvmstat instrumentation for performance testing and problem isolation purposes.
(Source : UserPerfData)
It also saves annoying data in shared memory and on disk, data which DMTCP seemingly cannot handle.
This flag is turned on by default but disabling this flag has fixed our checkpoint and restart problems.
This can be done by supplying -XX:-UsePerfData to the jvm.
I have send the suggestion towards the DMTCP developers.

donderdag 6 juni 2013

CT Checkpointing and others

I have managed to get the CT library working and have started testing DMTCP with it.
Until now it works with the standard 1.2.7 release of DMTCP but it does not work with the latest SVN release.
After having tried the svn checkout of the fix I had supplied to get the VM snapshotting working, it was also noted that restarting did not work.
Something with a deadlock I think since this is the last received debug message:

[40000] TRACE at connectionlist.cpp:387 in refill; REASON='Waiting for Missing Cons'

Java is also tested some more, the main test supplied by DMTCP already fails when 1.2.6 is used.
This is in contrast with my local system where these tests do not fail.
Further investigation has shown that on 32bit systems the snapshots seem to work while on a 64bit system it does not. This explained why it worked on my local system since it is a 32bit OS.
On 1.2.7 both systems fail the java checkpoint tests.
All were using the most recent Java SDK: OpenJDK 1.7


Another additional problem was noted concerning the userid used to run the dmtcp commands.
When supplying userdata to an instance when they are launching, this will be executed as root.
Consequently this means that the workers were running at root level and were launching processes in this state as well.
DMTCP does not like to run as root but there are some ways to circumvent this.
Currently it is noted that those solutions are not adequate and are removed in favor of changing the userlevel.
Now the worker and the user processes are run at the default user level. (ubuntu in the case of Ubuntu AMIs)
Note to myself: don't use "su username command" but use "sudo -u username command".


Meanwhile the usage of directory snapshotting is nearly completed.

woensdag 5 juni 2013

Checkpointing difficulties Java

Since the last update of DMTCP, I am unable to checkpoint Java applications.

The output received from DMTCP itself showed the following behavior:

[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REGISTER_NAME_SERVICE_DATA message'
[40000] TRACE at dmtcpworker.cpp:670 in waitForStage3Refill; REASON='Key Value Pairs registered with the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for SEND_QUERIES message'
[40000] TRACE at dmtcpworker.cpp:675 in waitForStage3Refill; REASON='Queries sent to the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REFILL message'
[40000] TRACE at kernelbufferdrainer.cpp:159 in refillAllSockets; REASON='refilling socket buffers'
     _drainedData.size() = 0
[40000] TRACE at kernelbufferdrainer.cpp:198 in refillAllSockets; REASON='buffers refilled'
[40000] TRACE at dmtcpworker.cpp:689 in waitForStage4Resume; REASON='refilled'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for RESUME message'
[40000] TRACE at dmtcpworker.cpp:692 in waitForStage4Resume; REASON='got resume message'

Then a segmentation fault is received.
This segfault could be traced back to :

#0  0x00007f7386878bf1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f738720b072 in length (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
) at /usr/include/c++/4.6/bits/char_traits.h:261
#2  operator<< [std::char_traits char=""] > (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
, __out=...) at /usr/include/c++/4.6/ostream:515
#3  Print[char] (this=[optimized out=""], t=optimized out="") at ../../dmtcp/jalib/jassert.h:145
#4  dmtcp::FileConnList::remapShmMaps (this=0x7f7387665508) at file/fileconnlist.cpp:243
#5  0x00007f73871d6652 in dmtcp::ConnectionList::processEvent (this=0x7f7387665508, event=optimized out="", data=optimized out="")
    at connectionlist.cpp:113
#6  0x00007f73871cabd0 in dmtcp_process_event (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at ipc.cpp:43
#7  0x00007f7386f3dfeb in dmtcp::DmtcpWorker::processEvent (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at dmtcpworker.cpp:703
#8  0x00007f7386f3dee5 in dmtcp::DmtcpWorker::waitForStage4Resume (this=0x7f73871b0574, isRestart=false) at dmtcpworker.cpp:695
#9  0x00007f7386f517e7 in callbackPostCheckpoint (isRestart=0, mtcpRestoreArgvStartAddr=0x0) at mtcpinterface.cpp:235
#10 0x00007f73859b53d8 in checkpointhread (dummy=0x0) at mtcp.c:1991
#11 0x00007f7386f538f7 in pthread_start (arg=0x7f7387661248) at threadwrappers.cpp:121
#12 0x00007f7386500e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f7386f53576 in clone_start (arg=0x7f7387661288) at threadwrappers.cpp:71
#14 0x00007f7386cf1154 in clone_start (arg=optimized out="") at pid_miscwrappers.cpp:100
#15 0x00007f7386809ccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()

Which unfortunately did not provide me with a workable point to find a patch.
The developers of DMTCP were informed once more of these findings.
Hopefully they can provide some assistance in locating the problem.

zondag 7 april 2013

Data gathering

The checkpointing script still showed some instability throughout the entire week but this seems to be solved now.
I am getting more and more data and must say that the results look quite promising.
With my initial counter test, the checkpoints now happen in around 20 seconds.
I have been searching for the test that DMTCP is referring to in their paper but was unable to find it.
(Concerning software used at CERN called CMS)
Currently I would like to run more advanced tests.
For example since I have been working on CT scan reconstructions, I will try to get that running as well.

The essential spotmarket code is now also present.
What now rests is diving into the broker prototype and looking at it's possible usage in our scenario.
I got the code up and running already and performed the test as was explained.

zondag 31 maart 2013

Checkpointing, fully operational

Finally, after a lot of trial and error I have managed to pinpoint the cause of the checkpointing issue.

Whenever a checkpoint is taken by DMTCP, it uses a temporarily folder to store information.
If it is not initialised by the user, a folder is created in the /tmp directory.
The problem now originated from the snapshot taken by the EC2 instance.
When such a snapshot is taken and restarted, the tmp folder is empty.

This caused problems for DMTCP, something I presume with being unable to read the expected folder inside tmp.
To solve this issue, it was sufficient to change the DMTCP_TMPDIR to a different and more permanent directory. (/dmtcp was chosen)

Then I noticed that the restart scripts did not have support for the --tmpdir flag, which got me puzzled for a while since I saw no other way of enforcing a different temporarily directory.
(Setting the DMTCP_TMPDIR variable didn't seem to work?)
After having looked at the code once more, the flag did seem to exist inside the dmtcp_restart executable.
Thus I altered the dmtcp_coordinator file in order for it to support the --tmpdir flag.

After this problem I also discovered another bug in DMTCP concerning an implementation that retrieves the directories out of a path.

I combined my solutions in a patch and have send it to the DMTCP developers.

Currently checkpointing works perfectly and restarts as well.
I now have to devise a more effective way for timing.
Currently the timings show values between 25 and 60 seconds.
The largest overhead being the VM snapshot.

Restarts have not been timed yet, but currently I wait for at least 2 minutes to determine a worker as being dead. (Pings are being send every 30 seconds).

Next to that I have read the papers from Andrzejak and Javadi.
But they have not enlightened me, further investigation is required.

My following plan is to gather as much data as possible about the time it requires to perform snapshots.
And getting the brokerprototype up and running.

zaterdag 23 maart 2013

Checkpoint timing seems difficult?

It is now possible to restart a checkpointed application through the usage of some additional timers and the sync function.
But another problem has risen, it seems to be impossible to restart from a checkpoint that was created after a previous restart.
At the moment I fail to see the cause of this but hope to resolve it quite quickly.

Another part that is interesting is the time it takes for snapshots.
Once again I tried to calculate the time with the plugin system of DMTCP but this seems not to work.
The order of execution of plugin points seems not be as predictable as I would want them to be.

Since I already noticed some timing code being present inside DMTCP, I was quite sure there was a way to use this to my advantage.
There is indeed an --enable-timing flag while configuring DMTCP which writes timing results to the error stream and to a jtimings.csv file.

But even this method gives me timestamps that are not taking the VM snapshot into account.
I seriously doubt the way they calculate those timing values and will contact the developers once more.

maandag 18 maart 2013

Checkpointing issues

Having debugged throughout the entire weekend, I was completely surprised by the inability of restarting my checkpointed programs. In an attempt to figure out what went wrong I decided to once more dive into the code of DMTCP and find the problem.

I noticed that the checkpointing file was always being generated but the restart scripts were empty in the snapshots on Amazon. This in contrast to the restart scripts on the running instance which were perfectly fine.
The only reason why this could happen is when the file is not yet completely written to disk when the snapshot is being taken.

This amazed me since the point where I was taking that snapshot was after that the files are written to disk.

After some digging I figured out my mistake.

DMTCP runs with a DMTCP_Coordinator whom is in charge of taking the snapshots and restarting them.
And the process you want to run uses a DMTCP_Worker.

This worker is where the plugin system is operational but NOT where the files are written to disk.
These 2 mechanisms are decoupled from one another, therefor I cannot be sure that the files are written to disk on the given DMTCP plugin event.

I have contacted the developers once more in the hope of finding a solution for this problem.




woensdag 13 maart 2013

Successful jobs!

Having worked quite a lot the past few days, it's time to make another post on the progress.

I continued on the system and tackled the worker.
Next to that I moved my project from bitbucket to github.
The DMTCP and job files required on the basic CBAS ami are fixed as well.

The first jobs have run and the first checkpoints are created.

Plenty of bugfixes were required, and I have inserted the restarting functionality today.

This still needs to be tested a bit more thoroughly.

Afterwards we can start with the main part of this internship : combining our system with something to aid us in the decision of bid price and instance type.
Currently this is implemented using the on demand EC2 system.

Some other things that require additional thought:
- Cleaning of the buckets at some time.
- Making the master more robust.
- Clean the SNS topic of unused HTTP endpoints, this could even become a security issue.



zondag 3 maart 2013

AMICreator

Having continued with the work our next challenge was to somehow provide a job with a good starting position.
With this I mean a functional AMI that is capable of running a Worker and where the worker is capable of executing a given job.
This is where the Prologue files are being used for.

Having first created a basic AMI for the complete CBAS project, starting from Ubuntu 12.04.
I updated and installed the default Java version, build essential and updated Boto.
Afterwards I compiled the DMTCP version which has the plugin and 'system' support from the last SVN update they provide.
During this compilation process I noticed some of the tests that are provided in DMTCP have failed.
So this is by no means a release version and could have bugs while checkpointing certain applications.

Next I created and compiled my plugin that will snapshot the virtual machine.
Having then taken a snapshot and created an AMI, this will form the CBAS AMI.

Next off was to create job specific AMI's by using the prologue files that were provided within the JDL.
To do this I made an AMICreator which will launch an instance and execute all files that are provided in it's userdata.
After it has finished it will inform us by posting a message in a temporarily created SQS queue whereafter we clean up the used resources.
The files provided in the userdata script are presigned URLs from the prologue files that are uploaded to S3.

Having created a job specific AMI, we use this in our worker manager to start a new instance.
I was planning on launching an instance and then starting a worker on it by using the same methodology as the run files that are used in the original CBAS project.
For that reason I have created a private Git repository on Bitbucket.

But when testing this scenario, I concluded that either the git credentials need to be hardcoded into this script or ssh information needs to be exchanged between the new worker and Bitbucket.
Both are really ugly solutions so I think I'll opt for a third: uploading the source code to an S3 bucket and downloading it from there.

When the instance is launched, we send an SNS message to the new worker indicating that we want to start a job.

The next part will then be to adapt the worker code to our requirements.

zondag 17 februari 2013

Frontend

The following part I tackled was an easy usable frontend for the system.
In the received implementation there was no need for such a thing since master and client were running on the same machine.
This is no longer possible in a decent implementation of the system.
As such there was a need to communicate in some way with the master, send new jobs and view some information about the job.
I have followed my own suggestion from 2 posts ago with this result :



On the screenshot we are running the frontend locally.
The given bogus information is being used to generate a JDL and 2 archives, one for the prologue files and one for the inputsandbox files.
We have then requested a jobid from the master which is running on an EC2 instance.
Get a reply and start uploading everything to S3 whereafter we send a message to officially add the job.
All information is also written to the dynamodb database.

In the process of implementing this, the AWS system and the messages system were redesigned and implemented anew.

The next steps that will need to be taken is the launching of the new job on a worker.
This will require to first handle the prologue and AMI creation, then deploying everything and making sure the dmtcp process works properly.


maandag 11 februari 2013

Snapshotting successful

After having struggled with DMTCP and having decided to mail the developers there has been a lot of help coming from their side.
Together it became apparent that I was in need of the plugin system that DMTCP offers.
Example:

void dmtcp_process_event(DmtcpEvent_t event, void* data)
{
  /* NOTE:  See warning in plugin/README about calls to printf here. */
  switch (event) {
  case DMTCP_EVENT_INIT:
    printf("The plugin containing %s has been initialized.\n", __FILE__);
    break;
  case DMTCP_EVENT_PRE_CHECKPOINT:
    printf("\n*** The plugin is being called before checkpointing. ***\n");
    break;
  case DMTCP_EVENT_POST_CHECKPOINT:
    printf("*** The plugin has now been checkpointed. ***\n");
    break;
  case DMTCP_EVENT_POST_CHECKPOINT_RESUME:
    printf("The process is now resuming after checkpoint.\n");
    break;
  case DMTCP_EVENT_POST_RESTART_RESUME:
    printf("The plugin is now resuming or restarting from checkpointing.\n");
    break;
  case DMTCP_EVENT_PRE_EXIT:
    printf("The plugin is being called before exiting.\n");
    break;
  /* These events are unused and could be omitted.  See dmtcpplugin.h for
   * complete list.
   */
  case DMTCP_EVENT_POST_RESTART:
  case DMTCP_EVENT_RESET_ON_FORK:
  case DMTCP_EVENT_POST_SUSPEND:
  case DMTCP_EVENT_POST_LEADER_ELECTION:
  case DMTCP_EVENT_POST_DRAIN:
  default:
    break;
  }
  NEXT_DMTCP_PROCESS_EVENT(event, data);
}
But as stated in one of my previous posts, system calls were impossible.
After having mentioned this to the developers (Kapil Arya in particular), this function is now supported.
As a result we can now take complete snapshots of a VM right after the DMTCP checkpoint files are written to disk and before the program which snapshot we are taking has been restarted.

Many thanks to the development team of DMTCP for all their aid !


dinsdag 8 januari 2013

Further developments

Having worked further on the required changes, I have come to the conclusion that nearly everything is going to be changed.
The current single threaded locally running master will become a multithreaded remotely(on an EC2 instance)  functioning program but this has a couple of consequences.
All the local file operations and job handling is going to be replaced.
The messaging structure is going to be transformed to be compatible with our SNS functionality.
Also created a factory style handling of the messages.
The master will no longer be keeping information about the jobs locally, this will all be written to the database. Perhaps some kind of cashing mechanism can be put in place to make batch updates.
Although this can be dangerous with out of date updates being perceived by the workers.


A EC2 Metadata processor has been created, with this functionality it is now possible to query EC2 for information about our instance using the instance metadata functionality :
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/AESDG-chapter-instancedata.html

The HTTP server is up and running, with which we can now send and receive SNS messages.
To filter out the messages the Subject field of the SNS messages are filled with the receivers endpoint address(it's public DNS).

Using the Object persistence model the DynamoDB entries of the jobs can now be created, edited and removed.

The WorkerManager was created which currently only processes the PING's received from the workers and keeps tracks of timers. If a timer fires, the corresponding worker gets terminated by informing the ResourceManager. The ResourceManager is capable of launching or terminating instances asynchronously.
I have also noted that the information that is being stored by the ResourceManager, namely what instances are running, is quite important and should be stored either in S3 or DynamoDB to prevent "resource leaking".

As a final point of thought I would want to suggest a small frontend java program that will handle the JDL creation and initial communication with the master. This will just be for testing purposes and can be replaced by other means of frontend to the system.
The JDL will have the following layout :

  • Prologue={file1,file2, ...}
  • InputSandbox={file1,file2,...}
  • OutputSandbox={name} (Changed, see remarks*)
  • Arguments=args
  • Executable=executableName
I have tried to stick with the JDL standards but needed some additional info.
As can be seen the Prologue and OutputSandBox are new options. 
Prologue would be scripts that will run before the execution of the job, to initialize the environment.
OutputSandbox would be the name of the file where the result will be stored in. (as an archive)
This can ofcourse be extended in future versions. 

With the usage of the frontend, the JDL will now be automatically generated through the supply of the files necessary in each option. And the JDL together with the input archive that has been created will be stored on S3.
Operation sequence: 
  1. A new task is created on the frontend.
  2. Frontend requests a new job id. 
  3. After receiving, creates the JDL file and the archive.
  4. Upload to S3.
  5. Inform master of the newly added job. 
The frontend needs to know the job id up front in order to prevent naming problems.