donderdag 20 juni 2013

Java oracle small success

Having installed the oracle version of java seems to not give the error.
I will try this out a bit more thoroughly in a test that will run during the night.

EDIT: Tests have shown that indeed this is a success. Which makes us conclude that there is a problem in the OpenJDK.

Some more results

Since the error behavior only propagated while running the dmtcp_restart_script.sh through a Java process builder but I never got any error messages or any indication of what went wrong.
I decided to build a small test system and mimick the behavior without having to start up the complete CBAS system.
This system exists out of the process builder and the S3Streamgobblers running the restart script.

And lo and behold the error appeared again but now with additional error output !
Unfortunately it seems to be a Java JVM problem :

20/06 17:36:43 :: 418
20/06 17:36:48 :: 419
20/06 17:36:53 :: 420
#

[error occurred during error reporting (printing fatal error message), id 0x4]

#
#  SIGILL (0x4) at pc=0x00007f2960897dd8, pid=40000
[error occurred during error reporting (printing current thread and pid), id 0x4]

#

[error occurred during error reporting (printing Java version string), id 0x4]


[error occurred during error reporting (printing problematic frame), id 0x4]

#
[error occurred during error reporting (printing core file information), id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[error occurred during error reporting , id 0x4]


[Too many errors, abort]

I am currently testing it out with the java JVM from Oracle to see if this shows the same behavior.
I also want to compare this with the regular ProcessBuilder without S3StreamGobbler.

woensdag 19 juni 2013

Manual snapshotting fails

As a final attempt to get some trustworthy behavior from DMTCP, I changed the checkpointing system to use dmtcp_command .
This means that instead of letting DMTCP automatically checkpoint after a given period of time, I will now time it myself and issue the checkpoint command.
Something that can be done through the usage of dmtcp_command -c

This causes restarts to sometimes crash with a 134 error code.
It is very frustrating to not have consistent behavior but I have no other means of explaining it.
In some cases the restart works in others it does not with the above error code as a result.

Next to that I also noticed that whenever a restart works, it will not take a second snapshot anymore.
So the new method through the command part is something I will remove again.

Whenever I manually try it seems to be working, so perhaps it has something to do with output not being read?
I have not found any information about this causing crashes to the child process.

dinsdag 18 juni 2013

All java tests failed once again

For some inexplicable reason all of my java tests have failed both the VM checkpointing and the DIRECTORY archiving strategy.

They simply stopped the checkpointing procedure for some unknown reason.
Examination of the output gives a lot of restarted calculations.
But something is very odd about them, they always restart from the same point and then continue to the same point whereafter they restart once more.

For example: a counter that should be going from 0 to 3600 now is stuck at 360 to around 838.
Then it starts again from 360. No output of checkpoints in between either.
This is very strange because going from 360 to 838 with 5 seconds in between each count would reach  39 minutes and 45 seconds. While a snapshot should start after 30 minutes.


Whenever a manually start and restart is executed, there seem to be no problems.

More tests have shown that indeed checkpointing is somehow not working anymore after a restart.
I want to declare DMTCP as too buggy for further use.
With some luck you can have your application fully operational but this is a very annoying factor.
No clear indications were find for the noticed behavior.
More thorough knowledge of DMTCP's internals is required or simply waiting for a more stable build.

zondag 16 juni 2013

New testversions

The backup system is in place and has already been doing several testruns.
Unfortunately not with very good results. The problems seem to keep on appearing.
I have also noticed by accident that when running multiple checkpoints before a single restart, it seems that the amount of errors is greatly reduced.
In fact when taking snapshots every 5 minutes (which nearly caused problems while using the VM checkpointing strategy since it takes around 2.5 minutes to perform this checkpoint and in worse case could even take longer.), there was not a single error in the restart procedure.
But the runtime of the VM checkpointing strategy was doubled.

I also coincidentally discovered that DMTCP has changed their tests to incorporate my proposed solution but have not heard much more about this. My current Java tests show no sign of problems with DMTCP when my fix is in place. (-XX:-UsePerfData)

dinsdag 11 juni 2013

Test problems

Errors keep on occurring during the different tests.
A last problem that was noted is that a started worker seems to be unable to inform the Master.
This creates a loop of the master trying to restart the worker and the worker being unable to inform the master.
Strange thing is that the ping messages keep on arriving at the master so we can be sure that the worker has successfully started. So something inside the Master is blocking or the SNS system has decided to stop working. Since I suspect the Master being the problem, I have currently threaded the message handling part in order to prevent further blockage.

I have also noted that restart errors also occur with the directory checkpointing methodology.
Albeit less frequently.

The VM snapshotting method still behaves with the same insecurity. It would be nice if we could check if a given snapshot is "good". Perhaps something that can be suggested to the DMTCP developers?

The developers have also contacted me again about the Java issues and have recognized that there are indeed problems. Not only with Java but some others as well. They are currently trying to fix those.

I have been thinking about the usage of the next to last snapshot and will start implementing this method.
Until now I have been hesitant to use it due to additional management difficulties and the large overhead of going back so much. But the VM checkpointing is just not stable enough and really requires it.

vrijdag 7 juni 2013

Scheduler

The scheduler that will be used is the one written by Kurt Vermeersch.
This scheduler uses static data about Amazon EC2 and information about a task in order to schedule it.
It is mainly suited for scheduling multiple tasks that need to be run at the same time.
An initial integration of the scheduler will not be able to use this feature though.
Some new considerations should be made :

- In the current CBAS system single tasks are supplied to be executed.
- These are then scheduled to execute with a minimal cost.
- This scheduling is done independently from the other tasks.

This is actually a basic implementation and should be reconsidered.
I would suggest a system where the master waits for a given amount of time for all the jobs that arrive and then combine them.
Even better still would be the usage of all the current tasks in the scheduling process, even those that are already running.
But this would require changes to the scheduler.
If it could take into account the time a given task is already running, the proximity to the hourly payment and the proximity to the deadline, it should be possible to temporarily halt a given task from executing because the scheduler might know about a cheaper period that is approaching.

Another remark is concerning the 2 workload models that are supported by the scheduler from Kurt.
Only the first one is supported at the moment.


First attempt to integrate the scheduler is implemented.
Some more test cases are executing, with java tests included.

Java checkpointing fixed !

After some more debugging attempts and scrutiny of the DMTCP library.
I think to have found a solution to the java checkpointing problem.
It seems that the culprit was the java monitoring system enabled by UsePerfData.
This enables java to use the jvmstat instrumentation for performance testing and problem isolation purposes.
(Source : UserPerfData)
It also saves annoying data in shared memory and on disk, data which DMTCP seemingly cannot handle.
This flag is turned on by default but disabling this flag has fixed our checkpoint and restart problems.
This can be done by supplying -XX:-UsePerfData to the jvm.
I have send the suggestion towards the DMTCP developers.

donderdag 6 juni 2013

CT Checkpointing and others

I have managed to get the CT library working and have started testing DMTCP with it.
Until now it works with the standard 1.2.7 release of DMTCP but it does not work with the latest SVN release.
After having tried the svn checkout of the fix I had supplied to get the VM snapshotting working, it was also noted that restarting did not work.
Something with a deadlock I think since this is the last received debug message:

[40000] TRACE at connectionlist.cpp:387 in refill; REASON='Waiting for Missing Cons'

Java is also tested some more, the main test supplied by DMTCP already fails when 1.2.6 is used.
This is in contrast with my local system where these tests do not fail.
Further investigation has shown that on 32bit systems the snapshots seem to work while on a 64bit system it does not. This explained why it worked on my local system since it is a 32bit OS.
On 1.2.7 both systems fail the java checkpoint tests.
All were using the most recent Java SDK: OpenJDK 1.7


Another additional problem was noted concerning the userid used to run the dmtcp commands.
When supplying userdata to an instance when they are launching, this will be executed as root.
Consequently this means that the workers were running at root level and were launching processes in this state as well.
DMTCP does not like to run as root but there are some ways to circumvent this.
Currently it is noted that those solutions are not adequate and are removed in favor of changing the userlevel.
Now the worker and the user processes are run at the default user level. (ubuntu in the case of Ubuntu AMIs)
Note to myself: don't use "su username command" but use "sudo -u username command".


Meanwhile the usage of directory snapshotting is nearly completed.

woensdag 5 juni 2013

Checkpointing difficulties Java

Since the last update of DMTCP, I am unable to checkpoint Java applications.

The output received from DMTCP itself showed the following behavior:

[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REGISTER_NAME_SERVICE_DATA message'
[40000] TRACE at dmtcpworker.cpp:670 in waitForStage3Refill; REASON='Key Value Pairs registered with the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for SEND_QUERIES message'
[40000] TRACE at dmtcpworker.cpp:675 in waitForStage3Refill; REASON='Queries sent to the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REFILL message'
[40000] TRACE at kernelbufferdrainer.cpp:159 in refillAllSockets; REASON='refilling socket buffers'
     _drainedData.size() = 0
[40000] TRACE at kernelbufferdrainer.cpp:198 in refillAllSockets; REASON='buffers refilled'
[40000] TRACE at dmtcpworker.cpp:689 in waitForStage4Resume; REASON='refilled'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for RESUME message'
[40000] TRACE at dmtcpworker.cpp:692 in waitForStage4Resume; REASON='got resume message'

Then a segmentation fault is received.
This segfault could be traced back to :

#0  0x00007f7386878bf1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f738720b072 in length (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
) at /usr/include/c++/4.6/bits/char_traits.h:261
#2  operator<< [std::char_traits char=""] > (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
, __out=...) at /usr/include/c++/4.6/ostream:515
#3  Print[char] (this=[optimized out=""], t=optimized out="") at ../../dmtcp/jalib/jassert.h:145
#4  dmtcp::FileConnList::remapShmMaps (this=0x7f7387665508) at file/fileconnlist.cpp:243
#5  0x00007f73871d6652 in dmtcp::ConnectionList::processEvent (this=0x7f7387665508, event=optimized out="", data=optimized out="")
    at connectionlist.cpp:113
#6  0x00007f73871cabd0 in dmtcp_process_event (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at ipc.cpp:43
#7  0x00007f7386f3dfeb in dmtcp::DmtcpWorker::processEvent (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at dmtcpworker.cpp:703
#8  0x00007f7386f3dee5 in dmtcp::DmtcpWorker::waitForStage4Resume (this=0x7f73871b0574, isRestart=false) at dmtcpworker.cpp:695
#9  0x00007f7386f517e7 in callbackPostCheckpoint (isRestart=0, mtcpRestoreArgvStartAddr=0x0) at mtcpinterface.cpp:235
#10 0x00007f73859b53d8 in checkpointhread (dummy=0x0) at mtcp.c:1991
#11 0x00007f7386f538f7 in pthread_start (arg=0x7f7387661248) at threadwrappers.cpp:121
#12 0x00007f7386500e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f7386f53576 in clone_start (arg=0x7f7387661288) at threadwrappers.cpp:71
#14 0x00007f7386cf1154 in clone_start (arg=optimized out="") at pid_miscwrappers.cpp:100
#15 0x00007f7386809ccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()

Which unfortunately did not provide me with a workable point to find a patch.
The developers of DMTCP were informed once more of these findings.
Hopefully they can provide some assistance in locating the problem.