Errors keep on occurring during the different tests.
A last problem that was noted is that a started worker seems to be unable to inform the Master.
This creates a loop of the master trying to restart the worker and the worker being unable to inform the master.
Strange thing is that the ping messages keep on arriving at the master so we can be sure that the worker has successfully started. So something inside the Master is blocking or the SNS system has decided to stop working. Since I suspect the Master being the problem, I have currently threaded the message handling part in order to prevent further blockage.
I have also noted that restart errors also occur with the directory checkpointing methodology.
Albeit less frequently.
The VM snapshotting method still behaves with the same insecurity. It would be nice if we could check if a given snapshot is "good". Perhaps something that can be suggested to the DMTCP developers?
The developers have also contacted me again about the Java issues and have recognized that there are indeed problems. Not only with Java but some others as well. They are currently trying to fix those.
I have been thinking about the usage of the next to last snapshot and will start implementing this method.
Until now I have been hesitant to use it due to additional management difficulties and the large overhead of going back so much. But the VM checkpointing is just not stable enough and really requires it.
Geen opmerkingen:
Een reactie posten