Finally, after a lot of trial and error I have managed to pinpoint the cause of the checkpointing issue.
Whenever a checkpoint is taken by DMTCP, it uses a temporarily folder to store information.
If it is not initialised by the user, a folder is created in the /tmp directory.
The problem now originated from the snapshot taken by the EC2 instance.
When such a snapshot is taken and restarted, the tmp folder is empty.
This caused problems for DMTCP, something I presume with being unable to read the expected folder inside tmp.
To solve this issue, it was sufficient to change the DMTCP_TMPDIR to a different and more permanent directory. (/dmtcp was chosen)
Then I noticed that the restart scripts did not have support for the --tmpdir flag, which got me puzzled for a while since I saw no other way of enforcing a different temporarily directory.
(Setting the DMTCP_TMPDIR variable didn't seem to work?)
After having looked at the code once more, the flag did seem to exist inside the dmtcp_restart executable.
Thus I altered the dmtcp_coordinator file in order for it to support the --tmpdir flag.
After this problem I also discovered another bug in DMTCP concerning an implementation that retrieves the directories out of a path.
I combined my solutions in a patch and have send it to the DMTCP developers.
Currently checkpointing works perfectly and restarts as well.
I now have to devise a more effective way for timing.
Currently the timings show values between 25 and 60 seconds.
The largest overhead being the VM snapshot.
Restarts have not been timed yet, but currently I wait for at least 2 minutes to determine a worker as being dead. (Pings are being send every 30 seconds).
Next to that I have read the papers from Andrzejak and Javadi.
But they have not enlightened me, further investigation is required.
My following plan is to gather as much data as possible about the time it requires to perform snapshots.
And getting the brokerprototype up and running.
Geen opmerkingen:
Een reactie posten