maandag 18 maart 2013

Checkpointing issues

Having debugged throughout the entire weekend, I was completely surprised by the inability of restarting my checkpointed programs. In an attempt to figure out what went wrong I decided to once more dive into the code of DMTCP and find the problem.

I noticed that the checkpointing file was always being generated but the restart scripts were empty in the snapshots on Amazon. This in contrast to the restart scripts on the running instance which were perfectly fine.
The only reason why this could happen is when the file is not yet completely written to disk when the snapshot is being taken.

This amazed me since the point where I was taking that snapshot was after that the files are written to disk.

After some digging I figured out my mistake.

DMTCP runs with a DMTCP_Coordinator whom is in charge of taking the snapshots and restarting them.
And the process you want to run uses a DMTCP_Worker.

This worker is where the plugin system is operational but NOT where the files are written to disk.
These 2 mechanisms are decoupled from one another, therefor I cannot be sure that the files are written to disk on the given DMTCP plugin event.

I have contacted the developers once more in the hope of finding a solution for this problem.




Geen opmerkingen:

Een reactie posten