Third phase:
Now that it is possible to take checkpoints, the next step is to restart the checkpoint we created.
Whenever DMTCP creates a checkpoint it automatically generates a bash script that can be used to restart the complete job.
During tests it was noted that this script is far to general and needs manual tweaking before functioning.
- First of the variable DMTCP_HOST should always be set to the DNS or IP of the system running the dmtcp_coordinator.
- The dmtcp commands need to be in the path. (personal remark: .profile is not executed in an ssh command -> add to .bashrc before the check for interactive shells)
- In the middle of the restart script the following lines are present:
# SYNTAX: # :: HOST : MODE: CHECKPOINT_IMAGE ... # Host names and filenames must not include ':' # At most one fg (foreground) mode allowed; it must be last. # 'maybexterm' and 'maybebg' are set from MODE. worker_ckpts=' :: myserver :bg: /home/robin/RMITest/bin/ckpt_java_5bbc58e7-2634-509bd481.dmtcp /home/robin/RMITest/ckpt_java_5bbc58e7-2453-509bd40a.dmtcp /home/robin/RMITest/ckpt_rmiregistry_5bbc58e7-2452-509bd40a.dmtcp :: Ubuntu :bg: /home/robin/RMITest/bin/ckpt_java_3e62ae3a-6597-509bd51f.dmtcp
The problem is that the computers hostname is being taken directly as the way to locate different systems and needs to be altered. Although while testing this on Amazon AWS Ubuntu instances it was noted that the computer name can be used as a private DNS to the machine which solves this problem. - To restart the different jobs on remote systems, dmtcp uses SSH. The code that is provided to handle this only works on the default setup that requires no authentication. If another way of authentication is used (as with amazon using a private key file), this command will need to be altered. A bug was also found in that paths with spaces do not work, unfortunately a solution to this problem wasn't found yet. Simply quoting the variables is not sufficient. A different method of retrieving the different files will have to be created.
- A new coordinator has to be restarted manually, when we don't do this, following error is shown:
dmtcp_checkpoint (DMTCP + MTCP) 1.2.6 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [9748] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed' s = myserver jalib::Filesystem::GetCurrentHostname() = Robin-Laptop Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host. dmtcp_restart (9748): Terminating... dmtcp_checkpoint (DMTCP + MTCP) 1.2.6 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [13432] ERROR at dmtcpcoordinatorapi.cpp:358 in startNewCoordinator; REASON='JASSERT(false) failed' s = myserver jalib::Filesystem::GetCurrentHostname() = ubuntu-desktop Message: Won't automatically start coordinator because DMTCP_HOST is set to a remote host. dmtcp_restart (13432): Terminating...
- After fixing all of this we got the following problem:
dmtcp_checkpoint (DMTCP + MTCP) 1.2.6 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 5bbc58e7-12085-50a151fe(99005) Message: Bind failed. [12565] WARNING at connection.cpp:625 in restore; REASON='JWARNING(sock.bind ( ( sockaddr* ) &_bindAddr,_bindAddrlen )) failed' (strerror((*__errno_location ()))) = Address already in use id() = 5bbc58e7-12086-50a151fe(99004) Message: Bind failed. dmtcp_checkpoint (DMTCP + MTCP) 1.2.6 Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "-q" to hide this message.) [14350] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/14350 with data from ckpt image [12086] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file: mapping current version of /home/robin/RMITest/codebase.jar into memory; _not_ file as it existed at time of checkpoint. Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078228992: 1 [12086] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/12086 with data from ckpt image [12126] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/12126 with data from ckpt image [12085] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/12085 with data from ckpt image [12126] WARNING at jsocket.cpp:352 in writeAll; REASON='JWARNING(cnt > 0) failed' cnt = -1 len = 388 (strerror((*__errno_location ()))) = Connection reset by peer Message: JSocket write failure [14350] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed' sockfd() = 20 origLen = 388 len = 388 Message: JSocket needed to read origLen chars, still needs to read len chars, but EOF reached [14350] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' _magicBits = Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly? java (14350): Terminating... [12126] WARNING at jsocket.cpp:295 in readAll; REASON='JWARNING(cnt!=0) failed' sockfd() = 19 origLen = 388 len = 388 Message: JSocket needed to read origLen chars, still needs to read len chars, but EOF reached [12126] ERROR at dmtcpmessagetypes.cpp:64 in assertValid; REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' _magicBits = Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator die uncleanly? java (12126): Terminating...
Not really knowing the origin of this error I decided to try the entire procedure again wherafter I got the following error:[12985] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/12985 with data from ckpt image [13082] mtcp_restart_nolibc.c:929 read_shared_memory_area_from_file: mapping /tmp/hsperfdata_robin/13082 with data from ckpt image [12986] mtcp_restart_nolibc.c:973 read_shared_memory_area_from_file: mapping current version of /home/robin/RMITest/codebase.jar into memory; _not_ file as it existed at time of checkpoint. Change mtcp_restart_nolibc.c:973 and re-compile, if you want different behavior. 1078360064: 1 [12986] ERROR at virtualpidtable.cpp:558 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed' versionCheck = ntries:[ correctValue = NumEntries:[ o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657 Message: invalid file format java (12986): Terminating... [12985] ERROR at virtualpidtable.cpp:560 in serializeEntryCount; REASON='JASSERT(versionCheck == correctValue) failed' versionCheck = nt�, @���� correctValue = ] o.filename() = /tmp/dmtcp-robin@Robin-Laptop/dmtcpPidMapCount.5bbc58e7-12986-50a172b4.50a17657 Message: invalid file format
At the moment the only thing I can conclude is that DMTCP isn't yet working as I had hoped it would.
Simply restarting a checkpointed program in exactly the way it was during checkpointing, seems far from trivial.
Some more tests will need to be done to pinpoint the possible reasons of these crashes.
Hi Robin,
BeantwoordenVerwijderenIn je laatste error trace wordt aangegeven dat codebase.jar niet dezelfde file is vergeleken met tijd van checkpoint. Kan dit niet voor problemen zorgen? Werd de code gerecompiled?
Voor alle duidelijkheid dit scenario werkt wel om non-Amazon instances?
Lukt het om non-RMI code te checkpointen/restoren?
BTW, zijn in Listing 1 onder punt 6 99005 en 99004 server socket ports?
VerwijderenNeen, dit maakt deel uit van de identifiers gegeven door DMTCP.
BeantwoordenVerwijderenDe eerste GUID geeft aan om welke snapshot het gaat en het getal tussen haakjes wordt geassocieerd met de connectie.
Deze teller begint vanaf 99000 dus het zou hier gaan om de 4de en 5de connectie.
Tussen het checkpointen en het herstarten is er niets verandert geweest. Het is direct na de checkpointing terug opgezet.
Het gaat hier idd om een test die lokaal wordt uitgevoerd.
Als non RMI code heb ik een simpel tellertje getest in java en C++, dit werkt.
Maar als ik enkel codebase.jar (jini webserver) en rmiregistry opzet zonder enige andere code krijg ik ook weer een crash. Met dezelfde reclamatie over dat de jar file zogezegd niet hetzelfde zou zijn.