woensdag 5 juni 2013

Checkpointing difficulties Java

Since the last update of DMTCP, I am unable to checkpoint Java applications.

The output received from DMTCP itself showed the following behavior:

[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REGISTER_NAME_SERVICE_DATA message'
[40000] TRACE at dmtcpworker.cpp:670 in waitForStage3Refill; REASON='Key Value Pairs registered with the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for SEND_QUERIES message'
[40000] TRACE at dmtcpworker.cpp:675 in waitForStage3Refill; REASON='Queries sent to the coordinator'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for REFILL message'
[40000] TRACE at kernelbufferdrainer.cpp:159 in refillAllSockets; REASON='refilling socket buffers'
     _drainedData.size() = 0
[40000] TRACE at kernelbufferdrainer.cpp:198 in refillAllSockets; REASON='buffers refilled'
[40000] TRACE at dmtcpworker.cpp:689 in waitForStage4Resume; REASON='refilled'
[40000] TRACE at dmtcpworker.cpp:518 in waitForCoordinatorMsg; REASON='waiting for RESUME message'
[40000] TRACE at dmtcpworker.cpp:692 in waitForStage4Resume; REASON='got resume message'

Then a segmentation fault is received.
This segfault could be traced back to :

#0  0x00007f7386878bf1 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f738720b072 in length (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
) at /usr/include/c++/4.6/bits/char_traits.h:261
#2  operator<< [std::char_traits char=""] > (__s=0x7f737bdff000 [address 0x7f737bdff000="" bounds="" of="" out=""]
, __out=...) at /usr/include/c++/4.6/ostream:515
#3  Print[char] (this=[optimized out=""], t=optimized out="") at ../../dmtcp/jalib/jassert.h:145
#4  dmtcp::FileConnList::remapShmMaps (this=0x7f7387665508) at file/fileconnlist.cpp:243
#5  0x00007f73871d6652 in dmtcp::ConnectionList::processEvent (this=0x7f7387665508, event=optimized out="", data=optimized out="")
    at connectionlist.cpp:113
#6  0x00007f73871cabd0 in dmtcp_process_event (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at ipc.cpp:43
#7  0x00007f7386f3dfeb in dmtcp::DmtcpWorker::processEvent (event=DMTCP_EVENT_THREADS_RESUME, data=0x7f738549b5a0) at dmtcpworker.cpp:703
#8  0x00007f7386f3dee5 in dmtcp::DmtcpWorker::waitForStage4Resume (this=0x7f73871b0574, isRestart=false) at dmtcpworker.cpp:695
#9  0x00007f7386f517e7 in callbackPostCheckpoint (isRestart=0, mtcpRestoreArgvStartAddr=0x0) at mtcpinterface.cpp:235
#10 0x00007f73859b53d8 in checkpointhread (dummy=0x0) at mtcp.c:1991
#11 0x00007f7386f538f7 in pthread_start (arg=0x7f7387661248) at threadwrappers.cpp:121
#12 0x00007f7386500e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#13 0x00007f7386f53576 in clone_start (arg=0x7f7387661288) at threadwrappers.cpp:71
#14 0x00007f7386cf1154 in clone_start (arg=optimized out="") at pid_miscwrappers.cpp:100
#15 0x00007f7386809ccd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x0000000000000000 in ?? ()

Which unfortunately did not provide me with a workable point to find a patch.
The developers of DMTCP were informed once more of these findings.
Hopefully they can provide some assistance in locating the problem.

Geen opmerkingen:

Een reactie posten