Isolate doesn't exit - worker waits indefinitely #259

nir-lavee · 2014-03-01T22:04:14Z

Hello,

I have installed the latest 1.1.0pre version of CMS, downloaded on 16/02/2014, on Ubuntu 12.04 32 bit.

Sometimes when there are jobs pending the log shows a line such as "3 jobs still pending", and this line is repeated without any change. One instance of isolate can be seen in the process list (I have one worker), and it does not exit. The number of jobs will increase as contestants submit more, and never decrease.

Unfortunately, I cannot reproduce consistently. It usually doesn't happen. Submitting the same code, or recompiling/reevaluating the same submission, doesn't necessarily cause it to happen again. But it does regularly happen a few times over the course of a few hours, and seems more likely the more submissions are sent.

The only workaround I have found so far is having a cron job that kills all instances of isolate that are more than a few seconds old. The worker then tries again and is successful.

I know this report is not sufficiently informative. Is there any information that could help? I can gather it the next time it happens.

giomasce · 2014-03-03T18:40:09Z

Not easy to debug, but let's try!

First, you should set keep_sandbox to true in cms.conf and try again to reproduce the problem. Once you find a failing case, please use AWS in order to find the sandbox path and get in there. You should find a command.log file that contains the exact command that was spawned by CMS. Try to execute it again multiple times (as root) and let us know whether it fails consistently, sometimes or never.

Provided that you manage to reproduce the fail, please let us have some syscall trace of it. You can build one this way: install the package strace and then prepend the isolate call with strace -ff (remember, you must do this in a root shell). Please let us have some syscall trace of both the program failing and succeeding (if you can achieve both outcomes).

Since the syscall traces could leak some details that you want to keep private, feel free to send them to my personal email instead of this public bug report.

lw · 2014-03-04T11:12:17Z

Just a small clarification: I'm not sure you'll find the sandbox path in AWS, as that information is stored in the database only when the job returns to ES (either successfully or not). You should probably be able to retrieve the correct sandbox by just picking the one that was created last.

nir-lavee · 2014-03-04T14:25:41Z

Thanks.

I have two conflicting needs -

Reproduce the bug as you instructed.
Avoid filling the disk entirely during a contest (it is not particularly
large).

The reason these are problematic is that the bug usually happens when the
system is quite busy, i.e. a contest is active with a few dozen constants -
but then the disk would fill right up.

Is there a chance that, when the bug happens, I can find the sandbox even
if keep sandbox is false? I'm asking because isolate doesn't exit, so
perhaps it's not deleted just yet?
Otherwise, is there a way to keep only the few most recent sandboxes?

If not, it may take me a long time to reproduce the bug, since I will only
try to do so on my own with no active contest.
Thanks again.

On Tue, Mar 4, 2014 at 1:12 PM, Luca Wehrstedt [email protected]:

Just a small clarification: I'm not sure you'll find the sandbox path in
AWS, as that information is stored in the database only when the job
returns to ES (either successfully or not). You should probably be able to
retrieve the correct sandbox by just picking the one that was created last.

Reply to this email directly or view it on GitHubhttps://github.com//issues/259#issuecomment-36614130
.

lw · 2014-03-04T14:39:00Z

Yes, the sandbox is deleted only after the isolate process terminates. Therefore I believe that, while that process hangs, the sandbox directory will be present on disk. Note however that killing the sandbox may "unlock" the Worker, that will then proceed to delete the directory. Hence be sure to first kill the Worker and then kill the locked isolate.

nir-lavee · 2014-03-08T18:02:33Z

Thanks, I was finally able to reproduce.

Killing isolate now doesn't solve the problem because it hangs the next time the worker runs it, as well (this hasn't happened to me before). What I see in the log is that some of the testcases were evaluated fine, but then one of them (a different one each time) says "Starting job" and never finishes.

The same thing happened when I submitted the same file from a different user (submitting other files worked), so that might be a hint. It's a java file, though I remember the problem occurring with cpp as well.

I ran strace, it hung consistently every time, and even kill wasn't enough (I had to kill -9). Giovanni, I am emailing you as you suggested. I also saved the submission's tmp directory.

giomasce · 2014-04-06T08:57:08Z

Hi Nir.

I finally found some time to dedicate to this issue. Unfortunately I coulnd't find out much: at first I thought I was able to reproduce the issue, but I notice that I could reproduce the issue only when running under strace. Without it everything was find (i.e., the program was executed correctly and it failed because of the wrong array access error, as expected, without hanging). This turned out to be a bug of strace 4.5, which was solved in strace 4.8: with 4.8 no hanging happened at all (in the earlier version, strace didn't know how to trace a clone() call happening in a PID namespace different from that of strace itself).

All these tests happened on Linux 3.12.3. I'll try to check also Ubuntu 12.04 in order to understand whether the kernel difference may have a role (which may be probable).

Anyway, this bug is actually exposed by another bug of CMS that was fixed in the meantime in commit 3d3cf0d. Because of this last bug, the command line to isolate is built incorrectly, so that multiple processes are permitted instead of being forbidden. Fixing this last bug (i.e., including the linked commit in your code base) should prevent the hang from happening.

This issue remains open anyway, since in theory one could want to allow more than one process in the sandbox and the sandbox shouldn't hang.

giomasce · 2014-04-10T08:14:05Z

BTW, the relevant commit that fixes the multiple processes bug is 3793683.

lw · 2014-08-08T08:49:02Z

Moved to cms-dev/isolate#7.

lw mentioned this issue Aug 8, 2014

Isolate doesn't exit - worker waits indefinitely cms-dev/isolate#7

Open

lw closed this as completed Aug 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Isolate doesn't exit - worker waits indefinitely #259

Isolate doesn't exit - worker waits indefinitely #259

nir-lavee commented Mar 1, 2014

giomasce commented Mar 3, 2014

Uh oh!

lw commented Mar 4, 2014

Uh oh!

nir-lavee commented Mar 4, 2014

Uh oh!

lw commented Mar 4, 2014

Uh oh!

nir-lavee commented Mar 8, 2014

Uh oh!

giomasce commented Apr 6, 2014

Uh oh!

giomasce commented Apr 10, 2014

Uh oh!

lw commented Aug 8, 2014

Uh oh!

Isolate doesn't exit - worker waits indefinitely #259

Isolate doesn't exit - worker waits indefinitely #259

Comments

nir-lavee commented Mar 1, 2014

giomasce commented Mar 3, 2014

Uh oh!

lw commented Mar 4, 2014

Uh oh!

nir-lavee commented Mar 4, 2014

Uh oh!

lw commented Mar 4, 2014

Uh oh!

nir-lavee commented Mar 8, 2014

Uh oh!

giomasce commented Apr 6, 2014

Uh oh!

giomasce commented Apr 10, 2014

Uh oh!

lw commented Aug 8, 2014

Uh oh!