Skip to content

Isolate doesn't exit - worker waits indefinitely #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nir-lavee opened this issue Mar 1, 2014 · 8 comments
Closed

Isolate doesn't exit - worker waits indefinitely #259

nir-lavee opened this issue Mar 1, 2014 · 8 comments

Comments

@nir-lavee
Copy link
Contributor

Hello,

I have installed the latest 1.1.0pre version of CMS, downloaded on 16/02/2014, on Ubuntu 12.04 32 bit.

Sometimes when there are jobs pending the log shows a line such as "3 jobs still pending", and this line is repeated without any change. One instance of isolate can be seen in the process list (I have one worker), and it does not exit. The number of jobs will increase as contestants submit more, and never decrease.

Unfortunately, I cannot reproduce consistently. It usually doesn't happen. Submitting the same code, or recompiling/reevaluating the same submission, doesn't necessarily cause it to happen again. But it does regularly happen a few times over the course of a few hours, and seems more likely the more submissions are sent.

The only workaround I have found so far is having a cron job that kills all instances of isolate that are more than a few seconds old. The worker then tries again and is successful.

I know this report is not sufficiently informative. Is there any information that could help? I can gather it the next time it happens.

@giomasce
Copy link
Member

giomasce commented Mar 3, 2014

Not easy to debug, but let's try!

First, you should set keep_sandbox to true in cms.conf and try again to reproduce the problem. Once you find a failing case, please use AWS in order to find the sandbox path and get in there. You should find a command.log file that contains the exact command that was spawned by CMS. Try to execute it again multiple times (as root) and let us know whether it fails consistently, sometimes or never.

Provided that you manage to reproduce the fail, please let us have some syscall trace of it. You can build one this way: install the package strace and then prepend the isolate call with strace -ff (remember, you must do this in a root shell). Please let us have some syscall trace of both the program failing and succeeding (if you can achieve both outcomes).

Since the syscall traces could leak some details that you want to keep private, feel free to send them to my personal email instead of this public bug report.

@lw
Copy link
Member

lw commented Mar 4, 2014

Just a small clarification: I'm not sure you'll find the sandbox path in AWS, as that information is stored in the database only when the job returns to ES (either successfully or not). You should probably be able to retrieve the correct sandbox by just picking the one that was created last.

@nir-lavee
Copy link
Contributor Author

Thanks.

I have two conflicting needs -

  1. Reproduce the bug as you instructed.
  2. Avoid filling the disk entirely during a contest (it is not particularly
    large).

The reason these are problematic is that the bug usually happens when the
system is quite busy, i.e. a contest is active with a few dozen constants -
but then the disk would fill right up.

Is there a chance that, when the bug happens, I can find the sandbox even
if keep sandbox is false? I'm asking because isolate doesn't exit, so
perhaps it's not deleted just yet?
Otherwise, is there a way to keep only the few most recent sandboxes?

If not, it may take me a long time to reproduce the bug, since I will only
try to do so on my own with no active contest.
Thanks again.

On Tue, Mar 4, 2014 at 1:12 PM, Luca Wehrstedt [email protected]:

Just a small clarification: I'm not sure you'll find the sandbox path in
AWS, as that information is stored in the database only when the job
returns to ES (either successfully or not). You should probably be able to
retrieve the correct sandbox by just picking the one that was created last.

Reply to this email directly or view it on GitHubhttps://github.com//issues/259#issuecomment-36614130
.

@lw
Copy link
Member

lw commented Mar 4, 2014

Yes, the sandbox is deleted only after the isolate process terminates. Therefore I believe that, while that process hangs, the sandbox directory will be present on disk. Note however that killing the sandbox may "unlock" the Worker, that will then proceed to delete the directory. Hence be sure to first kill the Worker and then kill the locked isolate.

@nir-lavee
Copy link
Contributor Author

Thanks, I was finally able to reproduce.

Killing isolate now doesn't solve the problem because it hangs the next time the worker runs it, as well (this hasn't happened to me before). What I see in the log is that some of the testcases were evaluated fine, but then one of them (a different one each time) says "Starting job" and never finishes.

The same thing happened when I submitted the same file from a different user (submitting other files worked), so that might be a hint. It's a java file, though I remember the problem occurring with cpp as well.

I ran strace, it hung consistently every time, and even kill wasn't enough (I had to kill -9). Giovanni, I am emailing you as you suggested. I also saved the submission's tmp directory.

@giomasce
Copy link
Member

giomasce commented Apr 6, 2014

Hi Nir.

I finally found some time to dedicate to this issue. Unfortunately I coulnd't find out much: at first I thought I was able to reproduce the issue, but I notice that I could reproduce the issue only when running under strace. Without it everything was find (i.e., the program was executed correctly and it failed because of the wrong array access error, as expected, without hanging). This turned out to be a bug of strace 4.5, which was solved in strace 4.8: with 4.8 no hanging happened at all (in the earlier version, strace didn't know how to trace a clone() call happening in a PID namespace different from that of strace itself).

All these tests happened on Linux 3.12.3. I'll try to check also Ubuntu 12.04 in order to understand whether the kernel difference may have a role (which may be probable).

Anyway, this bug is actually exposed by another bug of CMS that was fixed in the meantime in commit 3d3cf0d. Because of this last bug, the command line to isolate is built incorrectly, so that multiple processes are permitted instead of being forbidden. Fixing this last bug (i.e., including the linked commit in your code base) should prevent the hang from happening.

This issue remains open anyway, since in theory one could want to allow more than one process in the sandbox and the sandbox shouldn't hang.

@giomasce
Copy link
Member

BTW, the relevant commit that fixes the multiple processes bug is 3793683.

@lw
Copy link
Member

lw commented Aug 8, 2014

Moved to cms-dev/isolate#7.

@lw lw closed this as completed Aug 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants