-
Notifications
You must be signed in to change notification settings - Fork 382
Isolate doesn't exit - worker waits indefinitely #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Not easy to debug, but let's try! First, you should set Provided that you manage to reproduce the fail, please let us have some syscall trace of it. You can build one this way: install the package Since the syscall traces could leak some details that you want to keep private, feel free to send them to my personal email instead of this public bug report. |
Just a small clarification: I'm not sure you'll find the sandbox path in AWS, as that information is stored in the database only when the job returns to ES (either successfully or not). You should probably be able to retrieve the correct sandbox by just picking the one that was created last. |
Thanks. I have two conflicting needs -
The reason these are problematic is that the bug usually happens when the Is there a chance that, when the bug happens, I can find the sandbox even If not, it may take me a long time to reproduce the bug, since I will only On Tue, Mar 4, 2014 at 1:12 PM, Luca Wehrstedt [email protected]:
|
Yes, the sandbox is deleted only after the |
Thanks, I was finally able to reproduce. Killing isolate now doesn't solve the problem because it hangs the next time the worker runs it, as well (this hasn't happened to me before). What I see in the log is that some of the testcases were evaluated fine, but then one of them (a different one each time) says "Starting job" and never finishes. The same thing happened when I submitted the same file from a different user (submitting other files worked), so that might be a hint. It's a java file, though I remember the problem occurring with cpp as well. I ran strace, it hung consistently every time, and even kill wasn't enough (I had to kill -9). Giovanni, I am emailing you as you suggested. I also saved the submission's tmp directory. |
Hi Nir. I finally found some time to dedicate to this issue. Unfortunately I coulnd't find out much: at first I thought I was able to reproduce the issue, but I notice that I could reproduce the issue only when running under All these tests happened on Linux 3.12.3. I'll try to check also Ubuntu 12.04 in order to understand whether the kernel difference may have a role (which may be probable). Anyway, this bug is actually exposed by another bug of CMS that was fixed in the meantime in commit 3d3cf0d. Because of this last bug, the command line to This issue remains open anyway, since in theory one could want to allow more than one process in the sandbox and the sandbox shouldn't hang. |
BTW, the relevant commit that fixes the multiple processes bug is 3793683. |
Moved to cms-dev/isolate#7. |
Hello,
I have installed the latest 1.1.0pre version of CMS, downloaded on 16/02/2014, on Ubuntu 12.04 32 bit.
Sometimes when there are jobs pending the log shows a line such as "3 jobs still pending", and this line is repeated without any change. One instance of isolate can be seen in the process list (I have one worker), and it does not exit. The number of jobs will increase as contestants submit more, and never decrease.
Unfortunately, I cannot reproduce consistently. It usually doesn't happen. Submitting the same code, or recompiling/reevaluating the same submission, doesn't necessarily cause it to happen again. But it does regularly happen a few times over the course of a few hours, and seems more likely the more submissions are sent.
The only workaround I have found so far is having a cron job that kills all instances of isolate that are more than a few seconds old. The worker then tries again and is successful.
I know this report is not sufficiently informative. Is there any information that could help? I can gather it the next time it happens.
The text was updated successfully, but these errors were encountered: