-
Notifications
You must be signed in to change notification settings - Fork 2.4k
When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yoann GENDRE commented Today I try to fix the problem. It seems to be OK (Unit test of spring-batch-core-2.2.7.RELEASE and mine are OK ). I have updated class FaultTolerantChunkProcessor (you can find it in attachment) as detailed bellow. In method transform, if we are in a scan and process return null (item is skipped), I removed cached output for this item : /*
* We only want to process the first item if there is a scan for a
* failed item.
*/
if (data.scanning()) {
// on scanning processed item is skipped - remove matching cached output
if (output == null) {
if (cacheIterator != null && cacheIterator.hasNext()) {
cacheIterator.next();
cacheIterator.remove();
}
}
while (cacheIterator != null && cacheIterator.hasNext()) {
outputs.add(cacheIterator.next());
}
// Only process the first item if scanning
break;
} In method write, If we are in a scan, I only write if item was not skipped during process : if (!data.scanning()) {
...
} else {
// scan only if item was not skipped
if (inputs.getSkips().size() == 0) {
scan(contribution, inputs, outputs, chunkMonitor, false);
} else {
logger.debug("Scan skipped because item was skipped during process");
}
}
return null; |
Yoann GENDRE commented The proposed correction in class FaultTolerantChunkProcessor in the pull request is a little bit different. |
Mahmoud Ben Hassine commented Hi, Thank you for reporting this case and for providing an example. When using a fault tolerant step, the item processor should be idempotent (as explained in the reference documentation). In your example, the
I tested both ways and they do make the test in your example pass (For solution 1, I adapted the assertions since there are no more skips in processing. For solution 2, just adding In regards to your PR, the code changes will actually fix the case but what will happen is:
This is by design (of the transactional property of the processor) what we don't want to do. Either we cache the value and reuse it without recalling the processor, or we don't cache any value and retry calling the processor according to the retry policy you provide. Does this make sense? Kr |
Mahmoud Ben Hassine commented I'm closing this as "Won't fix" since the item processor used in the example is not idempotent as the documentation suggests. If the item processor cannot be made idempotent, it should be declared as non transactional or declare the exception it might throw in subsequent calls as retryable (more details in my previous comment). |
Yoann GENDRE commented Hi Mahmoud Thanks for the return. OK for your answer and your comment on the code changes I suggest (I must admit I have not understand all code in FaultTolerantChunkProcessor - too complicated for me). I think we still have a problem. We should not go in a infinite loop in this case !!! From what I remember some items are not processed after first write error --> run state is KO and can't be restarted. To me (and I think it's the mots common use case ???) it's really hard to implement idempotent processor. In most case we can do it from business point of view but we can't be sure that a technical exception (SGBD or WS access) won't be thrown. Kind regards |
Mahmoud Ben Hassine commented Hi Yoann, I agree with you. From a robustness point of view, SB should not go in an infinite loop.
It should not fail, the step can proceed to completion and count the item as skipped. Failing the whole step for a single failed item is not ideal. I opened a PR to fix this issue and added a test similar to yours (I just simplified the configuration). The fix in the PR also makes your attached test to pass. Can you give it a try? Kind regards, |
Yoann GENDRE commented Hi Mahmoud, I have just tried your fix. It seems to be OK for me. Thanks. |
Mahmoud Ben Hassine commented Hi Yoann, Great! Thank you for taking time to test the fix. Kr, |
Conosci commented Dear Mahmoud, I have a scenario where I believe this fix is causing another dangerous bug, some items don't end properly their lifecycle and are silenty ignored. The scenario is the following:
At runtime, after the BATCH-2442 fix, the result is the following:
The same scenario is working well before the fix version, using Spring Batch version 4.0.1.RELEASE all the listeners are correctly invoked and the ending counts are correct. Please look into this critical issue. Thanks
|
Wolfgang Klaus commented Hi Mahmoud, the fix for this issue is really problematic and critical. As Conosci commented on Aug/31/2019 it is possible that some items are silently ignored. I'm attaching an Test for this issue. When you look at the StandardOutput after running the test you can see that for item 1 the writer is never called. In addition the Processor for item 19 is called twice without calling the writer. Here the relevant output before your changes:
And here the relevant output after your changes:
As you can see the writer for item 1 is never called although it is in the list to write and the processor for item 19 is called twice. To fix this the condition introduced with this fix in FaultTolerantChunkProcessor.java:584 has to be extended
|
Yoann GENDRE opened BATCH-2442 and commented
Job get in a infinite loop when an exception is first thrown during write and a second exception is thrown during the retry on processing the first item of the chunk.
In attachment, maven project with a job test case with commit-interval = 5 and skip-limit=3
It seems inputs is never set to busy=false in method scan of FaultTolerantChunkProcessor. (output is never empty)
For now I can't see where the problem come from...
Affects: 2.2.7
Attachments:
Issue Links:
Referenced from: pull request #592, and commits c1b79e5, dc41757, 30a4d43, ef13ea0, 346dcce, cc6f608
Backported to: 4.1.0.M1, 4.0.2, 3.0.10
0 votes, 6 watchers
The text was updated successfully, but these errors were encountered: