When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160

spring-projects-issues · 2015-10-02T16:53:32Z

Yoann GENDRE opened BATCH-2442 and commented

Job get in a infinite loop when an exception is first thrown during write and a second exception is thrown during the retry on processing the first item of the chunk.

In attachment, maven project with a job test case with commit-interval = 5 and skip-limit=3

read 5 String from ListItemReader
process String - increment count item processed - throw an exception if count = 6 (first process after first exception)
write items - throw an exception on first chunk

It seems inputs is never set to busy=false in method scan of FaultTolerantChunkProcessor. (output is never empty)
For now I can't see where the problem come from...

Affects: 2.2.7

Attachments:

Batch2442IgnoredItems.zip (2.53 kB)
FaultTolerantChunkProcessor.java (19.27 kB)
test-case-infiniteLoopOnretry.zip (13.00 kB)

Issue Links:

BATCH-2302 exception on process while recovering from rollback on write causes deadlock

Referenced from: pull request #592, and commits c1b79e5, dc41757, 30a4d43, ef13ea0, 346dcce, cc6f608

Backported to: 4.1.0.M1, 4.0.2, 3.0.10

0 votes, 6 watchers

The text was updated successfully, but these errors were encountered:

spring-projects-issues · 2015-10-05T14:58:35Z

Yoann GENDRE commented

Today I try to fix the problem. It seems to be OK (Unit test of spring-batch-core-2.2.7.RELEASE and mine are OK ).

I have updated class FaultTolerantChunkProcessor (you can find it in attachment) as detailed bellow.

In method transform, if we are in a scan and process return null (item is skipped), I removed cached output for this item :

/*
			 * We only want to process the first item if there is a scan for a
			 * failed item.
			 */
			if (data.scanning()) {
				// on scanning processed item is skipped - remove matching cached output
				if (output == null) {
					if (cacheIterator != null && cacheIterator.hasNext()) {
						cacheIterator.next();
						cacheIterator.remove();
					}
				}

				while (cacheIterator != null && cacheIterator.hasNext()) {
					outputs.add(cacheIterator.next());
				}

				// Only process the first item if scanning
				break;
			}

In method write, If we are in a scan, I only write if item was not skipped during process :

if (!data.scanning()) {
         ...
} else {
         // scan only if item was not skipped
         if (inputs.getSkips().size() == 0) {
                  scan(contribution, inputs, outputs, chunkMonitor, false);
         } else {
                  logger.debug("Scan skipped because item was skipped during process");
         }
}
return null;

spring-projects-issues · 2017-01-10T16:47:42Z

Yoann GENDRE commented

The proposed correction in class FaultTolerantChunkProcessor in the pull request is a little bit different.
In pull request a test case illustrating the problem :
Step with commit-interval=5
7 items (item1,item2,item3,item4,item5,item6,item7) in input
Writer fail on 1rst chunk (writer.cpt=1)
Processor fail on 1rst item during scan (processor.cpt=6) and on 2nd item during scan (processor.cpt=7)
Expected output =
Item:id=3,nbProcessed=2,nbWritten=2
Item:id=4,nbProcessed=2,nbWritten=2
Item:id=5,nbProcessed=2,nbWritten=2
Item:id=6,nbProcessed=1,nbWritten=1
Item:id=7,nbProcessed=1,nbWritten=1
In the test cas a step listener is used to stop the infinite loop and throw an exception when chunk number > 200

spring-projects-issues · 2018-02-20T11:45:13Z

Mahmoud Ben Hassine commented

Hi,

Thank you for reporting this case and for providing an example.

When using a fault tolerant step, the item processor should be idempotent (as explained in the reference documentation). In your example, the ItemProcessor succeeds the first time for item2 but fails the second time. This behavior is not idempotent. Since it is not always possible to write an idempotent processor (think of transient exceptions for instance), what you can do is:

either declare the exception as retryable (in addition to declaring it as skippable)
or declare the processor as non transactional (in which case item2 will be cached the first time and (re)used the second time during the scan)

I tested both ways and they do make the test in your example pass (For solution 1, I adapted the assertions since there are no more skips in processing. For solution 2, just adding processor-transactional="false" to the chunk element will make the test pass).

In regards to your PR, the code changes will actually fix the case but what will happen is:

cache the value the first time when it succeeds (not part of your PR)
remove the cached value the second time when scanning (undo step 1)
detect that through skip count and prevent scanning the (single skipped item) chunk

This is by design (of the transactional property of the processor) what we don't want to do. Either we cache the value and reuse it without recalling the processor, or we don't cache any value and retry calling the processor according to the retry policy you provide.

Does this make sense?

Kr
Mahmoud

spring-projects-issues · 2018-03-02T16:38:28Z

Mahmoud Ben Hassine commented

I'm closing this as "Won't fix" since the item processor used in the example is not idempotent as the documentation suggests. If the item processor cannot be made idempotent, it should be declared as non transactional or declare the exception it might throw in subsequent calls as retryable (more details in my previous comment).

spring-projects-issues · 2018-03-23T15:59:08Z

Yoann GENDRE commented

Hi Mahmoud

Thanks for the return. OK for your answer and your comment on the code changes I suggest (I must admit I have not understand all code in FaultTolerantChunkProcessor - too complicated for me).

I think we still have a problem. We should not go in a infinite loop in this case !!! From what I remember some items are not processed after first write error --> run state is KO and can't be restarted.
Shouldn't SB fail on getting an exception on an item processed a 2nd time (first time with success). This "NonIdempotentProcessingItemFailure" would prevent from getting in this infinite loop ?

To me (and I think it's the mots common use case ???) it's really hard to implement idempotent processor. In most case we can do it from business point of view but we can't be sure that a technical exception (SGBD or WS access) won't be thrown.

Kind regards
Yoann

spring-projects-issues · 2018-03-27T12:22:42Z

Mahmoud Ben Hassine commented

Hi Yoann,

I agree with you. From a robustness point of view, SB should not go in an infinite loop.

Shouldn't SB fail on getting an exception on an item processed a 2nd time (first time with success). This "NonIdempotentProcessingItemFailure" would prevent from getting in this infinite loop ?

It should not fail, the step can proceed to completion and count the item as skipped. Failing the whole step for a single failed item is not ideal. I opened a PR to fix this issue and added a test similar to yours (I just simplified the configuration). The fix in the PR also makes your attached test to pass. Can you give it a try?

Kind regards,
Mahmoud

spring-projects-issues · 2018-03-29T13:24:21Z

Yoann GENDRE commented

Hi Mahmoud,

I have just tried your fix. It seems to be OK for me.

Thanks.

spring-projects-issues · 2018-03-29T13:30:07Z

Mahmoud Ben Hassine commented

Hi Yoann,

Great! Thank you for taking time to test the fix.

Kr,
Mahmoud

spring-projects-issues · 2019-08-31T12:37:10Z

Conosci commented

Dear Mahmoud,

I have a scenario where I believe this fix is causing another dangerous bug, some items don't end properly their lifecycle and are silenty ignored. The scenario is the following:

Job with 1 Step With Chunk Processing (ChunkSize=10)
Reader, Processor, Writer, all with faultTolerant() and skipPolicy(..) (Always Skip) and skipListener(..) to listen to 'onSkipInWrite' and 'onSkipInProcess'
Source Data with 3 Records: the first is a good record, the second is giving Processing Error and the third is giving Write Error (duplicate primary key against the first record)

At runtime, after the BATCH-2442 fix, the result is the following:

Reader reads the 3 records chunk
Processor start processing and meet Processing Error in the second record
The System will skip (according to the configured policy) the second record and will invoke the SkipListener (onSkipInProcess) giving to the user the possibility to manage the issue
The System will take the remaining 2 records (the first and the third), will process again them and will try to write them in chunk (same transaction)
During the write the database will return exception (duplicate primary key)
The System will go in 'recovery' mode and will start processing the two items one by one
Now during the scan the following code will execute:
1. //BATCH-2442 : do not scan skipped items
  if (!inputs.getSkips().isEmpty()) {
  if (outputIterator.hasNext()) {
  outputIterator.remove();
  return;
  }
  }
The condition is then verified, there is a skip (the first Processor error for the second record) and one record from the remaining two (first and third) is then removed without any reason!
At this point only 1 record is then available in outputIterator (the first record), and it will be written successfully.
The job then ends with a non consistent state (there is a writeSkipCount element missing, the totals are wrong):
1. readCount = 3
2. processSkipCount = 1
3. writeCount = 1
4. writeSkipCount = 0
5. rollbackCount = 2
6. commitCount = 2
Basically what happened is that removing the wrong element from the outputIterator, the system didn't catch the Write Exception and didn't execute the proper lifecycle (doOnWriteError/shouldSkip/onSkipInWrite) givinig no change to the user to know about the skipped element in the skipListener

The same scenario is working well before the fix version, using Spring Batch version 4.0.1.RELEASE all the listeners are correctly invoked and the ending counts are correct.

Please look into this critical issue.

Thanks

spring-projects-issues · 2019-10-30T19:24:42Z

Wolfgang Klaus commented

Hi Mahmoud,

the fix for this issue is really problematic and critical. As Conosci commented on Aug/31/2019 it is possible that some items are silently ignored.

I'm attaching an Test for this issue. When you look at the StandardOutput after running the test you can see that for item 1 the writer is never called. In addition the Processor for item 19 is called twice without calling the writer.

Here the relevant output before your changes:

......
write items [19, 18, 17, 16, 14, 13, 12, 9, 8, 7, 6, 4, 3, 2, 1]
fail Item 13
process item 19
write items [19]
.........
process item 1
write items [1]
[info 2019/10/30 20:14:06.589 MEZ <Time-limited test> tid=0x11] Step: [step] executed in 306ms

And here the relevant output after your changes:

.......
write items [19, 18, 17, 16, 14, 13, 12, 9, 8, 7, 6, 4, 3, 2, 1]
fail Item 13
process item 19
process item 19
write items [19]
----
process item 2
write items [2]
[info 2019/10/30 20:16:06.589 MEZ <Time-limited test> tid=0x11] Step: [step] executed in 306ms

As you can see the writer for item 1 is never called although it is in the list to write and the processor for item 19 is called twice.

To fix this the condition introduced with this fix in FaultTolerantChunkProcessor.java:584 has to be extended
With the following change everything is allright.

if (!inputs.getSkips().isEmpty()  && inputs.getItems().size() != outputs.getItems().size() ) {

spring-projects-issues closed this as completed Oct 30, 2019

spring-projects-issues added in: retry type: bug has: backports Legacy label from JIRA. Superseded by "for: backport-to-x.x.x" in: core labels Dec 16, 2019

spring-projects-issues added this to the 4.1.0 milestone Dec 16, 2019

This was referenced Dec 17, 2019

4.0.2 Backported Issues #3582

Closed

3.0.10 Backported Issues #3583

Closed

4.1.0.M1 Backported Issues #3589

Closed

fmbenhassine mentioned this issue Jan 21, 2020

FaultTolerantChunkProcess ignoring retry count [BATCH-2344] #1257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160

When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160

spring-projects-issues commented Oct 2, 2015

spring-projects-issues commented Oct 5, 2015

Uh oh!

spring-projects-issues commented Jan 10, 2017

Uh oh!

spring-projects-issues commented Feb 20, 2018

Uh oh!

spring-projects-issues commented Mar 2, 2018

Uh oh!

spring-projects-issues commented Mar 23, 2018

Uh oh!

spring-projects-issues commented Mar 27, 2018

Uh oh!

spring-projects-issues commented Mar 29, 2018

Uh oh!

spring-projects-issues commented Mar 29, 2018

Uh oh!

spring-projects-issues commented Aug 31, 2019

Uh oh!

spring-projects-issues commented Oct 30, 2019

Uh oh!

When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160

When an error is thrown on write and another error is thrown on process during retry, the job gets in a infinite loop and never finishes. [BATCH-2442] #1160

Comments

spring-projects-issues commented Oct 2, 2015

spring-projects-issues commented Oct 5, 2015

Uh oh!

spring-projects-issues commented Jan 10, 2017

Uh oh!

spring-projects-issues commented Feb 20, 2018

Uh oh!

spring-projects-issues commented Mar 2, 2018

Uh oh!

spring-projects-issues commented Mar 23, 2018

Uh oh!

spring-projects-issues commented Mar 27, 2018

Uh oh!

spring-projects-issues commented Mar 29, 2018

Uh oh!

spring-projects-issues commented Mar 29, 2018

Uh oh!

spring-projects-issues commented Aug 31, 2019

Uh oh!

spring-projects-issues commented Oct 30, 2019

Uh oh!