Wednesday, August 25, 2010

Implementing while loop in Kettle

   I ran into the problem of 'while' loops in Kettle when I was loading data from Yahoo Store API. Yahoo web service allows you to get up to 100 orders pre request. But it don't tell you how many orders it have at all. So I'm sending requests until I get a special response that there are no more orders there.
   There are several ways to implement loops in Kettle:
  • To use "Execute for every input row" flag in job or transformation.
  • To circle hops in a job.
  • To use "Repeat" flag in Start step of the job.
   Every of these approaches have its strengths, weaknesses and area of applicability. Lets consider each of them.

 Using "Execute for every input row" flag in job or transformation.
   This is the most safe, correct and native way to implement loops in Kettle, but to use it you need to know in advance how many times you want to run the job inside a loop. If you can get or calculate the number of iterations in advance - just use this approach. For more information about it I can sent you to Slawomir Chodnicki blog article about it.

 Circling hops in a job.
   This approach is not safe. When using it keep in mind that loop depth can't be too big. If you broke this rule you risk to get StackOverflowError. That is because Kettle use recursive method calls when running this kind of jobs. So if you believe your loop will not exceed say 10,000 or 100,000 iterations, depending on StackSize settings in your JVM you can run the loop in this way.
   For more information about StackOverflowError in Kettle see this JIRA issue(http://jira.pentaho.com/browse/PDI-1463). And more info about implementing this loop you can find in another article of Slawomir Chodnicki.

  Using "Repeat" flag in Start step of the job.
   I recommend this approach if you can't use both above. By checking 'Repeat' flag in Start step of a job you can easy make job running forever. More important question is how to stop it! Ok, The only way to do that in 'Out of the box' Kettle(even 4.0) is to use Abort step. It stops the job and writes an error message in log that job is finished with errors. But that may not be the case! I want my job to stop normal, successful, without errors. Why can't I do that? Why should I flood the log with ERROR messages when actually no errors occured? For that reason I implemented a simple plugin called 'Stop Job' that stops repeating job without writing error messages in log.
   You can use this plugin in the same way you use Abort step. And you can add a message which will be written to log with log level BASIC.
  Be aware that when using Repeat flag, the job will repeat to run over and over even when one of it steps fails. To stop the job in this case use Abort job step as depicted at the image below.
Here is a Stop Job plugin that you are free to use and modify.

And update 11 April 2017:
Instead of using Stop Job plugin you may simply use JavaScript steps in jobs and transformations.
Use this code in job's JavaScript step to stop job execution without failing it's result:
previous_result.setNrErrors(0);
previous_result.setResult(true);
parent_job.stopAll();

And here is a code for transformation JavaScrpt step:
_step_.stopAll();
Feedback is welcome, because I'm just starting to validate these solutions.