Wednesday, August 25, 2010

Implementing while loop in Kettle

   I ran into the problem of 'while' loops in Kettle when I was loading data from Yahoo Store API. Yahoo web service allows you to get up to 100 orders pre request. But it don't tell you how many orders it have at all. So I'm sending requests until I get a special response that there are no more orders there.
   There are several ways to implement loops in Kettle:
  • To use "Execute for every input row" flag in job or transformation.
  • To circle hops in a job.
  • To use "Repeat" flag in Start step of the job.
   Every of these approaches have its strengths, weaknesses and area of applicability. Lets consider each of them.

 Using "Execute for every input row" flag in job or transformation.
   This is the most safe, correct and native way to implement loops in Kettle, but to use it you need to know in advance how many times you want to run the job inside a loop. If you can get or calculate the number of iterations in advance - just use this approach. For more information about it I can sent you to Slawomir Chodnicki blog article about it.

 Circling hops in a job.
   This approach is not safe. When using it keep in mind that loop depth can't be too big. If you broke this rule you risk to get StackOverflowError. That is because Kettle use recursive method calls when running this kind of jobs. So if you believe your loop will not exceed say 10,000 or 100,000 iterations, depending on StackSize settings in your JVM you can run the loop in this way.
   For more information about StackOverflowError in Kettle see this JIRA issue(http://jira.pentaho.com/browse/PDI-1463). And more info about implementing this loop you can find in another article of Slawomir Chodnicki.

  Using "Repeat" flag in Start step of the job.
   I recommend this approach if you can't use both above. By checking 'Repeat' flag in Start step of a job you can easy make job running forever. More important question is how to stop it! Ok, The only way to do that in 'Out of the box' Kettle(even 4.0) is to use Abort step. It stops the job and writes an error message in log that job is finished with errors. But that may not be the case! I want my job to stop normal, successful, without errors. Why can't I do that? Why should I flood the log with ERROR messages when actually no errors occured? For that reason I implemented a simple plugin called 'Stop Job' that stops repeating job without writing error messages in log.
   You can use this plugin in the same way you use Abort step. And you can add a message which will be written to log with log level BASIC.
  Be aware that when using Repeat flag, the job will repeat to run over and over even when one of it steps fails. To stop the job in this case use Abort job step as depicted at the image below.
Here is a Stop Job plugin that you are free to use and modify.

And update 11 April 2017:
Instead of using Stop Job plugin you may simply use JavaScript steps in jobs and transformations.
Use this code in job's JavaScript step to stop job execution without failing it's result:
previous_result.setNrErrors(0);
previous_result.setResult(true);
parent_job.stopAll();

And here is a code for transformation JavaScrpt step:
_step_.stopAll();
Feedback is welcome, because I'm just starting to validate these solutions.

13 comments:

  1. The StopJob plugin having problem when run the job using Kettle.bat. Error message is "Unable to read Job Entry copy info from XML node : org.pentaho.di.core.exception.KettleStepLoaderException:
    No valid step/plugin specified (jobPlugin=null) for Stop Job"
    Any fix ?
    Thank You.

    ReplyDelete
  2. StopJob plugin works fine in version 4.1 but no in version 4.2. Sorry for lack of information at the previous post.
    Thank You.

    ReplyDelete
  3. Hi,

    So as this plugin is build based on Kettle's Abort step, you may just compare code changes to Abort step between 4.1 and 4.2 and them update StopJob plugin in the same way.

    ReplyDelete
  4. Hello Vasili,

    Thank you so much for the post. It was really helpful.

    But now I have other two problems:

    1. The job itself is doing loop very well, but in my case I have to nest this loop job to another master job, which means that loop job has a parent job. This parent job looks like start->transformation 1->transformation 2->loop job. When I trigger run through parent job, loop job won't do looping.

    2. Is there any other way to work around on StopJob plugin? Right now we are using 5.0.1 version, this plugin does not work.

    It would be much appreciated if you could reply me at your convenience.

    Thank you very much!
    Jean

    ReplyDelete
  5. Thanks for the question, Jean!

    See, it came to my mind that I've wrote this article 5 years ago. I've developed large number of ETLs since that time, but never needed StopJob plugin any more.

    When I do loops I use first option from this article: "Execute for every input row" flag in job or transformation.

    I don't see any issues in the situation you've described. Having master job for the job with loop is pretty normal and should work fine. Probably you have some issue in the loop job itself. I would suggest to read carefully Slawomir's article and check if you done all steps.
    Key steps are:
    1. Create transformation which generates rows and have final step "Copy rows to result"
    2. The Master Job should have "row generating" transformation and following looping job.
    3. The looping job should have checked “Execute for every input row?”.

    If the article not helps - ask this question on Pentaho forums and post your jobs for review so that it will be easier to help.
    Also you may post link to forum here.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. @Vasili, i am kinda same situation, but the issue is i need to compare each row against rows received from excel file, like if price in a row is matching excel column range 30-150, i add another column in the parent row as "band" $30-$150.

    If i read the row from parent job and read the excel rows in the nested ( second job ) will it work?

    ReplyDelete
    Replies
    1. note : both steps columns length does not match.

      Delete
    2. Have you considered to use Stream Lookup step for your case?
      You may read main data in one steam, and read excel file in another stream of same transformation. And then use stream lookup for matching.

      Delete
  8. @Vasili thanks for your valuable idea :) i will keep this in mind, but as i had to compare price with ranges i found NUMBER RANGE step, and that did the magic.

    ReplyDelete
  9. Hi Vasili,
    There has been a time since the initial post was written. Finally, which is the best approach to solve your initial scenario? Yahoo API calls when you don't know in advance the amount of needed calls.
    I appreciate your inputs.

    ReplyDelete
    Replies
    1. Hi Raul,
      I already forgot specifics of Yahoo API. But these days I would try to implement this loop on transformation level instead of a job level.
      Inside the transformation you may use "Generate Rows" step set it's limit to some very large value which you feel will never be reached because of time constrains. Let's say to 1 000 000 rows. And on every row produced by "Generate Rows" send Web service request to Yahoo.
      Then you need some check that Yahoo returns no more rows, and if it's true, use Abort step to stop the transformation.

      Delete
  10. Thanks for the helpful article! Solved my Memory Overflow errors

    ReplyDelete