Thursday, July 22, 2010

Running Kettle jobs from BI Server

  Most power of Pentaho comes from integration. BI Server is an integration point of multitude tools, technologies and projects. This is one of the secrets of Pentaho success. Pentaho provides access to database systems, allows developer to choose between data manipulation languages(SQL, MDX, MQL), allows to perform data analysis using WEKA data mining tool, suggests different ways for incorporating business logic, helps to perform some actions like sending e-mail based on business decisions and presents the results to user in form of charts and reports.
   PDI by itself is also a great consolidating tool. I won't praise it as much as BI server just to save your time, but when consider diversity of data sources and technologies Kettle is even more sociable then BI Server. Just have a look at Kettle plugin list to get it. I believe it can run data to earth from any data source you want.
   And now imagine you tie these great tools together. You obtain fine-grained control over Kettle jobs. You can ask user for parameters before passing them to Kettle job. You can use Kettle job as a smart data source for your reports. Using java quartz lib you can schedule Kettle jobs to run with a second-precision (for comparison linux cron allows only minutes-precision).
   But many people abandon this power and run Kettle as a separate java process. I saw a lot of unanswered question on forums about running Kettle jobs in BI Server. Up to Kettle version 4.0 some bugs was in the way of using BI Server - PDI couple tied together in enterprise environment. An example is a call of System.gc() in every Kettle job.
   Below are some advices for those bold spirits who decide to integrate these tools.

  Setting up Kettle repository
   At first you need to show BI server where Kettle repository is. The easiest way to do it is to configure repository settings with Spoon tool. After that open User Home folder on you machine and find .kettle folder. You need to place .kettle to User Home folder of the machine where BI Server is running.
   Then open file pentaho-solutions\system\kettle\settings.xml. Use following settings to connect to DB-based kettle repository:
  1. <repository.type>rdbms</repository.type>
  2. <repository.name>kettle_repo</repository.name>
  3. <repository.userid>admin</repository.userid>
  4. <repository.password>admin</repository.password>
   admin/admin are default user name/password.

  That's it.

  Custom plugins.
   If you have custom kettle plugins you need to put it in one of following places: <PATH>/plugins, <User Home>/.kettle/plugins, <pentaho-solutions>/system/kettle/plugins.

  Running jobs
   And now to run Kettle job just add an action like that to your action sequence:
    <action-definition>
      <component-name>KettleComponent</component-name>
      <action-type>Pentaho Data Integration Job</action-type>
      <action-inputs>
        <kettle-logging-level type="string"/>
      </action-inputs>
      <action-resources/>
      <component-definition>
        <directory><![CDATA[job-dir]]></directory> 
        <job><![CDATA[job-name]]></job>
      </component-definition>
    </action-definition>
   For start use 'detail' value for kettle-logging-level. It writes enough logs but don't floods log files.

10 comments:

  1. Hi Vasili,



    I tried the way u mentioned here,i did changes to setting.xml file ,i got an issue saying repository not found.

    error:- Kettle.ERROR_0004 - The repository specified was not found (Local)


    Looking forward for you help to enable my BI server to find kettle repository settings.


    Thanks,
    Surya

    ReplyDelete
  2. Hi Surya,

    I'll just give you the general ideas, for details please see the Pentaho forums. Like this one:
    http://forums.pentaho.com/showthread.php?78458-PDI-and-database-repository-The-repository-specified-was-not-found

    When you design transformations in Spoon, you are connecting to Kettle repository from Spoon, and saving jobs and transformations into this repository. While you defining repository connection in Spoon, it saves database information to file .kettle\repositories.xml in your user home directory. If you run BI Server on same machine, it will automatically find this settings, but it's you run on different machine, you need to copy this file there.

    Actually there are many ways to run ETL in BI server. In this article I've described how to run transformations from kettle repository. But as an alternative way you may try to run transformation by simply putting it into pentaho solutions. Check this sample in BI Server for details pentaho-solutions\bi-developers\etl\SampleTransformation.xaction

    ReplyDelete
  3. Hi Vasili,
    I am trying to run the job which is stored in my local repository using an xaction. I have made the changes in settings.xml and yes i have a repositories.xml file in my $user/.kettle directory but when i am trying to run the xaction i am really having a situation here. the error is "The repository specified not found(myrepo)". I am running the xaction on the same machine but still not able to run it. I really want to run a job from my repository using action sequence. Hope one more pair of eyes could help me with it.
    Thanks,
    Santosh

    ReplyDelete
  4. Hi Santosh and Surya,

    Are you working together or it's a different issues?

    Try this checklist:
    In Spoon menu, go to Tools -> Repository -> Connect. Choose the repository you have created. Enter user name and password.
    1. Verify that in settings.xml you entered same values to repository.name, repository.userid and repository.password.
    2. Connect to repository. Verify that you able to connect to it.

    Now, go to Tools -> Repository -> Explore. Find the job you need to run.
    3. Verify that in xaction you have entered same directory and job names. On my local environment I've put that job in a folder, so I'm not sure if running from root folder will work.

    4. Open repositories.xml in text or xml editor, and verify that it's have description of yours repository.

    If that's not helps - create a question on pentaho forums and attach all mentioned files to it (repositories.xml, settings.xml, action sequence ). And post the link to question here. Then we can continue discussion on pentaho forums. Because Pentaho forums much more suitable for such discussions.

    ReplyDelete
  5. Hi vasili,

    not sure who Santosh is...? , but issue seems to be same..!!

    Q:-which version of Kettle and BI Server is used
    Ans:- pentaho 4.5.0 GA (4.5.0 GA Suite Release)
    Q :- which database
    ans:- mysql
    Q:- show the content of repositories.xml
    ans
    PentahoEnterpriseRepository
    Local
    Local
    http://localhost:9080/pentaho-di
    N


    Q:- action sequence file.
    Ans :-

    KettleComponent
    Pentaho Data Integration Job















    Q:- 1. Verify that in settings.xml you entered same values to repository.name, repository.userid and repository.password.
    ans :- yes , repository.name, repository.userid and repository.password values are same and correct , i can connect to spoon using specified credentials .

    q:- Now, go to Tools -> Repository -> Explore. Find the job you need to run.
    Yes i can find my job through explore (spoon GUI)

    Q:- Open repositories.xml in text or xml editor, and verify that it's have description of yours repository.
    yes my repositories.xml file has repository name

    While you defining repository connection in Spoon, it saves database information to file .kettle\repositories.xml in your user home directory from your blog
    what version of pentaho are you using.?

    we have 2 different file kettle folder .
    1) kettle.properties this save DB connection information
    2) repository.xml file this save repository details,URL .


    thanks for your help,
    Surya

    ReplyDelete
  6. Santosh, you may also join the discussion on pentaho forums:

    http://forums.pentaho.com/showthread.php?143338-The-repository-specified-was-not-found

    ReplyDelete
  7. Hi Vasili ,

    need your suggestion/idea/comment in below topic .

    I can execute a independent scheduled transformation from pentaho DI server console .But issue on running a parametrized scheduled transformation from pentaho DI server console .How can i pass parameter value at run time .

    In pentaho BI server , to execute parametrized report we used to pass variable value in URL . tried same in pentho DI server as below
    http://localhost:9080/pentaho-di/kettle/startTrans?name=UI_parameter&Values=Testvalue

    ReplyDelete
  8. any suggestion would be a gr8 help

    ReplyDelete
  9. DI server is a part of Pentaho Enterprise Edition. I don't use it, so can't help you. Try asking your question on Pentaho forums, or search documentation of enterprise edition.

    ReplyDelete