Thursday, July 22, 2010

Running Kettle jobs from BI Server

  Most power of Pentaho comes from integration. BI Server is an integration point of multitude tools, technologies and projects. This is one of the secrets of Pentaho success. Pentaho provides access to database systems, allows developer to choose between data manipulation languages(SQL, MDX, MQL), allows to perform data analysis using WEKA data mining tool, suggests different ways for incorporating business logic, helps to perform some actions like sending e-mail based on business decisions and presents the results to user in form of charts and reports.
   PDI by itself is also a great consolidating tool. I won't praise it as much as BI server just to save your time, but when consider diversity of data sources and technologies Kettle is even more sociable then BI Server. Just have a look at Kettle plugin list to get it. I believe it can run data to earth from any data source you want.
   And now imagine you tie these great tools together. You obtain fine-grained control over Kettle jobs. You can ask user for parameters before passing them to Kettle job. You can use Kettle job as a smart data source for your reports. Using java quartz lib you can schedule Kettle jobs to run with a second-precision (for comparison linux cron allows only minutes-precision).
   But many people abandon this power and run Kettle as a separate java process. I saw a lot of unanswered question on forums about running Kettle jobs in BI Server. Up to Kettle version 4.0 some bugs was in the way of using BI Server - PDI couple tied together in enterprise environment. An example is a call of System.gc() in every Kettle job.
   Below are some advices for those bold spirits who decide to integrate these tools.

  Setting up Kettle repository
   At first you need to show BI server where Kettle repository is. The easiest way to do it is to configure repository settings with Spoon tool. After that open User Home folder on you machine and find .kettle folder. You need to place .kettle to User Home folder of the machine where BI Server is running.
   Then open file pentaho-solutions\system\kettle\settings.xml. Use following settings to connect to DB-based kettle repository:
  1. <repository.type>rdbms</repository.type>
  2. <repository.name>kettle_repo</repository.name>
  3. <repository.userid>admin</repository.userid>
  4. <repository.password>admin</repository.password>
   admin/admin are default user name/password.

  That's it.

  Custom plugins.
   If you have custom kettle plugins you need to put it in one of following places: <PATH>/plugins, <User Home>/.kettle/plugins, <pentaho-solutions>/system/kettle/plugins.

  Running jobs
   And now to run Kettle job just add an action like that to your action sequence:
    <action-definition>
      <component-name>KettleComponent</component-name>
      <action-type>Pentaho Data Integration Job</action-type>
      <action-inputs>
        <kettle-logging-level type="string"/>
      </action-inputs>
      <action-resources/>
      <component-definition>
        <directory><![CDATA[job-dir]]></directory> 
        <job><![CDATA[job-name]]></job>
      </component-definition>
    </action-definition>
   For start use 'detail' value for kettle-logging-level. It writes enough logs but don't floods log files.