Vasili Vrubleuski on Buisness Intelligence: 2010

Sunday, September 26, 2010

JFreeChart in Firefox

   If you work with Pentaho via Firefox you probably upset with a quality of charts. Flash charts look just great. I'm talking here about JFree charts. This issue affects Firefox only! IE works a bit better in this case.

JFreeChart
   JFreeChart have several advantages over Flash Charts. JFreeChart is a java library that produces images in various formats: gif, jpeg, png. As a result these charts can be incorporated into PDF, Excel and RTF reports. Second advantage comes from Apple - Adobe Flash war. Flash charts will not work on iPhone and iPad. So there are some reasons to use JFreeChart.

Blurred Charts
   The case when charts in Pentaho are blurred is common enough. I designed a report with a chart in it with the latest Pentaho Report Designer 3.6.1, placed it on a dashboard, where it produced in HTML format, and added pdf export button. When I open my dashboard in Firefox the chart looks blurred. This is mainly concern to any text that is on the chart. Ok, you can understand these charts, but this is not a thing your customer will like. On the other hand charts exported to PDF looks very neat. Also when opened with Internet Explorer charts looks nice.

How to fix it?
   The problem here is in the way Firefox scaling images. Chart images are usually in high quality, and their size can be twice larger then the place for a chart on html page. So Firefox forced to downscale these images. And here, for the blame of Firefox, it can't do that nice. Firefox just don't have a good image scaling algorithm. You can check a bug, which wasn't fixed at the time of this writing.
But the good news is that Firefox have a "bad" images scaling algorithm that can correctly downscale our chart! I'll explain it:
<img> HTML element have a property image-rendering. In Firefox this property can be set to any of 5 values: auto, inherit, optimizeSpeed, optimizeQuality, -moz-crisp-edges. Currently only two rendering algorithms are used. Bilinear resampling for values: auto and optimizeQuality. Nearest neighbor resampling for optimizeSpeed and -moz-crisp-edges values. And here is a trick: Nearest neighbor resampling works better to images with sharp edges.
So all you need is just add following style for all chart images:

<style type="text/css">

img {

    image-rendering: -moz-crisp-edges 

}

</style>

This is how looks blurred image(you will see blurring in Firefox only):

And here is how looks sharp image(for me it's looks even better then in IE):

Sharp image can be rendered with problems if you changed zoom in a browser.

Why pentaho can't fix it?
Ok, I don't know. May be it consider this bug as not important. But it is certainly a disadvantage, which may scare some potential customers. JasperReports don't have this issue, however I don't sure which way they overcome it.

Wednesday, August 25, 2010

Implementing while loop in Kettle

I ran into the problem of 'while' loops in Kettle when I was loading data from Yahoo Store API. Yahoo web service allows you to get up to 100 orders pre request. But it don't tell you how many orders it have at all. So I'm sending requests until I get a special response that there are no more orders there.
There are several ways to implement loops in Kettle:

To use "Execute for every input row" flag in job or transformation.
To circle hops in a job.
To use "Repeat" flag in Start step of the job.

   Every of these approaches have its strengths, weaknesses and area of applicability. Lets consider each of them.

Using "Execute for every input row" flag in job or transformation.
   This is the most safe, correct and native way to implement loops in Kettle, but to use it you need to know in advance how many times you want to run the job inside a loop. If you can get or calculate the number of iterations in advance - just use this approach. For more information about it I can sent you to Slawomir Chodnicki blog article about it.

Circling hops in a job.
   This approach is not safe. When using it keep in mind that loop depth can't be too big. If you broke this rule you risk to get StackOverflowError. That is because Kettle use recursive method calls when running this kind of jobs. So if you believe your loop will not exceed say 10,000 or 100,000 iterations, depending on StackSize settings in your JVM you can run the loop in this way.
   For more information about StackOverflowError in Kettle see this JIRA issue(http://jira.pentaho.com/browse/PDI-1463). And more info about implementing this loop you can find in another article of Slawomir Chodnicki.

Using "Repeat" flag in Start step of the job.
   I recommend this approach if you can't use both above. By checking 'Repeat' flag in Start step of a job you can easy make job running forever. More important question is how to stop it! Ok, The only way to do that in 'Out of the box' Kettle(even 4.0) is to use Abort step. It stops the job and writes an error message in log that job is finished with errors. But that may not be the case! I want my job to stop normal, successful, without errors. Why can't I do that? Why should I flood the log with ERROR messages when actually no errors occured? For that reason I implemented a simple plugin called 'Stop Job' that stops repeating job without writing error messages in log.
   You can use this plugin in the same way you use Abort step. And you can add a message which will be written to log with log level BASIC.
Be aware that when using Repeat flag, the job will repeat to run over and over even when one of it steps fails. To stop the job in this case use Abort job step as depicted at the image below.

Here is a Stop Job plugin that you are free to use and modify.

And update 11 April 2017:
Instead of using Stop Job plugin you may simply use JavaScript steps in jobs and transformations.
Use this code in job's JavaScript step to stop job execution without failing it's result:
previous_result.setNrErrors(0);
previous_result.setResult(true);
parent_job.stopAll();

And here is a code for transformation JavaScrpt step:
_step_.stopAll();
Feedback is welcome, because I'm just starting to validate these solutions.

Thursday, July 22, 2010

Running Kettle jobs from BI Server

Most power of Pentaho comes from integration. BI Server is an integration point of multitude tools, technologies and projects. This is one of the secrets of Pentaho success. Pentaho provides access to database systems, allows developer to choose between data manipulation languages(SQL, MDX, MQL), allows to perform data analysis using WEKA data mining tool, suggests different ways for incorporating business logic, helps to perform some actions like sending e-mail based on business decisions and presents the results to user in form of charts and reports.
   PDI by itself is also a great consolidating tool. I won't praise it as much as BI server just to save your time, but when consider diversity of data sources and technologies Kettle is even more sociable then BI Server. Just have a look at Kettle plugin list to get it. I believe it can run data to earth from any data source you want.
   And now imagine you tie these great tools together. You obtain fine-grained control over Kettle jobs. You can ask user for parameters before passing them to Kettle job. You can use Kettle job as a smart data source for your reports. Using java quartz lib you can schedule Kettle jobs to run with a second-precision (for comparison linux cron allows only minutes-precision).
   But many people abandon this power and run Kettle as a separate java process. I saw a lot of unanswered question on forums about running Kettle jobs in BI Server. Up to Kettle version 4.0 some bugs was in the way of using BI Server - PDI couple tied together in enterprise environment. An example is a call of System.gc() in every Kettle job.
   Below are some advices for those bold spirits who decide to integrate these tools.

Setting up Kettle repository
   At first you need to show BI server where Kettle repository is. The easiest way to do it is to configure repository settings with Spoon tool. After that open User Home folder on you machine and find .kettle folder. You need to place .kettle to User Home folder of the machine where BI Server is running.
   Then open file pentaho-solutions\system\kettle\settings.xml. Use following settings to connect to DB-based kettle repository:
1. <repository.type>rdbms</repository.type>
2. <repository.name>kettle_repo</repository.name>
3. <repository.userid>admin</repository.userid>
4. <repository.password>admin</repository.password>
   admin/admin are default user name/password.

That's it.

Custom plugins.
   If you have custom kettle plugins you need to put it in one of following places: <PATH>/plugins, <User Home>/.kettle/plugins, <pentaho-solutions>/system/kettle/plugins.

Running jobs
   And now to run Kettle job just add an action like that to your action sequence:

    <action-definition>
      <component-name>KettleComponent</component-name>
      <action-type>Pentaho Data Integration Job</action-type>
      <action-inputs>
        <kettle-logging-level type="string"/>
      </action-inputs>
      <action-resources/>
      <component-definition>
        <directory><![CDATA[job-dir]]></directory>
        <job><![CDATA[job-name]]></job>
      </component-definition>
    </action-definition>

For start use 'detail' value for kettle-logging-level. It writes enough logs but don't floods log files.

Wednesday, June 16, 2010

Week number function in MySQL, MDX and Java

Introduction
   Vitually every data warehouse have a date dimension. Also almost every buisness user require a weekly reports (along with daily, monthly and other). At this point report developer often uses functions that return week's number in a year. Trouble may arrise if you decide to use together functions from different languages and technologies.

   In this article I'll compare week function in Java, JavaScript, MySQL and MDX(Mondrian). You'll see that without proper attention the same date could appear in different weeks when using different technologies together. For example 1-st Jan 2010 could be 0-th week in MySQL, 1-st week in Java, and 53-rd week somewhere else.

   If you interested in aspects of generating date dimension I would recommend to read Roland Bouman's article of this topic.

Java: Calendar
   Java provides class java.util.Calendar for operating with dates. Calendar class have a constant WEEK_OF_MONTH for getting week's number. Also Calendar have two properties for configuring week's number calculations: minimalDaysInFirstWeek and firstDayOfWeek. Default values for these properties are locale-dependent.

   Details:

firstDayOfWeek - first day of a week(Sunday, Monday,.. Saturday).
minimalDaysInFirstWeek - minimal number of days in first week that are also in this year.
first week have number 1.

Example:

java.util.Calendar calendar = java.util.Calendar.getInstance();
java.util.Date date = (new java.text.SimpleDateFormat("yyyy-MM-dd")).parse("2010-12-31");
calendar.setTime(date);
int week = calendar.get(java.util.Calendar.WEEK_OF_YEAR);
System.out.println(week); // prints: 52

MySQL: WEEK
MySQL have a WEEK function. It also allows user to change calculation algorithm, but it's not as generic as in Java.

In MySQL user can choose between two main approaches for determining the first week of the year:

It is the first week that starts in this year. Read as: Sunday or Monday in this year. (American approach)
It is the week that contain January 1-st, if more that 3 days of this week are in the new year. (European approach)

   This discussion may help to understand that.
   If a date falls in the last week of the previous year, MySQL may return 0 in certain configurations.

   Example:

select WEEK(STR_TO_DATE('Dec 31, 2009','%M %d,%Y')) ; - 52
select WEEK(STR_TO_DATE('Jan 1, 2010','%M %d,%Y')) ; - 0

select WEEK(STR_TO_DATE('Dec 31, 2009','%M %d,%Y'), 3) ; - 53
select WEEK(STR_TO_DATE('Jan 1, 2010','%M %d,%Y'), 3) ; - 53

MDX: DatePart.
MDX uses VisualBasic's function: DatePart. It allows setting first day of week to any day you like. And for first week of the year it provides following options:

Use system setting
Start with the week in which January 1 occurs (default)
Start with the week that has at least four days in the new year
Start with the first full week of the new year

JavaScript
   JavaScript don't have a standard function for week number. So be careful when writing your own, or using functions provided with JavaScript libraries and when you want to use it together with Java's , MySQL's or MDX's functions.

How to use them together?
   The straightest approach is to avoid using them together. You may pre-generate date dimension with help of Java for example. And when later you need to determine week's number by date you may use either Java or date dimension's table.
   However if you still need to use week number functions from a mix of technologies, be sure that you understand how each function works, and configure them properly.

Saturday, May 15, 2010

Virtual development environment: Part II

My previous post describes some alternatives of using virtualization. In this post I'll dig deeper into details of using virtual machines.

Virtualization in software development process

Windows host runs several virtual guests with Ubuntu. Virtual guests run as services. One of guests , let say Project 1, is configured to start 1-2 minutes later after Window start. So I have it always running and available at any time. Other guest machines which I don't use are turned off to save memory but ready to be started at any time. Guest machines have a static IP address; run SSH server, Samba server, MySQL server and Apache Tomcat with my web application. Have a look at the schema below for more clear picture of my configuration.

My web application for Project 1 is always available for HTTP access, as well as I can connect to it with remote debugger at any time. All files of the web application are shared to windows host with samba server.

Virtual Network

VMWare provides several ways to setup virtual network. I've choose one based on NAT. It hides your virtual machines from outside world behind the virtual NAT. However you can setup port forwarding to open your web application for outer world.

I won't tell in this article about network setup process and static IP address configuration on the guest machine. There are a lot of resources on these topics. And if you are not familiar with Linux yet it would be cruel from my side to steal a wonderful trip over Linux tutorials, blogs, and first steps in command line by replacing exciting discoveries with a several predefined instructions ;).

SSH, Putty and Samba

Setting up an SSH Server on your guest is like opening a door to your development environment and to the Linux world.

There are two nice tools for Windows OS: Putty and WinSCP, They allow you to connect to the guest by SSH protocol. With Putty you can run any command on guest, and with WinSCP you can navigate guest's machine file system, and transfer files.

While WinSCP gives you a lot of power working on Linux, it has some issues. One example is that WinSCP always copy remote file to local temp folder before opening it. As a result it's impossible to use TortoiseSVN over WinSCP. If you feel inconveniences using WinSCP then try Samba.

Samba opens Linux file system for Windows networks. When you'll additionally map this Linux shared folder as network drive you would have a seamless access to you virtual machine and will be able to use your favorite Windows tools for work with Linux files.

Issues or "A cat in gloves catches no mice"

One of the most interest issues I had with VMWare was so called "95% issue". A problem starting a virtual machine because it's hangs at 95 %. When I have read all 10 pages of VMware support forum, seeing lots of requests for help with a common answer from VMWare: "Your platform is not supported" My first intention was to stop using VMWare and switch to Sun's VirtualBox. Many people said that the issue was resolved with disk drivers. I was experimenting with them but with no luck. And only when I start recollecting all changes I made to computer recently, I come to solution. It was a WideCap tool conflicting with VMWare Server. After WideCap's uninstallation "95% issue" was resolved.

Also I had an issue in using TortoiseSVN on drives shared with Samba, however it's resolving by security settings in Samba configuration.

Friday, April 9, 2010

Virtual development environment: Part I

In the first post I'll share my experience of using virtualization platforms in software development process.

Why to use virtualization?

I've started to work with several open source and custom projects at the same time. This forced me to think about optimizing my computer for more comfortable and fast work. And especially to get rid of the mess that I had on my local MySQL server: It had around 30 database schemas. And it was difficult to find one of them that I need for the project. Below are main issues that I was solving:

Work with several project simultaneously.
Explore multiple open source projects and compare different versions. I'm trying to keep track of new features that appear in every version of Pentaho BI Server, Jasper Server and BIRT.
Keep separate databases for every project. It helps to get rid of multiple schemas mess on the same database server.
Learning other operation systems, testing software in different envieronments.

How to use virtualization?

Initially I thought about virtual development environment as a separate virtual machine for every project. With a standard set of tools installed like: SVN, Java, Eclipse, Tomcat. But after trying to work with virualization platforms I have come to decision that it's better to virtualize only server part without GUI.

Now I open all GUI tools on my host machine. And virtual machines play the role of GUI-less servers. So typical virtual machine is running Linux, MySql and a Tomcat with web application.

One reason is that you will tend to move all programs like browser and mail client either to the guest machine or to the host. Because it's not convenient to switch often back and forth for copy-pasting. It's easier to use a single machine where you have a Browser, Mail Client, Instant Messangers and IDE.

Another reason is that almost all virtualization platforms don't allow to use graphics-accelerator card on guest machine. And when you run your favorite IDE on machine without graphics-accelerator it looks quite poor. And most probably you will not like working with it. Fonts are less clear and everything other looks less attractive.

Which virtual platform to choose?

When choosing virtualization platform I have considered following options:

VMware(Player, Server, Worstation, ESXi)
VirtualPC
VirtualBox
Xen

I've chosen VMware Server. Because it's free and widespread. One thing I like most about VMware Server is that it's running as a windows service, and you never see it neither on taskbar nor in system tray. My guest machines is configured to run after Windows is started. So they never catch my eye, however are always available via virtual network.

Xen and VMware ESXi are more complicated to setup and use. And also I don't sure if they will help with graphics-accelerator issue.

The results.

Here is a final configuration that I use:

Host machine:

OS: Windows Vista Home Premium 64-bit. Windows was preinstalled on my notebook and I'm not yet ready to change it for Linux.
Software: VMware Server 2.0.2
RAM Total: 4 Gb
RAM used by Windows Vista: 2Gb. But I need to admit that Windows releases some of allocated memory when other applications require. Up to 500Mb I suppose.

Guest machine:

OS: Ubuntu Server 9.10 32-bit.
Software: Java, SSH server, MySQL, Glassfish, Pentaho
RAM used by Ubuntu Server: less then 50 Mb.
RAM total used when Glassfish is in debug mode: up to 1 Gb - 1.2 Gb.
VMI support.

Now I use only one always-running guest machine. But VMware Server is smart enough to allocate only as much RAM on host machine as guest's Ubuntu actually use. As a result I'm able to run 2-3 virtual machines on my host. Some of then with GUI. Ubuntu Desktop use about 150Mb.

I have found my virtual development environment worth the time that I spend on it. However I still need to prove and check the advantages, because I'm working in this environment for only two month.

In the next post I plan to share the issues and their solutions that I found while setting up virtual machines.