Vasili Vrubleuski on Buisness Intelligence

Tuesday, February 7, 2012

OLAP tools review: JPivot

This is a second article from "OLAP tools review" series. To read the first one open this link.

JPivot is a free opensource OLAP client. It's slightly outdated, but still in some aspects outperforms modern tools. JPivot is bundled into both Pentaho BI Server and JasperReports Server. Also JPivot is available in Mondrian as a default UI.

At first glance

JPivot may look a little complex for people not familiar with OLAP. The goal of this article is to show the simplicity and the power of JPivot. Modern tools don't offer anything significantly better then JPivot.

We will work with sample data warehouse supplied with Pentaho BI Server called SteelWheels.

When you first open JPivot you'll see a picture like that:

We see a toolbar with a set of buttons and a table under it. The table shows us requested data, and buttons helps to navigate through this data. All columns you see under Measures caption are aggregates of data warehouse Measures columns. Other columns are Dimensions. As I've told in the previous article OLAP tools tends to start analysis from grand total values.

At first glance it's difficult to tell what these numbers are about. It's very helpful to understand what individual line from a data warehouse means, which part of business process it represent. This information can be obtained from data warehouse developer, but so as we can't talk to developer now, we will drill down to details and will try to figure out ourselves. If you want to look at most fine-grained line from data warehouse find the last image in this article called Drill through example.

Result table setup

From all buttons on toolbar, the most important is the first one from the left, called "Open OLAP Navigator". OLAP Navigator allows you to setup result table: to choose what to display on rows, what on columns, which measures to use, and how to filter the data.

Let's put Time on Columns, Leave Markets on Rows, and move everything else to Filters, after you press OK button you should see this result:

Note that we still see same total number 105,331. It's a total Quantity of ordered product items. Total haven't changed because we haven't filter data yet. You can filter data by clicking on Dimension of Measure name in OLAP Navigator. A list of values for the dimension will be opened from which you may select values you need to filter by. Also try to open Measures to see different Measures which available there.

Changing detail level

The most important feature of OLAP system is ability to drill down (open details). JPivot supports 4 modes: Drill Member, Drill Position, Drill Replace and Drill Through. By default Drill Position mode is active. And in this mode if you click on plus sign before "All Markets" member JPivot opens a list of markets, displaying values for All Markets as a first row. Next picture illustrate that:

Total number for all markets still 105,331. And if you sum values for each market (APAC, EMEA, Japan and NA) you'll get this total value. So drill down shows you details from which total value is comprised.

Here is description of all drill down modes:

Drill Member. Show Total and Details. Replace all occurrence of Total Member with combination of Total and Details.
Drill Position. Show Total and Details. Replace single occurrence (selected by user) of Total Member with combination of Total and Details.
Drill Replace. Show Details only. Replace all occurrence of Total Member with Details.
Drill Through. Shows individual lines from data warehouse table which comprise the selected value.

Actually Drill Member and Drill Position are very similar. If you want to understand the difference between them, try to add several dimensions on rows and look what happens when you drill down first dimension and then second.

I like Drill Replace mode best and would made it a default mode in JPivot. That's because in this mode you may create a nice table which will not intermix totals and details all together. By looking at a table below you may easily compare how Quantity Measure changes for each market over years. And in addition to table I added a chart which visualize values form table. Chart isn't difficult to add, and I'll not describe it in this article.

And now the time come to understand what business process this data describes. Usually this is a first thing you are interest in, but to make explanations easier we come to this feature only now.

Let's turn on Drill Through mode and click on a small value at the intersection of Japan and 2005. We are expecting to see a list of individual lines from data warehouse table, and selected the smallest value to not overload the system. Here is the result:

Drill through example

By looking at first line we can conclude that Some customer form Japan, Tokyo, Minato-ku called "Tokyo Collectables, Ltd" ordered a product of line Classic Cars called "1970 Triumph Spitfire" in 2005, QTR1, Jan in number of 48 items and this order was Shipped.

Now we understand data much better, we may conclude that all Quantities which we have seen in tables are Ordered items which not necessary was shipped to a customer. To see the actually shipped Quantities we need to filter data by Order Status - Shipped.

Calculations

Calculations in JPivot can be archived by manually writing MDX formula. Average user will never learn MDX, so we will not even consider this here. But ability to manually enter MDX expressions is very desired feature for advanced users, unless all MDX power will be completely replaced by user friendly UI.

Hopefully I've showed you how powerful JPivot is, nevertheless it's interface is complicated.

Saturday, February 4, 2012

OLAP tools review: Basis for comparison of OLAP tools.

This is a second article from "OLAP tools review" series. To read the first one open this link.

There are two important sides of any OLAP systems: "Which questions can be addressed to OLAP?" and "Which answers can be received from OLAP?" In this article we'll review these topics in details.

Questions to OLAP System

All questions to OLAP Systems are about figures. As we know, OLAP systems are intended to answer business questions like: "What's a company revenue this year?" or "How does it compare to a previous year?". These questions can easily be answered by searching the appropriate information in a Data Warehouse. More advanced question will be: "Which revenue the company may expect next year?". Existing OLAP usually answer this question by calculating a formula provided by an expert.

Rather complex questions can be constructed with modern OLAP systems. For building these complex questions OLAP provides a form of conversation between a system and a user. Suppose you asked for a comparison bar chart for expenses on entertainments for each month of this year. Then next question may simply be "Why in April I spent so much?". System should remember the context of conversation and present to you a list of items under entertainment category for April of this year.

When complex question is built, OLAP usually provide an ability to save it, and then quickly run next time.

Answers of OLAP System

OLAP Systems gives answers in terms of figures. It can be a single number, but usually it's a row of numbers of a table of numbers. For our expenses example a table may contain months by rows and expenses categories by columns. See table below as example. A table helps you to compare numbers to each other. Each number will usually represent an aggregate value. The aggregates are not necessary should be sums, they may be average values, counts, ratios, and a lot more types.

	Categories
Months of 2011	Food	Entertainment	Car
January	$407	$224	$734
February	$451	$311	$354
March	$376	$74	$579

So we can classify all answers by dimensionality. First three types of answers are used in practice, and the forth I introduce as a possible direction of improvement of existing OLAP systems:

Single number. Example: Total expenses in May.
List of numbers. Example: Total expenses in each month of a year.
Table of numbers. Example: Total expenses in each month of a year by categories.
Several tables of numbers. Suppose we have a slider component for changing date and table will display data for that day. This may be considered as set of tables.

Answers visualization

A huge role in easy comprehension of these numbers plays visualization. Instead of reviewing a table of figures for several minutes a single glance at a chart or graph will tell you everything you need in a moment. As example system may show you a bar chart with expenses for each month of a year, and you'll easily identify the higher bar which denotes a month with the biggest expenses.

Let's think about visualization of different types of answer dimensionality:

Single number. Visualization usually is not required, because nothing to compare this number with.
List of numbers. Many ways to visualize (A bar chart, a pie chart, line chart)
Table of numbers. Still many ways for great visualization (A stacked-bar chart, multi-line chart, many pie charts)
Several tables of numbers. Difficult to visualize, an animation may help.

Initial context of conversation

OLAP Systems can answer very complex questions, and it's difficult to classify them all. But OLAP usually supports conversations. That means user instead of writing a complex question is asking a series of short questions which leads to desired result. Along this process system should remember the context of conversation.

Many OLAP systems start conversation from grand total values and allow user to "drill down" to details. And it's reasonable that user wants to see the whole picture at first and then investigate the details. But it's also true that business users prefer to look at current situation, like expenses for last year. And seeing total expenses looks useless.

On the other hand, user which opens OLAP system for the first time, may be unfamiliar with Data Warehouse structure and it may be very helpful to see several most recent rows with fine-grained data from DW, and then start data exploration in a bottom-up fashion. But I don't remember seeing OLAP system which adopts bottom-up style of navigation.

Navigation in figures ocean

Common operations which user can perform in OLAP system can be split into following classes: Result table or chart definition, data filtering, changing detail level, calculations.

Result table or chart setup:

Choose the measures. Example: Show total expenses.
Choose Dimension or combination of Dimensions for columns. Example: Compare expenses by categories.
Choose Dimension or combination of Dimensions for rows. Example: Compare expenses by categories(on columns) and by months(on rows).
Sort data in a table.

Data filtering:

Select dimension members for which to display data in the result table or chart.
Filter data in result table by members of other dimensions(which don't used in result table definition)
Filter data by Measures values, or by arithmetic formulas.

Changing detail level:

Drill level down. Example: If we see total expenses for a year "drill down" will open expenses my months of the year.
Drill level up. It’s an opposite of drill down.
Drill Through. It shows the most fine-grained rows from data warehouse which comprise selected value.

Calculations:

Calculate column or row values with user defined formula. OLAP systems from this review support such functionality in less or more extent. But I think a lot of improvements can be made to existing OLAP systems in this direction.
Calculate column or row values with a predefined function.
Change aggregation function for Measures.

Let me describe one simple idea which OLAP tools may adopt in future to simplify advanced calculations.

Do you like using Excel? I do. Excel allows entering formulas with mouse clicks. Excel suggests function names to user. Excel highlights cells used in formula with different colors. All you need to learn to start using formulas in Excel is that formulas start with '=' sign.
Implementing functionality like that in OLAP system opens the door to completely new usage scenarios for OLAP systems. Suppose accountant use data warehouse to calculate taxes, and don't need to ask technical specialist to change some formula every time tax rate or tax policies changes.

In the next article I'll review JPivot tool.

OLAP tools review: Introduction

I'm impressed. And this impression finally pushed me to start these series of articles about Data Analysis Tools. I'm going to review and compare several OLAP Systems. Actually the compared set of tools is now limited to java web application which I had a chance to work with as a Buisness Intelligence consultant. They are: JPivot, Jaspersoft OLAP, Pentaho Analysis and may be I'll add something in future.

So what did impress me? Today it was a Pentaho Analyzer Report (Pentaho Analysis) which is a part of Pentaho BI Server Enterprise Edition. I really like how its usability was improved. It's not an advertisement article, so don't stop reading it after a praise to Pentaho, in the review I'll try to look deeper to functionality of each tool and as well into some technical details and I'll try to be unbiased. All in all most of these tools have very similar functionality. And also I want to find directions for these tools to evolve.

The purpose of OLAP systems.

Let's look at OLAP from different sides. At first let's look at OLAP as at Decision Support Systems (DSS). I'm not going to give a strict definition, but essentially what Decision Support Systems used for? To help people make decisions. How do they help? Decision Support System provides information required by the person for making a decision. In this sense Google Search is the best Decision Support System, it brings you the information required for making decisions. But Google Search can't help us calculate company's expenses for last month. Then let's refine our definition for OLAP. OLAP are business-oriented systems which work with figures.

By classification of Daniel Power we deal with data-driven DSS. A data-driven DSS provide operational and strategic business intelligence using internal company data and sometimes external data. You may read more about it in Daniel's article, but we go further.

Also I want to compare DSS to Expert Systems. Decision Support Systems provides the information required by an expert for making a decision. In contrast to them, Expert Systems make decisions themselves and provide explanations to an expert. The point of intersection between them is an understanding how an expert explores the information in the process of making a decision. An understanding of expert's needs is a key to improve existing DSS systems.

To summarize, we talk here about OLAP systems which are kind of data-driven Decision Support Systems.

OLAP and Data Warehouse intuition

Let me suggest you an intuition behind OLAP Systems.

Suppose you earn a lot of money but you have found out that they all get spent somewhere. What will be the solution? I would suggest making notes of your expenses. If you open an Excel sheet and create a table, like one shown below, and will start filling it every day, you may become proud that you have build your first Data Warehouse.

Date of purchase	Category	Name of product or service	Cost
1 Jan 2010	food	cheese	$5.45
1 Jan 2010	food	milk	$3.30
2 Jan 2010	entertainment	visit to theater	$160.00
3 Jan 2010	car	gasoline	$50.00

Now after a month of doing these records you may calculate many useful aggregate values like: total expenses for this month, total expenses for each of categories, total expenses for gasoline. This is essentially what OLAP systems do for you. They calculate aggregated values filtered by different criteria.

That is to say you'll be able to ask OLAP about sums of these expenses or about lists of individual expense records. Like: "How much money was spent this month?", "How much money was spent last month for a car?", "Show me the bar chart with comparison of expenses in each month of this year." As well as much more complex questions.

Note that you always aggregate values from cost column. Such columns as cost are called Measure columns. There is no reason to sum values from other columns in our table. As example there is no sense to sum product names. Other columns are used to group or to filter costs: by month, by year, by product category. These grouping and filtering columns are called Dimensions (and sometimes they called Filters).

It's important for the user of OLAP system to have a clear understanding what is represented by individual line of a Data Warehouse table. In our case it's a single purchased item details. But when you look at these data from OLAP, you usually see sums of these items like total cost for some period of time. Usually the easiest task in OLAP is to see a grand total over all items, and a little more time consuming is to navigate to detailed subset of these items.

Dimension structure

OLAP system usually deals with large volumes of data. And the diversity of Dimension values can grow large. To help users navigate data, OLAP system allows structuring Dimension values into a tree form. The most popular example is a Date Dimension, which can have branches for each year, then each year will have a branch for quarters, each quarter branches for month, months will have leaves for days.

In the next article I'll try to build a basis for reviewing and comparison of OLAP tools.

Friday, February 25, 2011

Wrapping long text in Pentaho html reports in IE7

   To understand the issue just open pentaho demo report in IE7 and for comparation in FF of Chrome.
   The report is located in Steel Wheels/Reporting and is called "Order Status". It contain a long line of text which is not wrapped in IE7 and wrapped in other browsers.

   Internet Explorer 7:

Firefox:

There should be a dozen of ways to wrap this text, but I want to show how simple it can be resolved with CSS.

   Actually Pentaho report engine do a lot of things to wrap the long line: it sets <table> element with, <col> elements widths, it even wrap each line of data into span element. But IE7 just write it in single line.
   One promissing way to overcome this was to use table-layout:fixed css style. But this not helps because pentaho also have fixed height of rows. And data just trancates without moving to the next line.
   My solution is to apply block style to all spans in the table:
table td span {
   display:block;
}

It simple and it works like a charm. If you need you can empower the rule with more strict constraints to keep other tables save.

Sunday, September 26, 2010

JFreeChart in Firefox

   If you work with Pentaho via Firefox you probably upset with a quality of charts. Flash charts look just great. I'm talking here about JFree charts. This issue affects Firefox only! IE works a bit better in this case.

JFreeChart
   JFreeChart have several advantages over Flash Charts. JFreeChart is a java library that produces images in various formats: gif, jpeg, png. As a result these charts can be incorporated into PDF, Excel and RTF reports. Second advantage comes from Apple - Adobe Flash war. Flash charts will not work on iPhone and iPad. So there are some reasons to use JFreeChart.

Blurred Charts
   The case when charts in Pentaho are blurred is common enough. I designed a report with a chart in it with the latest Pentaho Report Designer 3.6.1, placed it on a dashboard, where it produced in HTML format, and added pdf export button. When I open my dashboard in Firefox the chart looks blurred. This is mainly concern to any text that is on the chart. Ok, you can understand these charts, but this is not a thing your customer will like. On the other hand charts exported to PDF looks very neat. Also when opened with Internet Explorer charts looks nice.

How to fix it?
   The problem here is in the way Firefox scaling images. Chart images are usually in high quality, and their size can be twice larger then the place for a chart on html page. So Firefox forced to downscale these images. And here, for the blame of Firefox, it can't do that nice. Firefox just don't have a good image scaling algorithm. You can check a bug, which wasn't fixed at the time of this writing.
But the good news is that Firefox have a "bad" images scaling algorithm that can correctly downscale our chart! I'll explain it:
<img> HTML element have a property image-rendering. In Firefox this property can be set to any of 5 values: auto, inherit, optimizeSpeed, optimizeQuality, -moz-crisp-edges. Currently only two rendering algorithms are used. Bilinear resampling for values: auto and optimizeQuality. Nearest neighbor resampling for optimizeSpeed and -moz-crisp-edges values. And here is a trick: Nearest neighbor resampling works better to images with sharp edges.
So all you need is just add following style for all chart images:

<style type="text/css">

img {

    image-rendering: -moz-crisp-edges 

}

</style>

This is how looks blurred image(you will see blurring in Firefox only):

And here is how looks sharp image(for me it's looks even better then in IE):

Sharp image can be rendered with problems if you changed zoom in a browser.

Why pentaho can't fix it?
Ok, I don't know. May be it consider this bug as not important. But it is certainly a disadvantage, which may scare some potential customers. JasperReports don't have this issue, however I don't sure which way they overcome it.

Wednesday, August 25, 2010

Implementing while loop in Kettle

I ran into the problem of 'while' loops in Kettle when I was loading data from Yahoo Store API. Yahoo web service allows you to get up to 100 orders pre request. But it don't tell you how many orders it have at all. So I'm sending requests until I get a special response that there are no more orders there.
There are several ways to implement loops in Kettle:

To use "Execute for every input row" flag in job or transformation.
To circle hops in a job.
To use "Repeat" flag in Start step of the job.

   Every of these approaches have its strengths, weaknesses and area of applicability. Lets consider each of them.

Using "Execute for every input row" flag in job or transformation.
   This is the most safe, correct and native way to implement loops in Kettle, but to use it you need to know in advance how many times you want to run the job inside a loop. If you can get or calculate the number of iterations in advance - just use this approach. For more information about it I can sent you to Slawomir Chodnicki blog article about it.

Circling hops in a job.
   This approach is not safe. When using it keep in mind that loop depth can't be too big. If you broke this rule you risk to get StackOverflowError. That is because Kettle use recursive method calls when running this kind of jobs. So if you believe your loop will not exceed say 10,000 or 100,000 iterations, depending on StackSize settings in your JVM you can run the loop in this way.
   For more information about StackOverflowError in Kettle see this JIRA issue(http://jira.pentaho.com/browse/PDI-1463). And more info about implementing this loop you can find in another article of Slawomir Chodnicki.

Using "Repeat" flag in Start step of the job.
   I recommend this approach if you can't use both above. By checking 'Repeat' flag in Start step of a job you can easy make job running forever. More important question is how to stop it! Ok, The only way to do that in 'Out of the box' Kettle(even 4.0) is to use Abort step. It stops the job and writes an error message in log that job is finished with errors. But that may not be the case! I want my job to stop normal, successful, without errors. Why can't I do that? Why should I flood the log with ERROR messages when actually no errors occured? For that reason I implemented a simple plugin called 'Stop Job' that stops repeating job without writing error messages in log.
   You can use this plugin in the same way you use Abort step. And you can add a message which will be written to log with log level BASIC.
Be aware that when using Repeat flag, the job will repeat to run over and over even when one of it steps fails. To stop the job in this case use Abort job step as depicted at the image below.

Here is a Stop Job plugin that you are free to use and modify.

And update 11 April 2017:
Instead of using Stop Job plugin you may simply use JavaScript steps in jobs and transformations.
Use this code in job's JavaScript step to stop job execution without failing it's result:
previous_result.setNrErrors(0);
previous_result.setResult(true);
parent_job.stopAll();

And here is a code for transformation JavaScrpt step:
_step_.stopAll();
Feedback is welcome, because I'm just starting to validate these solutions.

Thursday, July 22, 2010

Running Kettle jobs from BI Server

Most power of Pentaho comes from integration. BI Server is an integration point of multitude tools, technologies and projects. This is one of the secrets of Pentaho success. Pentaho provides access to database systems, allows developer to choose between data manipulation languages(SQL, MDX, MQL), allows to perform data analysis using WEKA data mining tool, suggests different ways for incorporating business logic, helps to perform some actions like sending e-mail based on business decisions and presents the results to user in form of charts and reports.
   PDI by itself is also a great consolidating tool. I won't praise it as much as BI server just to save your time, but when consider diversity of data sources and technologies Kettle is even more sociable then BI Server. Just have a look at Kettle plugin list to get it. I believe it can run data to earth from any data source you want.
   And now imagine you tie these great tools together. You obtain fine-grained control over Kettle jobs. You can ask user for parameters before passing them to Kettle job. You can use Kettle job as a smart data source for your reports. Using java quartz lib you can schedule Kettle jobs to run with a second-precision (for comparison linux cron allows only minutes-precision).
   But many people abandon this power and run Kettle as a separate java process. I saw a lot of unanswered question on forums about running Kettle jobs in BI Server. Up to Kettle version 4.0 some bugs was in the way of using BI Server - PDI couple tied together in enterprise environment. An example is a call of System.gc() in every Kettle job.
   Below are some advices for those bold spirits who decide to integrate these tools.

Setting up Kettle repository
   At first you need to show BI server where Kettle repository is. The easiest way to do it is to configure repository settings with Spoon tool. After that open User Home folder on you machine and find .kettle folder. You need to place .kettle to User Home folder of the machine where BI Server is running.
   Then open file pentaho-solutions\system\kettle\settings.xml. Use following settings to connect to DB-based kettle repository:
1. <repository.type>rdbms</repository.type>
2. <repository.name>kettle_repo</repository.name>
3. <repository.userid>admin</repository.userid>
4. <repository.password>admin</repository.password>
   admin/admin are default user name/password.

That's it.

Custom plugins.
   If you have custom kettle plugins you need to put it in one of following places: <PATH>/plugins, <User Home>/.kettle/plugins, <pentaho-solutions>/system/kettle/plugins.

Running jobs
   And now to run Kettle job just add an action like that to your action sequence:

    <action-definition>
      <component-name>KettleComponent</component-name>
      <action-type>Pentaho Data Integration Job</action-type>
      <action-inputs>
        <kettle-logging-level type="string"/>
      </action-inputs>
      <action-resources/>
      <component-definition>
        <directory><![CDATA[job-dir]]></directory>
        <job><![CDATA[job-name]]></job>
      </component-definition>
    </action-definition>

For start use 'detail' value for kettle-logging-level. It writes enough logs but don't floods log files.