Archive for the ‘Kettle’ Category

Slides from my MySQL UC 2010 presentation

Апрель 27th, 2010

As requested by a few fans out there, here are the slides of my presentation:

Pentaho Data Integration 4.0 and MySQL.pdf

I had a great time at the conference, met a lot of nice folks, friends, customers, partners and colleagues. After the conference I was unable to get back home like so many of you because of the Paul Simon singing Eyjafjallajökul volcano in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0 RC1 hacking with the rest of the l33t super Pentaho development team.  However, after 2+ weeks from home, even a severe storm over Philadelphia couldn’t prevent me from getting home eventually.

Until next time,
Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL User Conference 2010

Апрель 10th, 2010

Dear Kettle and MySQL fans,

Next week I’ll be strolling around the MySQL user conference in Santa Clara.  Even better, I’ll be presenting Tuesday afternoon (3:05pm).  The topic is Pentaho Data Integration 4.0 and MySQL.

The presentation will show you what the world’s most popular open source data integration tool can do for a MySQL user.  It will include practical examples and will showcase the latest improvements present in the brand new version 4.0.

Even more than the presentation itself, I’m looking forward to meeting you all over there.  The regular crowd, MySQL users, Pentaho partners, folks from Calpont, Continuent, SQLStream and many others but also the many new colleagues in San Francisco.

More than anything else I’m looking forward to hear about your Kettle successes and real-world data integration war stories.  If you want to chat about Kettle 4, see things first-hand or simply join me for a beer, don’t hesitate to ask.  I’ll try to regularly tweet my whereabouts at the user conference so you’ll know where I’m at.

Let me finish with a note to everybody that promised me beer in return for features and bug fixes: it’s payback time!

See you soon at the conference!
Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Writing another book: Pentaho Kettle Solutions

Март 14th, 2010
Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2010, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the early autumn of 2010, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since. The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

PlanetMySQL Voting: Vote UP / Vote DOWN

Writing another book: Pentaho Kettle Solutions

Март 14th, 2010
Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2010, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the early autumn of 2010, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since. The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

PlanetMySQL Voting: Vote UP / Vote DOWN

Re-Introducing UDJC

Январь 27th, 2010

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Re-Introducing UDJC

Январь 27th, 2010

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Easter Eggs for MySQL and Kettle

Январь 27th, 2010
To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's Computus article.

You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?

A Javascript snippet to calculate easter day


I also transcribed the algorithm to javascript, so I could use it in Kettle (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.

I don't have a proper place to host that code, so I'm listing it here:

//Script to calculate Easter day
//according to the "Anonymous Gregorian algorithm"
function easterDay(year) {
var a = year % 19,
b = Math.floor(year / 100),
c = year % 100,
d = Math.floor(b / 4),
e = b % 4,
f = Math.floor((b + 8) / 25),
g = Math.floor((b - f + 1) / 3),
h = (19 * a + b - d - g + 15) % 30,
i = Math.floor(c / 4),
k = c % 4,
L = (32 + 2 * e + 2 * i - h - k) % 7,
m = Math.floor((a + 11 * h + 22 * L) / 451),
n = h + L - 7 * m + 114;
return new Date(year,
Math.floor(n / 31) - 1,
(n % 31) + 1);
}

easter = easterDay(year);

To use this in your kettle transformations, create a stream with an field of the Integer type called year. The year field should take on the value of some year. In the step, create one output field of the Date type to take on the value of the easter script variable. (For usage outside kettle, just use the easterDay() function as you see fit)

Nice, but so what?


The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"

Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".

In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about creating a localized date dimension, a date dimension that speaks your language some time ago)

So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.

PlanetMySQL Voting: Vote UP / Vote DOWN

Easter Eggs for MySQL and Kettle

Январь 27th, 2010
To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's Computus article.

You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?

A Javascript snippet to calculate easter day


I also transcribed the algorithm to javascript, so I could use it in Kettle (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.

I don't have a proper place to host that code, so I'm listing it here:

//Script to calculate Easter day
//according to the "Anonymous Gregorian algorithm"
function easterDay(year) {
var a = year % 19,
b = Math.floor(year / 100),
c = year % 100,
d = Math.floor(b / 4),
e = b % 4,
f = Math.floor((b + 8) / 25),
g = Math.floor((b - f + 1) / 3),
h = (19 * a + b - d - g + 15) % 30,
i = Math.floor(c / 4),
k = c % 4,
L = (32 + 2 * e + 2 * i - h - k) % 7,
m = Math.floor((a + 11 * h + 22 * L) / 451),
n = h + L - 7 * m + 114;
return new Date(year,
Math.floor(n / 31) - 1,
(n % 31) + 1);
}

easter = easterDay(year);

To use this in your kettle transformations, create a stream with an field of the Integer type called year. The year field should take on the value of some year. In the step, create one output field of the Date type to take on the value of the easter script variable. (For usage outside kettle, just use the easterDay() function as you see fit)

Nice, but so what?


The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"

Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".

In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about creating a localized date dimension, a date dimension that speaks your language some time ago)

So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.

PlanetMySQL Voting: Vote UP / Vote DOWN