Archive for the ‘easter’ Category

Re-Introducing UDJC

Январь 27th, 2010

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Re-Introducing UDJC

Январь 27th, 2010

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;

private int yearIndex;
private Calendar calendar;

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] r=getRow();
if (r==null)
{
setOutputDone();
return false;
}

if (first) {
yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR"));
if (yearIndex<0) {
throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!");
}

calendar = Calendar.getInstance();
calendar.clear();

first=false;
}

Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size());
int outputIndex = getInputRowMeta().size();

Long year = getInputRowMeta().getInteger(r, yearIndex);
outputRowData[outputIndex++] = easterDate(year.intValue());

putRow(data.outputRowMeta, outputRowData);

return true;
}

private Date easterDate(int year) {
int a = year % 19;
int b = (int)Math.floor(year / 100);
int c = year % 100;
int d = (int)Math.floor(b / 4);
int e = b % 4;
int f = (int)Math.floor(( 8 + b ) / 25);
int g = (int)Math.floor((b - f + 1) / 3);
int h = (19 * a + b - d - g + 15) % 30;
int i = (int)Math.floor(c / 4);
int k = c % 4;
int L = (32 + 2 * e + 2 * i - h - k) % 7;
int m = (int)Math.floor((a + 11 * h + 22 * L) / 451);
int n = h + L - 7 * m + 114;

calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1));
return calendar.getTime();
}

All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).

Screen shot of the UDJC step

The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version.  That’s over 3 times faster to do exactly the same thing.

Here is a link to the Kettle test transformation for those that want to give it a try.  As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.

The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle.  This gives us pretty much optimal performance.

You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze.  However, the bulk of the changes are in there and working great.  Thank you Daniel, for an outstanding job!

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Easter Eggs for MySQL and Kettle

Январь 27th, 2010
To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's Computus article.

You might ask yourself: "how does it work?". Frankly, I don't know. Much like a tax form, I treat the calculation as a black box. But, it's wikipedia, so it must be right, right?

A Javascript snippet to calculate easter day


I also transcribed the algorithm to javascript, so I could use it in Kettle (a.k.a. Pentaho Data Integration). Of course, nothing should stop you from using it for another environment, such as a webpage.

I don't have a proper place to host that code, so I'm listing it here:

//Script to calculate Easter day
//according to the "Anonymous Gregorian algorithm"
function easterDay(year) {
var a = year % 19,
b = Math.floor(year / 100),
c = year % 100,
d = Math.floor(b / 4),
e = b % 4,
f = Math.floor((b + 8) / 25),
g = Math.floor((b - f + 1) / 3),
h = (19 * a + b - d - g + 15) % 30,
i = Math.floor(c / 4),
k = c % 4,
L = (32 + 2 * e + 2 * i - h - k) % 7,
m = Math.floor((a + 11 * h + 22 * L) / 451),
n = h + L - 7 * m + 114;
return new Date(year,
Math.floor(n / 31) - 1,
(n % 31) + 1);
}

easter = easterDay(year);

To use this in your kettle transformations, create a stream with an field of the Integer type called year. The year field should take on the value of some year. In the step, create one output field of the Date type to take on the value of the easter script variable. (For usage outside kettle, just use the easterDay() function as you see fit)

Nice, but so what?


The thought may have crossed your mind: "So what, who cares - why should I ever want to know when it's easter day?"

Apparently, if you think like that, you don't like eggs very much. That's ok - I don't blame you. But I happen to like eggs, and people in the egg business like people that like eggs like me so they can sell them more eggs. In fact, they like selling eggs so much, that it makes a big difference to them whether their business intelligence reports say: "On March 22, 2008, we sold 10 times more eggs than on February 22 and May 22 of the same year" as compared to "In 2008, on the day before Easter, we only sold half the amount of eggs as compared to the day before Easter in 2009".

In order to report these facts, special events and holidays like easter are stored in a date dimension. (I wrote about creating a localized date dimension, a date dimension that speaks your language some time ago)

So there you go: you could use these solutions in order to build a date dimension that understands easter. The nice thing about easter is that it can be used to derive a whole bunch of other Christian holidays, like good friday, ascension, and pentecost, and in many western countries, these will be special days with regard to the normal course of business. I leave all these as an exercise to the reader, but trust me - calculating easter is the key to a solving a lot of these problems.

PlanetMySQL Voting: Vote UP / Vote DOWN