Archive for the ‘Kettle’ Category

Data Modeling

Ноябрь 3rd, 2011

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are helping you in a big way to avoid all sorts of pitfalls in the design of a data warehouse.

From my own experience and from what I see in my own Kettle community, dimensional modeling is by far the most popular technique used to create data warehouses.  For that reason (and the fact that I’m a huge fan of Kimball) I’ve always made sure to properly support the most complex part of technique: the slowly changing dimension.  For the better part that has made Kettle an excellent choice when it comes to easy translation of your dimensional model to ETL.

However, where these days you have open source tools like Quipu and RapidACE for data vault modelling I was sad to see that not too much exists for dimensional modeling in combination with Kettle.

So a few weeks ago I was doing some basic modeling for a new Pentaho logging data mart for PDI 4.3 EE.  This data mart will be responsible for the delivery of easy to digest reports, analyses views and dashboards on the subjects of monitoring and logging of Pentaho servers.  Initially I started doing this in a nice Eclipse plugin called UMLet which resulted in a data model like this:

While this result isn’t the worst diagram you can possibly imagine there are a number of problems with the approach:

  • The information about dimensions, attributes, relationships, … is not captured in a structured way.
  • Export of the metadata is not possible in any usable format except for PDF and images.
  • UMLet, like so many UML and modeling tools is a generic tool that also supports many other features that I’m not interested in when I’m doing dimensional modeling.  As a result, creating a model takes time and real effort.
  • The work needs to be used in your favorite ETL tool so it makes sense to be have it handy there, instead of having to use a third party tool.
So I thought: wouldn’t it be great if I had some sort of perspective in Spoon where I could do a bit dimensional modeling based on a logical Pentaho metadata model?

Wouldn’t it be great if I could create a new metadata domain to hold all the star models for a certain data mart?

Then wouldn’t it be great if you could edit your star models in there?

The graphics don’t have to be anything fancy, I thought.  It just needs to automatically position the fact table in the middle and the dimensions around it…

Obviously, I would like to be able to edit the name, description and type of the dimension …

and depending on the type of dimension I would like to insert a bunch of default attributes…

Using standard Kettle data grid I should be able to copy attributes and other metadata back and forth between dimension dialogs and a spreadsheet as well.

In the fact table definition it would be cool if we could not only specify the facts but also the relationships to the dimension…

Because that way we wouldn’t have to worry about how to draw the star model and we would know everything we would need to know.

If we would have a tool like that we would be able to generate the SQL to generate the physical tables against a certain target database…

Because if we would have all sorts of knowledge in metadata of the dimensions we could really nicely generate all the required data types, indexes and what not.

And then it would be cool to also generate a template transformation to update the dimension and fact tables in the models…

Well, I thought it would be nice to have that sort of functionality.

Perhaps we could also create physical Pentaho metadata domain (XMI) from the star domain as well as Mondrian schemas and a PDF with documentation.

OK, so this is coming to a PDI release near you in the short term.  I’ve only been working on it for a few weeks on and off but you can try an early version here.  Simply unzip it in the plugins folder of a PDI 4.3 build.  The plugin needs 4.3 since that version already includes a lot of libraries like Pentaho metadata and reporting and that way I don’t need to package all those libraries with it.  We can see later how we can deploy on 4.2 as well.

Please provide feedback here or in PDI-6890.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Proposals for Codebits.EU

Август 15th, 2011
Codebits is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the Pavilhão Atlântico, Sala Tejo in Lisbon, Portugal.

I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer Giuseppe Maxia. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. This year's proposals are looking very interesting already, with high quality proposals from Giuseppe about database replication with Tungsten replicator, Pentaho's chief of data integration Matt Casters about Kettle (aka Pentaho data integration), and Pedro Alves from webdetails who will be talking about "Big Data" analysis and dashboarding work he did for the Mozilla team.

There are many more interesting talks, and you should simply check out the proposals for yourself and give a thumbs up or a thumbs down according to whether you'd like see a particular proposal at the conference. I decided to send in a few proposals as well:So, if you like what you see here, take a minute to vote and shape this codebits conference. I'm hoping to meet you there!

PlanetMySQL Voting: Vote UP / Vote DOWN

Real-time streaming data aggregation

Июль 28th, 2011

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course … Twitter!  Since Twitter is easily accessed it’s common for examples to pop up regarding it’s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.

Here’s what we want to do:

  1. Continuously read all the tweets that are being sent on Twitter.
  2. Extract all the hash-tags used
  3. Count the number of hash-tags used in a one-minute time-window
  4. Report on all the tags that are being used more than once
  5. Put the output in a browser window, continuously update every minute.

This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that Vincent from Open-BI described earlier this week on his blog in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.

Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for “windowed” (time-based) joins and other nifty things.  We’ve seen very little demand for this sort of requirement in the past, perhaps because people don’t know it’s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at SQLStream, they can help you. SQLStream is co-founded by Pentaho’s Julian Hyde of Mondrian fame.

OK, let’s see how we can solve our little problem with Kettle instead…

1. Continuously read all the tweets that are being sent on Twitter.

For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:

http://stream.twitter.com/1/statuses/sample.json?delimited=length

Since the format of the output is never-ending and specific in nature I wrote a small “User Defined Java Class” script:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();
client.setTimeout(10000);
client.setConnectionTimeout(10000);

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);
client.getParams().setAuthenticationPreemptive(true);

HttpMethod method = new PostMethod("http://stream.twitter.com/1/statuses/sample.json?delimited=length");

// Execute request
//
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
//
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (c=bufferedInputStream.read())!=-1  && !isStopped()) {
char ch = (char)c;
bodyBuffer.append(ch);
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' && opened==0) {
// one JSON block, pass it on!
//
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex<0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

bodyBuffer.setLength(0);
}
}
} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {
bufferedInputStream.reset();
bufferedInputStream.close();
}

setOutputDone();
return false;
}

As the experienced UDJC writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be “a very long time“.

You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.

2. Extract all the hash-tags used

First we’ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags…

3. Count the number of hash-tags used in a one-minute time-window

The counting is easy as you can simply use a “Group by”  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the “Single Threader” step which has the option to aggregate in a time-based manner so we might as well use this option:

This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn’t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.

4. Report on all the tags that are being used more than once

The filtering is done with a simple “Filter Rows” step.  However, thanks to the magic of the “Single Threader” step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It’s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.

5. Put the output in a browser window, continuously update every minute.

As shown in an earlier blog post, we can do this quite easily with a “Text File Output” step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:

The script to print the header looks like this:

var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");
out.flush();
}

The separator between each minute is simple too:

if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();
  out.println("============================================");
out.println();
  out.flush();
}

You can execute this transformation on a Carte instance (4.2.0) and see the following output:

'Real-time' twitter hashtag report, minute based
=================================================

nr;hashtag;count;from;to
1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
=================================================

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000

...

For reference, I used the following URL to start the streaming report:

http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Ftest-stuff%2FTwitter Stream%2FRead a twitter stream.ktr&USERNAME=MyTwitterAccount&PASSWORD=MyPassword

I placed the complete example over here in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.

For reference, take a look at this earlier blog post of mine where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Real-time streaming data aggregation

Июль 28th, 2011

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course … Twitter!  Since Twitter is easily accessed it’s common for examples to pop up regarding it’s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.

Here’s what we want to do:

  1. Continuously read all the tweets that are being sent on Twitter.
  2. Extract all the hash-tags used
  3. Count the number of hash-tags used in a one-minute time-window
  4. Report on all the tags that are being used more than once
  5. Put the output in a browser window, continuously update every minute.

This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that Vincent from Open-BI described earlier this week on his blog in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.

Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for “windowed” (time-based) joins and other nifty things.  We’ve seen very little demand for this sort of requirement in the past, perhaps because people don’t know it’s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at SQLStream, they can help you. SQLStream is co-founded by Pentaho’s Julian Hyde of Mondrian fame.

OK, let’s see how we can solve our little problem with Kettle instead…

1. Continuously read all the tweets that are being sent on Twitter.

For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:

http://stream.twitter.com/1/statuses/sample.json?delimited=length

Since the format of the output is never-ending and specific in nature I wrote a small “User Defined Java Class” script:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();
client.setTimeout(10000);
client.setConnectionTimeout(10000);

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);
client.getParams().setAuthenticationPreemptive(true);

HttpMethod method = new PostMethod("http://stream.twitter.com/1/statuses/sample.json?delimited=length");

// Execute request
//
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
//
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (c=bufferedInputStream.read())!=-1  && !isStopped()) {
char ch = (char)c;
bodyBuffer.append(ch);
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' && opened==0) {
// one JSON block, pass it on!
//
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex<0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

bodyBuffer.setLength(0);
}
}
} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {
bufferedInputStream.reset();
bufferedInputStream.close();
}

setOutputDone();
return false;
}

As the experienced UDJC writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be “a very long time“.

You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.

2. Extract all the hash-tags used

First we’ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags…

3. Count the number of hash-tags used in a one-minute time-window

The counting is easy as you can simply use a “Group by”  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the “Single Threader” step which has the option to aggregate in a time-based manner so we might as well use this option:

This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn’t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.

4. Report on all the tags that are being used more than once

The filtering is done with a simple “Filter Rows” step.  However, thanks to the magic of the “Single Threader” step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It’s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.

5. Put the output in a browser window, continuously update every minute.

As shown in an earlier blog post, we can do this quite easily with a “Text File Output” step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:

The script to print the header looks like this:

var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");
out.flush();
}

The separator between each minute is simple too:

if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();
  out.println("============================================");
out.println();
  out.flush();
}

You can execute this transformation on a Carte instance (4.2.0) and see the following output:

'Real-time' twitter hashtag report, minute based
=================================================

nr;hashtag;count;from;to
1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
=================================================

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000

...

For reference, I used the following URL to start the streaming report:

http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Ftest-stuff%2FTwitter Stream%2FRead a twitter stream.ktr&USERNAME=MyTwitterAccount&PASSWORD=MyPassword

I placed the complete example over here in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.

For reference, take a look at this earlier blog post of mine where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Real-time streaming data aggregation

Июль 28th, 2011

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application log sniffing or of-course … Twitter!  Since Twitter is easily accessed it’s common for examples to pop up regarding it’s usage and in this blog post too we will use this service to demo the Pentaho Data Integration capabilities wrt to processing streaming data.

Here’s what we want to do:

  1. Continuously read all the tweets that are being sent on Twitter.
  2. Extract all the hash-tags used
  3. Count the number of hash-tags used in a one-minute time-window
  4. Report on all the tags that are being used more than once
  5. Put the output in a browser window, continuously update every minute.

This is a very generic example but the logic of this can be applied to different fields like JMS, HL7, log sniffing and so on.  It differs from the excellent work that Vincent from Open-BI described earlier this week on his blog in the sense that his Talend job finishes where ours will never end and where ours will do time-based aggregation in contrast to aggregation over a finite data set.

Also note that in order for Kettle to fully support multiple streaming data sources we would have to implement support for “windowed” (time-based) joins and other nifty things.  We’ve seen very little demand for this sort of requirement in the past, perhaps because people don’t know it’s possible with Kettle.  In any case, if you currently are in need of full streaming data support, have a look at SQLStream, they can help you. SQLStream is co-founded by Pentaho’s Julian Hyde of Mondrian fame.

OK, let’s see how we can solve our little problem with Kettle instead…

1. Continuously read all the tweets that are being sent on Twitter.

For this we are going to use one of the public Twitter web services, one that delivers a never-ending stream of JSON messages:

http://stream.twitter.com/1/statuses/sample.json?delimited=length

Since the format of the output is never-ending and specific in nature I wrote a small “User Defined Java Class” script:

public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
HttpClient client = SlaveConnectionManager.getInstance().createHttpClient();
client.setTimeout(10000);
client.setConnectionTimeout(10000);

Credentials creds = new UsernamePasswordCredentials(getParameter("USERNAME"), getParameter("PASSWORD"));
client.getState().setCredentials(AuthScope.ANY, creds);
client.getParams().setAuthenticationPreemptive(true);

HttpMethod method = new PostMethod("http://stream.twitter.com/1/statuses/sample.json?delimited=length");

// Execute request
//
InputStream inputStream=null;
BufferedInputStream bufferedInputStream=null;
try {
int result = client.executeMethod(method);

// the response
//
inputStream = method.getResponseBodyAsStream();
bufferedInputStream = new BufferedInputStream(inputStream, 1000);

StringBuffer bodyBuffer = new StringBuffer();
int opened=0;
int c;
while ( (c=bufferedInputStream.read())!=-1  && !isStopped()) {
char ch = (char)c;
bodyBuffer.append(ch);
if (ch=='{') opened++; else if (ch=='}') opened--;
if (ch=='}' && opened==0) {
// one JSON block, pass it on!
//
Object[] r = createOutputRow(new Object[0], data.outputRowMeta.size());
String jsonString = bodyBuffer.toString();

int startIndex = jsonString.indexOf("{");
if (startIndex<0) startIndex=0;

// System.out.print("index="+startIndex+" json="+jsonString.substring(startIndex));

r[0] = jsonString.substring(startIndex);
putRow(data.outputRowMeta, r);

bodyBuffer.setLength(0);
}
}
} catch(Exception e) {
throw new KettleException("Unable to get tweets", e);
} finally {
bufferedInputStream.reset();
bufferedInputStream.close();
}

setOutputDone();
return false;
}

As the experienced UDJC writers among you will notice: this step never ends as long as the twitter service keeps on sending more data.  Depending on the stability and popularity of twitter that can be “a very long time“.

You could improve the code even further to re-connect to the service every time it drops away.  Personally I would not do this.  I would rather have the transformation terminate with an error (as it is implemented now), send an alert (e-mail, database, SNMP) and re-start the transformation in a loop in a job.  That way you have a trace in case twitter dies for a few hours.

2. Extract all the hash-tags used

First we’ll parse the JSON returned by the twitter service, extract the first 5 hash-tags from the message, split this up into 5 rows and count the tags…

3. Count the number of hash-tags used in a one-minute time-window

The counting is easy as you can simply use a “Group by”  step.  However, how can we aggregate in a time-based fashion without too much tinkering?   Well, we now have the “Single Threader” step which has the option to aggregate in a time-based manner so we might as well use this option:

This step simply accumulates all records in memory until 60 seconds have passed and then performs one iteration of the single threaded execution of the specified transformation.  This is a special execution method that doesn’t use the typical parallel engine.  Another cool thing about this engine is that the records that go into the engine in the time-window can be grouped and sorted without the transformation being restarted every minute.

4. Report on all the tags that are being used more than once

The filtering is done with a simple “Filter Rows” step.  However, thanks to the magic of the “Single Threader” step we can sort the rows descending by the tag occurrence count in that one-minute time-window.  It’s also interesting to note that if you have huge amounts of data, that you can easily parallelize your work by starting multiple copies of the single threader step and/or with some clever data partitioning.  In our case we could partition by hash-tag or re-aggregate the aggregated data.

5. Put the output in a browser window, continuously update every minute.

As shown in an earlier blog post, we can do this quite easily with a “Text File Output” step.  However, we also want to put a small header and a separator between the data from every minute so we end up with a transformation that looks like this:

The script to print the header looks like this:

var out;
if (out==null) {
out = _step_.getTrans().getServletPrintWriter();
out.println("'Real-time' twitter hashtag report, minute based");
out.flush();
}

The separator between each minute is simple too:

if (nr==1) {
var out = _step_.getTrans().getServletPrintWriter();
  out.println("============================================");
out.println();
  out.flush();
}

You can execute this transformation on a Carte instance (4.2.0) and see the following output:

'Real-time' twitter hashtag report, minute based
=================================================

nr;hashtag;count;from;to
1;tatilmayonezi;5;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
2;AUGUST6THBUZZNIGHTCLUB;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
3;teamfollowback;3;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
4;ayamzaman;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
5;dnd;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
6;follow;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
7;malhação;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
8;rappernames;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
9;thingswelearnedontwitter;2;2011/07/27 22:52:43.000;2011/07/27 22:53:32.000
=================================================

1;ska;5;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
2;followplanetjedward;4;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
3;chistede3pesos;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
4;NP;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
5;rappernames;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
6;tatilmayonezi;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
7;teamfollowback;3;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
8;AvrilBeatsVolcano;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
9;CM6;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
10;followme;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
11;Leão;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
12;NewArtists;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
13;OOMF;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
14;RETWEET;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
15;sougofollow;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
16;swag;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000
17;thingswelearnedontwitter;2;2011/07/27 22:53:35.000;2011/07/27 22:54:47.000

...

For reference, I used the following URL to start the streaming report:

http://cluster:cluster@127.0.0.1:8282/kettle/executeTrans/?trans=%2Fhome%2Fmatt%2Ftest-stuff%2FTwitter Stream%2FRead a twitter stream.ktr&USERNAME=MyTwitterAccount&PASSWORD=MyPassword

I placed the complete example over here in case you want to try this yourself on PDI/Kettle version 4.2.0-RC1 or later. Things you can add to make it even cooler is to have this transformation send an e-mail every time a certain hash-tag gets used more than 15 times in a given minute.  That sort of alerting support for example gives you easy access to emerging new trends, events and memes.

For reference, take a look at this earlier blog post of mine where I describe the internal cleanup mechanisms inside of Kettle that prevent our transformation from ever running out of memory or resources.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Kettle vs Oracle REF CURSOR

Ноябрь 17th, 2010

Dear Kettle fans,

PDI-200 has been out there for a while now.  Jens created the feature request a little over 3 years ago.  I guess the main thing blocking this issue was not as much a technical problem but more of a licensing and dependency one (Oracle JDBC dependency and distribution license).

However, now that we have the User Defined Java Class step we can work around those pesky problems. That is because the Java code in there only gets compiled and executed at runtime so it’s perfectly fine to create any sort of dependency in there you like.

The following transformation reads a set of rows from a stored procedure as described on this web page.

In short, our UDJC step executes the following code:

begin ? := sp_get_stocks(?); end;

The result is a set of rows and the parameter is in this case a single numeric value.

The step contains mostly Java code but thanks to configuration options you only need to do 2 things to make this work for your own REF CURSOR returning procedures…

First you need to specify the output fields of the rows…

And then you need to specify the parameters:

The source code for this sample transformation is over here and runs on Pentaho Data Integration version 4.x (or higher).  All in all it only took a few hours to write these 150 lines of Java so perhaps it can serve as inspiration for other similar problems you might have with Oracle or other databases.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Pentaho Kettle Solutions Overview

Октябрь 8th, 2010

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping.  Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

Book front

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book called Pentaho Data Integration 3.2 was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready.  As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle.  Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it’s called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit.  And so we asked Mr Kimballs permission to use his list which he kindly provided.  He was also gracious enough to review the related chapter of our book.

By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.

In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point.  In fact, there are plenty of steps we did not cover in this book.  However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could.  Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)

The content

After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:

1. Getting started

This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained.  This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you’ll get hands-on experience in the last chapter of the first part titled “An example ETL Solution - Sakila”.  This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.
2. ETL
In this part you’ll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail.  That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 “Cleansing and Conforming” in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).
3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution.  First you’ll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14).  As you can guess from the titles of the chapters, a lot of best practices, do’s-and-don’ts are covered in this part.
4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It’s personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)
5. Advanced topics
The last part conveniently titled “Advanced topics” deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21).  Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration.  It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it’s probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle.  This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.

I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I’m sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one.  After all, we did each spend over 6 months on it…

Feel free to ask about specific topics you might be interested in to see if they are covered ;-)

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Pentaho Kettle Solutions

Сентябрь 30th, 2010
Pentaho I have several favorite authors -- Tim Dorsey, Clive Cussler, and few others that I buy their latest book just because I trust the quality of their work.  Now on that list are Roland Bouman, Jos van Dongen, and Matt Casters.  In a follow up to Bouman's and van Dongen's Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL, the have now produced Pentaho Kettle Solutions which explores the often murky world of ETL and data integration. 

Kettle can be confusing as there are many components with  names such as spoon and pan but these tools are valuable to any DBA who has to feed data into an instance.  If you have a data ware house the tools are invaluable.

This book covers initial steps with Kettle to cleaning up the worst raw data and even getting the information into the now ubiquitous cloud.  Having stumbled along with a few tutorials and the Kettle docs, I can testify that this book is a must have for anyone wanting to use Kettle to feed a database.  The text is clear, the example concise, and the book progresses logically.

Kettle is wonderful for modifying raw data before being sent to your database.  need to clean up telephone numbers from xxx-xxx-xxxx, (xxx) xxx-xxxx, or xxxxxxxxxx in your format of choice?  Or are state or country abbreviations (Ca, CA, Cal., & California) causing problems.  Then you need Kettle and this book.

PlanetMySQL Voting: Vote UP / Vote DOWN

Back to blogging….

Август 11th, 2010
It has been a while since I posted on my blog - in fact, I believe this is the first time ever that more than one month passed between posts since I started blogging. There are a couple of reasons for the lag:

  • Matt Casters, Jos van Dongen and me have spent a lot of time finalizing our forthcoming book, Pentaho Kettle Solutions (Wiley, ISBN: 978-0-470-63517-9). The book is currently being produced, and should be available according to schedule in early September 2010. If you're interested, you might like to read one of my earlier posts that explains the organization and outline of the book.

    (I should point out that we have reorganized the outline as the project progressed, so the final result will not have all the chapters mentioned in that post. We do however cover most of the topics mentioned.)

  • I have been checking out Quipu, a promising Open Source data warehouse management solution. Quipu provides a repository-based extensible code-generator that allows you to generate and maintain a data warehouse based on the Data Vault model. One of the things I want to do in the short term is write templates that allows it to work for MySQL, and after that, I want to see if I can get Quipu to generate Kettle Jobs and transformations.

  • I have been working on a couple of software projects. Two of them are currently available as open source on google code:

    mql-to-sql
    This allows you to use the Metaweb Query Language (MQL) to query a RDBMS. If you're wondering what this is all about: MQL is the query language used by Freebase (which is a collaborative "database-of-everything" or "wikipedia-gone-database").

    While Freebase is interesting in its own right, I am particularly enthused about the MQL query language. I feel that MQL is an exceptionally good solution for flexible, expressive and secure data access for modern (AJAX) Web applications. Even though MQL was not developed with relational database systems in mind, I think it is a pretty good fit.

    Anyway, this is very much a work in progress, and I appreciate your feedback on it. If you're interested, you can read a bit more about my take on RDBMS data access for web applications and MQL on the mql-to-sql project home page. I have also put up an online demo that allows you to query the sakila MySQL sample database using MQL.

    kettle-cookbook
    This is a project that provides auto-documentation for Kettle (a.k.a. Pentaho Data Integration).

    The project consists of a bunch of Kettle Jobs and transformations (as well as some XSLT stylesheets) that extract data from Kettle Jobs and transformation and transform it into a collection of human-readable HTML documents along with a table of contents. The resulting documentation looks and feels a bit like JavaDoc documentation.

    If all goes well, I will be presenting kettle-cookbook in a Pentaho web seminar, which is currently scheduled for September 15, 2010


  • I've been enjoying 4 weeks of vacation (I started working this week). There's not much to tell about that, other than that it was great spending a lot of time with my family. I plan to do this more often, and now that Kettle Solutions is finished I should be able to find more time to do it.

  • I've been looking at two emerging HTML5 APIs for client-side structured storage, the Web SQL Database API and the Indexed DB API.

    I have developed a few thoughts about the ongoing debate (see this and article for some background) about which one is better and I see a role for something like MQL here too. I will probably write something about this in the next few weeks

  • Yesterday, I got wind of the JS1k contest! Basically, the challenge is to write an interesting standalone JavaScript demo program that must be no larger than 1024 bytes. It is amazing and inspiring to see what people manage to do with a modern browser in 1k of JavaScript code.

    I decided to try it myself, and you can find my submission here: An interactive SQL query tool for the Web SQL DB API.

    Essentially, you get a textarea where you can enter arbitrary SQLite queries, and button to execute the SQL, and a result area that will print a feedback message and a result table (if applicable). As a bonus, there's a button to get a listing of the available database objects (using a query on sqlite_master) and an explain button to show the query plan of the current SQL statement.

    The demo works on recent versions of Google Chrome, Apple Safari and Opera. It can run offline, and does not require any plugins. I should say that I expect my submission will be rejected by the judges since the demo is not functional on Mozilla Firefox, which is a requirement. (That is, the script will detect that the Web SQL Database API is not supported and print a message to that effect). However, it was still fun to try my hand at it.


Ok - that's it for now. I will try and post more regularly and write about these and other things in the near future. Don't hesitate to leave a comment if you have any questions or suggestions.

PlanetMySQL Voting: Vote UP / Vote DOWN

Book Review : Pentaho 3.2 Data Integration

Май 6th, 2010

Dear Kettle fans,

A few weeks ago, when I was stuck in the US after the MySQL User Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable.  However, this time it’s a book about my brainchild Kettle. That makes this book very special to me. The full title is Pentaho 3.2 Data Integration : Beginner’s Guide (Amazon, Packt).  The title all by itself explains the purpose of this book: give the reader a quick-start when it comes to Pentaho Data Integration (Kettle).

The author María Carina Roldán (twitter) is a seasoned BI consultant and a valued member of the Kettle community. Besides her frequent appearances on our forum, she is appreciated by many for the time she spent on the Kettle Tutorial.

I’m not going to go over the detailed table of content.  Since I wrote the foreword of the book, I’m sure you’ll agree I’m somewhat biased. However, in all objectivity, the book covers what it claims to cover: it does help the PDI/Kettle beginner tremendously.  It covers all you need to get started and then some: the installation of PDI, the typical “Hello World” setup of PDI, reading text files, calculating, scripting, databases, repositories, etc.  As the title indicates, this book covers the current 3.2 stable release of Kettle, not the upcoming 4.0 release. However, for as far as 99% of the topics covered are concerned, that shouldn’t make too much of a difference.

So obviously I can recommend this book very much. It’s a time-saver for those that are starting with PDI.  For those that have dabbled with Kettle before I must say that María packed the book with nice tips and tricks so I’m sure you’ll be able to learn a thing or two.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN