Archive for the ‘Pentaho’ Category

Pentaho Kettle Solutions Overview

Октябрь 8th, 2010

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping.  Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

Book front

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book called Pentaho Data Integration 3.2 was due when we started, we knew that a beginners guide would be ready by the time that this book was going to be ready.  As such we opted to look at what the data warehouse professional might need when he or she would start to work with Kettle.  Fortunately there is already a good and well known check-list out there to see if you covered everything ETL related and it’s called The 34 subsystems of ETL, a concept by Ralph Kimball that was first featured in his book The Data Warehouse Lifecycle Toolkit.  And so we asked Mr Kimballs permission to use his list which he kindly provided.  He was also gracious enough to review the related chapter of our book.

By using this approach we allow the users to flip to a certain chapter in our book and directly get the information they want on the problem they are facing at that time. For example, Change Data Capturing (subsystem 2, a.k.a. CDC) is handled in Chapter 6: Data Extraction.

In other words: we did not start with the capabilities of Kettle. We did not take every step or feature of Kettle as a starting point.  In fact, there are plenty of steps we did not cover in this book.  However, everywhere a step or feature needed to be explained while covering all the sub-systems we did so as clearly as we could.  Rest assured though; since this book handles just about every topic related to data integration, all of the basic and 99% of the advanced features of Kettle are indeed covered in this book ;-)

The content

After a gentle introduction into how ETL tools came about and more importantly how and why Kettle came into existence, the book covers 5 main parts:

1. Getting started

This part starts with the a primer that explains the need for data integration and takes you by the hand into the wonderful world of ETL.
Then all the various building blocks of Kettle are explained.  This is especially interesting for folks with prior data integration experience, perhaps with other tools, as they can read all about the design principles and concepts behind Kettle.
After that the installation and configuration of Kettle is covered. Since the installation is a simple unzip, that includes a detailed description of all the available tools and configuration files.
Finally, you’ll get hands-on experience in the last chapter of the first part titled “An example ETL Solution - Sakila”.  This chapter explains in great detail how a small but complex data warehouse can be created using Kettle.
2. ETL
In this part you’ll first encounter a detailed overview of the 34 sub-systems of ETL after which the art of Data Extraction is covered in detail.  That includes extracting information from all sorts of file types, databases, working with ERP and CRM systems, Data profilng and CDC.
This is followed by chapter 7 “Cleansing and Conforming” in which the various data cleansing and validation steps are covered as well as error handling, auditing, deduplication and last but not least scripting and regular expressions.
Finally this second part of the book will cover everything related to star schemas including the handling of dimension tables (chapter 8), loading of fact tables (chapter 9) and working with OLAP data (chapter 10).
3. Management and deployment
The third main part of the book deals with everything related to the management and deployment of your data integration solution.  First you’ll read all about the ETL development lifecycle (chapter 11), scheduling and monitoring (chapter 12), versioning and migration (chapter 13) and lineage and auditing (chapter 14).  As you can guess from the titles of the chapters, a lot of best practices, do’s-and-don’ts are covered in this part.
4. Performance and scalability
The 4th part of our book really dives into the often highly technical topics surrounding performance tuning (chapter 15), parallelization, clustering and partitioning (chapter 16), dynamic clustering in the cloud (chapter 17) and real-time data integration (chapter 18).
It’s personally hope that the book will lead to more performance related JIRA cases since chapter 15 explains how you can detect bottlenecks :-)
5. Advanced topics
The last part conveniently titled “Advanced topics” deals with things we thought were interesting to a data warehouse engineer or ETL developer that is faced with concepts like Data Vault management (chapter 19), handling complex data formats (chapter 20) or web services (chapter 21).  Indispensable in case you want to embed Kettle into your own software is chapter 22 : Kettle integration.  It contains many Java code samples that explain to you how you can execute jobs and transformations or even assemble them dynamically.
Last but certainly not least since it’s probably one of the most interesting chapters for a Java developer is chapter 23: Extending Kettle.  This chapter explains to you how you can develop step, job-entry, partitioning or database type plugins for Kettle in great detail so that you can get started with your own components in no time.

I hope that this overview of our new brain-child gives you an idea of what you might be buying into. Since all books are essentially a compromise between page count, time and money I’m sure there will be the occasional typo or lack of precision but rest assured that we did our utmost best on this one.  After all, we did each spend over 6 months on it…

Feel free to ask about specific topics you might be interested in to see if they are covered ;-)

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Open Source BI — Pentaho and Jaspersoft Part I

Июль 14th, 2010
Hey DBAs! Are you seeking more efficient ways of shifting through your data to aid your business operations? Two popular Business Intelligence products have community Open Source software are Pentaho and JasperSoft. And both work with MySQL.

Both are easy to download and install. Both will use a JDBC connector to connect to MySQL. But how easy are the two to configure and run a simple report against a running instance of MySQL?


Setting up a JDBC connection with JasperSoft or Pentaho is pretty much like using any other JDBC connection.

The next step is to setup a query like SELECT name, job_title, department FROM employees, departments WHERE employees.emp_id = departments.emp_id. Either package will let you pick a variety of output templates. Then you have the BI software merge your query with the template. I honestly think an average MySQL DBA could fairly quickly generate a nice looking report from their instance and that JasperSoft would be just a little bit faster.

In part two of this series, the steps will be more detailed and documented. There will also being comparing and contrasting of the two products. Both products are part of larger projects and there are many useful tools that work with the BI software that you will want to investigate. More on those in later posts.

And in a short time you should be able to download a Virtual Box image with both community BI programs and a InfiniDB instance with some data sets. This way you can test all three simply. I would also consider doing a VMWare version if there is demand for it.

PlanetMySQL Voting: Vote UP / Vote DOWN

Book Review : Pentaho 3.2 Data Integration

Май 6th, 2010

Dear Kettle fans,

A few weeks ago, when I was stuck in the US after the MySQL User Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable.  However, this time it’s a book about my brainchild Kettle. That makes this book very special to me. The full title is Pentaho 3.2 Data Integration : Beginner’s Guide (Amazon, Packt).  The title all by itself explains the purpose of this book: give the reader a quick-start when it comes to Pentaho Data Integration (Kettle).

The author María Carina Roldán (twitter) is a seasoned BI consultant and a valued member of the Kettle community. Besides her frequent appearances on our forum, she is appreciated by many for the time she spent on the Kettle Tutorial.

I’m not going to go over the detailed table of content.  Since I wrote the foreword of the book, I’m sure you’ll agree I’m somewhat biased. However, in all objectivity, the book covers what it claims to cover: it does help the PDI/Kettle beginner tremendously.  It covers all you need to get started and then some: the installation of PDI, the typical “Hello World” setup of PDI, reading text files, calculating, scripting, databases, repositories, etc.  As the title indicates, this book covers the current 3.2 stable release of Kettle, not the upcoming 4.0 release. However, for as far as 99% of the topics covered are concerned, that shouldn’t make too much of a difference.

So obviously I can recommend this book very much. It’s a time-saver for those that are starting with PDI.  For those that have dabbled with Kettle before I must say that María packed the book with nice tips and tricks so I’m sure you’ll be able to learn a thing or two.

Until next time,

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Part 2: Comparing Numerics in Pentaho Data Integration

Май 5th, 2010
As a followup to my previous post about comparing numeric values, I've since discovered a little more about the problem. To repeat my original problem: certain numeric field values that should be equal are being detected as different in the Filter rows step. I think it's important to be able to perform accurate comparisons since it is a frequent task in data quality analysis.

Originally, I assumed this had something to do with jdbc. However, since I can re-produce the issue without any SQL, I'm sure this has nothing to do with the version of the MySQL Connector/J jdbc driver. I tried the 5.0.8 version of the driver and I observed the same behavior. I couldn't even get my transform to work correctly with the 5.1.12 version of the connector -- it does not recognize column aliases in my SQL query.

Now for the rest of the story:

My comparison of numeric data was between 2 fields from two data streams, initiated by two separate SQL table inputs.
  • The first data stream is from a "raw" table. From the input step it is passed through a "sort rows" and then a "group by" step to aggregate the numeric values.
  • The second data stream is from a "rollup" table where the raw data is summarized.

The two streams are then merged ( by a unique id) and compared in order to validate the data in the rollup.

At this point, the problem seems more related to the metadata of the fields. I found two resolutions to choose from:

Use "group by" in SQL. The data types of the output numerics are magically set to BigNumber.

or

Place a "Select/Rename Values" step after the sort and before the "Group By" step to coerce the metadata of the fields to be of BigNumber type.

Personally, I prefer the second option because I like to extract the data as quickly as possible. Pentaho can handle sorting and grouping of somewhat large datasets just fine.

I have a new example transform here: numeric_compare_filter_values_try2.ktr

Here is a picture of the example transform:


PlanetMySQL Voting: Vote UP / Vote DOWN

Slides from my MySQL UC 2010 presentation

Апрель 27th, 2010

As requested by a few fans out there, here are the slides of my presentation:

Pentaho Data Integration 4.0 and MySQL.pdf

I had a great time at the conference, met a lot of nice folks, friends, customers, partners and colleagues. After the conference I was unable to get back home like so many of you because of the Paul Simon singing Eyjafjallajökul volcano in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0 RC1 hacking with the rest of the l33t super Pentaho development team.  However, after 2+ weeks from home, even a severe storm over Philadelphia couldn’t prevent me from getting home eventually.

Until next time,
Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Comparing Numerics in Pentaho Data Integration / Kettle

Апрель 14th, 2010
While working on a transformation I ran into a problem with comparing two (seemingly) identical numbers using the Filter Rows step. I had a case where a transformation selected two DECIMAL(13,5) values from the database and compared them.

I could see that the numbers were identical in the MySQL database, but the Filter Rows step returned false when comparing. To troubleshoot, I tried multiplying the difference of the two numbers by 10,000,000 ( in the transform) and I actually discovered a very small difference beyond the 5th decimal place. This datatype in MySQL is considered an "exact" datatype, not to be confused with a FLOAT.

My solution is to convert the two fields to Strings and do the comparison. If you don't like that, then explicitly round the numbers using the Calculator step. The Select and Alter step, doesn't seem to truncate the numbers and I don't think it was intended to alter the raw data anyway.

I haven't had time to figure out if or how PDI is converting the numbers, but my gut feeling is that the road to flakiness passes through the MySQL jdbc driver installed with PDI 3.2. Its somewhat old, and my experiments with newer versions (of the driver) introduced new problems.

The link below is an example tranformation ( version 3.2) showing what I am talking about. See the Generate rows step and notice the very small difference between the two numbers

> numeric_compare_filter_values.ktr

Enjoy.


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL User Conference 2010

Апрель 10th, 2010

Dear Kettle and MySQL fans,

Next week I’ll be strolling around the MySQL user conference in Santa Clara.  Even better, I’ll be presenting Tuesday afternoon (3:05pm).  The topic is Pentaho Data Integration 4.0 and MySQL.

The presentation will show you what the world’s most popular open source data integration tool can do for a MySQL user.  It will include practical examples and will showcase the latest improvements present in the brand new version 4.0.

Even more than the presentation itself, I’m looking forward to meeting you all over there.  The regular crowd, MySQL users, Pentaho partners, folks from Calpont, Continuent, SQLStream and many others but also the many new colleagues in San Francisco.

More than anything else I’m looking forward to hear about your Kettle successes and real-world data integration war stories.  If you want to chat about Kettle 4, see things first-hand or simply join me for a beer, don’t hesitate to ask.  I’ll try to regularly tweet my whereabouts at the user conference so you’ll know where I’m at.

Let me finish with a note to everybody that promised me beer in return for features and bug fixes: it’s payback time!

See you soon at the conference!
Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Investing in Disruption

Март 30th, 2010

Innovator_solution
 
 I'm an advisor, investor and board member to several startup software companies including Revolution Computing, Pentaho and most recently Erply a new Software as a Service (SaaS) company.  One of the common threads I look for is the opportunity to disrupt a large market.

One of the things that made MySQL successful was its use of open source technology to disrupt the multi-billion dollar database market.  In Silicon Valley, people often talk about disruption, but usually what they mean is they have some new feature or a new way to do things that is 10x faster or 10x cheaper.  Those are good things, but that's not necessarily sufficient to make a business truly disruptive.  

The classic disruption model as defined by Clayton Christensen comes down to 4 important factors:

  1. There's a proven market with large incumbents
    This demonstrates that customers are willing to pay money to solve this problem

  2. There are underserved customers whose needs are not being met by the incumbents
    They may be receptive to a "good enough" product that is easy to access

  3. The incumbents cannot profitably meet the needs of this market
    Ideally, their entry into this market would hurt their core business 

  4. To disrupt market, you need to disrupt all the players, not just some of them
    If there are other players, you need to disrupt all of them

If you have all of those things, then your business could be disruptive.  But typically many startup companies ignore the third point.  It's not enough to do something the incumbents don't do today, you want to do something that they cannot do, because it would hurt their existing business.

In the case of MySQL, the product targeted the underserved web developer market.  MySQL was not only a better fit technically in that area, but due to its open source model, it was a business that was unattractive to the incumbents. (Or it was, until it grew to beyond $100 million in revenue.  Now Oracle will leverage this force to compete against Microsoft SQL Server.)  

There are plenty of great businesses out there that are not disruptive; perhaps you're creating a new market, or you're introducing a new innovation that the incumbents have not discovered.  Disruption isn't the only strategy, but if you can make your business disruptive, you gain a significant advantage in the market place.


PlanetMySQL Voting: Vote UP / Vote DOWN

Writing another book: Pentaho Kettle Solutions

Март 14th, 2010
Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2010, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the early autumn of 2010, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since. The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

PlanetMySQL Voting: Vote UP / Vote DOWN

Writing another book: Pentaho Kettle Solutions

Март 14th, 2010
Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences. Although the act of writing was time consuming and at times very trying for me as well as my family, it was completely worth it. I have none but happy memories of the collaboration with my full co-author Jos van Dongen, our technical editors Jens Bleuel, Jeroen Kuiper, Tom Barber and Tomas Morgner, several of the Pentaho Developers, and last but not least, the team at Wiley, in particular Robert Elliot and Sara Shlaer.

When the book was finally published, late August 2010, I was very proud - as a matter of fact, I still am :) Both Jos and I have been rewarded with a lot of positive feedback, and so far, book sales are meeting the expectations of the publisher. We've had mostly positive reviews on places like Amazon, and elsewhere on the web. I'd like to use this opportunity to thank everybody that took the time to review the book: Thank you all - it is very rewarding to get this kind of feedback, and I appreciate it enourmously that you all took the time to spread the word. Beer is on me next time we meet :)

Announcing "Pentaho Kettle Solutions"


In the early autumn of 2010, just a month after "Pentaho Solutions" was published, Wiley contacted Jos and me to find out if we were interested in writing a more specialized book on ETL and data integration using Pentaho. I felt honoured, and took the fact that Wiley, an experienced and well-reknowned publisher in the field of data warehousing and business intelligence, voiced interested in another Pentaho book by Jos an me as a token of confidence and encouragement that I value greatly. (For Pentaho Solutions, we heard that Wiley was interested, but we contacted them.) At the same time, I admit I had my share of doubts, having the memories of what it took to write Pentaho Solutions still fresh in my mind.

As it happens, Jos and I both attended the 2009 Pentaho Community Meeting, and there we seized the opportunity to talk to Matt Casters, chief Pentaho Data Integration and founding developer of Kettle (a.k.a. Pentaho Data Integration). Both Jos and I didn't expect Matt to be able to free up any time in his ever busy schedule to help us to write the new book. Needless to say, he made us both very happy when he rather liked the idea, and expressed immediate interest in becoming a full co-author!

Together, the three of us made a detailed outline and wrote a formal proposal for Wiley. Our proposal was accepted in December 2009, and we have been writing since. The tentative title of the book is Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data Integration. It is planned to be published in September 2010, and it will have approximately 750 pages.



Our working copy of the outline is quite detailed but may still change in the future, which is why I won't publish it here until we finished our first draft of the book. I am 99% confident that the top level of the outline is stable, and I have no reservation in releasing that already:

  • Part I: Getting Started

    • ETL Primer

    • Kettle Concepts

    • Installation and Configuration

    • Sample ETL Solution


  • Part II: ETL Subsystems

    • Overview of the 34 Subsystems of ETL

    • Data Extraction

    • Cleansing and Conforming

    • Handling Dimension Tables

    • Fact Tables

    • Loading OLAP Cubes


  • Part III: Management and Deployment

    • Testing and Debugging

    • Scheduling and Monitoring

    • Versioning and Migration

    • Lineage and Auditing

    • Securing your Environment

    • Documenting


  • Part IV: Performance and Scalability

    • Performance Tuning

    • Parallization and Partitioning

    • Dynamic Clustering in the Cloud

    • Realtime and Streaming data


  • Part V: Integrating and Extending Kettle

    • Pentaho BI Integration

    • Third-party Kettle Integration

    • Extending Kettle


  • Part VI: Advanced Topics

    • Webservices and Web APIs

    • Complex File Handling

    • Data Vault Management

    • Working with ERP Systems



Feel free to ask me any questions about this new book. If you're interested, stay tuned - I will probably be posting 2 or 3 updates as we go.

PlanetMySQL Voting: Vote UP / Vote DOWN