Archive for the ‘OLAP’ Category

Intro to OLAP

Июль 12th, 2010

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.

As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.

OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.

Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.

There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.

A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.

What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.

Data before pivoting:
Example image from Wikimedia commons showing detail data for sales

Data summarized in pivot form:
Wikimedia commons image showing data summarized in pivot format

ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.

A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.

Star schema with degenerate dimension

Example star schema about sales.

Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
Sample date hierarchy, showing quarter, month, year and day hierarchies.

The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.

The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.

Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.

Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.

Whats next?
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.


Entry posted by Justin Swanhart | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Intro to OLAP

Июль 12th, 2010

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.

As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.

OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.

Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.

There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.

A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.

What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.

Data before pivoting:
Example image from Wikimedia commons showing detail data for sales

Data summarized in pivot form:
Wikimedia commons image showing data summarized in pivot format

ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.

A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.

Star schema with degenerate dimension

Example star schema about sales.

Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
Sample date hierarchy, showing quarter, month, year and day hierarchies.

The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.

The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.

Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.

Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.

Whats next?
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.


Entry posted by Justin Swanhart | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

New OLAP Wikistat benchmark: Introduction and call for feedbacks

Январь 29th, 2010

I've seen my posts on Ontime Air Traffics and Star Schema Benchmark got a lot of interest
(links:

).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas's announcement .

Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).

I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.

I have next goals with proposed benchmark:

  • Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
  • Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
  • Understand specific features and characteristic of each engine.
  • Compare throughput on simple queries (queries and scenario to be drafted yet)
  • Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)

So in proposed schema I have four tables:
pagestat (fact table), and pages, datesinfo, projects (dimensions tables).

Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).

EEM diagram

( made with MySQL Workbench )

In current dataset, which you can download from Amazon snapshot (name: “percona-wikistat”, ID:snap-a5f9bacc) we have:

  • Table pages: 724.550.811 rows. data size: 40476M
  • Table datesinfo: 9624 rows, one entry represents 1 hour
  • Table projects: 2025 rows
  • Table pagestats
    Data for 2009-06: # 3.453.013.109 rows / size 68352M
    Data for 2009-07: # 3.442.375.618 rows / szie 68152M

So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.

Example of query ( again, full list on Benchmark Wiki)

SQL:
  1. SELECT project, sum(page_count) sm
  2.  FROM pagestat
  3.    JOIN datesinfo di ON ( di.id=date_id )
  4.    JOIN projects p ON  (p.id=project_id )
  5.  WHERE di.calmonth=7 AND di.calyear=2009
  6.  GROUP BY project
  7.  ORDER BY sm DESC
  8.  LIMIT 20;

I am going to load data and run queries against available engines:

  • MySQL MyISAM / InnoDB (to have reference results)
  • InfoBright
  • InfiniDB
  • MonetDB
  • LucidDB
  • Greenplum

and I will report my results ( so stay with MySQLPerformanceBlog ;) )

I'd like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.

I welcome your feedback on the benchmark, and what else you would like to see here.


Entry posted by Vadim | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

New OLAP Wikistat benchmark: Introduction and call for feedbacks

Январь 29th, 2010

I've seen my posts on Ontime Air Traffics and Star Schema Benchmark got a lot of interest
(links:

).
However benchmarks by itself did not cover all cases I would want, so I was thinking about better scenario. The biggest problem is to get real big enough dataset, and I thank to Bradley C. Kuszmaul, he pointed me on Wikipedia statistics on access to wikipedia pages, and thank to Domas, who made stats accessible. Link to the archives: http://dammit.lt/wikistats/archive/ or the original Domas's announcement .

Although the table does not have very much different information, I think it is good enough to represent cases you can face in Web application ( log processing, page visits, clickstream, etc).

I made some efforts to normalize data to have model in classic star schema and prepared queries that could be run on proposed dataset (John Sichi, lead of LucidDB helped me to draft some queries).
You can see details on our Percona Wikistat benchmark Wiki.

I have next goals with proposed benchmark:

  • Compare engines in OLAP queries for planning, predicting growth, analyzing access patterns to wiki pages, draw trends.
  • Compare engines in statistical queries for end users, which can be executed in real-time. I.e. How many times that or another page was accessed yesterday vs today.
  • Understand specific features and characteristic of each engine.
  • Compare throughput on simple queries (queries and scenario to be drafted yet)
  • Check ability to load data and serve queries at the same time ( availability during data load ) (queries and scenario to be drafted yet)

So in proposed schema I have four tables:
pagestat (fact table), and pages, datesinfo, projects (dimensions tables).

Dimensions tables are supposed to be static and not changed, and we can change datasize
by varying amount of months loaded into fact table (so this is scale factor).

EEM diagram

( made with MySQL Workbench )

In current dataset, which you can download from Amazon snapshot (name: “percona-wikistat”, ID:snap-a5f9bacc) we have:

  • Table pages: 724.550.811 rows. data size: 40476M
  • Table datesinfo: 9624 rows, one entry represents 1 hour
  • Table projects: 2025 rows
  • Table pagestats
    Data for 2009-06: # 3.453.013.109 rows / size 68352M
    Data for 2009-07: # 3.442.375.618 rows / szie 68152M

So with two months of stats we have about 172GB of data with about 7 billion rows in fact table.

Example of query ( again, full list on Benchmark Wiki)

SQL:
  1. SELECT project, sum(page_count) sm
  2.  FROM pagestat
  3.    JOIN datesinfo di ON ( di.id=date_id )
  4.    JOIN projects p ON  (p.id=project_id )
  5.  WHERE di.calmonth=7 AND di.calyear=2009
  6.  GROUP BY project
  7.  ORDER BY sm DESC
  8.  LIMIT 20;

I am going to load data and run queries against available engines:

  • MySQL MyISAM / InnoDB (to have reference results)
  • InfoBright
  • InfiniDB
  • MonetDB
  • LucidDB
  • Greenplum

and I will report my results ( so stay with MySQLPerformanceBlog ;) )

I'd like also to test also Paraccel, Vertica and KickFire systems, but I do not have access to.

I welcome your feedback on the benchmark, and what else you would like to see here.


Entry posted by Vadim | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Star Schema Bechmark: InfoBright, InfiniDB and LucidDB

Январь 8th, 2010

In my previous rounds with DataWarehouse oriented engines I used single table without joins, and with small (as for DW) datasize (see http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, http://www.mysqlperformanceblog.com/2009/10/26/air-traffic-queries-in-luciddb/, http://www.mysqlperformanceblog.com/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/). Addressing these issues, I took Star Schema Benchmark, which is TPC-H modification, and tried run queries against InfoBright, InfiniDB, LucidDB and MonetDB. I did not get results for MonetDB, will explain later why. Again primary goal for test was not to get just numbers, but understand specifics of each engine and their ability to handle amount of data and execute queries.

All details I have are available on our Wiki http://www.percona.com/docs/wiki/benchmark:ssb:start and the specification of benchmarks you can get there http://www.percona.com/docs/wiki/_media/benchmark:ssb:starschemab.pdf.

I generated data with scale factor = 1000, which gave me 610GB of data in raw format and loaded into each engine.

There difference in engines gets into play. While InfoBright and InfiniDB does not need indexes at all (you actually can’t create indexes here), they needed for LucidDB and MonetDB, and it changes load time and datasize after load significantly. The numbers
I put in results do not include indexing time, but it also should be considered.

And indexes are exactly reason why I could not get results for MonetDB, there I faced issue
I was not prepared for. MonetDB requires that index fits into memory during indexing procedure, and for 610GB the index may get to 120GB size, and I did not have that amount of memory ( the box is only 16GB of RAM). MonetDB experts recommended me to extended
swap partition to 128GB, but my partitions layout was not really prepared for, I just did not expect I need big swap partition.

Loading
So load time.
InfiniDB can really utilize all available cores/cpus in systems ( I run benchmark on 8 cores box), and it allowed to load data faster than other engines. Though LucidDB and MonetDB
are also have multi-thread loaders, only InfoBright ICE used single core.

InfiniDB: 24 010 sec
MonetDB: 42 608 sec (without indexes)
InfoBright: 51 779 sec
LucidDB: 140 736 sec (without indexes)

I should note that time to create indexes in LucidDB was also significant and exceeded loading time. Full report on indexes are available here http://www.percona.com/docs/wiki/benchmark:ssb:luciddb:start

Data size

Size after load is also interesting factor. InfoBright is traditionally good with compression,
though compression rate is less than in case with AirTraffic table. I was told this is because
lineorder table comes not in sorted order, which one would expect in real life. Actually
the same complain I heard from InfiniDB experts – if put lineorder data in sorted order, loading
time can decrease significantly.

Datasize after load:
InfoBright: 112G
LucidDB: 120GB (without indexes)
InfiniDB: 626GB
MonetDB: 650GB (without indexes)

Queries time

Now on queries time.
Full results you can find on page http://www.percona.com/docs/wiki/benchmark:ssb:start,
and graph is below. There couple comments from me.

InfoBright was fully 1 CPU bound during all queries. I think the problem
that engine can use only single cpu/core is getting significant limitation
for them. For query 3.1 I got the surprising result, after 36h of work I got
error that query can’t be resolved by InfoBright optimizer and I need
to enable MySQL optimizer.

InfiniDB is otherwise was IO-bound, and processed data fully utilizing
sequential reads and reading data with speed 120MB/s. I think it allowed
InfiniDB to get the best time in the most queries.

LucidDB on this stage is also can utilize only singe thread with results sometime better,
sometime worse than InfoBright.

Results:

Query InfoBright InfiniDB LucidDB
Q1.1 48 min 21.67 sec (2901.67 sec) 24 min 26.05 sec (1466.05 sec) 3503.792 sec
Q1.2 44 min 55.37 sec (2695.37 sec) 24 min 25.83 sec (1465.83 sec) 2889.903 sec
Q1.3 45 min 53.49 sec (2753.49 sec) 24 min 27.25 sec (1467.25 sec) 2763.464 sec
Q2.1 1 hour 54 min 27.74 sec (6867.74) 19 min 44.35 sec (1184.35 sec) 9694.534 sec
Q2.2 1 hour 13 min 33.15 sec (4413.15) 19 min 49.56 sec (1189.56 sec) 9399.965 sec
Q2.3 1 hour 8 min 23.41 sec (4103.41) 19 min 52.27 sec (1192.25 sec) 8875.349 sec
Q3.1 NA 19 min 11.23 sec (1151.23 sec) 16376.93 sec
Q3.2 3 hours 30 min 17.64 sec (12617.64 sec) 19 min 28.55 sec (1168.55 sec) 5560.977 sec
Q3.3 2 hours 58 min 18.87 sec (10698.87 sec) 19 min 58.29 sec (1198.29 sec) 2517.621 sec
Q3.4 1 hour 41 min 41.29 sec (6101.29 sec) 12 min 57.96 sec (777.96 sec) 686.202 sec
Q4.1 8 hours 53 min 52.55 sec (32032.55 sec) 32 min 57.49 sec (1977.49 sec ) 19843.213 sec
Q4.2 5 hours 38 min 7.60 sec / 5 hours 36 min 35.69 sec (20195.69 sec) 33 min 35.45 sec (2015.45 sec) 15292.648 sec
Q4.3 12 hours 58 min 4.27 sec (46684.27 sec) 33 min 47.32 sec (2027.32 sec) 7241.791 sec

Graph with results (time in sec, less time is better)

Conclusions

  • InfiniDB is doing just great using available CPU cores full IO bandwidth reading from disk. You can see more details on InfiniDB scalability on InfiniDB’s blog http://infinidb.org/infinidb-blog/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html
  • SSB benchmark may be not good for InfoBright, the synthetic nature of benchmark
    does not allow InfoBright to show better results. But I hope InfoBright will be able to reuse multi-cores / multi-disks soon.
  • I’d like MonetDB is able to use disk to build indexes, not only rely on available memory
  • Taking complains on SSB I am looking to get another more realistic dataset and
    compare bigger set of available DW solutions

Star Schema Bechmark: InfoBright, InfiniDB and LucidDB

Январь 8th, 2010

In my previous rounds with DataWarehouse oriented engines I used single table without joins, and with small (as for DW) datasize (see http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, http://www.mysqlperformanceblog.com/2009/10/26/air-traffic-queries-in-luciddb/, http://www.mysqlperformanceblog.com/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/). Addressing these issues, I took Star Schema Benchmark, which is TPC-H modification, and tried run queries against InfoBright, InfiniDB, LucidDB and MonetDB. I did not get results for MonetDB, will explain later why. Again primary goal for test was not to get just numbers, but understand specifics of each engine and their ability to handle amount of data and execute queries.

All details I have are available on our Wiki http://www.percona.com/docs/wiki/benchmark:ssb:start and the specification of benchmarks you can get there http://www.percona.com/docs/wiki/_media/benchmark:ssb:starschemab.pdf.

I generated data with scale factor = 1000, which gave me 610GB of data in raw format and loaded into each engine.

There difference in engines gets into play. While InfoBright and InfiniDB does not need indexes at all (you actually can’t create indexes here), they needed for LucidDB and MonetDB, and it changes load time and datasize after load significantly. The numbers
I put in results do not include indexing time, but it also should be considered.

And indexes are exactly reason why I could not get results for MonetDB, there I faced issue
I was not prepared for. MonetDB requires that index fits into memory during indexing procedure, and for 610GB the index may get to 120GB size, and I did not have that amount of memory ( the box is only 16GB of RAM). MonetDB experts recommended me to extended
swap partition to 128GB, but my partitions layout was not really prepared for, I just did not expect I need big swap partition.

Loading
So load time.
InfiniDB can really utilize all available cores/cpus in systems ( I run benchmark on 8 cores box), and it allowed to load data faster than other engines. Though LucidDB and MonetDB
are also have multi-thread loaders, only InfoBright ICE used single core.

InfiniDB: 24 010 sec
MonetDB: 42 608 sec (without indexes)
InfoBright: 51 779 sec
LucidDB: 140 736 sec (without indexes)

I should note that time to create indexes in LucidDB was also significant and exceeded loading time. Full report on indexes are available here http://www.percona.com/docs/wiki/benchmark:ssb:luciddb:start

Data size

Size after load is also interesting factor. InfoBright is traditionally good with compression,
though compression rate is less than in case with AirTraffic table. I was told this is because
lineorder table comes not in sorted order, which one would expect in real life. Actually
the same complain I heard from InfiniDB experts – if put lineorder data in sorted order, loading
time can decrease significantly.

Datasize after load:
InfoBright: 112G
LucidDB: 120GB (without indexes)
InfiniDB: 626GB
MonetDB: 650GB (without indexes)

Queries time

Now on queries time.
Full results you can find on page http://www.percona.com/docs/wiki/benchmark:ssb:start,
and graph is below. There couple comments from me.

InfoBright was fully 1 CPU bound during all queries. I think the problem
that engine can use only single cpu/core is getting significant limitation
for them. For query 3.1 I got the surprising result, after 36h of work I got
error that query can’t be resolved by InfoBright optimizer and I need
to enable MySQL optimizer.

InfiniDB is otherwise was IO-bound, and processed data fully utilizing
sequential reads and reading data with speed 120MB/s. I think it allowed
InfiniDB to get the best time in the most queries.

LucidDB on this stage is also can utilize only singe thread with results sometime better,
sometime worse than InfoBright.

Results:

Query InfoBright InfiniDB LucidDB
Q1.1 48 min 21.67 sec (2901.67 sec) 24 min 26.05 sec (1466.05 sec) 3503.792 sec
Q1.2 44 min 55.37 sec (2695.37 sec) 24 min 25.83 sec (1465.83 sec) 2889.903 sec
Q1.3 45 min 53.49 sec (2753.49 sec) 24 min 27.25 sec (1467.25 sec) 2763.464 sec
Q2.1 1 hour 54 min 27.74 sec (6867.74) 19 min 44.35 sec (1184.35 sec) 9694.534 sec
Q2.2 1 hour 13 min 33.15 sec (4413.15) 19 min 49.56 sec (1189.56 sec) 9399.965 sec
Q2.3 1 hour 8 min 23.41 sec (4103.41) 19 min 52.27 sec (1192.25 sec) 8875.349 sec
Q3.1 NA 19 min 11.23 sec (1151.23 sec) 16376.93 sec
Q3.2 3 hours 30 min 17.64 sec (12617.64 sec) 19 min 28.55 sec (1168.55 sec) 5560.977 sec
Q3.3 2 hours 58 min 18.87 sec (10698.87 sec) 19 min 58.29 sec (1198.29 sec) 2517.621 sec
Q3.4 1 hour 41 min 41.29 sec (6101.29 sec) 12 min 57.96 sec (777.96 sec) 686.202 sec
Q4.1 8 hours 53 min 52.55 sec (32032.55 sec) 32 min 57.49 sec (1977.49 sec ) 19843.213 sec
Q4.2 5 hours 38 min 7.60 sec / 5 hours 36 min 35.69 sec (20195.69 sec) 33 min 35.45 sec (2015.45 sec) 15292.648 sec
Q4.3 12 hours 58 min 4.27 sec (46684.27 sec) 33 min 47.32 sec (2027.32 sec) 7241.791 sec

Graph with results (time in sec, less time is better)

Conclusions

  • InfiniDB is doing just great using available CPU cores full IO bandwidth reading from disk. You can see more details on InfiniDB scalability on InfiniDB’s blog http://infinidb.org/infinidb-blog/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html
  • SSB benchmark may be not good for InfoBright, the synthetic nature of benchmark
    does not allow InfoBright to show better results. But I hope InfoBright will be able to reuse multi-cores / multi-disks soon.
  • I’d like MonetDB is able to use disk to build indexes, not only rely on available memory
  • Taking complains on SSB I am looking to get another more realistic dataset and
    compare bigger set of available DW solutions

Star Schema Bechmark: InfoBright, InfiniDB and LucidDB

Январь 8th, 2010

In my previous rounds with DataWarehouse oriented engines I used single table without joins, and with small (as for DW) datasize (see http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, http://www.mysqlperformanceblog.com/2009/10/26/air-traffic-queries-in-luciddb/, http://www.mysqlperformanceblog.com/2009/11/02/air-traffic-queries-in-infinidb-early-alpha/). Addressing these issues, I took Star Schema Benchmark, which is TPC-H modification, and tried run queries against InfoBright, InfiniDB, LucidDB and MonetDB. I did not get results for MonetDB, will explain later why. Again primary goal for test was not to get just numbers, but understand specifics of each engine and their ability to handle amount of data and execute queries.

All details I have are available on our Wiki http://www.percona.com/docs/wiki/benchmark:ssb:start and the specification of benchmarks you can get there http://www.percona.com/docs/wiki/_media/benchmark:ssb:starschemab.pdf.

I generated data with scale factor = 1000, which gave me 610GB of data in raw format and loaded into each engine.

There difference in engines gets into play. While InfoBright and InfiniDB does not need indexes at all (you actually can’t create indexes here), they needed for LucidDB and MonetDB, and it changes load time and datasize after load significantly. The numbers
I put in results do not include indexing time, but it also should be considered.

And indexes are exactly reason why I could not get results for MonetDB, there I faced issue
I was not prepared for. MonetDB requires that index fits into memory during indexing procedure, and for 610GB the index may get to 120GB size, and I did not have that amount of memory ( the box is only 16GB of RAM). MonetDB experts recommended me to extended
swap partition to 128GB, but my partitions layout was not really prepared for, I just did not expect I need big swap partition.

Loading
So load time.
InfiniDB can really utilize all available cores/cpus in systems ( I run benchmark on 8 cores box), and it allowed to load data faster than other engines. Though LucidDB and MonetDB
are also have multi-thread loaders, only InfoBright ICE used single core.

InfiniDB: 24 010 sec
MonetDB: 42 608 sec (without indexes)
InfoBright: 51 779 sec
LucidDB: 140 736 sec (without indexes)

I should note that time to create indexes in LucidDB was also significant and exceeded loading time. Full report on indexes are available here http://www.percona.com/docs/wiki/benchmark:ssb:luciddb:start

Data size

Size after load is also interesting factor. InfoBright is traditionally good with compression,
though compression rate is less than in case with AirTraffic table. I was told this is because
lineorder table comes not in sorted order, which one would expect in real life. Actually
the same complain I heard from InfiniDB experts – if put lineorder data in sorted order, loading
time can decrease significantly.

Datasize after load:
InfoBright: 112G
LucidDB: 120GB (without indexes)
InfiniDB: 626GB
MonetDB: 650GB (without indexes)

Queries time

Now on queries time.
Full results you can find on page http://www.percona.com/docs/wiki/benchmark:ssb:start,
and graph is below. There couple comments from me.

InfoBright was fully 1 CPU bound during all queries. I think the problem
that engine can use only single cpu/core is getting significant limitation
for them. For query 3.1 I got the surprising result, after 36h of work I got
error that query can’t be resolved by InfoBright optimizer and I need
to enable MySQL optimizer.

InfiniDB is otherwise was IO-bound, and processed data fully utilizing
sequential reads and reading data with speed 120MB/s. I think it allowed
InfiniDB to get the best time in the most queries.

LucidDB on this stage is also can utilize only singe thread with results sometime better,
sometime worse than InfoBright.

Results:

Query InfoBright InfiniDB LucidDB
Q1.1 48 min 21.67 sec (2901.67 sec) 24 min 26.05 sec (1466.05 sec) 3503.792 sec
Q1.2 44 min 55.37 sec (2695.37 sec) 24 min 25.83 sec (1465.83 sec) 2889.903 sec
Q1.3 45 min 53.49 sec (2753.49 sec) 24 min 27.25 sec (1467.25 sec) 2763.464 sec
Q2.1 1 hour 54 min 27.74 sec (6867.74) 19 min 44.35 sec (1184.35 sec) 9694.534 sec
Q2.2 1 hour 13 min 33.15 sec (4413.15) 19 min 49.56 sec (1189.56 sec) 9399.965 sec
Q2.3 1 hour 8 min 23.41 sec (4103.41) 19 min 52.27 sec (1192.25 sec) 8875.349 sec
Q3.1 NA 19 min 11.23 sec (1151.23 sec) 16376.93 sec
Q3.2 3 hours 30 min 17.64 sec (12617.64 sec) 19 min 28.55 sec (1168.55 sec) 5560.977 sec
Q3.3 2 hours 58 min 18.87 sec (10698.87 sec) 19 min 58.29 sec (1198.29 sec) 2517.621 sec
Q3.4 1 hour 41 min 41.29 sec (6101.29 sec) 12 min 57.96 sec (777.96 sec) 686.202 sec
Q4.1 8 hours 53 min 52.55 sec (32032.55 sec) 32 min 57.49 sec (1977.49 sec ) 19843.213 sec
Q4.2 5 hours 38 min 7.60 sec / 5 hours 36 min 35.69 sec (20195.69 sec) 33 min 35.45 sec (2015.45 sec) 15292.648 sec
Q4.3 12 hours 58 min 4.27 sec (46684.27 sec) 33 min 47.32 sec (2027.32 sec) 7241.791 sec

Graph with results (time in sec, less time is better)

Conclusions

  • InfiniDB is doing just great using available CPU cores full IO bandwidth reading from disk. You can see more details on InfiniDB scalability on InfiniDB’s blog http://infinidb.org/infinidb-blog/mysql-parallel-query-processing-of-ssb-queries-via-infinidb-.html
  • SSB benchmark may be not good for InfoBright, the synthetic nature of benchmark
    does not allow InfoBright to show better results. But I hope InfoBright will be able to reuse multi-cores / multi-disks soon.
  • I’d like MonetDB is able to use disk to build indexes, not only rely on available memory
  • Taking complains on SSB I am looking to get another more realistic dataset and
    compare bigger set of available DW solutions

Analyzing air traffic performance with InfoBright and MonetDB

Октябрь 3rd, 2009

Accidentally me and Baron played with InfoBright (see http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/) this week. And following Baron's example I also run the same load against MonetDB. Reading comments to Baron's post I tied to load the same data to LucidDB, but I was not successful in this.

I tried to analyze a bigger dataset and I took public available data
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time about USA domestic flights with information about flight length and delays.

The data is available from 1988 to 2009 in chunks per month, so I downloaded 252 files (for 1988-2008 years) with size from 170MB to 300MB each. In total raw data is about 55GB. Average amount of rows in each chunk is 483762.46 (the query Q0 is: select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t; for InfoBright and with t as (select yeard,monthd,count(*) as c1 from ontime group by YEARD,monthd) select AVG(c1) FROM t for MonetDB. For InfoBright it took 4.19 sec to execute and 29.9 sec for MonetDB, but it's almost single case where MonetDB was significantly slower)

Few words about environment: server Dell SC1425, with 4GB of RAM and Dual Intel(R) Xeon(TM) CPU 3.40GHz.
InfoBright (ICE) version: 5.1.14-log build number (revision)=IB_3.2_GA_5316(ice)
MonetDB version: server v5.14.2, based on kernel v1.32.2
LucidDB was 0.9.1

The table I loaded data is:

CODE:
  1. CREATE TABLE `ontime` (
  2.   `Year` year(4) DEFAULT NULL,
  3.   `Quarter` tinyint(4) DEFAULT NULL,
  4.   `Month` tinyint(4) DEFAULT NULL,
  5.   `DayofMonth` tinyint(4) DEFAULT NULL,
  6.   `DayOfWeek` tinyint(4) DEFAULT NULL,
  7.   `FlightDate` date DEFAULT NULL,
  8.   `UniqueCarrier` char(7) DEFAULT NULL,
  9.   `AirlineID` int(11) DEFAULT NULL,
  10.   `Carrier` char(2) DEFAULT NULL,
  11.   `TailNum` varchar(50) DEFAULT NULL,
  12.   `FlightNum` varchar(10) DEFAULT NULL,
  13.   `Origin` char(5) DEFAULT NULL,
  14.   `OriginCityName` varchar(100) DEFAULT NULL,
  15.   `OriginState` char(2) DEFAULT NULL,
  16.   `OriginStateFips` varchar(10) DEFAULT NULL,
  17.   `OriginStateName` varchar(100) DEFAULT NULL,
  18.   `OriginWac` int(11) DEFAULT NULL,
  19.   `Dest` char(5) DEFAULT NULL,
  20.   `DestCityName` varchar(100) DEFAULT NULL,
  21.   `DestState` char(2) DEFAULT NULL,
  22.   `DestStateFips` varchar(10) DEFAULT NULL,
  23.   `DestStateName` varchar(100) DEFAULT NULL,
  24.   `DestWac` int(11) DEFAULT NULL,
  25.   `CRSDepTime` int(11) DEFAULT NULL,
  26.   `DepTime` int(11) DEFAULT NULL,
  27.   `DepDelay` int(11) DEFAULT NULL,
  28.   `DepDelayMinutes` int(11) DEFAULT NULL,
  29.   `DepDel15` int(11) DEFAULT NULL,
  30.   `DepartureDelayGroups` int(11) DEFAULT NULL,
  31.   `DepTimeBlk` varchar(20) DEFAULT NULL,
  32.   `TaxiOut` int(11) DEFAULT NULL,
  33.   `WheelsOff` int(11) DEFAULT NULL,
  34.   `WheelsOn` int(11) DEFAULT NULL,
  35.   `TaxiIn` int(11) DEFAULT NULL,
  36.   `CRSArrTime` int(11) DEFAULT NULL,
  37.   `ArrTime` int(11) DEFAULT NULL,
  38.   `ArrDelay` int(11) DEFAULT NULL,
  39.   `ArrDelayMinutes` int(11) DEFAULT NULL,
  40.   `ArrDel15` int(11) DEFAULT NULL,
  41.   `ArrivalDelayGroups` int(11) DEFAULT NULL,
  42.   `ArrTimeBlk` varchar(20) DEFAULT NULL,
  43.   `Cancelled` tinyint(4) DEFAULT NULL,
  44.   `CancellationCode` char(1) DEFAULT NULL,
  45.   `Diverted` tinyint(4) DEFAULT NULL,
  46.   `CRSElapsedTime` INT(11) DEFAULT NULL,
  47.   `ActualElapsedTime` INT(11) DEFAULT NULL,
  48.   `AirTime` INT(11) DEFAULT NULL,
  49.   `Flights` INT(11) DEFAULT NULL,
  50.   `Distance` INT(11) DEFAULT NULL,
  51.   `DistanceGroup` TINYINT(4) DEFAULT NULL,
  52.   `CarrierDelay` INT(11) DEFAULT NULL,
  53.   `WeatherDelay` INT(11) DEFAULT NULL,
  54.   `NASDelay` INT(11) DEFAULT NULL,
  55.   `SecurityDelay` INT(11) DEFAULT NULL,
  56.   `LateAircraftDelay` INT(11) DEFAULT NULL,
  57.   `FirstDepTime` varchar(10) DEFAULT NULL,
  58.   `TotalAddGTime` varchar(10) DEFAULT NULL,
  59.   `LongestAddGTime` varchar(10) DEFAULT NULL,
  60.   `DivAirportLandings` varchar(10) DEFAULT NULL,
  61.   `DivReachedDest` varchar(10) DEFAULT NULL,
  62.   `DivActualElapsedTime` varchar(10) DEFAULT NULL,
  63.   `DivArrDelay` varchar(10) DEFAULT NULL,
  64.   `DivDistance` varchar(10) DEFAULT NULL,
  65.   `Div1Airport` varchar(10) DEFAULT NULL,
  66.   `Div1WheelsOn` varchar(10) DEFAULT NULL,
  67.   `Div1TotalGTime` varchar(10) DEFAULT NULL,
  68.   `Div1LongestGTime` varchar(10) DEFAULT NULL,
  69.   `Div1WheelsOff` varchar(10) DEFAULT NULL,
  70.   `Div1TailNum` varchar(10) DEFAULT NULL,
  71.   `Div2Airport` varchar(10) DEFAULT NULL,
  72.   `Div2WheelsOn` varchar(10) DEFAULT NULL,
  73.   `Div2TotalGTime` varchar(10) DEFAULT NULL,
  74.   `Div2LongestGTime` varchar(10) DEFAULT NULL,
  75.   `Div2WheelsOff` varchar(10) DEFAULT NULL,
  76.   `Div2TailNum` varchar(10) DEFAULT NULL,
  77.   `Div3Airport` varchar(10) DEFAULT NULL,
  78.   `Div3WheelsOn` varchar(10) DEFAULT NULL,
  79.   `Div3TotalGTime` varchar(10) DEFAULT NULL,
  80.   `Div3LongestGTime` varchar(10) DEFAULT NULL,
  81.   `Div3WheelsOff` varchar(10) DEFAULT NULL,
  82.   `Div3TailNum` varchar(10) DEFAULT NULL,
  83.   `Div4Airport` varchar(10) DEFAULT NULL,
  84.   `Div4WheelsOn` varchar(10) DEFAULT NULL,
  85.   `Div4TotalGTime` varchar(10) DEFAULT NULL,
  86.   `Div4LongestGTime` varchar(10) DEFAULT NULL,
  87.   `Div4WheelsOff` varchar(10) DEFAULT NULL,
  88.   `Div4TailNum` varchar(10) DEFAULT NULL,
  89.   `Div5Airport` varchar(10) DEFAULT NULL,
  90.   `Div5WheelsOn` varchar(10) DEFAULT NULL,
  91.   `Div5TotalGTime` varchar(10) DEFAULT NULL,
  92.   `Div5LongestGTime` varchar(10) DEFAULT NULL,
  93.   `Div5WheelsOff` varchar(10) DEFAULT NULL,
  94.   `Div5TailNum` varchar(10) DEFAULT NULL
  95. ) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1;

Last fields starting with "Div*" are not really used.

Load procedure:

Infobright: the loader that comes with ICE version is very limited and I had to transform files to quote each field. After that load statement is:
mysql -S /tmp/mysql-ib.sock -e "LOAD DATA INFILE '/data/d1/AirData_ontime/${YEAR}_$i.txt.tr' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"'" ontime

The load time for each chunk was about 30s/chunk in initial years and up to 48s/chunk for 2008 year. And total load time is 8836 sec (2.45h).

The size of database after load is 1.6G which is impressive and give 1:34 compress ratio.

MonetDB: It took some time to figure out how to load text data ( I really wish developers improve documentation), but finally I ended up with next load statement:

/usr/local/monetdb/bin/mclient -lsql --database=ontime -t -s "COPY 700000 records INTO ontime FROM '/data/d1/AirData_ontime/${Y
EAR}_$i.txt' USING DELIMITERS ',','\n','\"' NULL AS '';"

Load time: 13065 sec ( 3.6h)

Database size after load is 65G , which is discouraging. It seems it does not use any compression, and it's bigger than original data.

LucidDB
Here it took time to find how to execute command from command line using included sqlline utility, and I did not understand how to do that, so I generated big SQL file which contained load statements.

Load of each chunk was significantly slower starting with about 60 sec/chunk for initial year and constantly growing to 200 sec / chunk for 2000 year. On 2004 year (after about 5h of loading) the load failed by some reason and I did not try to repeat, as I would not fit in timeframe I allocated for this benchmark. Maybe I will try sometime again.

Query execution
So I really have data for InfoBright and MonetDB, let see how fast they are in different queries.

First favorite query for any database benchmarker is SELECT count(*) FROM ontime;. Both InforBritgh and MonetDB executes it immediately with result 117023290 rows

Now some random queries I tried again both databases:

-Q1: Count flights per day from 2000 to 2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

with result:

[ 5, 7509643 ]
[ 1, 7478969 ]
[ 4, 7453687 ]
[ 3, 7412939 ]
[ 2, 7370368 ]
[ 7, 7095198 ]
[ 6, 6425690 ]

And it took 7.9s for MonetDB and 12.13s for InfoBright.

-Q2: Count of flights delayed more than 10min per day of week for 2000-2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

Result:

[ 5, 1816486 ]
[ 4, 1665603 ]
[ 1, 1582109 ]
[ 7, 1555145 ]
[ 3, 1431248 ]
[ 2, 1348182 ]
[ 6, 1202457 ]

And 0.9s execution for MonetDB and 6.37s for InfoBright.

-Q3: Count of delays per airport for years 2000-2008
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10

[ "ORD", 739286 ]
[ "ATL", 736736 ]
[ "DFW", 516957 ]
[ "PHX", 336360 ]
[ "LAX", 331997 ]
[ "LAS", 307677 ]
[ "DEN", 306594 ]
[ "EWR", 262007 ]
[ "IAH", 255789 ]
[ "DTW", 248005 ]

with 1.7s for MonetDB and 7.29s for InfoBright

-Q4: Count of delays per Carrier for 2007 year
SELECT carrier, count(*) FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier ORDER BY 2 DESC

[ "WN", 296293 ]
[ "AA", 176203 ]
...

With 0.27s for MonetDB and 0.99sec for InfoBright

But it obvious that the more flight carrier has, the more delays, so to be fair, let's calculate
-Q5: Percentage of delays for each carrier for 2007 year.
It is a bit more trickier, as for InfoBright and MonetDB you need different query:

MonetDB:
WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE YearD=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC

InfoBright:
SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

I am using c*1000/c2 here, because MonetDB seems using integer arithmetic and, with c/c2 I received just 1.

So result is:
[ "EV", 101796, 286234, 355 ]
[ "US", 135987, 485447, 280 ]
[ "AA", 176203, 633857, 277 ]
[ "MQ", 145630, 540494, 269 ]
[ "AS", 42830, 160185, 267 ]
[ "B6", 50740, 191450, 265 ]
[ "UA", 128174, 490002, 261 ]
...

with execution time: 0.5s for MonetDB and 2.92s for InfoBright.

Warnings: do not try EXPLAIN this query in InfoBright. MySQL is really stupid here, and EXPLAIN for this query took 6 min!

If you wonder about carriers - EV is Atlantic Southeast Airlines and US is US Airways Inc.
35.5% flights of Atlantic Southeast Airlines was delayed on more than 10 mins!

-Q6: Let's try the same query for wide range of years 2000-2008:
Result is:
[ "EV", 443798, 1621140, 273 ]
[ "AS", 299282, 1207960, 247 ]
[ "B6", 191250, 787113, 242 ]
[ "WN", 1885942, 7915940, 238 ]
[ "FL", 287815, 1220663, 235 ]
...

And execution 12.5s MonetDB and 21.83s InfoBright.

(AS is Alaska Airlines Inc. and B6 is JetBlue Airways)

-Q7: Percent of delayed (more 10mins) flights per year:

MonetDB:
with t as (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD), t2 as (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) select t.YEARD, c1/c2 FROM t JOIN t2 ON (t.YEARD=t2.YEARD)

InfoBright:
SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON (t.YEARD=t2.YEARD)

with result:
[ 1988, 166 ]
[ 1989, 199 ]
[ 1990, 166 ]
[ 1991, 147 ]
[ 1992, 146 ]
[ 1993, 154 ]
[ 1994, 165 ]
[ 1995, 193 ]
[ 1996, 221 ]
[ 1997, 191 ]
[ 1998, 193 ]
[ 1999, 200 ]
[ 2000, 231 ]
[ 2002, 163 ]
[ 2003, 153 ]
[ 2004, 192 ]
[ 2005, 210 ]
[ 2006, 231 ]
[ 2007, 245 ]
[ 2008, 219 ]

And with execution time 27.9s MonetDB and 8.59s InfoBright.

It seems MonetDB does not like scanning wide range of rows, the slowness here is similar to Q0.

-Q8: As final I tested most popular destination in sense count of direct connected cities for different diapason of years.

SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN N and M GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

Years, InfoBright, MonetDB
1y, 5.88s, 0.55s
2y, 11.77s, 1.10s
3y, 17.61s, 1.69s
4y, 37.57s, 2.12s
10y, 79.77s, 29.14s

-Q9: And prove that MonetDB does not like to scan many records, there is query
select year,count(*) as c1 from ontime group by YEAR

which shows how many records per years
+------+---------+
| year | c1 |
+------+---------+
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 5967780 |
| 2002 | 5271359 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009728 |
+------+---------+

And execution time: MonetDB: 6.3s and InfoBright: 0.31s

To group all results there is graph:

Conclusions:

  • This experiment was not really about InfoBright vs MonetDB comparison. My goal was to check how available OpenSource software is able to handle such kind of tasks.
  • Despite InfoBright was slower for many queries, I think it is more production ready and stable. It has Enterprise edition and Support which you can buy. And execution time is really good, taking into account amount of rows engine had to crunch. For query Q8 (1year range) traditional transactional oriented stored engine took 30min to get result.
  • I really like MonetDB. I do not know what is the magic behind the curtain, they also do not have indexes like InfoBright, but results are impressive. On drawbacks - the command line is weak ( I had to use bash and pass query as parameter, otherwise I was not able to edit query or check history), the documentation also needs improvements. The fact it does not use the compression also maybe showstopper, the space consumption is worrying. Addressing these issues I think MonetDB may have commercial success
  • Worth to note that MongoDB supports all INSERT / UPDATE / DELETE statements (and space is price for that as I understand), while InfoBright ICE edition allows you only LOAD DATA. InfoBright Enterprise allows INSERT / UPDATE but that also is not for online transactions processing.
  • Compression in InfoBright is impressive. Even smaller rate 1:10 means you can compress 1TB to 100GB, which is significant economy of space.

I am open to run any other queries if you want to compare or get info about air performance.


Entry posted by Vadim | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN