Archive for the ‘dw’ Category

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 2 (partitioning)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).

Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.

One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.

So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field) or YEAR(timestamp_field). In my experience, using anything else (such as DATE(timestamp_field)) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.

So — how do you decide partitioning strategy? Ask yourself:

  • What range of data do the queries touch – a quarter, a year?
  • What is the data loading frequency?
  • Is an incremental load required?
  • How much data is involved, a day, a week, a month?

The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.

This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.

Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.

For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.

How do you decide subpartitioning strategy?

  • Select the dimension queried most frequently on the fact table OR
  • Pick the common join column

For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.

If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.

For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.

Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not -- I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that's true for Oracle, nor do I know what is true in MySQL.]

One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.

Oracle knows it can get partition elimination while it does a join.

If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:

sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4

And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.

This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.

I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.


PlanetMySQL Voting: Vote UP / Vote DOWN

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.


PlanetMySQL Voting: Vote UP / Vote DOWN

Intro to OLAP

Июль 12th, 2010

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.

As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.

OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.

Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.

There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.

A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.

What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.

Data before pivoting:
Example image from Wikimedia commons showing detail data for sales

Data summarized in pivot form:
Wikimedia commons image showing data summarized in pivot format

ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.

A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.

Star schema with degenerate dimension

Example star schema about sales.

Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
Sample date hierarchy, showing quarter, month, year and day hierarchies.

The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.

The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.

Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.

Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.

Whats next?
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.


Entry posted by Justin Swanhart | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Intro to OLAP

Июль 12th, 2010

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is BI?
Chances are that you have heard the term business intelligence. Business intelligence (or BI) is a term which encompasses many different tools and methods for analyzing data, usually presenting it in a way that is easily consumed by upper management. This analysis is often used to determine how effectively the business has been at meeting certain performance goals, and to forecast how they will do in the future. To put it another way the tools are designed to provide insight about the business process, hence the name. Probably the most popular BI activity for web sites is click analysis.

As far as BI is concerned, this series of posts focuses on OLAP analysis and in a lesser sense, on data warehousing. Data warehouses often provide the information upon which OLAP analysis is performed, but more on this in post #2.

OLAP? What is that?
OLAP is an acronym which stands for online analytical processing. OLAP analysis, which is really just another name for multidimensional analysis, consists of displaying summary aggregations of the data broken down into different groups. A typical OLAP analysis might show “sale total, by year, by sales rep, by product category”. OLAP analysis is usually used for reporting on current data, looking at historical trends and trying to make predictions about future trends.

Multidimensional Analysis
Multidimensional analysis is a form of statistical analysis. In multidimensional analysis samples representing a particular measure are compared or broken down into different dimensions. For example, in a sales analysis, the “sale amount” is a measure. Measures are always aggregated values. That is, total sales might be expressed as SUM(sale_amt). This is because the SUM of the individual sales will be grouped along different dimensions, such as by year or by product. I’m getting a little ahead of myself. Before we talk about measures and dimensions, we should talk about the two ways in which this information can be stored.

There are two main ways to store multidimensional data for OLAP analysis
OLAP servers typically come in two basic flavors. Some servers have specialized data stores which store data in a form which is highly effective for multidimensional analysis. These servers are termed MOLAP and they tend to have exceptional performance due to their specialized data store. Almost all MOLAP solutions pre-compute many (or even all) of the possible answers to multi-dimensional queries. Palo is an example of an open source version of this technology. ESSbase is an example of closed source product. MOLAP servers often feature extensive compression of data which can improve performance. Loading data into a MOLAP server usually takes a very long time because many of the answers in the cube must be calculated. The extra time spent during the load is usually called “processing” time.

A relational OLAP (or ROLAP) server uses data stored in an RDBMS. These systems trade the performance of a multidimensional store for the convenience of an RDBMS. These servers almost always query over a database which is structured as a STAR or snowflake type schema. To go back to the sales analysis example above, in a STAR schema the facts about the sales would be stored in the fact table, and the list of customers and products would be stored in separate dimension tables. Some ROLAP servers support the aggregation of data into additional tables, and can use the tables automatically. These servers can approach the performance of MOLAP with the convenience of ROLAP, but there are still challenges with this approach. The biggest challenges are the amount of time that it takes to keep the tables updated and in the complexity of the many scripts or jobs which might be necessary to keep the tables in sync. Part five of my series will introduce materialized views which attempt to address these challenges in a manageable way.

What makes a ROLAP so great?
An OLAP server usually returns information to the user as a ‘pivot table‘ or ‘pivot report’. While you could create such a report in a spreadsheet, the ROLAP tool is designed to deal with millions or even billions of rows of data, much more than a spreadsheet can usually handle. MOLAP servers usually require that all, or almost all of the data must fit it memory. Another difference is the ease by which this analysis is constructed. You don’t necessarily have to write queries or drag and drop a report together in order to analyze multidimensional data using an OLAP tool.

Data before pivoting:
Example image from Wikimedia commons showing detail data for sales

Data summarized in pivot form:
Wikimedia commons image showing data summarized in pivot format

ROLAP tools use star schema
As I said before, a sale amount would be considered a measure, and it would usually be aggregated with SUM. The other information about the sale, such as the product, when it was sold and to whom it was sold would be defined in dimension tables. The fact table contains columns which are joined to the dimension tables, such as product_id and customer_id. These are often defined as foreign keys from the fact table to the dimension tables.

A note about degenerate dimensions:
Any values in the fact table that don’t join to dimensions are either considered degenerate dimensions or measures. In the example below the status of the order is a degenerate dimension. A degenerate dimension is stored as an ENUM in many cases. In the example below that there is no actual dimension table which includes the two different order statuses. Such a dimension would add an extra join, which is expensive. Any yes/no field and/or fields with a very low cardinality (such as gender or order status) will probably be stored in the fact table instead of in a dedicated dimension. In the “pivot data” example above, all the dimensions are degenerate: gender, region, style, date.

Star schema with degenerate dimension

Example star schema about sales.

Often a dimension will include redundant information to make reporting easier, a process called “denormalization”. Hierarchical information may be stored in a single dimension. For example, a dimension for products may include both the category AND a sub-category. A time dimension includes year, month and quarter. You can create multiple different hierarchies from a single dimension. This allows ‘drill down’ into the dimension. By default the data would be summarized by year, but you can drill down to quarter or month level aggregation.
Sample date hierarchy, showing quarter, month, year and day hierarchies.

The screenshots here in the jPivot (an OLAP cube browser) documentation can give you a better idea about the display of data. The examples break down sales by product, by category, and by region.

The information is presented in such a fashion that it can be “drilled into” and “filtered on” to provide an easy to use interface to the underlying data. Graphical display of the data as pie, line or bar charts is possible.

Focusing on ROLAP.
This is the MySQL performance blog, and as such an in depth discussion of MOLAP technology is not particularly warranted here. Our discussion will focus on Mondrian. Mondrian is an open source ROLAP server featuring an in-memory OLAP cache. Mondrian is part of the Pentaho open source business intelligence suite. Mondrian is also used by other projects such as Wabit and Jaspersoft. If you are using open source BI then you are probably already using Mondrian. Closed source ROLAP servers include Microstrategy, Microsoft Analysis Services and Oracle BI.

Mondrian speaks MDX, olap4j and XML for analysis. This means that there is a very high chance that your existing BI tools (if you have them) will work with it. MDX is a query language that looks similar to SQL but is actually very different. Olap4j is an OLAP interface for java applications. XML for analysis (XMLA) is an industry standard analytical interface originally created by Microsoft, SAS and Hyperion.

Whats next?
Next we’ll talk about the difference between data marts and data warehouses. The former are usually used for OLAP analysis, but they can be fundamentally related to a warehouse.


Entry posted by Justin Swanhart | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #182, a Carnival of the Vanities for DBAs

Март 12th, 2010

This is the 182nd edition of Log Buffer, the weekly review of database blogs. Make sure to read the whole edition so you do not miss where to submit your SQL limerick!

This week started out with me posting about International Women’s Day, and has me personally attending Confoo (Montreal) which is an excellent conference I hope to return to next year. I learned a lot from confoo, especially the blending nosql and sql session I attended.

This week was also the Hotsos Symposium. Doug’s Oracle Blog has a series of posts about Hotsos. If all this talk about conferences has gotten you excited, Joshua Drake notes that 14 days and the hotel is almost full for postgresql conference east which is March 25th-28th in Philadelphia. And the Oracle database insider notes that the Oracle OpenWorld call for papers is now open.

According to Susan Visser this week (ending tomorrow) is also read an e-book week. So if you have not already done so, read an e-book! She links a coupon for an e-book in the post.

Craig Mullins notes that the mainframe is a good career choice in Mainframes: The Safe IT Career Choice. He notes that the mainframe is still not dead:

People having been predicting the death of the mainframe since the advent of client/server in the late 1980s. That is more than 20 years! Think of all the things that have died in that timespan while the mainframe keeps on chugging away: IBM’s PC business, Circuit City, Koogle peanut butter, public pay phones, Johnny Cash… the list is endless.

In other career-related news, Antonio Cangiano is looking for [2] top-notch student hackers for a 16-month internship at IBM in Toronto starting in May. All the details, including how to apply, are in Cangiano’s blog post.

Willie Favero wants to know how you “solve the batch dilemma” for issues like “shrinking your batch window, designing your batch to play nicely with … OLTP” in how’s your batch workload doing? Perhaps Favero should read the updated batch best practices posted by Anthony Shorten.

Bryan Smith surveys a more personal question by asking if you go both ways and “manage both DB2 for Linux, UNIX, and Windows and DB2 for z/OS” in don’t ask, don’t tell, bi-platform DBAs. This week’s Log Buffer editor admits to being a tri-platform DBA — she has tried many platforms, and in fact, many databases (MySQL, Oracle, DB2, SQL Server, Sybase, Postgres and Ingres)!

Hari Prasanna Srinivasan promotes a patching survey in Oracle really wants to hear from you! Patching Survey.

Henrik Loeser explains what a deadlock and a hot spot are by using a real life analogy taken from a police report in deadlock and hot spot in real life.

Jamie Thomson asks why do you abbreviate schema names?. Shlomi Noach tries to solve the issue that “there is no consistent convention as for how to write [about table aliases in] an SQL query” in proper sql table alias use conventions. Noach also gives us a tip: faster than truncate.

Leons Petrazickis reminds us that “rulesets are chains” and it is important to have your rulesets in the proper order in iptables firewall pitfall.

Anyone interested in the history of MySQL AB will be informed after reading Dries Buytaert’s article.
Gavin Towey shares his software that helps centrally manage 120 MySQL servers in qsh.pl: distributed query tool For those who want to learn more about column-oriented databases, particularly in MySQL, Robin Schumacher of the InfiniDB blog announces that there is a MySQL University session recording on MySQL column databases now available. MySQL join-fu expert Jay Pipes has moved his blog to www.joinfu.com and starts with An SQL Puzzle and of course a follow up on the sql puzzle.

Ivan Zoratti is happy that finally, slides posted for the MySQL DW breakfast. Venu Anuganti gives you tips on one of the most common MySQL frustrations: optimizing subqueries in how to improve subqueries derived tables performance. Justin Swanhart posts the way in which he Gets Linux performance information from your MySQL database without shell access and emulates a ‘top’ CPU summary using /proc/stat and MySQL using the same method.

The Oracle Apps blog has an introduction to Oracle user productivity kit (UPK). Even though in this editor’s opinion the article is very sales-pitchy, it has valuable information, and does indeed live up to its promise:

UPK is a software tool that can capture all the steps in a system process. It records every keystroke, every click of the mouse, each menu option chosen and each button pressed. All this is done in the UPK Recorder by going through the transaction and pressing “printscreen” after every user action. From this, without any further effort from the developer, UPK builds a number of valuable outputs.

Allen White gives a great tip on how to optimize queries in keep your data clean.

Mike Dietrich reminds you to remove “old” parameters and events from your init.ora when upgrading, “as keeping them will definitely slow down the database performance in the new release.” He shows evidence of slowness when this is not done. Dietrich also shows how you can be gathering workload statistics “to give the optimizer some good knowledge about how powerful your IO-system might be”, especially “a few days after upgrading to the new release…while a real workload is running.”

Brian Aker shows the exciting features coming soon in Drizzle in Drizzle, Cherry, Roadmap for our Next Release.

Maybe you are thinking of migrating, not upgrading…..The O’Reilly Radar shows how to asses an Oracle to MySQL migration in MySQL migration and risk management. Actually, that article interviews Ronald Bradford on the subject — Bradford has been prolific lately, updating free my.cnf advice series and “Don’t Assume”: MySQL for the Oracle DBA series. Nick Quarmby also talks about migrating Oracle, but not to a new database, just to a new platform, in his primer on migrating Oracle Applications to new platforms. And the big news comes from Carlos of dataprix that Twitter will migrate from MySQL to Cassandra DB.

Paul S. Randal explains his way of benchmarking: 1 Tb table population on SQL Server.

Pete Finnigan shares his slides from a webinar on how to secure oracle, and Denis Pilipchuk shares his approaches for discovering security vulnerabilities in software applications.

Jeff Davis shares his thoughts about scalability and the relational model. Robert Treat responds actually, the relational model doesn’t scale and Baron Schwartz counters with NoSQL doesn’t mean non-relational.

Buck Woody explains “whenever you want to know something about SQL Server’s configuration, whether that’s the Instance itself or a database, you have a few options” — and of course what those options are — in system variables, stored procedures or functions for meta data.

This week’s T-SQL Tuesday topic was I/O. There are many links to great blog posts in the comments; three random posts I chose to highlight: Michael Zilberstein talks about IO capacity planning, while Kalen Delaney talks about using STATISTICS IO in I/O, you know, and Merrill Aldrich chimes in with information on real world SSD’s. Aldrich also begs folks not to waste resources and make more work for developers and DBAs in dear ISV, you’re keeping me awake nights with your VARCHAR() dates.

And we end with a bit of fin: Paul Nielsen wants us all to have a bit of fun; he has posted an SQL limerick and asks readers to create there own in there once was in Dublin a query.


PlanetMySQL Voting: Vote UP / Vote DOWN

Scalable Star Schema Benchmark (SSB) Join Metrics

Август 24th, 2009

We ran a quick scalability test of Calpont join behavior across using a Star Schema Benchmark data set at a scale factor of 1000. The Star Schema Benchmark transforms a TPC-H / DBT-3 data to a more standardized data warehouse star schema data model, and the 1000 scale factor includes 6 billion rows in the primary fact table. Information on the star schema bench (SSB) can be found at http://www.cs.umb.edu/~xuedchen/research/publications/DataWarehousePerformanceDissertationProposal.pdf .

-----------------------------------------------------------------------------------------------------------------------------------------
-- Note that these queries are run without any tuning or indices created for these joins or filters.
-- Basically, this is just 1) Create tables (without index or partition declarations).
-- 2) Load tables, fact table was loaded one month at a time.
-- 3) Run queries.
-- In addition, there is no expectation that rows in different tables are co-located.
-- Therefore, result should be similar across a wide variety of join and predicate cases.
-----------------------------------------------------------------------------------------------------------------------------------------
 





The x axis show additional servers being included, the y axis is elapsed time in seconds (log).

 


Base table cardinality is 30 million customers, 10 million suppliers, 20 million parts, 2556 days, and 6 billion lineorders (actually 5,999,989,709).

The system under test includes between 1 and 8 performance modules (scalable component of Calpont architecture), each with two quad core CPUs at 2.0 GHz and 16 GB memory. This is run with a shared disk infrastructure and single fiber channel to each performance module. The system could also be configured with direct attached/on-board disk, or with dual fiber channel.

Filters used for each table and cardinality returned are included here:

-- 30 million rows, find 360k (about 1.2%)
select count(*) from customer where c_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 360736 |
+----------+

-- 10 million rows, find 120k (about 1.2%)
select count(*) from supplier where s_city in ('INDONESIA1','INDONESIA2','INDONESIA3');
+----------+
| count(*) |
+----------+
| 119731 |
+----------+

-- 20 million rows, find 240k (about 12%)
select count(*) from part where p_mfgr in ('MFGR#1','MFGR#4')
and p_category not in ('MFGR#12','MFGR#11','MFGR#13','MFGR#14','MFGR#41','MFGR#42','MFGR#43');
+----------+
| count(*) |
+----------+
| 239495 |
+----------+

-- 2556 rows, find 365 (about 14%)
select count(*) from dateinfo where d_year = 1997;
+----------+
| count(*) |
+----------+
| 365 |
+----------+

-- 6 billions rows, find 910 million (about 15%)
select count(*) from lineorder where lo_orderdate between 19970101 and 19971231;
+-----------+
| count(*) |
+-----------+
| 910240598 |
+-----------+

This test measures a 5 table join operation, as well as variations that remove one table from this query, yielding 4 distinct flavors. The first query is run with an empty cache, and then re-run. The remaining queries are run immediately afterwards.

-- --------------------------------------------------------
-- 5 tables in join
-- 360k, 120k, 240k, dates joined with 900 million
-- --------------------------------------------------------
select d_yearmonthnum, s_city, c_city, p_mfgr, sum(lo_extendedprice), count(*)
from customer, supplier, part, dateinfo, lineorder
where c_custkey = lo_custkey and
s_suppkey = lo_suppkey and
p_partkey = lo_partkey and
d_datekey = lo_orderdate and
lo_orderdate between 19970101 and 19971231 and
c_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
s_city in ('INDONESIA1','INDONESIA2','INDONESIA3') and
p_mfgr in ('MFGR#1','MFGR#4') and
p_category in ('MFGR#12','MFGR#11','MFGR#43') and
d_year = 1997
group by 1,2,3,4
order by 1,2,3,4;




PlanetMySQL Voting: Vote UP / Vote DOWN