Archive for the ‘Kaleidoscope’ Category

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 2 (partitioning)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

See part 1 for the introduction and talking about power and hardware. This part will go over the 2nd “P”, partitioning. Learning about Oracle’s partitioning has gotten me more interested in how MySQL’s partitioning works, and I do hope that MySQL partitioning will develop to the level that Oracle partitioning does, because Oracle’s partitioning looks very nice (then again, that’s why it costs so much I guess).

Partition – Larger tables or fact tables can benefit from partitioning because it makes data load easier and can increase join performance and use data elimination. Parallel execution can be done with partitioning due to partition pruning. The degree of parallelism should be a power of 2, because of hash-based algorithm in hash partitioning. To translate this to the MySQL world, if you are using LINEAR HASH partitioning, then you should use a degree of parallelism that is a power of 2 (I checked, and indeed. Otherwise, use a degree of parallelism that makes sense given the number of partitions you have.

One important note that during Pythian’s testing of MySQL partitioning, we found that all partitions were locked when an INSERT occurs, for the duration of the INSERT. Bulk-loading with MySQL partitioning is not as fast as it would be if MySQL allowed partition pruning for INSERTs.

So, what should be partitioned? For the first level of partitioning, the goal is to enable partitioning pruning and simplify data management. The most typical partitioning is range or interval partitioning on a date column. Interval partitioning is you say what the partition is (date, month) and partition is automatically created. MySQL does not have interval partitioning, and I have seen typical first-level partitioning be range or list based on a date or timestamp column. Note that if you use a timestamp field, the partitioning expression is optimized if you use TO_DAYS(timestamp_field) or YEAR(timestamp_field). In my experience, using anything else (such as DATE(timestamp_field)) actually makes partitioning slower than not using partitioning at all. Note that this is based on tests I did a few months ago, and your mileage may vary.

So — how do you decide partitioning strategy? Ask yourself:

  • What range of data do the queries touch – a quarter, a year?
  • What is the data loading frequency?
  • Is an incremental load required?
  • How much data is involved, a day, a week, a month?

The answers to the above questions will tell you about how big your interval needs to be. The best scenario is that all answers are the same, “we load every day, and people query by day.” If the answers are different weight access a higher priority than loading, because most people care more about query performance than performance of ETL.

This is true even if your intervals have different sizes — ie sales per day are much bigger in Dec but that’s OK. However, Maria recommends that the subpartition be as evenly divided as possible.

Easier to look at more partitions than to look at a partition that’s too big. But you don’t want too many partitions, max Oracle allows partitions is 1 million partitions, prior to 11g it was 64,000. “Stick closer to 64,000 than 1 million”. MySQL’s limitation is 1024 per table.

For the second level of partitioning, also called subpartitioning, the goal is to allow for multi-level pruning and improve join performance. In Oracle, the most typical subpartition is hash or list – in MySQL, you can only subpartition by hash or key.

How do you decide subpartitioning strategy?

  • Select the dimension queried most frequently on the fact table OR
  • Pick the common join column

For example, if you want to look at sales per day, per store, you would choose “per day” as the partition and “per store” as the subpartition.

If you do not have a good partition on logical elements (like grouping), then you can subpartition using hash partitioning on common joins — perhaps surrogate keys, or using join key of the largest table involved in the join.

For example, if the sales table is partitioned and another big table is product, you can hash subpartition product_id.

Because there’s overhead in partitions (loading metadata, reading metadata), make sure size of partitions and subpartitions is >20 Mb. So better to have a 30 Mb subpartition than a 15 Mb subpartition. [I have no idea if this is true in MySQL or not -- I think the general concept is true, because there is some overhead, but I have no idea about the 20 Mb figure and why that's true for Oracle, nor do I know what is true in MySQL.]

One easy calculation is double the # of CPUs, round up to nearest power of 2. If you’re executing in parallel, Oracle will use 2x CPUs. (all this advice, by the way, follows 80/20 rule, this is probably good for about 80% of the environments out there). Of course, MySQL does not do parallel execution very well, so this probably does not apply.

Oracle knows it can get partition elimination while it does a join.

If 2 tables have the same degree of parallelism (same # of buckets) and are partitioned in the same way on the join column (say, customer_id in a subpartition of sales and a partition of customer), Oracle will match the partitions when joining:

sales table joined with customer table can change into 4 small joins:
sales sub part 1 joins with customer part 1
sales sub part 2 joins with customer part 2
sales sub part 3 joins with customer part 3
sales sub part 4 joins with customer part 4

And with parallelism, the total time is now reduced to the time it takes to do one of those smaller joins.

This is also why you want to have a power of 2 for buckets – because cores/processors come in powers of 2. Partition-wise joins like this can also be done with range or list, assuming both tables in the join have the same buckets.

I have no idea if MySQL partitioning works this way, but it’s certainly a functionality that makes sense to me.


PlanetMySQL Voting: Vote UP / Vote DOWN

Data Warehousing Best Practices: Comparing Oracle to MySQL, part 1 (introduction and power)

Июль 30th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. My impression, which was confirmed by folks in the Oracle world, is that she knows her way around the Oracle optimizer.

These are my notes from the session, which include comparisons of how Oracle works (which Maria gave) and how MySQL works (which I researched to figure out the difference, which is why this blog post took a month after the conference to write). Note that I am not an expert on data warehousing in either Oracle or MySQL, so these are more concepts to think about than hard-and-fast advice. In some places, I still have questions, and I am happy to have folks comment and contribute what they know.

One interesting point brought up:
Maria quoted someone (she said the name but I did not grab it) from Forrester saying, “3NF is typically a selfless model used by Enterprise data warehouse, which is used by the whole company. A star schema is a selfish model, used by a department, because it’s already got aggregation in it.”

I thought that was an interesting way of pointing that out — most people do not understand why 3NF is not good enough for data warehousing, and I have had a hard time explaining why a star or snowflake schema should be used. Another schema-related topic I had a hard time putting into words before this workshop was the difference between a star and a snowflake schema: compared to a star schema, in a snowflake schema, you have more than one fact table and maybe some dimensions that are not used often.

From Maria and the slides:
“Oracle says model what will suit your business best. Don’t get lost in academia. Most schemas are not 100% according to the theoretical
models. Some examples: 3NF schema with denormalized attributes to avoid costly joins, Star schema with multiple hierarchies in same fact table.”

Data warehousing has a 3-step approach —

1) data sources -> staging layer (temp loading layer)
2) staging layer (temp loading layer)-> foundation (logical, data store) layer
3) foundation (logical, data store) layer -> access and performance layer

The foundation layer is usually 3NF the access layer is usually a star or snowflake schema. As for the data sources, they can be varied, you would hope that they are in 3NF (and if they are you can skip the first 2 steps) but they are not always that way.

The 3 P’s of best practice for data warehousing (on Oracle) are power, partitioning, parallelism. The goal of the data warehousing environment is to minimize the amount of data accessed and use the most efficient joins – so it is not so index focused. This may be based on Oracle’s way of doing joins, I am not so sure if it applies to MySQL as well.

Power The weakest link in the chain (the 3 steps above) will define the throughput, so make sure your hardware configuration is balanced. Maria mentioned that as DBAs, “most of the time we don’t have control over this, but we’re still bound to the SLAs.”

This includes hardware that immediately comes to mind such as # of CPUs/cores, speed of CPU, amount of RAM, speed of disk as well as what we may not think of immediately: speed of network switches, speed of disk controllers, number and speed of host BUS adapters. Notes on host BUS adapters (HBAs): Know the # of HBA ports you have. 4 Gb HBA does 400 Mb/sec. 2 Gb HBA does 200 Mb/sec. Make sure there’s enough HBA capacity to sustain the CPU throughput (ie, make sure HBA isn’t the bottleneck). Also the speed at which it all talks. If you have a 4 Gb machine but a 2 Gb switch, you end up having 2 Gb throughput. Upgrade the network at the same time you upgrade machines.

Because we are talking about data warehousing, it is often not possible to eliminate disk I/O, so the goal is to have the fastest I/O throughput possible. Data warehouses need to be sized on I/O throughput not number of I/O’s.

I made a post earlier about how to determine I/O throughput for a system, which used information from this session. Justin Swanhart already pointed out that this is based on the fact that Oracle can do hash joins and MySQL can only do nested loop joins. I wonder, though, if there is indeed no case when using MySQL for which I/O throughput is a more useful metric than iops.

Disk arrays that are expensive are usually sized for iops, not throughput, and because they’re expensive the disk array is shared throughout the company. A DBA needs to ask ‘how many connections into the storage array do I have? How many disk controllers do I have? Where are my physical disks, and which controllers are they hanging off of?’

Typical 15k rpm disk can do about 25-35 Mb/sec (per disk) random i/o’s. Disk manufacturers will throw out numbers like 200-300 Mb/sec but that’s sequential I/O and leading edge of the drive. Make sure all your LUNs are not coming off the same set of disks, so that you’re not conflicting on disk seeks.

Continue to part 2, partitioning.


PlanetMySQL Voting: Vote UP / Vote DOWN

Determining I/O throughput for a system

Июль 29th, 2010

At Kscope this year, I attended a half day in-depth session entitled Data Warehousing Performance Best Practices, given by Maria Colgan of Oracle. In that session, there was a section on how to determine I/O throughput for a system, because in data warehousing I/O per second (iops) is less important than I/O throughput (how much actual data goes through, not just how many reads/writes).

The section contained an Oracle-specific in-database tool, and a standalone tool that can be used on many operating systems, regardless of whether or not a database exists:

If Oracle is installed, run DBMS_RESOURCE_MANAGER.CALIBRATE_IO:

SET SERVEROUTPUT ON
DECLARE
lat INTEGER;
iops INTEGER;
mbps INTEGER;
BEGIN
-- DBMS_RESOURCE_MANAGER.CALIBRATE_IO(<DISKS>, <MAX_LATENCY>,iops,mbps,lat);
DBMS_RESOURCE_MANAGER.CALIBRATE_IO (2, 10, iops, mbps, lat);
dbms_output.put_line('max_mbps = ' || mbps);
end;

For us MySQL folks, or even the Drizzle or NoSQL folks, Oracle offers a free standalone tool called Orion. The example given in the slides was:

./orion –run advanced –testname mytest –num_small 0 –size_large 1024 –type rand –simulate contact –write 0 –duration 60 –matrix column

-num_small is 0 because you don’t usually do small transactions in a dw.
-type rand for random I/O’s because data warehouse queries usually don’t do sequential reads
-write 0 – no writes, because you do not write often to the dw, that is what the ETL is for.
-duration is in seconds
-matrix column shows you how much you can sustain

I would be interested to see how other folks measure I/O throughput, and maybe even do a side-by-side comparison of different tools. Orion is available for:

Linux (x86, x86-64, Itanium, Power)
Solaris (SPARC64)
AIX (PPC64)
zLinux
HPUX (PA RISC, Itanium)
Windows

I am working on a larger write-up of the session itself, which had many concise descriptions of data warehousing issues, but I thought that this merited its own post.


PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #193 – A Carnival of The Vanities for DBAs

Июль 3rd, 2010

Welcome to Log Buffer, the weekly roundup of DBA industry happenings.

Read on for the latest updates in Log Buffer #193. Don’t forget, we’re always looking for volunteer editors to publish and host an issue of Log Buffer. If you’d like this to be you, contact the Log Buffer coordinator.

ODTUG/Kaleidoscope 2010 roundup:

Sheeri Cabral has posted slides and a summary of the first ever MySQL track at ODTUG/Kaleidoscope, citing a successful event.

Tim Hall, on Oracle-Base blog gives a daily report on the event summarizing his take on day 1 & 2, 3, 4, and 5.

Following the ODTUG/Kaleidoscope Oracle ACE Director’s briefing Alex Gorbachev continues the conversation on best practices with a blog post and encourages you to add your comments.

Marc Fielding, of Pythian’s consulting group suggests the following reads this week:

Pinal Dave who tries to clear up confustion and misconceptions about Statistics on SQL Blog Authority.

SQL Master noting a SQL Server 2008 Security gotcha, from a discussion he has on sql server security during interactions with the SASSUG user group.

On SQL Aloha, a post from Brad McGhee, who poses the question: Does your company have a SQL server DR plan? A quick poll he conducted provides scary results and he offers a list of resources for DBAs without a complete and tested DR plan in place.

Sramana Mitra, who writes about Oracle’s next move: data security, now that the Sun acquisition is complete.

OracleNerd outlining strategies for how to receive free passes to Oracle OpenWorld 2010.

Peter Finnigan talking about database forensics on his Oracle security blog, and the release of a new redo log mining tool to extract DDL from redo logs.

In other news, Willie Favero gives readers a heads up that the DB2 Catalog and directory are about to go through some major changes.

Guiseppe Maxia, The Data Charmer, learns a new programming language as he tries to integrate MySQL Sandbox and Cluster.

Dave Page on his Postgres blog, highlights what SQL injection attacks are and how DBAs can avoid them after another malicious attack targeting websites running Microsoft Internet Information Server (IIS).

Happy Canada Day for those north of the border, and for our friends to the south, Happy Independence Day and enjoy a great 4th of July weekend.


PlanetMySQL Voting: Vote UP / Vote DOWN

ODTUG Kscope Wrap-up and Slides

Июль 2nd, 2010

Ronald Bradford and I produced a successful MySQL track at Kaleidoscope (hereinafter referred to as Kscope). With a speaker list of Philip Antoniades, Josh Sled and Craig Sylvester of Oracle, Laine Campbell of PalominoDB, Patrick Galbraith of Northscale, Sarah Novotny of Blue Gecko, Padrig O’Sullivan of Akiba, Dossy Shiobara of Panoptic.com and Matt Yonkovic of Percona, we knew the technical content was going to be great.

As someone who’s helped organize all the OpenSQLCamps, a few MySQL Camps, and the Boston MySQL User Group, I know that participation at an event such as this can be small. Despite planning the MySQL track at the last minute, we had top-notch speakers with appropriate content for the audience, which was mostly Oracle crossovers. We had several registrants who came solely for the MySQL content, with all but 2 of the 27 sessions having 10-25 audience members. According to a few different folks, this is the same amount as the SOA and BPM track receives, and that track was not planned at the last minute. The ODTUG conference committee and board were happy with the turnout as well. I can’t wait to see the results of the evaluations!

As someone who’s sat on not-for-profit boards in the past as well as organized events while being on the board, I know how crazy it can be to plan conferences, and I also know that there are some organizations and personalities that are difficult to work with. I am happy to report that ODTUG has been very welcoming, accommodating, and hands-off, letting us do what we need. When I spoke with Edward Roske, the conference chair for Kscope 2011, he said, “You know what the needs for MySQL much better than I do, so just run the MySQL track as if you’re an executive of a corporation.” I know we are all worried about content being controlled by marketing folks, and I am very happy to report that ODTUG’s Kaleidoscope conference is of, by and for the people. There is one slot for a vendor presentation, and it is clearly marked, and there was an expo hall with over 20 booths, so there are opportunities for marketing; but all in all this is a technical conference.

Slides from presentations:
From Ronald:
Increasing MySQL Productivity from Design to Implementation (3-hour presentation)

MySQL idiosyncrasies that bite

From Matt Yonkovit:
The Five Minute DBA

From me:
Importing and Exporting Data with MySQL
What do you mean, SQL syntax error? – a 90-minute look into how MySQL’s SQL extends and deviates from the ANSI/ISO SQL:2003 standard.

Stored
Procedures & Functions and
Triggers
and Views.

I also gave Jay Pipes’ Join-fu: the Art of SQL part one and Join-fu: the Art of SQL part two as Jay was unable to attend at the last minute.

My huge thanks to all the speakers and all the attendees; the MySQL track at Kaleidoscope was a success! I am already imagining what we can do next year in Long Beach, CA at the end of June.


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Idiosyncrasies That Bite

Июнь 28th, 2010

The following are my slides that I presented at ODTUG Kaleidoscope 2010. This presentation talks about the MySQL defaults including a non-transactional state, silent data truncations, date management, and transaction isolation options. These are all critical for data integrity and consistency. I also cover in-depth topics including SQL_MODE, character sets and collations.


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL’s SQL Deviations and Extensions

Июнь 28th, 2010

Today at Kaleidoscope I will be doing a 90-minute session comparing MySQL’s SQL syntax to the ANSI/ISO SQL:2003 standard, entitled What Do You Mean, “SQL Syntax Error”?

You can download the PDF slides now.

For those that may be following along the presentation later today (4 pm Eastern time), here are some links that I may throw out during the session:


PlanetMySQL Voting: Vote UP / Vote DOWN

ODTUG Kaleidoscope 2010 — Ready… Set… Go-o-o!

Июнь 25th, 2010

It is time… Time for one more very special conference for me. Why special?

  1. I have never been to ODTUG Kaleidoscope before. I always like new conferences — new experience.
  2. Unlike Sheeri, I do not speak! This is one of those rare conferences where I come to slack off, meet old friends and make new ones, go to lots of sessions and actually learn stuff.
  3. Coming from the DBA background, it’s not often that I come to development oriented conferences and I think I should do more of that.
  4. It’s in Washington, DC. I lived there for some time and have number of good friends there. I’m really excited to see them again!

So what am I going to do there? I just arrived and right in time for the Oracle ACE Directors’ briefing that will run for the whole days of Friday. This is a super secret meeting where Oracle’s super secret plans are shared. Nobody can talk about that after this meeting or their tongues are cut off on the spot. For those of you who didn’t realize I’m joking, the ACE Director’s briefing is where Oracle shares the roadmap of its products — some of it is long term strategy and some is about the upcoming releases. There are few things that we are asked not to share in public but, frankly, there is nothing really sensitive. One of the most interesting parts of the briefing are the Q&A moments when all kind of questions get asked (sometimes tough ones) and, to the most parts, gets answered.

It took me the same time to get to the hotel from Dulles International Airport as flying from Ottawa. Oh well, the beauty of DC traffic. Now I’m sitting at the back of the room enjoying the demo of the new APEX 4.0 and quietly having a bite. I see quite a few familiar faces (or rather backs) that I’m about to say hello to (pending the next break).

So what’s ahead? Yesterday, we all received a surprise invitation (well, surprise to me at least) to ACE Directors welcome reception organized by the ODTUG board members. Looking forward to that.

Saturday is the ODTUG Community Service Day where the volunteers help to beautify Ronald H. Brown Middle School in Washington, D.C.

Sunday is ODTUG Kaleidoscope 2010 Symposia and I’m still not sure whether I should select APEX or Performance, Scalability, and Security track. Sunday night is the night of Oracle ACE dinner and I think I managed to miss few ACE program dinners at the past few conferences so I’m really looking forward to this Sunday.

Monday evening, come for the Sundown Sessions which are open discussions moderated by Oracle ACE Directors. I’ll be hanging either in MySQL or Database Development halls. Immediately following that, on my agenda is Oracle ACE Directors Reception — this is your chance to mingle with the bunch of Oracle ACE Directors and other cool folks and abuse some complimentary food and drinks (thanks to OTN).


PlanetMySQL Voting: Vote UP / Vote DOWN

Next Week’s MySQL Sessions at ODTUG Kaleidoscope

Июнь 22nd, 2010

By now you know that there is a MySQL Track during next week’s ODTUG Kaleidoscope in Washington, DC. Ronald Bradford and I organized the schedule at the last minute (Ronald did a lot of the work!). It was difficult to fill a schedule with 19 sessions that are either 1 hour or 1.5 hours long, and to do it I ended up with three presentations.

At each presentation I will be giving away a copy of The MySQL Administrator’s Bible, so be sure to show up! All MySQL track sessions are in Maryland C, and all times are Eastern.

On Monday, June 28th from 4 pm – 5:30 pm I will be presenting “What do you mean, SQL Syntax Error?”, a presentation about how MySQL’s SQL syntax extends and deviates from the ANSI/ISO SQL:2003 standard. There is an 80-page PDF accompaniment that will be given out for free during this session.

On Tuesday, June 29th from 11 am to 12 noon I will be presenting Importing and Exporting Data with MySQL, about the many tools to load and bulk load data, and how to export data for regular and bulk loads. I will also be going over which storage engines are particularly well-suited for bulk loading, and the caveats to watch out for. This session is useful for those who know MySQL as well as those asking the question, “What’s the equivalent of Oracle’s SQL Loader for MySQL?”

On Wednesday, June 30th from 8:30 am to 9:30 am I will be presenting Navigating MySQL Stored Procedures & Functions, Views and Triggers, which covers all the ways stored procedures, stored functions, views and triggers can be used, including a highlight of Oracle differences.

I hope to see you there!


PlanetMySQL Voting: Vote UP / Vote DOWN

Still room at Kaleidoscope for MySQL attendees

Июнь 16th, 2010

Today I received notice that next week’s Velocity conference is at maximum capacity. With just under 2 weeks before the start of ODTUG Kaleidoscope in Washington DC we still have room for late registrations. There is 4 days of MySQL content, free events and also a Sunday Symposium that includes talks on performance and high availability.

Contact any of the MySQL speakers directly and you can receive a special 50% discount code. This is only for MySQL attendees.

If you live in the DC area and only want the FREE option then come along and join use on Monday night for a free session and reception.

ODTUG Kaleidoscope 2010
July 27 – July 1
Marriott Wardman Part Hotel
2660 Woodley Road NW
Washington, District Of Columbia 20008
www.odtugkaleidoscope.com

Conference highlights include

Community Service Day – Saturday, June 26, 8:00 a.m. – 1:00 p.m.
Join ODTUG volunteers and help refurbish a school in D.C.  Under the guidance of Greater DC Cares (GDCC), the leading and largest nonprofit coordinator of volunteerism in the D.C. region, ODTUGgers will: Sort books, beautify school grounds, and paint games on blacktop outside of hte school.

There is still time to sign up!  

Four Full-day Symposia – Sunday, June 27, 8:30 a.m. – 4:00 p.m.
Application Express; Oracle EPM and Essbase; Security, Scalability, and Performance; SOA and BPM. One-day registration available.

Welcome Reception/Battle of the Rock Bands – Sunday, June 27, 6:15 – 8:00 p.m.
Meet the exhibitors and compete in the “Battle of the Rock Bands.” Sign up to play.


Opening General Session – Monday, June 28, 8:30 – 10:00 a.m.
Awards for Best Technical Paper and Best 2009 Presentations
Keynote – “Future of the Internet and its Social Impact” by Lee Rainie, Director of the PEW Research Center’s Internet & American Life Project.
Sundown Sessions with Oracle ACE Directors – Monday, June 28, 5:45 – 6:45 p.m.
Reception to meet the Oracle ACE Directors immediately follows – 6:45 – 7:45 p.m.

Special Event – Wednesday, June 30, 6:30 – 10:00 p.m.
Featuring comedian John Heffron, 2nd season champion of the hit TV show, Last Comic Standing.
Music by live cover band, Right Foot Red


PlanetMySQL Voting: Vote UP / Vote DOWN