<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PlanetMysql.ru - информация о СУБД MySQL &#187; infobright</title>
	<atom:link href="http://planetmysql.ru/category/infobright/feed/" rel="self" type="application/rss+xml" />
	<link>http://planetmysql.ru</link>
	<description>Блог о самой популярной СУБД MySQL</description>
	<lastBuildDate>Wed, 08 Feb 2012 21:24:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Shard-Query EC2 images available</title>
		<link>http://www.mysqlperformanceblog.com/2011/05/11/shard-query-ec2-images-available/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=shard-query-ec2-images-available</link>
		<comments>http://www.mysqlperformanceblog.com/2011/05/11/shard-query-ec2-images-available/#comments</comments>
		<pubDate>Thu, 12 May 2011 03:19:09 +0000</pubDate>
		<dc:creator>Justin Swanhart</dc:creator>
				<category><![CDATA[Cloud and NoSQL]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[shard-query]]></category>

		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=6389</guid>
		<description><![CDATA[Infobright and InnoDB AMI images are now available
There are now demonstration AMI images for Shard-Query.  Each image comes pre-loaded with the data used in the previous Shard-Query blog post.  The data in the each image is split into 20 &#8220;shards&#8221;.  This blog post will refer to an EC2 instances as a node from here on out.  Shard-Query is very flexible in it&#8217;s configuration, so you can use this sample database to spread processing over up to 20 nodes.
The Infobright Community Edition (ICE) images are available in 32 and 64 bit varieties.  Due to memory requirements, the InnoDB versions are only available on 64 bit instances.  MySQL will fail to start on a micro instance, simply decrease the values in the /etc/my.cnf file if you really want to try micro instances.

Where to find the images


Amazon ID
Name
Arch
Notes

ami-20b74949
shard-query-infobright-demo-64bit
x86_64
ICE 3.5.2pl1. Requires m1.large or larger

ami-8eb648e7
shard-query-innodb-demo-64bit
x86_64
Percona Server 5.5.11 with XtraDB.  Requires m1.large or larger.

ami-f65ea19f
shard-query-infobright-demo
i686
ICE 3.5.2pl1 32bit.  Requires m1.small or greater.

snap-073b6e68
shard-query-demo-data-flatfiles
30GB ext3 EBS
This is an ext3 volume which contains the flat files for the demos, if you want to reload on your favorite storage engine or database

About the cluster
For best performance, there should be an even data distribution in the system.  To get an even distribution, the test data was hashed over the values in the date_id column.  There will be another blog post about the usage and performance of the splitter.  It is multi-threaded(actually multi-process) and is able to hash split up to 50GB/hour of input data on my i970 test machine.  It is possible to distribute splitting and/or loading among multiple nodes as well.  Note that in the demonstration each node will contain redundant, but non-accessed data for all configurations of more than one node.  This would not be the case in normal circumstances.  The extra data will not impact performance because it will never be accessed.   
Since both InnoDB and ICE versions of the data are available it is important to examine the differences in size.  This will give us some interesting information about how Shard-Query will perform on each database.  To do the size comparison, I used the du utility:

InnoDB file size on disk:  42GB (with indexes)

# du -sh *
203M    ibdata1
128M    ib_logfile0
128M    ib_logfile1
988K    mysql
2.1G    ontime1
2.1G    ontime10
2.1G    ontime11
2.1G    ontime12
2.1G    ontime13
2.1G    ontime14
2.1G    ontime15
2.1G    ontime16
2.1G    ontime17
2.1G    ontime18
2.1G    ontime19
2.1G    ontime2
2.1G    ontime20
2.1G    ontime3
2.1G    ontime4
2.1G    ontime5
2.1G    ontime6
2.1G    ontime7
2.1G    ontime8
2.1G    ontime9
212K    performance_schema
0       test

ICE size on disk: 2.5GB

# du -sh *
8.0K    bh.err
11M     BH_RSI_Repository
4.0K    brighthouse.ini
4.0K    brighthouse.log
4.0K    brighthouse.seq
964K    mysql
123M    ontime1
124M    ontime10
123M    ontime11
123M    ontime12
123M    ontime13
123M    ontime14
123M    ontime15
123M    ontime16
123M    ontime17
123M    ontime18
124M    ontime19
124M    ontime2
124M    ontime20
124M    ontime3
123M    ontime4
122M    ontime5
122M    ontime6
122M    ontime7
123M    ontime8
125M    ontime9

The InnoDB data directory size is 42GB, which is twice the original size of the input data.   The ICE schema was discussed in the comments of the last post.  ICE does not have any indexes (not even primary keys).  
Here is the complete InnoDB schema from one shard.  The schema is duplicated 20 times (but not the ontime_fact data):

DROP TABLE IF EXISTS `dim_airport`;
CREATE TABLE `dim_airport` (
  `airport_id` int(11) NOT NULL DEFAULT '0',
  `airport_code` char(3) DEFAULT NULL,
  `CityName` varchar(100) DEFAULT NULL,
  `State` char(2) DEFAULT NULL,
  `StateFips` varchar(10) DEFAULT NULL,
  `StateName` varchar(50) NOT NULL,
  `Wac` int(11) DEFAULT NULL,
  PRIMARY KEY (`airport_id`),
  KEY `CityName` (`CityName`),
  KEY `State` (`State`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Data from BTS ontime flight data.  Data for Origin and Destination airport data.';

CREATE TABLE `dim_date` (
  `Year` year(4) DEFAULT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `Month` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date NOT NULL,
  `date_id` smallint(6) NOT NULL,
  PRIMARY KEY (`date_id`),
  KEY `FlightDate` (`FlightDate`),
  KEY `Year` (`Year`,`Quarter`,`Month`,`DayOfWeek`),
  KEY `Quarter` (`Quarter`,`Month`,`DayOfWeek`),
  KEY `Month` (`Month`,`DayOfWeek`),
  KEY `DayOfWeek` (`DayOfWeek`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains the date information from the BTS ontime flight data.  Note dates may not be in date_id order';
/*!40101 SET character_set_client = @saved_cs_client */;

CREATE TABLE `dim_flight` (
  `UniqueCarrier` char(7) DEFAULT NULL,
  `AirlineID` int(11) DEFAULT NULL,
  `Carrier` char(2) DEFAULT NULL,
  `FlightNum` varchar(10) DEFAULT NULL,
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `AirlineName` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`flight_id`),
  KEY `UniqueCarrier` (`UniqueCarrier`,`AirlineID`,`Carrier`),
  KEY `AirlineID` (`AirlineID`,`Carrier`),
  KEY `Carrier` (`Carrier`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains information on flights, and what airline offered those flights and the flight number of the flight.  Some data hand updated.';

--
-- Table structure for table `ontime_fact`
--

CREATE TABLE `ontime_fact` (
  `date_id` int(11) NOT NULL DEFAULT '0',
  `origin_airport_id` int(11) NOT NULL DEFAULT '0',
  `dest_airport_id` int(11) NOT NULL DEFAULT '0',
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `TailNum` varchar(50) DEFAULT NULL,
  `CRSDepTime` int(11) DEFAULT NULL,
  `DepTime` int(11) DEFAULT NULL,
  `DepDelay` int(11) DEFAULT NULL,
  `DepDelayMinutes` int(11) DEFAULT NULL,
  `DepDel15` int(11) DEFAULT NULL,
  `DepartureDelayGroups` int(11) DEFAULT NULL,
  `DepTimeBlk` varchar(20) DEFAULT NULL,
  `TaxiOut` int(11) DEFAULT NULL,
  `WheelsOff` int(11) DEFAULT NULL,
  `WheelsOn` int(11) DEFAULT NULL,
  `TaxiIn` int(11) DEFAULT NULL,
  `CRSArrTime` int(11) DEFAULT NULL,
  `ArrTime` int(11) DEFAULT NULL,
  `ArrDelay` int(11) DEFAULT NULL,
  `ArrDelayMinutes` int(11) DEFAULT NULL,
  `ArrDel15` int(11) DEFAULT NULL,
  `ArrivalDelayGroups` int(11) DEFAULT NULL,
  `ArrTimeBlk` varchar(20) DEFAULT NULL,
  `Cancelled` tinyint(4) DEFAULT NULL,
  `CancellationCode` char(1) DEFAULT NULL,
  `Diverted` tinyint(4) DEFAULT NULL,
  `CRSElapsedTime` int(11) DEFAULT NULL,
  `ActualElapsedTime` int(11) DEFAULT NULL,
  `AirTime` int(11) DEFAULT NULL,
  `Flights` int(11) DEFAULT NULL,
  `Distance` int(11) DEFAULT NULL,
  `DistanceGroup` tinyint(4) DEFAULT NULL,
  `CarrierDelay` int(11) DEFAULT NULL,
  `WeatherDelay` int(11) DEFAULT NULL,
  `NASDelay` int(11) DEFAULT NULL,
  `SecurityDelay` int(11) DEFAULT NULL,
  `LateAircraftDelay` int(11) DEFAULT NULL,
  `FirstDepTime` varchar(10) DEFAULT NULL,
  `TotalAddGTime` varchar(10) DEFAULT NULL,
  `LongestAddGTime` varchar(10) DEFAULT NULL,
  `DivAirportLandings` varchar(10) DEFAULT NULL,
  `DivReachedDest` varchar(10) DEFAULT NULL,
  `DivActualElapsedTime` varchar(10) DEFAULT NULL,
  `DivArrDelay` varchar(10) DEFAULT NULL,
  `DivDistance` varchar(10) DEFAULT NULL,
  `Div1Airport` varchar(10) DEFAULT NULL,
  `Div1WheelsOn` varchar(10) DEFAULT NULL,
  `Div1TotalGTime` varchar(10) DEFAULT NULL,
  `Div1LongestGTime` varchar(10) DEFAULT NULL,
  `Div1WheelsOff` varchar(10) DEFAULT NULL,
  `Div1TailNum` varchar(10) DEFAULT NULL,
  `Div2Airport` varchar(10) DEFAULT NULL,
  `Div2WheelsOn` varchar(10) DEFAULT NULL,
  `Div2TotalGTime` varchar(10) DEFAULT NULL,
  `Div2LongestGTime` varchar(10) DEFAULT NULL,
  `Div2WheelsOff` varchar(10) DEFAULT NULL,
  `Div2TailNum` varchar(10) DEFAULT NULL,
  `Div3Airport` varchar(10) DEFAULT NULL,
  `Div3WheelsOn` varchar(10) DEFAULT NULL,
  `Div3TotalGTime` varchar(10) DEFAULT NULL,
  `Div3LongestGTime` varchar(10) DEFAULT NULL,
  `Div3WheelsOff` varchar(10) DEFAULT NULL,
  `Div3TailNum` varchar(10) DEFAULT NULL,
  `Div4Airport` varchar(10) DEFAULT NULL,
  `Div4WheelsOn` varchar(10) DEFAULT NULL,
  `Div4TotalGTime` varchar(10) DEFAULT NULL,
  `Div4LongestGTime` varchar(10) DEFAULT NULL,
  `Div4WheelsOff` varchar(10) DEFAULT NULL,
  `Div4TailNum` varchar(10) DEFAULT NULL,
  `Div5Airport` varchar(10) DEFAULT NULL,
  `Div5WheelsOn` varchar(10) DEFAULT NULL,
  `Div5TotalGTime` varchar(10) DEFAULT NULL,
  `Div5LongestGTime` varchar(10) DEFAULT NULL,
  `Div5WheelsOff` varchar(10) DEFAULT NULL,
  `Div5TailNum` varchar(10) DEFAULT NULL,
  KEY `date_id` (`date_id`),
  KEY `flight_id` (`flight_id`),
  KEY `origin_airport_id` (`origin_airport_id`),
  KEY `dest_airport_id` (`dest_airport_id`),
  KEY `DepDelay` (`DepDelay`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains all avaialble data from 1988 to 2010';

mysql&#62; use ontime1;
Database changed

mysql&#62; show table status like 'ontime_fact'\G
*************************** 1. row ***************************
           Name: ontime_fact
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 6697533
 Avg_row_length: 241
    Data_length: 1616904192
Max_data_length: 0
   Index_length: 539279360
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2011-05-10 04:26:14
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment: Contains all avaialble data from 1988 to 2010
1 row in set (0.00 sec)

With ICE, after compression there is only 2.5GB of data, so ICE gets over 16:1 compression ratio, which is quite nice.   Each shard contains only 128MB of data!
Storage engine makes a big difference
In general, a column store performs about 8x-10x better than a row store for queries which access a significant amount of data.  One big reason for this is the excellent compression that RLE techniques provide.
I have not loaded InnoDB compressed tables yet but since InnoDB compression is not RLE, I doubt it will have the same impact.  
For large datasets effective compression results in the need for fewer nodes in order to keep data entirely in memory.  This frees disk to use on-disk temporary storage for hash joins and other background operations.  This will have a direct impact in our query response times and throughput.
Setting up a cluster using the AMI images
You can easily test Shard-Query for yourself.  Spin up the desired number of EC2 instances using on of the the AMI images.   You should spin a number of instances that evenly divides into 20 for best results.     There is a helpful utility (included in the image) to help configure the cluster and it uses a copy of this text on this page.  To use it, ensure:

That only the instances that you want to use are shown in the EC2 console.
That the &#8220;private ip&#8221; field is selected in the list of columns to show (click show/hide to change the columns)
That the &#8220;public dns&#8221; field is selected

SSH to the public DNS entry of the node on the list of nodes.  This node will become &#8220;shard1&#8243;. 
Now, in the EC2 console  hit CTRL-A to select all text on the page and then CTRL-C to copy it.  Paste this into a text file on shard1 called &#8220;/tmp/servers.txt&#8221; and  run the following commands:

$ cat servers.txt &#124; grep "10\."&#124; grep -v internal &#124;tee hosts.internal
[host list omitted]

Now you need to set up the hosts file:

sudo su -
# cat hosts.internal &#124; ~ec2-user/tools/mkhosts &#62;&#62; /etc/hosts

# ping shard20
PING shard20 (10.126.15.34) 56(84) bytes of data.
64 bytes from shard20 (10.126.15.34): icmp_seq=1 ttl=61 time=0.637 ms
...

Note: There is no need to put that hosts file on your other nodes unless you want to run workers on them.
Generate a cluster configuration
There is a script provided to generate the shards.ini file for testing an cluster of 1 to 20 nodes.  

cd shard-query

#generate a config for 20 shards (adjust to your number of nodes)
php genconfig 20 &#62; shards.ini

Running the test
For best performance, you should run the workers on one or two nodes.  You should start two workers per core in the cluster.
First start gearmand:

gearmand -p 7000 -d

Then start the workers on node 1 (assuming a 20 node cluster):

cd shard-query
./start_workers 80

I normally start (2 * TOTAL_CLUSTER_CORES) workers.  That is, if you have 20 machines, each with 2 cores, run 80 workers.
Test the system.  You should see the following row count (the first number is wall time, the second exec time, the third parse time).

$ echo "select count(*) from ontime_fact;" &#124; ./run_query

Array
(
    [count(*)] =&#62; 135125787
)
1 rows returned (0.084244966506958s, 0.078309059143066s, 0.0059359073638916s)

Execute the test:
As seen above, the run_query script will run one more more semicolon terminated SQL statements.  The queries for the benchmark are in ~ec2-user/shard-query/queries.sql.  
I have also provided a convenient script which will summarize the output from the ./run_query command, called pivot_results

cd shard-query/
$ ./run_query &#60; queries.sql &#124; tee raw &#124;./pivot_results &#38;
[1] 12359
$ tail -f ./raw
-- Q1
...

At the end, you will get a result output that is easy to graph in a spreadsheet:

$ cat raw &#124; ./pivot_results
Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8.0,Q8.1,Q8.2,Q8.3,Q8.4,Q9,Q10,Q11
34.354,60.978,114.175,27.138,45.751,14.905,14.732,34.946,126.599,250.222,529.287,581.295,11.042,63.366,14.573

InnoDB my.cnf

[client]
port=3306
socket=/tmp/mysql-inno.sock

[mysqld]
socket=/tmp/mysql-inno.sock
default-storage-engine=INNODB
innodb-buffer-pool-instances=2
innodb-buffer-pool-size=5600M
innodb-file-format=barracuda
innodb-file-per-table
innodb-flush-log-at-trx-commit=1
innodb-flush-method=O_DIRECT
innodb-ibuf-active-contract=1
innodb-import-table-from-xtrabackup=1
innodb-io-capacity=1000
innodb-log-buffer-size=32M
innodb-log-file-size=128M
innodb-open-files=1000
innodb_fast_checksum
innodb-purge-threads=1
innodb-read-ahead=linear
innodb-read-ahead-threshold=8
innodb-read-io-threads=16
innodb-recovery-stats
innodb-recovery-update-relay-log
innodb-replication-delay=#
innodb-rollback-on-timeout
innodb-rollback-segments=16
innodb-stats-auto-update=0
innodb-stats-on-metadata=0
innodb-stats-sample-pages=256
innodb-stats-update-need-lock=0
innodb-status-file
innodb-strict-mode
innodb-thread-concurrency=0
innodb-thread-concurrency-timer-based
innodb-thread-sleep-delay=0
innodb-use-sys-stats-table
innodb-write-io-threads=4
join-buffer-size=16M
key-buffer-size=64M
local-infile=on
lock-wait-timeout=300
log-error=/var/log/mysqld-innodb.log
max-allowed-packet=1M
net-buffer-length=16K
#we value throughput over response time, get a good plan
optimizer-prune-level=0
partition=ON
port=3306
read-buffer-size=512K
read-rnd-buffer-size=1M
skip-host-cache
skip-name-resolve
sort-buffer-size=512K
sql-mode=STRICT_TRANS_TABLES
symbolic-links
table-definition-cache=16384
table-open-cache=128
thread-cache-size=32
thread-stack=256K
tmp-table-size=64M
transaction-isolation=READ-COMMITTED
user=mysql
wait-timeout=86400

To be continued
You can now set up a cluster from 1 to 20 nodes for testing.  This way you can verify the numbers in my next blog post. I will compare performance of various cluster sizes on both storage engines.]]></description>
			<content:encoded><![CDATA[<h3>Infobright and InnoDB AMI images are now available</h3>
<p>There are now demonstration AMI images for Shard-Query.  Each image comes pre-loaded with the data used in the previous Shard-Query <a href="http://www.mysqlperformanceblog/scale-out-mysql">blog post</a>.  The data in the each image is split into 20 &#8220;shards&#8221;.  This blog post will refer to an EC2 instances as a <i>node</i> from here on out.  Shard-Query is very flexible in it&#8217;s configuration, so you can use this sample database to spread processing over up to 20 nodes.</p>
<p>The Infobright Community Edition (ICE) images are available in 32 and 64 bit varieties.  Due to memory requirements, the InnoDB versions are only available on 64 bit instances.  MySQL will fail to start on a micro instance, simply decrease the values in the /etc/my.cnf file if you really want to try micro instances.<br />
<span></span></p>
<h3>Where to find the images</h3>
<table border=1>
<tr>
<th>Amazon ID
<th>Name
<th>Arch
<th>Notes</tr>
<tr>
<td>ami-20b74949
<td>shard-query-infobright-demo-64bit
<td>x86_64
<td valign=top>ICE 3.5.2pl1. Requires m1.large or larger</td>
<tr>
<td>ami-8eb648e7
<td>shard-query-innodb-demo-64bit
<td>x86_64
<td  valign=top>Percona Server 5.5.11 with XtraDB.  Requires m1.large or larger.</td>
<tr>
<td>ami-f65ea19f
<td>shard-query-infobright-demo
<td>i686</td>
<td  valign=top>ICE 3.5.2pl1 32bit.  Requires m1.small or greater.</td>
<tr>
<td>snap-073b6e68
<td>shard-query-demo-data-flatfiles
<td>30GB ext3 EBS
<td  valign=top>This is an ext3 volume which contains the flat files for the demos, if you want to reload on your favorite storage engine or database</td>
</table>
<h3>About the cluster</h3>
<p>For best performance, there should be an even data distribution in the system.  To get an even distribution, the test data was hashed over the values in the date_id column.  There will be another blog post about the usage and performance of the splitter.  It is multi-threaded(actually multi-process) and is able to hash split up to 50GB/hour of input data on my i970 test machine.  It is possible to distribute splitting and/or loading among multiple nodes as well.  Note that in the demonstration each node will contain redundant, but non-accessed data for all configurations of more than one node.  This would not be the case in normal circumstances.  The extra data will not impact performance because it will never be accessed.   </p>
<p>Since both InnoDB and ICE versions of the data are available it is important to examine the differences in size.  This will give us some interesting information about how Shard-Query will perform on each database.  To do the size comparison, I used the <b>du</b> utility:</p>
<p>
<b>InnoDB file size on disk:  <i>42GB (with indexes)</i></b></p>
<pre>
# du -sh *
203M    ibdata1
128M    ib_logfile0
128M    ib_logfile1
988K    mysql
2.1G    ontime1
2.1G    ontime10
2.1G    ontime11
2.1G    ontime12
2.1G    ontime13
2.1G    ontime14
2.1G    ontime15
2.1G    ontime16
2.1G    ontime17
2.1G    ontime18
2.1G    ontime19
2.1G    ontime2
2.1G    ontime20
2.1G    ontime3
2.1G    ontime4
2.1G    ontime5
2.1G    ontime6
2.1G    ontime7
2.1G    ontime8
2.1G    ontime9
212K    performance_schema
0       test
</pre>
<p><b>ICE size on disk: <i>2.5GB</i></b></p>
<pre>
# du -sh *
8.0K    bh.err
11M     BH_RSI_Repository
4.0K    brighthouse.ini
4.0K    brighthouse.log
4.0K    brighthouse.seq
964K    mysql
123M    ontime1
124M    ontime10
123M    ontime11
123M    ontime12
123M    ontime13
123M    ontime14
123M    ontime15
123M    ontime16
123M    ontime17
123M    ontime18
124M    ontime19
124M    ontime2
124M    ontime20
124M    ontime3
123M    ontime4
122M    ontime5
122M    ontime6
122M    ontime7
123M    ontime8
125M    ontime9
</pre>
<p>The InnoDB data directory size is 42GB, which is twice the original size of the input data.   The ICE schema was discussed in the comments of the last post.  ICE does not have any indexes (not even primary keys).  </p>
<p>Here is the complete InnoDB schema from one shard.  The schema is duplicated 20 times (but not the ontime_fact data):</p>
<pre>
DROP TABLE IF EXISTS `dim_airport`;
CREATE TABLE `dim_airport` (
  `airport_id` int(11) NOT NULL DEFAULT '0',
  `airport_code` char(3) DEFAULT NULL,
  `CityName` varchar(100) DEFAULT NULL,
  `State` char(2) DEFAULT NULL,
  `StateFips` varchar(10) DEFAULT NULL,
  `StateName` varchar(50) NOT NULL,
  `Wac` int(11) DEFAULT NULL,
  PRIMARY KEY (`airport_id`),
  KEY `CityName` (`CityName`),
  KEY `State` (`State`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Data from BTS ontime flight data.  Data for Origin and Destination airport data.';

CREATE TABLE `dim_date` (
  `Year` year(4) DEFAULT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `Month` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date NOT NULL,
  `date_id` smallint(6) NOT NULL,
  PRIMARY KEY (`date_id`),
  KEY `FlightDate` (`FlightDate`),
  KEY `Year` (`Year`,`Quarter`,`Month`,`DayOfWeek`),
  KEY `Quarter` (`Quarter`,`Month`,`DayOfWeek`),
  KEY `Month` (`Month`,`DayOfWeek`),
  KEY `DayOfWeek` (`DayOfWeek`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains the date information from the BTS ontime flight data.  Note dates may not be in date_id order';
/*!40101 SET character_set_client = @saved_cs_client */;

CREATE TABLE `dim_flight` (
  `UniqueCarrier` char(7) DEFAULT NULL,
  `AirlineID` int(11) DEFAULT NULL,
  `Carrier` char(2) DEFAULT NULL,
  `FlightNum` varchar(10) DEFAULT NULL,
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `AirlineName` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`flight_id`),
  KEY `UniqueCarrier` (`UniqueCarrier`,`AirlineID`,`Carrier`),
  KEY `AirlineID` (`AirlineID`,`Carrier`),
  KEY `Carrier` (`Carrier`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains information on flights, and what airline offered those flights and the flight number of the flight.  Some data hand updated.';

--
-- Table structure for table `ontime_fact`
--

CREATE TABLE `ontime_fact` (
  `date_id` int(11) NOT NULL DEFAULT '0',
  `origin_airport_id` int(11) NOT NULL DEFAULT '0',
  `dest_airport_id` int(11) NOT NULL DEFAULT '0',
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `TailNum` varchar(50) DEFAULT NULL,
  `CRSDepTime` int(11) DEFAULT NULL,
  `DepTime` int(11) DEFAULT NULL,
  `DepDelay` int(11) DEFAULT NULL,
  `DepDelayMinutes` int(11) DEFAULT NULL,
  `DepDel15` int(11) DEFAULT NULL,
  `DepartureDelayGroups` int(11) DEFAULT NULL,
  `DepTimeBlk` varchar(20) DEFAULT NULL,
  `TaxiOut` int(11) DEFAULT NULL,
  `WheelsOff` int(11) DEFAULT NULL,
  `WheelsOn` int(11) DEFAULT NULL,
  `TaxiIn` int(11) DEFAULT NULL,
  `CRSArrTime` int(11) DEFAULT NULL,
  `ArrTime` int(11) DEFAULT NULL,
  `ArrDelay` int(11) DEFAULT NULL,
  `ArrDelayMinutes` int(11) DEFAULT NULL,
  `ArrDel15` int(11) DEFAULT NULL,
  `ArrivalDelayGroups` int(11) DEFAULT NULL,
  `ArrTimeBlk` varchar(20) DEFAULT NULL,
  `Cancelled` tinyint(4) DEFAULT NULL,
  `CancellationCode` char(1) DEFAULT NULL,
  `Diverted` tinyint(4) DEFAULT NULL,
  `CRSElapsedTime` int(11) DEFAULT NULL,
  `ActualElapsedTime` int(11) DEFAULT NULL,
  `AirTime` int(11) DEFAULT NULL,
  `Flights` int(11) DEFAULT NULL,
  `Distance` int(11) DEFAULT NULL,
  `DistanceGroup` tinyint(4) DEFAULT NULL,
  `CarrierDelay` int(11) DEFAULT NULL,
  `WeatherDelay` int(11) DEFAULT NULL,
  `NASDelay` int(11) DEFAULT NULL,
  `SecurityDelay` int(11) DEFAULT NULL,
  `LateAircraftDelay` int(11) DEFAULT NULL,
  `FirstDepTime` varchar(10) DEFAULT NULL,
  `TotalAddGTime` varchar(10) DEFAULT NULL,
  `LongestAddGTime` varchar(10) DEFAULT NULL,
  `DivAirportLandings` varchar(10) DEFAULT NULL,
  `DivReachedDest` varchar(10) DEFAULT NULL,
  `DivActualElapsedTime` varchar(10) DEFAULT NULL,
  `DivArrDelay` varchar(10) DEFAULT NULL,
  `DivDistance` varchar(10) DEFAULT NULL,
  `Div1Airport` varchar(10) DEFAULT NULL,
  `Div1WheelsOn` varchar(10) DEFAULT NULL,
  `Div1TotalGTime` varchar(10) DEFAULT NULL,
  `Div1LongestGTime` varchar(10) DEFAULT NULL,
  `Div1WheelsOff` varchar(10) DEFAULT NULL,
  `Div1TailNum` varchar(10) DEFAULT NULL,
  `Div2Airport` varchar(10) DEFAULT NULL,
  `Div2WheelsOn` varchar(10) DEFAULT NULL,
  `Div2TotalGTime` varchar(10) DEFAULT NULL,
  `Div2LongestGTime` varchar(10) DEFAULT NULL,
  `Div2WheelsOff` varchar(10) DEFAULT NULL,
  `Div2TailNum` varchar(10) DEFAULT NULL,
  `Div3Airport` varchar(10) DEFAULT NULL,
  `Div3WheelsOn` varchar(10) DEFAULT NULL,
  `Div3TotalGTime` varchar(10) DEFAULT NULL,
  `Div3LongestGTime` varchar(10) DEFAULT NULL,
  `Div3WheelsOff` varchar(10) DEFAULT NULL,
  `Div3TailNum` varchar(10) DEFAULT NULL,
  `Div4Airport` varchar(10) DEFAULT NULL,
  `Div4WheelsOn` varchar(10) DEFAULT NULL,
  `Div4TotalGTime` varchar(10) DEFAULT NULL,
  `Div4LongestGTime` varchar(10) DEFAULT NULL,
  `Div4WheelsOff` varchar(10) DEFAULT NULL,
  `Div4TailNum` varchar(10) DEFAULT NULL,
  `Div5Airport` varchar(10) DEFAULT NULL,
  `Div5WheelsOn` varchar(10) DEFAULT NULL,
  `Div5TotalGTime` varchar(10) DEFAULT NULL,
  `Div5LongestGTime` varchar(10) DEFAULT NULL,
  `Div5WheelsOff` varchar(10) DEFAULT NULL,
  `Div5TailNum` varchar(10) DEFAULT NULL,
  KEY `date_id` (`date_id`),
  KEY `flight_id` (`flight_id`),
  KEY `origin_airport_id` (`origin_airport_id`),
  KEY `dest_airport_id` (`dest_airport_id`),
  KEY `DepDelay` (`DepDelay`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains all avaialble data from 1988 to 2010';

mysql> use ontime1;
Database changed

mysql> show table status like 'ontime_fact'\G
*************************** 1. row ***************************
           Name: ontime_fact
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 6697533
 Avg_row_length: 241
    Data_length: 1616904192
Max_data_length: 0
   Index_length: 539279360
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2011-05-10 04:26:14
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment: Contains all avaialble data from 1988 to 2010
1 row in set (0.00 sec)
</pre>
<p>With ICE, after compression there is only 2.5GB of data, so ICE gets over 16:1 compression ratio, which is quite nice.   Each shard contains only 128MB of data!</p>
<h3>Storage engine makes a big difference</h3>
<p>In general, a column store performs about 8x-10x better than a row store for queries which access a significant amount of data.  One big reason for this is the excellent compression that RLE techniques provide.<br />
I have not loaded InnoDB compressed tables yet but since InnoDB compression is not RLE, I doubt it will have the same impact.  </p>
<p>For large datasets effective compression results in the need for fewer nodes in order to keep data entirely in memory.  This frees disk to use on-disk temporary storage for hash joins and other background operations.  This will have a direct impact in our query response times and throughput.</p>
<h3>Setting up a cluster using the AMI images</h3>
<p>You can easily test Shard-Query for yourself.  Spin up the desired number of EC2 instances using on of the the AMI images.   You should spin a number of instances that evenly divides into 20 for best results.     There is a helpful utility (included in the image) to help configure the cluster and it uses a copy of this text on this page.  To use it, ensure:</p>
<ol>
<li>That only the instances that you want to use are shown in the EC2 console.
<li>That the &#8220;private ip&#8221; field is selected in the list of columns to show (click show/hide to change the columns)
<li>That the &#8220;public dns&#8221; field is selected</li>
</ol>
<p>SSH to the public DNS entry of the node on the list of nodes.  This node will become &#8220;shard1&#8243;. </p>
<p>Now, in the EC2 console  hit CTRL-A to select all text on the page and then CTRL-C to copy it.  Paste this into a text file on shard1 called &#8220;/tmp/servers.txt&#8221; and  run the following commands:</p>
<pre>
$ cat servers.txt | grep "10\."| grep -v internal |tee hosts.internal
[host list omitted]
</pre>
<p>Now you need to set up the hosts file:</p>
<pre>
sudo su -
# cat hosts.internal | ~ec2-user/tools/mkhosts >> /etc/hosts

# ping shard20
PING shard20 (10.126.15.34) 56(84) bytes of data.
64 bytes from shard20 (10.126.15.34): icmp_seq=1 ttl=61 time=0.637 ms
...
</pre>
<p>Note: There is no need to put that hosts file on your other nodes unless you want to run workers on them.</p>
<h3>Generate a cluster configuration</h3>
<p>There is a script provided to generate the shards.ini file for testing an cluster of 1 to 20 nodes.  </p>
<pre>
cd shard-query

#generate a config for 20 shards (adjust to your number of nodes)
php genconfig 20 > shards.ini
</pre>
<h3>Running the test</h3>
<p>For best performance, you should run the workers on one or two nodes.  You should start two workers per core in the cluster.</p>
<p>First start gearmand:</p>
<pre>
gearmand -p 7000 -d
</pre>
<p>Then start the workers on node 1 (assuming a 20 node cluster):</p>
<pre>
cd shard-query
./start_workers 80
</pre>
<p>I normally start (2 * TOTAL_CLUSTER_CORES) workers.  That is, if you have 20 machines, each with 2 cores, run 80 workers.</p>
<p>Test the system.  You should see the following row count (the first number is wall time, the second exec time, the third parse time).</p>
<pre>
$ echo "select count(*) from ontime_fact;" | ./run_query

Array
(
    [count(*)] => 135125787
)
1 rows returned (0.084244966506958s, 0.078309059143066s, 0.0059359073638916s)
</pre>
<h3>Execute the test:</h3>
<p>As seen above, the run_query script will run one more more semicolon terminated SQL statements.  The queries for the benchmark are in ~ec2-user/shard-query/queries.sql.  </p>
<p>I have also provided a convenient script which will summarize the output from the ./run_query command, called pivot_results</p>
<pre>
cd shard-query/
$ ./run_query < queries.sql | tee raw |./pivot_results &#038;
[1] 12359
$ tail -f ./raw
-- Q1
...
</pre>
<p>At the end, you will get a result output that is easy to graph in a spreadsheet:</p>
<pre>
$ cat raw | ./pivot_results
Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8.0,Q8.1,Q8.2,Q8.3,Q8.4,Q9,Q10,Q11
34.354,60.978,114.175,27.138,45.751,14.905,14.732,34.946,126.599,250.222,529.287,581.295,11.042,63.366,14.573
</pre>
<h3>InnoDB my.cnf</h3>
<pre>
[client]
port=3306
socket=/tmp/mysql-inno.sock

[mysqld]
socket=/tmp/mysql-inno.sock
default-storage-engine=INNODB
innodb-buffer-pool-instances=2
innodb-buffer-pool-size=5600M
innodb-file-format=barracuda
innodb-file-per-table
innodb-flush-log-at-trx-commit=1
innodb-flush-method=O_DIRECT
innodb-ibuf-active-contract=1
innodb-import-table-from-xtrabackup=1
innodb-io-capacity=1000
innodb-log-buffer-size=32M
innodb-log-file-size=128M
innodb-open-files=1000
innodb_fast_checksum
innodb-purge-threads=1
innodb-read-ahead=linear
innodb-read-ahead-threshold=8
innodb-read-io-threads=16
innodb-recovery-stats
innodb-recovery-update-relay-log
innodb-replication-delay=#
innodb-rollback-on-timeout
innodb-rollback-segments=16
innodb-stats-auto-update=0
innodb-stats-on-metadata=0
innodb-stats-sample-pages=256
innodb-stats-update-need-lock=0
innodb-status-file
innodb-strict-mode
innodb-thread-concurrency=0
innodb-thread-concurrency-timer-based
innodb-thread-sleep-delay=0
innodb-use-sys-stats-table
innodb-write-io-threads=4
join-buffer-size=16M
key-buffer-size=64M
local-infile=on
lock-wait-timeout=300
log-error=/var/log/mysqld-innodb.log
max-allowed-packet=1M
net-buffer-length=16K
#we value throughput over response time, get a good plan
optimizer-prune-level=0
partition=ON
port=3306
read-buffer-size=512K
read-rnd-buffer-size=1M
skip-host-cache
skip-name-resolve
sort-buffer-size=512K
sql-mode=STRICT_TRANS_TABLES
symbolic-links
table-definition-cache=16384
table-open-cache=128
thread-cache-size=32
thread-stack=256K
tmp-table-size=64M
transaction-isolation=READ-COMMITTED
user=mysql
wait-timeout=86400
</pre>
<h3>To be continued</h3>
<p>You can now set up a cluster from 1 to 20 nodes for testing.  This way you can verify the numbers in my next blog post. I will compare performance of various cluster sizes on both storage engines.  </p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=28597&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=28597&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2011/05/12/shard-query-ec2-images-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Two new open source data warehousing launches</title>
		<link>http://feedproxy.google.com/~r/451opensource/~3/CLWu6NwVYWU/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=two-new-open-source-data-warehousing-launches</link>
		<comments>http://feedproxy.google.com/~r/451opensource/~3/CLWu6NwVYWU/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 11:33:53 +0000</pubDate>
		<dc:creator>The 451 Group</dc:creator>
				<category><![CDATA[451 group]]></category>
		<category><![CDATA[451caostheory]]></category>
		<category><![CDATA[451group]]></category>
		<category><![CDATA[calpont]]></category>
		<category><![CDATA[caostheory]]></category>
		<category><![CDATA[Dynamo BI]]></category>
		<category><![CDATA[dynamodb]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[John Sichi]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[luciddb]]></category>
		<category><![CDATA[lucidera]]></category>
		<category><![CDATA[matt aslett]]></category>
		<category><![CDATA[mattaslett]]></category>
		<category><![CDATA[matthew aslett]]></category>
		<category><![CDATA[matthewaslett]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[Nicholas Goodman]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[opensource]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[sun]]></category>
		<category><![CDATA[The 451 Grou]]></category>

		<guid isPermaLink="false">http://blogs.the451group.com/opensource/?p=1266</guid>
		<description><![CDATA[In our recent report on the data warehousing market we speculated that there would soon be a change in the number of vendors operating in what is a crowded market. We were anticipating that the number of vendors would go down, rather than up, but - in the short term at least - we have been proved wrong, as two new open source analytical databases emerged this week.
First came the formation of Dynamo Business Intelligence Corp, (aka Dynamo BI), a new commercially supported distribution, and sponsor, of LucidDB. Then came the launch of InfiniDB Community Edition, a new open source analytic database based on MySQL from Calpont.
Read the rest of this post on our Too Much Information blog.
]]></description>
			<content:encoded><![CDATA[<p>In our recent <a href="http://www.the451group.com/special_reports/special_report_detail.php?icid=914">report on the data warehousing market</a> we speculated that there would soon be a change in the number of vendors operating in what is a crowded market. We were anticipating that the number of vendors would go down, rather than up, but - in the short term at least - we have been proved wrong, as two new open source analytical databases emerged this week.</p>
<p>First came the <a href="http://n2.nabble.com/Introducing-Dynamo-BI-tt3883211.html">formation</a> of Dynamo Business Intelligence Corp, (aka Dynamo BI), a new commercially supported distribution, and sponsor, of LucidDB. Then came the <a href="http://www.calpont.com/press/October-26-2009.html">launch</a> of InfiniDB Community Edition, a new open source analytic database based on MySQL from Calpont.</p>
<p>Read the rest of <a href="http://blogs.the451group.com/information_management/2009/10/28/because-20-data-warehousing-vendors-is-never-enough/">this post</a> on our Too Much Information blog.</p>
<img src="http://feeds.feedburner.com/~r/451opensource/~4/CLWu6NwVYWU" height="1" width="1" /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21952&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21952&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/10/28/two-new-open-source-data-warehousing-launches/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)</title>
		<link>http://rpbouman.blogspot.com/2009/10/calpont-opens-up-infinidb-open-source.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=calpont-opens-up-infinidb-open-source-analytical-database-based-on-mysql</link>
		<comments>http://rpbouman.blogspot.com/2009/10/calpont-opens-up-infinidb-open-source.html#comments</comments>
		<pubDate>Tue, 27 Oct 2009 13:54:00 +0000</pubDate>
		<dc:creator>Roland Bouman</dc:creator>
				<category><![CDATA[analytic databases]]></category>
		<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[calpont]]></category>
		<category><![CDATA[column oriented databases]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[luciddb]]></category>
		<category><![CDATA[monetdb]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[partitioning]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[Open source business intelligence and data warehousing are on the rise!If you kept up with the MySQL Performance Blog, you might have noticed a number of posts comparing the open source analytical databases Infobright, LucidDB, and MonetDB. LucidDB got some more news last week when Nick Goodman announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.Now, to top if off, Calpont has just released InfiniDB, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:Column-oriented architecture (like all other analytical database products mentioned)Transparent compressionVertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databasesSupport for ACID/TransactionsHigh performance bulkloaderNo specialized hardware - InfiniDB is a pure software solution that can run on commidity hardwareMySQL compatibleThe website sums up a few more features and benefits, but I think this covers the most important ones. Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (INSERT, DELETE, UPDATE), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice  than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)]]></description>
			<content:encoded><![CDATA[Open source business intelligence and data warehousing are on the rise!<br /><br />If you kept up with the <a href="http://www.mysqlperformanceblog.com/" >MySQL Performance Blog</a>, you might have <a href="http://www.mysqlperformanceblog.com/2009/10/26/air-traffic-queries-in-luciddb/" >noticed</a> a <a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/" >number</a> of <a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/" >posts</a> comparing the open source analytical databases <a href="http://www.infobright.org/">Infobright</a>, <a href="http://www.luciddb.org/" >LucidDB</a>, and <a href="http://monetdb.cwi.nl/">MonetDB</a>. LucidDB <a href="http://www.nicholasgoodman.com/bt/blog/2009/10/24/luciddb-dynamobi-is-running-with-it/" >got</a> some <a href="http://n2.nabble.com/Introducing-Dynamo-BI-td3883211.html" >more</a> news <a href="http://thinkwaitfast.blogspot.com/2009/10/introducing-dynamo-bi.html">last</a> week when <a href="http://www.nicholasgoodman.com/bt/blog/" >Nick Goodman</a> announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.<br /><br />Now, to top if off, <a href="http://www.calpont.com/" >Calpont</a> has just released <a href="http://infinidb.org/resources/what-is-infinidb" >InfiniDB</a>, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.<br /><br />So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:<ul><br /><li>Column-oriented architecture (like all other analytical database products mentioned)</li><br /><li>Transparent compression</li><br /><li>Vertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.</li><br /><li>MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databases</li><br /><li>Support for ACID/Transactions</li><br /><li>High performance bulkloader</li><br /><li>No specialized hardware - InfiniDB is a pure software solution that can run on commidity hardware</li><br /><li>MySQL compatible</li><br /></ul><br />The website sums up a few more features and benefits, but I think this covers the most important ones. <br /><br />Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)<br /><a href="http://www.flickr.com/photos/21931585@N07/4049476409/" title="shared-disk-arch-simple by roland.bouman, on Flickr"><img src="http://farm3.static.flickr.com/2563/4049476409_a124c2b147_o.jpg" width="821" height="465" alt="shared-disk-arch-simple" /></a><br />Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (<code>INSERT</code>, <code>DELETE</code>, <code>UPDATE</code>), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.<br /><br />Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice  than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/15319370-3110806468379653967?l=rpbouman.blogspot.com" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21933&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21933&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/10/27/calpont-opens-up-infinidb-open-source-analytical-database-based-on-mysql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Some scaling observations on Infobright</title>
		<link>http://www.fishpool.org/post/2009/10/03/Some-scaling-observations-on-Infobright?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=some-scaling-observations-on-infobright</link>
		<comments>http://www.fishpool.org/post/2009/10/03/Some-scaling-observations-on-Infobright#comments</comments>
		<pubDate>Sat, 03 Oct 2009 10:12:00 +0000</pubDate>
		<dc:creator>Osma Ahvenlampi</dc:creator>
				<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[Open Source]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[A couple of days ago, Baron Schwartz posted some

simple load and select benchmarking of MyISAM, Infobright and MonetDB,
which Vadim Tkachenko followed up with a 
more realistic dataset and interesting figures where MonetDB beat Infobright in
most queries.
Used to the parallel IEE loader, I was surprised by the apparent slow
loading speed of Baron's benchmark and decided to try and replicate it. I
installed Infobright 3.2 on my laptop (see, this is very unscientific) and
wrote a simple perl script to generate and load an arbitrarily large
data set resembling Baron's description. I'm not going to post my exact
numbers, because this installation is severely resource-constrained below
Infobright's recommended smallest installation. However, you can reproduce the
results yourself with the attached script, and I will note some
observations.    First of all, this was run on a 1.8GHz Core 2 Duo 2GB RAM laptop running a
64-bit kernel and 64-bit ICE. I stopped most other programs for the duration of
the test, but was still running Fedora 11's GNOME Desktop, and gave ICE only
400MB main heap, 200MB compressed heap and 300MB loader heap. What I found:

Loading speed is almost a linear function of table width. Every time I
doubled the number of (random integer) columns in the loaded table, loading
speed approximately halved. At 200 columns, I was seeing approximately the same
speed as Baron.
On the other hand, loading speed is NOT affected by the number of rows
loaded, from which one could assume that it won't be affected by the size of
the pre-existing table either (though I did not test this).
SELECT speed is not affected by the number of columns in the table, unless
those columns are being selected or used in the where constraint. The same
select executed in the same speed regardless of whether the source table had
10, 100 or 500 columns which were not being inspected.
Loading order, column sorting and select constraints are strongly
correlated. Limiting a query with a random-value column causes a query which
reads 10%, 50% or 90% of the table to execute in approximately 1x, 2x and 3x
time. On the other hand, replacing the random-value column in the where
constraint with a sorted-value (in load order) column makes queries accessing
the same 10%, 50% or 90% of the rows run in nearly constant time. This is the
rough set &#34;knowledge grid&#34; in action, but only works if the aggregations done
are sum(), min(), max() or other simple functions supported by the grid.
With large dataset (I stopped at 50x heap space), the constant scale factor
in the previous query starts to deteriorate, as the entire knowledge grid no
longer fits in heap, and inspecting 90% of it vs 10% will require I/O. At this
point the scale is beginning to resemble a realistic production setting, as few
people are able to host even 5% of their ADBMS working set in-memory.
So this is where things get really interesting, and performance
characteristics shift around. Random-column constrained queries of 10%, 50% and
90% of rows now run at 1x, 1.1x and 1.3x time, while constant-column
constraints execute at 1x, 1.3x and 3x their respective performance, but
approximately 3000x faster than the random-constraints!
The last point shows why it really matters to Infobright workloads that the
most frequent queries are taken into account when deciding the load order for
the data set. Even when designing an incremental ETL process, it can pay off
immensely to pre-sort the incremental data sets by the most likely constraint
or group by columns to allow blocks of data (64k rows each) to be included or
excluded for particular query plans.
Why is the random-column constrained queries executing in almost constant
time regardless of the rows inspected? Because Infobright's columnar datapack
engine accesses 64k rows at a time, it's nearly as expensive to access every
tenth row as it is to access nine rows out of every ten, if the distribution of
those rows is even across the data set. On the other hand, if the 10% of rows
needed are clustered together, then the other 90% of the data set is skipped
very early in the query optimization process. A traditional btree-indexed data
set would still require a random sweep over most of the index, which would be
much larger than Infobright's &#34;knowledge grid&#34; is.

That's it for now. If you decide to run your own test using the script,
please post comments. It should run as-is on a machine with the unmodified ICE
3.2 installation and basic Perl packages available. The script takes two
arguments: number of columns (each a random integer value), and number of rows,
and generates the data set into a named pipe though which it's loaded into a
local table created automatically.]]></description>
			<content:encoded><![CDATA[<p>A couple of days ago, Baron Schwartz posted some
<a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/">
simple load and select benchmarking of MyISAM, Infobright and MonetDB</a>,
which Vadim Tkachenko followed up with a <a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/">
more realistic dataset and interesting figures where MonetDB beat Infobright in
most queries</a>.</p>
<p>Used to the parallel IEE loader, I was surprised by the apparent slow
loading speed of Baron's benchmark and decided to try and replicate it. I
installed Infobright 3.2 on my laptop (see, this is very unscientific) and
wrote a simple perl script to generate and load an arbitrarily large
data set resembling Baron's description. I'm not going to post my exact
numbers, because this installation is severely resource-constrained below
Infobright's recommended smallest installation. However, you can reproduce the
results yourself with the attached script, and I will note some
observations.</p>    <p>First of all, this was run on a 1.8GHz Core 2 Duo 2GB RAM laptop running a
64-bit kernel and 64-bit ICE. I stopped most other programs for the duration of
the test, but was still running Fedora 11's GNOME Desktop, and gave ICE only
400MB main heap, 200MB compressed heap and 300MB loader heap. What I found:</p>
<ul>
<li>Loading speed is almost a linear function of table width. Every time I
doubled the number of (random integer) columns in the loaded table, loading
speed approximately halved. At 200 columns, I was seeing approximately the same
speed as Baron.</li>
<li>On the other hand, loading speed is NOT affected by the number of rows
loaded, from which one could assume that it won't be affected by the size of
the pre-existing table either (though I did not test this).</li>
<li>SELECT speed is not affected by the number of columns in the table, unless
those columns are being selected or used in the where constraint. The same
select executed in the same speed regardless of whether the source table had
10, 100 or 500 columns which were not being inspected.</li>
<li>Loading order, column sorting and select constraints are strongly
correlated. Limiting a query with a random-value column causes a query which
reads 10%, 50% or 90% of the table to execute in approximately 1x, 2x and 3x
time. On the other hand, replacing the random-value column in the where
constraint with a sorted-value (in load order) column makes queries accessing
the same 10%, 50% or 90% of the rows run in nearly constant time. This is the
rough set &quot;knowledge grid&quot; in action, but only works if the aggregations done
are sum(), min(), max() or other simple functions supported by the grid.</li>
<li>With large dataset (I stopped at 50x heap space), the constant scale factor
in the previous query starts to deteriorate, as the entire knowledge grid no
longer fits in heap, and inspecting 90% of it vs 10% will require I/O. At this
point the scale is beginning to resemble a realistic production setting, as few
people are able to host even 5% of their ADBMS working set in-memory.</li>
<li>So this is where things get really interesting, and performance
characteristics shift around. Random-column constrained queries of 10%, 50% and
90% of rows now run at 1x, 1.1x and 1.3x time, while constant-column
constraints execute at 1x, 1.3x and 3x their respective performance, but
approximately 3000x faster than the random-constraints!</li>
<li>The last point shows why it really matters to Infobright workloads that the
most frequent queries are taken into account when deciding the load order for
the data set. Even when designing an incremental ETL process, it can pay off
immensely to pre-sort the incremental data sets by the most likely constraint
or group by columns to allow blocks of data (64k rows each) to be included or
excluded for particular query plans.</li>
<li>Why is the random-column constrained queries executing in almost constant
time regardless of the rows inspected? Because Infobright's columnar datapack
engine accesses 64k rows at a time, it's nearly as expensive to access every
tenth row as it is to access nine rows out of every ten, if the distribution of
those rows is even across the data set. On the other hand, if the 10% of rows
needed are clustered together, then the other 90% of the data set is skipped
very early in the query optimization process. A traditional btree-indexed data
set would still require a random sweep over most of the index, which would be
much larger than Infobright's &quot;knowledge grid&quot; is.</li>
</ul>
<div>That's it for now. If you decide to run your own test using the script,
please post comments. It should run as-is on a machine with the unmodified ICE
3.2 installation and basic Perl packages available. The script takes two
arguments: number of columns (each a random integer value), and number of rows,
and generates the data set into a named pipe though which it's loaded into a
local table created automatically.</div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21433&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21433&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/10/03/some-scaling-observations-on-infobright/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analyzing air traffic performance with InfoBright and MonetDB</title>
		<link>http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=analyzing-air-traffic-performance-with-infobright-and-monetdb</link>
		<comments>http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 00:06:49 +0000</pubDate>
		<dc:creator>MySQL Performance Blog</dc:creator>
				<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[monetdb]]></category>
		<category><![CDATA[OLAP]]></category>
		<category><![CDATA[reporing]]></category>

		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=1289</guid>
		<description><![CDATA[Accidentally me and Baron played with InfoBright (see http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/) this week. And following Baron's example I also run the same load against MonetDB. Reading comments to Baron's post I tied to load the same data to LucidDB, but I was not successful in this.
I tried to analyze a bigger dataset and I took public available data
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&#38;DB_Short_Name=On-Time about USA domestic flights with information about flight length and delays.
The data is available from 1988 to 2009 in chunks per month, so I downloaded  252 files (for 1988-2008 years) with size from 170MB to 300MB each. In total raw data is about 55GB. Average amount of rows in each chunk is  483762.46 (the query Q0 is: select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t; for InfoBright and with t as (select yeard,monthd,count(*) as c1 from ontime group by YEARD,monthd) select AVG(c1) FROM t for MonetDB. For InfoBright it took 4.19 sec to execute and 29.9 sec for MonetDB, but it's almost single case where MonetDB was significantly slower)
Few words about environment: server Dell SC1425, with 4GB of RAM and Dual Intel(R) Xeon(TM) CPU 3.40GHz.
InfoBright  (ICE) version: 5.1.14-log build number (revision)=IB_3.2_GA_5316(ice)
MonetDB version: server v5.14.2, based on kernel v1.32.2
LucidDB was 0.9.1
The table I loaded data is:
PLAIN TEXT
CODE:




CREATE TABLE `ontime` &#40;


&#160; `Year` year&#40;4&#41; DEFAULT NULL,


&#160; `Quarter` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `Month` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `DayofMonth` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `DayOfWeek` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `FlightDate` date DEFAULT NULL,


&#160; `UniqueCarrier` char&#40;7&#41; DEFAULT NULL,


&#160; `AirlineID` int&#40;11&#41; DEFAULT NULL,


&#160; `Carrier` char&#40;2&#41; DEFAULT NULL,


&#160; `TailNum` varchar&#40;50&#41; DEFAULT NULL,


&#160; `FlightNum` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Origin` char&#40;5&#41; DEFAULT NULL,


&#160; `OriginCityName` varchar&#40;100&#41; DEFAULT NULL,


&#160; `OriginState` char&#40;2&#41; DEFAULT NULL,


&#160; `OriginStateFips` varchar&#40;10&#41; DEFAULT NULL,


&#160; `OriginStateName` varchar&#40;100&#41; DEFAULT NULL,


&#160; `OriginWac` int&#40;11&#41; DEFAULT NULL,


&#160; `Dest` char&#40;5&#41; DEFAULT NULL,


&#160; `DestCityName` varchar&#40;100&#41; DEFAULT NULL,


&#160; `DestState` char&#40;2&#41; DEFAULT NULL,


&#160; `DestStateFips` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DestStateName` varchar&#40;100&#41; DEFAULT NULL,


&#160; `DestWac` int&#40;11&#41; DEFAULT NULL,


&#160; `CRSDepTime` int&#40;11&#41; DEFAULT NULL,


&#160; `DepTime` int&#40;11&#41; DEFAULT NULL,


&#160; `DepDelay` int&#40;11&#41; DEFAULT NULL,


&#160; `DepDelayMinutes` int&#40;11&#41; DEFAULT NULL,


&#160; `DepDel15` int&#40;11&#41; DEFAULT NULL,


&#160; `DepartureDelayGroups` int&#40;11&#41; DEFAULT NULL,


&#160; `DepTimeBlk` varchar&#40;20&#41; DEFAULT NULL,


&#160; `TaxiOut` int&#40;11&#41; DEFAULT NULL,


&#160; `WheelsOff` int&#40;11&#41; DEFAULT NULL,


&#160; `WheelsOn` int&#40;11&#41; DEFAULT NULL,


&#160; `TaxiIn` int&#40;11&#41; DEFAULT NULL,


&#160; `CRSArrTime` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrTime` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrDelay` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrDelayMinutes` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrDel15` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrivalDelayGroups` int&#40;11&#41; DEFAULT NULL,


&#160; `ArrTimeBlk` varchar&#40;20&#41; DEFAULT NULL,


&#160; `Cancelled` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `CancellationCode` char&#40;1&#41; DEFAULT NULL,


&#160; `Diverted` tinyint&#40;4&#41; DEFAULT NULL,


&#160; `CRSElapsedTime` INT&#40;11&#41; DEFAULT NULL,


&#160; `ActualElapsedTime` INT&#40;11&#41; DEFAULT NULL,


&#160; `AirTime` INT&#40;11&#41; DEFAULT NULL,


&#160; `Flights` INT&#40;11&#41; DEFAULT NULL,


&#160; `Distance` INT&#40;11&#41; DEFAULT NULL,


&#160; `DistanceGroup` TINYINT&#40;4&#41; DEFAULT NULL,


&#160; `CarrierDelay` INT&#40;11&#41; DEFAULT NULL,


&#160; `WeatherDelay` INT&#40;11&#41; DEFAULT NULL,


&#160; `NASDelay` INT&#40;11&#41; DEFAULT NULL,


&#160; `SecurityDelay` INT&#40;11&#41; DEFAULT NULL,


&#160; `LateAircraftDelay` INT&#40;11&#41; DEFAULT NULL,


&#160; `FirstDepTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `TotalAddGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `LongestAddGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DivAirportLandings` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DivReachedDest` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DivActualElapsedTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DivArrDelay` varchar&#40;10&#41; DEFAULT NULL,


&#160; `DivDistance` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1Airport` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1WheelsOn` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1TotalGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1LongestGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1WheelsOff` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div1TailNum` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2Airport` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2WheelsOn` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2TotalGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2LongestGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2WheelsOff` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div2TailNum` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3Airport` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3WheelsOn` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3TotalGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3LongestGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3WheelsOff` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div3TailNum` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4Airport` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4WheelsOn` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4TotalGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4LongestGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4WheelsOff` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div4TailNum` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5Airport` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5WheelsOn` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5TotalGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5LongestGTime` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5WheelsOff` varchar&#40;10&#41; DEFAULT NULL,


&#160; `Div5TailNum` varchar&#40;10&#41; DEFAULT NULL


&#41; ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1; 






Last fields starting with "Div*" are not really used.
Load procedure:
Infobright: the loader that comes with ICE version is very limited and I had to transform files to quote each field. After that load statement is:
mysql -S /tmp/mysql-ib.sock -e "LOAD DATA INFILE '/data/d1/AirData_ontime/${YEAR}_$i.txt.tr' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"'" ontime
The load time for each chunk was  about 30s/chunk in initial years and up to 48s/chunk for 2008 year. And total load time is 8836 sec (2.45h).
The size of database after load is 1.6G which is impressive and give 1:34 compress ratio.
MonetDB: It took some time to figure out how to load text data ( I really wish developers improve documentation), but finally I  ended up with next load statement:
/usr/local/monetdb/bin/mclient -lsql --database=ontime -t -s "COPY 700000 records INTO ontime FROM '/data/d1/AirData_ontime/${Y
EAR}_$i.txt' USING DELIMITERS ',','\n','\"' NULL AS '';"

Load time: 13065 sec ( 3.6h)
Database size after load is 65G , which is discouraging. It seems it does not use any compression, and it's bigger than original data.
LucidDB
Here it took time to find how to execute command from command line using included sqlline utility, and I did not understand how to do that, so I generated big SQL file which contained load statements.
Load of each chunk was significantly slower starting with about 60 sec/chunk for initial year and constantly growing to 200 sec / chunk for 2000 year.  On 2004 year (after about 5h of loading) the load failed by some reason and I did not try to repeat, as I would not fit in timeframe I allocated for this benchmark. Maybe I will try sometime again.
Query execution
So I really have data for InfoBright and MonetDB, let see how fast they are in different queries.
First favorite query for any database benchmarker is SELECT count(*) FROM ontime;. Both InforBritgh and MonetDB executes it immediately with result 117023290 rows
Now some random queries I tried again both databases:
-Q1: Count flights per day from 2000 to 2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

with result:
[ 5,    7509643 ]
[ 1,    7478969 ]
[ 4,    7453687 ]
[ 3,    7412939 ]
[ 2,    7370368 ]
[ 7,    7095198 ]
[ 6,    6425690 ]
And it took 7.9s for MonetDB and 12.13s for InfoBright.
-Q2: Count of flights delayed more than 10min per day of week for 2000-2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay&#62;10 AND YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

Result:
[ 5,    1816486 ]
[ 4,    1665603 ]
[ 1,    1582109 ]
[ 7,    1555145 ]
[ 3,    1431248 ]
[ 2,    1348182 ]
[ 6,    1202457 ]
And  0.9s execution for MonetDB and 6.37s for InfoBright.
-Q3: Count of delays per airport for years 2000-2008
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay&#62;10 AND YearD BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10
[ "ORD",        739286  ]
[ "ATL",        736736  ]
[ "DFW",        516957  ]
[ "PHX",        336360  ]
[ "LAX",        331997  ]
[ "LAS",        307677  ]
[ "DEN",        306594  ]
[ "EWR",        262007  ]
[ "IAH",        255789  ]
[ "DTW",        248005  ]
with 1.7s for MonetDB and 7.29s for InfoBright
-Q4: Count of delays per Carrier for 2007 year
SELECT carrier, count(*) FROM ontime WHERE DepDelay&#62;10  AND YearD=2007 GROUP BY carrier ORDER BY 2 DESC
[ "WN", 296293  ]
[ "AA", 176203  ]
...
With 0.27s for MonetDB and 0.99sec for InfoBright
But it obvious that the more flight carrier has, the more delays, so to be fair, let's calculate
-Q5: Percentage of delays for each carrier for 2007 year.
It is a bit more trickier, as for InfoBright and MonetDB you need different query:
MonetDB:
WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay&#62;10  AND YearD=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE YearD=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC

InfoBright:
SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay&#62;10  AND Year=2007 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

I am using c*1000/c2 here, because MonetDB seems using integer arithmetic and, with c/c2 I received just 1.
So result is:
[ "EV", 101796, 286234, 355     ]
[ "US", 135987, 485447, 280     ]
[ "AA", 176203, 633857, 277     ]
[ "MQ", 145630, 540494, 269     ]
[ "AS", 42830,  160185, 267     ]
[ "B6", 50740,  191450, 265     ]
[ "UA", 128174, 490002, 261     ]
...
with execution time: 0.5s for MonetDB and 2.92s for InfoBright.
Warnings: do not try EXPLAIN this query in InfoBright. MySQL is really stupid here, and EXPLAIN for this query took 6 min!
If you wonder about carriers - EV is Atlantic Southeast Airlines and US is US Airways Inc.
35.5% flights of  Atlantic Southeast Airlines was delayed on more than 10 mins! 
-Q6: Let's try the same query for wide range of years 2000-2008:
Result is:
[ "EV", 443798, 1621140,        273     ]
[ "AS", 299282, 1207960,        247     ]
[ "B6", 191250, 787113, 242     ]
[ "WN", 1885942,        7915940,        238     ]
[ "FL", 287815, 1220663,        235     ]
...
And execution 12.5s MonetDB and 21.83s InfoBright.
(AS is Alaska Airlines Inc. and B6 is JetBlue Airways)
-Q7: Percent of delayed (more 10mins) flights per year:
MonetDB:
with t as (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay&#62;10 GROUP BY YearD), t2 as (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) select t.YEARD, c1/c2 FROM t JOIN t2 ON (t.YEARD=t2.YEARD)

InfoBright:
SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay&#62;10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON (t.YEARD=t2.YEARD)

with result:
[ 1988, 166     ]
[ 1989, 199     ]
[ 1990, 166     ]
[ 1991, 147     ]
[ 1992, 146     ]
[ 1993, 154     ]
[ 1994, 165     ]
[ 1995, 193     ]
[ 1996, 221     ]
[ 1997, 191     ]
[ 1998, 193     ]
[ 1999, 200     ]
[ 2000, 231     ]
[ 2002, 163     ]
[ 2003, 153     ]
[ 2004, 192     ]
[ 2005, 210     ]
[ 2006, 231     ]
[ 2007, 245     ]
[ 2008, 219     ]
And with execution time 27.9s MonetDB and 8.59s InfoBright.
It seems MonetDB does not like scanning wide range of rows, the slowness here is similar to Q0.
-Q8: As final I tested most popular destination in sense count of direct connected cities for different diapason of years.
SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN N and M GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

Years,  InfoBright, MonetDB
1y,   5.88s,   0.55s
2y,   11.77s,  1.10s
3y,   17.61s,  1.69s
4y,   37.57s,  2.12s
10y,  79.77s,  29.14s
-Q9: And prove that MonetDB does not like to scan many records, there is query
select year,count(*) as c1 from ontime group by YEAR

which  shows how many records  per years
+------+---------+
&#124; year &#124; c1      &#124;
+------+---------+
&#124; 1989 &#124; 5041200 &#124;
&#124; 1990 &#124; 5270893 &#124;
&#124; 1991 &#124; 5076925 &#124;
&#124; 1992 &#124; 5092157 &#124;
&#124; 1993 &#124; 5070501 &#124;
&#124; 1994 &#124; 5180048 &#124;
&#124; 1995 &#124; 5327435 &#124;
&#124; 1996 &#124; 5351983 &#124;
&#124; 1997 &#124; 5411843 &#124;
&#124; 1998 &#124; 5384721 &#124;
&#124; 1999 &#124; 5527884 &#124;
&#124; 2000 &#124; 5683047 &#124;
&#124; 2001 &#124; 5967780 &#124;
&#124; 2002 &#124; 5271359 &#124;
&#124; 2003 &#124; 6488540 &#124;
&#124; 2004 &#124; 7129270 &#124;
&#124; 2005 &#124; 7140596 &#124;
&#124; 2006 &#124; 7141922 &#124;
&#124; 2007 &#124; 7455458 &#124;
&#124; 2008 &#124; 7009728 &#124;
+------+---------+
And execution time: MonetDB: 6.3s and InfoBright: 0.31s
To group all results there is graph:

Conclusions:

 This experiment was not really about InfoBright vs MonetDB comparison. My goal was to check how available OpenSource software is able to handle such kind of tasks.
Despite InfoBright was slower for many queries, I think it is more production ready and stable. It has Enterprise edition and Support which you can buy. And execution time is really good, taking into account amount of rows engine had to crunch. For query Q8 (1year range) traditional transactional oriented stored engine took 30min to get result.

I really like MonetDB. I do not know what is the magic behind the curtain, they also do not have indexes like InfoBright, but results are impressive. On drawbacks - the command line is weak ( I had to use bash and pass query as parameter, otherwise I was not able to edit query or check history), the documentation also needs improvements. The fact it does not use the compression also maybe showstopper, the space consumption is worrying. Addressing these issues I think MonetDB may have commercial success
Worth to note that MongoDB supports all INSERT / UPDATE / DELETE statements (and space is price for that as I understand), while InfoBright ICE edition allows you only LOAD DATA. InfoBright Enterprise allows INSERT / UPDATE but that also is not for online transactions processing.
Compression in InfoBright is impressive. Even smaller rate 1:10 means you can compress 1TB to 100GB, which is significant economy of space.

I am open to run any other queries if you want to compare or get info about air performance.
    
    Entry posted by Vadim &#124;
      No comment
    Add to:  &#124;  &#124;  &#124;  &#124; ]]></description>
			<content:encoded><![CDATA[<p>Accidentally me and Baron played with InfoBright (see <a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/">http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/</a>) this week. And following Baron's example I also run the same load against MonetDB. Reading comments to Baron's post I tied to load the same data to LucidDB, but I was not successful in this.</p>
<p>I tried to analyze a bigger dataset and I took public available data<br />
<a href="http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&amp;DB_Short_Name=On-Time">http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&#038;DB_Short_Name=On-Time</a> about USA domestic flights with information about flight length and delays.</p>
<p>The data is available from 1988 to 2009 in chunks per month, so I downloaded  252 files (for 1988-2008 years) with size from 170MB to 300MB each. In total raw data is about 55GB. Average amount of rows in each chunk is  483762.46 (the query Q0 is:<code> select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t;</code> for InfoBright and <code>with t as (select yeard,monthd,count(*) as c1 from ontime group by YEARD,monthd) select AVG(c1) FROM t</code> for MonetDB. For InfoBright it took <strong>4.19 sec</strong> to execute and <strong>29.9 sec</strong> for MonetDB, but it's almost single case where MonetDB was significantly slower)</p>
<p>Few words about environment: server Dell SC1425, with 4GB of RAM and Dual Intel(R) Xeon(TM) CPU 3.40GHz.<br />
InfoBright  (ICE) version: 5.1.14-log build number (revision)=IB_3.2_GA_5316(ice)<br />
MonetDB version: server v5.14.2, based on kernel v1.32.2<br />
LucidDB was 0.9.1</p>
<p>The table I loaded data is:</p>
<div><span><a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/">PLAIN TEXT</a></span></div>
<div><span>CODE:</span>
<div>
<div>
<ol>
<li>
<div>CREATE TABLE `ontime` <span>&#40;</span></div>
</li>
<li>
<div>&nbsp; `Year` year<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Quarter` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Month` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DayofMonth` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DayOfWeek` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `FlightDate` date DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `UniqueCarrier` char<span>&#40;</span><span>7</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `AirlineID` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Carrier` char<span>&#40;</span><span>2</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `TailNum` varchar<span>&#40;</span><span>50</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `FlightNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Origin` char<span>&#40;</span><span>5</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `OriginCityName` varchar<span>&#40;</span><span>100</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `OriginState` char<span>&#40;</span><span>2</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `OriginStateFips` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `OriginStateName` varchar<span>&#40;</span><span>100</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `OriginWac` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Dest` char<span>&#40;</span><span>5</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DestCityName` varchar<span>&#40;</span><span>100</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DestState` char<span>&#40;</span><span>2</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DestStateFips` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DestStateName` varchar<span>&#40;</span><span>100</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DestWac` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `CRSDepTime` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepTime` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepDelay` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepDelayMinutes` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepDel15` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepartureDelayGroups` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DepTimeBlk` varchar<span>&#40;</span><span>20</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `TaxiOut` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `WheelsOff` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `WheelsOn` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `TaxiIn` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `CRSArrTime` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrTime` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrDelay` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrDelayMinutes` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrDel15` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrivalDelayGroups` int<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ArrTimeBlk` varchar<span>&#40;</span><span>20</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Cancelled` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `CancellationCode` char<span>&#40;</span><span>1</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Diverted` tinyint<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `CRSElapsedTime` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `ActualElapsedTime` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `AirTime` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Flights` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Distance` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DistanceGroup` TINYINT<span>&#40;</span><span>4</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `CarrierDelay` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `WeatherDelay` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `NASDelay` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `SecurityDelay` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `LateAircraftDelay` INT<span>&#40;</span><span>11</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `FirstDepTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `TotalAddGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `LongestAddGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DivAirportLandings` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DivReachedDest` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DivActualElapsedTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DivArrDelay` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `DivDistance` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1Airport` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1WheelsOn` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1TotalGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1LongestGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1WheelsOff` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div1TailNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2Airport` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2WheelsOn` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2TotalGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2LongestGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2WheelsOff` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div2TailNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3Airport` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3WheelsOn` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3TotalGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3LongestGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3WheelsOff` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div3TailNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4Airport` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4WheelsOn` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4TotalGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4LongestGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4WheelsOff` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div4TailNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5Airport` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5WheelsOn` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5TotalGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5LongestGTime` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5WheelsOff` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL,</div>
</li>
<li>
<div>&nbsp; `Div5TailNum` varchar<span>&#40;</span><span>10</span><span>&#41;</span> DEFAULT NULL</div>
</li>
<li>
<div><span>&#41;</span> ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1; </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Last fields starting with "Div*" are not really used.</p>
<p><strong>Load procedure:</strong></p>
<p><strong>Infobright</strong>: the loader that comes with ICE version is very limited and I had to transform files to quote each field. After that load statement is:<br />
<code>mysql -S /tmp/mysql-ib.sock -e "LOAD DATA INFILE '/data/d1/AirData_ontime/${YEAR}_$i.txt.tr' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"'" ontime</code></p>
<p>The load time for each chunk was  about 30s/chunk in initial years and up to 48s/chunk for 2008 year. And total load time is 8836 sec (2.45h).</p>
<p>The size of database after load is 1.6G which is impressive and give <strong>1:34</strong> compress ratio.</p>
<p><strong>MonetDB</strong>: It took some time to figure out how to load text data ( I really wish developers improve documentation), but finally I  ended up with next load statement:</p>
<p><code>/usr/local/monetdb/bin/mclient -lsql --database=ontime -t -s "COPY 700000 records INTO ontime FROM '/data/d1/AirData_ontime/${Y<br />
EAR}_$i.txt' USING DELIMITERS ',','\n','\"' NULL AS '';"<br />
</code></p>
<p>Load time: 13065 sec ( 3.6h)</p>
<p>Database size after load is 65G , which is discouraging. It seems it does not use any compression, and it's bigger than original data.</p>
<p><strong>LucidDB</strong><br />
Here it took time to find how to execute command from command line using included <code>sqlline</code> utility, and I did not understand how to do that, so I generated big SQL file which contained load statements.</p>
<p>Load of each chunk was significantly slower starting with about 60 sec/chunk for initial year and constantly growing to 200 sec / chunk for 2000 year.  On 2004 year (after about 5h of loading) the load failed by some reason and I did not try to repeat, as I would not fit in timeframe I allocated for this benchmark. Maybe I will try sometime again.</p>
<p><strong>Query execution</strong><br />
So I really have data for InfoBright and MonetDB, let see how fast they are in different queries.</p>
<p>First favorite query for any database benchmarker is <code>SELECT count(*) FROM ontime;</code>. Both InforBritgh and MonetDB executes it immediately with result 117023290 rows</p>
<p>Now some random queries I tried again both databases:</p>
<p>-Q1: Count flights per day from 2000 to 2008 years<br />
<code>SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC<br />
</code></p>
<p>with result:</p>
<p>[ 5,    7509643 ]<br />
[ 1,    7478969 ]<br />
[ 4,    7453687 ]<br />
[ 3,    7412939 ]<br />
[ 2,    7370368 ]<br />
[ 7,    7095198 ]<br />
[ 6,    6425690 ]</p>
<p>And it took <strong>7.9s</strong> for MonetDB and <strong>12.13s</strong> for InfoBright.</p>
<p>-Q2: Count of flights delayed more than 10min per day of week for 2000-2008 years<br />
<code>SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC<br />
</code><br />
Result:</p>
<p>[ 5,    1816486 ]<br />
[ 4,    1665603 ]<br />
[ 1,    1582109 ]<br />
[ 7,    1555145 ]<br />
[ 3,    1431248 ]<br />
[ 2,    1348182 ]<br />
[ 6,    1202457 ]</p>
<p>And  <strong>0.9s</strong> execution for MonetDB and <strong>6.37s</strong> for InfoBright.</p>
<p>-Q3: Count of delays per airport for years 2000-2008<br />
<code>SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10</code></p>
<p>[ "ORD",        739286  ]<br />
[ "ATL",        736736  ]<br />
[ "DFW",        516957  ]<br />
[ "PHX",        336360  ]<br />
[ "LAX",        331997  ]<br />
[ "LAS",        307677  ]<br />
[ "DEN",        306594  ]<br />
[ "EWR",        262007  ]<br />
[ "IAH",        255789  ]<br />
[ "DTW",        248005  ]</p>
<p>with <strong>1.7s</strong> for MonetDB and <strong>7.29s</strong> for InfoBright</p>
<p>-Q4: Count of delays per Carrier for 2007 year<br />
<code>SELECT carrier, count(*) FROM ontime WHERE DepDelay>10  AND YearD=2007 GROUP BY carrier ORDER BY 2 DESC</code></p>
<p>[ "WN", 296293  ]<br />
[ "AA", 176203  ]<br />
...</p>
<p>With <strong>0.27s</strong> for MonetDB and <strong>0.99sec</strong> for InfoBright</p>
<p>But it obvious that the more flight carrier has, the more delays, so to be fair, let's calculate<br />
-Q5: Percentage of delays for each carrier for 2007 year.<br />
It is a bit more trickier, as for InfoBright and MonetDB you need different query:</p>
<p>MonetDB:<br />
<code>WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10  AND YearD=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE YearD=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC<br />
</code></p>
<p>InfoBright:<br />
<code>SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10  AND Year=2007 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;<br />
</code></p>
<p>I am using c*1000/c2 here, because MonetDB seems using integer arithmetic and, with c/c2 I received just 1.</p>
<p>So result is:<br />
[ "EV", 101796, 286234, 355     ]<br />
[ "US", 135987, 485447, 280     ]<br />
[ "AA", 176203, 633857, 277     ]<br />
[ "MQ", 145630, 540494, 269     ]<br />
[ "AS", 42830,  160185, 267     ]<br />
[ "B6", 50740,  191450, 265     ]<br />
[ "UA", 128174, 490002, 261     ]<br />
...</p>
<p>with execution time: <strong>0.5s</strong> for MonetDB and <strong>2.92s</strong> for InfoBright.</p>
<p>Warnings: do not try EXPLAIN this query in InfoBright. MySQL is really stupid here, and EXPLAIN for this query took 6 min!</p>
<p>If you wonder about carriers - EV is Atlantic Southeast Airlines and US is US Airways Inc.<br />
35.5% flights of  Atlantic Southeast Airlines was delayed on more than 10 mins! </p>
<p>-Q6: Let's try the same query for wide range of years 2000-2008:<br />
Result is:<br />
[ "EV", 443798, 1621140,        273     ]<br />
[ "AS", 299282, 1207960,        247     ]<br />
[ "B6", 191250, 787113, 242     ]<br />
[ "WN", 1885942,        7915940,        238     ]<br />
[ "FL", 287815, 1220663,        235     ]<br />
...</p>
<p>And execution <strong>12.5s</strong> MonetDB and <strong>21.83s</strong> InfoBright.</p>
<p>(AS is Alaska Airlines Inc. and B6 is JetBlue Airways)</p>
<p>-Q7: Percent of delayed (more 10mins) flights per year:</p>
<p>MonetDB:<br />
<code>with t as (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD), t2 as (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) select t.YEARD, c1/c2 FROM t JOIN t2 ON (t.YEARD=t2.YEARD)<br />
</code><br />
InfoBright:<br />
<code>SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON (t.YEARD=t2.YEARD)<br />
</code><br />
with result:<br />
[ 1988, 166     ]<br />
[ 1989, 199     ]<br />
[ 1990, 166     ]<br />
[ 1991, 147     ]<br />
[ 1992, 146     ]<br />
[ 1993, 154     ]<br />
[ 1994, 165     ]<br />
[ 1995, 193     ]<br />
[ 1996, 221     ]<br />
[ 1997, 191     ]<br />
[ 1998, 193     ]<br />
[ 1999, 200     ]<br />
[ 2000, 231     ]<br />
[ 2002, 163     ]<br />
[ 2003, 153     ]<br />
[ 2004, 192     ]<br />
[ 2005, 210     ]<br />
[ 2006, 231     ]<br />
[ 2007, 245     ]<br />
[ 2008, 219     ]</p>
<p>And with execution time <strong>27.9s</strong> MonetDB and <strong>8.59s</strong> InfoBright.</p>
<p>It seems MonetDB does not like scanning wide range of rows, the slowness here is similar to Q0.</p>
<p>-Q8: As final I tested most popular destination in sense count of direct connected cities for different diapason of years.</p>
<p><code>SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN N and M GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;<br />
</code></p>
<p>Years,  InfoBright, MonetDB<br />
1y,   5.88s,   0.55s<br />
2y,   11.77s,  1.10s<br />
3y,   17.61s,  1.69s<br />
4y,   37.57s,  2.12s<br />
10y,  79.77s,  29.14s</p>
<p>-Q9: And prove that MonetDB does not like to scan many records, there is query<br />
<code>select year,count(*) as c1 from ontime group by YEAR<br />
</code><br />
which  shows how many records  per years<br />
+------+---------+<br />
| year | c1      |<br />
+------+---------+<br />
| 1989 | 5041200 |<br />
| 1990 | 5270893 |<br />
| 1991 | 5076925 |<br />
| 1992 | 5092157 |<br />
| 1993 | 5070501 |<br />
| 1994 | 5180048 |<br />
| 1995 | 5327435 |<br />
| 1996 | 5351983 |<br />
| 1997 | 5411843 |<br />
| 1998 | 5384721 |<br />
| 1999 | 5527884 |<br />
| 2000 | 5683047 |<br />
| 2001 | 5967780 |<br />
| 2002 | 5271359 |<br />
| 2003 | 6488540 |<br />
| 2004 | 7129270 |<br />
| 2005 | 7140596 |<br />
| 2006 | 7141922 |<br />
| 2007 | 7455458 |<br />
| 2008 | 7009728 |<br />
+------+---------+</p>
<p>And execution time: MonetDB: <strong>6.3s</strong> and InfoBright: <strong>0.31s</strong></p>
<p>To group all results there is graph:<br />
<img src="https://spreadsheets.google.com/a/percona.com/oimg?key=0AjsVX7AnrCYwdERIZFVqakRrcXplM0g0UktaUkRwenc&amp;oid=1&amp;v=1254520554070" /></p>
<p><strong>Conclusions</strong>:</p>
<ul>
<li> This experiment was not really about InfoBright vs MonetDB comparison. My goal was to check how available OpenSource software is able to handle such kind of tasks.</li>
<li>Despite InfoBright was slower for many queries, I think it is more production ready and stable. It has Enterprise edition and Support which you can buy. And execution time is really good, taking into account amount of rows engine had to crunch. For query Q8 (1year range) traditional transactional oriented stored engine took 30min to get result.
</li>
<li>I really like MonetDB. I do not know what is the magic behind the curtain, they also do not have indexes like InfoBright, but results are impressive. On drawbacks - the command line is weak ( I had to use bash and pass query as parameter, otherwise I was not able to edit query or check history), the documentation also needs improvements. The fact it does not use the compression also maybe showstopper, the space consumption is worrying. Addressing these issues I think MonetDB may have commercial success</li>
<li>Worth to note that MongoDB supports all INSERT / UPDATE / DELETE statements (and space is price for that as I understand), while InfoBright ICE edition allows you only LOAD DATA. InfoBright Enterprise allows INSERT / UPDATE but that also is not for online transactions processing.</li>
<li>Compression in InfoBright is impressive. Even smaller rate 1:10 means you can compress 1TB to 100GB, which is significant economy of space.</li>
</ul>
<p>I am open to run any other queries if you want to compare or get info about air performance.</p>
    <hr noshade style="margin:0;height:1px" />
    <p>Entry posted by Vadim |
      <a href="http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/#comments">No comment</a></p>
    <p>Add to: <a href="http://del.icio.us/post?url=http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/&amp;title=Analyzing%20air%20traffic%20performance%20with%20InfoBright%20and%20MonetDB" title="Bookmark this post on del.icio.us"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/delicious.png" alt="delicious" /></a> | <a href="http://digg.com/submit?phase=2&amp;url=http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/&amp;title=Analyzing%20air%20traffic%20performance%20with%20InfoBright%20and%20MonetDB" title="Digg this post on Digg.com"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/digg.png" alt="digg" /></a> | <a href="http://reddit.com/submit?url=http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/&amp;title=Analyzing%20air%20traffic%20performance%20with%20InfoBright%20and%20MonetDB" title="Submit this post on reddit.com"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/reddit.png" alt="reddit" /></a> | <a href="http://www.netscape.com/submit/?U=http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/&amp;T=Analyzing%20air%20traffic%20performance%20with%20InfoBright%20and%20MonetDB" title="Vote for this article on Netscape"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/netscape.gif" alt="netscape" /></a> | <a href="http://www.google.com/bookmarks/mark?op=add&amp;bkmk=http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/&amp;title=Analyzing%20air%20traffic%20performance%20with%20InfoBright%20and%20MonetDB" title="Add to Google Bookmarks"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/google.png" alt="Google Bookmarks" /></a></p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21431&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21431&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/10/03/analyzing-air-traffic-performance-with-infobright-and-monetdb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Quick comparison of MyISAM, Infobright, and MonetDB</title>
		<link>http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=quick-comparison-of-myisam-infobright-and-monetdb</link>
		<comments>http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 04:56:58 +0000</pubDate>
		<dc:creator>MySQL Performance Blog</dc:creator>
				<category><![CDATA[infobright]]></category>
		<category><![CDATA[monetdb]]></category>
		<category><![CDATA[myisam]]></category>
		<category><![CDATA[optimizer]]></category>

		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=1258</guid>
		<description><![CDATA[Recently I was doing a little work for a client who has MyISAM tables with many columns (the same one Peter wrote about recently).  The client's performance is suffering in part because of the number of columns, which is over 200.  The queries are generally pretty simple (sums of columns), but they're ad-hoc (can access any columns) and it seems tailor-made for a column-oriented database.
I decided it was time to actually give Infobright a try.  They have an open-source community edition, which is crippled but not enough to matter for this test.  The "Knowledge Grid" architecture seems ideal for the types of queries the client runs.  But hey, why not also try MonetDB, another open-source column-oriented database I've been meaning to take a look at?
What follows is not a realistic benchmark, it's not scientific, it's just some quick and dirty tinkering.  I threw up an Ubuntu 9.04 small server on Amazon.  (I used this version because there's a .deb of MonetDB for it).  I created a table with 200 integer columns and loaded it with random numbers between 0 and 10000.  Initially I wanted to try with 4 million rows, but I had trouble with MonetDB -- there was not enough memory for this.  I didn't do anything fancy with the Amazon server -- I didn't fill up the /mnt disk to claim the bits, for example.  I used default tuning, out of the box, for all three databases.
The first thing I tried doing was loading the data with SQL statements.  I wanted to see how fast MyISAM vs. MonetDB would interpret really large INSERT statements, the kind produced by mysqldump.  But MonetDB choked and told me the number of columns mismatched.  I found reference to this on the mailing list, and skipped that.  I used LOAD DATA INFILE instead (MonetDB's version of that is COPY INTO).  This is the only way to get data into Infobright, anyway.
The tests
I loaded 1 million rows into the table.  Here's a graph of the times (smaller is better):

MyISAM took 88 seconds, MonetDB took 200, and Infobright took 486.  Here's the size of the resulting table on disk (smaller is better):

MyISAM is 787MB, MonetDB is 791MB, and Infobright is 317MB.  Next I ran three queries:
PLAIN TEXT
SQL:




SELECT sum&#40;c19&#41;, sum&#40;c89&#41;, sum&#40;c129&#41; FROM t;


SELECT sum&#40;c19&#41;, sum&#40;c89&#41;, sum&#40;c129&#41; FROM t WHERE c11&#62; 5;


SELECT sum&#40;c19&#41;, sum&#40;c89&#41;, sum&#40;c129&#41; FROM t WHERE c11 &#60;5; 






Graphs of query performance time for all three databases are really not very helpful, because MyISAM is so much slower that you can't see the graphs for the others.  So I'll give the numbers and then omit MyISAM from the graphs.  Here are the numbers for everything I measured:




myisam
monetdb
infobright



size (bytes)    
826000000    
829946723
332497242


load time (seconds)    
88    
200    
486


query1 time    
3.4    
0.012    
0.0007


query2 time    
3.4    
0.15    
1.2


query3 time    
2.5    
0.076    
0.15


And here is a graph of Infobright duking it out with MonetDB on the three queries I tested (shorter bar is better):

I ran each query a few times, discarded the first run, and averaged the next three together.
Notes on Infobright
A few miscellaneous notes: don't forget that Infobright is not just a storage engine plugged into MySQL.  It's a complete server with a different optimizer, etc.  This point was hammered home during the LOAD DATA INFILE, when I looked to see what was taking so long (I was tempted to use oprofile and see if there are sleep() statements).  What did I see in 'top' but a program called bhloader.  This bhloader program was the only thing doing anything; mysqld wasn't doing a thing.  LOAD DATA INFILE in Infobright isn't what it seems to be.  Otherwise, Infobright behaved about as I expected it to; it seemed pretty normal to a MySQL guy.
Notes on MonetDB
MonetDB was a bit different.  I had to be a bit resourceful to get everything going.  The documentation was for an old version, and was pretty sparse.  I had to go to the mailing lists to find the correct COPY syntax -- it wasn't that listed in the online manual.  And there were funny things like a "merovingian" process (think "angel") that had to be started before the server would start, and I had to destroy the demo database and recreate it before I could start it as shown in the tutorials.
MonetDB has some unexpected properties; it is not a regular RDBMS.  Still, I'm quite impressed by it in some ways.  For example, it seems quite nicely put together, and it's not at all hard to learn.
It doesn't really "speak SQL" -- it speaks relational algebra, and the SQL is just a front-end to it.  You can talk XQuery to it, too.  I'm not sure if you can talk dirty to it, but you can sure talk nerdy to it: you can, should you choose to, give it instructions in MonetDB Assembly Language (MAL), the underlying language.  An abstracted front-end is a great idea; MySQL abstracts the storage backend, but why not do both?  Last I checked, Drizzle is going this direction, hurrah!
EXPLAIN is enlightening and frightening!  You get to see the intermediate code from the compiler.  The goggles, they do nothing!
From what I was able to learn about MonetDB in an hour, I believe it uses memory-mapped files to hold the data in-memory.  If this is true, it explains why I couldn't load 4 million rows into it (this was a 32-bit Amazon machine).
The SQL implementation is impressive.  It's a really solid subset of SQL:2003, much more than I expected.  It even has CTEs, although not recursive ones.  (No, there is no REPLACE, and there is no INSERT/ON DUPLICATE KEY UPDATE.)  I didn't try the XQuery interface.
Although I didn't try it out, there are what looks like pretty useful instrumentation interfaces for profiling, debugging and the like.  The query timer is in milliseconds (why doesn't mysql show query times in microseconds?  I had to resort to Perl + Time::HiRes for timing the Infobright queries).
I think it can be quite useful.  However, I'm not quite sure it's useful for "general-purpose" database use -- there are a number of limitations (concurrency, for one) and it looks like it's still fairly experimental.
    
    Entry posted by Baron Schwartz &#124;
      No comment
    Add to:  &#124;  &#124;  &#124;  &#124; ]]></description>
			<content:encoded><![CDATA[<p>Recently I was doing a little work for a client who has MyISAM tables with many columns (the same one <a href="http://www.mysqlperformanceblog.com/2009/09/28/how-number-of-columns-affects-performance/">Peter wrote about recently</a>).  The client's performance is suffering in part because of the number of columns, which is over 200.  The queries are generally pretty simple (sums of columns), but they're ad-hoc (can access any columns) and it seems tailor-made for a column-oriented database.</p>
<p>I decided it was time to actually give <a href="http://www.infobright.org/">Infobright</a> a try.  They have an open-source community edition, which is crippled but not enough to matter for this test.  The "Knowledge Grid" architecture seems ideal for the types of queries the client runs.  But hey, why not also try <a href="http://monetdb.cwi.nl/">MonetDB</a>, another open-source column-oriented database I've been meaning to take a look at?</p>
<p>What follows is not a realistic benchmark, it's not scientific, it's just some quick and dirty tinkering.  I threw up an Ubuntu 9.04 small server on Amazon.  (I used this version because there's a .deb of MonetDB for it).  I created a table with 200 integer columns and loaded it with random numbers between 0 and 10000.  Initially I wanted to try with 4 million rows, but I had trouble with MonetDB -- there was not enough memory for this.  I didn't do anything fancy with the Amazon server -- I didn't fill up the /mnt disk to claim the bits, for example.  I used default tuning, out of the box, for all three databases.</p>
<p>The first thing I tried doing was loading the data with SQL statements.  I wanted to see how fast MyISAM vs. MonetDB would interpret really large INSERT statements, the kind produced by mysqldump.  But MonetDB choked and told me the number of columns mismatched.  I found reference to this on the mailing list, and skipped that.  I used LOAD DATA INFILE instead (MonetDB's version of that is COPY INTO).  This is the only way to get data into Infobright, anyway.</p>
<h3>The tests</h3>
<p>I loaded 1 million rows into the table.  Here's a graph of the times (smaller is better):</p>
<p><img class="alignnone size-full wp-image-1259" title="Load Time" src="http://www.mysqlperformanceblog.com/wp-content/uploads/2009/09/load_time.png" alt="Load Time" width="450" height="320" /></p>
<p>MyISAM took 88 seconds, MonetDB took 200, and Infobright took 486.  Here's the size of the resulting table on disk (smaller is better):</p>
<p><img src="http://www.mysqlperformanceblog.com/wp-content/uploads/2009/09/table_size_bytes.png" alt="Table Size in Bytes" title="Table Size in Bytes" width="450" height="320" class="alignnone size-full wp-image-1270" /></p>
<p>MyISAM is 787MB, MonetDB is 791MB, and Infobright is 317MB.  Next I ran three queries:</p>
<div><span><a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/">PLAIN TEXT</a></span></div>
<div><span>SQL:</span>
<div>
<div>
<ol>
<li>
<div><span>SELECT</span> sum<span>&#40;</span>c19<span>&#41;</span>, sum<span>&#40;</span>c89<span>&#41;</span>, sum<span>&#40;</span>c129<span>&#41;</span> <span>FROM</span> t;</div>
</li>
<li>
<div><span>SELECT</span> sum<span>&#40;</span>c19<span>&#41;</span>, sum<span>&#40;</span>c89<span>&#41;</span>, sum<span>&#40;</span>c129<span>&#41;</span> <span>FROM</span> t <span>WHERE</span> c11&gt; <span>5</span>;</div>
</li>
<li>
<div><span>SELECT</span> sum<span>&#40;</span>c19<span>&#41;</span>, sum<span>&#40;</span>c89<span>&#41;</span>, sum<span>&#40;</span>c129<span>&#41;</span> <span>FROM</span> t <span>WHERE</span> c11 &lt;<span>5</span>; </div>
</li>
</ol>
</div>
</div>
</div>
<p></p>
<p>Graphs of query performance time for all three databases are really not very helpful, because MyISAM is so much slower that you can't see the graphs for the others.  So I'll give the numbers and then omit MyISAM from the graphs.  Here are the numbers for everything I measured:</p>
<table borders="1">
<thead>
<tr>
<td></td>
<th>myisam</th>
<th>monetdb</th>
<th>infobright</th>
</tr>
</thead>
<tr>
<th>size (bytes)    </th>
<td>826000000    </td>
<td>829946723</td>
<td>332497242</td>
</tr>
<tr>
<th>load time (seconds)    </th>
<td>88    </td>
<td>200    </td>
<td>486</td>
</tr>
<tr>
<th>query1 time    </th>
<td>3.4    </td>
<td>0.012    </td>
<td>0.0007</td>
</tr>
<tr>
<th>query2 time    </th>
<td>3.4    </td>
<td>0.15    </td>
<td>1.2</td>
</tr>
<tr>
<th>query3 time    </th>
<td>2.5    </td>
<td>0.076    </td>
<td>0.15</td>
</tr>
</table>
<p>And here is a graph of Infobright duking it out with MonetDB on the three queries I tested (shorter bar is better):</p>
<p><img src="http://www.mysqlperformanceblog.com/wp-content/uploads/2009/09/monetdb_infobright_query_time1.png" alt="MonetDB vs Infobright Query Time" title="MonetDB vs Infobright Query Time" width="492" height="320" class="alignnone size-full wp-image-1265" /></p>
<p>I ran each query a few times, discarded the first run, and averaged the next three together.</p>
<h3>Notes on Infobright</h3>
<p>A few miscellaneous notes: don't forget that Infobright is <em>not</em> just a storage engine plugged into MySQL.  It's a complete server with a different optimizer, etc.  This point was hammered home during the LOAD DATA INFILE, when I looked to see what was taking so long (I was tempted to use oprofile and see if there are sleep() statements).  What did I see in 'top' but a program called bhloader.  This bhloader program was the only thing doing anything; mysqld wasn't doing a thing.  LOAD DATA INFILE in Infobright isn't what it seems to be.  Otherwise, Infobright behaved about as I expected it to; it seemed pretty normal to a MySQL guy.</p>
<h3>Notes on MonetDB</h3>
<p>MonetDB was a bit different.  I had to be a bit resourceful to get everything going.  The documentation was for an old version, and was pretty sparse.  I had to go to the mailing lists to find the correct COPY syntax -- it wasn't that listed in the online manual.  And there were funny things like a "merovingian" process (think "angel") that had to be started before the server would start, and I had to destroy the demo database and recreate it before I could start it as shown in the tutorials.</p>
<p>MonetDB has some unexpected properties; it is not a regular RDBMS.  Still, I'm quite impressed by it in some ways.  For example, it seems quite nicely put together, and it's not at all hard to learn.</p>
<p>It doesn't really "speak SQL" -- it speaks relational algebra, and the SQL is just a front-end to it.  You can talk XQuery to it, too.  I'm not sure if you can talk dirty to it, but you can sure talk nerdy to it: you can, should you choose to, give it instructions in MonetDB Assembly Language (MAL), the underlying language.  An abstracted front-end is a great idea; MySQL abstracts the storage backend, but why not do both?  Last I checked, Drizzle is going this direction, hurrah!</p>
<p>EXPLAIN is enlightening and frightening!  You get to see the intermediate code from the compiler.  <a href="http://monetdb.cwi.nl/projects/monetdb/SQL/Documentation/EXPLAIN-Statement.html">The goggles, they do nothing!</a></p>
<p>From what I was able to learn about MonetDB in an hour, I believe it uses memory-mapped files to hold the data in-memory.  If this is true, it explains why I couldn't load 4 million rows into it (this was a 32-bit Amazon machine).</p>
<p>The SQL implementation is impressive.  It's a really solid subset of SQL:2003, much more than I expected.  It even has CTEs, although not recursive ones.  (No, there is no REPLACE, and there is no INSERT/ON DUPLICATE KEY UPDATE.)  I didn't try the XQuery interface.</p>
<p>Although I didn't try it out, there are what looks like pretty useful instrumentation interfaces for profiling, debugging and the like.  The query timer is in milliseconds (why doesn't mysql show query times in microseconds?  I had to resort to Perl + Time::HiRes for timing the Infobright queries).</p>
<p>I think it can be quite useful.  However, I'm not quite sure it's useful for "general-purpose" database use -- there are a number of limitations (concurrency, for one) and it looks like it's still fairly experimental.</p>
    <hr noshade style="margin:0;height:1px" />
    <p>Entry posted by Baron Schwartz |
      <a href="http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/#comments">No comment</a></p>
    <p>Add to: <a href="http://del.icio.us/post?url=http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/&amp;title=Quick%20comparison%20of%20MyISAM,%20Infobright,%20and%20MonetDB" title="Bookmark this post on del.icio.us"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/delicious.png" alt="delicious" /></a> | <a href="http://digg.com/submit?phase=2&amp;url=http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/&amp;title=Quick%20comparison%20of%20MyISAM,%20Infobright,%20and%20MonetDB" title="Digg this post on Digg.com"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/digg.png" alt="digg" /></a> | <a href="http://reddit.com/submit?url=http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/&amp;title=Quick%20comparison%20of%20MyISAM,%20Infobright,%20and%20MonetDB" title="Submit this post on reddit.com"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/reddit.png" alt="reddit" /></a> | <a href="http://www.netscape.com/submit/?U=http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/&amp;T=Quick%20comparison%20of%20MyISAM,%20Infobright,%20and%20MonetDB" title="Vote for this article on Netscape"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/netscape.gif" alt="netscape" /></a> | <a href="http://www.google.com/bookmarks/mark?op=add&amp;bkmk=http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/&amp;title=Quick%20comparison%20of%20MyISAM,%20Infobright,%20and%20MonetDB" title="Add to Google Bookmarks"><img src="http://www.mysqlperformanceblog.com/wp-content/themes/boxy-but-gold/images/google.png" alt="Google Bookmarks" /></a></p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21349&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21349&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/30/quick-comparison-of-myisam-infobright-and-monetdb/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What data types does your innovative storage engine NOT support?</title>
		<link>http://www.xaprb.com/blog/2009/09/29/what-data-types-does-your-innovative-storage-engine-not-support/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=what-data-types-does-your-innovative-storage-engine-not-support</link>
		<comments>http://www.xaprb.com/blog/2009/09/29/what-data-types-does-your-innovative-storage-engine-not-support/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 06:33:44 +0000</pubDate>
		<dc:creator>Baron Schwartz (xaprb)</dc:creator>
				<category><![CDATA[data-types]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[Storage Engines]]></category>

		<guid isPermaLink="false">http://www.xaprb.com/blog/?p=1315</guid>
		<description><![CDATA[I&#8217;ve been investigating a few different storage engines for MySQL lately, and something I&#8217;ve noticed is that they all list what they support, but they generally don&#8217;t say what they don&#8217;t support.  For example, Infobright&#8217;s documentation shows a list of every data type supported.  What&#8217;s missing?  Hmm, I don&#8217;t see BLOB, BIT, ENUM, SET&#8230; it&#8217;s kind of hard to tell, isn&#8217;t it?  I don&#8217;t have an encyclopedic list of all the MySQL data types in my head.  The same thing is true of the list of functions that are optimized inside Infobright&#8217;s own code instead of at the server layer.  I can see what&#8217;s optimized, but I can&#8217;t see whether FUNC_WHATEVER() is optimized without scanning the page &#8212; and there&#8217;s no list of un-optimized functions.

I don&#8217;t mean to pick on Infobright.  I&#8217;ve recently looked at another third-party storage engine and they did exactly the same thing.  It&#8217;s just that the docs I saw weren&#8217;t public as far as I know, so I can&#8217;t mention them by name.

For a product like this, I think the most helpful thing would be a page explaining two things: 1) the enhancements or extra functionality over the standard MySQL server, and 2) the unavailable or degraded functionality.  It would also be good to have both high-level and detailed versions of this.

Related posts:The Ma.gnolia data might not be permanently lost I keep rea50 things to know before migrating Oracle to MySQL A while baPostgreSQL adds windowing functions and common table expressions As Hubert 
Related posts brought to you by Yet Another Related Posts Plugin.]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been investigating a few different storage engines for MySQL lately, and something I&#8217;ve noticed is that they all list what they support, but they generally don&#8217;t say what they don&#8217;t support.  For example, Infobright&#8217;s documentation shows <a href="http://www.infobright.org/wiki/Supported_Data_Types_and_Values/">a list of every data type supported</a>.  What&#8217;s missing?  Hmm, I don&#8217;t see BLOB, BIT, ENUM, SET&#8230; it&#8217;s kind of hard to tell, isn&#8217;t it?  I don&#8217;t have an encyclopedic list of all the MySQL data types in my head.  The same thing is true of the list of functions that are optimized inside Infobright&#8217;s own code instead of at the server layer.  I can see what&#8217;s optimized, but I can&#8217;t see whether FUNC_WHATEVER() is optimized without scanning the page &#8212; and there&#8217;s no list of un-optimized functions.</p>

<p>I don&#8217;t mean to pick on Infobright.  I&#8217;ve recently looked at another third-party storage engine and they did exactly the same thing.  It&#8217;s just that the docs I saw weren&#8217;t public as far as I know, so I can&#8217;t mention them by name.</p>

<p>For a product like this, I think the most helpful thing would be a page explaining two things: 1) the enhancements or extra functionality over the standard MySQL server, and 2) the unavailable or degraded functionality.  It would also be good to have both high-level and detailed versions of this.</p>

<p>Related posts:<ol><li><a href="http://www.xaprb.com/blog/2009/02/19/the-magnolia-data-might-not-be-permanently-lost/" rel="bookmark" title="Permanent Link: The Ma.gnolia data might not be permanently lost">The Ma.gnolia data might not be permanently lost</a> <small>I keep rea</small></li><li><a href="http://www.xaprb.com/blog/2009/03/13/50-things-to-know-before-migrating-oracle-to-mysql/" rel="bookmark" title="Permanent Link: 50 things to know before migrating Oracle to MySQL">50 things to know before migrating Oracle to MySQL</a> <small>A while ba</small></li><li><a href="http://www.xaprb.com/blog/2009/01/21/postgresql-adds-windowing-functions-and-common-table-expressions/" rel="bookmark" title="Permanent Link: PostgreSQL adds windowing functions and common table expressions">PostgreSQL adds windowing functions and common table expressions</a> <small>As Hubert </small></li></ol></p>
<p>Related posts brought to you by <a href="http://mitcho.com/code/yarpp/">Yet Another Related Posts Plugin</a>.</p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21334&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21334&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/29/what-data-types-does-your-innovative-storage-engine-not-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A peek under the hood in Infobright 3.2 storage engine</title>
		<link>http://www.fishpool.org/post/2009/09/21/A-peek-under-the-hood-in-Infobright-3.2-storage-engine?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=a-peek-under-the-hood-in-infobright-3-2-storage-engine</link>
		<comments>http://www.fishpool.org/post/2009/09/21/A-peek-under-the-hood-in-Infobright-3.2-storage-engine#comments</comments>
		<pubDate>Mon, 21 Sep 2009 14:50:00 +0000</pubDate>
		<dc:creator>Osma Ahvenlampi</dc:creator>
				<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[I've been meaning to post some real-world data on the performance of the
Infobright 3.2 release which happened a few weeks ago after an extended release
candidate period. We're just preparing our upgrades now, so I don't have any
performance notes over significant data sets or complicated queries to post
quite yet.
To make up for that, I decided to address a particular annoyance of mine in
the community edition, first because it hadn't been addressed in the 3.2
release (and really, I'm hoping doing this would include it into 3.2.1), and
second, simply because the engine being open source means I can. I feel

being OSS is one of Infobright's biggest strengths, in addition to being a
pretty amazing piece of performance for such a simple, undemanding package in
general, and not making use of that would be shame. Read on for details.    The annoyance? It's pretty difficult to tell, as a user, what the engine is
doing while it's running queries. EXPLAIN isn't hooked
up and falls back to the general MySQL code path (which, due to the storage
engine not exporting any index information, simply thinks any query will be a
full table scan). SHOW
PROCESSLIST status data on every query simply says &#34;init&#34; for the entire
duration of the queries, which could be minutes at a time. It does write quite
a lot of detailed information into an optional debug log, but that's on the
database server, inaccessible to the user and application, as well as being
rather hard to read.
Fortunately, the existence of those debug statements meant it was very easy
to find the places into which I could insert some status instrumentation for
the process list. This is certainly not perfect - this doesn't help telling
about execution paths before running a query, and the convention for process
list status is far more terse than what the debug output of the engine could
produce. I could have simply copied the same detail level into the process
list, but that doesn't seem to be the norm in MySQL engines, and assuming that
Infobright will later include the SHOW PROFILES
feature, would not be helpful anyway.
The patch is below (or download it as raw text), and
it applies on top of 3.2 src package downloadable at the Infobright.org
site. Builds with 'make EDITION=community release' and works for me, but
use this at your own risk. Please do post notes and comments, though, I'd be
interested to hear about other users. I'm sure the patch could be much
improved, too.
Now, what would be really interesting was if the debug
log's information of the knowledge grid evaluation could be turned into EXPLAIN
output, but that would require more understanding of MySQL internals than what
I have...
This was the first time I looked at the source code for Infobright, and the
second or third time I did so for MySQL in general. ICE is pretty impressive
also in its techniques, not only being the only integrated columnar engine, but
also having more join strategies than other engines I've used, and so forth.
The code is tough to follow though, and the source package included a huge
amount of unused stuff, like a copy of both the InnoDB and NDB storage engines,
neither of which is built from the code base. I guess a bit of clean-up would
make this somewhat more approachable..
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerGeneral.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerGeneral.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerGeneral.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerGeneral.cpp  
 2009-09-18 12:13:23.001506795 +0300
@@ -26,6 +26,7 @@
         mind-&#62;Empty();
         return no_desc;  
         // all done
     }
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;joining&#34;);
     //if(desc[0].val1.vc-&#62;IsConst() &#38;&#38;
desc[0].val2.vc == NULL) {
     //    // Special case: if there is a
chance for one-dimensional filtering, execute one condition only.
     //    no_desc = 1;
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerHash.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerHash.cpp
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerHash.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerHash.cpp  
 2009-09-18 15:31:15.263505407 +0300
@@ -55,6 +55,7 @@
     /////////////////// Prepare all descriptor information
/////////////
     // TODO: prepare a common language for both joined
columns, if not compatible
     bool first_found = true;
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;hash
join&#34;);
     DimensionVector dims1(mind);  
     // Initial dimension descriptions
     DimensionVector dims2(mind);
     for(int i = 0; i &#60; desc.size(); i++) {
@@ -203,6 +204,7 @@
 
     _int64 hash_row = 0;      
         // hash_row = 0, otherwise deadlock
for null on the first position
     _int64 traversed_rows = 0;
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;hash
join traverse&#34;);
     while(mit.IsValid()) {
         if(m_conn.killed())
             throw
KilledRCException();
@@ -247,6 +249,7 @@
     int no_of_matching_rows;
     MIIterator mit(mind, matched_dims);
     MIDummyIterator combined_mit(mind);  
     // a combined iterator for checking non-hashed
conditions, if any
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;hash
join tuples&#34;);
     while(mit.IsValid()) {
         if(m_conn.killed())
             throw
KilledRCException();
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerLoop.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerLoop.cpp
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerLoop.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerLoop.cpp  
 2009-09-18 12:12:46.056444175 +0300
@@ -34,6 +34,8 @@
     int cur_dim1, cur_dim2;
     int attr1, attr2;
 
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;loop
join&#34;);
+
     ParseDescriptor( desc[0], cur_t1, cur_t2, cur_dim1,
cur_dim2, attr1, attr2, loc_op );
 
   
 //////////////////////////////////////////////////////////////////////////////////

diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerSort.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerSort.cpp
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerSort.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerSort.cpp  
 2009-09-18 15:32:01.921256944 +0300
@@ -29,6 +29,7 @@
     VirtualColumn *vc2 = desc[0].val1.vc;
     dim1 = vc1-&#62;GetDim();
     dim2 = vc2-&#62;GetDim();
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;sort
join&#34;);
     // The only supported cases (for now):
     if(dim1 == -1 &#124;&#124; dim2 == -1 &#124;&#124;  
         // one-dim only
         mind-&#62;GetFilter(dim1) == NULL
&#124;&#124;
@@ -128,6 +129,8 @@
     s1.Lock();
     s2.Lock();
 
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;sort
join apply&#34;);
+
     MINewContents new_mind(mind);
     new_mind.SetDimension(dim1);
     new_mind.SetDimension(dim2);
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIRoughSorter.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIRoughSorter.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIRoughSorter.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIRoughSorter.cpp  
 2009-09-16 23:17:07.487972507 +0300
@@ -77,6 +77,7 @@
 
     ///////////////////////// the main sorting loop
through bigblocks /////////
     if(sorting_needed)    {
+      
 thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(), &#34;sorting
roughly&#34;);
       
 rccontrol.lock(mind-&#62;m_conn-&#62;GetThreadID()) &#60;&#60; &#34;Sorting
roughly multiindex...&#34; &#60;&#60; unlock;
         _int64 start_tuple = 0;
         _int64 stop_tuple = 0;
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIUpdatingIterator.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIUpdatingIterator.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIUpdatingIterator.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIUpdatingIterator.cpp  
 2009-09-18 15:12:03.293257577 +0300
@@ -116,6 +116,7 @@
 {
     if(!changed)
         return;
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;commit&#34;);
     if(one_dim_filter) {
       
 one_dim_filter-&#62;Commit();        //
working directly on multiindex filer (special case)
         mind-&#62;UpdateNoTuples();
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/MultiSorter.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MultiSorter.cpp
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MultiSorter.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MultiSorter.cpp  
 2009-09-18 11:31:50.202579444 +0300
@@ -372,6 +372,7 @@
     }
     else
       
 rccontrol.lock(m_conn-&#62;GetThreadID()) &#60;&#60; &#34;Sorting &#34; &#60;&#60;
no_obj &#60;&#60; &#34; rows...&#34; &#60;&#60; unlock;
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;sorting&#34;);
     if(max_rate &#60; cur_rate)
         max_rate = cur_rate;
     int byte_ind = 4;      
     // no. of bytes to encode row index (4 or 8)
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_exeq_low.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_exeq_low.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_exeq_low.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_exeq_low.cpp  
 2009-09-18 11:59:48.065253205 +0300
@@ -553,6 +553,7 @@
   
 ////////////////////////////////////////////////////////////////////////

     if(desc.size() &#60; 1)
         return;
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;preparing&#34;);
     DelayWhereConditions(desc);
     SyntacticalDescriptorListPreprocessing(desc, mind,
table);
 
@@ -619,6 +620,8 @@
         return;
     }
 
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;executing&#34;);
+    
     ///////////////// Apply all one-dimensional filters
(after where, i.e. without outer joins)
     for(uint i = 0; i &#60; desc.size(); i++)
         if(!desc[i].done &#38;&#38;
desc[i].IsInner() &#38;&#38; !desc[i].IsType_Join() &#38;&#38;
!desc[i].IsDelayed()) {
@@ -655,6 +658,7 @@
     }
 
   
 /////////////////////////////////////////////////////////////////////////////////////

+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;joining&#34;);
     DescriptorJoinOrdering(desc, mind);
 
     ///// descriptor display for joins
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_optimize_RS.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_optimize_RS.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_optimize_RS.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_optimize_RS.cpp  
 2009-09-18 11:27:55.040256644 +0300
@@ -104,6 +104,7 @@
               
               
   MultiIndex &#38;mind,
               
               
   vector&#60;Descriptor&#62; &#38;desc)
 {
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;evaluating P2P&#34;);
     bool is_nonempty = true;
     // init by previous values of mind (if any
nontrivial)
     for(int i = 0; i &#60; mind.NoDimensions(); i++)
{
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/RCEngine_results.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RCEngine_results.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/RCEngine_results.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RCEngine_results.cpp  
 2009-09-16 23:02:02.200221840 +0300
@@ -37,7 +37,7 @@
 void RCEngine::SendResults(na::DataSource* exectree, THD* thd,
select_result *res, List&#60;Item&#62; &#38;fields)
 {
     int error = 0;
-    thd-&#62;proc_info=&#34;Sending data&#34;;
+    thd_proc_info(thd,&#34;Sending data&#34;);
     DBUG_PRINT(&#34;info&#34;, (&#34;%s&#34;, thd-&#62;proc_info));
 
     res-&#62;send_fields(fields, Protocol::SEND_NUM_ROWS &#124;
Protocol::SEND_EOF);
@@ -136,7 +136,7 @@
 void RCEngine::SendResults(JustATable&#38; results, THD* thd,
select_result *res, List&#60;Item&#62; &#38;fields, ConnectionInfo *conn)
 {
     int error = 0;
-    thd-&#62;proc_info=&#34;Sending data&#34;;
+    thd_proc_info(thd,&#34;Sending data&#34;);
     DBUG_PRINT(&#34;info&#34;, (&#34;%s&#34;, thd-&#62;proc_info));
 
     res-&#62;send_fields(fields, Protocol::SEND_NUM_ROWS &#124;
Protocol::SEND_EOF);
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/RoughJoinWatcher.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RoughJoinWatcher.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/RoughJoinWatcher.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RoughJoinWatcher.cpp  
 2009-09-16 23:18:25.821225774 +0300
@@ -184,6 +184,7 @@
     //        - after
checking all the result:
     //          
 - if still potentially_excluded =&#62; set as non-intersecting and up to
date
 
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;updating P2P&#34;);
     rccontrol.lock(mind.m_conn-&#62;GetThreadID()) &#60;&#60;
&#34;Updating P2P...&#34; &#60;&#60; unlock;
     anything_to_update = false;
     _int64 pairs_already_updated = 0;
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/TempTable_aggregate.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/TempTable_aggregate.cpp

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/TempTable_aggregate.cpp  
 2009-08-26 21:26:43.000000000 +0300
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/TempTable_aggregate.cpp  
 2009-09-18 11:30:24.149255211 +0300
@@ -65,6 +65,7 @@
     ::Filter tuple_left(mit.NoTuples());
     tuple_left.Set();
     gbw.SetDistinctTuples(tuple_left.NoObj());
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;aggregating&#34;);
     do {
         if(rccontrol.isOn())  {
           
 if(upper_approx_of_groups == 1)
@@ -222,6 +223,7 @@
 void TempTable::MultiDimensionalDistinctScan(GroupByWrapper&#38; gbw,
DimensionVector &#38;dims)
 {
   
 MEASURE_FET(&#34;TempTable::MultiDimensionalDistinctScan(GroupByWrapper&#38;
gbw)&#34;);
+    thd_proc_info(&#38;ConnectionInfoOnTLS.Get().Thd(),
&#34;Distinct scan&#34;);
     while(gbw.AnyOmittedByDistinct()) {  
 /////////// any distincts omitted? =&#62; another pass needed
         ///// Some displays
         _int64 max_size_for_display =
0;]]></description>
			<content:encoded><![CDATA[<p>I've been meaning to post some real-world data on the performance of the
Infobright 3.2 release which happened a few weeks ago after an extended release
candidate period. We're just preparing our upgrades now, so I don't have any
performance notes over significant data sets or complicated queries to post
quite yet.</p>
<p>To make up for that, I decided to address a particular annoyance of mine in
the community edition, first because it hadn't been addressed in the 3.2
release (and really, I'm hoping doing this would include it into 3.2.1), and
second, simply because the engine being open source means I can. I feel
<a href="http://mervadrian.wordpress.com/2009/07/07/infobright-bids-to-anchor-an-open-source-dw-ecosystem/">
being OSS is one of Infobright's biggest strengths</a>, in addition to being a
pretty amazing piece of performance for such a simple, undemanding package in
general, and not making use of that would be shame. Read on for details.</p>    <p>The annoyance? It's pretty difficult to tell, as a user, what the engine is
doing while it's running queries. <a href="http://dev.mysql.com/doc/refman/5.1/en/explain.html">EXPLAIN</a> isn't hooked
up and falls back to the general MySQL code path (which, due to the storage
engine not exporting any index information, simply thinks any query will be a
full table scan). <a href="http://dev.mysql.com/doc/refman/5.1/en/show-processlist.html">SHOW
PROCESSLIST</a> status data on every query simply says &quot;init&quot; for the entire
duration of the queries, which could be minutes at a time. It does write quite
a lot of detailed information into an optional debug log, but that's on the
database server, inaccessible to the user and application, as well as being
rather hard to read.</p>
<p>Fortunately, the existence of those debug statements meant it was very easy
to find the places into which I could insert some status instrumentation for
the process list. This is certainly not perfect - this doesn't help telling
about execution paths before running a query, and the convention for process
list status is far more terse than what the debug output of the engine could
produce. I could have simply copied the same detail level into the process
list, but that doesn't seem to be the norm in MySQL engines, and assuming that
Infobright will later include the <a href="http://dev.mysql.com/doc/refman/5.1/en/show-profiles.html">SHOW PROFILES</a>
feature, would not be helpful anyway.</p>
<p>The patch is below (or <a href="http://www.fishpool.org/public/code/infobright-3.2-proc-info.patch">download it as raw text</a>), and
it applies on top of 3.2 src package <a href="http://www.infobright.org/Download/ICE/">downloadable at the Infobright.org
site</a>. Builds with 'make EDITION=community release' and works for me, but
use this at your own risk. Please do post notes and comments, though, I'd be
interested to hear about other users. I'm sure the patch could be much
improved, too.</p>
<p>Now, what would be <strong>really</strong> interesting was if the debug
log's information of the knowledge grid evaluation could be turned into EXPLAIN
output, but that would require more understanding of MySQL internals than what
I have...</p>
<p>This was the first time I looked at the source code for Infobright, and the
second or third time I did so for MySQL in general. ICE is pretty impressive
also in its techniques, not only being the only integrated columnar engine, but
also having more join strategies than other engines I've used, and so forth.
The code is tough to follow though, and the source package included a huge
amount of unused stuff, like a copy of both the InnoDB and NDB storage engines,
neither of which is built from the code base. I guess a bit of clean-up would
make this somewhat more approachable..</p>
<code>diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerGeneral.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerGeneral.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerGeneral.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerGeneral.cpp  
 2009-09-18 12:13:23.001506795 +0300<br />
@@ -26,6 +26,7 @@<br />
         mind-&gt;Empty();<br />
         return no_desc;  
         // all done<br />
     }<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;joining&quot;);<br />
     //if(desc[0].val1.vc-&gt;IsConst() &amp;&amp;
desc[0].val2.vc == NULL) {<br />
     //    // Special case: if there is a
chance for one-dimensional filtering, execute one condition only.<br />
     //    no_desc = 1;<br />
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerHash.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerHash.cpp<br />
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerHash.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerHash.cpp  
 2009-09-18 15:31:15.263505407 +0300<br />
@@ -55,6 +55,7 @@<br />
     /////////////////// Prepare all descriptor information
/////////////<br />
     // TODO: prepare a common language for both joined
columns, if not compatible<br />
     bool first_found = true;<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;hash
join&quot;);<br />
     DimensionVector dims1(mind);  
     // Initial dimension descriptions<br />
     DimensionVector dims2(mind);<br />
     for(int i = 0; i &lt; desc.size(); i++) {<br />
@@ -203,6 +204,7 @@<br />
 <br />
     _int64 hash_row = 0;      
         // hash_row = 0, otherwise deadlock
for null on the first position<br />
     _int64 traversed_rows = 0;<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;hash
join traverse&quot;);<br />
     while(mit.IsValid()) {<br />
         if(m_conn.killed())<br />
             throw
KilledRCException();<br />
@@ -247,6 +249,7 @@<br />
     int no_of_matching_rows;<br />
     MIIterator mit(mind, matched_dims);<br />
     MIDummyIterator combined_mit(mind);  
     // a combined iterator for checking non-hashed
conditions, if any<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;hash
join tuples&quot;);<br />
     while(mit.IsValid()) {<br />
         if(m_conn.killed())<br />
             throw
KilledRCException();<br />
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerLoop.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerLoop.cpp<br />
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerLoop.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerLoop.cpp  
 2009-09-18 12:12:46.056444175 +0300<br />
@@ -34,6 +34,8 @@<br />
     int cur_dim1, cur_dim2;<br />
     int attr1, attr2;<br />
 <br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;loop
join&quot;);<br />
+<br />
     ParseDescriptor( desc[0], cur_t1, cur_t2, cur_dim1,
cur_dim2, attr1, attr2, loc_op );<br />
 <br />
   
 //////////////////////////////////////////////////////////////////////////////////<br />

diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerSort.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerSort.cpp<br />
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/JoinerSort.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/JoinerSort.cpp  
 2009-09-18 15:32:01.921256944 +0300<br />
@@ -29,6 +29,7 @@<br />
     VirtualColumn *vc2 = desc[0].val1.vc;<br />
     dim1 = vc1-&gt;GetDim();<br />
     dim2 = vc2-&gt;GetDim();<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;sort
join&quot;);<br />
     // The only supported cases (for now):<br />
     if(dim1 == -1 || dim2 == -1 ||  
         // one-dim only<br />
         mind-&gt;GetFilter(dim1) == NULL
||<br />
@@ -128,6 +129,8 @@<br />
     s1.Lock();<br />
     s2.Lock();<br />
 <br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;sort
join apply&quot;);<br />
+<br />
     MINewContents new_mind(mind);<br />
     new_mind.SetDimension(dim1);<br />
     new_mind.SetDimension(dim2);<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIRoughSorter.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIRoughSorter.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIRoughSorter.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIRoughSorter.cpp  
 2009-09-16 23:17:07.487972507 +0300<br />
@@ -77,6 +77,7 @@<br />
 <br />
     ///////////////////////// the main sorting loop
through bigblocks /////////<br />
     if(sorting_needed)    {<br />
+      
 thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(), &quot;sorting
roughly&quot;);<br />
       
 rccontrol.lock(mind-&gt;m_conn-&gt;GetThreadID()) &lt;&lt; &quot;Sorting
roughly multiindex...&quot; &lt;&lt; unlock;<br />
         _int64 start_tuple = 0;<br />
         _int64 stop_tuple = 0;<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIUpdatingIterator.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIUpdatingIterator.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MIUpdatingIterator.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MIUpdatingIterator.cpp  
 2009-09-18 15:12:03.293257577 +0300<br />
@@ -116,6 +116,7 @@<br />
 {<br />
     if(!changed)<br />
         return;<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;commit&quot;);<br />
     if(one_dim_filter) {<br />
       
 one_dim_filter-&gt;Commit();        //
working directly on multiindex filer (special case)<br />
         mind-&gt;UpdateNoTuples();<br />
diff -ur infobright-3.2-x86_64src/src/storage/brighthouse/core/MultiSorter.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MultiSorter.cpp<br />
---
infobright-3.2-x86_64src/src/storage/brighthouse/core/MultiSorter.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/MultiSorter.cpp  
 2009-09-18 11:31:50.202579444 +0300<br />
@@ -372,6 +372,7 @@<br />
     }<br />
     else<br />
       
 rccontrol.lock(m_conn-&gt;GetThreadID()) &lt;&lt; &quot;Sorting &quot; &lt;&lt;
no_obj &lt;&lt; &quot; rows...&quot; &lt;&lt; unlock;<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;sorting&quot;);<br />
     if(max_rate &lt; cur_rate)<br />
         max_rate = cur_rate;<br />
     int byte_ind = 4;      
     // no. of bytes to encode row index (4 or 8)<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_exeq_low.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_exeq_low.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_exeq_low.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_exeq_low.cpp  
 2009-09-18 11:59:48.065253205 +0300<br />
@@ -553,6 +553,7 @@<br />
   
 ////////////////////////////////////////////////////////////////////////<br />

     if(desc.size() &lt; 1)<br />
         return;<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;preparing&quot;);<br />
     DelayWhereConditions(desc);<br />
     SyntacticalDescriptorListPreprocessing(desc, mind,
table);<br />
 <br />
@@ -619,6 +620,8 @@<br />
         return;<br />
     }<br />
 <br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;executing&quot;);<br />
+    <br />
     ///////////////// Apply all one-dimensional filters
(after where, i.e. without outer joins)<br />
     for(uint i = 0; i &lt; desc.size(); i++)<br />
         if(!desc[i].done &amp;&amp;
desc[i].IsInner() &amp;&amp; !desc[i].IsType_Join() &amp;&amp;
!desc[i].IsDelayed()) {<br />
@@ -655,6 +658,7 @@<br />
     }<br />
 <br />
   
 /////////////////////////////////////////////////////////////////////////////////////<br />

+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;joining&quot;);<br />
     DescriptorJoinOrdering(desc, mind);<br />
 <br />
     ///// descriptor display for joins<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_optimize_RS.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_optimize_RS.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/Query_optimize_RS.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/Query_optimize_RS.cpp  
 2009-09-18 11:27:55.040256644 +0300<br />
@@ -104,6 +104,7 @@<br />
               
               
   MultiIndex &amp;mind,<br />
               
               
   vector&lt;Descriptor&gt; &amp;desc)<br />
 {<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;evaluating P2P&quot;);<br />
     bool is_nonempty = true;<br />
     // init by previous values of mind (if any
nontrivial)<br />
     for(int i = 0; i &lt; mind.NoDimensions(); i++)
{<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/RCEngine_results.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RCEngine_results.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/RCEngine_results.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RCEngine_results.cpp  
 2009-09-16 23:02:02.200221840 +0300<br />
@@ -37,7 +37,7 @@<br />
 void RCEngine::SendResults(na::DataSource* exectree, THD* thd,
select_result *res, List&lt;Item&gt; &amp;fields)<br />
 {<br />
     int error = 0;<br />
-    thd-&gt;proc_info=&quot;Sending data&quot;;<br />
+    thd_proc_info(thd,&quot;Sending data&quot;);<br />
     DBUG_PRINT(&quot;info&quot;, (&quot;%s&quot;, thd-&gt;proc_info));<br />
 <br />
     res-&gt;send_fields(fields, Protocol::SEND_NUM_ROWS |
Protocol::SEND_EOF);<br />
@@ -136,7 +136,7 @@<br />
 void RCEngine::SendResults(JustATable&amp; results, THD* thd,
select_result *res, List&lt;Item&gt; &amp;fields, ConnectionInfo *conn)<br />
 {<br />
     int error = 0;<br />
-    thd-&gt;proc_info=&quot;Sending data&quot;;<br />
+    thd_proc_info(thd,&quot;Sending data&quot;);<br />
     DBUG_PRINT(&quot;info&quot;, (&quot;%s&quot;, thd-&gt;proc_info));<br />
 <br />
     res-&gt;send_fields(fields, Protocol::SEND_NUM_ROWS |
Protocol::SEND_EOF);<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/RoughJoinWatcher.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RoughJoinWatcher.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/RoughJoinWatcher.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/RoughJoinWatcher.cpp  
 2009-09-16 23:18:25.821225774 +0300<br />
@@ -184,6 +184,7 @@<br />
     //        - after
checking all the result:<br />
     //          
 - if still potentially_excluded =&gt; set as non-intersecting and up to
date<br />
 <br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;updating P2P&quot;);<br />
     rccontrol.lock(mind.m_conn-&gt;GetThreadID()) &lt;&lt;
&quot;Updating P2P...&quot; &lt;&lt; unlock;<br />
     anything_to_update = false;<br />
     _int64 pairs_already_updated = 0;<br />
diff -ur
infobright-3.2-x86_64src/src/storage/brighthouse/core/TempTable_aggregate.cpp
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/TempTable_aggregate.cpp<br />

---
infobright-3.2-x86_64src/src/storage/brighthouse/core/TempTable_aggregate.cpp  
 2009-08-26 21:26:43.000000000 +0300<br />
+++
infobright-3.2-x86_64src.new/src/storage/brighthouse/core/TempTable_aggregate.cpp  
 2009-09-18 11:30:24.149255211 +0300<br />
@@ -65,6 +65,7 @@<br />
     ::Filter tuple_left(mit.NoTuples());<br />
     tuple_left.Set();<br />
     gbw.SetDistinctTuples(tuple_left.NoObj());<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;aggregating&quot;);<br />
     do {<br />
         if(rccontrol.isOn())  {<br />
           
 if(upper_approx_of_groups == 1)<br />
@@ -222,6 +223,7 @@<br />
 void TempTable::MultiDimensionalDistinctScan(GroupByWrapper&amp; gbw,
DimensionVector &amp;dims)<br />
 {<br />
   
 MEASURE_FET(&quot;TempTable::MultiDimensionalDistinctScan(GroupByWrapper&amp;
gbw)&quot;);<br />
+    thd_proc_info(&amp;ConnectionInfoOnTLS.Get().Thd(),
&quot;Distinct scan&quot;);<br />
     while(gbw.AnyOmittedByDistinct()) {  
 /////////// any distincts omitted? =&gt; another pass needed<br />
         ///// Some displays<br />
         _int64 max_size_for_display =
0;<br /></code><br /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21254&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21254&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/21/a-peek-under-the-hood-in-infobright-3-2-storage-engine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Infobright Tuning on OpenSolaris/Solaris 10</title>
		<link>http://blogs.sun.com/jkshah/entry/infobright_tuning_on_opensolaris_solaris?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=infobright-tuning-on-opensolarissolaris-10</link>
		<comments>http://blogs.sun.com/jkshah/entry/infobright_tuning_on_opensolaris_solaris#comments</comments>
		<pubDate>Tue, 15 Sep 2009 07:00:00 +0000</pubDate>
		<dc:creator>Jignesh Shah</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[opensolaris]]></category>
		<category><![CDATA[solaris]]></category>

		<guid isPermaLink="false">http://blogs.sun.com/jkshah/entry/infobright_tuning_on_opensolaris_solaris</guid>
		<description><![CDATA[Recently I was working on a project which used Infobright as the database. The version tested was 3.1.1 both on OpenSolaris as well as Solaris 10. Infobright is like a column-oriented database engine for MySQL primarily targeted towards data warehouse, data mining type of project deployments.  
  While everything was working as expected, one thing we did notice that as number of concurrent connections tried to query against the database we noticed that queries deteriorated fast in the sense that not much parallel benefits were being squeezed from the machine. Now this sucks! (apparently sucks is now a technical term). It sucks because the server has definitely many&#160; cores and typically each Infobright query still can at the max peg a core. So the expectation will be typically to atleast handle concurrent queries which is close to the number of cores&#160; (figuratively speaking though in reality it depends). 
  &#160;Anyway we started digging into this problem. First we noticed that CPU cycles were heavy so IO was probably not the culprit (in this case). Using plockstat we found 
      
  # plockstat -A -p 2039    (where 2039 is the PID of mysqld server running 4 simultaneous queries)

^C 
Mutex hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
3634393     1122 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
3626645     1047 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    2 536317885 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
   12  6338626 mysqld`LOCK_open             mysqld`_Z10open_tableP3THDP13st_table_listP11st_mem_rootPbj+0x55a 
 9057     1275 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
 8493     1051 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 7928     1119 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    5   326542 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
  683     1189 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1339 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1274 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1156 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
   17    36292 0x1777780                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
    2   246377 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
   57     8074 mysqld`_iob+0xa8             libstdc++.so.6.0.3`_ZNSo5flushEv+0x30 
  218     1479 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    4    78172 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
    4    75161 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
….

R/W reader hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
   44     1171 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 
   12     3144 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1    14125 0xf7aa18                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    1    12089 0xf762e8                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    2     1886 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    2     1776 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     3006 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     2765 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1797 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1131 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 

Mutex block 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
 2175 11867793 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 1931 12334706 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    3 93404485 libc.so.1`libc_malloc_lock   mysqld`my_malloc+0x32 
    1    11581 libc.so.1`libc_malloc_lock   mysqld`_ZN11Item_stringD0Ev+0x49 
    1     1769 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZnwmRKSt9nothrow_t+0x20
..
 
    
  Now typically if you see libc_malloc_lock in a plockstat for a&#160; multi-threaded program then it is a sign that the default malloc/free routines in libc is the culprit since the default malloc is not scalable enough for a multi-threaded program. There are alternate implementations which are more scalable than the default. Two such options which are already part of OpenSolaris, Solaris 10 are libmtmalloc.so and libumem.so. They can be forced to be used instead of the default without recompiling the binaries by preloading anyone of them before the startup command. 
  In case of the 64-bit Infobright binaries we did that by modifying the startup script mysqld-ib and added the following line just before invocation of mysqld command. 
    
  LD_PRELOAD_64=/usr/lib/64/libmtmalloc.so;
export LD_PRELOAD_64 
  What we found was now the response times for each query was more in-line as it was being executed on its own. well not true entirely but you get the point. For a 4 concurrent queries we found that it had improved from like 1X to 2.5X reduction in total execution time. 
  Similary when we used libumem.so we found the reduction more like 3X when 4 queries were executing concurrently. 
  LD_PRELOAD_64=/usr/lib/64/libumem.so;
export LD_PRELOAD_64 
   
    
  Definitely something to use for all Infobright installations on OpenSolaris or Solaris 10.
  In a following blog post we will see other ways to tune Infobright which are not as drastic as this one but still buys some percentage of improvements. Stay tuned!!  
   
   
   
    
   
    
       
   
   
   
  ]]></description>
			<content:encoded><![CDATA[<p>Recently I was working on a project which used <a href="http://www.infobright.com">Infobright</a> as the database. The version tested was 3.1.1 both on OpenSolaris as well as Solaris 10. Infobright is like a column-oriented database engine for MySQL primarily targeted towards data warehouse, data mining type of project deployments. </p> 
  <p>While everything was working as expected, one thing we did notice that as number of concurrent connections tried to query against the database we noticed that queries deteriorated fast in the sense that not much parallel benefits were being squeezed from the machine. Now this sucks! (apparently sucks is now a technical term). It sucks because the server has definitely many&nbsp; cores and typically each Infobright query still can at the max peg a core. So the expectation will be typically to atleast handle concurrent queries which is close to the number of cores&nbsp; (figuratively speaking though in reality it depends).</p> 
  <p>&nbsp;Anyway we started digging into this problem. First we noticed that CPU cycles were heavy so IO was probably not the culprit (in this case). Using plockstat we found</p> 
  <p> </p> <code> </code> 
  <pre># plockstat -A -p 2039    (where 2039 is the PID of mysqld server running 4 simultaneous queries)

^C 
Mutex hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
3634393     1122 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
3626645     1047 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    2 536317885 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
   12  6338626 mysqld`LOCK_open             mysqld`_Z10open_tableP3THDP13st_table_listP11st_mem_rootPbj+0x55a 
 9057     1275 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
 8493     1051 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 7928     1119 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
    5   326542 0x177b878                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
  683     1189 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1339 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1274 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
  564     1156 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
   17    36292 0x1777780                    mysqld`_ZN7IBMutex6UnlockEv+0x12 
    2   246377 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
   57     8074 mysqld`_iob+0xa8             libstdc++.so.6.0.3`_ZNSo5flushEv+0x30 
  218     1479 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    4    78172 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
    4    75161 mysqld`rccontrol+0x18        mysqld`_ZN7IBMutex6UnlockEv+0x12 
….

R/W reader hold 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
   44     1171 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 
   12     3144 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1    14125 0xf7aa18                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    1    12089 0xf762e8                     mysqld`_ZN11Query_cache21send_result_to_clientEP3THDPcj+0x536 
    2     1886 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    2     1776 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     3006 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     2765 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1797 mysqld`LOCK_grant            mysqld`_Z11check_grantP3THDmP13st_table_listjjb+0x38c 
    1     1131 mysqld`THR_LOCK_plugin       mysqld`_Z24plugin_foreach_with_maskP3THDPFcS0_P13st_plugin_intPvEijS3_+0xa3 

Mutex block 

Count     nsec Lock                         Caller 
------------------------------------------------------------------------------- 
 2175 11867793 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZdlPv+0xe 
 1931 12334706 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_Znwm+0x2b 
    3 93404485 libc.so.1`libc_malloc_lock   mysqld`my_malloc+0x32 
    1    11581 libc.so.1`libc_malloc_lock   mysqld`_ZN11Item_stringD0Ev+0x49 
    1     1769 libc.so.1`libc_malloc_lock   libstdc++.so.6.0.3`_ZnwmRKSt9nothrow_t+0x20
..
</pre> 
  <p> </p> 
  <p>Now typically if you see libc_malloc_lock in a plockstat for a&nbsp; multi-threaded program then it is a sign that the default malloc/free routines in libc is the culprit since the default malloc is not scalable enough for a multi-threaded program. There are alternate implementations which are more scalable than the default. Two such options which are already part of OpenSolaris, Solaris 10 are libmtmalloc.so and libumem.so. They can be forced to be used instead of the default without recompiling the binaries by preloading anyone of them before the startup command.</p> 
  <p>In case of the 64-bit Infobright binaries we did that by modifying the startup script mysqld-ib and added the following line just before invocation of mysqld command.</p> 
  <p>  </p>
  <p>LD_PRELOAD_64=/usr/lib/64/libmtmalloc.so;
export LD_PRELOAD_64</p> 
  <p>What we found was now the response times for each query was more in-line as it was being executed on its own. well not true entirely but you get the point. For a 4 concurrent queries we found that it had improved from like 1X to 2.5X reduction in total execution time.</p> 
  <p>Similary when we used libumem.so we found the reduction more like 3X when 4 queries were executing concurrently.</p> 
  <p>LD_PRELOAD_64=/usr/lib/64/libumem.so;
export LD_PRELOAD_64</p> 
  <p> </p>
  <p> </p> 
  <p>Definitely something to use for all Infobright installations on OpenSolaris or Solaris 10.</p>
  <p>In a following blog post we will see other ways to tune Infobright which are not as drastic as this one but still buys some percentage of improvements. Stay tuned!! <br /></p> 
  <p><br /></p> 
  <p><br /></p> 
  <p><br /></p> 
  <p> </p> 
  <p><br /></p> 
  <address>  
    <p><br /> </p> <br /> 
  </address> 
  <p><br /></p> 
  <p><br /></p> 
  <p><br /></p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21114&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21114&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/15/infobright-tuning-on-opensolarissolaris-10/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>EU Should Protect MySQL-based Special Purpose Database Vendors</title>
		<link>http://rpbouman.blogspot.com/2009/09/eu-should-protect-mysql-based-special.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=eu-should-protect-mysql-based-special-purpose-database-vendors</link>
		<comments>http://rpbouman.blogspot.com/2009/09/eu-should-protect-mysql-based-special.html#comments</comments>
		<pubDate>Sat, 12 Sep 2009 00:35:00 +0000</pubDate>
		<dc:creator>Roland Bouman</dc:creator>
				<category><![CDATA[antitrust]]></category>
		<category><![CDATA[Business Intelligence]]></category>
		<category><![CDATA[calpont]]></category>
		<category><![CDATA[data warehousing]]></category>
		<category><![CDATA[eu]]></category>
		<category><![CDATA[exadata]]></category>
		<category><![CDATA[infobright]]></category>
		<category><![CDATA[Kickfire]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Oracle / Sun deal]]></category>
		<category><![CDATA[scaledb]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[In my recent post on the EU antitrust regulators' probe into the Oracle Sun merger I did not mention an important class of stakeholders: the MySQL-based special purpose database startups. By these I mean:KickfireInfobrightCalpontScaleDBI think it's safe to say the first three are comparable in the sense that they are all analytical databases: they are designed for data warehousing and business intelligence applications. ScaleDB might be a good fit for those applications, but I think it's architecture is sufficiently different from the first three to not call it an analytical database.For Kickfire and Infobright, the selling point is that they are offering a relatively cheap solution to build large data warehouses and responsive business intelligence applications. (I can't really find enough information on Calpoint pricing, although they do mention low total cost of ownership.) An extra selling point is that they are MySQL compatible, which may make some difference for some customers. But that compatibility is in my opinion not as important as the availability of a serious data warehousing solution at a really sharp price.Now, in my previous post, I mentioned that the MySQL and Oracle RDBMS products are very different, and I do not perceive them as competing. Instead of trying to kill the plain MySQL database server product, Oracle should take advantage of a huge opportunity to help shape the web by being a good steward, leading ongoing MySQL development, and in addition, enable their current Oracle Enterprise customers to build cheap LAMP-based websites (with the possibility of adding value by offering Oracle to MySQL data integration).For these analytical database solutions, things may be different though. I think these MySQL based analytical databases really are competitive to Oracle's Exadata analytical appliance. Oracle could form a serious threat to these MySQL-based analytical database vendors. After the merger, Oracle would certainly be in a position to hamper these vendors by resticting the non-GPL licensed usage of MySQL.In a recent ad, Oracle vouched to increase investments in developing Sun's hardware and operating system technology. And this would eventually put them in an even better position to create appliances like Exadata, allowing them to ditch an external hardware partner like HP (which is their Exadata hardware partner).So, all in all, in my opinion the EU should definitely take a serious look at the dynamics of the analytical database market and decide how much impact the Oracle / Sun merger could have on this particular class of MySQL OEM customers. The rise of these relatvely cheap MySQL-based analytical databases is a very interesting development for the business intelligence and data warehousing space in general, and means a big win for customers that need affordable datawarhousing / business intelligence. It would be a shame if it would be curtailed by Oracle. After the merger, Oracle sure would have the means and the motive, so if someone needs protection, I think it would be these MySQL-based vendors of analytical databases.As always, these are just my musing and opinions - speculation is free. Feel free to correct me, add applause or point out my ignorance :)]]></description>
			<content:encoded><![CDATA[In <a href="http://rpbouman.blogspot.com/2009/09/mysql-factor-in-eus-decision.html" >my recent post</a> on the EU antitrust regulators' probe into the Oracle Sun merger I did not mention an important class of stakeholders: the MySQL-based special purpose database startups. By these I mean:<br /><ul><br /><li><a href="http://www.kickfire.com/" >Kickfire</a></li><br /><li><a href="http://www.infobright.org/" >Infobright</a></li><br /><li><a href="http://www.calpont.com/" >Calpont</a></li><br /><li><a href="http://www.scaledb.com/" >ScaleDB</a></li><br /></ul><br />I think it's safe to say the first three are comparable in the sense that they are all analytical databases: they are designed for data warehousing and business intelligence applications. ScaleDB might be a good fit for those applications, but I think it's architecture is sufficiently different from the first three to not call it an analytical database.<br /><br />For Kickfire and Infobright, the selling point is that they are offering a relatively cheap solution to build large data warehouses and responsive business intelligence applications. (I can't really find enough information on Calpoint pricing, although they do mention low total cost of ownership.) An extra selling point is that they are MySQL compatible, which may make some difference for some customers. But that compatibility is in my opinion not as important as the availability of a serious data warehousing solution at a really sharp price.<br /><br />Now, in my previous post, I mentioned that the MySQL and Oracle RDBMS products are very different, and I do not perceive them as competing. Instead of trying to kill the plain MySQL database server product, Oracle should take advantage of a huge opportunity to help shape the web by being a good steward, leading ongoing MySQL development, and in addition, enable their current Oracle Enterprise customers to build cheap LAMP-based websites (with the possibility of adding value by offering Oracle to MySQL data integration).<br /><br />For these analytical database solutions, things may be different though. <br /><br />I think these MySQL based analytical databases really are competitive to Oracle's <a href="http://www.oracle.com/database/exadata.html" >Exadata</a> analytical appliance. Oracle could form a serious threat to these MySQL-based analytical database vendors. After the merger, Oracle would certainly be in a position to hamper these vendors by resticting the non-GPL licensed usage of MySQL.<br /><a href="http://www.oracle.com/features/suncustomers.html" >In a recent ad</a>, Oracle vouched to increase investments in developing Sun's hardware and operating system technology. And this would eventually put them in an even better position to create appliances like Exadata, allowing them to ditch an external hardware partner like HP (which is their Exadata hardware partner).<br /><br />So, all in all, in my opinion the EU should definitely take a serious look at the dynamics of the analytical database market and decide how much impact the Oracle / Sun merger could have on this particular class of MySQL OEM customers. The rise of these relatvely cheap MySQL-based analytical databases is a very interesting development for the business intelligence and data warehousing space in general, and means a big win for customers that need affordable datawarhousing / business intelligence. It would be a shame if it would be curtailed by Oracle. After the merger, Oracle sure would have the means and the motive, so if someone needs protection, I think it would be these MySQL-based vendors of analytical databases.<br /><br />As always, these are just my musing and opinions - speculation is free. Feel free to correct me, add applause or point out my ignorance :)<div><img width="1" height="1" src="https://blogger.googleusercontent.com/tracker/15319370-8581340365503538920?l=rpbouman.blogspot.com" /></div><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21085&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=21085&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/12/eu-should-protect-mysql-based-special-purpose-database-vendors/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

