Archive for the ‘monetdb’ Category

Calpont opens up: InfiniDB Open Source Analytical Database (based on MySQL)

Октябрь 27th, 2009
Open source business intelligence and data warehousing are on the rise!

If you kept up with the MySQL Performance Blog, you might have noticed a number of posts comparing the open source analytical databases Infobright, LucidDB, and MonetDB. LucidDB got some more news last week when Nick Goodman announced that the Dynamo Business Intelligence Corporation will be offering services around LucidDB, branding it as DynamoDB.

Now, to top if off, Calpont has just released InfiniDB, a GPLv2 open source version of its analytical database offering, which is based on the MySQL server.

So, let's take a quick look at InfiniDB. I haven't yet played around with it, but the features sure look interesting:

  • Column-oriented architecture (like all other analytical database products mentioned)

  • Transparent compression

  • Vertical and horizontal partitioning: on top of being column-oriented, data is also partitioned, potentially allowing for less IO to access data.

  • MVCC and support for high concurrency. It would be interesting to see how much benefit this gives when loading data, because this is usually one of the bottle necks for column-oriented databases

  • Support for ACID/Transactions

  • High performance bulkloader

  • No specialized hardware - InfiniDB is a pure software solution that can run on commidity hardware

  • MySQL compatible


The website sums up a few more features and benefits, but I think this covers the most important ones.

Calpont also offers a closed source enterprise edition, which differs from the open source by offering support for multi-node scale-out support. By that, they do not mean regular MySQL replication scale-out. Instead, the enterprise edition features a true distributed database architecture which allows you to divide incoming requests across a layer of so-called "user modules" (MySQL front ends) and "performance modules" (the actual workhorses that partition, retrieve and cache data). In this scenario, the user modules break the queries they recieve from client applications into pieces, and send them to one or more performance modules in a parallel fashion. The performance modules then retrieve the actual data from either their cache, or from the disk, and sends those back to the user modules which re-assemble the partial and intermediate results to the final resultset which is sent back to the client. (see picture)
shared-disk-arch-simple
Given the MySQL compatibility and otherwise similar features, I think it is fair to compare the open source InfiniDB offering to the Infobright community edition. Interesting differences are that InfiniDB supports all usual DML statements (INSERT, DELETE, UPDATE), and that InfiniDB offers the same bulkloader in both the community edition as well as the enterprise edition: Infobright community edition does not support DML, and offers a bulk loader that is less performant than the one included in its enterprise edition. I have not heard of an InfoBright multi-node option, so when comparing the enterprise edition featuresets, that seems like an advantage too in Calpont's offering.

Please understand that I am not endorsing one of these products over the other: I'm just doing a checkbox feature list comparison here. What it mostly boils down to, is that users that need an affordable analytical database now have even more choice than before. In addition, it adds a bit more competition for the vendors, and I expect them all to improve as a result of that. These are interesting times for the BI and data warehousing market :)

PlanetMySQL Voting: Vote UP / Vote DOWN

Analyzing air traffic performance with InfoBright and MonetDB

Октябрь 3rd, 2009

Accidentally me and Baron played with InfoBright (see http://www.mysqlperformanceblog.com/2009/09/29/quick-comparison-of-myisam-infobright-and-monetdb/) this week. And following Baron's example I also run the same load against MonetDB. Reading comments to Baron's post I tied to load the same data to LucidDB, but I was not successful in this.

I tried to analyze a bigger dataset and I took public available data
http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time about USA domestic flights with information about flight length and delays.

The data is available from 1988 to 2009 in chunks per month, so I downloaded 252 files (for 1988-2008 years) with size from 170MB to 300MB each. In total raw data is about 55GB. Average amount of rows in each chunk is 483762.46 (the query Q0 is: select avg(c1) from (select year,month,count(*) as c1 from ontime group by YEAR,month) t; for InfoBright and with t as (select yeard,monthd,count(*) as c1 from ontime group by YEARD,monthd) select AVG(c1) FROM t for MonetDB. For InfoBright it took 4.19 sec to execute and 29.9 sec for MonetDB, but it's almost single case where MonetDB was significantly slower)

Few words about environment: server Dell SC1425, with 4GB of RAM and Dual Intel(R) Xeon(TM) CPU 3.40GHz.
InfoBright (ICE) version: 5.1.14-log build number (revision)=IB_3.2_GA_5316(ice)
MonetDB version: server v5.14.2, based on kernel v1.32.2
LucidDB was 0.9.1

The table I loaded data is:

CODE:
  1. CREATE TABLE `ontime` (
  2.   `Year` year(4) DEFAULT NULL,
  3.   `Quarter` tinyint(4) DEFAULT NULL,
  4.   `Month` tinyint(4) DEFAULT NULL,
  5.   `DayofMonth` tinyint(4) DEFAULT NULL,
  6.   `DayOfWeek` tinyint(4) DEFAULT NULL,
  7.   `FlightDate` date DEFAULT NULL,
  8.   `UniqueCarrier` char(7) DEFAULT NULL,
  9.   `AirlineID` int(11) DEFAULT NULL,
  10.   `Carrier` char(2) DEFAULT NULL,
  11.   `TailNum` varchar(50) DEFAULT NULL,
  12.   `FlightNum` varchar(10) DEFAULT NULL,
  13.   `Origin` char(5) DEFAULT NULL,
  14.   `OriginCityName` varchar(100) DEFAULT NULL,
  15.   `OriginState` char(2) DEFAULT NULL,
  16.   `OriginStateFips` varchar(10) DEFAULT NULL,
  17.   `OriginStateName` varchar(100) DEFAULT NULL,
  18.   `OriginWac` int(11) DEFAULT NULL,
  19.   `Dest` char(5) DEFAULT NULL,
  20.   `DestCityName` varchar(100) DEFAULT NULL,
  21.   `DestState` char(2) DEFAULT NULL,
  22.   `DestStateFips` varchar(10) DEFAULT NULL,
  23.   `DestStateName` varchar(100) DEFAULT NULL,
  24.   `DestWac` int(11) DEFAULT NULL,
  25.   `CRSDepTime` int(11) DEFAULT NULL,
  26.   `DepTime` int(11) DEFAULT NULL,
  27.   `DepDelay` int(11) DEFAULT NULL,
  28.   `DepDelayMinutes` int(11) DEFAULT NULL,
  29.   `DepDel15` int(11) DEFAULT NULL,
  30.   `DepartureDelayGroups` int(11) DEFAULT NULL,
  31.   `DepTimeBlk` varchar(20) DEFAULT NULL,
  32.   `TaxiOut` int(11) DEFAULT NULL,
  33.   `WheelsOff` int(11) DEFAULT NULL,
  34.   `WheelsOn` int(11) DEFAULT NULL,
  35.   `TaxiIn` int(11) DEFAULT NULL,
  36.   `CRSArrTime` int(11) DEFAULT NULL,
  37.   `ArrTime` int(11) DEFAULT NULL,
  38.   `ArrDelay` int(11) DEFAULT NULL,
  39.   `ArrDelayMinutes` int(11) DEFAULT NULL,
  40.   `ArrDel15` int(11) DEFAULT NULL,
  41.   `ArrivalDelayGroups` int(11) DEFAULT NULL,
  42.   `ArrTimeBlk` varchar(20) DEFAULT NULL,
  43.   `Cancelled` tinyint(4) DEFAULT NULL,
  44.   `CancellationCode` char(1) DEFAULT NULL,
  45.   `Diverted` tinyint(4) DEFAULT NULL,
  46.   `CRSElapsedTime` INT(11) DEFAULT NULL,
  47.   `ActualElapsedTime` INT(11) DEFAULT NULL,
  48.   `AirTime` INT(11) DEFAULT NULL,
  49.   `Flights` INT(11) DEFAULT NULL,
  50.   `Distance` INT(11) DEFAULT NULL,
  51.   `DistanceGroup` TINYINT(4) DEFAULT NULL,
  52.   `CarrierDelay` INT(11) DEFAULT NULL,
  53.   `WeatherDelay` INT(11) DEFAULT NULL,
  54.   `NASDelay` INT(11) DEFAULT NULL,
  55.   `SecurityDelay` INT(11) DEFAULT NULL,
  56.   `LateAircraftDelay` INT(11) DEFAULT NULL,
  57.   `FirstDepTime` varchar(10) DEFAULT NULL,
  58.   `TotalAddGTime` varchar(10) DEFAULT NULL,
  59.   `LongestAddGTime` varchar(10) DEFAULT NULL,
  60.   `DivAirportLandings` varchar(10) DEFAULT NULL,
  61.   `DivReachedDest` varchar(10) DEFAULT NULL,
  62.   `DivActualElapsedTime` varchar(10) DEFAULT NULL,
  63.   `DivArrDelay` varchar(10) DEFAULT NULL,
  64.   `DivDistance` varchar(10) DEFAULT NULL,
  65.   `Div1Airport` varchar(10) DEFAULT NULL,
  66.   `Div1WheelsOn` varchar(10) DEFAULT NULL,
  67.   `Div1TotalGTime` varchar(10) DEFAULT NULL,
  68.   `Div1LongestGTime` varchar(10) DEFAULT NULL,
  69.   `Div1WheelsOff` varchar(10) DEFAULT NULL,
  70.   `Div1TailNum` varchar(10) DEFAULT NULL,
  71.   `Div2Airport` varchar(10) DEFAULT NULL,
  72.   `Div2WheelsOn` varchar(10) DEFAULT NULL,
  73.   `Div2TotalGTime` varchar(10) DEFAULT NULL,
  74.   `Div2LongestGTime` varchar(10) DEFAULT NULL,
  75.   `Div2WheelsOff` varchar(10) DEFAULT NULL,
  76.   `Div2TailNum` varchar(10) DEFAULT NULL,
  77.   `Div3Airport` varchar(10) DEFAULT NULL,
  78.   `Div3WheelsOn` varchar(10) DEFAULT NULL,
  79.   `Div3TotalGTime` varchar(10) DEFAULT NULL,
  80.   `Div3LongestGTime` varchar(10) DEFAULT NULL,
  81.   `Div3WheelsOff` varchar(10) DEFAULT NULL,
  82.   `Div3TailNum` varchar(10) DEFAULT NULL,
  83.   `Div4Airport` varchar(10) DEFAULT NULL,
  84.   `Div4WheelsOn` varchar(10) DEFAULT NULL,
  85.   `Div4TotalGTime` varchar(10) DEFAULT NULL,
  86.   `Div4LongestGTime` varchar(10) DEFAULT NULL,
  87.   `Div4WheelsOff` varchar(10) DEFAULT NULL,
  88.   `Div4TailNum` varchar(10) DEFAULT NULL,
  89.   `Div5Airport` varchar(10) DEFAULT NULL,
  90.   `Div5WheelsOn` varchar(10) DEFAULT NULL,
  91.   `Div5TotalGTime` varchar(10) DEFAULT NULL,
  92.   `Div5LongestGTime` varchar(10) DEFAULT NULL,
  93.   `Div5WheelsOff` varchar(10) DEFAULT NULL,
  94.   `Div5TailNum` varchar(10) DEFAULT NULL
  95. ) ENGINE=BRIGHTHOUSE DEFAULT CHARSET=latin1;

Last fields starting with "Div*" are not really used.

Load procedure:

Infobright: the loader that comes with ICE version is very limited and I had to transform files to quote each field. After that load statement is:
mysql -S /tmp/mysql-ib.sock -e "LOAD DATA INFILE '/data/d1/AirData_ontime/${YEAR}_$i.txt.tr' INTO TABLE ontime FIELDS TERMINATED BY ',' ENCLOSED BY '\"'" ontime

The load time for each chunk was about 30s/chunk in initial years and up to 48s/chunk for 2008 year. And total load time is 8836 sec (2.45h).

The size of database after load is 1.6G which is impressive and give 1:34 compress ratio.

MonetDB: It took some time to figure out how to load text data ( I really wish developers improve documentation), but finally I ended up with next load statement:

/usr/local/monetdb/bin/mclient -lsql --database=ontime -t -s "COPY 700000 records INTO ontime FROM '/data/d1/AirData_ontime/${Y
EAR}_$i.txt' USING DELIMITERS ',','\n','\"' NULL AS '';"

Load time: 13065 sec ( 3.6h)

Database size after load is 65G , which is discouraging. It seems it does not use any compression, and it's bigger than original data.

LucidDB
Here it took time to find how to execute command from command line using included sqlline utility, and I did not understand how to do that, so I generated big SQL file which contained load statements.

Load of each chunk was significantly slower starting with about 60 sec/chunk for initial year and constantly growing to 200 sec / chunk for 2000 year. On 2004 year (after about 5h of loading) the load failed by some reason and I did not try to repeat, as I would not fit in timeframe I allocated for this benchmark. Maybe I will try sometime again.

Query execution
So I really have data for InfoBright and MonetDB, let see how fast they are in different queries.

First favorite query for any database benchmarker is SELECT count(*) FROM ontime;. Both InforBritgh and MonetDB executes it immediately with result 117023290 rows

Now some random queries I tried again both databases:

-Q1: Count flights per day from 2000 to 2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

with result:

[ 5, 7509643 ]
[ 1, 7478969 ]
[ 4, 7453687 ]
[ 3, 7412939 ]
[ 2, 7370368 ]
[ 7, 7095198 ]
[ 6, 6425690 ]

And it took 7.9s for MonetDB and 12.13s for InfoBright.

-Q2: Count of flights delayed more than 10min per day of week for 2000-2008 years
SELECT DayOfWeek, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY DayOfWeek ORDER BY c DESC

Result:

[ 5, 1816486 ]
[ 4, 1665603 ]
[ 1, 1582109 ]
[ 7, 1555145 ]
[ 3, 1431248 ]
[ 2, 1348182 ]
[ 6, 1202457 ]

And 0.9s execution for MonetDB and 6.37s for InfoBright.

-Q3: Count of delays per airport for years 2000-2008
SELECT Origin, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD BETWEEN 2000 AND 2008 GROUP BY Origin ORDER BY c DESC LIMIT 10

[ "ORD", 739286 ]
[ "ATL", 736736 ]
[ "DFW", 516957 ]
[ "PHX", 336360 ]
[ "LAX", 331997 ]
[ "LAS", 307677 ]
[ "DEN", 306594 ]
[ "EWR", 262007 ]
[ "IAH", 255789 ]
[ "DTW", 248005 ]

with 1.7s for MonetDB and 7.29s for InfoBright

-Q4: Count of delays per Carrier for 2007 year
SELECT carrier, count(*) FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier ORDER BY 2 DESC

[ "WN", 296293 ]
[ "AA", 176203 ]
...

With 0.27s for MonetDB and 0.99sec for InfoBright

But it obvious that the more flight carrier has, the more delays, so to be fair, let's calculate
-Q5: Percentage of delays for each carrier for 2007 year.
It is a bit more trickier, as for InfoBright and MonetDB you need different query:

MonetDB:
WITH t AS (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND YearD=2007 GROUP BY carrier), t2 AS (SELECT carrier, count(*) AS c2 FROM ontime WHERE YearD=2007 GROUP BY carrier) SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM t JOIN t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC

InfoBright:
SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier, count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

I am using c*1000/c2 here, because MonetDB seems using integer arithmetic and, with c/c2 I received just 1.

So result is:
[ "EV", 101796, 286234, 355 ]
[ "US", 135987, 485447, 280 ]
[ "AA", 176203, 633857, 277 ]
[ "MQ", 145630, 540494, 269 ]
[ "AS", 42830, 160185, 267 ]
[ "B6", 50740, 191450, 265 ]
[ "UA", 128174, 490002, 261 ]
...

with execution time: 0.5s for MonetDB and 2.92s for InfoBright.

Warnings: do not try EXPLAIN this query in InfoBright. MySQL is really stupid here, and EXPLAIN for this query took 6 min!

If you wonder about carriers - EV is Atlantic Southeast Airlines and US is US Airways Inc.
35.5% flights of Atlantic Southeast Airlines was delayed on more than 10 mins!

-Q6: Let's try the same query for wide range of years 2000-2008:
Result is:
[ "EV", 443798, 1621140, 273 ]
[ "AS", 299282, 1207960, 247 ]
[ "B6", 191250, 787113, 242 ]
[ "WN", 1885942, 7915940, 238 ]
[ "FL", 287815, 1220663, 235 ]
...

And execution 12.5s MonetDB and 21.83s InfoBright.

(AS is Alaska Airlines Inc. and B6 is JetBlue Airways)

-Q7: Percent of delayed (more 10mins) flights per year:

MonetDB:
with t as (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD), t2 as (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) select t.YEARD, c1/c2 FROM t JOIN t2 ON (t.YEARD=t2.YEARD)

InfoBright:
SELECT t.YEARD, c1/c2 FROM (select YEARD,count(*)*1000 as c1 from ontime WHERE DepDelay>10 GROUP BY YearD) t JOIN (select YEARD,count(*) as c2 from ontime GROUP BY YEARD) t2 ON (t.YEARD=t2.YEARD)

with result:
[ 1988, 166 ]
[ 1989, 199 ]
[ 1990, 166 ]
[ 1991, 147 ]
[ 1992, 146 ]
[ 1993, 154 ]
[ 1994, 165 ]
[ 1995, 193 ]
[ 1996, 221 ]
[ 1997, 191 ]
[ 1998, 193 ]
[ 1999, 200 ]
[ 2000, 231 ]
[ 2002, 163 ]
[ 2003, 153 ]
[ 2004, 192 ]
[ 2005, 210 ]
[ 2006, 231 ]
[ 2007, 245 ]
[ 2008, 219 ]

And with execution time 27.9s MonetDB and 8.59s InfoBright.

It seems MonetDB does not like scanning wide range of rows, the slowness here is similar to Q0.

-Q8: As final I tested most popular destination in sense count of direct connected cities for different diapason of years.

SELECT DestCityName, COUNT( DISTINCT OriginCityName) FROM ontime WHERE Year BETWEEN N and M GROUP BY DestCityName ORDER BY 2 DESC LIMIT 10;

Years, InfoBright, MonetDB
1y, 5.88s, 0.55s
2y, 11.77s, 1.10s
3y, 17.61s, 1.69s
4y, 37.57s, 2.12s
10y, 79.77s, 29.14s

-Q9: And prove that MonetDB does not like to scan many records, there is query
select year,count(*) as c1 from ontime group by YEAR

which shows how many records per years
+------+---------+
| year | c1 |
+------+---------+
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 5967780 |
| 2002 | 5271359 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009728 |
+------+---------+

And execution time: MonetDB: 6.3s and InfoBright: 0.31s

To group all results there is graph:

Conclusions:

  • This experiment was not really about InfoBright vs MonetDB comparison. My goal was to check how available OpenSource software is able to handle such kind of tasks.
  • Despite InfoBright was slower for many queries, I think it is more production ready and stable. It has Enterprise edition and Support which you can buy. And execution time is really good, taking into account amount of rows engine had to crunch. For query Q8 (1year range) traditional transactional oriented stored engine took 30min to get result.
  • I really like MonetDB. I do not know what is the magic behind the curtain, they also do not have indexes like InfoBright, but results are impressive. On drawbacks - the command line is weak ( I had to use bash and pass query as parameter, otherwise I was not able to edit query or check history), the documentation also needs improvements. The fact it does not use the compression also maybe showstopper, the space consumption is worrying. Addressing these issues I think MonetDB may have commercial success
  • Worth to note that MongoDB supports all INSERT / UPDATE / DELETE statements (and space is price for that as I understand), while InfoBright ICE edition allows you only LOAD DATA. InfoBright Enterprise allows INSERT / UPDATE but that also is not for online transactions processing.
  • Compression in InfoBright is impressive. Even smaller rate 1:10 means you can compress 1TB to 100GB, which is significant economy of space.

I am open to run any other queries if you want to compare or get info about air performance.


Entry posted by Vadim | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN

Quick comparison of MyISAM, Infobright, and MonetDB

Сентябрь 30th, 2009

Recently I was doing a little work for a client who has MyISAM tables with many columns (the same one Peter wrote about recently). The client's performance is suffering in part because of the number of columns, which is over 200. The queries are generally pretty simple (sums of columns), but they're ad-hoc (can access any columns) and it seems tailor-made for a column-oriented database.

I decided it was time to actually give Infobright a try. They have an open-source community edition, which is crippled but not enough to matter for this test. The "Knowledge Grid" architecture seems ideal for the types of queries the client runs. But hey, why not also try MonetDB, another open-source column-oriented database I've been meaning to take a look at?

What follows is not a realistic benchmark, it's not scientific, it's just some quick and dirty tinkering. I threw up an Ubuntu 9.04 small server on Amazon. (I used this version because there's a .deb of MonetDB for it). I created a table with 200 integer columns and loaded it with random numbers between 0 and 10000. Initially I wanted to try with 4 million rows, but I had trouble with MonetDB -- there was not enough memory for this. I didn't do anything fancy with the Amazon server -- I didn't fill up the /mnt disk to claim the bits, for example. I used default tuning, out of the box, for all three databases.

The first thing I tried doing was loading the data with SQL statements. I wanted to see how fast MyISAM vs. MonetDB would interpret really large INSERT statements, the kind produced by mysqldump. But MonetDB choked and told me the number of columns mismatched. I found reference to this on the mailing list, and skipped that. I used LOAD DATA INFILE instead (MonetDB's version of that is COPY INTO). This is the only way to get data into Infobright, anyway.

The tests

I loaded 1 million rows into the table. Here's a graph of the times (smaller is better):

Load Time

MyISAM took 88 seconds, MonetDB took 200, and Infobright took 486. Here's the size of the resulting table on disk (smaller is better):

Table Size in Bytes

MyISAM is 787MB, MonetDB is 791MB, and Infobright is 317MB. Next I ran three queries:

SQL:
  1. SELECT sum(c19), sum(c89), sum(c129) FROM t;
  2. SELECT sum(c19), sum(c89), sum(c129) FROM t WHERE c11> 5;
  3. SELECT sum(c19), sum(c89), sum(c129) FROM t WHERE c11 <5;

Graphs of query performance time for all three databases are really not very helpful, because MyISAM is so much slower that you can't see the graphs for the others. So I'll give the numbers and then omit MyISAM from the graphs. Here are the numbers for everything I measured:

myisam monetdb infobright
size (bytes)     826000000     829946723 332497242
load time (seconds)     88     200     486
query1 time     3.4     0.012     0.0007
query2 time     3.4     0.15     1.2
query3 time     2.5     0.076     0.15

And here is a graph of Infobright duking it out with MonetDB on the three queries I tested (shorter bar is better):

MonetDB vs Infobright Query Time

I ran each query a few times, discarded the first run, and averaged the next three together.

Notes on Infobright

A few miscellaneous notes: don't forget that Infobright is not just a storage engine plugged into MySQL. It's a complete server with a different optimizer, etc. This point was hammered home during the LOAD DATA INFILE, when I looked to see what was taking so long (I was tempted to use oprofile and see if there are sleep() statements). What did I see in 'top' but a program called bhloader. This bhloader program was the only thing doing anything; mysqld wasn't doing a thing. LOAD DATA INFILE in Infobright isn't what it seems to be. Otherwise, Infobright behaved about as I expected it to; it seemed pretty normal to a MySQL guy.

Notes on MonetDB

MonetDB was a bit different. I had to be a bit resourceful to get everything going. The documentation was for an old version, and was pretty sparse. I had to go to the mailing lists to find the correct COPY syntax -- it wasn't that listed in the online manual. And there were funny things like a "merovingian" process (think "angel") that had to be started before the server would start, and I had to destroy the demo database and recreate it before I could start it as shown in the tutorials.

MonetDB has some unexpected properties; it is not a regular RDBMS. Still, I'm quite impressed by it in some ways. For example, it seems quite nicely put together, and it's not at all hard to learn.

It doesn't really "speak SQL" -- it speaks relational algebra, and the SQL is just a front-end to it. You can talk XQuery to it, too. I'm not sure if you can talk dirty to it, but you can sure talk nerdy to it: you can, should you choose to, give it instructions in MonetDB Assembly Language (MAL), the underlying language. An abstracted front-end is a great idea; MySQL abstracts the storage backend, but why not do both? Last I checked, Drizzle is going this direction, hurrah!

EXPLAIN is enlightening and frightening! You get to see the intermediate code from the compiler. The goggles, they do nothing!

From what I was able to learn about MonetDB in an hour, I believe it uses memory-mapped files to hold the data in-memory. If this is true, it explains why I couldn't load 4 million rows into it (this was a 32-bit Amazon machine).

The SQL implementation is impressive. It's a really solid subset of SQL:2003, much more than I expected. It even has CTEs, although not recursive ones. (No, there is no REPLACE, and there is no INSERT/ON DUPLICATE KEY UPDATE.) I didn't try the XQuery interface.

Although I didn't try it out, there are what looks like pretty useful instrumentation interfaces for profiling, debugging and the like. The query timer is in milliseconds (why doesn't mysql show query times in microseconds? I had to resort to Perl + Time::HiRes for timing the Infobright queries).

I think it can be quite useful. However, I'm not quite sure it's useful for "general-purpose" database use -- there are a number of limitations (concurrency, for one) and it looks like it's still fairly experimental.


Entry posted by Baron Schwartz | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


PlanetMySQL Voting: Vote UP / Vote DOWN