Archive for the ‘mongodb’ Category

VC funding for Hadoop and NoSQL tops $350m

Ноябрь 15th, 2011

451 Research has today published a report looking at the funding being invested in Apache Hadoop- and NoSQL database-related vendors. The full report is available to clients, but non-clients can find a snapshot of the report, along with a graphic representation of the recent up-tick in funding, over at our Too Much Information blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

New England Database Summit

Ноябрь 15th, 2011

The New England Database Summit is an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their research and experiences working with on data-related problems.  It is an academic conference with applications to real life, and includes any type of database.

The 5th annual NEDB will be held in Cambridge, MA MIT (in 32-123) on Friday, February 3, 2012.  Anyone who would like is welcome to present a poster (registration required), or submit a short paper for review.  We plan to accept 8--10 papers for presentation (15 minutes) at the meeting.   All posters will be accepted.

For more details, and to register and / or upload a paper, see:

http://db.csail.mit.edu/nedbday12/


PlanetMySQL Voting: Vote UP / Vote DOWN

My take on the "warning" against using MongoDB…

Ноябрь 14th, 2011
We have seen the "warning" against using MongoDB a few times now, and I have to say that this reminds me of other such warnings:
In a sense, most of them were right. If you had, in the 1920's, asked the movie going public if they wanted "talkies", chances are most of them would have said no. If you had told my mom and dad in the late 1970's that within 20 - 30 years, everyone would have a computer at home, with some resemblence to what their nerdy near-20 year old boy was tinkering with in the basement of their house, they probably would have laughed, at best.

But that's not the thing here. True innovation moves things forward. It introduces new things and new ways of doing things in a way that we have not heard of before, and the rest of the world has not a got a good view on it. Look at Virtual computing. This was considered so slow that it was close to useless some 10 years ago or so, but today it just cannot be ignored and is put to good use all over the place (I am now disregarding the fact that this technology is way older than this, I am talking Virtual computing in the field where I spend most of my time).

If something that provides new and unique features, and new ways of doing things, are still slow, when compred to traditional means of acheving similar results, isn't strange:
  • New means not fully developed. What you want to demonstrate with something completely new isn't that it performs as well as existing technologies, then why would anyone change? No, you want to show the new features and demonstrate hwo unique this new thing is.
  • The way we measure performance or whatever we use to measure existing technologies, is usually tied to measuring just that: the performance or whatever of existing technologies, not that of a new technology and a new way of doing things.
In the early 20th centrury, steam cars were mach faster and more reliable than internal combustion powered cars, the Stanley Steamer was more performant and reliable than most competitors.

Getting back to MongoDB then: My main gripe with it is not that's it's not in all aspect mature (it's not. face it, if you use MongoDB you used leading egde stuff. It will break, live with it!). Neither do I have any issues with many of the other attributes of MongoDB and neither that it really isn't even innovative (it's not, live with it). No, my main gripe is this: MongoDB and NoSQL isn't really that new, and this means it is probably a stop-gap solution. In the 1980's running a SQL database on a PC was possible, but slow (I was working for Oracle at the time, so I know), DBase was in the case of a single PC easier to use, faster and more developer friendly. And the way you used Ashton-Tate DBase, by the way, wasn't that much different from how MongoDB is used today.

But the SQL Based relational databases, like Oracle, Informix etc. had more features and was more flexible and standardized, and once the PCs got more powerful, DBase was history.

What really wins then, in my mind, is features, flexibility, scalability and broad spectrum of usacases. SQL Based relational databases, has this, to an extent, but what most of them lacks is scalability across servers in a cloud. In this aspect, they have some of this scalability, but they don't scale nearly as much or as easily as, say, MongoDB or the other NoSQL databases (yes, I hate that term. Find a better one fpor me that is broadly accepted and I start using it).

So what am I saying here? Let me summarize it:
  • No. MongoDB doesn't suck, no way, but in terms of maturity it has some way to go.
  • Yes, a relational RDBMS usualy has more flexibility and broader set of usecases than a NoSQL solution.
  • And Yes: There are places where the RDBMS software industry got caught with their pants down: Cloud environment scalability for example. Also, licensing, if Oracle or MySQL or whoever could figure out a proper means of pricing cloud services, I'd be happy (insteaad of using the pricing models for software that was introduced with Auto-Flow in the early 1960. This is insane. The world has changed since then, guy!)
  • Would I rather use MySQL than MongoDB here at Recorded Future? Yes, probably, but I don't insist, and it just wouldn't work, as MySQL will not scale in the way you can scale a MongoDB solution, far from it.
  • Will MongoDB or the other NoSQL solutions mean that the era of SQL based databases is reaching an end? Nope, no way, José. Eventually some RDBMS vendor will get it and understand the issues and build a viable solution (like ScaleDB or NuoDB or something like that).
  • What is my main issue with MongoDB? That it sacrifices features and functionality for performance. And it cannot add the features and flexibility of an RDBMS without sacrificing performance (look at this: access method: B+-Tree. Come on, how innovative is THAT?) . To be honest, running MongoDB without sharding seems like a useless excercise to me. If you don't need the scalability that this setup can provide you with, you have better options. (but this is just me talking here.)
  • With MongoDB hang around for long? Yes, probably, but it will not be that hot for as long at SQL based databases. The reason is that compared to a traditional RDBMS it provides just one big advantage (a big advantage, yes, but just one): Performance. Which is why we use it. Remember the Stanley Steamer? Old hat today, but it was the hottest thing you could drive some 100 years ago or so. And this is what cloud computing is all about, a constant change of technology to get the best value for money right now, and to be on constant lookout for new technologies that drives features and performance.
/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

MongoDB for MySQL folks part 3 — More on queries and indexes

Ноябрь 2nd, 2011
Last time I wrote about MongoDB for MySQL DBAs I described some of the basics of MongoDB querying, and this time I'll follow that up with some more on querying.

As we saw last time, the basic format of a MongoDB query is:
db.find(<query>,<attributes>)
Note that you do NOT replace db with the name of the database you want to query here, you just make the database you want to use the current one and issue the query, such as:
> use test
> db.mycoll.find()
The example above will find all objects in the mycoll collection, and will include all the object attributes and also the key (_id), like this:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
The Object id is generated by MongoDB itself here, although you can set it yorself if you want to, as long as it's unique. The insert method is used to insert data:
> db.mycoll.insert({c3: 4, c4: 'some string'})
> db.mycoll.find()
results in;
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And as you can see, typing is automatic, or you can look at it as being type agnostic. Now, this wasn't much more than we saw last time, what we want is to select some specific objects and possibly get some specific columns from it, this is done by specifying one or two arguments to the find() method. For example, if I only want to get the object back that I inserted last above, I'd do this:
> db.mycoll.find({c3: 4})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And this wasn't really complicated, right? The condition is passed as a Java Script object notation, and that is fairly uncomplicated. But what happens for something slightly more than this really simple example, like a rangesearch? To get all objects where the c3 member is 4 or higher (which results in the same object as above by the way), you would write something like this, and :
> db.mycoll.find({c3: {$gt: 3}})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
I will show some more $-operations beyond $gt in a later post, for now just accept that they exist and are documented here: Advanced Queries

The _id column is just annoying here, right now, but it is always shown by default, as are all the other object. To get rid of it for now, this will do the trick:
> db.mycoll.find({c3: {$gt: 3}}, {_id: 0})
{ "c3" : 4, "c4" : "some string" }
Not too bad, right, and kinda easy to understand. The falgs you pass for each field in the second argument may have one of three values:
  • 1 - Include this field. This is the default.
  • 0 - Do not include this field.
  • -1 - Include no fields except this one and the ObjectId. You may have more of these, in which case all the -1 flagged fields will be included.
Let's try a more advanced version. I want to the the c1 and c2 attributes, and nothing else, then I do this:
> db.mycoll.find({},{c1: -1, c2: -1, _id: 0})
{ "c1" : 1 }
{ "c2" : 1 }
{ "c1" : 2, "c2" : 2 }
{ }
{ }
As you can see, I have to explicitly exclude the _id field.

Online help
The mongo commandline tool for once has decent online help. Typing just help will show the options. For help on database specific operations, type db.help() and for collection specific operations, type db..help(), such as db.mycoll.help(). In JavaScript, a function is just another script, and adding arguments to the function will execute the function, but maybe you want to see how the function is implemented? The just type the name of the function, like this:
> db.mycoll.find
function (query, fields, limit, skip) {
return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip);
}

DBA Work - Indexing data and explain
What would a mongo DBA want to do? Let's try creating an index. Let's say we want an index on the c1 attribute in the mycoll collection as above, then we must use the ensureIndex() method on the collection in question, telling what columns I want to index, like this:
> db.mycoll.ensureIndex({c1: 1})
And that's it. Let's try to query that collection again, this time using the c1 column as an argument, and hopefully the index will be used:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
Right. But is the index used? I want to know that it is for a fact, or if it isn't, so I have something to complain to my developers about. In MySQL, you want use the EXPLAIN command and figure out what index are being used, but with mongo? Easy. Use the explain method, like this:
> db.mycoll.find({c1: 1}).explain()
{
"cursor" : "BtreeCursor c1_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"c1" : [
[
1,
1
]
]
}
}
Hey, that's prett cool, right! The index is a standard B-tree index (the only index type available in MongoDB). An index can also be unique, like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true})
Which will create a unique index on the c2 attribute, but in our case it will not work:
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : null }
What's going on here? Well, the c2 attribute isn't included in all objects, and but the index will include all objects, and MongoDB considers NULL a duplicate here (unlike an SQL NULL in which case this is not the case). So the real question here is, what do you want? As MongoDB is schema-free, and you can have any kind of attributes, and also looking at the data above, what I would probably want is an index on the c2 attrbute that makes sure that c2 is unique WHEN INCLUDED, if the c2 attribute isn't part of the object, then please mr. Indexer, ignore it. This is called a sparse index in MongoDB, and what it means is an index that just indexes the objects where the attribute is included.

Note that this may not always be what you want with non-unique indexes, but it often it is, and it makes seaching and inserting faster (as the index is smaller). In the case you have an attribute that is only rarely part of the object, and you want to find the objects where it IS included, this is just what you want.

In our case, the index is created like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true, sparse: true})
And this time we had no errors. Let's see how it works, first get some data:
> db.mycoll.find({}, {c2: 1, _id:0})
{ }
{ "c2" : 1 }
{ "c2" : 2 }
{ }
{ }
Now, let's see if the unique index on c2 will guarantee uniqueness by inserting a new row with an existing value for c2:
> db.mycoll.insert({c2: 1})
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : 1.0 }
Yo! That worked as expected! As does this (which gives no errors):
> db.mycoll.insert({c2: 3})

That's it for now, I'll be back soon with some more MongoDB DBA stuff: Sharding!
/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

MongoDB for MySQL folks part 3 — More on queries and indexes

Октябрь 27th, 2011
Last time I wrote about MongoDB for MySQL DBAs I described some of the basics of MongoDB querying, and this time I'll follow that up with some more on querying.

As we saw last time, the basic format of a MongoDB query is:
db.find(,)
Note that you do NOT replace db with the name of the database you want to query here, you just make the database you want to use the current one and issue the query, such as:
> use test
> db.mycoll.find()
The example above will find all objects in the mycoll collection, and will include all the object attributes and also the key (_id), like this:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
The Object id is generated by MongoDB itself here, although you can set it yorself if you want to, as long as it's unique. The insert method is used to insert data:
> db.mycoll.insert({c3: 4, c4: 'some string'})
> db.mycoll.find()
results in;
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And as you can see, typing is automatic, or you can look at it as being type agnostic. Now, this wasn't much more than we saw last time, what we want is to select some specific objects and possibly get some specific columns from it, this is done by specifying one or two arguments to the find() method. For example, if I only want to get the object back that I inserted last above, I'd do this:
> db.mycoll.find({c3: 4})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And this wasn't really complicated, right? The condition is passed as a Java Script object notation, and that is fairly uncomplicated. But what happens for something slightly more than this really simple example, like a rangesearch? To get all objects where the c3 member is 4 or higher (which results in the same object as above by the way), you would write something like this, and :
> db.mycoll.find({c3: {$gt: 3}})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
I will show some more $-operations beyond $gt in a later post, for now just accept that they exist and are documented here: Advanced Queries

The _id column is just annoying here, right now, but it is always shown by default, as are all the other object. To get rid of it for now, this will do the trick:
> db.mycoll.find({c3: {$gt: 3}}, {_id: 0})
{ "c3" : 4, "c4" : "some string" }
Not too bad, right, and kinda easy to understand. The falgs you pass for each field in the second argument may have one of three values:
  • 1 - Include this field. This is the default.
  • 0 - Do not include this field.
  • -1 - Include no fields except this one and the ObjectId. You may have more of these, in which case all the -1 flagged fields will be included.
Let's try a more advanced version. I want to the the c1 and c2 attributes, and nothing else, then I do this:
> db.mycoll.find({},{c1: -1, c2: -1, _id: 0})
{ "c1" : 1 }
{ "c2" : 1 }
{ "c1" : 2, "c2" : 2 }
{ }
{ }
As you can see, I have to explicitly exclude the _id field.

Online help
The mongo commandline tool for once has decent online help. Typing just help will show the options. For help on database specific operations, type db.help() and for collection specific operations, type db..help(), such as db.mycoll.help(). In JavaScript, a function is just another script, and adding arguments to the function will execute the function, but maybe you want to see how the function is implemented? The just type the name of the function, like this:
> db.mycoll.find
function (query, fields, limit, skip) {
return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip);
}

DBA Work - Indexing data and explain
What would a mongo DBA want to do? Let's try creating an index. Let's say we want an index on the c1 attribute in the mycoll collection as above, then we must use the ensureIndex() method on the collection in question, telling what columns I want to index, like this:
> db.mycoll.ensureIndex({c1: 1})
And that's it. Let's try to query that collection again, this time using the c1 column as an argument, and hopefully the index will be used:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
Right. But is the index used? I want to know that it is for a fact, or if it isn't, so I have something to complain to my developers about. In MySQL, you want use the EXPLAIN command and figure out what index are being used, but with mongo? Easy. Use the explain method, like this:
> db.mycoll.find({c1: 1}).explain()
{
"cursor" : "BtreeCursor c1_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"c1" : [
[
1,
1
]
]
}
}
Hey, that's prett cool, right! The index is a standard B-tree index (the only index type available in MongoDB). An index can also be unique, like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true})
Which will create a unique index on the c2 attribute, but in our case it will not work:
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : null }
What's going on here? Well, the c2 attribute isn't included in all objects, and but the index will include all objects, and MongoDB considers NULL a duplicate here (unlike an SQL NULL in which case this is not the case). So the real question here is, what do you want? As MongoDB is schema-free, and you can have any kind of attributes, and also looking at the data above, what I would probably want is an index on the c2 attrbute that makes sure that c2 is unique WHEN INCLUDED, if the c2 attribute isn't part of the object, then please mr. Indexer, ignore it. This is called a sparse index in MongoDB, and what it means is an index that just indexes the objects where the attribute is included.

Note that this may not always be what you want with non-unique indexes, but it often it is, and it makes seaching and inserting faster (as the index is smaller). In the case you have an attribute that is only rarely part of the object, and you want to find the objects where it IS included, this is just what you want.

In our case, the index is created like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true, sparse: true})
And this time we had no errors. Let's see how it works, first get some data:
> db.mycoll.find({}, {c2: 1, _id:0})
{ }
{ "c2" : 1 }
{ "c2" : 2 }
{ }
{ }
Now, let's see if the unique index on c2 will guarantee uniqueness by inserting a new row with an existing value for c2:
> db.mycoll.insert({c2: 1})
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : 1.0 }
Yo! That worked as expected! As does this (which gives no errors):
> db.mycoll.insert({c2: 3})

That's it for now, I'll be back soon with some more MongoDB DBA stuff: Sharding!
/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

MongoDB for MySQL folks — Part 2

Октябрь 24th, 2011
In this second part of a series of blogs on using a NoSQL database, in this case MongoDB, aimed at MySQL users, I will describe querying a bit. And it's probably not what you expected if you haven't tried a NoSQL database or MongoDB more specifically, before. The first blogpost on this subject is here.

At the heart of accessing MongoDB is JavaScript, more specifically it uses the Mozilla SpiderMonkey JavaScript engine. As I wrote in the previous blog on this subject, JavScript is all over the place in MongoB, largely you can look at MonngoDB as JavaScript with HUGE space for variables, and although many would look at this as gross oversimplification, it works as simple descripton of what we have here. JavaScript for querying, inserting, administration, scripting and "stored procedures" (they aren''t called this in MongoDB, but you know what I mean). And JSON (JavaScript Object Notation) for data, when inserting an object, you construct a JSON object, when you query, what you get back is a JSON object. etc etc. Internally, Mongo uses a binary representation of JSON called BSON.

So we have A LOT of JavaScript and JSON, and if you are no fan of JavaScript, then probably MongoDB isn't for you! In the MongoDB database there are collections then, which are similar to a table, but in a collection, there isn't, like in an RDBMS, a bunch of rows all with the same columns (attributes), instead there are multiple object, all using the same "primary key" (called _id) and the object itself is a JSON Object with any attributes. This is why MongoDB is "schemaless", there is no set schema for the objects (i.e. "rows") except the _id then.

The schemaless design sometimes gives the expressionh that beyond the unique access key, there is no other way to access the data and that you cannot have secondary indexes, but you sure can! But now I am getting ahead of myself.

Let's give MongoDB a shot. After installing MongoDB (which is dead simple) and starting the server, access the command-line using the "mongo" command line program. In the examples I show here, I use a sharded setup, so the prompt is "mongos", and it might be different in your case, depending on your configuration and version of MongoDB. First, lets create a database to play with;
mongos> use mytest
Hey, you ask, what is going on here? You never created a database, you just accessed it? Yes, that is the deal with MongoDB, you just access a database or a collection, and if it doesn't already exist, it is created for you. Which is the reason many MongoDB databases, even in production, has a bunch of wrongly spelled database and collection names.
Now, let's see of the database we access really exists.
mongos> show dbs
admin (empty)
config 0.1875GB
test (empty)
and as you can see, I lied to you, the mytest database isn't there at all! Well, the reason is that there is no collection there yet, so let's create one:
mongos> db.foo.insert({attr1: 1})
Again I am accessing something that doesn't exist to create it, in this case a collection called foo, where the first object has one attribute, attr1, with the value 1. Now, the mytest database should exist, as there is data in it, let's have a look:
mongos> show dbs
admin (empty)
config 0.1875GB
mytest 0.203125GB
And there you go, the mystest database is created and so is the collection foo:
mongos> show collections
foo
system.indexes
OK, there you go, a database and a collection for you. This is getting hotter! Let's insert some more data:
mongos> db.foo.insert({attr1: 2})
mongos> db.foo.insert({attr2: 3})
mongos> db.foo.insert({attr2: 4, attr3: 5})
All this is valid data, every object may have different and different number of attributes, fact is, every object can have any attribute you would want. Which doesn't necessarily make it a good idea, but it's good to have that feeling of complete freedom, right?
Some more inserts:
mongos> db.foo.insert({attr2: 4, attr4: {attr5: 6, attr6: 7}})
mongos> db.foo.insert({attr2: 4, attr7: [10, 11, 12]})
Here I first insert an object where one attribute in itself is an object, and then an object where one attribute is an array. Hey, cool!
Now, let's do some youerying. First just get att the data:
mongos> db.foo.find()
{ "_id" : ObjectId("4ea528fb35a81331e048baa9"), "attr1" : 1 }
{ "_id" : ObjectId("4ea52a4835a81331e048baaa"), "attr1" : 2 }
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
{ "_id" : ObjectId("4ea52a5f35a81331e048baac"), "attr2" : 4, "attr3" : 5 }
{ "_id" : ObjectId("4ea52ae635a81331e048baad"), "attr2" : 4, "attr4" : { "attr5" : 6, "attr6" : 7 } }
{ "_id" : ObjectId("4ea52afc35a81331e048baae"), "attr2" : 4, "attr7" : [ 10, 11, 12 ] }
The find() function is a built-in MongoDB function that retries data for you, and it can take 2 optional arguments, both of which is an object in itself (surprise!):
  1. A query-specification.
  2. The columes to return.
For example:
mongos> db.foo.find({ attr2: 3})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }

Here I am querying for all objects where c1 is 3. The attr2 attribute doesn't have an index on it, but I can query on it. To the use enforced _id key, just do this (in your case, you will se a different object id of course):
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
Before I close for now, let me show you how you can tell that the latter of the two last queries above was more effective by using explain() (you didn't expect that, did you):
mongos> db.foo.find({ attr2: 3}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 6,
"nscannedObjects" : 6,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
The result from an explain is, as you would expect by now, a JavaScript object. Look at the nscanned attribute here. There is no index on the attr2 attribute, so all objects are scanned, compared to this:
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")}).explain()
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 22,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"_id" : [
[
ObjectId("4ea52a5035a81331e048baab"),
ObjectId("4ea52a5035a81331e048baab")
]
]
}
}
Here, just one object is looked at, to get the data, and the id column is used to find it (look in the indexBounds attribute). So far, querying doesn't all that difficult to what you see in an RDBMS, except all those objects and JavaScript.

That's it for now, I'll be back with some more MongoDB lessons real soon! The next post will be on more advanced querying!

/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

MariaDB Statistics and Surveys

Октябрь 19th, 2011

I just finished reading a couple of interesting, and somewhat related, blog posts which I think are worth sharing (apologies to anyone who has already seen them). One is from Jelastic and the other is from Michal Hrušecký.

I’ve written about MariaDB and the Jelastic cloud before (see MariaDB now available as a hosted database via Jelastic cloud platform). Now Jelastic has published statistics on the relative popularity of the various databases they offer. The good news is MariaDB is currently the database of choice for 14% of their customers. The bad news is that we’re in fourth place behind their other three database choices (MySQL, PostgreSQL, and MongoDB). However, MariaDB has only been available on their platform for a little over two months and we’re very happy that so many users of Jelastic’s state-of-the-art Java cloud are choosing to use MariaDB. Thanks!

In the second blog post, Michal Hrušecký shares the results of what he terms his “little MySQL survey”. This time MariaDB comes in second behind MySQL Community Server (and ahead of MySQL Cluster and Percona Server). In case you didn’t know, Michal packages MySQL and MariaDB for openSUSE, so this survey was a good way for him to judge the relative popularity of some of the more popular variants and give him ideas on how to improve things.

BTW: If you know of other interesting/informative/etc… blog posts related to MariaDB, let us know so we can add them to the Blog Posts Relevant to MariaDB page of the AskMonty Knowledgebase!


PlanetMySQL Voting: Vote UP / Vote DOWN

NoSQL for us RDBMS folks — MongoDB 101

Октябрь 4th, 2011
As you probably know, I have been doing RDBMS work for many years, some 25+ years by now. At Recorded Future I am the database architect, and although an RDBMS is used extensively, MySQL in this case, we are looking at options, and are currently doing more and more work using a NoSQL Solution, probably te most popular one by now, namely MongoDB.

And before you complan: NoSQL is not a good term, but someone with a longer NoSQL background should then find something better, not yours truly. And for all intents and purposes, you know what I am talking about, right?

I plan to post a few MongoDB posts here, how it looks like from an RDBMS dudes (like myself) POV. This is the first installment, but there will be more. I should also point out that I am no expert in NoSQL technologies in general, nor specifically in MongoDB, but I am trying, and our MongoDB setup is large and complex enough to serve as a decent example: We have a couple of Tb of data in Mongo, we use Replication and we use Sharding / Clustering whatever you want to call it. In addition, we run this puppy on Amazon EC2.

So, why did we move to MongoDB, and what are the results? I think the reason is two-fold:
  1. Better performance. Not better performance per se, but by using sharding and large in-memory stores, we can achive this.
  2. Better and easier to set up scalability. Again, this doesn't come by itself, nor is it without problems. But the built-in sharding in MongoDB actually does work reasonably well, and is largely transparent to the application.
This sounds great, doesn't it? Everyone wants better perforamnce and better scalability, right? And why can't we have MySQL to scale in the same way as MongoDB? Well, there are drawbacks to all this, but in our case, I think we can live with it.

An important thing to note here, I think, is this: Scalability is transparent and easy to use and implement because the basic idea of a NoSQL system (a key-value store in this case) is so simple: One unique key points to one value. That's it. This is easy to shard. Dead easy. A multi-table relational schema is much less so. But you could simplify the schema in MySQL to make it just as easy to shrd, but then what is the point of an RDBMS at all? And MongoDB Scalability is an incredibly important part of MongoDB performance (in addition to it's in-memory nature).

In our case, using Amazon EC2, scaleability is really important. A single Amazon instance is limited in terms of memory and CPU power. The limits are high (some 70G RAM for example), but there is still a limit. If you have, say, 500 Gb or a Tb or so of data, 70Gb might not be that much, so you have to shard or cluster or something. And this is where MongoDB rocks, it is easy to shard, the simple data structures in a NoSQL database mena this this can be done largely transparent, and all means that MongoDB Rocks for us.

Do we have other options? You bet! MySQL Cluster, NimbusDB (aka NuoDB), Oracle, Cassandra, etc etc, but we ended up with MongoDB for now.

So from a DBA POV, what is mongo then? To begin with, there is no SQL interface and no "query language" per se, rather to the user, MongoDB looks a lot like a Java Script environment with a lot of space for variables, sort of. MongoDB has databases, and in databases there are collections, which are sort of like tables in an RDBMS. A collection is a key-value store, you can define a key for the collection, or MongoDB will generate one for you. MongoDB is schemaless, which means that a collection doesn't not have any predefined columns or anything. Instead, if an object that you store has an attribute foo then you assign a value to it in your object and insert the object. That's it. And you can have an index on foo if you want to. And index in this case is a traditional B-Tree index, nothing more exciting than that.

An Object as above in Mongo is a JSON structure, which internally in Mongos is represented as BSON. To access the collections you use Java Script as I said before, like this to insert an object into the collection bar with the attribute foo:
db.bar.insert({foo: "foobar"});
In SQL this means: INSERT INTO bar(foo) VALUES('foobar')
Used this way, Mongo will assign the object a unique id, called _id, which is similar to the PRIMARY KEY in an SQL RDBMS. To retrieve the row we just inserted, use the command:
db.bar.find()
Which in SQL means: SELECT * FROM bar
And you get something like this back:
{ "_id" : ObjectId("4e8c9b2e2fde67676c2381ae"), "foo" : "anka" }

MongoDB has a big bunch of built-in Java Script methods, what I showed above was just a very basic sample. To use Mongo, knowing Java Script and JSON in general is more or less a requirement, and if you don't know these, and start using MongoDB, you will know then soon.

I will write some more blog-posts on Mongo eventually, but before I close this post, let me tell you one more thing. I never created the collections (bar) I used above. The reason is that yiu do not have to, it is created for you as you access it. The same goes for databases. This is the reason that you often end up with a bunch of misspelled databases and collections in mongo, which you can of cource drop after you have made a mistake, but as we are all lazy, you mostly have a bunch of non-used databases and collections.

/Karlsson

PlanetMySQL Voting: Vote UP / Vote DOWN

Quick Installation of Replication from MySQL to MongoDB

Сентябрь 30th, 2011
Proof-of-concept Tungsten support for MongoDB arrived last May, when I posted about our hackathon effort to replicate from MySQL to MongoDB.  That code then lay fallow for a few months while we worked on other things like parallel replication, but the period of idleness has ended.  Earlier this week I checked in fixes to Tungsten Replicator to add one-line installation support for MongoDB slaves.

MySQL to MongoDB replication will be officially supported in the Tungsten Replicator 2.0.5 build, which will be available in a few weeks.  However, you can try out MySQL to MongoDB replication right now.  Here is a quick how-to using my lab hosts logos1 for the MySQL master and logos2 for the MongoDB slave. 

1. Download the latest development build of Tungsten Replicator.   See the nightly builds page for S3 URLs.

$ cd /tmp
$ wget --no-check-certificate https://s3.amazonaws.com/files.continuent.com/builds/nightly/tungsten-2.0-snapshots/tungsten-replicator-2.0.5-332.tar.gz

2. Untar and cd into the release. 

$ tar -xzf tungsten-replicator-2.0.5-332.tar.gz
$ cd tungsten-replicator-2.0.5-332

3. Install a MySQL master replicator on a host that has MySQL installed and is configured to use row replication, i.e. binlog_format=row.  Note that you need to enable the colnames and pkey filters.  These add column names to row updates and eliminate update and delete query columns other than those corresponding to the primary key, respectively. Last but not least, ensure strings are converted to Unicode rather than transported as raw bytes, which we have to do in homogeneous MySQL replication to finesse character set issues.  

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mysql \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos1 \
  --mysql-use-bytes-for-string=false \
  --svc-extractor-filters=colnames,pkey \
  --svc-parallelization-type=disk --start-and-report

4. Finally, install a MongoDB slave.  Before you do this, ensure mongod 1.8.x is up and running on the host as described in the original blog post on MySQL to MongoDB replication.   My mongod is running on the default port of 27017, so there is no --slave-port option necessary. 

$ tools/tungsten-installer --master-slave -a \
  --datasource-type=mongodb \
  --master-host=logos1  \
  --datasource-user=tungsten  \
  --datasource-password=secret  \
  --service-name=mongodb \
  --home-directory=/opt/continuent \
  --cluster-hosts=logos2 \
  --skip-validation-check=InstallerMasterSlaveCheck \
  --svc-parallelization-type=disk --start-and-report

That's it.  You test replication by logging into MySQL on the master, adding a row to a table, and confirming it reaches the slave.   First the SQL commands: 

$ mysql -utungsten -psecret -hlogos1 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
...
mysql> create table bar(id1 int primary key, data varchar(30));
Query OK, 0 rows affected (0.15 sec)

mysql> insert into bar values(1, 'hello from mysql');
Query OK, 1 row affected (0.00 sec)

Now check the contents of MongoDB:  

$ mongo logos2:27017/test
MongoDB shell version: 1.8.3
connecting to: logos2:27017/test
system.indexes
> db.bar.find()
{ "_id" : ObjectId("4e85269484aef8fcae4b0010"), "id1" : "1", "data" : "hello from mysql" }

Voila!  We may still have bugs, but at least MySQL to MongoDB replication is now easy to install.   

Speaking of bugs, I have been fixing problems as they pop up in testing.  The most significant improvement is a feature I call auto-indexing on MongoDB slaves.  MongoDB materializes collections automatically when you put in the first update, but it does nothing about indexes.  My first TPC-B runs processed less than 100 transactions per second on the MongoDB slave, which is pretty pathetic. The bottleneck is due to MongoDB update operations of the form 'db.account.findAndModify(myquery,mydoc)'.  You must index properties used in the query or things will be very slow.   

Auto-indexing cures the update bottleneck by ensuring that there is an index corresponding to the SQL primary key for any table that we update.  MongoDB makes this logic very easy to implement--you can issue a command like 'db.account.ensureIndex({account_id:1})' to create an index.  What's really cool is that MongoDB will do this even if the collection is not yet materialized--e.g., before you load data.   It seems to be another example of how MongoDB collections materialize whenever you refer to them, which is a very useful feature.  

TPC-B updates into MongoDB are now running at over 1000 transactions per second on my test hosts. I plan to fix more bugs and goose up performance still further over the next few weeks.  Through MongoDB we are unlearning assumptions within Tungsten that are necessary to work with non-relational databases.  It's great preparation for big game hunting next year:  replication to HBase and Cassandra.  

PlanetMySQL Voting: Vote UP / Vote DOWN

Tech Messages | 2011-09-21

Сентябрь 24th, 2011

Green NetworkA special extended edition of Tech Messages for 2011-08-31 through 2011-09-21:


PlanetMySQL Voting: Vote UP / Vote DOWN