Archive for the ‘graph’ Category

Liveblogging at Confoo: Blending NoSQL and SQL

Март 11th, 2010

Persistence Smoothie: Blending NoSQL and SQL – see user feedback and comments at http://joind.in/talk/view/1332.

Michael Bleigh from Intridea, high-end Ruby and Ruby on Rails consultants, build apps from start to finish, making it scalable. He’s written a lot of stuff, available at http://github.com/intridea. @mbleigh on twitter

NoSQL is a new way to think about persistence. Most NoSQL systems are not ACID compliant (Atomicity, Consistency, Isolation, Durability).

Generally, most NoSQL systems have:

  • Denormalization
  • Eventual Consistency
  • Schema-Free
  • Horizontal Scale

NoSQL tries to scale (more) simply, it is starting to go mainstream – NY Times, BBC, SourceForge, Digg, Sony, ShopWiki, Meebo, and more. But it’s not *entirely* mainstream, it’s still hard to sell due to compliance and other reasons.

NoSQL has gotten very popular, lots of blog posts about them, but they reach this hype peak and obviously it can’t do everything.

“NoSQL is a (growing) collection of tools, not a new way of life.”

What is NoSQL? Can be several things:

  • Key-Value Stores
  • Document Databases
  • Column-oriented data stores
  • Graph Databases

Key-Value Stores


memcached is a “big hash in the sky” – it is a key value store. Similarly, NoSQL key-value stores “add to that big hash in the sky” and store to disk.

Speaker’s favorite is Redis because it’s similar to memcached.

  • key-value store + datatypes (list, sets, scored sets, soon hashes will be there)
  • cache-like functions (like expiration)
  • (Mostly) in-memory

Another interesting key-value store is Riak

  • Combination of key-value store and document database
  • heavy into HTTP REST
  • You can create links between documents, and do “link walking” that you don’t normally get out of a key-value store
  • built-in Map Reduce

Map Reduce:


  • Massively parallel way to process large datasets
  • First you scour data and “map” a new set of dataM
  • Then you “reduce” the data down to a salient result — for example, map reduce function to make a tag cloud: map function makes an array with a tag name and a count of 1 for each instance of that tag, and the reduce tag goes through that array and counts them…
  • http://en.wikipedia.org/wiki/MapReduce

Other key-value stores:

Document Databases


Some say that it’s the “closest” thing to real SQL.
  • MongoDB – Document store that speaks BSON (Binary JSON, which is compact). This is the speaker’s favorite because it has a rich query syntax that makes it close to SQL. Can’t do joins, but can embed objects in other objects, so it’s a tradeoff

    • Also has GridFS that can store large files efficiently, can scale to petabytes of data
    • does have MapReduce but it’s deliberate and you run it every so often.

  • CouchDB
    • Pure JSON Document Store – can query directly with nearly pure javascript (there are auth issues) but it’s an interesting paradigm to be able to run your app almost entirely through javascript.
    • HTTP REST interface
    • MapReduce only to see items in CouchDB. Incremental MapReduce, every time you add or modify a document, it dynamically changes the functions you’ve written. You can do really powerful queries as easy as you can do simple queries. However, some things are really complex, ie, pagination is almost impossible to do.
    • Intelligent Replication – CouchDB is designed to work with offline integration. Could be used instead of SQLite as the HTML5 data store, but you need CouchDB running locally to be doing offline stuff w/CouchDB

Column-oriented store


Columns are stored together (ie, names) instead of rows. Lets you be schema-less because you don’t care about a row’s consistency, you can just add a column to a table very easily.

Graph Databases


speaker’s opinion – there aren’t enough of these.
Neo4J – can handle modeling complex relationships – “friends of friends of cousins” but it requires a license.

When should I use this stuff?





If you have:Use
Complex, slow joins for an “activity stream”Denormalize, use a key-value store.
Variable schema, vertical interactionDocument database or column store
Modeling multi-step relationships (linkedin, friends of friends, etc)Graph

Don’t look for a single tool that does every job. Use more than one if it’s appropriate, weigh the tradeoffs (ie, don’t have 7 different data stores either!)

NoSQL solves real scalability and data design issues. But financial transactions HAVE to be atomic, so don’t use NoSQL for those.

A good presentation is http://www.slideshare.net/bscofield/the-state-of-nosql.

Using SQL and NoSQL together


Why? Well, your data is already in an SQL database (most likely).

You can blend by hand, but the easy way is DataMapper:
Generic, relational ORM (adapters for many SQL dbs and many NoSQL stores)
Implements Identity Map
Module-based inclusion (instead of extending from a class, you just include into a class).

You can set up multiple data targets (default is MySQL, example sets up MongoDB too).
DataMapper is:

  • Ultimate Polyglot ORM
  • simple r’ships btween persistence engines are easy
  • jack of all, master none
  • Sometimes perpetuates false assumptions –
  • If you’re in Ruby, your legacy stuff is in ActiveRecord, so you’re going to have to rewrite your code anyway.

Speaker’s idea to be less generic and better use of features of each data store – Gloo – “Gloo glues together different ORMs by providing relationship proxies.” this software is ALPHA ALPHA ALPHA.

The goal is to be able to define relationships on the terms of any ORM from any class, ORM or not
Right now – partially working activeRecord relationships
Is he doing it wrong? Is it a crazy/stupid idea? Maybe.

Example:





NeedUse
Assume you already have an auth systemit’s already in SQL, so leave it there.
Need users to be able to purchase items from the storefront – Can’t lose transactions, need full ACID complianceuse MySQL.
Social Graph – want to have activity streams and 1-way and 2-way relationships. Need speed, but not consistencyuse Redis
Product Listings — selling moves and books, both have different properties, products are pretty much non-relationaluse MongoDB

He wrote the example in about 3 hours, so integration of multiple data stores can be done quickly and work.


PlanetMySQL Voting: Vote UP / Vote DOWN

OQGRAPH on Launchpad, graph examples

Октябрь 24th, 2009

The MySQL 5.0 and MySQL/MariaDB 5.1 source code is now also available through Launchpad. If you were waiting for a version for 5.1 and are ok with building the plugin from source, now you can!

The repo contains a subdir for examples, we’re hoping many people will contribute little snippets and scripts to import and use interesting datasets. To give you a hint, with graph capabilities you are able to deal with RDF data sources. You just need to transform the XML to say CSV, import into a suitable structure, and copy the edge information across to an OQGRAPH table.

Roland Bouman’s tree-of-life (which uses xslt stylesheets) are a good example of that approach, and was the first entry in the examples tree, including an SQL dump of the base dataset (it was CC-NC licensed) so you don’t necessarily have to fuss with the RDF/xslt foo.

Enjoy! We want to have examples/demos, a proper testsuite (there’s a bug/wishlist for that), and more. If you can help, please do: mucking around with graphs is great fun. If you implement OQGRAPH in a “proper” app, we’d also like to hear from you. The examples are intended to get people used to what OQGRAPH can do, and thus trigger ideas for practical uses. It’s not just fun. With OQGRAPH’s capabilities and speed, you can profit.


PlanetMySQL Voting: Vote UP / Vote DOWN

Walking the Tree of Life in simple SQL

Октябрь 21st, 2009

Antony and I are busy getting the Open Query GRAPH Engine code ready so you all can play with it, but we needed to test with a larger dataset to make sure all was fundamentally well with the system.

We have some intersting suitable dataset sources, but the first we tried in ernest because it was easy to get in (thanks to Roland Bouman for both the idea and providing xslt stylesheets to transform the set), was the Tree of Life which is a hierarchy of 89052 entries showing how biological species on earth are related to eachother.

GRAPH engine operates in a directed fashion, so I inserted the connections both ways resulting in 178102 entries. So, I inserted A->B as well as B->A for each connection. So we now have a real graph, not just a simple tree.

Just like with my previous post, we have a separate table that contains the name of the species. For query simplicity, I looked up the id the start/end name separately. By the way, latch=1 indicates we use Dijkstra’s shortest-path algorithm for our search.

# with all that explained, let’s find ourselves in the tree of life!
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=1 AND destid=16421 \G
*************************** 1. row ***************************
path: Life on Earth -> Eukaryotes -> Unikonts -> Opisthokonts -> Animals -> Bilateria -> Deuterostomia -> Chordata -> Craniata -> Vertebrata -> Gnathostomata -> Teleostomi -> Osteichthyes -> Sarcopterygii -> Terrestrial Vertebrates -> Tetrapoda -> Reptiliomorpha -> Amniota -> Synapsida -> Eupelycosauria -> Sphenacodontia -> Sphenacodontoidea -> Therapsida -> Theriodontia -> Cynodontia -> Mammalia -> Eutheria -> Primates -> Catarrhini -> Hominidae -> Homo -> Homo sapiens
1 row in set (0.13 sec)

# how are we related to the family of plants containing the banana
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=21506 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Archaeplastida (Plantae) -> Green plants -> Streptophyta -> Embryophytes -> Spermatopsida -> Angiosperms -> Monocotyledons -> Zingiberanae -> Musaceae
1 row in set (0.06 sec)

Obviously, this search needs to find its way up the tree then find the appropriate other branch.

# finally, our connection retro-viruses
SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_graph JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380 \G
*************************** 1. row ***************************
path: Homo sapiens -> Homo -> Hominidae -> Catarrhini -> Primates -> Eutheria -> Mammalia -> Cynodontia -> Theriodontia -> Therapsida -> Sphenacodontoidea -> Sphenacodontia -> Eupelycosauria -> Synapsida -> Amniota -> Reptiliomorpha -> Tetrapoda -> Terrestrial Vertebrates -> Sarcopterygii -> Osteichthyes -> Teleostomi -> Gnathostomata -> Vertebrata -> Craniata -> Chordata -> Deuterostomia -> Bilateria -> Animals -> Opisthokonts -> Unikonts -> Eukaryotes -> Life on Earth -> Viruses -> DNA-RNA Reverse Transcribing Viruses -> Retroviridae
1 row in set (0.06 sec)

As you can see this one has to walk all the way back to “life on earth”, we’re really not related at all.

I left in the lines that show the amount of time taken. In earlier queries it took a few seconds, and I thought that was just some slowness in the graph engine, until I found out that the join was un-indexed so MySQL was table-scanning the tol table for each item found. Quickly corrected, the numbers are as you see.

I was still curious though, and since the SELECT returns a single item (a string in this case) it was really easy to use the BENCHMARK(N,func) function. That standard MySQL function runs func N times. Simple.

# so, we do
SELECT benchmark(1000000,(SELECT GROUP_CONCAT(name ORDER BY seq SEPARATOR ‘ -> ‘) AS path FROM tol_tree JOIN tol ON (linkid=id) WHERE latch=1 AND origid=16421 AND destid=57380));

1 row in set (1.86 sec)

As it turns out, we were really just measuring latency before, as this shows we can do a million of these path searches through a graph in less than 2 seconds. To me, that’s not just “not bad” (the usual opinion a Dutch person would express ;-) but freaking awesome. And that is just what I wanted to tell.


PlanetMySQL Voting: Vote UP / Vote DOWN

GRAPH engine – Mk.II

Октябрь 16th, 2009

The GRAPH engine allows you to deal with hierarchies and graphs in a purely relational way. So, we can find all children of an item, path from an item to a root node, shortest path between two items, and so on, each with a simple basic query structure using standard SQL grammar.

The engine is implemented as a MySQL/MariaDB 5.1 plugin (we’re working on a 5.0 backport for some clients) and thus runs with an unmodified server.

Demo time! I’ll simplify/strip a little bit here for space reasons, but what’s here is plain cut/paste from a running server, no edits

-- insert a few entries with connections (and multiple paths)
insert into foo (origid, destid) values (1,2), (2,3), (2,4), (4,5), (3,6), (5,6);
-- a regular table to join on to
insert into people values (1,"pearce"),(2,"hunnicut"),(3,"potter"),
                          (4,"hoolihan"),(5,"winchester"),(6,"mulcahy");
-- find us the shortest path from pearce (1) to mulcahy (6) please
select group_concat(people.name order by seq) as path
  from foo join people on (foo.linkid=people.id)
  where latch=1 and origid=1 and destid=6;
+--------+--------+--------------------------------+
| origid | destid | path                           |
+--------+--------+--------------------------------+
|      1 |      6 | pearce,hunnicut,potter,mulcahy |
+--------+--------+--------------------------------+
-- find us all people we can get to from potter (3)
select origid,group_concat(people.name order by seq) as destinations
  from foo join people on (foo.linkid=people.id)
  where latch=1 and origid=3;
+--------+----------------+
| origid | destinations   |
+--------+----------------+
|      3 | mulcahy,potter |
+--------+----------------+

-- find us all people from where we can get to hoolihan (4)
select origid,group_concat(people.name order by seq) as origins
  from foo join people on (foo.linkid=people.id)
  where latch=1 and destid=4;
+--------+--------------------------+
| origid | origins                  |
+--------+--------------------------+
|      4 | hoolihan,hunnicut,pearce |
+--------+--------------------------+

So, there you have it. A graph (in this case a simple unidirectional tree, aka hierarchy) that looks like a table to us, as do the resultsets that have been computed.

This is still a early implementation, we’re still enhancing the storage efficiency (in memory) and speed, and adding persistence. We’re also looking for a suitable large dataset that would allow us to seriously test the system, find bugs and assess speed. If you happen to have a large hierarchical structure, but especially a social graph you could obfuscate and give to us, that would be great!

Also, if you’re interested in deploying the GRAPH engine or have questions or additional needs, we’d be happy to talk with you.

select origid,group_concat(people.name order by seq) as destinations from foo join people on (foo.linkid=people.id) where latch=1 and origid=4;
+——–+—————————–+
| origid | destinations                |
+——–+—————————–+
|      4 | mulcahy,winchester,hoolihan |
+——–+—————————–+

PlanetMySQL Voting: Vote UP / Vote DOWN