Archive for the ‘Main’ Category

SlackDB Updates

Октябрь 25th, 2010

Since I announced SlackDB a few weeks ago, I’ve had a number of questions and interesting conversations in response. I thought I would summarize the initial feedback and answer some questions to help clarify things. One of the biggest questions was “Isn’t this what Drizzle is doing?”, and the answer is no. They are both being designed for “the cloud” and speak the MySQL protocol, but they provide very different guarantees around consistency and high-availability. The simple answer is that SlackDB will provide true multi-master configurations through a deterministic and idempotent replication model (conflicts will be resolved via timestamps), where Drizzle still maintains transactions and ACID properties, which imply single master. Drizzle could add support for clustered configurations and distributed transactions (like the NDB storage engine), but writes would still happen on the majority (maintain quorum) since the concept of global state needs to be maintained.

This led Mark Callaghan to ask why not just modify Drizzle to support these behaviors? He has a good point since most of the properties I’m talking about exist at the storage engine level. There are still a number of changes that would need to happen in the kernel around catalog, database, and table creation to support the replication model. SlackDB also won’t need a number of constructs provided by the Drizzle kernel (various locks, transaction support) so query processing can be lighter-weight. So while it’s probably possible with enough patches and plugins to make this work in Drizzle, I believe it will be easier (both socially and technically) to do this from scratch. With either approach there is still a fair amount of code to be written, and I’ve decided to use Erlang since it allows programmers to express ideas concisely and more quickly with an acceptable trade-off in runtime efficiency. This would make it even more difficult to integrate with Drizzle.

A couple folks asked why I chose the BSD license instead of GPL or Apache. I didn’t want a copyleft license, so GPL was out, but after chatting some more I decided to switch SlackDB to the Apache 2.0 license for the patent protection clause. As much as I dislike patents and would prefer not to acknowledge them, I figured having the protection clauses in there would make it less likely that anyone using the software would have to deal with them once there are other contributors who may hold patents.

I presented the techniques I’m using behind SlackDB in a session at OpenSQL Camp Boston last weekend, and overall they were well received. There was a lot of great feedback and suggestions about other projects and libraries doing related things that may help speed things along. I was glad to see I wasn’t the only person thinking about these properties for relational databases, as Josh Berkus of PostgreSQL fame also led a session on ordering events and conflict resolution within relational data when you loosen up consistency.

I also attended Surge in Baltimore and listened to a talk by Justin Sheehy about “Embracing Concurrency At Scale.” You can see another recording of the same talk here. Justin explained the concepts and problems with systems trying to maintain any kind of globally consistent state quite well, and I agree with almost everything in his presentation. This recent blog post by Coda Hale also explains some of the other key principles around what you must give up in order to get the level of availability required by most systems these days. These help explain the reasons why I started SlackDB – I’m trying to combine these properties with a relational data model. Right now I’m still only able to put my limited spare time into it, but I’m hoping to find a way to put more time into the project. Hopefully you will agree we need a database like this and will help out too. :)


PlanetMySQL Voting: Vote UP / Vote DOWN

Let’s Build a Relational Database for the Cloud

Сентябрь 27th, 2010

As you might have guessed from my last couple blog posts, I’ve been experimenting with a few languages and libraries for a new project. I’ve finally gotten things far enough along to the point where I’d like to start getting other developers and potential users involved. I’m introducing SlackDB, an open source project that combines the functionality of relational databases with the ideas behind eventually consistent, shared-nothing data stores to provide a new database to support new and existing web applications in the cloud (enough buzzwords in there?). This is an idea I wrote about a while ago and recently I started putting a lot of night and weekend time into it.

It is still very early on in the development process, but the ideas behind it are starting to solidify and a fair amount of code is already written. It’s not all that useful yet, but much of the framework is there now. I’m starting from scratch and chose to use Erlang for the reasons mentioned in my last post (I wrote a prototype in Python and scrapped it due to performance reasons). It already speaks the MySQL protocol so you can interact with it using any of the APIs or tools you would normally use for MySQL. Other protocols and query languages may be supported in the future, but I decided to focus on the most popular one for now to help gain adoption.

I’m looking for database internals and/or Erlang programmers to help out (or anyone else who wants to learn). I want this to be a true community-driven project from design to documentation, there is just too much to do alone. Perhaps you’re looking for a side project or a reason to learn a language like Erlang. Here is your chance. This project may sound a bit ambitious, but we don’t need to write a general purpose database to support all uses cases. We’ll take an iterative approach and choose what’s important. I’ve outlined a number of things already in the design principles page on the wiki. I also have a number of blueprints up on the Launchpad project page.

SQL meets NoSQL

As much as I dislike the ambiguity of ‘NoSQL’, SlackDB is blurring the lines between the two by using techniques normally found in ‘NoSQL’ projects like eventual consistency and deterministic replication. People still have problems to solve and applications to run that require relational data models. A number of applications don’t even use most of the features provided by an ACID compliant relational database. For example, WordPress, one of the most popular web applications in the world, doesn’t even use real transactions (it auto-commits on every command). It is a prime application to use a relaxed consistency back-end with a true multi-master setup, allowing for higher availability and geographic redundancy.

Isn’t this what project X is trying to do?

Not that I’ve seen yet. I haven’t found any relational databases that throw away immediate consistency, isolation, and in turn provide a deterministic, idempotent replication model. Also, I don’t see any relational databases trying to solve the multi-tenant cloud problem in a way that embraces common properties of the cloud. Resources need to be elastic and services should tolerate network partitions and not show degraded performance on either side.

Is this a company? Are you going to sell it? Who will use it?

While I would like to be paid to work on this, I never plan to sell the software or release some parts closed (no open-core). It’s BSD licensed, so anyone could try to fork and sell it, but I hope you don’t. My primary motivation is that I feel the open source cloud service space is still lacking a big component and I want to help fill the gap. I would like to see projects such as WordPress and Drupal be able to run out of the box with multi-master SlackDB configurations. Down the road I would really like to see public and private cloud providers adopt this as part of their multi-tenant platform-as-a-service (PaaS) offering.

I’ll be at OpenSQL Camp in Boston in a couple weeks and am planning on talking more about SlackDB there. If you are interested try to make it, otherwise feel free to drop me a note at eday@oddments.org.


PlanetMySQL Voting: Vote UP / Vote DOWN

Yet Another Language Comparison

Сентябрь 22nd, 2010

Over the past year or so I’ve found myself evaluating my overall programming experience with the languages I’m working with. I might just be getting impatient in my old age (turning the big three-oh in a couple months), but I like to think I’m trying to find the most efficient way to solve the problem at hand. This has led me to learn and experiment with a number of languages, taking a look at each one’s strengths and weaknesses. I realize programming language selection is very subjective and folks can get quite passionate in the debate, but I’m still going to present my personal opinions on the matter. Flame away.

The main question that I’m trying to try to answer is: What language will enable me to solve the problem at hand correctly and in the fastest way possible? By correctly I mean without bugs or missing requirements. By fastest I not only mean the initial design and coding phases, but also maintenance. I’m a strong believer that no piece of software is ever complete and is usually read many more times than it is written. An application needs to be written in a way that is easy to jump back into it after some period of time. I’ve considered a number of metrics while experimenting with each language to answer the above question and I’m going to touch on a few big ones before digging into various scenarios and languages. I should also prefix this with the assumption that I’m talking about community-driven open-source software. This can of course apply to any team, open or closed, but I don’t really care about those one-off programs that never leave your hard-drive. Here is a list of things to consider while evaluating your choices:

  • Don’t be Different – Disregard all that hoopla about everyone being their own unique, beautiful butterfly – sometimes it’s best to conform. If I want to hack on the Linux kernel, I’m probably going to be doing it in C or assembly. If I want to contribute a plugin to Drupal or WordPress, it’s going to be in PHP. Even though it is technically possible to embed some other language into a project, it’s usually easiest to take the path of least resistance. There is only one work flow, one set of development tools, and less time spent context switching between the different languages. If there is some precedent for a language already, stop and use that. The rest of this post should be applied when you have a clean slate and can make choices without worrying about tight integration with an existing project.
  • Be Mature – I recently read this article suggesting we need a new programming language, and while the ideas are nice, I can’t say I agree. As great as new languages like Go may be, you never know when the plug may be pulled on the core development team. There is always the option of maintaining the language tools as well as your project, but that’s a lot of extra work. I like to choose languages that I know are not going anywhere or won’t be changing too drastically in the future.

  • Be Popular – There are a few websites out there that try to measure programming language popularity from various sources. Take them with a grain of salt, but it does give you a pretty good idea of who is hot or not, and even what the recent trends are. If a language doesn’t appear in the top 20-30 of the general lists, I usually don’t look any further. Google Trends can be useful as well to create your own trending graphs from their data set. Popularity is important because you want to have useful developer tools and a community to help answer your questions. If you and a professor at some university are the only people using a language, there is going to be a bottleneck on resources. This is also critical for open source projects when you want to build a developer community. Choose a language that folks already know so they can make useful contributions and help make your software better.

  • Determine What’s Important – As with many aspect of computer science (and life), choosing a language is all about trade-offs. To make these decisions, you need to know what limits you are going to hit. Are you going to be bound by CPU, disk, network, user, or some other resource? By user bound I mean your application will always be limited by user interaction, and no hardware resource limits will ever be hit. If you are CPU bound, you probably want a machine code or efficient byte-code compiled language. If you are user bound, performance matters less so you have more options. Keep in mind these limits are not isolated and can have effects on one another. For example, some languages may make I/O interaction really easy at the cost of space, but this may be due to double or even triple buffering of data, causing your CPU usage to increase too.

  • Don’t Guess – I found myself performing a number of micro-benchmarks to test various aspects of the languages. How long does it take to call a function? How much overhead is there in the concurrency primitives? How expensive is context switching? How efficient are the built-in string processing functions? It’s best to answer these questions by writing small programs in each language and comparing the results.

  • Choose the Best Tool for the Job – Sometimes you choose a language mainly because it has a particular library, module, or some built in feature suitable for your application. Most languages have the same standard library bits, but many have a few niche uses and have great support for certain features. For example, if you’re going to be running on large multi-core machines and need to share a lot of memory between threads (so not multi-process), languages that have a single interpretor lock like Python may not be a good choice. If you want a simple, integrated webserver framework that you can customize, Python is great. C or C++ may not be the best choice in this case because you’ll mostly be writing your own. Examine what your primary feature requests are and how well they are met by each language.

My Current Preferences

Below is a list of a few classes of applications and my current preference for each. You’ll notice a lack of Java in the discussion, mainly because I’ve always been on the C++ side for object-oriented applications. I’d rather put the time into a C++/STL/boost application and eliminate the extra VM layer at runtime. C/C++ also has the benefit of being able to link with other C/C++ libraries natively, where in Java you would need to write a JNI wrapper, find the Java equivalent, or write your own native library.

  • Web Applications – Early on I used Perl for all my web programming, then I switched to PHP for a number of years, and recently I’ve found myself preferring Python. It is a fantastic language and allows you to do almost anything with the objects at runtime (for better or worse). I’ve been doing some work on OpenStack and have found the WSGI standard great for building modular web applications. Combined with an event framework like Eventlet, you don’t even really need Apache httpd. The main concerns with Python are CPU bound tasks and SMP support because of the global interpretor lock (GIL). Most of the web apps I write are not bound by either and most of the heavy lifting (if any) is pushed to some other service that is more efficient (like a database). Projects like Django and Pylons take this to the next level providing frameworks around these basic ideas, but if you want to keep it simple then 10 lines of Python will get you a functioning web server and WSGI application (with a dependency on Eventlet). The Routes and SQL Alchemy packages also provide some very useful functionality while building your web applications.

  • Scripting, Tools, and Middleware – For these types of apps, I’ve mainly used Perl or a combination of shell/sed/awk, but recently I’ve again found Python to be a better fit. Decent versions of Python are standard on any system now, so you don’t need to worry about customizing or installing any dependencies to get your applications running. Again, if there are SMP or CPU performance concerns, you might need to look at another language.

  • Shared Libraries and Drivers – These consist of libraries that are used to provide some core functionality or other service. For example, libz for compression or libmysql to talk with MySQL servers. You really want the lowest common denominator so the library can easily be wrapped and reused in a number of other languages. This means writing it in C. Python, PHP, Perl, Erlang, Ruby, Lua, and pretty much all others have well defined interfaces for interacting with C libraries. Projects such as SWIG even take care of some of this interfacing work for you, allowing you to build multiple language bindings at once. You can of course write your driver in each language natively, but this can be a lot of work. You can probably get away with writing the library in C++, but you’ll most likely run into more issues than if you had just used C.

  • Servers – This is where most of my time has gone throughout my career, and for about 10 years the answer was always C. I was always trying to squeeze every bit of CPU and memory out the servers I was writing. In the past three years I started doing a lot more C++ work for MySQL related projects like Drizzle, and recently I’ve been experimenting with a number of alternatives. In a previous blog post I tested performance and throughput for a few different solutions, and I while I was impressed with the higher level languages, the C++ version still won by a good margin. In further tests I performed more CPU-intensive calculations and the Javascript and Python versions went through the roof compared to C++. This was most likely due to less time being spent in the kernel for the I/O calls, which should be about the same regardless of language. There were two languages that did stand out in the performance tests: Go and Erlang. Even with heavier CPU loads, they both performed quite well, usually taking only 10-15% more time than the C or C++ equivalents. Go is still a no-go due to it’s immaturity, but I think Erlang is a real contender. I’ve been somewhat frustrated with C++ due to it’s verbosity and nuances. For example, defining and debugging complex template code can be a nightmare, but it’s required if you want to use the STL. When doing the same thing in Erlang, I found myself writing more concise code with less bugs in a fraction of the time. In other words, the code was almost as fast and much more elegant than the C or C++ equivalents.

And the winner is…

There is of course no single winner, choose the best tool for the job. I think the combination of C, Python, and Erlang are a good fit for a wide variety of applications. The mental shift to a functional language may take a bit in the case of Erlang, but I encourage you to give it a try if you have not already. The main downside of Erlang is its popularity (or lack thereof). It’s not too far down the list, but certainly not in the top ten. This is probably due to it being a functional language and not having a history of general purpose applications. The popularity of projects such as CouchDB and RabbitMQ are putting Erlang on the map and giving developers a reason to take a closer look. If you still need to squeeze every bit of CPU and memory out of your applications, you’ll probably need to stick with C or C++.


PlanetMySQL Voting: Vote UP / Vote DOWN

OSCON and OpenStack

Июль 26th, 2010

The past two weeks have been both exciting and extremely busy, first traveling to Austin, TX for the first OpenStack Design Summit, and then back home to Portland, OR for The O’Reilly Open Source Conference (OSCON) and Community Leadership Summit. The events were great in different ways, and there was some overlap with OpenStack since we announced it on the first day of OSCON and created quite a bit of buzz around the conference. I want to comment on a few things that came up during these two weeks.

New Role

I’m now focusing on OpenStack related projects at Rackspace. I’m no longer working on Drizzle, but I will still be involved in the MySQL and database ecosystems through future projects and conferences (see you at OpenSQL Camp). I will also still be working on a couple of Gearman related projects in my spare time. At OSCON I gave two presentations on Gearman and Drizzle, you can find the slides here.

The Five Steps to Open

One question that came up a few times over the past couple weeks is what the term “Open” means when a business or organization decides to adopt the open source philosophy. It turns out this means many different things to folks, and when an organization decides to go open, they need to make a decision on how open they are willing to be. Here are the various layers we’ve seen over the years:

  • Open API – You’ve decided to take the first step to being open and released a well documented API to work with your web service or project. Everything behind the API is still a black-box though.
  • Open Core – Beyond the APIs, you’ve decided to release part of the code open source, but you still keep some of the bits proprietary in an attempt to keep a competitive advantage. This is a hot debate lately on whether it is a viable Open Source business model.
  • Open Source – You’ve decided keeping some code proprietary doesn’t help, and actually even hurts your project or adoption. You put all of the code out in the open for everyone to see. While everyone can see all of the source code, there still isn’t a whole lot of interaction going on.
  • Open Development – Putting the source code out wasn’t enough. You want to enable users and external developers to be able to file bugs, submit patches, and track the development process to see what to expect next. This usually involves running your project on a public project site such as github or Launchpad.
  • Open Decision Making – You’ve postponed the inevitable for long enough. Feature requests and bug reports are pouring in, and the community wants to have a say in what gets prioritized. Should we focus only on stability? Performance? New features? Porting to mobile platforms? Let the community decided the direction of the project.

There have been examples of success for organizations who have stopped at each of these steps. Given the proper environment, any can work. My preference is to work on projects that are fully open, where company and organizational boundaries do not exist between developers and users. I’m thrilled to say that we’ve gone all in with OpenStack. We’re hosted on Launchpad and have a governance structure that allows all parties within the community to have a say in the future of the project.

Preventing Vendor Lock-in

During the Cloud Summit at OSCON, there was a debate titled: “Are Open APIs Enough to Prevent Lock-in?”. Most folks came to the conclusion that the answer is “no,” and I agree. While I feel open APIs are necessary, they are by no means sufficient. Even if a project is open source and allows for open development, it probably will not prevent vendor lock-in. The key is to provide some incentive for vendors to adopt and invest resources within a project. Much like customers don’t want vendor lock-in when choosing a platform, vendors do not want project or feature lock-in when choosing the software to power their business. Each vendor who chooses to participate must have the ability to voice their opinion on the direction of APIs, features, and other project priorities. This is why it is critical that any open source project must take all the steps described above to give the project a chance of being adopted and becoming the de facto standard. There is of course no guarantee that adoption and prevention of vendor lock-in will happen, but I see them as necessary steps.

This is another area where OpenStack has done the correct thing. We are planning on having another developer summit in November, and then once every six months after that time. All design discussions and decision making will happen in public forums such as the mailing list and IRC. We want all participants in the community to have a chance to respond to topics being discussed, and we believe the more we have, the more successful the project will be. Having many voices allows the project to be more applicable to different environments. For example, Rackspace and NASA have different requirements for their compute architectures, but they also share many components as well. Through open participation we can ensure all needs are accounted for. Much like the LAMP stack has powered universities, governments, and competing business, we hope OpenStack can do the same.

Contributor License Agreement (CLA)

During the past couple of weeks a few folks asked what the CLA was all about. When the foundations of OpenStack were forming, the requirement of having a CLA came up from the legal side. Having been involved with open source projects that had very invasive CLAs, initially I had quite a bit of concern. The CLA is actually quite innocuous, and it does NOT require assignment or dual-ownership of copyright. You are the sole owner of code you contribute. For all intents and purposes it is a signed version of the Apache 2.0 license, the CLA just makes these terms more explicit. The CLA is handled through digital signatures, so no papers, pens, or faxing is required.

Get Involved!

Expect to see more posts on my blog related to OpenStack topics. If you would like to get involved, you can join the IRC channel (#openstack on irc.freenode.net), join the mailing list, or start contributing code! There are even jobs around OpenStack popping up already!


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Server Protocol Bug

Июль 25th, 2010

A few months ago I wrote a tool that verified MySQL and Drizzle protocol compatibility, along with testing for all sorts of edge cases. In analyzing protocol command interactions in mysqld, I found that the MySQL server will happily read an infinite amount of data if you exceed the maximum packet size while using a special sequence of protocol packets. The reasoning behind this behavior is so that the server can be polite and flush your data before sending a “max packet exceeded” error message, but perhaps there should be a limit to one’s politeness. What’s more interesting is that you can do this during the client handshake packet without authorization, so anyone could do this to any open MySQL server. The appropriate thing to do here would be to set some maximum limit of data to read and force a connection close when it is reached, otherwise your bandwidth and CPU could be consumed (essentially a DoS attack).

This portion of code was ripped out entirely in Drizzle, so there are no risks there. I submitted this as a bug to MySQL and MariaDB back in February and they both have patches available to fix this as well. You can find the bug here and a patch here. If you have publicly accessible MySQL or MariaDB servers, you probably want to upgrade binaries or patch this.


PlanetMySQL Voting: Vote UP / Vote DOWN

Open Source Bridge Database Sessions

Май 7th, 2010

Open Source Bridge, the “conference for open source citizens,” is right around the corner! The sessions were just announced and it’s going to be packed with quite a variety of really interesting talks. From open cloud computing topics to hardware hacking to language hacks (like HipHop from Facebook), I’m really looking forward to being there (I’m helping organize the event, but hopefully I’ll have time to attend sessions as well).

I wanted to point out a few of the great database talks:

I’m Attending Open Source Bridge – June 1–4, 2010 – Portland, OR

Beyond the DB talks, I’m also exited for a few other talks around high performance and high availability, from Facebook operations to Rasmus Lerdorf’s talk on making your PHP applications faster. I’ll also take the opportunity to shamelessly plug my own talk on writing high performance multi-core applications. There are also rumors of donut trucks, tesla coils, and scavenger hunts.

You should register to attend today, it’s going to be awesome.


PlanetMySQL Voting: Vote UP / Vote DOWN

Threads with Events

Апрель 20th, 2010

Last week I was surprised to see this paper bubble back up on Planet MySQL. It describes the pros and cons of thread and event based programming for high concurrency applications (like a web server), arguing that thread-based programming is superior if you use an appropriate lightweight threading implementation. I don’t entirely disagree with this, but the problem is such a library does not exist that is standard, portable, and useful for all types of applications. We have POSIX threads in the portable Linux/Unix/BSD world, so we need to work with this. Other experimental libraries based on lightweight threads or “fibers” are really interesting as they can maintain your stack without all the normal overhead, but it is hard to get the scheduling correct for all application types. I would even argue that thread and event based programming is actually not all that different, it’s just a matter of how state is maintained (stack vs state variables) and how scheduling is performed.

The comparisons done in that paper also put a C-based web server using a co-routine threading library against a Java based server that depends on the poll() system call. I’m sorry, but this is comparing apples to oranges. First, you’re in the Java VM with a number of runtime components (like garbage collection) which may be getting in the way. Also, the standard poll() system call is not an efficient event-handling mechanism, it’s much better to use epoll or some other Kernel-based handling mechanism.

One high-concurrency userland threading implementation I do like is in Erlang. Erlang processes are extremely lightweight and I’ve written apps that depend heavily on them. One interesting application I saw was caching objects where each object got it’s own Erlang process. This put a whole new spin on cache management, and it looked like it could actually scale reasonably well. The “problem” with Erlang, which may or may not be a problem depending on your requirements, is that it is still a bit of overhead running byte-code in a VM, as well as it being a functional language. I love functional programming, but I’ve found it still ties most developer’s heads in knots if they don’t have a reason to use it regularly. For open source projects trying to build a contributor community, it can act as one more hurdle.

So, what is the “best” paradigm?

Back in 2000 some colleagues and I wrote a hybrid thread-event library that would create one event-handler instance per thread, and connections would be spread across the pool of event-handling threads. I believe this gave the best of both worlds, and I saw high throughputs with fairly minimal overhead. I wrote a number of servers based on this architecture, including HTTP, IMAP, POP3, and DNS, and with each server type this model proved to be efficient and scalable. Ultimately the best architecture depends on your application. If you never intend to have many connections, and your applications has long-running computations, one-thread-per-connection would probably be best. If you need to handle large numbers of connections and have short, non-blocking request processing, event-based scales extremely well. You can of course create a hybrid of these two and have all connections managed by event threads and asynchronous queues to dedicated processing threads for heavy request handling (this is sort of what I did in the C Gearman Job Server).

There is no single correct answer, so take a look at your options before deciding how to approach your own applications. Don’t be afraid to create hybrids as well. Regardless of which paradigm you choose, concurrent programming can be hard, especially at the lower levels. There have been a number of higher level abstractions to help developers, from new libraries to new languages, but most of these come with a cost in performance or flexibility. When you need to squeeze every bit of performance out of your application, you will most likely end up in C or C++ dealing with these issues directly.

This is actually one of the problems I’m attempting to address with the Scale Stack Event modules. I’m trying to create a healthy level of abstraction on hybrid thread/event based applications so you don’t have any overhead or limitations while a lot of the common headaches are taken care of for you. If you have a need for such a system, get in touch, I’d be interested to talk. Since it is BSD licensed you can use it in any application, including commercial.


PlanetMySQL Voting: Vote UP / Vote DOWN

Drizzle Developer Day Recap

Апрель 20th, 2010

Last Friday we held the Drizzle Developer Day at the Santa Clara convention center, taking advantage of the fact that many developers and interested contributors were already there for the MySQL Conference & Expo. Minus a few small glitches like wifi and pizza consumption location, I would say it was an overall success. There were a lot of new folks interested in learning about Drizzle and getting the server up and running. The day was organized by splitting folks up into small groups with matching interests, and then switching up groups every hour or so. We had groups focused on replication, documentation, writing plugins, the optimizer, Boots (the new client tool), and a “getting started” group.

The first group I participated in was about Boots, the new command line tool developed by a group of students I sponsored at Portland State University. One of the students who created it was there (Chromakode), so he gave a demo of all the features and ways you could extend it for custom use. Baron from Percona was there and had a lot of good feedback on what is needed by DBAs, as well as for monitoring/troubleshooting problems. Some of the new features in Boots will help quite a bit with this since you are able to write simple Python scripts that work inside the program rather than having to write a bunch of shell processing code around the existing tool. This extended into a discussion about testing tools for production systems, and how to capture and replay production traffic with the same timing and load (or increased load).

The next group I sat in on was around creating plugins. There were topics like getting started with writing your own plugin, a script to generate a skeleton for your own, and more advanced topics like dependency tracking. Since I used the same pandora-plugin system for another project and added dependency tracking there, I am interested in getting dependency tracking into Drizzle. We didn’t get to any code, but this will require some changes in how plugins are loaded in the Drizzle kernel.

I had to leave a little early to catch my flight home, but for the second half of the day I bounced between helping a group get started from scratch (mainly installing dependencies to getting Drizzle built and running) and the other group topics. Thanks to everyone who showed up and helped participate, we all had some great conversations providing valuable feedback for directions to take moving forward.


PlanetMySQL Voting: Vote UP / Vote DOWN

Boots: A Modular CLI for Databases

Апрель 9th, 2010

BootsBack in October I wrote about a student group I was sponsoring to create a new command line tool for Drizzle. The group wrapped up their part of the project (the term ended), and we now have a new tool called Boots! A few of the developers are still active in the project, and I’m planning to get involved more as well. We also have a couple students interested in hacking on it for Drizzle’s Google Summer of Code.

Boots is written in Python and aims to replaces the the previous ‘drizzle’ tool (which was modified from the ‘mysql’ command line tool). It doesn’t support everything that the old tool has yet (like tab completion), but it adds some new features. For example, there are multiple ‘lingos’, or modular languages, that can be used to communicate with the shell. This allows you to use plain SQL, Python, or even LISP to interact with the shell. One of the lingos, piped-sql, lets you do interesting things such as:

shell$ boots -u root -h 127.0.0.1 -l pipedsql
Boots (v0.2.0)
127.0.0.1:3306 (server v5.1.40)
> SELECT * FROM mysql.user; | csv_out("users.csv")
5 rows in set (0.06s server | +0.00s working)
> Boots quit.
shell$ cat users.csv
localhost,root,,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,,,,,0,0,0,0
...

It’s ready to use, so download and install it now! If you have any features you would like to see, please get in touch through the Boots blueprints, mailing list, or #boots IRC channel on irc.freenode.net. One of the original developers from the project, Chromakode (the same from the awesome xkcd.com shell), will also be attending the MySQL Conference & Expo next week and helping out with the Drizzle booth. Come find one of us to talk more about the project there!


PlanetMySQL Voting: Vote UP / Vote DOWN

Scale Stack and Database Proxy Prototype

Апрель 8th, 2010

Back in January when I was between jobs I had a free weekend to do some fun hacking. I decided to start a new open source project that had been brewing in the back of my head and since then have been poking at it on the weekends and an occasional late night. I decided to call it Scale Stack because it aims to provide a scalable network service stack. This may sound a bit generic and boring, but let me show a graph of a database proxy module I slapped together in the past couple days:

Database Proxy Graph

I setup MySQL 5.5.2-m2 and ran the sysbench read-only tests against it with 1-8192 threads. I then started up the database proxy module built on Scale Stack so sysbench would route through that, and you can see the concurrency improved quite a bit at higher thread counts. The database module doesn’t do much, it simply does connection concentration, mapping M to N connections, where N is a fixed parameter given at startup. In this case I always mapped all incoming sysbench connections down to 128 connections between Scale Stack and MySQL. It also uses a fixed number of threads and is entirely non-blocking. As you can see the max throughput around 64 threads is a bit lower, but I’ve not done much to optimize this yet (there should be some easy improvements where I simply stuck in a mutex instead of doing a lockless queue). It’s only a simple proof-of-concept module to see how well this would work, but it’s a start to a potentially useful module built on the other Scale Stack components. One other thing to mention is that these tests were run on a single 16-core Intel machine. I’d really like to test this with multiple machines at some point.

So, what is Scale Stack?

Check out the website for a simple overview of what it is. The goal is to pick up where the operating system kernel leaves off with the network stack. It is written in C++ and is extremely modular with only the module loader, option parsing, and basic log in the kernel library. It uses Monty Taylor’s pandora-build autoconf files to provide a sane modular build system, along with some modifications I made so dependency tracking is done between modules. You can actually use it to write modules that would do anything, I’m just most interested in network service based modules. The kernel/module loader is also just a library, so you can actually embed this into existing applications as well. Some of the modules I’ve written for it are a threaded event handling module based on libevent/pthreads and a TCP socket module. There is also an echo server and simple proxy module I created while testing the event and socket modules. The database proxy module builds on top of the event and socket module. The code is under the BSD license and is up on Launchpad, so feel free to check it out and contribute. If you need a base to build high-performance network services on, you should definitely take a look and talk with me.

What’s up next?

I have a long list of things I would like to do with this, but first up are still some basics. This includes other socket type modules like TLS/SSL, UDP, and Unix sockets. Then are some more protocol modules such as Drizzle, a real MySQL protocol module, and others like HTTP, Gearman, and memcached. It’s fairly trivial to write these since the socket modules handle all buffering and provide a simple API. As for the DatabaseProxy module, I’d like to rework how things are now so it’s not MySQL protocol specific, integrate other protocol modules, improve performance, add in multi-tenancy support for quality-of-service queuing based on account rules, and a laundry list of other features I won’t bore you with right now.

I also have plans for other services besides a database proxy, especially one that could combine a number of protocols into a generic URI server with pluggable handlers so you can do some interesting translations between modules (like Apache httpd but not http-centric). For example, think of the crazy things you can do with Twisted for Python, but now with a fast, threaded C++ kernel. I also still need to experiment with live reloading of modules, but I’m not sure if this will be worthwhile yet.

If any of this sounds interesting, get in touch, I’d love to have some help! I’ll have some blog posts later on how to get started writing modules, but for now just take a look at the existing modules. The EchoServer is a good place to start since it is pretty simple. Also, if you’ll be at the MySQL Conference and Expo next week, I’d be happy to talk more about it then.


PlanetMySQL Voting: Vote UP / Vote DOWN