Archive for the ‘programming’ Category

Fun with Bash :: one liners

Апрель 10th, 2012

Here are some quick and easy bash commands to solve every day problems I run into. Comment and leave some of your own if you like. I might update this post with new ones over time. These are just some common ones.

Iterate through directory listing and remove the file extension from each file
ls -1 | while read each; do new=`echo $each |sed 's/\(.*\)\..*/\1/'` && echo $new && mv "$each" "$new"; done

Output relevant process info, and nothing else
ps axo "user,pid,ppid,%cpu,%mem,tty,stime,state,command"| grep -v "grep" | grep $your-string-here

Setup a SOCKS5 proxy on localhost port 5050, to tunnel all traffic through a destination server
ssh -N -D 5050 username@destination_server'

Setup a SOCKS5 proxy via a remote TOR connection, using local port 5050 and remote TOR port 9050
ssh -L 5050:127.0.0.1:9050 username@destination_server'

Display text or code file contents to screen but don't display any # comment lines
sed -e '/^#/d' $1 < $file_name_here

Same as above but replacing # lines with blank lines
sed -e '/^#/g' $1 < $file_name_here

Find all symlinks in the current directory and subdirs
find ./ -type l -exec ls -l {} \;

Find all executable files in current directory and subdirs
find ./ -type f -perm -o+rx -exec ls -ld '{}' \;

Remove all files matching the input string
echo -n "filename match to remove [rm -i]: " && read f; find ./ -name ${f} -exec rm -i {} \;

Display largest ten files in current dir and subdirs
du -a ./ | sort -n -r | head -n 10

Display all files in current dir and subdirs in order of filesize
du -a ./ | sort -n -r

Generate a MD5 hash for the input string (not file)
Linux: echo -n "str: " && read x && echo -n "$x" | md5sum
OSX: echo -n "str: " && read x && echo -n "$x" | md5

Display a summary of all files in current and subdirs
for t in files links directories; do echo `find . -type ${t:0:1} | wc -l` $t; done 2> /dev/null

PlanetMySQL Voting: Vote UP / Vote DOWN

Living in the Prove It Culture

Март 7th, 2012
Engineering cultures differ from shop to shop. I have been in the same culture for 13 years so I am not an expert on what all the different types are. Before that I was living in Dilbert world. The culture there was really weird. The ideas were never yours. It was always some need some way off person had. A DBA, a UI "expert" and some product manager would dictate what code you wrote. Creativity was stifled and met with resistance.

I then moved to the early (1998) days of the web. It was a start up environment. In the beginning there were just two of us writing code. So, we thought everything we did was awesome. Then we added some more guys. Lucky for us we mostly hired well. The good hires where type A personalities that had skills we didn't have. They challenged us and we challenged them. On top of that, we had a CEO who had been a computer hacker in his teens. So, he had just enough knowledge to challenge us as well. Over the years we kept hiring more and more people. We always asked in the interview if the person could take criticism and if they felt comfortable defending their ideas. We decided to always have a white board session. We would ask them questions and have them work it out on a white board or talk it out with us in a group setting. The point of this was not to see if they always knew the answer. The point was to see how they worked in that setting. Looking back, the hires that did not work out also did not excel in that phase of the interview. The ones that have worked out always questioned our methods in the interview. They did not belittle our methods or dismiss them. They just asked questions. They would ask if we had tried this or that. Even if we could quickly explain why our method was right for us, they still questioned it. They challenged us.

When dealing with people outside the engineering team, we subconsciously applied these same tactics. The philosophy came to be that if you came to us with an idea, you had to throw it up on the proverbial wall. We would then try to knock it down. If it stuck, it was probably a good idea. Some people could handle this and some could not. The ones that could not handle that did not always get their ideas pushed through. It may not mean they were bad ideas. And that is maybe the down side of this culture. But, it has worked pretty well for us.

We apply this to technology too. My first experience on Linux was with RedHat. The mail agent I cut my teeth on was qmail. I used djbdns. When Daniel Beckham, our now director of operations, came on, he had used sendmail and bind. He immediately challenged qmail. I went through some of the reasons I prefered it. He took more shots. In the end, he agreed that qmail was better than sendmail. However, his first DNS setup for us was bind. It took a few more years of bind hell for him to come around to djbdns.

When RedHat splintered into RedHat Enterprise and Fedora, we tried out Fedora on one server. We found it to be horribly unstable. It got the axe. We looked around for other distros. We found a not very well known distro that was known as the ricer distro of the Linux world called Gentoo. We installed it on one server to see what it was all about. I don't remember now whose idea it was. Probably not mine. We eventually found it to be the perfect distro for us. It let us compile our core tools like Apache, PHP and MySQL while at the same time using a package system. We never trusted RPMs for those things on RedHat. Sure, bringing a server online took longer but it was so worth it. Eventually we bought in and it is now the only distro in use here.

We have done this over and over and over. From the fact that we all use Macs now thanks to Daniel and his willingness to try it out at our CEO's prodding to things like memcached, Gearman, etc. We even keep evaluating the tools we already have. When we decided to write our own proxy we discounted everything we knew and evaluated all the options. In the end, Apache was known and good at handling web requests and PHP could do all we needed in a timely, sane manner. But, we looked at and tested everything we could think of. Apache/PHP had to prove itself again.

Now, you might think that a culture of skepticism like this would lead to new employees having a hard time getting any traction. Quite the opposite. Because we hire people that fit the culture, they can have a near immediate impact. We have a problem I want solved and a developer that has been here less than a year suggested that Hadoop may be a solution, but was not sure we would use it. I recently sent this in an email to the whole team in response to that.
The only thing that is *never* on the table is using a Windows server. If you can get me unique visitors for an arbitrary date range in milliseconds and it require Hadoop, go for it.
You see, we don't currently use Hadoop here. But, if that is what it takes to solve my problem and you can prove it and it will work, we will use it.

Recently we had a newish team member suggest we use a SAN for our development servers to use as a data store. Specifically he suggested we could use it to house our MySQL data for our development servers. We told him he was insane. SANs are magical boxes of pain. He kept pushing. He got Dell to come in and give us a test unit. Turns out it is amazing. We can have a hot copy of our production database on our dev slices in about 3 minutes. A full, complete copy of our production database in 3 minutes. Do you know how amazing that is? Had we not had the culture we do and had not hired the right person that was smart enough to pull it off and confident enough to fight for the solution, we would not have that. He has been here less than a year and has had a huge impact to our productivity. There is talk of using this in production too. I am still in the "prove it" mode on this. We will see.
I know you will ask how our dev db works, here you go:
1. Replicate production over VPN to home office
2. Write MySQL data on SAN
3. Stop replication, flush tables, snapshot FS
4. Copy snapshot to a new location
5. On second dev server, umount SAN, mount new snapshot
6. Restart MySQL all around
7. Talk in dev chat how bad ass that is

We had a similar thing happen with our phone system. We had hired a web developer that previously worked for a company that created custom Asterisk solutions. When our propietary PBX died, he stepped up and proved that Asterisk would work for us. Not a job for a web developer. But he was confident he could make it work. It now supports 3 offices and several home bound users world wide. He also had only been here a short time when that happened.

Perhaps it sounds like a contradiction. It may sound like we just hop on any bandwagon technology out there. But no. We still use MySQL. We are still on 5.0 in fact. It works. We are evaluating Percona 5.5 now. We tried MySQL 5.1. We found no advantage and the Gentoo package maintainer found it to be buggy. So, we did not switch. We still use Apache. It works. Damn well. We do use Apache with the worker MPM with PHP which is supposedly bad. But, it works great for us. But, we had to prove it would work. We ran a single node with worker for months before trusting it. Gearman was begrudgingly accepted. The idea of daemonized PHP code was not a comforting one. But once you write a worker and use it, you feel like a god. And then you see the power. Next thing you know, it is a core, mission critical part of your infrastructure. That is how it is with us now. In fact, Gearman has went from untrusted to the go to tech. When someone proposes a solution that does not involve Gearman, someone will ask if part of the problem can be solved using Gearman and not whatever idea they have. There is then a discussion about why it is or is not a good fit. Likewise, if you want to a build a daemon to listen on a port and answer requests, the question is "Why can't you just use Apache and a web service?" And it is a valid question. If you can solve your problem with a web service on already proven tech, why build something new?

This culture is not new. We are not unique. But, in a world of "brogramming" where "engineers" rage on code that is awesome before it is even working and people are said to be "killing it" all the time, I am glad I live in a world where I have to prove myself everyday. I am the most senior engineer on the team. And even still I get shot down. I often pitch an idea in dev chat and someone will shoot it down or point out an obvious flaw. Anyone, and I mean anyone, on the team can question my code, ideas or decisions and I will listen to them and consider their opinion. Heck, people outside the team can question me too. And regularly do. And that is fine. I don't mind the questions. I once wrote here that I like to be made to feel dumb. It is how I get smarter. I have met people that thought they were smarter than everyone else. They were annoying. I have interviewed them. It is hard to even get through those interviews.

Is it for everyone? Probably not. It works for us. And it has gotten us this far. You can't comfortable though. If you do foster this type of culture, there is a risk of getting comfortable. If you start thinking you have solved all the hard problems, you will look up one day and realize that you are suffering. Keep pushing forward and questioning your past decisions. But before you adopt the latest and greatest new idea, prove that the decisions your team makes are the right ones at every step. Sometimes that will take a five minute discussion and sometimes it will take a month of testing. And other times, everyone in the room will look at something and think "Wow that is so obvious how did we not see it?" When it works, it is an awesome world to live in.
PlanetMySQL Voting: Vote UP / Vote DOWN

Making rpm builds a first class citizen: How?

Январь 20th, 2012

In my previous post I explained why I believe the production of RPM and DEB packages should be more integrated with the rest of your development process. Now it's time to look into how you can put the RPM build scripts inside your main source code repository, and in particular how I did that to produce RPM packages for Drizzle.

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Making rpm builds a first class citizen: How?

Январь 20th, 2012

In my previous post I explained why I believe the production of RPM and DEB packages should be more integrated with the rest of your development process. Now it's time to look into how you can put the RPM build scripts inside your main source code repository, and in particular how I did that to produce RPM packages for Drizzle.

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Making rpm builds a first class citizen: Why?

Январь 20th, 2012

Last weekend I released rpm files for the latest Drizzle Fremont beta (announcement). As part of that work I've also integrated the spec file and other files used by the rpmbuild into the main Drizzle bzr repository (but not yet merged into trunk). In this post I want to explain why I think this is a good thing, and in a follow up post I'll go into what I needed to do to make it work.

(And speaking of stuff you can download, phpMyAdmin 3.5.0-alpha1 now supports Drizzle!)

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Could closed core prove a more robust model than open core?

Декабрь 2nd, 2011

When participating recently in a sprint held at Google to document four free software projects, I thought about what might have prompted Google to invest in this effort. Their willingness to provide a hotel, work space, and food for some thirty participants, along with staff support all week long, demonstrates their commitment to nurturing open source.

Google is one of several companies for which I'll coin the term "closed core." The code on which they build their business and make their money is secret. (And given the enormous infrastructure it takes to provide a search service, opening the source code wouldn't do much to stimulate competition, as I point out in a posting on O'Reilly's radar blog). But they depend on a huge range of free software, ranging from Linux running on their racks to numerous programming languages and libraries that they've drawn on to develop their services.

So Google contributes a lot back to the free software community. The release code for many non-essential functions. They promote the adoption of standards such as HTML 5. They have been among the first companies to offer APIs for important functions, including their popular Google Maps. They have opened the source code to Android (although its development remains under their control), which has been the determining factor in making Android devices compete with the arguably more highly-functioning iOS products. They even created a whole new programming language (Go) and are working on another.

Google is not the only "closed core" company (for instance, Facebook has also built their service around APIs and released their Cassandra project). Microsoft has a whole open source program, including some important contributions to health IT. Scads of other companies, such as IBM, Hewlett Packard, and VMware, have complex relationships to open source software that don't fit a simple "open core" or "closed core" model. But the closed core trend represents a fertile collaboration between communities and companies that have businesses in specific areas. The closed core model requires businesses to determine where their unique value lies and to be generous in offering the public extra code that supports their infrastructure but does not drive revenue.

This model may prove more robust and lasting than open core, which attracts companies occupying minor positions in their industries. The shining example of open core is MySQL, but its complex status, including a long history of dual licensing and simultaneous development by several organizations, make it a difficult model from which to draw lessons about the whole movement. In particular, Software as a Service redefines the relationships that the free software movement has traditionally defined between open and proprietary. Deploying and monitoring the core SaaS software creates large areas for potential innovation, as we saw with Cassandra, where a company can benefit from turning their code into a community project.


PlanetMySQL Voting: Vote UP / Vote DOWN

What’s New in CFEngine 3: Making System Administration Even More Powerful

Октябрь 28th, 2011

CFEngine is both the oldest and the newest of the popular tools for automating site administration. Mark Burgess invented it as a free software project in 1993, and years later, as deployments in the field outgrew its original design he gave it a complete rethink and developed the powerful concept of promise theory to make it modular and maintainable. In this guise as version 3, CFEngine stands along with two other pieces of free software, Puppet and Chef, as key parts of enterprise computing. Along the way, Burgess also started a commercial venture, CFEngine AS, that maintains both the open source and proprietary versions of CFEngine.

Diego Zamboni has recently taken the position of Senior Security Advisor at CFEngine AS and is writing a book for O'Reilly on CFEngine 3. I talked to him this week about the recent new release of the open source version (3.2.4) in tandem with a new commercial release of CFEngine 3 Nova (version 2.1.3). Here's are excerpts of what he has written to introduce CFEngine 3.

CFEngine 3 is fine-tuned to the features and design that make it possible to automate very large numbers of systems in a scalable and manageable way. CFEngine 3 is also very lightweight--its binaries normally use less than 30MB of disk space, it requires a single TCP port to communicate among servers and clients, and it has been designed to be very resource-efficient. CFEngine 3 can run on everything from smartphones to supercomputers.

CFEngine 3 is different from many other automation mechanisms in that you do not need to tell it what to do. Instead, you specify the state in which you wish the system to be, and CFEngine 3 will automatically and iteratively decide the actions to take to reach the desired state, or as close to it as possible. Underlying this ability is a powerful theoretical model known as Promise Theory, which was initially developed for CFEngine 3, but which has also found other applications in Computer Science and in other fields such as Economics and Organization.

This allows you to develop building blocks for complex promises that remain readable and manageable because the lower-level components are encapsulated. Each promise represents the desired state of certain parts of the system. At the lowest level, these are some of the things that you can express to CFEngine 3 as desired states:

  • "Make sure file /foo/bar contains line xyz"

  • "Make sure user foobar exists/does not exist"

  • "Make sure process foo is/is not running"

At a higher level of abstraction, you can encapsulate CFEngine 3 operations and express high-level desired states:

  • "Make sure all web servers have Apache installed"

  • "Make sure all root accounts have the same, centrally-designated password"

  • "Make sure parameters EnableDNS and AllowRoot are disabled on all sshd configurations"

And at an even higher level, you can express top-level desired states like these:

  • "Configure host xyz as a database server"

  • "Create a new cluster of VMs to use as web servers"

So what's in the new versions? CFEngine 3 Nova includes:

  • System monitoring extensions, which extend the monitoring capabilities of CFEngine 3 Community (to monitor system state such as CPU load, number of processes and network connections, disk utilization, etc.) to allow for defining custom monitors for any type of information.

  • Support for manipulating virtual machines on Xen, VMware ESX, and KVM.

  • Native Windows support.

  • Flexible searching of reports in a brand new scalable interface that supports thousands of hosts on a single hub.

  • Improved machine learning and anomaly monitoring for diagnostics and capacity planning. Additional sensors have been added to detect operating system performance and behavioral trends, especially on Linux kernels.

  • The NoSQL document-oriented database MongoDB, used instead of MySQL for all storage on Nova's Mission Portal.

  • Generic JSON return values so that users can customize the interface and JQuery framework of the Mission Portal. This allows direct access to data in a way that makes higher levels of scripting more effective.

CFEngine 3 Community also includes a large number of improvements, all of which are in Nova too:

  • A vastly improved bootstrapping process, which makes it easy to get new CFEngine 3 servers and clients up and running with very little manual configuration.

  • Support for environments, which are a way of grouping hosts according to arbitrary definitions. This makes it very easy to define, for example, "development," "testing," and "production" environments for CFEngine 3 policies.

  • The new cf-report command, available in both Community and Nova, which allows extraction of data and generation of reports from the command line. It can produce reports both about the behavior of the current CFEngine 3 environment (policies, hosts, etc.) and about internal information, such as a CFEngine 3 syntax summary.

  • Many performance and concurrency improvements and bug fixes.

  • Several new functions and parsing improvements, including and(), not(), and or() functions, to ease writing of complex class expressions.

  • A new and improved Emacs mode for editing CFEngine 3 policy files.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code RADAR20


PlanetMySQL Voting: Vote UP / Vote DOWN

Developer Week in Review: These things always happen in threes

Октябрь 26th, 2011

Fall is being coy this year in the Northeast. We've been having on and off spells of very mild, almost summer-like weather over the last few weeks. That trend seems to be finally ending, alas, as there is possible snow forecasted for the weekend in New Hampshire. As the old joke goes, if you don't like the weather here, just wait five minutes.

The fall also brings hunting to the area. The annual moose season just concluded (you need to enter a special lottery to get a moose permit), but deer season is just about to open. My son and I won't be participating this year, but we recently purchased the appropriate tools of the trade, a shotgun to hunt in southern NH (where you can't hunt deer with a rifle) and a Mosin Nagant 91/30 for the rest of the state. The later is probably overkill, but my son saved up his pennies to buy it, being a student of both WWII and all things Soviet. Hopefully, he won't dislocate his shoulder firing it ...

Meanwhile, in the wider world ...

John McCarthy: 1927-2011

It's been a sad month for the computer industry, with the deaths of Steve Jobs and Dennis Ritchie already fact. Less well known, but equally influential, AI pioneer and LISP creator John McCarthy passed away on Sunday. McCarthy was involved in the creation of two of the preeminent AI research facilities in the world, at MIT and Stanford, and he is generally credited with coining the term "artificial intelligence."

LISP has had its periods of popularity, peaking in the 1980s, but it's never been a mainstream language in the way that C, FORTRAN, BASIC or Java was. What people tend to forget is just how old LISP really is. Only FORTRAN, COBOL and ALGOL are older then LISP, which came on the scene in 1958. Many of the concepts we take for granted today, such as closures, first saw light in LISP. It also lives in the hearts of Emacs and AutoCAD, among others, and LISP is the language used in much of the groundbreaking artificial intelligence work.

On a side note, when I first met my wife and told her I was involved in the AI field, she gave me a truly strange look. She had a BA in animal science, you see, and in that field "AI" stands for artificial insemination.

Velocity Europe, being held Nov. 8-9 in Berlin, will bring together the web operations and performance communities for two days of critical training, best practices, and case studies.

Save 20% on registration with the code RADAR20

Someone finally admits the dirty truth about the GPL

If you listen to Richard Stallman, the GPL is all about being a coercive force that will eventually drive all software to be free (as in freedom.) Those of us who watch such things have noticed that it has a paradoxical effect, however. Companies like MySQL (now Oracle) use it the same way that drug dealers offer free samples to new customers. "The first one's free, but you'll be back for more." In other words, they get you hooked by offering a GPL version, but cash in when you want to use their product for commercial purposes because the GPL is too dangerous for most companies.

Now, python developer Zed Shaw has brought the GPL's dirty little secret into the light of day. In a particularly NSFW rant, Shaw explains why he chooses to use the GPL these days. In short, it's because he's sick of developers at companies getting to be heroes by using his stuff and getting the glory. "I use the GPL to keep you honest. You now have to tell your bosses you're using my gear. And it will scare the piss out of them." He goes on to say that he's using the GPL as a stick to force companies to pay him to use his software.

This goes right to the very core of the debate about what free/open software should be about. Is it a tool to make all software free? Is it a way to allow "good" people (i.e., non-commercial users) to have access while punishing "bad" people (professional developers)? Personally, I'm thrilled that Southwest Airlines uses a Java library I created for another client years ago and open sourced, but evidently some people (especially those who aren't getting paid to maintain open-source projects by a day job) want to get paid for their efforts.

I find the logic a bit questionable. I don't see a lot of difference between a free software developer who holds corporate users' feet to the fire and a commercial software developer. Sure, it still allows hobbyists and educational users to use the software for free, but it's actually acting to discourage companies from getting involved in FL/OSS by encouraging the wrong model. When companies use open-source software in their products, they are more likely to contribute back to the project and to open source other non-critical code they produce. If they are paying a developer for it, they are much less likely to contribute back.

The Steve Jobs movie: I predict lots of people walking and talking

With the Steve Jobs biography currently sitting at the top of Amazon's bestseller list, Sony Pictures is wasting no time getting a film adaptation underway. The current buzz is that Aaron Sorkin, creator of the West Wing and winner of the Academy Award for his adaptation of "The Social Network," is on the short list to write the screenplay.

It would be interesting to see how Sorkin would tackle Jobs' story, full and complex as it is. One approach might be to leave out the '80s, already covered to some degree in "Pirates of Silicon Valley," and concentrate instead on his youth and the last 15 years of his life. One can only hope that the technological details are not hopelessly mangled in an attempt to make it accessible.

Got news?

Please send tips and leads here.

Related:


PlanetMySQL Voting: Vote UP / Vote DOWN

New algorithm for calculating 95 percentile

Август 30th, 2011

The 95 percentile for query response times is and old concept; Peter and Roland blogged about it in 2008. Since then, MySQL tools have calculated the 95 percentile by collecting all values, either exactly or approximately, and returning all_values[int(number_of_values * 0.95)] (that’s an extreme simplification). But recently I asked myself*: must we save all values? The answer is no. I created a new algorithm** for calculating the 95 percentile that is faster, more accurate, and saves only 100 values.***

Firstly, my basis of comparison is the 95 percentile algo used by mk-query-digest. That algo is fast, memory-stable, and very proven in the real world. It works well for any number of values, even hundreds of thousands of values. It saves all values by using base 1.05 buckets and counting the number of values that fall within the range of each bucket. The results are not exact, but the differences are negligible because a 10ms and 13ms response time are indiscernible to a human. Any algo that hopes to handle very large numbers of values must approximate because not even C can store and sort hundreds of thousands of floats (times N many attributes times N many query classes) quickly enough.

So when I finished the new algo, I compared it to the mk-query-digest algo and obtained the following results:

FILE                         REAL_95     OLD_95     NEW_95  OLD_DIFF NEW_DIFF  OLD_TIME NEW_TIME   FILE_SIZE  OLD_RSS  NEW_RSS
nums/500k-1-or-2               1.751      1.697      1.784    -0.054   +0.033     12.12     9.37     4500000    3.88M    2.63M
nums/100k-1-or-2               1.749      1.697      1.794    -0.052   +0.045      2.42     1.88      900000    3.88M    2.63M
nums/50k-trend-1-to-9          6.931      6.652      6.995    -0.279   +0.064      1.24     0.90      450000    3.88M    2.63M
nums/25k-trend-1-to-5          3.888      3.704      3.988    -0.184   +0.100      0.64     0.47      225000    3.88M    2.63M
nums/21k-1-spike5-1            0.997      0.992      2.002    -0.005   +1.005      0.55     0.42      189000    3.88M    2.63M
nums/10k-rand-0-to-20         19.048     18.532     19.054    -0.516   +0.006      0.29     0.21       95079    3.86M    2.62M
nums/10k-rand-0-to-10          9.511      9.360      9.525    -0.151   +0.014      0.29     0.21       90000    3.86M    2.62M
nums/4k-trend-1-to-7           5.594      5.473      6.213    -0.121   +0.619      0.14     0.09       36000    3.86M    2.63M
nums/1k-sub-sec                0.941      0.900      0.951    -0.041   +0.010      0.07     0.04        9000    3.80M    2.62M
nums/400-half-10              10.271      9.828     10.273    -0.443   +0.002      0.05     0.03        3800    3.79M    2.62M
nums/400-high-low             10.446     10.319     10.446    -0.127        0      0.05     0.03        3800    3.79M    2.62M
nums/400-low-high             10.445     10.319     10.475    -0.126   +0.030      0.05     0.03        3800    3.79M    2.63M
nums/400-quarter-10           10.254      9.828     10.254    -0.426        0      0.06     0.03        3700    3.79M    2.62M
nums/153-bias-50              88.523     88.305     88.523    -0.218        0      0.05     0.03        1500    3.79M    2.62M
nums/100-rand-0-to-100        90.491     88.305     90.491    -2.186        0      0.05     0.03         991    3.79M    2.62M
nums/105-ats                  42.000     42.000     42.000         0        0      0.05     0.03         315    3.75M    2.61M
nums/20                       19.000     18.532     19.000    -0.468        0      0.04     0.03          51    3.79M    2.62M
nums/1                        42.000     42.000     42.000         0        0      0.04     0.03           3    3.75M    2.61M

 
I generated random microsecond values in various files. The first number of the filename indicates the number of values. So the first file has 500k values. The remaining part of the filename hints at the distribution of the values. For example, “50k-trend-1-to-9″ mean 50k values that increase from about 1 second to 9 seconds. Number and distribution of values affects 95 percentile algorithms, so I wanted to simulate several possible combinations.

“REAL_95″ is the real, exact 95 percentile; this is the control by which the “old” (i.e. the mk-query-digest) and new algos are compared. The diffs are comparisons to this control.

Each algo was timed and its memory (rss) measured, too. The time and memory comparisons are a little bias because the mk-query-digest module that implements its 95 percentile algo does more than my test script for the new algo.

The results show that the new algo is about 20% faster in all cases and more accurate in all but one case (“21k-1-spike5-1″). Also, the new algo uses less memory, but again this is a little bias; the important point is that it doesn’t use more memory to get its speed or accuracy increase.

The gains of the new algo are small in these comparisons, but I suspect they’ll be much larger given that the algo is used at least twice for each query. So saving 1 second in the algo can save minutes in data processing when there’s tens of thousands of queries.

Instead of explaining the algorithm exhaustively, I have upload all my code and data so you can reproduce the results on your machine: new-95-percentile-algo.tar.gz. You’ll need to checkout Maatkit, tweak the “require” lines in the Perl files, and tweak the Bash script (cmp-algos.sh), but otherwise I think the experiment should be straight forward. The new algo is in new-algo.pl. (new-algo.py is for another blog post.)

My ulterior motive for this blog post is to get feedback. Is the algorithm sane? Is there a critical flaw that I overlooked? Do you have a real-world example that doesn’t work well? If you’re intrepid or just curious and actually study the algo and have questions, feel free to contact me.

* By “recently asked myself” I mean that some time ago Baron and I wondered if it was possible to calculate 95 percentile without saving all values. At that time, I didn’t think it was feasible, but lately I thought and coded more about the problem.

** By “a new algorithm” I doubt that this has never been attempted or coded before, but I can’t find any examples of a similar algorithm.

*** By “saves only 100 values” I mean ultimately. At certain times, 150 values may be saved, but eventually the extra 50 should be integrated back into the base 100 values.


PlanetMySQL Voting: Vote UP / Vote DOWN

Developer Week in Review: Lion drops pre-installed MySQL

Август 3rd, 2011


A busy week at Casa Turner, as the infamous Home Renovations of Doom wrap up, I finish the final chapters of "Developing Enterprise iOS Applications" (buy a copy for all your friends, it's a real page turner!), pack for two weeks of vacation with the family in California (Palm Springs in August, 120 degrees, woohoo!), and celebrate both a birthday and an anniversary.



But never fear, WIR fans, I'll continue to supply the news, even as my MacBook melts in the sun and the buzzards start to circle overhead.

The law of unintended consequences

Lion ServerIf you decide to install Lion Server, you may notice something missing from the included software: MySQL. Previous releases of OS X server offered pre-installed MySQL command line and GUI tools, but they are AWOL from Lion. Instead, the geek-loved but less widely used Postgres database is installed.

It seems pretty obvious to the casual observer why Apple would make this move. With Oracle suing Google over Java, and Oracle's open source philosophy in doubt, I know I wouldn't want to stake my bottom line on an Oracle package bundled with my premiere operating system. Apple could have used one of the non-Oracle forks of MySQL, but it appears they decided to skirt the issue entirely by going with Postgres, which has a clear history of non-litigiousness.

Meanwhile, Oracle had better be asking themselves if they can afford to play the games they've been playing without alienating their market base.

South Korea fines Apple 3 million won, which works out to ...

Apple has bee been hit with a penalty from the South Korean government that's a result of the iPhone location-tracking story that broke earlier this year. Now, Apple may have more money than the U.S. Treasury sitting in petty cash right now, but it will be difficult for them to recover from such a significant hit to their bottom line: a whopping 3 million won, which works out to a staggering ... um ... $2,830. Never mind.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Java 7 and the risks of X.0 software

Java 7 was recently released to the world with great fanfare and todo. This week, we got a reminder why using an X.0 version of software is a risky endeavor. It turns out that the optimized compiler is really a pessimized compiler, and that programs compiled with it stand a chance of crashing. Even better, there's a chance they'll just go off and do the wrong thing.

Java 7 seems to be breaking new ground in non-deterministic programming, which will be very helpful for physics researchers working with the Heisenberg uncertainty principle. What could be more appropriate for simulating the random behavior of particles than a randomly behaving compiler?

Got news?

Please send tips and leads here.

Related:


PlanetMySQL Voting: Vote UP / Vote DOWN