Archive for the ‘big data’ Category

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

New England’s Victory (for Big Data)

Февраль 6th, 2012

While it might not have been New England’s weekend on the Big Gridiron, it was certainly New England’s day for Big Data at the New England Database Summit on Friday at MIT.

The summit was well attended, with 350 registrants and keynotes from prominent MySQL users such as Mark Callaghan. The coverage was quite broad, with presentations running the gamut from grad students (complete with bodyguards and intimidating academic advisors) to established companies such as StreamBase. The sponsor list was an A-list this year as well, with EMC and Microsoft being the two biggest backers.

There were far too many and diverse topics to write about all of them. That said, here were a few of the notable ones.

Keynote #1: Johannes Gehrke (Cornell): Declarative Data-Driven Coordination

Johannes Gehrke of Cornell kicked off with the first keynote on Declarative-Driven Coordination. His methodology shed light on an alternative to out-of-band communication. The presentation focused on how to successfully handle entangled queries.

More Sleep for Tom and Meg if They Can Just Coordinate

In brief, what he showed is a way for someone to see if their friend is on a flight and have the database go about satisfying mutual constraints. With a proof that is outlined in his Sigmod paper, his main theorem is that any schedule that is entangled-isolated is also oracle-serializable. It’s a clever approach, as long as one’s set of friends being entangled remains small.

Keynote #2: Mark Callaghan (Facebook): Performance is Overrated

The room got a little quiet when Mark took the stage. Some people were expecting a possible rehash of this summer’s brouhaha between Mike Stonebraker and Facebook on the fate of MySQL. But, instead Mark jumped into some very practical discussions about managing MySQL at scale.

First, he noted that manageability needs more attention since…

    • The cost of extra hardware can be predicted
    • The cost of downtime cannot
    • Downtime comes in many forms (server down and server too busy)

For Mark, manageability has a number of meanings. This includes the rate of interrupts/server for the operations team. Mark finds that while the server count grows quickly, his operations team grows slowly. Hence, it is imperative that the quality-of-service improve over time (i.e., Does work get done? Does work get done on time?).

Mark and his team use MySQL for a number of reasons. First, it was there when Mark arrived. Second, Mark and his team made it scale 10x. Finally, Mark likes MySQL for OLTP.

As Facebook has grown though, so have the number of servers. This is due to “Big Data” x high QPS. Hence, they have had to add servers to add IOPs. To address this, Mark noted that flash memory (SSD) is very interesting as are (we blush) write-optimized databases.

The last part of his presentation focused on advice for scaling: More Data, More QPS. His tips were quite straightforward:

    • Fix stalls to make use of capacity
      • Don’t make MySQL faster, make it less slow
    • Improve efficiency to use less
    • Repeat

 Additional details can be found in Sheeri’s excellent live blog of the presentation.

New Tools and Systems Session: Willis Lang (University of Wisconsin): Energy-Conscious Data Management Systems

Just as Mark stressed that performance isn’t everything when he spoke about management, Willis Lang pointed out another key concern.  His slides noted that “three decades of database research has optimized for the highest possible performance possible regardless of energy consumption.” (We agree and have written about this topic as well).

Willis and his team have been looking at various techniques for addressing this such as using variable speed disks. He has been systematically studying the power/performance trade-offs of hardware components. The preliminary memory-based results showed that interesting trade-off opportunities exist if one rethinks database design principles. His presentation focused on the improvements that can be seen with memory parking. Additional details on his research can be found here.

As mentioned previously, there were many good talks — much more could be written about the event. Other interesting speakers included David Karger who introduced Dido, which seeks to make database manipulation as easy as document editing, and Alvin Cheung whose Pyxis project eases application development with automatic code partitioning based on application and server characteristics.

Kudos to Samuel Madden (MIT) and Ugur Cetintemel (Brown University) for organizing the event. Additional details can also be found via the Twitter hashtag #nedb12 and the event homepage.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Conference and Expo Talk on Benchmarking

Февраль 2nd, 2012

I’ll be speaking on April 11th at 4:30 pm in Room 4 in at the Percona Conference and Expo Talk. The topic will be “Creating a Benchmark Infrastructure That Just Works.

Throughout my career I’ve been involved with maintaining the performance of database applications and therefore created many benchmark frameworks. At Tokutek, an important part of my role is measuring the performance of our storage engine over time and versus competing solutions. There is nothing proprietary about what I’ve created, it can be used anywhere.

My presentation will cover how I created the benchmark infrastructure at Tokutek:

  • Hardware and software considerations (including physical vs. virtual)
  • Selecting benchmarks
  • Capturing detailed information during the benchmark
  • Automation
  • Storing results
  • Visualization
  • Trend analysis
  • Continuous integration (monitoring the performance of future versions)
  • Self-service (let people get the information they want)

Track: Tools
Experience level: Intermediate

Tokutek is also a sponsor of the show and will have an expo booth. So, I hope to see you at my talk and/or at our booth.


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

1 Billion Insertions – The Wait is Over!

Январь 26th, 2012

iiBench measures the rate at which a database can insert new rows while maintaining several secondary indexes. We ran this for 1 billion rows with TokuDB and InnoDB starting last week, right after we launched TokuDB v5.2. While TokuDB completed it in 15 hours, InnoDB took 7 days.

The results are shown below. At the end of the test, TokuDB’s insertion rate remained at 17,028 inserts/second whereas InnoDB had dropped to 1,050 inserts/second. That is a difference of over 16x. Our complete set of benchmarks for TokuDB v5.2 can be found here.

Benchmark Details: Ubuntu 10.10; 2x Xeon X5460; 16GB RAM; 8x 146GB 10k SAS in RAID10. Each data point is the average insertion rate for the last 2 million rows. 

We developed the iiBench benchmark to measure performance for a use case that occurs commonly in production applications, such as online advertising, social media, and network management.

iiBench simulates a pattern of usage for always-on applications that:

  • Require fast query performance and hence require indexes
  • Have high data insert rates
  • Cannot wait for offline batch processing and hence require the indexes be maintained as data comes in

Note that iiBench was created as an open-source benchmark, which allows others to freely use it, extend it, and contribute their changes back. We originally unveiled the benchmark in the context of a challenge issued at the 2008 OpenSQL camp. Since then, iiBench has been downloaded and used many times, and ported by the community (in this case, Mark Callaghan) to a Python Script.

Please let us know any feedback you have on iiBench. For additional information on…

  • iibench overview click here
  • TokuDB version 5.2 Overview click here
  • TokuDB version 5.2 Performance, including iibench, SysBench, Compression, and TPCC-like, click here

PlanetMySQL Voting: Vote UP / Vote DOWN

CAOS Theory Podcast 2012.01.20

Январь 20th, 2012

Topics for this podcast:

*Hadoop v1.0 and year ahead
*Oracle-Cloudera deal for more Hadoop
*Oracle’s ‘Sun spot’ with Solaris
*Open Source M&A outlook for 2012
*Our new MySQL/NoSQL/NewSQL survey

iTunes or direct download (28:49, 4.9MB)


PlanetMySQL Voting: Vote UP / Vote DOWN

Fractal Tree Indexes and Mead – MySQL Meetup

Январь 11th, 2012

 
Thanks again to Sheeri Cabral  for having me at the Boston MySQL Meetup on Monday for the talk on “Fractal Tree® Indexes – Theoretical Overview and Customer Use Cases.” The crowd was very interactive, and I appreciated that over 50 people signed up for the event and left some very positive comments and reviews.

In addition, the conversation spilled over late into the night as we made our way over to nearby Mead Hall afterwards for a few drinks, some food, and to continue the discussion.

The presentation is available here.

As a brief overview – most databases employ B-trees to achieve a good tradeoff between the ability to update data quickly and to search it quickly. It turns out that B-trees are far from the optimum in this tradeoff space. This led to the development at MIT, Rutgers and Stony Brook of Fractal Tree indexes. Fractal Tree indexes improve MySQL® scalability and query performance by allowing greater insertion rates, supporting rich indexing and offering efficient compression. They can also eliminate operational headaches such as dump/reloads, inflexible schemas and partitions.

The presentation provides an overview on how Fractal Tree indexes work, and then gets into some specific product features, benchmarks, and customer use cases that show where people have deployed Fractal Tree indexes via the TokuDB® storage engine.
 


PlanetMySQL Voting: Vote UP / Vote DOWN