Archive for the ‘big data’ Category

SkySQL is Coming to a City Near You!

Март 19th, 2012

Now that the snow is melting and spring is in the air, the SkySQL Team is hitting the road and making the rounds of key industry events, trade shows, and meetups around the globe.  Come meet the team, pick-up a few tips and tricks for using the MySQL database, network with your peers, and learn more about SkySQL’s products and services.  Here are some the events we’ll be at this spring:

BIG Data, A New Horizon for Data Analysis
March 20 - 21, 2012
Cité Internationale Univeritaire de Paris, Paris, France

POSSCON 2012
March 28-29, 2012
Columbia Metropolitan Convention Center, Columbia, South Carolina

Houston PHP Users Group Meeting
April 5, 2012
BrandExtract, Houston TX

Percona Live:  MySQL Conference & Expo 2012
April 10-April, 2012
Hyatt Regency Santa Clara, Santa Clara, CA

SkySQL & MariaDB:  Solutions Day for the MySQL Database
April 13, 2012
Hyatt Regency Santa Clara, Santa Clara, CA

Drizzle Day 2012
April 13, 2012
Hyatt Regency Santa Clara, Santa Clara, CA

Sphinx Search Day 2012
April 13, 2012
Hyatt Regency Santa Clara, Santa Clara, CA

For more detailed information on these events, visit http://www.skysql.com/news-and-events/events.

We hope to see you when we’re in a city near you!


PlanetMySQL Voting: Vote UP / Vote DOWN

O’Reilly Strata 2012: The Year of the Data Scientist

Март 5th, 2012

We had the privilege this past week to be invited to be part of the 2012 O’Reilly Strata “Making Data Work” Conference. Some of our photos from the event are here. At the event, we were excited to have Tokutek described in front of the approximately 2,500 attendees during the keynote sessions.

Overall, the diversity of topics discussed at the conference was impressive, spanning databases, developer tools, data visualization techniques, customer stories, and business implications. The full agenda is here.

For those who missed it, here are some great resources:

At the show, Tokutek was one of ten companies selected for the Startup Showcase. In this process, we were the only database company to receive an honorable mention.

We had a number of great conversations with participants at the show. Common themes and questions we received around MySQL focused on how to scale performance of MySQL, when to consider flash drives or more RAM, and considerations for keeping MySQL + TokuDB over going to NoSQL.

As part of the show, I also had the chance to talk with O’Reilly’s Mac Slocum about Tokutek.

With all the interest in Big Data, Tim O’Reilly summed up the conference well, saying “data science is the new black”. 2012 is clearly the year of the data scientist – and we have the database that will make him or her successful.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Evidenzia Upgrades to TokuDB v5.2 to Address Storage Growth and Scale Performance

Февраль 27th, 2012

Ensuring sufficient disk I/O to catch copyright violations at network speed.

Evidenzia GmbH & Co. KG

Issues addressed:

  • Storage growth, including maxed-out disk I/O utilization
  • Performance issues and business impact due to slow selects
  • Inability to revise data schema on the fly

The Company: Evidenzia GmbH & Co. KG is one of the leading partners of the software, movie and music industry when it comes to tracing copyright infringements and illegal file sharing activities in peer-to-peer networks. Evidenzia helps copyright owners in protecting their intellectual property. Their powerful technologies enable copyright owners to trace and document illegal file sharing activities in P2P networks reliably. All data and documentation may then be used as evidence in court.

The Challenge: Evidenzia ingests a large amount of logging information each hour. The data not only needs to be processed in parallel for instant reporting, but also has to be stored in case it is ever needed as evidence in a legal case. To meet these needs, Evidenzia logs IP addresses while also performing a connect to each peer. In the process, the software fetches data to match it to the copyrighted material for proof of copyright violation.

“Prior to TokuDB, we were using InnoDB for storing all the data. We found that as the tables grew bigger, the selects were becoming slower, taking as much as an hour or more, and the disk I/O was growing higher” according to Director of Operations Bastian Axter.

To keep up with the workload, Evidenzia had considered several options, but they failed to meet program performance and price goals. These included:

Flash memory (SSD cache) – Storing all the data on SSD was much too expensive so Evidenzia considered using SSD cache inside the RAID controller. After testing this approach, Evidenzia discovered that it would not help because there was still too much data spread randomly to the disk, and the cache could not improve with random reads.

Partitioning  - “Partitioning was one option that was reviewed to divide up the load,” Axter said. ” However, the management overhead that would have been required for all the tables and partitions was excessive. This approach would clearly have introduced more problems than it could have solved and would have resulted in additional management headache.”

The Solution: With Tokudb 5.2, Evidenzia can do all the inserts and selects in parallel and also delete deprecated data out of the same table, without the need to call an “optimize table” or slow down the other processes (insert/select). In addition, the compression of TokuDB tables proved invaluable in keeping the required disk space low.

“The fast indexes and the ability to delete without having to optimize the table, as well as the unique ability of Hot Index addition, really brought home how powerful TokuDB is” according to Axter.  ”For these reasons, we were able to convert other tables to TokuDB as well.”

Below is a graph of the disk-usage (I/O max 100%) of the primary database, which shows the dramatic drop in disk I/O at week 46 when Evidenzia deployed TokuDB:

Disk utilization before and after TokuDB

“Most of the I/O came from the long running selects; they are gone since we introduced TokuDB into production,” according to Axter. “The overall impact on disk I/O was impressive, dropping from near 80-100% down to 5-10%.”

The Benefits: 

Cost Savings: With growth in InnoDB, as selects were slowing down, disk I/O was rising. Evidenzia would have had to buy additional drives just to keep up with the I/O. In addition, the compression on InnoDB wasn’t up to the task of being able to significantly shrink the tables on disk. “With TokuDB, we saved over 70% on storage,” according to Axter.

Performance: “There was an immediate impact with selects with TokuDB. These went from taking over an hour down to taking just minutes,” noted Axter. “Not only did TokuDB assist us with the select slowdown from large tables, but it also addressed our problems with deletes. Prior to TokuDB, deletes of already processed and archived data were far too slow because of the huge and slow fragmented indexes.”

Flexibility of Operations: With InnodDB, “optimize table” to rebuild the indexes was too disruptive to the business since it would block the whole logging process. With TokuDB, however, indexes don’t fragment and so they never require the database to be taken offline to rebuild them.


PlanetMySQL Voting: Vote UP / Vote DOWN

A super-set of MySQL for Big Data. Interview with John Busch, Schooner.

Февраль 20th, 2012
“Legacy MySQL does not scale well on a single node, which forces granular sharding and explicit application code changes to make them sharding-aware and results in low utilization of severs”– Dr. John Busch, Schooner Information Technology A super-set of MySQL suitable for Big Data? On this subject, I have interviewed Dr. John Busch, Founder, Chairman, [...]
PlanetMySQL Voting: Vote UP / Vote DOWN

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

Tokutek Selected as a Finalist for O’Reilly Strata Conference

Февраль 9th, 2012

We are excited to announce that we’ve been named as one of ten finalists selected for the startup showcase at the O’Reilly Strata “Making Data Work” Conference at the end of this month in Santa Clara, California. The startup showcase will be held on February 29th, starting at 6:30 pm.

The conference offers a great overview of the big data space, with tracks on Data Science, Business and Industry, Visualization and Interfaces, Hadoop Applied, Hadoop Tech, Policy and Privacy, and Domain Data. With all of the “NoSQL” buzz and sessions at the show (Hadoop gets two tracks!), we are glad to be able to attend as a representative of the “NewSQL” community. We’ll be showing just how much MySQL, with the right storage engine, can scale to take on Big Data while giving up none of the power of ACID, familiar SQL interfaces, rich indexes, high insertion rates, and flexible schema.

If you will be there, please stop by to say hello! And please vote for us too (what can we say, it’s an election year all around).


PlanetMySQL Voting: Vote UP / Vote DOWN

New England’s Victory (for Big Data)

Февраль 6th, 2012

While it might not have been New England’s weekend on the Big Gridiron, it was certainly New England’s day for Big Data at the New England Database Summit on Friday at MIT.

The summit was well attended, with 350 registrants and keynotes from prominent MySQL users such as Mark Callaghan. The coverage was quite broad, with presentations running the gamut from grad students (complete with bodyguards and intimidating academic advisors) to established companies such as StreamBase. The sponsor list was an A-list this year as well, with EMC and Microsoft being the two biggest backers.

There were far too many and diverse topics to write about all of them. That said, here were a few of the notable ones.

Keynote #1: Johannes Gehrke (Cornell): Declarative Data-Driven Coordination

Johannes Gehrke of Cornell kicked off with the first keynote on Declarative-Driven Coordination. His methodology shed light on an alternative to out-of-band communication. The presentation focused on how to successfully handle entangled queries.

More Sleep for Tom and Meg if They Can Just Coordinate

In brief, what he showed is a way for someone to see if their friend is on a flight and have the database go about satisfying mutual constraints. With a proof that is outlined in his Sigmod paper, his main theorem is that any schedule that is entangled-isolated is also oracle-serializable. It’s a clever approach, as long as one’s set of friends being entangled remains small.

Keynote #2: Mark Callaghan (Facebook): Performance is Overrated

The room got a little quiet when Mark took the stage. Some people were expecting a possible rehash of this summer’s brouhaha between Mike Stonebraker and Facebook on the fate of MySQL. But, instead Mark jumped into some very practical discussions about managing MySQL at scale.

First, he noted that manageability needs more attention since…

    • The cost of extra hardware can be predicted
    • The cost of downtime cannot
    • Downtime comes in many forms (server down and server too busy)

For Mark, manageability has a number of meanings. This includes the rate of interrupts/server for the operations team. Mark finds that while the server count grows quickly, his operations team grows slowly. Hence, it is imperative that the quality-of-service improve over time (i.e., Does work get done? Does work get done on time?).

Mark and his team use MySQL for a number of reasons. First, it was there when Mark arrived. Second, Mark and his team made it scale 10x. Finally, Mark likes MySQL for OLTP.

As Facebook has grown though, so have the number of servers. This is due to “Big Data” x high QPS. Hence, they have had to add servers to add IOPs. To address this, Mark noted that flash memory (SSD) is very interesting as are (we blush) write-optimized databases.

The last part of his presentation focused on advice for scaling: More Data, More QPS. His tips were quite straightforward:

    • Fix stalls to make use of capacity
      • Don’t make MySQL faster, make it less slow
    • Improve efficiency to use less
    • Repeat

 Additional details can be found in Sheeri’s excellent live blog of the presentation.

New Tools and Systems Session: Willis Lang (University of Wisconsin): Energy-Conscious Data Management Systems

Just as Mark stressed that performance isn’t everything when he spoke about management, Willis Lang pointed out another key concern.  His slides noted that “three decades of database research has optimized for the highest possible performance possible regardless of energy consumption.” (We agree and have written about this topic as well).

Willis and his team have been looking at various techniques for addressing this such as using variable speed disks. He has been systematically studying the power/performance trade-offs of hardware components. The preliminary memory-based results showed that interesting trade-off opportunities exist if one rethinks database design principles. His presentation focused on the improvements that can be seen with memory parking. Additional details on his research can be found here.

As mentioned previously, there were many good talks — much more could be written about the event. Other interesting speakers included David Karger who introduced Dido, which seeks to make database manipulation as easy as document editing, and Alvin Cheung whose Pyxis project eases application development with automatic code partitioning based on application and server characteristics.

Kudos to Samuel Madden (MIT) and Ugur Cetintemel (Brown University) for organizing the event. Additional details can also be found via the Twitter hashtag #nedb12 and the event homepage.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Conference and Expo Talk on Benchmarking

Февраль 2nd, 2012

I’ll be speaking on April 11th at 4:30 pm in Room 4 in at the Percona Conference and Expo Talk. The topic will be “Creating a Benchmark Infrastructure That Just Works.

Throughout my career I’ve been involved with maintaining the performance of database applications and therefore created many benchmark frameworks. At Tokutek, an important part of my role is measuring the performance of our storage engine over time and versus competing solutions. There is nothing proprietary about what I’ve created, it can be used anywhere.

My presentation will cover how I created the benchmark infrastructure at Tokutek:

  • Hardware and software considerations (including physical vs. virtual)
  • Selecting benchmarks
  • Capturing detailed information during the benchmark
  • Automation
  • Storing results
  • Visualization
  • Trend analysis
  • Continuous integration (monitoring the performance of future versions)
  • Self-service (let people get the information they want)

Track: Tools
Experience level: Intermediate

Tokutek is also a sponsor of the show and will have an expo booth. So, I hope to see you at my talk and/or at our booth.


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN

Big Kettle News

Январь 30th, 2012

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so on.

This is important to you because it means that you can now use Kettle to integrate a multitude of technologies, ranging from files over relational databases to big data and NoSQL.  You can do this in other words without writing any code.  Take a look at how easy it is to program for Hadoop MapReduce:

In other words, this part of the big news of today allows you to use the best tool for the job, whatever that tool is. You can now combine the large set of steps and job entries with all the available data sources and use that to integrate everything. Especially for Hadoop the time it takes to implement a MapReduce job is really small taking the sting out of costly and long training and testing cycles.

But that’s not all…

Pentaho Data Integration as well as the new big data plugins are now available under the Apache License 2.0. This means that it’s now very easy to integrate Kettle or the plugins in 3rd party software. In fact, for Hadoop, all major distributions are already supported including: Amazon Elastic MapReduce, Apache Hadoop, Cloudera’s Distribution including Apache Hadoop (CDH), Cloudera Enterprise, EMC Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR’s M3 Free and M5 Edition.
The change of Kettle from LGPL to Apache License 2.0 was broadly supported by our community and acts as an open invitation for other projects (and companies) to integrate Kettle. I hope that more NoSQL, Big Data and Big Search communities will reach out to us to work together to even broaden our portfolio. The way I see it, the Kettle community just got a whole lot bigger!

Where are the goodies?

The main landing page for the Big Data community is placed on our wiki to emphasize our intention to closely work with the various communities to make Pentaho Big Data a success. You can find all information over there, including a set of videos, PDI 4.3.0 preview download (including Big Data plugins), Hadoop installation instructions, PRD configuration information and much more.

Thanks for your time reading this and thanks for using Pentaho software!

Matt


PlanetMySQL Voting: Vote UP / Vote DOWN