Archive for the ‘Startups’ Category

Best of Guide – Highlights of Our Popular Content

Май 24th, 2012

Read the original article at Best of Guide – Highlights of Our Popular Content

Top 5 Most Popular

We use a broad brush to highlight the biggest no-nos in web application scalability.

We dig into scalability, steering to the richest areas to focus on.

MySQL on Amazon EC2, the what, how and when.

We highlight some of the big differences between the two database engines. We’re also working on revamping this content, so stay tuned for more.

Interviewing a MySQL DBA – a guide for managers. Also helpful if you’re gearing up for an interview.

Hiring Guides

This two part guide for a hiring manager, focuses on the MySQL Database Operations role.

A long time favorite, a hiring guide for Oracle DBAs.

Devops is the latest craze in bringing the worlds of operations and development together.  We talk about how to identify and attract such a candidate.

Today the cloud is *almost* synonymous with Amazon Web Services and their Elastic Compute (EC2) cloud offering.  Want to hire the best, here’s the step-by-step guide.

Howtos

Caching caching everywhere.  Learn to do it right.

Hotbackups make building replicas a snap.  Avoid the downtime & speedup the process.

Get your backups right so you can get some sleep at night!

Learn how to automatically spinup new MySQL slaves in Amazon EC2.

Industry Commentary

A decade ago startups large and small were running on Oracle, but no longer.  As the shift intensifies, it becomes harder and harder to find the right talent.  Here’s why.

High availability ain’t what it used to be.  Here’s why nobody is really achieving so-called five nines.

We argue that technologists with broad experience are needed to achieve scalability for today’s high traffic high transaction websites.

Startup & Small Business Advice

You’ve heard all the hype.  Now for some medicine.

Our three part guide takes you through ten steps to building a successful consulting business.  This is as much a guide for freelancers or wanna be consultants, as it is for startups, and those wishing to hire good temporary resources.

Book Reviews

Learn about scalability from the guys at AKF.

Here’s a great how-to book, short and to the point.  Optimizing those queries!

Building a startup doesn’t have to mean big money. Stay efficient and build those margins.

Jeff Jarvis champions Google, but we flip the question around.

Hidden Gems

Cut through the hype.  Which types of applications really do lend themselves to deploying in the cloud?

Ask some tough questions before you deploy everything in the cloud.

Migrating your application from Oracle to MySQL, you may be in for a bumpy road.  Here are a few things to watch out for.

Soup to nuts guide to deploying applications on Amazon Web Services EC2.

For more articles like these go to iHeavy, Inc +1-212-533-6828


PlanetMySQL Voting: Vote UP / Vote DOWN

Best of Guide – Highlights of Our Popular Content

Май 24th, 2012

Read the original article at Best of Guide – Highlights of Our Popular Content

Top 5 Most Popular

We use a broad brush to highlight the biggest no-nos in web application scalability.

We dig into scalability, steering to the richest areas to focus on.

MySQL on Amazon EC2, the what, how and when.

We highlight some of the big differences between the two database engines. We’re also working on revamping this content, so stay tuned for more.

Interviewing a MySQL DBA – a guide for managers. Also helpful if you’re gearing up for an interview.

Hiring Guides

This two part guide for a hiring manager, focuses on the MySQL Database Operations role.

A long time favorite, a hiring guide for Oracle DBAs.

Devops is the latest craze in bringing the worlds of operations and development together.  We talk about how to identify and attract such a candidate.

Today the cloud is *almost* synonymous with Amazon Web Services and their Elastic Compute (EC2) cloud offering.  Want to hire the best, here’s the step-by-step guide.

Howtos

Caching caching everywhere.  Learn to do it right.

Hotbackups make building replicas a snap.  Avoid the downtime & speedup the process.

Get your backups right so you can get some sleep at night!

Learn how to automatically spinup new MySQL slaves in Amazon EC2.

Industry Commentary

A decade ago startups large and small were running on Oracle, but no longer.  As the shift intensifies, it becomes harder and harder to find the right talent.  Here’s why.

High availability ain’t what it used to be.  Here’s why nobody is really achieving so-called five nines.

We argue that technologists with broad experience are needed to achieve scalability for today’s high traffic high transaction websites.

Startup & Small Business Advice

You’ve heard all the hype.  Now for some medicine.

Our three part guide takes you through ten steps to building a successful consulting business.  This is as much a guide for freelancers or wanna be consultants, as it is for startups, and those wishing to hire good temporary resources.

Book Reviews

Learn about scalability from the guys at AKF.

Here’s a great how-to book, short and to the point.  Optimizing those queries!

Building a startup doesn’t have to mean big money. Stay efficient and build those margins.

Jeff Jarvis champions Google, but we flip the question around.

Hidden Gems

Cut through the hype.  Which types of applications really do lend themselves to deploying in the cloud?

Ask some tough questions before you deploy everything in the cloud.

Migrating your application from Oracle to MySQL, you may be in for a bumpy road.  Here are a few things to watch out for.

Soup to nuts guide to deploying applications on Amazon Web Services EC2.

For more articles like these go to iHeavy, Inc +1-212-533-6828


PlanetMySQL Voting: Vote UP / Vote DOWN

The Mythical MySQL DBA

Декабрь 19th, 2011

I’ve  been getting more than my fair share of calls from recruiters of late. Even in this depressed economic climate where jobs are rarer than a cab at rush-hour, it’s heartening to know that tech engineers are in great demand. And it’s even more heartening to think that demand for MySQL DBAs has never been better.

My reckoning was confirmed by a Bloomberg news report about stalwart retailers suffering from a dearth of talented engineers. Bloomberg cited Target’s outage-prone e-commerce site as a symptom of, among other things the market’s shortage. One of the challenges old-timers like Target face is having to compete with Silicon Valley startups as a fulfilling and ultimately, financially rewarding place to work.

From the outside looking in, it's hard to say for sure why Target.com keeps crashing, but I can speculate on a few possible scenarios.

For one, the handoff from Amazon may have been less than smooth, lacking proper documentation and so forth. It could also be that the handoff went to less experienced DBAs or perhaps, those more versed in the legacy technologies of Oracle and much less in the free-wheeling open-source ones like MySQL. Other reasons could be failures in capacity planning, incomplete or incorrect systems integration, or simply misconfigurations in the load balancer, replication of database and memory settings.

If any of these scenarios had been true for Target, a sound experienced DBA and/or operations team attuned to scaling and disaster scenarios should have been able to anticipate these outages and mitigate their impact.  That is, if there were enough talented and experienced ones to go around.

From our vantage point, we think there’s room for more individuals to specialize in this area. What we do see are developers or Unix system administrators that include MySQL experience in their bag of skills but few who can actually manage a database eco-system.  Even in the Oracle space where there are a lot of career DBAs, many of them have moved over from the business side, so they lack certain computer science and engineering fundamentals and a pure science foundation.

Much of this boils down to universities not churning out enough engineers.  And the ones that do graduate are drawn to Startups; the coolest, smartest firms like Facebook and Google. If young college grads are gunning for the best job they can find, they're likely to shoot for the sexiest most cutting edge technologies.  In today's market that means programming jobs in Ruby on Rails or perhaps Node.js. Few would aspire to be in WebOps.

Dustin Moskovitz

Not too sexy for Ops

If I were to really go out on a limb I might ask if you've ever heard of Dustin Moskovitz?  No?  Oh he's "the ops guy" from the original Facebook team and, with a net worth estimated at $3.5bln, the youngest billionaire in the world.  Did I imply that operations and database administration wasn't sexy?

 

 

 

 

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Why generalists are better at scaling the web

Октябрь 25th, 2011

Recently at Surge 2011, the annual  conference on scalability  and performance, Google's CIO Ben Fried gave an illuminating keynote address. His main insight was that generalists are the people that will lead engineering teams in successfully scaling the web.

In a world where the badge of Specialist or Expert is prized, this was refreshing perspective from an industry bigwig. As tech professionals, or any professional for that matter, we don't welcome the label of generalist. The word suggests a jack-of-all-trades and master of none. But the generalist is no less an expert than the specialist. Generalists can get their hands greasy with the tools to fix bugs in the machine but they especially good at mobilizing the machine itself; with their talents of broad vision, and perspective they can direct an entire team to accomplish tasks efficiently. This ability to see big-picture can not be underestimated especially during times of crisis or pressure to meet targets. For a team to scale the web effectively, you're going to need a good mix of both types of personalities.

Picking out the potential generalist

Startups wanting to achieve scalability  are face with huge pressure to do more with limited budgets.  In bringing on new engineers, they must hire people who have the programming skills to realise their big idea. Ideally these programmers should also have some architectural vision, a knowledge of web operations, and performance as that application becomes popular.  And what of maintaining that large infrastructure as it grows?

So the question for a startup is how do you spot or hire generalists?  In the book, REWORK by  Jason Fried and David Heinemeier Hansson, the authors emphasize good writers and good teachers.  Their point is that in order to teach an idea or concept you have to understand it thoroughly and be able to step into someone elses shoes in order to explain it from their vantage point.

This is in large part the skill that Ben Fried was speaking about at Surge. To borrow his method of using "Disaster Porn" as a way to illustrate a point, we have a story of our own.

Our own disaster porn

About five years ago we worked for a firm who was faced with ongoing challenges of growth.  Their user base was growing by 25%-50% per quarter but they were suffering from outages because of that growth.  What's more one of their top engineers was leaving to join another company.  They took the opportunity to bring us on board to assess the entire infrastructure.

We looked over the architecture and were surprised at every turn.  Although they had a lot of engineers on staff, they were all tasked with building features, and responding to ongoing business requirements.  None were given any operations responsibilities. There was a very obvious lack of leadership. so you can imagine how this turned out to be a recipe for a fine mess. One day we'd see new servers being added at random, another day we'd witness haphazard decisions with what technologies to use or what what versions of frameworks to adopt. In effect, each engineer was making decisions without considering the consequences on the whole.

The infrastructure wound up being built on two different webserver platforms, three - count 'em - three different programming languages and frameworks, and three MySQL databases scattered about on different machines. After a few hours discussing the architecture with the team, I put together a plan that framed the architecture around three simpler tiers.  Two included the standard load balanced webserver tier, and backend database tier, and then a third to manage batch jobs and building static assets and media files.

A generalist solution

Our push then was to standardize on one type of webserver, one version of each language stack, and consolidate all the databases into one instance.  This huge simplification meant that they could add replication to the database tier, eliminating single points of failure, providing redundancy for all business services.  This in itself was a major achievement. We left them with some major problems solved, while offering a new direction, and a better handle on the remaining challenges. What the company had lacked was not engineering know-how, but rather a generalist's perspective.  The engineers had focused too much on immediate tasks, locked on detail, but losing sight of the big picture.

As more companies move their applications to the cloud, some carefully and some not, we anticipate many more disaster scenarios such as these.  This speaks strongly to the rising cult of DevOps and its effort towards broader skills and collaboration among both developers and operations teams. The good thing to come out of it is that cleaning up messes such as these will force us to hone our strategic thinking and organizational skills, possibly making generalists out of many more of us.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

What is the biggest challenge for Big Data?

Сентябрь 9th, 2011

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists. 


PlanetMySQL Voting: Vote UP / Vote DOWN

NSA, Accumulo & Hadoop

Сентябрь 8th, 2011

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.  Hadoop is no longer just a new technology scratching at the edges of the traditional database market.  Hadoop is no longer just used by startups and web companies.  This is highlighted by outputs like this from organizations such as the NSA.  This is also further highlighted by the amount of research and focus on Hadoop by the data community at large (such as last week at VLDB).  No, Hadoop has become a proven and trusted platform and is now being used by traditional and conservative segments of the market.  

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Reply to The Future of the NoSQL, SQL, and RDBMS Markets

Август 12th, 2011

Conor O'Mahony over at IBM wrote a good post on a favorite topic of mine “The Future of the NoSQL, SQL, and RDBMS Markets”.  If this is of interest to you then I suggest you read his original post.  I replied in the comments but thought I would also repost my reply here.

-----------------------------------------------------------------------------------------------

Hi Connor, I wish it was as simple as SQL & RDBMS is good for this and NoSQL is good for that.  For me at least, the waters are much muddier than that.

The benefit of SQL & RDBMS is that its general purpose nature has meant it can be applied to a lot of problems, and because of its applicability it is become mainstream to the point every developer on the planet can probably write basic SQL.  And it is justified, there aren’t many data problems you can’t through a RDBMS at and solve.

The problem with SQL & RDBMS, well essentially I see two.  Firstly, distributed scale is a problem in a small number of cases.  This can be solved by losing some of the generic nature of RDBMS and keeping SQL such as with MPP or attempts like Stonebraker’s NewSQL.  The other way is to lose RDBMS and SQL altogether to achieve scale with alternative key/value methods such as Cassandra, HBase etc.  But these NoSQL databases don’t seem to be the ones gaining the most traction.  From my perspective, the most “popular” and fastest growing NoSQL databases tend to be those which aren’t entirely focused on pure scale but instead focus first on the development model, such as Couch and MongoDB.  Which brings me to my second issue with SQL & RDBMS.

Without a doubt the way in which we build applications has changed dramatically over the last 20 years.  We now see much greater application volumes, much smaller developer teams, shorter development timeframes and faster changing requirements.  Much of what the RDBMS has offered developers – such as strong normalization, enforced integrity, strong data definition, documented schemas – have become less relevant to applications and developers.  Today I would suspect most applications use a SQL database purely as a application specific dumb datastore.  Usually there aren’t multiple applications accessing that database, there aren’t lots of direct data import/exports into other aplications, no third party application reporting, no ad-hoc user queries and the data store is just a repository for a single application to retain data purely for the purpose of making that application function.  Even several major ERP applications have fairly generic databases with soft schemas without any form of constraints of referential integrity.  This is just handled better, from a development perspective, in the code that populates it.

Now of course the RDBMS can meet this requirement – but the issue is the cost of doing this is higher than what it needs to be.  People write code with classes, RDBMS uses SQL.  The translation between these two structures, the plumbing code, can be in cases 50% of more of an applications code base (be that it hand-written code or automatic code generated by a modeling tool).  Why write queries if you are just retrieving and entire row based on key.  Why have a strict data model if you are the only application using it and you maintain integrity in the code?  Why should a change in requirements require you to now to go through the process of building a schema change script/process that has to have deployed sync’d with application version.  Why have cost based optimization when all the data access paths are 100% known at the time of code compilation?

Now I am still largely undecided on all of this.  I get why NoSQL can be appealing.  I get how it fits with today’s requirements, what I am unsure about if it is all very short sighted.  Applications being built today with NoSQL will themselves grow over time.  What may start off today as simple gets/puts within a soft schema’d datastore may overtime gain certain reporting or analytics requirements unexpected when initial development began.  What might have taken a simple SQL query to meet such a requirement in RDBMS now might require data being extracted into something else, maybe Hadoop or MPP or maybe just a simple SQL RDBMS – where it can be processed and re-extracted back into the NoSQL store in a processed form.  It might make sense if you have huge volumes of data but for the small scale web app, this could be a lot of cost and overhead to summarize data for simple reporting needs.

Of course this is all still evolving.  And RDBMS vendors and NoSQL are both on some form of convergence path.  We have already started hearing noises about RBDMS looking to offer more NoSQL like interfaces to the underlying data stores as well as the NoSQL looking to offer more SQL like interfaces to their repositories.  They will meet up eventually, but by then we will all be talking about something new like stream processing :)

Thanks Connor for the thought provoking post.

 


PlanetMySQL Voting: Vote UP / Vote DOWN

Building data startups: Fast, big, and focused

Август 9th, 2011

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack


As the foundational layer in the big data stack, the cloud provides
the scalable persistence and compute power needed to manufacture data
products.

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let's take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing's brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google's BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what's alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers' credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and News.me. Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren't limited to startups: LinkedIn's People You May Know and FourSquare's Explore feature enhance engagement of their companies' core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.



Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

Building data startups: Fast, big, and focused

Август 9th, 2011

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack


As the foundational layer in the big data stack, the cloud provides
the scalable persistence and compute power needed to manufacture data
products.

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let's take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing's brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google's BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what's alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers' credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and News.me. Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren't limited to startups: LinkedIn's People You May Know and FourSquare's Explore feature enhance engagement of their companies' core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.



Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

IA Ventures — Jobs shout out

Август 4th, 2011

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize community events, and work with portfolio companies—basically, you can take on as much responsibility as you can handle."

Roger, Brad and the team continue to impress with their focus on Big Data, their strategic investments in monetizing data and knowledge of the industry in general.


PlanetMySQL Voting: Vote UP / Vote DOWN