Archive for the ‘data’ Category

Can the People’s House become a social platform for the people?

Декабрь 12th, 2011

Congressional hackathon
InSourceCode developers work on "Madison" with volunteers.

There wasn't a great deal of hacking, at least in the traditional sense, at the "first congressional hackathon." Given the general shiver that the word still evokes in many a Washingtonian in 2011, that might be for the best. The attendees gathered together in the halls of the United States House of Representatives didn't create a more interactive visualization of how laws are made or a mobile health app. As open government advocate Carl Malamud observed, the "hack" felt like something even rarer in the "Age of the App for That:"

Impressed @MattLira pulled off a truly bipartisan tech event on the hill. *that* is a true hack. #inhackwetrust

— Carl Malamud (@carlmalamud) December 7, 2011

In a time when partisanship and legislative gridlock have defined Congress for many citizens, seeing the leadership of the United States House of Representatives agree on the importance of using the power of data and social networking to open government was an early Christmas present.

"Increased access, increased connection with our constituents, transparency, openness is not a partisan issue," said House Majority Leader Eric Cantor.

"The Republican leader and I may debate vigorously on many issues, but one area where we strongly agree is on making Congress more transparent and accessible," said House Democratic Whip Steny Hoyer in his remarks. "First, Congress took steps to open up the Capitol building so citizens can meet with their representatives and see the home of their legislature. In the same way, Congress is now taking steps to update how it connects with the American people online."

An open House

While the event was branded as a "Congressional Facebook Developer Hackathon," what emerged more closely resembled a loosely organized conference or camp.

Facebook executives and developers shared the stage with members of Congress to give keynotes to the 200 or so attendees before everyone broke into discussion groups to talk about constituent communications, press relations and legislative data. The event might be more aptly described as a "wonk-a-thon," as Sunlight Foundation's Daniel Schuman put it last week.

This "hackathon" was organized to have some of the feel of an unconference, in the view of Matt Lira, digital director for the House Majority Leader. Lira sat down for a follow-up interview last Thursday.

"There's a real model to CityCamp," he said. "We had 'curators' for the breakout. Next time, depending on how we structure it, we might break out events that are designed specifically for programming, with others clustered around topics. We want to keep it experimental."

Why? "When Aneesh Chopra and I did that session at SXSW, that personally for me was what tripped my thinking here," said Lira. "We came down from the stage and formed a circle. I was thinking the whole time that it would have been a waste of intellectual talent to have Tim O'Reilly and Clay Shirky in the audience instead of engaging in the conversation. I was thinking I never want to do a panel again. I want it to be like this."

Part of the challenge, so to speak, of Congress hosting a hackathon in the traditional sense, with judging and prizes, lies in procurement rules, said Lira."There are legal issues around challenges or prizes for Congress," he explained. "They're allowed in the executive branch, under DARPA, and now every agency under the COMPETES Act. We can't choose winners or losers, or give out prizes under procurement rules."

Whatever you call it, at the end of the event, discussion leaders from the groups came back and presented on the ideas and concepts that had been hashed out. You can watch a short video that EngageDC produced for the House Majority Leader's office below:

What came out of this unprecedented event, in other words, won't necessarily be measured in lines of code. It's that Congress got geekier. It's that the House is opening its doors to transparency through technology.

Given the focus on Facebook, it's not surprising that social media took center stage in many of the discussions. The idea for it came from a trip to Silicon Valley, where Representative Cantor said he met with Facebook founder Mark Zuckerberg and COO Sheryl Sandberg, and discussed how to make the House more social. After that conversation, Lira and Steve Dwyer, director of online communications and technology for the House Democratic Whip, organized the event.

For a sense of the ideas shared by the working groups, read the story of the first congressional "hackathon" on Storify.

"For government, I don't think we could have done anything more purposeful than this as a first meeting," said Lira in our interview. "Next, we'll focus on building this group of people, strengthening the trust, which will prove instrumental when we get into the pure coding space. I have 100% confidence that we could do a programming-only event now and would have attendance."

A Likeocracy in alpha

As the Sunlight Foundation's John Wonderlich observed earlier this year, access to legislative data brings citizens closer to their representatives.

"When developers and programmers have better access to the data of Congress, they can better build the databases and tools that let the rest of us connect with the legislature," he wrote.

If more open legislative data goes online, when we talk about what's trending in Congress, those conversations will be based upon insight into how the nation is reacting to them on social networks, including Facebook, Twitter, and Google+.

Facebook developers Roddy Lindsay, Tyler Brock, Eric Chaves, Porter Bayne, and Blaise DiPersia coded up a simple proof of concept of what making legislative data might look like. "LikeOcracy" pulls legislation from a House XML feed and makes it more social. The first version added Facebook's ubiquitous "Like" buttons to bill elements. A second version of the app adds more opportunities for reaction by integrating ReadrBoard, which enables users to rate sections or individual lines as "Unnecessary, Problematic, Great Idea or Confusing." You can try it out on three sample bills, including the Stop Online Piracy Act.

Would "social legislation" in a Facebook app catch on? The growth of civic startups like PopVox, OpenCongress and Votizen suggests that the idea has legs. [Disclosure: Tim O'Reilly was an early angel investor in PopVox.]

Likeocracy doesn't tap into Facebook's Open Graph, but it does hint at what integration might look like in the future. Justin Osofsky, Facebook's director of platform partnerships, described how the interests of constituents could be integrated with congressional data under Facebook's new Timeline. Citizens might potentially be able to simply "subscribe" to a bill, much like they can now for any web page, if Facebook's "Subscribe" plug-in was applied to the legislative process.

Opening bill markup online

The other app presented at the hackathon came not from the attendees but from the efforts of InSourceCode, a software development firm that's also coded for Congressman Mike Pence and the Republican National Committee.

Rep. Darrell Issa, chairman of the House Committee on Oversight and Government Reform, introduced the beta version of MADISON on Wednesday, a new online tool to crowdsource legislative markup. The vision is that MADISON will work as a real-time markup engine to let the public comment on bills as they move through the legislative process. "The assumption is that legislation should be open in Congress," said Issa. "It should be posted, interoperable and commented upon."

As Nick Judd reported at techPresident, the first use of MADISON is to host Issa and Sen. Ron Wyden's "OPEN bill," which debuted on the app. Last week, the congressmen released the Online Protection and Enforcement of Digital Trade Act (OPEN) at Keepthewebopen.com. The OPEN legislation removes one of the most controversial aspects of SOPA, using the domain name system for enforcement, and instead places authority with the International Trade Commission to address enforcement of IP rights on websites that are primarily infringing upon copyright.

Issa said that his team had looked at the use of wikis by Rep. John Culberson, who put the healthcare reform bill online in a wiki. "There are some problems with editors who are not transparent to all of us," said Issa. "That's one of the challenges. We want to make sure that if you're an editor, you're a known editor."

MADISON includes two levels of authentication: email for simple commenting and a more thorough vetting process for organizations or advocacy groups that wish to comment. "Like most things that are a 1.0 or beta, our assumption is that we'll learn from this," said Issa. "Some members may choose to have an active dialog. Others may choose to have it be part of pre-markup record."

Issa fielded a number of questions on Wednesday, including one from web developer Brett Stubbs: "Will there be open access or an API? What we really want is just data." Issa indicated that future versions might include that.

Jayson Manship, the "chief nerd" at InSourceCode, said that MADISON was built in four days. According to Manship, the idea came from conversations with Issa and Seamus Kraft, director of digital strategy for the House Committee on Oversight and Government Reform. MADISON is built with PHP and MySQL, and hosted in RackSpace's cloud so it can scale with demand, said Manship.

"It's important to be entrepreneurial," said Lira in our interview. "There are partners throughout institutions that would be willing to do projects of different sizes and scopes. MADISON is something that Issa and Seamus wanted to do. They took it upon themselves to get the ball rolling. That's the attitude we need."

"We're working to hold the executive accountable to taxpayers," said Kraft last week. "Opening up what we do here in these two halls of Congress is equally important. MADISON is our first shot at it. We're going to need a lot of help to make it better."

Kraft invited the remaining developers present to come to the Rayburn Office Building, where Manship and his team had brought in half a dozen machines, to help get MADISON ready for launch. While I was there, there were conversations about decisions, plug-ins and ideas about improving the interface or functionality, representing a bona fide collaboration to make the app better.

There's a larger philosophical issue relating to open government that Nick Judd touched upon over at techPresident in a follow-up post on MADISON:

The terms for the site warn the user that anything they write on it will become public domain — but the code itself is proprietary. Meanwhile, OpenCongress' David Moore points out that the code that powers his organization's website, which also allows users to comment on individual provisions of bill text, is open source and has been available for some time. In theory, this means the Oversight staff could have started from that code and built on it instead of beginning from scratch. The code being proprietary means that while people like Moore might be able to make suggestions, they can't just download it, make their own changes and submit them for community review — which they'd happily do at little or no cost for a project released under an open-source license.

As Moore put it, "Get that code on GitHub, we'll do OpenID, fix the design."

When asked about whether the team had considered making MADISON code open source, Manship said that "he didn't know, although they weren't opposed to it."

While Moore welcomed MADISON, he also observed that Open Congress has had open-source code for bill text commenting for years.

@seamuskraft @mattlira glad to chat, will email. We see first step as liberating full #opengovdata (API & bulk) for MADISON & OC & open Web.

— David Moore (@ppolitics) December 9, 2011

The decision by Issa's office to fund the creation of an app that was already available as open-source software is one that's worth noting, so I asked Kraft why they didn't fork OpenCongress' code, as Judd suggests. "While there was no specific budget expense for MADISON, it was developed by the Oversight Committee," said Kraft.

"While we like and support OpenCongress' code, it didn't fit the needs for MADISON," Kraft wrote in an emailed statement.

What's next is, so to speak, an "OPEN" question, both in terms of the proposed SOPA alternative and the planned markup of SOPA itself on December 15. The designers of OPEN are actively looking for feedback from the civic software development community, both in terms of what functionality exists now and what could be built in future iterations.

THOMAS.gov as a platform

What Moore and long-time open-government advocates like Carl Malamud want to see from Congress is more structural change:

Re: #hackwetrust, while we do seek leg. version control, public bill markup isn't ultimate goal. Exhaustive #opengovdata & open API is (2/2)

— David Moore (@ppolitics) December 8, 2011

@MattLira @DarrellIssa @SeamusKraft MADISON is much-welcomed, but PPF's #opengov ultimate goal is open API for @THOMASdotgov -cc @digiphile.

— David Moore (@ppolitics) December 9, 2011

They're not alone. Dan Schuman listed many other ways the House has yet to catch up with 21st century technology:

We have yet to see bulk access to THOMAS or public access to CRS reports, important legislative and ethics documents are still unavailable in digital format, many committee hearings still are not online, and so on.

As Schuman highlighted, the Sunlight Foundation has been focused on opening up Congress through technology since the organization was founded. To whit: "There have been several previous collaborative efforts by members of the transparency community to outline how the House of Representatives can be more open and accountable, of which an enduring touchstone is the Open House Project Report, issued in May 2007," wrote Schuman.

The notion of making THOMAS.gov into a platform received high-level endorsement from a congressional leader when House Minority Whip Steny Hoyer remarked on how technology is affecting Congress, his caucus and open government in the executive branch:

For Congress, there is still a lot of work to be done, and we have a duty to make the legislative process as open and accessible as possible. One thing we could do is make THOMAS.gov — where people go to research legislation from current and previous Congresses — easier to use, and accessible by social media. Imagine if a bill in Congress could tweet its own status.

The data available on THOMAS.gov should be expanded and made easily accessible by third-party systems. Once this happens, developers, like many of you here today, could use legislative data in innovative ways. This will usher in new public-private partnerships that will empower new entrepreneurs who will, in turn, yield benefits to the public sector.

One successful example is how cities have made public transit data accessible so developers can use it in apps and websites. The end result has been commuters saving time every day and seeing more punctual trains and buses as a result of the transparency. Legislative data is far more complex, but the same principles apply. If we make the information available, I am confident that smart people like you will use it in inventive ways.

Hoyer's specific citation of the growth of open data in cities and an ecosystem of civic applications based upon it is further confirmation that the Gov 2.0 meme is moving into the mainstream.

Making THOMAS.gov into a platform for bulk data would change what's possible for all civic developers. What I really want is "data on everything," Stubbs told me last week. "THOMAS is just a visual viewer of the internal stuff. If we could have all of this, we could do something with it. What I would like is a data broker. I'd like a RESTful API with all of the data that I could just query. That's what the government could learn from Facebook. From my point of view, I just want to pull information and compile it."

If Hoyer and the House leadership would like to see THOMAS.gov act as a platform, several attendees at the hackathon suggested to me that Congress could take a specific action: collaborate with the Senate and send the Library of Congress a letter instructing it to provide bulk legislative data access to THOMAS.gov in structured formats so that developers, designers and citizens around the nation can co-create a better civic experience for everyone.

"The House administration is working on standards called for by the rule and the letter sent earlier this year," said Lira. "We think they will be satisfactory to people. The institutions of the House have been following through since the day they were issued. The first step was issuing an XML feed daily. Next year, there will be a steady series of incremental process improvements. When the House Administrative Committee issues standards, the House Clerk will work on them. "

Despite the abysmal public perception of Congress, genuine institutional changes in the House of Representatives driven by the GOP embracing innovation and transparency are incrementally happening. As Tim O'Reilly observed earlier this year, the current leadership of the House on transparency is doing a better job than their predecessors.

In April, Speaker Boehner and Majority Leader Cantor sent a letter to the House Clerk regarding legislative data release. Then, in September, a live XML feed for the House floor went online. Yes, there's a long way to go on open legislative data quality in Congress. That said, there's support for open-government data from both the White House and the House.

"My personal view is that what's important right now is that the House create the right precedents," said Lira. "If we create or adopt a data standard, it's important that it be the right standard."

Even if open government is in beta, there needs to be more tolerance for experiments and risks, said Lira. "I made a mistake in attacking We the People as insufficient. I still believe it is, but it's important to realize that the precedent is as important as the product in government. In technology in general, you'll never reach an end. We The People is a really good precedent, and I look forward to seeing what they do. They've shown a real commitment, and it's steadily improving."

A social Congress

While Sean Parker may predict that social media will determine the outcome of the 2012 election, governance is another story entirely. Meaningful use of social media by Congress remains challenged by a number of factors, not least an online identity ecosystem that has not provided Congress with ideal means to identify constituents online. The reality remains that when it comes to which channels influence Congress, in-person visits and individual emails or phone calls are far more influential with congressional staffers.

As with any set of tools, success shouldn't be measured solely by media reports or press releases but by the outcomes from their use. The hard work of bipartisan compromise between the White House and Congress, to the extent it occurs, might seem unlikely to be publicly visible in 140 characters or less.

"People think it's always an argument in Washington," said Lira in our interview. "Social media can change that. We're seeing a decentralization of audiences that is built around their interests rather than the interests of editors. Imagine when you start streaming every hearing and making information more digestible. All of a sudden, you get these niche audiences. They're not enough to sustain a network, but you'll get enough of an audience to sustain the topic. I believe we will have a more engaged citizenry as a result."

Lira is optimistic. "Technology enables our republic to function better. In ancient Greece, you could only sustain a democracy in the size of city. Transportation technology limited that scope. In the U.S., new technologies enabled global democracy. As we entered the age of mass communication, we lost mass participation. Now with the Internet, we can have people more engaged again."

There may be a 30-year cycle at play here. Lira suggested looking back to radio in the 1920s, television in the 1950s, and cable in the 1980s. "It hasn't changed much since; we're essentially using the same rulebook since the '80s. The changes made in those periods of modernization were unique."

Thirty years on from the introduction of cable news, will the Internet help reinvigorate the founders' vision of a nation of, by and with the people? "I do think that this is a transformational moment," said Lira. "It will be for the next couple of years. When you talk to people — both Republicans and Democrats — you sense we're on the cusp of some kind of change, where it's not just communicating about projects but making projects better. Hearings, legislative government and executive government will all be much more participatory a decade from now. "

In that sweep of history, the "People's House" may prove to be a fulcrum of change. "If any place in government is going to do it, it's the House" said Lira. "It's our job to be close to the public in a way that no other part of government is. In the Federalist Papers, that's the role of the House. We have an obligation to lead the way in terms of incorporating technology into real processes. We're not replacing our system of representative government. We're augmenting it with what's now possible, like when the telegraph let people know what the votes were faster."

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

PlanetMySQL Voting: Vote UP / Vote DOWN

Visualization of the Week: A better U.S. migration map

Ноябрь 18th, 2011

Jon Bruner's "American Migration" visualization, based on IRS data, demonstrates how "Americans are enormously mobile: 37.5 million people moved from one house to another last year, with 4.3 million of them moving between states." Bruner's interactive map lets you click on a specific county and see both the immigration and emigration data for that location — where folks move from and where they move to.

American Migration
Screenshot from the "American Migration" visualization (click for full interactive version).

As Flowing Data points out, this migration map is "much improved" over the map Bruner created with the same data last year: "The colors are more subtle and more meaningful, and you can turn off the lines so that it's easier to see highlighted counties when the selected county had a lot of traffic during a selected year."

On his own blog, Bruner lists what he sees as the improvements:

It's got five years of data instead of one; a brand-new layout; and some much-requested features, like a search tool and the ability to switch off the lines. But the upgrade that I'm most excited about is in the code: I built the map using nothing but open-source software, from Python and MySQL to handle the data right down to JavaScript to display the map. I've been steadily moving much of my data handling to Python and MySQL, but this is the first map I've made using JavaScript, and interactive JS maps are still rare elsewhere, too.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We're always looking for leads, so please drop a line if there's a visualization you think we should know about.

Strata 2012 — The 2012 Strata Conference, being held Feb. 28-March 1 in Santa Clara, Calif., will offer three full days of hands-on data training and information-rich sessions. Strata brings together the people, tools, and technologies you need to make data work.

Save 20% on registration with the code RADAR20

More Visualizations:


PlanetMySQL Voting: Vote UP / Vote DOWN

CAOS Theory Podcast 2011.11.11

Ноябрь 11th, 2011

Topics for this podcast:

*Continuent extends MySQL replication to Oracle Database
*CFEngine updates server automation software
*Devops moving mainstream
*Neo Technology integrates with Spring
*451 CAOS report from Hadoop World

iTunes or direct download (26:56, 4.6MB)


PlanetMySQL Voting: Vote UP / Vote DOWN

Oracle’s NoSQL

Октябрь 7th, 2011

OracleOracle's turn-about announcement of a NoSQL product wasn't really surprising. When Oracle spends time and effort putting down a technology, you can bet that its secretly impressed, and trying to re-implement it in its back room. So Oracle's paper "Debunking the NoSQL Hype" should really have been read as a backhanded product announcement. (By the way, don't click that link; the paper appears to have been taken down. Surprise.)

I have to agree with DataStax and other developers in the NoSQL movement: Oracle's announcement is a validation, more than anything else. It's certainly a validation of NoSQL, and it's worth thinking about exactly what that means. It's long been clear that NoSQL isn't about any particular architecture. When databases as fundamentally different as MongoDB, Cassandra, and Neo4J can all be legitimately characterized as "NoSQL," it's clear that NoSQL isn't a "thing." We've become accustomed to talking about the NoSQL "movement," but what does that mean?

As Justin Sheehy, CTO of Basho Technologies, said, the NoSQL movement isn't about any particular architecture, but about architectural choice. For as long as I can remember, application developers have debated software architecture choices with gusto. There were many choices for the front end; many choices for middleware; and careers rose and fell based on those choices. Somewhere along the way, "Software Architect" even became a job title. But for the backend, for the past 20 years there has really been only one choice: a relational database that looks a lot like Oracle (or MySQL, if you'd prefer). And choosing between Oracle, MySQL, PostgreSQL, or some other relational database just isn't that big a choice.

Did we really believe that one size fits all for database problems? If we ever did, the last three years have made it clear that the model was broken. I've got nothing against SQL (well, actually, I do, but that's purely personal), and I'm willing to admit that relational databases solve many, maybe even most, of the database problems out there. But just as it's clear that the universe is a more complicated place than physicists thought it was in 1990, it's also clear that there are data problems that don't fit 20-year-old models. NoSQL doesn't use any particular model for storing data; it represents the ability to think about and choose your data architecture. It's important to see Oracle recognize this. The company's announcement isn't just a validation of key-value stores, but of the entire discussion of database architecture.

Of course, there's more to the announcement than NoSQL. Oracle is selling a big data appliance: an integrated package including Hadoop and R. The software is available standalone, though Oracle clearly hopes that the package will be running on its Exadata Database hardware (or equivalent), which is an impressive monster of a database machine (though I agree with Mike Driscoll, that machines like these are on the wrong side of history). There are other bits and pieces to solve ETL and other integration problems. And it's fair to say that Oracle's announcement validates more than just NoSQL; it validates the "startup stack" or "data stack" that we've seen in many of most exciting new businesses that we watch. Hadoop plus a non-relational database (often MongoDB, HBase, or Cassandra), with R as an analytics platform, is a powerful combination. If nothing else, Oracle has given more conservative (and well-funded) enterprises permission to make the architectural decisions that the startups have been making all along, and to work with data that goes beyond what traditional data warehouses and BI technologies allow. That's a good move, and it grows the pie for everyone.

I don't think many young companies will be tempted to invest millions in Oracle products. Some larger enterprises should, and will, question whether investing in Oracle products is wise when there are much less expensive solutions. And I am sure that Oracle will take its share of the well-funded enterprise business. It's a win all around.

Web 2.0 Summit, being held October 17-19 in San Francisco, will examine "The Data Frame" — focusing on the impact of data in today's networked economy.

Save $300 on registration with the code RADAR


Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

Building data startups: Fast, big, and focused

Август 9th, 2011

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack


As the foundational layer in the big data stack, the cloud provides
the scalable persistence and compute power needed to manufacture data
products.

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let's take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing's brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google's BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what's alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 20% on registration with the code STN11RAD

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers' credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and News.me. Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren't limited to startups: LinkedIn's People You May Know and FourSquare's Explore feature enhance engagement of their companies' core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.



Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

Building data startups: Fast, big, and focused

Август 9th, 2011

This is a written follow-up to a talk presented at a recent Strata online event.

A new breed of startup is emerging, built to take advantage of the rising tides of data across a variety of verticals and the maturing ecosystem of tools for its large-scale analysis.

These are data startups, and they are the sumo wrestlers on the startup stage. The weight of data is a source of their competitive advantage. But like their sumo mentors, size alone is not enough. The most successful of data startups must be fast (with data), big (with analytics), and focused (with services).

Setting the stage: The attack of the exponentials

The question of why this style of startup is arising today, versus a decade ago, owes to a confluence of forces that I call the Attack of the Exponentials. In short, over the past five decades, the cost of storage, CPU, and bandwidth has been exponentially dropping, while network access has exponentially increased. In 1980, a terabyte of disk storage cost $14 million dollars. Today, it's at $30 and dropping. Classes of data that were previously economically unviable to store and mine, such as machine-generated log files, now represent prospects for profit.

Attack of the exponentials

At the same time, these technological forces are not symmetric: CPU and storage costs have fallen faster than that of network and disk IO. Thus data is heavy; it gravitates toward centers of storage and compute power in proportion to its mass. Migration to the cloud is the manifest destiny for big data, and the cloud is the launching pad for data startups.

Leveraging the big data stack


As the foundational layer in the big data stack, the cloud provides
the scalable persistence and compute power needed to manufacture data
products.

At the middle layer of the big data stack is analytics, where features are extracted from data, and fed into classification and prediction algorithms.

Finally, at the top of the stack are services and applications. This is the level at which consumers experience a data product, whether it be a music recommendation or a traffic route prediction.

Let's take each of layers and discuss the competitive axes at each.

The emerging big data stack
The competitive axes and representative technologies on the Big Data stack are illustrated here. At the bottom tier of data, free tools are shown in red (MySQL, Postgres, Hadoop), and we see how their commercial adaptations (InfoBright, Greenplum, MapR) compete principally along the axis of speed; offering faster processing and query times. Several of these players are pushing up towards the second tier of the data stack, analytics. At this layer, the primary competitive axis is scale: few offerings can address terabyte-scale data sets, and those that do are typically proprietary. Finally, at the top layer of the big data stack lies the services that touch consumers and businesses. Here, focus within a specific sector, combined with depth that reaches downward into the analytics tier, is the defining competitive advantage.

Fast data

At the base of the big data stack — where data is stored, processed, and queried — the dominant axis of competition was once scale. But as cheaper commodity disks and Hadoop have effectively addressed scalable persistence and processing, the focus of competition has shifted toward speed. The demand for faster disks has led to an explosion in interest in solid-state disk firms, such as Fusion-IO, which went public recently. And several startups, most notably MapR, are promising faster versions of Hadoop.

FusionIO and MapR represent another trend at the data layer: commercial technologies that challenge open source or commodity offerings on an efficiency basis, namely watts or CPU cycles consumed. With energy costs driving between one-third and one-half of data center operating costs, these efficiencies have a direct financial impact.

Finally, just as many large-scale, NoSQL data stores are moving from disk to SSD, others have observed that many traditional, relational databases will soon be entirely in memory. This is particularly true for applications that require repeated, fast access to a full set of data, such as building models from customer-product matrices. This brings us to the second tier of the big data stack, analytics.

Big analytics

At the second tier of the big data stack, analytics is the brains to cloud computing's brawn. Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yield results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an online, distributed manner across many machines.

Because data is heavy, and algorithms are light, one key strategy is to push code deeper to where the data lives, to minimize network IO. This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re-written as user-defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports UDFs written in both Java and R. Following Google's BigTable, HBase is introducing coprocessors in its 0.92 release, which allows Java code to be associated with data tablets, and minimize data transfer over the network. Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co-located with the disks of its storage appliances.

The field of what's alternatively called business or predictive analytics is nascent, and while a range of enabling tools and platforms exist (such as R, SPSS, and SAS), most of the algorithms developed are proprietary and vertical-specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services — such as recommendation engines — that interoperate across data platforms. But in the near-term, consultancies like Accenture and McKinsey, are positioning themselves to provide big analytics via billable hours.

Outside of consulting, firms with analytical strengths push upward, surfacing focused products or services to achieve success.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science -- from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

Save 30% on registration with the code STN11RAD

Focused services

The top of the big data stack is where data products and services directly touch consumers and businesses. For data startups, these offerings more frequently take the form of a service, offered as an API rather than a bundle of bits.

BillGuard is a great example of a startup offering a focused data service. It monitors customers' credit card statements for dubious charges, and even leverages the collective behavior of users to improve its fraud predictions.

Several startups are working on algorithms that can crack the content relevance nut, including Flipboard and News.me. Klout delivers a pure data service that uses social media activity to measure online influence. My company, Metamarkets, crunches server logs to provide pricing analytics for publishers.

For data startups, data processes and algorithms define their competitive advantage. Poor predictions — whether of fraud, relevance, influence, or price — will sink a data startup, no matter how well-designed their web UI or mobile application.

Focused data services aren't limited to startups: LinkedIn's People You May Know and FourSquare's Explore feature enhance engagement of their companies' core products, but only when they correctly suggest people and places.

Democratizing big data

The axes of strategy in the big data stack show analytics to be squarely at the center. Data platform providers are pushing upwards into analytics to differentiate themselves, touting support for fast, distributed code execution close to the data. Traditional analytics players, such as SAS and SAP, are expanding their storage footprints and challenging the need for alternative data platforms as staging areas. Finally, data startups and many established firms are creating services whose success hinges directly on proprietary analytics algorithms.

The emergence of data startups highlights the democratizing consequences of a maturing big data stack. For the first time, companies can successfully build offerings without deep infrastructure know-how and focus at a higher level, developing analytics and services. By all indications, this is a democratic force that promises to unleash a wave of innovation in the coming decade.



Related:




PlanetMySQL Voting: Vote UP / Vote DOWN

What VMware’s Cloud Foundry announcement is about

Апрель 13th, 2011

I chatted today about VMware's Cloud Foundry with Roger Bodamer, the EVP of products and technology at 10Gen. 10Gen's MongoDB is one of three back-ends (along with MySQL and Redis) supported from the start by Cloud Foundry.

If I understand Cloud Foundry and VMware's declared "Open PaaS" strategy, it should fill a gap in services. Suppose you are a developer who wants to loosen the bonds between your programs and the hardware they run on, for the sake of flexibility, fast ramp-up, or cost savings. Your choices are:

  • An IaaS (Infrastructure as a Service) product, which hands you an emulation of bare metal where you run an appliance (which you may need to build up yourself) combining an operating system, application, and related services such as DNS, firewall, and a database.

    You can implement IaaS on your own hardware using a virtualization solution such as VMware's products, Azure, Eucalyptus, or RPM. Alternatively, you can rent space on a service such as Amazon's EC2 or Rackspace.

  • A PaaS (Platform as a Service) product, which operates at a much higher level. A vendor such as handles all the back-end services and just exposes an API to which you program.

By now, the popular APIs for IaaS have been satisfactorily emulated so that you can move your application fairly easily from one vendor to another. Some APIs, notably OpenStack, were designed explicitly to eliminate the friction of moving an app and increase the competition in the IaaS space.

Until now, the PaaS situation was much more closed. VMware claims to do for PaaS what Eucalyptus and OpenStack want to do for IaaS. Vmware has a conventional cloud service called Cloud Foundry, but will offer the code under an open source license. Right Scale has already announced that you can use it to run a Cloud Foundry application on EC2. And a large site could run Cloud Foundry on its own hardware, just as it runs VMware.

Cloud Foundry is aggressively open middleware, offering a flexible way to administer applications with a variety of options on the top and bottom. As mentioned already, you can interact with MongoDB, MySQL, or Redis as your storage. (However, you have to use the particular API offered by each back-end; there is no common Cloud Foundry interface that can be translated to the chosen back end.) You can use Spring, Rails, or Node.js as your programming environment.

So open source Cloud Foundry may prove to be a step toward more openness in the cloud arena, as many people call for and I analyzed in a series of articles last year. VMware will, if the gamble pays off, gain more customers by hedging against lock-in and will sell its tools to those who host PaaS on their own servers. The success of the effort will depend on the robustness of the solution, ease of management, and the rate of adoption by programmers and sites.


PlanetMySQL Voting: Vote UP / Vote DOWN

What VMware’s Cloud Foundry announcement is about

Апрель 13th, 2011

I chatted today about VMware's Cloud Foundry with Roger Bodamer, the EVP of products and technology at 10Gen. 10Gen's MongoDB is one of three back-ends (along with MySQL and Redis) supported from the start by Cloud Foundry.

If I understand Cloud Foundry and VMware's declared "Open PaaS" strategy, it should fill a gap in services. Suppose you are a developer who wants to loosen the bonds between your programs and the hardware they run on, for the sake of flexibility, fast ramp-up, or cost savings. Your choices are:

  • An IaaS (Infrastructure as a Service) product, which hands you an emulation of bare metal where you run an appliance (which you may need to build up yourself) combining an operating system, application, and related services such as DNS, firewall, and a database.

    You can implement IaaS on your own hardware using a virtualization solution such as VMware's products, Azure, Eucalyptus, or RPM. Alternatively, you can rent space on a service such as Amazon's EC2 or Rackspace.

  • A PaaS (Platform as a Service) product, which operates at a much higher level. A vendor such as handles all the back-end services and just exposes an API to which you program.

By now, the popular APIs for IaaS have been satisfactorily emulated so that you can move your application fairly easily from one vendor to another. Some APIs, notably OpenStack, were designed explicitly to eliminate the friction of moving an app and increase the competition in the IaaS space.

Until now, the PaaS situation was much more closed. VMware claims to do for PaaS what Eucalyptus and OpenStack want to do for IaaS. Vmware has a conventional cloud service called Cloud Foundry, but will offer the code under an open source license. Right Scale has already announced that you can use it to run a Cloud Foundry application on EC2. And a large site could run Cloud Foundry on its own hardware, just as it runs VMware.

Cloud Foundry is aggressively open middleware, offering a flexible way to administer applications with a variety of options on the top and bottom. As mentioned already, you can interact with MongoDB, MySQL, or Redis as your storage. (However, you have to use the particular API offered by each back-end; there is no common Cloud Foundry interface that can be translated to the chosen back end.) You can use Spring, Rails, or Node.js as your programming environment.

So open source Cloud Foundry may prove to be a step toward more openness in the cloud arena, as many people call for and I analyzed in a series of articles last year. VMware will, if the gamble pays off, gain more customers by hedging against lock-in and will sell its tools to those who host PaaS on their own servers. The success of the effort will depend on the robustness of the solution, ease of management, and the rate of adoption by programmers and sites.


PlanetMySQL Voting: Vote UP / Vote DOWN

Outliers and coexistence are the new normal for big data

Апрель 1st, 2011

Letting data speak for itself through analysis of entire data sets is eclipsing modeling from subsets. In the past, all too often what were once disregarded as "outliers" on the far edges of a data model turned out to be the telltale signs of a micro-trend that became a major event. To enable this advanced analytics and integrate in real-time with operational processes, companies and public sector organizations are evolving their enterprise architectures to incorporate new tools and approaches.

Whether you prefer "big," "very large," "extremely large," "extreme," "total," or another adjective for the "X" in the "X Data" umbrella term, what's important is accelerated growth in three dimensions: volume, complexity and speed.

Big data is not without its limitations. Many organizations need to revisit business processes, solve data silo challenges, and invest in visualization and collaboration tools to make big data understandable and actionable across an extended organization.

"Sampling is dead"

When complete huge data volumes can be processed and analyzed at scale, "sampling is dead," says Abhishek Mehta, former Bank of America (BofA) managing director and Tresata co-founder, and speaker at last year's Hadoop World. Potential applications include risk default analysis of every loan in a bank's portfolio and analysis of granular data for targeted advertising.

The BofA corporate investments group adopted a SAS high performance risk management solution together with IBM BladeCenter grid and XIV storage to power credit-risk modeling, scoring and loss forecasting. As explained in a recent call with the SAS high-performance computing team, this new enterprise risk management system reduced calculation times at BofA for forecasting the probability of loan defaults from 96 hours to four hours. In addition to speeding up loan processing and hedging decisions, Bank of America can aggregate bottom-up data from individual loans for perhaps a more accurate picture of total risk than what was possible previously by testing models on just subsets of data.

nPario holds an exclusive license from Yahoo for technology based on columnar storage that within Yahoo's internal infrastructure handles over eight petabytes of data for advertising and promotion, per a February 2011 discussion with nPario President and CEO Bassel Y. Ojjeh. nPario has basically forked the code, so that Yahoo can continue their internal use while nPario goes to market with a commercial offering for external customers. The nPario technology enables analysis at the granular level, not just at aggregate or sampled data. In addition to supporting a range of other data sources, nPario offers full integration with Adobe Omniture, including APIs that can pull data from Omniture (although Omniture charges a fee for this download).

Electronic Arts uses nPario for an "insight's suite" that details how gamers engage with advertising. The nPario-powered EA analytics suite tracks clicks, impressions, demographic profiles, social media buzz and other data across EA's online, console game, mobile and social channels. The result is a much more precise understanding of consumer intent and ability to micro-target ads, over what was previously possible either with sampled data or with data limited to just online or shrink-wrapped and not across the complete range of EA's customer engagement.

Multiple big data technologies coexist in many enterprise architectures

CoexistenceIn many cases, organizations will use a mix-and-match combination of relational database management systems (RDBMS), Hadoop/MapReduce, R, columnar databases such as HP Vertica or ParAccel, or document-oriented databases. Also, there is growing adoption this year beyond just the financial services industry and government for complex event processing (CEP) and related real-time or near-real-time technologies to take action from web, IT, sensor and other streaming data.

At the same time that outliers are the new normal in data science, coexistence is quickly becoming the new normal for big data infrastructure and service architectures. For many enterprises and public sector organizations, the focus is "the right tool for the job" to manage structured, unstructured and semi-relational data from disparate sources. A few examples:

The Strata Online Conference, being held April 6, will look at how information — and the ability to put it to work — will shape tomorrow's markets. Scheduled speakers include: Gavin Starks from AMEE, Jeff Jonas from IBM, Chris Thorpe from Artfinder, and Ian White from Urban Mapping.

Registration is open
  • AOL Advertising integrated two data management systems: one optimized for high-throughput data analysis (the "analytics" system), the other for low-latency random access (the "transactional" system). After evaluating alternatives, AOL Advertising combined Cloudera Distribution for Apache Hadoop (CDH) with Membase (now Couchbase). This pairs Hadoop's capability for handling large, complex data volumes with Membase's capability for speed for sub-millisecond latency in making optimized decisions for real-time ad placement.
  • At LinkedIn, to power large-scale data computations of more than 100 billion relationships a day and low-latency site serving, they use a combination of Hadoop to process massive batch workloads, Project Voldemort, for a NoSQL key/value storage engine, and the Azkaban open-source workflow system. Further, they developed a real-time, persistent messaging system named Kafka for log aggregation and activity processing.
  • The Walt Disney Co. Technology Shared Services Group extended its existing data warehouse architecture with a Hadoop cluster to provide an integration mashup for diverse departmental data, most of which is stored separately by Disney's many business units and subsidiaries. With a Hadoop cluster that went into production for shared service internal business units last October, this data can now be analyzed for patterns across different but connected customer activities, such as attendance at a theme park, purchases from Disney stores, and viewership of Disney's cable television programming. (Disney case study summarized from PricewaterhouseCoopers, Technology Forecast, Big Data issue, 2010).

Centralization and coexistence at eBay

Even companies whose enterprise architecture more closely aligns with the enterprise data warehouse (EDW) vision associated with Bill Inmon than the federated model popularized by Ralph Kimball are finding themselves migrating their architectures toward greater coexistence to empower business growth. eBay offers an instructive example.

"A data mart can't be cheap enough to justify its existence," says Oliver Ratzesberger, eBay's senior director of architecture and operations. eBay has migrated to coexistence architecture featuring Teradata as the core EDW, Teradata offshoot named Singularity for behavioral analysis and clickstream semi-relational data, and Hadoop for image processing and deep data mining. All three store multiple petabytes of data.

Named after Ray Kurzweil's thought-provoking book "The Singularity is Near," the Singularity system at eBay is running production for managing and analyzing semi-relational data, using the same Teradata SQL user interfaces that are already widely understood and liked by many eBay staff. eBay's Hadoop instances still require separate management tools, and to date, still come with fewer capabilities for workload management than what eBay receives with its Teradata architecture.

Using this tripartite architecture, on eBay's consumer online marketplace, there are no static pages. Every page is dynamic, and many if not yet all ads are individualized. These technical innovations at eBay are helping to empower eBay's corporate resurgence, as highlighted in the March 2011 Harvard Business Review "How eBay Developed a Culture of Experimentation" interview with eBay CEO John Donahoe.

Coexistence at Bank of America

Bank of America operates a Teradata data warehouse architecture with Hadoop, R and columnar extensions along with: IBM Cognos business intelligence, InfoSphere Foundation Tools and InfoSphere DataStage; Tableau reporting; SAP global ERP reporting system; and Cisco telepresence for internal collaboration; among other technologies and systems.

R-specialist Revolution Analytics cites a Bank of America reference. In it, Mike King, a quantitative analyst at Bank of America, describes how he uses R to write programs for capital adequacy modeling, decision systems design and predictive analytics:

R allows you to take otherwise overwhelmingly complex data and view it in such a way that, all of a sudden, the choice becomes more intuitive because you can picture what it looks like. Once you have that visual image of the data in your mind, it's easier to pick the most appropriate quantitative techniques.

While Revolution Analytics is sponsoring a SAS to R Challenge for SAS customers to consider converting to R, coexistence between enterprise-grade software such as SAS and emerging tools such as R, is a more common outcome than a replacement or cutback in the number of current or future SAS licenses, as shown by Bank of America's recent investment described above in the SAS risk management offering.

For its part, SAS indicates that SAS/IML Studio (formerly known as SAS Stat Studio) provides one existing capability to interface with the R language. According to Radhika Kulkarni, vice president of advanced analytics at SAS, in a discussion about SAS-R integration on the SAS website: "We are busy working on an R interface that can be surfaced in the SAS server or via other SAS clients. In the future, users will be able to interface with R through the IML procedure."

To quote Bob Rodriguez, senior director of statistical development at SAS, from that website discussion: "R is a leading language for developing new statistical methods. Our new PhD developers learned R in their graduate programs and are quite versed in it." The SAS article added that: "Both R and SAS are here to stay, and finding ways to make them work better with each other is in the best interests of our customers."

Recent evolutions in big data vendors

As 10gen CEO and co-founder Dwight Merriman and new President Max Schireson described in a call March 8: "There have been periodic rebellions against the RDBMS." Intuit's small business division uses document-oriented MongoDB from 10gen for real-time tracking of website user engagement and user activities. Document-oriented CouchDB supporter CouchOne merged with key value store and memcached specialist Membase to form Couchbase; their customers include AOL and social gaming leader Zynga.

Customers had asked DataStax (previously named Riptano) for a roadmap for integrated Cassandra and Hadoop management, per an O'Reilly Strata conference discussion with DataStax CEO and co-founder Matt Pfeil and products VP Ben Werther. In March 2011, DataStax announced the Brisk integrated Hadoop, Hive and Cassandra platform, to support high-volume, high-velocity websites and complex event processing, among other applications that require real-time or near-real-time processing. According to DataStax VP of Products Ben Werther in a March 29 email: "Cassandra is at the core of Brisk and eliminates the need for HBase because it natively provides low-latency access and everything you'd get in HBase without the complexity."

Originating at Facebook and with commercial backing from DataStax, Cassandra is in use at Cisco, Facebook, Ooyala, Rackspace/Cloudkick, SimpleGeo, Twitter and other organizations that have large, active data sets. It's basically a BigTable data model running on an Amazon Dynamo like infrastructure. DataStax's largest Cassandra production cluster has more than 700 nodes. Cloudkick, acquired by Rackspace, offers a good discussion of their selection process that led to use of Cassandra: 4 months with Cassandra, a love story.

While EMC/Greenplum and Teradata/Aster Data started with PostgreSQL and moved forward from there, EnterpriseDB has continued to incorporate PostgreSQL updates. EnterpriseDB CEO Ed Boyajian and VP Karen Tegan Padir explained in a call last month that while much of the PostgreSQL initial work was to build databases for sophisticated users, EnterpriseDB has done more to improve manageability and ease of use, including a 1-click installer for PostgreSQL similar to Red Hat installer for Linux. EnterpriseDB envisions becoming for PostgreSQL what Cloudera has become for Hadoop: an integrated solution provider aimed a commercial, enterprise and public-sector accounts.

MicroStrategy is one of Cloudera's key partners for visualization and collaboration, and Informatica is quickly becoming a strong partner for ETL. To speed up what can be slow transfers in ODBC, Cloudera is building an optimized version of Sqoop. Flume agents support CEP applications, but it's not a big use case yet for Hadoop, per a call in February with Dr. Amr Awadallah, co-founder and VP of engineering, and marketing VP John Kreisa.

The following are additional examples of big data integration and coexistence efforts based on phone and in-person discussions with vendor executives in February and March 2011:

  • Adobe acquired data management platform vendor Demdex to integrate with the Omniture in the Adobe Online Marketing Suite. Demdex helps advertisers shift dollars and focus from buying content-driven placements to buying specific audiences.
  • Appistry extended its CloudIQ Storage with a Hadoop edition and partnership with Accenture for a Cloud MapReduce offering for private clouds. This joint offering runs MapReduce jobs on top of the Appistry CloudIQ Platform for behind-the-firewall corporate applications.
  • Together with its siblings Cassandra and Project Voldemort, Riak is an Amazon.com Dynamo-inspired database that Comcast, Mozilla and others use to prototype, test and deploy applications, with commercial support and services from Basho Technologies.
  • At CloudScale, CEO Bill McColl and his team offer a platform to help developers create applications designed for real-time distributed architectures.
  • Clustrix's clustered database system looks like a MySQL database "on the wire," but without MySQL code, to combine key-value stores with relational database functionality, with a focus on online transaction processing (OLTP) applications.
  • Concurrent supports an open source abstraction for MapReduce called Cascading that allows applications to integrate with Hadoop through Java API.
  • Within an enterprise and extending to its SaaS or social media data, Coveo offer integrated search tools for finding information quickly. For example, a Coveo user can search Microsoft SharePoint files or pull up data from Salesforce.com all from within her Outlook email browser.
  • Germany-based Exasol added a bulk-loader and increased integration capabilities for SAP clients.
  • Based on Big Table and other Google technologies, Fusion Tables are a service for managing large collections of tabular data in the cloud. You can upload tables of up to 100MB and share them with collaborators, or make them public. You can apply filters and aggregation to your data, visualize it on maps and other charts, merge data from multiple tables, and export it to the web or csv files.
  • Yale's Daniel Abadi and several of his colleagues unveiled Hadapt to run large and ad-hoc SQL queries with high velocity on both structured and unstructured data in Hadoop, to commercialize a project that began in the Yale computer science department.
  • IBM Netezza has partnered with R specialist Revolution Analytics add built-in R capabilities to the IBM Netezza TwinFin Data Warehouse Appliance. While Revolution Analytics has challenged SAS, they see more of a partner model with IBM Netezza and IBM SPSS. This may in part reflect the work career of Revolution Analytics President and CEO Norman Nie; prior to his current role, he co-invented SPSS.
  • Mapr targets speeding up Hadoop/MapReduce through a proprietary replacement for HDFS that can integrate with the rest Apache Hadoop ecosystem. (For a backgrounder on that ecosystem, refer to Meet the Big Data Equivalent of the LAMP Stack).
  • MarkLogic offers a purpose-built database using an XML data model for unstructured information for Simon & Schuster, Pearson Education, Boeing, the U.S. Federal Aviation Administration and other customers.
  • Microsoft Dryad offers a programming model to write parallel and distributed programs to scale from a small cluster to a large data center.
  • Pentaho offers an open source BI suite integrating capabilities for ETL, reporting, OLAP analysis, dashboards and data mining.
  • With its SpringSource and Wavemaker acquisitions, VMware is offering and expanding a suite of tools for developers to program applications that take advantage of virtualized cloud delivery environments. VMware's cloud application strategy is to empower developers to run modern applications that share information with underlying infrastructure to maximize performance, quality of service and infrastructure utilization. This extends VMware's virtualization business farther up into the software development lifecycle and provides incremental revenue for VMware while VMware positions itself for desktop virtualization to take off.

Data in the cloud

Data in cloudCloud computing and big data technologies overlap. As Judith Hurwitz at Hurwitz & Associates explained in a call on February 22: "Amazon has definitely blazed the trail as the pioneer for compute services." Amazon found they had extra capacity and started renting it out, but with little or no service level guarantees.

As Judith Hurwitz discussed, the data in the cloud market is starting to bifurcate. Private clouds are advancing the enterprise shared services model with workload management, self-provisioning and other automation of shared services. IBM, Unisys, Microsoft Azure, HP, NaviSite (Time Warner) and others are offering enterprise-grade services. While data in Amazon is pretty portable -- most services link with Amazon -- many APIs and tools are still specific to one environment, or reflect important dependencies, e.g., Microsoft Azure basically assumes a .Net infrastructure.

At the 1000 Genomes Project, medical researchers are benefiting from a cloud architecture to access data for genomics research, including the ability to download a public dataset through Amazon Web Services. For medical researchers on limited budgets, using the cloud capacity for analytics can save investment dollars. However, Amazon pricing can be deceptive as CPU hours can add up to quite a lot of money over time. To speed data transfers from the cloud, the project participants are using Aspera and its fasp protocol.

The University of Washington, Monterey Bay Aquarium Research Institute and Microsoft have collaborated on Project Trident to provide a scientific workflow workbench for oceanography. Trident, implemented with Windows Workflow Foundation, .NET, Silverlight and other Microsoft technologies, allows scientists to explore and visualize oceanographic data in real-time. They can use Trident to compose, run and catalog oceanography experiments from any web browser.

Pervasive DataCloud adds a data services layer to Amazon Web Services for integration and transformation capabilities. An enterprise with multiple CRM systems can synchronize application data from Oracle/Siebel, Salesforce.com and Force.com partner applications within a Pervasive DataCloud2 process. They can then use the feeds from that DataCloud process to power executive dashboards or business analytics. Likewise, an enterprise with Salesforce.com data can use DataCloud2 to synch with an on-premise relational database, or synch data between Salesforce.com and Intuit QuickBooks accounting software.

Big data jobs

All of this activity is welcome news for software engineers and other technical staff whose jobs may have been affected by overseas outsourcing. The monthly Hadoop user group meetups at the Yahoo campus now feature hundreds of attendees and even some job offers: many big data mega vendors and startups are hiring. For example, while Yahoo ended its own distribution of Hadoop, it has some interesting work underway with its Cloud Data Platform and Services including job openings there.

Cloudera counts 85 employees and continues to hire. Cloudera's Hadoop training courses are consistently sold out, including big demand from public sector organizations; the venture capital arm of the CIA, In-Q-Tel, became a Cloudera investor last month.

Recognizing big data's limits

To temper enthusiasm just a bit, 2011 is also a good time for a reality check to put big data into perspective. To benefit from big data, many enterprises and public sector organizations need to revisit business processes, solve data silo challenges, invest in virtualization and collaboration tools to help make big data understandable and actionable across an extended organization, and encourage more staff to develop "T-shaped" skills that combine deep technical experience (the T's vertical line) and wide business skills (the T's horizontal line).

Also, big data applications such as risk management software will not by themselves prevent the next sub-prime mortgage meltdown or the previous generation's savings and loan industry crisis. Decision-makers at financial institutions will need to make the right risk decisions, and regulatory oversight such as the new Basel rules for minimum capital requirements may play an important role too.

For more on big data technology and business trends, including a longer discussion on big data limitations, take a look at my recently published Putting Big Data to Work: Opportunities for Enterprises report on GigaOM Pro.


PlanetMySQL Voting: Vote UP / Vote DOWN

What is this MySQL file used for?

Февраль 17th, 2011

MySQL keeps many different files, some contain real data, some contain meta data. Witch ones are important? Witch can your throw away?

This is my attempt to create a quick reference of all the files used by MySQL, whats in them, what can you do if they are missing, what can you do with them.

When I was working for Dell doing Linux support my first words to a customer where “DO YOU HAVE COMPLETE AND VERIFIED BACKUP?” Make one now before you think about doing anything I suggest here.

You should always try to manage your data through a MySQL client.  If things have gone very bad this may not be possible. MySQL may not start. If your file system get corrupt you may have missing files. Sometimes people create other files in the MySQL directory (BAD).  This should help you understand what is safe to remove.

Before you try to work with one of these files make sure you have the file permissions set correctly.

This may not be a complete list of files used my MySQL.  It most certainly doesn’t describe everything each table is used for. If you know of ways to replace a missing file or what happens to MySQL when a file is missing that I haven’t described here, please leave me a comment or email me.  I’ll update this document and give your a reference.

my.cnf

This file alters the default configuration settings. MySQL looks in the /etc directory for my.cnf. You should review this file to insure you are looking in the right place for all other MySQL files.  MySQL WILL run without it.  If you have trouble getting MySQL to start, read the error log then try moving are renaming this file.

mysql <directory>

On Linux servers the default location for MySQL files is /var/lib/mysql.  This directory is controlled by the “datadir” variable.

Do I need to say, deleting this directory deletes everything?

ibdata1

If your remove this file your InnoDB DATA IS GONE and MySQL will recreate an empty file.

If you are not using the innodb_file_per_table option (default), this file holds almost ALL of your data in InnoDB tables.  This file is all but useless without its corresponding  ‘.frm’ file for each table in the right database directory. If all you have is the .frm files you can recreate the structure of your tables. (See below.)

Idbdata1 can get really big. The default size is 10MB. MySQL will automatically extended it by the default size as needed. If MySQL crashed, some of your InnoDB data may be in your transaction logs (ib_logfile.*).

By design the InnoDB file does not shrink. The safest way to shrink this file is to take a complete backup, stop MySQL, remove the ib* files, start MySQL and restore all your data.  REALLY.

DatabaseName <directory in mysql>

Each MySQL database has a directory named after it.  Each directory holds the meta data for the database. Your MyISAM data is in this directory. InnoDB tables may be here if the innodb_file_per_table variable is used. By default InnoDB tables are in the ibdata1 file (see above).

If you delete the directory your data may be gone. MyISAM data WILL be lost.  InnoDB tables may still be in the ibdata1.  If so, you will need to recreate the meta data files to recover your data.

Creating a directory is almost equivalent to ‘create database’.  If a directory exists MySQL will show you have a database by that name. The create database command may also creates a db.opt file.

<TableName>.frm

This file is key to both InnoDB and MyISAM databases. It is the meta data to the location of your data. It contains the table column definitions.

If you remove this file MySQL will tell you your DATA doesn’t exist.  It does. Your data is still in the ibdata (.ibd) file or the ibdata1 file. You need to recreate the table to recreate this file. If you don’t know the exact structure of this table your out of luck.

Stop MySQL and move the .MYD and .MYI files to another directory.  (You might also make a backup copy.) Start MySQL and recreate this table. Stop MySQL and copy the .MYD and .MYI files back to the database directory and restart MySQL.

<TableName>.MYD

THIS IS YOUR MyISAM DATA. If this is all you have, and you know the data structure of the the table, all is not lost.  (See .frm Above.)  You may also need to recreate the .MYI index file.

<TableName>.MYI

This file contains the indexs for your table.  If it becomes corrupt or is deleted you can recreate it using the ‘REPAIR TABLE table_name USE_FRM;’ command.

<TableName>.ibd

THIS IS YOUR InnoDB DATA. If this is all you have, and you know the data structure of the the table, all is not lost.  (See .frm Above.) Unlink MyISAM tables the indexes are contained in this file with your data.

MySQL doesn’t create this file unless you are using the innodb_file_per_table option.

<TableName>.CSV

THIS IS YOUR CSV DATA.  This file contains comma separated text data. These file do not have indexes.

<TableName>.CSM

This file contains meta data for CSV and archive tables.  I have not found what is stored here. I do know it tells MySQL if you are logging to the general logs.

ib_logfile*

This file contains your un-committed transactions data. MySQL uses it to recover from a crash.

If you shut down InnoDB cleanly, you can remove them. MySQL will recreate them.

If you change the size of innodb_log_file_size, you will need to recreate these files by stopping MySQL cleanly and deleting them.

mysql-bin.*

This “Bin Log” files contain any change made to any database. Each transaction is assigned a MASTER_LOG_POS(ion).  These files are not created by default. They are used for replication and point-in-time recovery.

You can stop the server and remove these files IF you remove the mysql-bin.index file as well.  MySQL creates a new bin log file each time it starts or the logs are flushed.  Deleting these files will the server is running will break replication.

mysql-bin.index

This fail is used by MySQL to keep a Bin Log list.  It is a simple text file like;

./mysql-bin.000001

./mysql-bin.000002

./mysql-bin.000003

If you remove this file, MySQL will recreate it with only the newest bin log name. If you need to remove old bin logs use the command “purge binary logs [to mysql-bin.######] [before “yyyy-mm-dd”]”.

You can control the number of bin logs using the expire_logs_days variable.

mysqld.log

This is MySQL’s primary administration log. MySQL reports starts and stops as well as some warning and errors in this file.  If MySQL crashes, mysqld_safe will restart it. This log will report this.

You can delete this file if needed.

slow.log

The slow query log consists of all SQL statements that took more than long_query_time seconds to execute and (as of MySQL 5.1.21) required at least min_examined_row_limit rows to be examined.

You can delete this file if needed.

db.opt

Database characteristics, like the CHARACTER SET clause are stored in the db.opt file. You may have strange query results if this file is missing. MySQL will use it’s default. You can recreate this file my altering the table with the correct settings.

mysql.pid

The PID file hold the process ID number for the running server. MySQL creates this file and scripts that start and stop MySQL use it to control MySQL.

MySQL will remove this file when is stops. You should not delete it if MySQL is running. If mysql is NOT running and the file exists MySQL may have crashed and you should delete this file.

References:

FULL DISK
http://dev.mysql.com/doc/refman/5.1/en/full-disk.html

MySQL Database Backup .MYI and .MYD
http://www.aeonity.com/frost/mysql-database-backup-myi-myd

Recovering from Crashes
http://dev.mysql.com/tech-resources/articles/recovering-from-crashes.html

Tweet


PlanetMySQL Voting: Vote UP / Vote DOWN