<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>PlanetMysql.ru - информация о СУБД MySQL &#187; Solr</title>
	<atom:link href="http://planetmysql.ru/category/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://planetmysql.ru</link>
	<description>Блог о самой популярной СУБД MySQL</description>
	<lastBuildDate>Fri, 10 Feb 2012 22:53:14 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>The SMAQ stack for big data</title>
		<link>http://feedproxy.google.com/~r/oreilly/radar/atom/~3/KeHWOvewc4s/the-smaq-stack-for-big-data.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=the-smaq-stack-for-big-data</link>
		<comments>http://feedproxy.google.com/~r/oreilly/radar/atom/~3/KeHWOvewc4s/the-smaq-stack-for-big-data.html#comments</comments>
		<pubDate>Wed, 22 Sep 2010 13:00:00 +0000</pubDate>
		<dc:creator>Tim O'Reilly</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[smaq]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[strataconf]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[
SMAQ report sections
&#8594; MapReduce
&#8594; Storage
&#8594; Query
&#8594; Conclusion

"Big data" is data that becomes large enough that it cannot be processed using conventional methods. Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors and science contribute to petabytes of data created daily.

To meet the challenge of processing such large data sets, Google created MapReduce. Google's work and Yahoo's creation of the Hadoop MapReduce implementation has spawned an ecosystem of big data processing tools.

As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.



In the same way the commodity LAMP stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative data-driven products and services, in the same way that LAMP was a critical enabler for Web 2.0.

Though dominated by Hadoop-based architectures, SMAQ encompasses a variety of systems, including leading NoSQL databases. This paper describes the SMAQ stack and where today's big data tools fit into the picture.



MapReduce

Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today's big data processing. The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This distribution solves the issue of data too large to fit onto a single machine.



To understand how MapReduce works, look at the two phases suggested by its name. In the map phase, input data is processed, item by item, and transformed into an intermediate data set. In the reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result.



A simple example of MapReduce is the task of counting the number of unique words in a document. In the map phase, each word is identified and given the count of 1. In the reduce phase, the counts are added together for each word.

If that seems like an obscure way of doing a simple task, that's because it is. In order for MapReduce to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized. Translating queries into one or more MapReduce steps is not an intuitive process. Higher-level abstractions have been developed to ease this, discussed under Query below.

An important way in which MapReduce-based systems differ from conventional databases is that they process data in a batch-oriented fashion. Work must be queued for execution, and may take minutes or hours to process.

Using MapReduce to solve problems entails three distinct operations:


 Loading the data -- This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it.

MapReduce -- This phase will retrieve data from storage, process it, and return the results to the storage.

Extracting the result -- Once processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented.


Many SMAQ systems have features designed to simplify the operation of each of these stages.

Hadoop MapReduce

Hadoop is the dominant open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached &#8220;web scale&#8221; capability in early 2008.

The Hadoop project is now hosted by Apache. It has grown into a large endeavor, with multiple subprojects that together comprise a full SMAQ stack.

Since it is implemented in Java, Hadoop's MapReduce implementation is accessible from the Java programming language. Creating MapReduce jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be processed must be loaded into the Hadoop Distributed Filesystem.

Taking the word-count example from above, a suitable map function might look like the following (taken from the Hadoop MapReduce documentation, the key operations shown in bold).



public static class Map
	extends Mapper&#60;LongWritable, Text, Text, IntWritable&#62; {

	private final static IntWritable one = new IntWritable(1);
	private Text word = new Text();

	public void map(LongWritable key, Text value, Context context)
	     throws IOException, InterruptedException {

		String line = value.toString();
		StringTokenizer tokenizer = new StringTokenizer(line);
		while (tokenizer.hasMoreTokens()) {
			word.set(tokenizer.nextToken());
			context.write(word, one);
		}
	}
}


The corresponding reduce function sums the counts for each word.


public static class Reduce
		extends Reducer&#60;Text, IntWritable, Text, IntWritable&#62; {

	public void reduce(Text key, Iterable&#60;IntWritable&#62; values,
		Context context) throws IOException, InterruptedException {

		int sum = 0;
		for (IntWritable val : values) {
			sum += val.get();
		}
		context.write(key, new IntWritable(sum));
	}
}	


The process of running a MapReduce job with Hadoop involves the following steps:


 Defining the MapReduce stages in a Java program

 Loading the data into the filesystem

Submitting the job for execution

 Retrieving the results from the filesystem


Run via the standalone Java API, Hadoop MapReduce jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has grown up around Hadoop to make the task of loading and processing data more straightforward.

 Other implementations

MapReduce has been implemented in a variety of other programming languages and systems, a list of which may be found in Wikipedia's entry for MapReduce. Notably, several NoSQL database systems have integrated MapReduce, and are described later in this paper.


Storage

MapReduce requires storage from which to fetch data and in which to store the results of the computation. The data expected by MapReduce is not relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among nodes and fed to the map phase as key-value pairs. This data does not require a schema, and may be unstructured. However, the data must be available in a distributed fashion, to serve each processing node.



The design and features of the storage layer are important not just because of the interface with MapReduce, but also because they affect the ease with which data can be loaded and the results of computation extracted and searched.

 Hadoop Distributed File System

The standard storage mechanism used by Hadoop is the Hadoop Distributed File System, HDFS. A core part of Hadoop, HDFS has the following features, as detailed in the HDFS design document.


 Fault tolerance -- Assuming that failure will happen allows HDFS to run on commodity hardware. 

Streaming data access -- HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.

Extreme scalability -- HDFS will scale to petabytes; such an installation is in production use at Facebook.

Portability -- HDFS is portable across operating systems.

Write once -- By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.

Locality of computation -- Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to facilitate this.


HDFS provides an interface similar to that of regular filesystems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible. However, higher-level layers have been created to provide finer-grained functionality to Hadoop deployments, such as HBase.

 HBase, the Hadoop Database

One approach to making HDFS more usable is HBase. Modeled after Google's BigTable database, HBase is a column-oriented database designed to store massive amounts of data. It belongs to the NoSQL universe of databases, and is similar to Cassandra and Hypertable.



HBase uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault-tolerant, distributed nodes. Like similar column-store databases, HBase provides REST and Thrift based API access.

Because it creates indexes, HBase offers fast, random access to its contents, though with simple queries. For complex operations, HBase acts as both a source and a sink (destination for computed data) for Hadoop MapReduce. HBase thus allows systems to interface with Hadoop as a database, rather than the lower level of HDFS.

 Hive

Data warehousing, or storing data in such a way as to make reporting and analysis easier, is an important application area for SMAQ systems. Developed originally at Facebook, Hive is a data warehouse framework built on top of Hadoop. Similar to HBase, Hive provides a table-based abstraction over HDFS and makes it easy to load structured data. In contrast to HBase, Hive can only run MapReduce jobs and is suited for batch data analysis. Hive provides a SQL-like query language to execute MapReduce jobs, described in the Query section below.

 Cassandra and Hypertable

Cassandra and Hypertable are both scalable column-store databases that follow the pattern of BigTable, similar to HBase.

An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites, including Twitter, Facebook, Reddit and Digg. Hypertable was created at Zvents and spun out as an open source project.



Both databases offer interfaces to the Hadoop API that allow them to act as a source and a sink for MapReduce. At a higher level, Cassandra offers integration with the Pig query language (see the Query section below), and Hypertable has been integrated with Hive.

 NoSQL database implementations of MapReduce

The storage solutions examined so far have all depended on Hadoop for MapReduce. Other NoSQL databases have built-in MapReduce features that allow computation to be parallelized over their data stores. In contrast with the multi-component SMAQ architectures of Hadoop-based systems, they offer a self-contained system comprising storage, MapReduce and query all in one.

Whereas Hadoop-based systems are most often used for batch-oriented analytical purposes, the usual function of NoSQL stores is to back live applications. The MapReduce functionality in these databases tends to be a secondary feature, augmenting other primary query mechanisms. Riak, for example, has a default timeout of 60 seconds on a MapReduce job, in contrast to the expectation of Hadoop that such a process may run for minutes or hours.

These prominent NoSQL databases contain MapReduce functionality:


CouchDB is a distributed database, offering semi-structured document-based storage. Its key features include strong replication support and the ability to make distributed updates. Queries in CouchDB are implemented using JavaScript to define the map and reduce phases of a MapReduce process.

MongoDB is very similar to CouchDB in nature, but with a stronger emphasis on performance, and less suitability for distributed updates, replication, and versioning. MongoDB MapReduce operations are specified using JavaScript.

Riak is another database similar to CouchDB and MongoDB, but places its emphasis on high availability. MapReduce operations in Riak may be specified with JavaScript or Erlang.


 Integration with SQL databases

In many applications, the primary source of data is in a relational database using platforms such as MySQL or Oracle. MapReduce is typically used with this data in two ways:


 Using relational data as a source (for example, a list of your friends in a social network).

 Re-injecting the results of a MapReduce operation into the database (for example, a list of product recommendations based on friends' interests).


It is therefore important to understand how MapReduce can interface with relational database systems. At the most basic level, delimited text files serve as an import and export format between relational databases and Hadoop systems, using a combination of SQL export commands and HDFS operations. More sophisticated tools do, however, exist.

The Sqoop tool is designed to import data from relational databases into Hadoop. It was developed by Cloudera, an enterprise-focused distributor of Hadoop platforms. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import.

Sqoop also offers the ability to re-inject the results of MapReduce from HDFS back into a relational database. As HDFS is a filesystem, Sqoop expects delimited text files and transforms them into the SQL commands required to insert data into the database.

For Hadoop systems that utilize the Cascading API (see the Query section below) the cascading.jdbc and cascading-dbmigrate tools offer similar source and sink functionality. 

 Integration with streaming data sources

In addition to relational data sources, streaming data sources, such as web server log files or sensor output, constitute the most common source of input to big data systems. The Cloudera Flume project aims at providing convenient integration between Hadoop and streaming data sources. Flume aggregates data from both network and file sources, spread over a cluster of machines, and continuously pipes these into HDFS. The Scribe server, developed at Facebook, also offers similar functionality.

 Commercial SMAQ solutions

Several massively parallel processing (MPP) database products have MapReduce functionality built in. MPP databases have a distributed architecture with independent nodes that run in parallel. Their primary application is in data warehousing and analytics, and they are commonly accessed using SQL.


	
 The Greenplum database is based on the open source PostreSQL DBMS, and runs on clusters of distributed hardware. The addition of MapReduce to the regular SQL interface enables fast, large-scale analytics over Greenplum databases, reducing query times by several orders of magnitude. Greenplum MapReduce permits the mixing of external data sources with the database storage. MapReduce operations can be expressed as functions in Perl or Python.

 Aster Data's nCluster data warehouse system also offers MapReduce functionality. MapReduce operations are invoked using Aster Data's SQL-MapReduce technology. SQL-MapReduce enables the intermingling of SQL queries with MapReduce jobs defined using code, which may be written in languages including C#, C++, Java, R or Python.


Other data warehousing solutions have opted to provide connectors with Hadoop, rather than integrating their own MapReduce functionality.



 Vertica, famously used by Farmville creator Zynga, is an MPP column-oriented database that offers a connector for Hadoop.

 Netezza is an established manufacturer of hardware data warehousing and analytical appliances. Recently acquired by IBM, Netezza is working with Hadoop distributor Cloudera to enhance the interoperation between their appliances and Hadoop. While it solves similar problems, Netezza falls outside of our SMAQ definition, lacking both the open source and commodity hardware aspects.


 Although creating a Hadoop-based system can be done entirely with open source, it requires some effort to integrate such a system. Cloudera aims to make Hadoop enterprise-ready, and has created a unified Hadoop distribution in its Cloudera Distribution for Hadoop (CDH). CDH for Hadoop parallels the work of Red Hat or Ubuntu in creating Linux distributions. CDH comes in both a free edition and an Enterprise edition with additional proprietary components and support. CDH is an integrated and polished SMAQ environment, complete with user interfaces for operation and query. Cloudera's work has resulted in some significant contributions to the Hadoop open source ecosystem.



Query

Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a programming language is unintuitive and inconvenient, as is evident from the Java code listings shown above. To mitigate this, SMAQ systems incorporate a higher-level query layer to simplify both the specification of the MapReduce operations and the retrieval of the result.




Many organizations using Hadoop will have already written in-house layers on top of the MapReduce API to make its operation more convenient. Several of these have emerged either as open source projects or commercial products.

Query layers typically offer features that handle not only the specification of the computation, but the loading and saving of data and the orchestration of the processing on the MapReduce cluster. Search technology is often used to implement the final step in presenting the computed result back to the user.

Pig

Developed by Yahoo, Pig provides a new high-level language, Pig Latin, for describing and running Hadoop MapReduce jobs. It is intended to make Hadoop accessible for developers familiar with data manipulation using SQL, and provides an interactive interface as well as a Java API. Pig integration is available for the Cassandra and HBase databases.

Below is shown the word-count example in Pig, including both the data loading and storing phases (the notation $0 refers to the first field in a record).


input = LOAD 'input/sentences.txt' USING TextLoader();
words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();


While Pig is very expressive, it is possible for developers to write custom steps in User Defined Functions (UDFs), in the same way that many SQL databases support the addition of custom functions. These UDFs are written in Java against the Pig API.

Though much simpler to understand and use than the MapReduce API, Pig suffers from the drawback of being yet another language to learn. It is SQL-like in some ways, but it is sufficiently different from SQL that it is difficult for users familiar with SQL to reuse their knowledge.

 Hive

As introduced above, Hive is an open source data warehousing solution built on top of Hadoop. Created by Facebook, it offers a query language very similar to SQL, as well as a web interface that offers simple query-building functionality. As such, it is suited for non-developer users, who may have some familiarity with SQL.

Hive's particular strength is in offering ad-hoc querying of data, in contrast to the compilation requirement of Pig and Cascading. Hive is a natural starting point for more full-featured business intelligence systems, which offer a user-friendly interface for non-technical users.

The Cloudera Distribution for Hadoop integrates Hive, and provides a higher-level user interface through the HUE project, enabling users to submit queries and monitor the execution of Hadoop jobs.

 Cascading, the API Approach

The Cascading project provides a wrapper around Hadoop's MapReduce API to make it more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of MapReduce into a larger system more convenient. Cascading's features include:


 A data processing API that aids the simple definition of MapReduce jobs.

 An API that controls the execution of MapReduce jobs on a Hadoop cluster.

 Access via JVM-based scripting languages such as Jython, Groovy, or JRuby.

 Integration with data sources other than HDFS, including Amazon S3 and web servers.

 Validation mechanisms to enable the testing of MapReduce processes.


Cascading's key feature is that it lets developers assemble MapReduce operations as a flow, joining together a selection of &#8220;pipes&#8221;. It is well suited for integrating Hadoop into a larger system within an organization.

While Cascading itself doesn't provide a higher-level query language, a derivative open source project called Cascalog does just that. Using the Clojure JVM language, Cascalog implements a query language similar to that of Datalog. Though powerful and expressive, Cascalog is likely to remain a niche query language, as it offers neither the ready familiarity of Hive's SQL-like approach nor Pig's procedural expression. The listing below shows the word-count example in Cascalog: it is significantly terser, if less transparent.


	(defmapcatop split [sentence]
		(seq (.split sentence "\\s+")))

	(?&#60;- (stdout) [?word ?count] 
		(sentence ?s) (split ?s :&#62; ?word)
		(c/count ?count))


 Search with Solr

An important component of large-scale data deployments is retrieving and summarizing data. The addition of database layers such as HBase provides easier access to data, but does not provide sophisticated search capabilities.

To solve the search problem, the open source search and indexing platform Solr is often used alongside NoSQL database systems. Solr uses Lucene search technology to provide a self-contained search server product.

For example, consider a social network database where MapReduce is used to compute the influencing power of each person, according to some suitable metric. This ranking would then be reinjected to the database. Using Solr indexing allows operations on the social network, such as finding the most influential people whose interest profiles mention mobile phones, for instance.

Originally developed at CNET and now an Apache project, Solr has evolved from being just a text search engine to supporting faceted navigation and results clustering. Additionally, Solr can manage large data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful component for constructing business intelligence dashboards. 

Conclusion

MapReduce, and Hadoop in particular, offers a powerful means of distributing computation among commodity servers. Combined with distributed storage and increasingly user-friendly query mechanisms, the resulting SMAQ architecture brings big data processing within reach for even small- and solo-development teams. 

It is now economic to conduct extensive investigation into data, or create data products that rely on complex computations. The resulting explosion in capability has forever altered the landscape of analytics and data warehousing systems, lowering the bar to entry and fostering a new generation of products, services and organizational attitudes - a trend explored more broadly in Mike Loukides' "What is Data Science?" report.

The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses.


Related:

 What is data science?
 PDF Edition: What is data science?
 A data science cheat sheet



   
]]></description>
			<content:encoded><![CDATA[<p><div>
<h3>SMAQ report sections</h3>
<p><a href="http://radar.oreilly.com/#map-reduce">&rarr; MapReduce</a></p>
<p><a href="http://radar.oreilly.com/#storage">&rarr; Storage</a></p>
<p><a href="http://radar.oreilly.com/#query">&rarr; Query</a></p>
<p><a href="http://radar.oreilly.com/#conclusion">&rarr; Conclusion</a></p>
</p>
</div>"Big data" is data that becomes large enough that it cannot be processed using conventional methods. Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors and science contribute to petabytes of data created daily.</p>

<p>To meet the challenge of processing such large data sets, Google created MapReduce. Google's work and Yahoo's creation of the Hadoop MapReduce implementation has spawned an ecosystem of big data processing tools.</p>

<p>As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/smaq-overview-m.png" alt="SMAQ Stack" /></p>

<p>In the same way the commodity <a href="http://en.wikipedia.org/wiki/LAMP_(software_bundle)">LAMP</a> stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin <a href="http://strataconf.com/">a new era of innovative data-driven products and services</a>, in the same way that LAMP was a critical enabler for <a href="http://oreilly.com/web2/archive/what-is-web-20.html">Web 2.0</a>.</p>

<p>Though dominated by Hadoop-based architectures, SMAQ encompasses a variety of systems, including leading NoSQL databases. This paper describes the SMAQ stack and where today's big data tools fit into the picture.</p>

<p></p>
<hr>
<h2>MapReduce</h2><p></p>

<p><a href="http://labs.google.com/papers/mapreduce.html">Created at Google</a> in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today's big data processing. The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This distribution solves the issue of data too large to fit onto a single machine.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/smaq-mr-m.png" alt="SMAQ Stack - MapReduce" /></p>

<p>To understand how MapReduce works, look at the two phases suggested by its name. In the map phase, input data is processed, item by item, and transformed into an intermediate data set. In the reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/mr-example-m.png" alt="MapReduce example" /></p>

<p>A <a href="http://en.wikipedia.org/wiki/MapReduce#Example">simple example</a> of MapReduce is the task of counting the number of unique words in a document. In the map phase, each word is identified and given the count of 1. In the reduce phase, the counts are added together for each word.</p>

<p>If that seems like an obscure way of doing a simple task, that's because it is. In order for MapReduce to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized. Translating queries into one or more MapReduce steps is not an intuitive process. Higher-level abstractions have been developed to ease this, discussed under Query below.</p>

<p>An important way in which MapReduce-based systems differ from conventional databases is that they process data in a batch-oriented fashion. Work must be queued for execution, and may take minutes or hours to process.</p>

<p>Using MapReduce to solve problems entails three distinct operations:</p>

<ul>
<li> <strong>Loading the data</strong> -- This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it.</li>

<p><li><strong>MapReduce</strong> -- This phase will retrieve data from storage, process it, and return the results to the storage.</li></p>

<p><li><strong>Extracting the result</strong> -- Once processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented.</li><br />
</ul></p>

<p>Many SMAQ systems have features designed to simplify the operation of each of these stages.</p>

<h3>Hadoop MapReduce</h3>

<p>Hadoop is the dominant open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, <a href="http://research.yahoo.com/files/cutting.pdf">according to its creator Doug Cutting</a>, reached &#8220;web scale&#8221; capability in early 2008.</p>

<p>The Hadoop project is now hosted by Apache. It has grown into a large endeavor, with <a href="http://hadoop.apache.org/#What+Is+Hadoop?">multiple subprojects</a> that together comprise a full SMAQ stack.</p>

<p>Since it is implemented in Java, Hadoop's <a href="http://hadoop.apache.org/mapreduce/docs/current/">MapReduce implementation</a> is accessible from the Java programming language. Creating MapReduce jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be processed must be loaded into the Hadoop Distributed Filesystem.</p>

<p>Taking the word-count example from above, a suitable map function might look like the following (taken from the Hadoop MapReduce documentation, the key operations shown in bold).</p>

<p><br />
<pre><br />
public static class Map<br />
	extends Mapper&lt;LongWritable, Text, Text, IntWritable> {</p>

<p>	private final static IntWritable one = new IntWritable(1);<br />
	private Text word = new Text();</p>

<p>	public void map(LongWritable key, Text value, Context context)<br />
	     throws IOException, InterruptedException {</p>

<p>		String line = value.toString();<br />
		StringTokenizer tokenizer = new StringTokenizer(line);<br />
		while (tokenizer.hasMoreTokens()) {<br />
			<strong>word.set(tokenizer.nextToken());<br />
			context.write(word, one);</strong><br />
		}<br />
	}<br />
}<br />
</pre></p>

<p>The corresponding reduce function sums the counts for each word.</p>

<pre>
public static class Reduce
		extends Reducer&lt;Text, IntWritable, Text, IntWritable> {

<p>	public void reduce(Text key, Iterable&lt;IntWritable> values,<br />
		Context context) throws IOException, InterruptedException {</p>

<p>		int sum = 0;<br />
		<strong>for (IntWritable val : values) {<br />
			sum += val.get();<br />
		}<br />
		context.write(key, new IntWritable(sum));</strong><br />
	}<br />
}	<br />
</pre></p>

<p>The process of running a MapReduce job with Hadoop involves the following steps:</p>

<ul>
<li> Defining the MapReduce stages in a Java program</li>

<p><li> Loading the data into the filesystem</li></p>

<p><li>Submitting the job for execution</li></p>

<p><li> Retrieving the results from the filesystem</li><br />
</ul></p>

<p>Run via the standalone Java API, Hadoop MapReduce jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has grown up around Hadoop to make the task of loading and processing data more straightforward.</p>

<h3> Other implementations</h3>

<p>MapReduce has been implemented in a variety of other programming languages and systems, a list of which may be found in <a href="http://en.wikipedia.org/wiki/MapReduce#Implementations">Wikipedia's entry for MapReduce</a>. Notably, several NoSQL database systems have integrated MapReduce, and are described later in this paper.</p>

<p></p><hr>
<h2>Storage</h2><p></p>

<p>MapReduce requires storage from which to fetch data and in which to store the results of the computation. The data expected by MapReduce is not relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among nodes and fed to the map phase as key-value pairs. This data does not require a schema, and may be unstructured. However, the data must be available in a distributed fashion, to serve each processing node.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/smaq-storage-m.png" alt="SMAQ Stack - Storage" /></p>

<p>The design and features of the storage layer are important not just because of the interface with MapReduce, but also because they affect the ease with which data can be loaded and the results of computation extracted and searched.</p>

<h3> Hadoop Distributed File System</h3>

<p>The standard storage mechanism used by Hadoop is the <a href="http://hadoop.apache.org/hdfs/">Hadoop Distributed File System</a>, HDFS. A core part of Hadoop, HDFS has the following features, as detailed in the <a href="http://hadoop.apache.org/hdfs/docs/current/hdfs_design.html">HDFS design document</a>.</p>

<ul>
<li> <strong>Fault tolerance</strong> -- Assuming that failure will happen allows HDFS to run on commodity hardware. </li>

<p><li><strong>Streaming data access</strong> -- HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.</li></p>

<p><li><strong>Extreme scalability</strong> -- HDFS will scale to petabytes; such an installation is in production use at Facebook.</li></p>

<p><li><strong>Portability</strong> -- HDFS is portable across operating systems.</li></p>

<p><li><strong>Write once</strong> -- By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.</li></p>

<p><li><strong>Locality of computation</strong> -- Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to facilitate this.</li><br />
</ul></p>

<p>HDFS provides an interface similar to that of regular filesystems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible. However, higher-level layers have been created to provide finer-grained functionality to Hadoop deployments, such as HBase.</p>

<h3> HBase, the Hadoop Database</h3>

<p>One approach to making HDFS more usable is HBase. Modeled after Google's <a href="http://labs.google.com/papers/bigtable.html">BigTable</a> database, <a href="http://hbase.apache.org/">HBase</a> is a column-oriented database designed to store massive amounts of data. It belongs to the NoSQL universe of databases, and is similar to Cassandra and Hypertable.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/storage-hbase-m.png" alt="HBase and MapReduce" /></p>

<p>HBase uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault-tolerant, distributed nodes. Like similar column-store databases, HBase provides <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">REST</a> and <a href="http://incubator.apache.org/thrift/">Thrift</a> based API access.</p>

<p>Because it creates indexes, HBase offers fast, random access to its contents, though with simple queries. For complex operations, HBase acts as both a <em>source</em> and a <em>sink</em> (destination for computed data) for Hadoop MapReduce. HBase thus allows systems to interface with Hadoop as a database, rather than the lower level of HDFS.</p>

<h3> Hive</h3>

<p>Data warehousing, or storing data in such a way as to make reporting and analysis easier, is an important application area for SMAQ systems. Developed originally at Facebook, <a href="http://hadoop.apache.org/hive/">Hive</a> is a data warehouse framework built on top of Hadoop. Similar to HBase, Hive provides a table-based abstraction over HDFS and makes it easy to load structured data. In contrast to HBase, Hive can only run MapReduce jobs and is suited for batch data analysis. Hive provides a SQL-like query language to execute MapReduce jobs, described in the Query section below.</p>

<h3> Cassandra and Hypertable</h3>

<p><a href="http://cassandra.apache.org/">Cassandra</a> and <a href="http://hypertable.org/">Hypertable</a> are both scalable column-store databases that follow the pattern of BigTable, similar to HBase.</p>

<p>An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites, including Twitter, Facebook, Reddit and Digg. Hypertable was created at <a href="http://www.zvents.com/">Zvents</a> and spun out as an open source project.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/storage-cassandra-m.png" alt="Cassandra and MapReduce" /></p>

<p>Both databases offer interfaces to the Hadoop API that allow them to act as a source and a sink for MapReduce. At a higher level, Cassandra offers <a href="http://wiki.apache.org/cassandra/HadoopSupport">integration with the Pig query language</a> (see the Query section below), and Hypertable has been <a href="http://code.google.com/p/hypertable/wiki/HiveExtension">integrated with Hive</a>.</p>

<h3> NoSQL database implementations of MapReduce</h3>

<p>The storage solutions examined so far have all depended on Hadoop for MapReduce. Other NoSQL databases have built-in MapReduce features that allow computation to be parallelized over their data stores. In contrast with the multi-component SMAQ architectures of Hadoop-based systems, they offer a self-contained system comprising storage, MapReduce and query all in one.</p>

<p>Whereas Hadoop-based systems are most often used for batch-oriented analytical purposes, the usual function of NoSQL stores is to back live applications. The MapReduce functionality in these databases tends to be a secondary feature, augmenting other primary query mechanisms. Riak, for example, has a default timeout of 60 seconds on a MapReduce job, in contrast to the expectation of Hadoop that such a process may run for minutes or hours.</p>

<p>These prominent NoSQL databases contain MapReduce functionality:</p>

<ul>
<li><a href="http://couchdb.apache.org/">CouchDB</a> is a distributed database, offering semi-structured document-based storage. Its key features include strong replication support and the ability to make distributed updates. Queries in CouchDB are implemented using JavaScript to define the map and reduce phases of a MapReduce process.</li>

<p><li><a href="http://www.mongodb.org/">MongoDB</a> is very similar to CouchDB in nature, but with a stronger emphasis on performance, and less suitability for distributed updates, replication, and versioning. <a href="http://www.mongodb.org/display/DOCS/MapReduce">MongoDB MapReduce operations</a> are specified using JavaScript.</li></p>

<p><li><a href="https://wiki.basho.com/display/RIAK/Riak">Riak</a> is another database similar to CouchDB and MongoDB, but places its emphasis on high availability. <a href="https://wiki.basho.com/display/RIAK/MapReduce">MapReduce operations in Riak</a> may be specified with JavaScript or Erlang.</li><br />
</ul></p>

<h3> Integration with SQL databases</h3>

<p>In many applications, the primary source of data is in a relational database using platforms such as MySQL or Oracle. MapReduce is typically used with this data in two ways:</p>

<ul>
<li> Using relational data as a source (for example, a list of your friends in a social network).</li>

<p><li> Re-injecting the results of a MapReduce operation into the database (for example, a list of product recommendations based on friends' interests).</li><br />
</ul></p>

<p>It is therefore important to understand how MapReduce can interface with relational database systems. At the most basic level, delimited text files serve as an import and export format between relational databases and Hadoop systems, using a combination of SQL export commands and HDFS operations. More sophisticated tools do, however, exist.</p>

<p>The <a href="http://wiki.github.com/cloudera/sqoop/">Sqoop</a> tool is designed to import data from relational databases into Hadoop. It was developed by <a href="http://www.cloudera.com/">Cloudera</a>, an enterprise-focused distributor of Hadoop platforms. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import.</p>

<p>Sqoop also offers the ability to re-inject the results of MapReduce from HDFS back into a relational database. As HDFS is a filesystem, Sqoop expects delimited text files and transforms them into the SQL commands required to insert data into the database.</p>

<p>For Hadoop systems that utilize the Cascading API (see the Query section below) the <a href="http://github.com/cwensel/cascading.jdbc/">cascading.jdbc</a> and <a href="http://github.com/backtype/cascading-dbmigrate">cascading-dbmigrate</a> tools offer similar source and sink functionality. </p>

<h3> Integration with streaming data sources</h3>

<p>In addition to relational data sources, streaming data sources, such as web server log files or sensor output, constitute the most common source of input to big data systems. The Cloudera <a href="http://github.com/cloudera/flume">Flume</a> project aims at providing convenient integration between Hadoop and streaming data sources. Flume <a href="http://archive.cloudera.com/cdh/3/flume-0.9.1+1/UserGuide.html">aggregates data</a> from both network and file sources, spread over a cluster of machines, and continuously pipes these into HDFS. The <a href="http://github.com/facebook/scribe">Scribe</a> server, developed at Facebook, also offers similar functionality.</p>

<h3> Commercial SMAQ solutions</h3>

<p>Several massively parallel processing (MPP) database products have MapReduce functionality built in. MPP databases have a distributed architecture with independent nodes that run in parallel. Their primary application is in <a href="http://en.wikipedia.org/wiki/Data_warehouse">data warehousing</a> and analytics, and they are commonly accessed using SQL.</p>

<ul>
	
<li> The <a href="http://www.greenplum.com/">Greenplum</a> database is based on the open source PostreSQL DBMS, and runs on clusters of distributed hardware. The addition of <a href="http://www.greenplum.com/technology/mapreduce/">MapReduce</a> to the regular SQL interface enables fast, large-scale analytics over Greenplum databases, reducing query times by several orders of magnitude. Greenplum MapReduce permits the mixing of external data sources with the database storage. MapReduce operations can be expressed as functions in Perl or Python.</li>

<p><li> Aster Data's <a href="http://www.asterdata.com/product/index.php">nCluster</a> data warehouse system also offers MapReduce functionality. MapReduce operations are invoked using Aster Data's <a href="http://www.asterdata.com/resources/mapreduce.php">SQL-MapReduce</a> technology. SQL-MapReduce enables the intermingling of SQL queries with MapReduce jobs defined using code, which may be written in languages including C#, C++, Java, R or Python.</li><br />
</ul></p>

<p>Other data warehousing solutions have opted to provide connectors with Hadoop, rather than integrating their own MapReduce functionality.</p>

<ul>

<p><li> <a href="http://www.vertica.com/">Vertica</a>, famously used by Farmville creator Zynga, is an MPP column-oriented database that offers a <a href="http://www.vertica.com/MapReduce">connector for Hadoop</a>.</li></p>

<p><li> <a href="http://www.netezza.com/">Netezza</a> is an established manufacturer of hardware data warehousing and analytical appliances. Recently acquired by IBM, Netezza is <a href="http://www.netezza.com/releases/2010/release071510.htm">working with Hadoop distributor Cloudera</a> to enhance the interoperation between their appliances and Hadoop. While it solves similar problems, Netezza falls outside of our SMAQ definition, lacking both the open source and commodity hardware aspects.</li><br />
</ul></p>

<p> Although creating a Hadoop-based system can be done entirely with open source, it requires some effort to integrate such a system. <a href="http://www.cloudera.com/">Cloudera</a> aims to make Hadoop enterprise-ready, and has created a unified Hadoop distribution in its <a href="http://www.cloudera.com/hadoop/">Cloudera Distribution for Hadoop</a> (CDH). CDH for Hadoop parallels the work of Red Hat or Ubuntu in creating Linux distributions. CDH comes in both a free edition and an <a href="http://www.cloudera.com/products-services/enterprise/">Enterprise</a> edition with additional proprietary components and support. CDH is an integrated and polished SMAQ environment, complete with user interfaces for operation and query. Cloudera's work has resulted in some <a href="http://www.cloudera.com/company/open-source/">significant contributions to the Hadoop open source ecosystem</a>.</p>

<p></p>
<hr>
<h2>Query</h2><p></p>

<p>Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a programming language is unintuitive and inconvenient, as is evident from the Java code listings shown above. To mitigate this, SMAQ systems incorporate a higher-level query layer to simplify both the specification of the MapReduce operations and the retrieval of the result.</p>

<p align="center"><img src="http://radar.oreilly.com/upload/2010/09/smaq-query-m.png" alt="SMAQ Stack - Query" />
</p>

<p>Many organizations using Hadoop will have already written in-house layers on top of the MapReduce API to make its operation more convenient. Several of these have emerged either as open source projects or commercial products.</p>

<p>Query layers typically offer features that handle not only the specification of the computation, but the loading and saving of data and the orchestration of the processing on the MapReduce cluster. Search technology is often used to implement the final step in presenting the computed result back to the user.</p>

<h3>Pig</h3>

<p>Developed by Yahoo, Pig provides a new high-level language, Pig Latin, for describing and running Hadoop MapReduce jobs. It is intended to make Hadoop accessible for developers familiar with data manipulation using SQL, and provides an interactive interface as well as a Java API. Pig integration is available for the Cassandra and HBase databases.</p>

<p>Below is shown the word-count example in Pig, including both the data loading and storing phases (the notation <em>$0</em> refers to the first field in a record).</p>

<pre>
input = LOAD 'input/sentences.txt' USING TextLoader();
<strong>words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);</strong>
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();
</pre>

<p>While Pig is very expressive, it is possible for developers to write custom steps in <a href="http://hadoop.apache.org/pig/docs/r0.7.0/udf.html">User Defined Functions (UDFs),</a> in the same way that many SQL databases support the addition of custom functions. These UDFs are written in Java against the Pig API.</p>

<p>Though much simpler to understand and use than the MapReduce API, Pig suffers from the drawback of being yet another language to learn. It is SQL-like in some ways, but it is sufficiently different from SQL that it is difficult for users familiar with SQL to reuse their knowledge.</p>

<h3> Hive</h3>

<p>As introduced above, <a href="http://hadoop.apache.org/hive/">Hive</a> is an open source data warehousing solution built on top of Hadoop. Created by Facebook, it offers a query language very similar to SQL, as well as a web interface that offers simple query-building functionality. As such, it is suited for non-developer users, who may have some familiarity with SQL.</p>

<p>Hive's particular strength is in offering ad-hoc querying of data, in contrast to the compilation requirement of Pig and Cascading. Hive is a natural starting point for more full-featured business intelligence systems, which offer a user-friendly interface for non-technical users.</p>

<p>The Cloudera Distribution for Hadoop integrates Hive, and provides a higher-level user interface through the <a href="http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-hue/">HUE</a> project, enabling users to submit queries and monitor the execution of Hadoop jobs.</p>

<h3> Cascading, the API Approach</h3>

<p>The <a href="http://www.cascading.org/">Cascading</a> project provides a wrapper around Hadoop's MapReduce API to make it more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of MapReduce into a larger system more convenient. Cascading's features include:</p>

<ul>
<li> A data processing API that aids the simple definition of MapReduce jobs.</li>

<p><li> An API that controls the execution of MapReduce jobs on a Hadoop cluster.</li></p>

<p><li> Access via JVM-based scripting languages such as Jython, Groovy, or JRuby.</li></p>

<p><li> Integration with data sources other than HDFS, including Amazon S3 and web servers.</li></p>

<p><li> Validation mechanisms to enable the testing of MapReduce processes.</li><br />
</ul></p>

<p>Cascading's key feature is that it lets developers assemble MapReduce operations as a flow, <a href="http://www.cascading.org/1.1/userguide/html/ch03s02.html">joining together a selection of &#8220;pipes&#8221;</a>. It is well suited for integrating Hadoop into a larger system within an organization.</p>

<p>While Cascading itself doesn't provide a higher-level query language, a derivative open source project called <a href="http://github.com/nathanmarz/cascalog">Cascalog</a> does just that. Using the <a href="http://clojure.org/">Clojure</a> JVM language, Cascalog implements a query language similar to that of <a href="http://en.wikipedia.org/wiki/Datalog">Datalog</a>. Though <a href="http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html">powerful and expressive</a>, Cascalog is likely to remain a niche query language, as it offers neither the ready familiarity of Hive's SQL-like approach nor Pig's procedural expression. The listing below shows the word-count example in Cascalog: it is significantly terser, if less transparent.</p>

<pre>
	(defmapcatop split [sentence]
		(seq (.split sentence "\\s+")))

<p>	(?&lt;- (stdout) [?word ?count] <br />
		(sentence ?s) (split ?s :&gt; ?word)<br />
		<strong>(c/count ?count)</strong>)<br />
</pre></p>

<h3> Search with Solr</h3>

<p>An important component of large-scale data deployments is retrieving and summarizing data. The addition of database layers such as HBase provides easier access to data, but does not provide sophisticated search capabilities.</p>

<p>To solve the search problem, the open source search and indexing platform <a href="http://lucene.apache.org/solr/">Solr</a> is often used alongside NoSQL database systems. Solr uses <a href="http://lucene.apache.org/">Lucene</a> search technology to provide a self-contained search server product.</p>

<p>For example, consider a social network database where MapReduce is used to compute the influencing power of each person, according to some suitable metric. This ranking would then be reinjected to the database. Using Solr indexing allows operations on the social network, such as finding the most influential people whose interest profiles mention mobile phones, for instance.</p>

<p>Originally developed at CNET and now an Apache project, Solr has evolved from being just a text search engine to supporting faceted navigation and results clustering. Additionally, Solr can manage large data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful component for constructing business intelligence dashboards. </p>

<p></p><hr><h2>Conclusion</h2><p></p>

<p>MapReduce, and Hadoop in particular, offers a powerful means of distributing computation among commodity servers. Combined with distributed storage and increasingly user-friendly query mechanisms, the resulting SMAQ architecture brings big data processing within reach for even small- and solo-development teams. </p>

<p>It is now economic to conduct extensive investigation into data, or create data products that rely on complex computations. The resulting explosion in capability has forever altered the landscape of analytics and data warehousing systems, lowering the bar to entry and fostering a new generation of products, services and organizational attitudes - a trend explored more broadly in Mike Loukides' "<a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is Data Science?</a>" report.</p>

<p>The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses.</p>

<p><br /><br />
<p><strong>Related:</strong></p><br />
<ul><br />
<li> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html">What is data science?</a></li><br />
<li> <a href="http://cdn.oreilly.com/radar/2010/06/What_is_Data_Science.pdf">PDF Edition: What is data science?</a></li><br />
<li> <a href="http://answers.oreilly.com/topic/1571-a-data-science-cheat-sheet/">A data science cheat sheet</a></li><br />
</ul></p>

<div>
<a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=KeHWOvewc4s:GQoD1PeDpPw:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?i=KeHWOvewc4s:GQoD1PeDpPw:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=KeHWOvewc4s:GQoD1PeDpPw:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=KeHWOvewc4s:GQoD1PeDpPw:JEwB19i1-c4"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?i=KeHWOvewc4s:GQoD1PeDpPw:JEwB19i1-c4" border="0"></img></a> <a href="http://feeds.feedburner.com/~ff/oreilly/radar/atom?a=KeHWOvewc4s:GQoD1PeDpPw:7Q72WNTAKBA"><img src="http://feeds.feedburner.com/~ff/oreilly/radar/atom?d=7Q72WNTAKBA" border="0"></img></a>
</div><img src="http://feeds.feedburner.com/~r/oreilly/radar/atom/~4/KeHWOvewc4s" height="1" width="1" /><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=25959&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=25959&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2010/09/22/the-smaq-stack-for-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disrupting IT with Open Source &amp; Cloud</title>
		<link>http://www.theopenforce.com/2010/06/disrupting-it.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=disrupting-it-with-open-source-cloud</link>
		<comments>http://www.theopenforce.com/2010/06/disrupting-it.html#comments</comments>
		<pubDate>Wed, 02 Jun 2010 04:46:05 +0000</pubDate>
		<dc:creator>Zack Urlocker</dc:creator>
				<category><![CDATA[Apache]]></category>
		<category><![CDATA[business]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[disruption]]></category>
		<category><![CDATA[eurocon]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[prague]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[A couple of weeks ago I gave a presentation at the Apache Lucene Eurocon in Prague. It was a good conference focused on Lucene/Solr open source search technology and sponsored by Lucid Imagination.  

I&#39;ve posted the bulk of the presentation below.  (I omitted a couple of slides that were MySQL specific.) Even though it was a technical conference, I got positive feedback from the attendees and organizers that the information was useful in helping folks think about where to focus their efforts.  


The slides have been posted to Box.net and are shown using their new &#34;embedded preview&#34; feature which is pretty cool. You can also use the short URL www.tinyurl.com/box-disrThanks to the folks at Lucid Imagination as well as those who gave input and feedback on the presentation.




Conference: Apache Lucene Eurocon, Agenda, Training
Lucid Imagination: Main site, Blog, Training, Services






]]></description>
			<content:encoded><![CDATA[<p>A couple of weeks ago I gave a presentation at the <a href="http://lucene-eurocon.org/" >Apache Lucene Eurocon</a> in Prague. It was a good conference focused on Lucene/Solr open source search technology and sponsored by <a href="http://www.lucidimagination.com" >Lucid Imagination</a>.  </p>

<p>I&#039;ve posted the bulk of the presentation below.  (I omitted a couple of slides that were MySQL specific.) Even though it was a technical conference, I got positive feedback from the attendees and organizers that the information was useful in helping folks think about where to focus their efforts.  </p>

<p></p>
<p>The slides have been posted to<a href="http://www.box.net/shared/us1k4bj5at" > Box.net</a> and are shown using their new &quot;<a href="http://blog.box.net/2010/06/01/embed-box-net-files-anywhere-on-the-web/" >embedded preview</a>&quot; feature which is pretty cool. You can also use the short URL <a href="http://www.tinyurl.com/box-disr" >www.tinyurl.com/box-disr</a></p><p>Thanks to the folks at Lucid Imagination as well as those who gave input and feedback on the presentation.</p>

<p></p>

<ul>
<li><strong>Conference: </strong><a href="http://lucene-eurocon.org/">Apache Lucene Eurocon</a>, <a href="http://lucene-eurocon.org/agenda.html" >Agenda</a>, <a href="http://lucene-eurocon.org/training.html" >Training</a></li>
<li><strong>Lucid Imagination: </strong><a href="http://www.lucidimagination.com/" >Main site</a>, <a href="http://www.lucidimagination.com/blog/" >Blog</a>, <a href="http://www.lucidimagination.com/solutions/services/training" >Training</a>, <a href="http://www.lucidimagination.com/solutions/services" >Services</a></li>
</ul>
<p></p>
<p></p>

<p></p>

<p></p><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=24924&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=24924&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2010/06/02/disrupting-it-with-open-source-cloud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comparison Between Solr And Sphinx Search Servers (Solr Vs Sphinx – Fight!)</title>
		<link>http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-%25e2%2580%2593-fight</link>
		<comments>http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/#comments</comments>
		<pubDate>Thu, 03 Sep 2009 17:00:00 +0000</pubDate>
		<dc:creator>Artem Russakovskii</dc:creator>
				<category><![CDATA[backend]]></category>
		<category><![CDATA[compare]]></category>
		<category><![CDATA[comparison]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[ENGINE]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[fulltext]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Search]]></category>
		<category><![CDATA[server]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[sphinx]]></category>

		<guid isPermaLink="false">http://beerpla.net/2009/09/03/detailed-comparison-between-solr-and-sphinx/</guid>
		<description><![CDATA[
		
		
		
		In the past few weeks I&#39;ve been implementing advanced search at Plaxo, working quite closely with Solr enterprise search server. Today, I saw this relatively detailed comparison between Solr and its main competitor Sphinx (full credit goes to StackOverflow user mausch who had been using Solr for the past 2 years). For those still confused, Solr and Sphinx are similar to MySQL FULLTEXT search, or for those even more confused, think Google (yeah, this is a bit of a stretch, I know).
Similarities

Both Solr and Sphinx satisfy all of your requirements. They&#39;re fast and designed to index and search large bodies of data efficiently. 
Both have a long list of high-traffic sites using them (Solr, Sphinx) 
Both offer commercial support. (Solr, Sphinx) 
Both offer client API bindings for several platforms/languages (Sphinx, Solr) 
Both can be distributed to increase speed and capacity (Sphinx, Solr) 

Here are some differences

Solr, being an Apache project, is obviously is Apache2-licensed. Sphinx is GPLv2. This means that if you ever need to embed or extend (not just &#34;use&#34;) Sphinx in a commercial application, you&#39;ll have to buy a commercial license. 
Solr is easily embeddable in Java applications. 
Solr is built on top of Lucene, which is a proven technology over 7 years old with a huge user base (this is only a small part). Whenever Lucene gets a new feature or speedup, Solr gets it too. Many of the devs committing to Solr are also Lucene committers. 
Sphinx integrates more tightly with RDBMSs, especially MySQL. 
Solr can be integrated with Hadoop to build distributed applications
Solr can be integrated with Nutch to quickly build a fully-fledged web search engine with crawler. 
Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can&#39;t. 
Solr comes with a spell-checker out of the box. 
Solr comes with facet support out of the box. Faceting in Sphinx takes more work. 
Sphinx doesn&#39;t allow partial index updates for field data. 
In Sphinx, all document ids must be unique unsigned non-zero integer numbers. Solr doesn&#39;t even require a unique key for many operations, and unique keys can be either integers or strings. 
Solr supports field collapsing to avoid duplicating similar results. Sphinx doesn&#39;t seem to provide any feature like this. 

Related questions

http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr
http://stackoverflow.com/questions/1132284/full-text-searching-with-rails
http://stackoverflow.com/questions/737275/pros-cons-of-full-text-search-engine-lucene-sphinx-postgresql-full-text-searc

Conclusion
In my experience, Solr is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend Solr.
For disclaimer purposes, I have not had much experience with Sphinx and, again, all credit for this comparison goes to mausch.
&#160;Tweet This!Share this on del.icio.usDigg this!Share this on RedditStumble upon something good? Share it on StumbleUponShare this on FacebookShare this on LinkedinSimilar Posts:Hidden Features Of Perl, PHP, Javascript, C, C++, C#, Java, Ruby, Python, And Others [Collection Of Incredibly Useful Lists]

Delicious.com [Quietly] Rolls Out Domain And Url Searching/Filtering. Finally!

Top 10 Reasons Why Digsby ROCKS

My MySQL Conference Schedule

MySQL Indexing Considerations Of Implementing A Priority Field In Your Application
]]></description>
			<content:encoded><![CDATA[<div><a href="http://api.tweetmeme.com/share?url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/"><img src="http://api.tweetmeme.com/imagebutton.gif?url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/" height="61" width="51" /></a></div><!--S-ButtonZ 1.1.4 Start--><div>
		
		</div><div>
		
		</div><!--S-ButtonZ 1.1.4 End--><p>In the past few weeks I&#039;ve been implementing advanced search at <a href="http://www.plaxo.com" rel="nofollow">Plaxo</a>, working quite closely with <a href="http://lucene.apache.org/solr/" rel="nofollow">Solr</a> enterprise search server. Today, I saw this relatively detailed comparison between Solr and its main competitor <a href="http://www.sphinxsearch.com/" rel="nofollow">Sphinx</a> (full credit goes to StackOverflow user <a href="http://stackoverflow.com/users/21239/mausch" rel="nofollow">mausch</a> who had been using Solr for the past 2 years). For those still confused, Solr and Sphinx are similar to MySQL FULLTEXT search, or for those even more confused, think Google (yeah, this is a bit of a stretch, I know).</p>
<h2>Similarities</h2>
<ul>
<li>Both Solr and Sphinx satisfy all of your requirements. They&#039;re fast and designed to index and search large bodies of data efficiently. </li>
<li>Both have a long list of high-traffic sites using them (<a href="http://wiki.apache.org/solr/PublicServers" rel="nofollow">Solr</a>, <a href="http://www.sphinxsearch.com/powered.html" rel="nofollow">Sphinx</a>) </li>
<li>Both offer commercial support. (<a href="http://www.lucidimagination.com/" rel="nofollow">Solr</a>, <a href="http://www.sphinxsearch.com/consulting.html" rel="nofollow">Sphinx</a>) </li>
<li>Both offer client API bindings for several platforms/languages (<a href="http://www.sphinxsearch.com/contribs.html" rel="nofollow">Sphinx</a>, <a href="http://wiki.apache.org/solr/#head-ab1768efa59b26cbd30f1acd03b633f1d110ed47" rel="nofollow">Solr</a>) </li>
<li>Both can be distributed to increase speed and capacity (<a href="http://www.sphinxsearch.com/docs/current.html#distributed" rel="nofollow">Sphinx</a>, <a href="http://wiki.apache.org/solr/DistributedSearch" rel="nofollow">Solr</a>) </li>
</ul>
<h2>Here are some differences</h2>
<ul>
<li>Solr, being an Apache project, is obviously is Apache2-licensed. <a href="http://www.sphinxsearch.com/licensing.html" rel="nofollow">Sphinx is GPLv2</a>. This means that if you ever need to embed or extend (not just &quot;use&quot;) Sphinx in a commercial application, you&#039;ll have to buy a commercial license. </li>
<li>Solr is <a href="http://wiki.apache.org/solr/Solrj#head-02003c15f194db1a691f8b9bb909145a60ccf498" rel="nofollow">easily embeddable</a> in Java applications. </li>
<li>Solr is built on top of Lucene, which is a proven technology over <a href="http://svn.apache.org/viewvc/lucene/java/tags/LUCENE_1_0_1/" rel="nofollow">7 years old</a> with a <a href="http://wiki.apache.org/lucene-java/PoweredBy" rel="nofollow">huge user base</a> (this is only a small part). Whenever Lucene gets a new feature or speedup, Solr gets it too. Many of the devs committing to Solr are also Lucene committers. </li>
<li>Sphinx integrates more tightly with RDBMSs, especially MySQL. </li>
<li>Solr can be <a href="http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data" rel="nofollow">integrated with Hadoop to build distributed applications</a></li>
<li>Solr can be <a href="http://stackoverflow.com/questions/211411/using-nutch-crawler-with-solr" rel="nofollow">integrated with Nutch to quickly build a fully-fledged web search engine with crawler</a>. </li>
<li>Solr can <a href="http://wiki.apache.org/solr/ExtractingRequestHandler" rel="nofollow">index proprietary formats like Microsoft Word, PDF, etc</a>. Sphinx <a href="http://stackoverflow.com/questions/1207995/indexing-word-documents-and-pdfs-with-sphinx" rel="nofollow">can&#039;t</a>. </li>
<li>Solr comes with a <a href="http://wiki.apache.org/solr/SpellCheckComponent" rel="nofollow">spell-checker out of the box</a>. </li>
<li>Solr comes with <a href="http://wiki.apache.org/solr/SolrFacetingOverview" rel="nofollow">facet support out of the box</a>. Faceting in Sphinx <a href="http://api-meal.eu/memo/128-faceted-search-with-sphinx-and-php/" rel="nofollow">takes more work</a>. </li>
<li><a href="http://stackoverflow.com/questions/737275/pros-cons-of-full-text-search-engine-lucene-sphinx-postgresql-full-text-searc/737931#737931" rel="nofollow">Sphinx doesn&#039;t allow partial index updates for field data</a>. </li>
<li>In Sphinx, <a href="http://www.sphinxsearch.com/docs/current.html#data-restrictions" rel="nofollow">all document ids must be unique unsigned non-zero integer numbers</a>. Solr <a href="http://wiki.apache.org/solr/UniqueKey" rel="nofollow">doesn&#039;t even require a unique key for many operations</a>, and unique keys can be either integers or strings. </li>
<li>Solr supports <a href="http://wiki.apache.org/solr/FieldCollapsing">field collapsing</a> to avoid duplicating similar results. Sphinx doesn&#039;t seem to provide any feature like this. </li>
</ul>
<h2>Related questions</h2>
<ul>
<li><a href="http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr" title="http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr" rel="nofollow">http://stackoverflow.com/questions/1284083/choosing-a-stand-alone-full-text-search-server-sphinx-or-solr</a></li>
<li><a href="http://stackoverflow.com/questions/1132284/full-text-searching-with-rails" rel="nofollow">http://stackoverflow.com/questions/1132284/full-text-searching-with-rails</a></li>
<li><a href="http://stackoverflow.com/questions/737275/pros-cons-of-full-text-search-engine-lucene-sphinx-postgresql-full-text-searc" rel="nofollow">http://stackoverflow.com/questions/737275/pros-cons-of-full-text-search-engine-lucene-sphinx-postgresql-full-text-searc</a></li>
</ul>
<h2>Conclusion</h2>
<p>In my experience, Solr is very-very fast on the query side. It is also very powerful. The indexing side is very CPU and memory intensive and is an unfortunate side effect of having such a feature-rich, fast application. Nevertheless, I highly recommend Solr.</p>
<p>For disclaimer purposes, I have not had much experience with Sphinx and, again, all credit for this comparison goes to <a href="http://stackoverflow.com/users/21239/mausch" rel="nofollow">mausch</a>.</p>
<div>&nbsp;</div><div><ul><li><a href="http://twitter.com/home?status=RT+@ArtemR:+Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)+-+http://tinyurl.com/llqrhb" rel="nofollow" title="Tweet This!">Tweet This!</a></li><li><a href="http://del.icio.us/post?url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;title=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)" rel="nofollow" title="Share this on del.icio.us">Share this on del.icio.us</a></li><li><a href="http://digg.com/submit?phase=2&amp;url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;title=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)" rel="nofollow" title="Digg this!">Digg this!</a></li><li><a href="http://reddit.com/submit?url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;title=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)" rel="nofollow" title="Share this on Reddit">Share this on Reddit</a></li><li><a href="http://www.stumbleupon.com/submit?url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;title=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)" rel="nofollow" title="Stumble upon something good? Share it on StumbleUpon">Stumble upon something good? Share it on StumbleUpon</a></li><li><a href="http://www.facebook.com/share.php?u=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;t=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)" rel="nofollow" title="Share this on Facebook">Share this on Facebook</a></li><li><a href="http://www.linkedin.com/shareArticle?mini=true&amp;url=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;title=Comparison+Between+Solr+And+Sphinx+Search+Servers+(Solr+Vs+Sphinx+-+Fight!)&amp;summary=In%20the%20past%20few%20weeks%20I've%20been%20implementing%20advanced%20search%20at%20Plaxo,%20working%20quite%20closely%20with%20Solr%20enterprise%20search%20server.%20Today,%20I%20saw%20this%20relatively%20detailed%20comparison%20between%20Solr%20and%20its%20main%20competitor%20Sphinx%20(full%20credit%20goes%20to%20StackOverflow%20user%20mausch%20who%20had%20been%20using%20Solr%20for%20the&amp;source=beer%20planet" rel="nofollow" title="Share this on Linkedin">Share this on Linkedin</a></li><li><a href="http://beerpla.net" rel="nofollow" title=""></a></li></ul><div></div></div>Similar Posts:<ul><li><a href="http://beerpla.net/2009/06/21/hidden-features-of-perl-php-javascript-c-c-c-java-ruby-python-and-others-collection-of-incredibly-useful-lists/" rel="bookmark" title="June 21, 2009">Hidden Features Of Perl, PHP, Javascript, C, C++, C#, Java, Ruby, Python, And Others [Collection Of Incredibly Useful Lists]</a></li>

<li><a href="http://beerpla.net/2009/08/18/delicious-com-quietly-rolls-out-domain-and-url-searchingfiltering-finally/" rel="bookmark" title="August 18, 2009">Delicious.com [Quietly] Rolls Out Domain And Url Searching/Filtering. Finally!</a></li>

<li><a href="http://beerpla.net/2008/06/14/top-10-reasons-why-digsby-rocks-or-why-you-should-try-digsby-right-now/" rel="bookmark" title="June 14, 2008">Top 10 Reasons Why Digsby ROCKS</a></li>

<li><a href="http://beerpla.net/2008/04/13/my-mysql-conference-schedule/" rel="bookmark" title="April 13, 2008">My MySQL Conference Schedule</a></li>

<li><a href="http://beerpla.net/2009/03/18/mysql-indexing-considerations-of-implementing-a-priority-field-in-your-application/" rel="bookmark" title="March 18, 2009">MySQL Indexing Considerations Of Implementing A Priority Field In Your Application</a></li>
</ul><!-- Similar Posts took 22.268 ms --><a href="http://www.addtoany.com/share_save?linkurl=http://beerpla.net/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-fight/&amp;linkname=Comparison%20Between%20Solr%20And%20Sphinx%20Search%20Servers%20(Solr%20Vs%20Sphinx%20&amp;%238211;%20Fight!)"><img src="http://beerpla.net/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark" /></a><br/>PlanetMySQL Voting:
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=20918&vote=1&apivote=1">Vote UP</a> /
	 <a href="http://planet.mysql.com/entry/vote/?entry_id=20918&vote=-1&apivote=1">Vote DOWN</a>]]></content:encoded>
			<wfw:commentRss>http://planetmysql.ru/2009/09/03/comparison-between-solr-and-sphinx-search-servers-solr-vs-sphinx-%e2%80%93-fight/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

