Archive for the ‘data’ Category

Energy data in action

Октябрь 26th, 2010

One of the Department of Energy's flagship open government initiatives, Open Energy Information (OpenEI), was on display at the recent National Science Festival. Ryan Ness, an engineer at the National Renewable Energy Laboratory, talks about how OpenEI provides access to data, models and tools in the following clip:

OpenEI is "open" in any number of important ways, including open source and open linked data, as Debbie Brodt-Giles, the OpenEI project leader, notes in an Amazon case study (OpenEI is hosted on Amazon's Web Services platform). "Key platform software for OpenEI includes Apache, Semantic MediaWiki, MySQL, and OpenLink Virtuoso," she wrote. "Customization to meet the specific needs of OpenEI has been performed primarily through PHP. Common deployment and operations for OpenEI have been automated using various AWS command-line tools."

As Ness pointed out in the interview, a recent mashathon pulled all of that open data into a mashup that shows residents the cost of energy, usage rates and potential savings around the country. The mashup also provides access to incentive information. For instance, the incentive programs in the District of Columbia are all available with a click. That addresses a concern Micah Sifry articulated: people need a way to take personal action when they are presented with information.

Chris Davis, a Ph.D. researcher at Delft University of Technology, took an in-depth look at the applications and insights that can be mined from Department of Energy data. "One of the points I wish to get across in this exercise, is that it's not just an analysis of renewable energy trends, but rather it's an analysis of U.S. Government data, grabbed live from a wiki that has been opened up to public contributions," he wrote. "This is truly revolutionary, although perhaps under-appreciated or unnoticed by many."



Related:





PlanetMySQL Voting: Vote UP / Vote DOWN

Four short links: 21 October 2010

Октябрь 21st, 2010

  1. Using MysQL as NoSQL -- 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients.
  2. Making an SLR Camera from Scratch -- amazing piece of hardware devotion. (via hackaday.com)
  3. Mac App Store Guidelines -- Apple announce an app store for the Macintosh, similar to its app store for iPhones and iPads. "Mac App" no longer means generic "program", it has a new and specific meaning, a program that must be installed through the App store and which has limited functionality (only one can run at a time, it's full-screen, etc.). The list of guidelines for what kinds of programs you can't sell through the App Store is interesting. Many have good reasons to be, but It creates a store inside itself for selling or distributing other software (i.e., an audio plug-in store in an audio app) is pure greed. Some are afeared that the next step is to make the App store the only way to install apps on a Mac, a move that would drive me away. It would be a sad day for Mac-lovers if Microsoft were to be the more open solution than Apple. cf the Owner's Manifesto.
  4. Privacy Aspects of Data Mining -- CFP for an IEEE workshop in December. (via jschneider on Twitter)


PlanetMySQL Voting: Vote UP / Vote DOWN

LCA Miniconf Call for Papers: Data Storage: Databases, Filesystems, Cloud Storage, SQL and NoSQL

Сентябрь 29th, 2010

This miniconf aims to cover many of the current methods of data storage and retrieval and attempt to bring order to the universe. We’re aiming to cover what various systems do, what the latest developments are and what you should use for various applications.

We aim for talks from developers of and developers using the software in question.

Aiming for some combination of: PostgreSQL, Drizzle, MySQL, XFS, ext[34], Swift (open source cloud storage, part of OpenStack), memcached, TokyoCabinet, TDB/CTDB, CouchDB, MongoDB, Cassandra, HBase….. and more!

Call for Papers open NOW (Until 22nd October).


PlanetMySQL Voting: Vote UP / Vote DOWN

The SMAQ stack for big data

Сентябрь 22nd, 2010

"Big data" is data that becomes large enough that it cannot be processed using conventional methods. Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors and science contribute to petabytes of data created daily.

To meet the challenge of processing such large data sets, Google created MapReduce. Google's work and Yahoo's creation of the Hadoop MapReduce implementation has spawned an ecosystem of big data processing tools.

As MapReduce has grown in popularity, a stack for big data systems has emerged, comprising layers of Storage, MapReduce and Query (SMAQ). SMAQ systems are typically open source, distributed, and run on commodity hardware.

SMAQ Stack

In the same way the commodity LAMP stack of Linux, Apache, MySQL and PHP changed the landscape of web applications, SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative data-driven products and services, in the same way that LAMP was a critical enabler for Web 2.0.

Though dominated by Hadoop-based architectures, SMAQ encompasses a variety of systems, including leading NoSQL databases. This paper describes the SMAQ stack and where today's big data tools fit into the picture.


MapReduce

Created at Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today's big data processing. The key innovation of MapReduce is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This distribution solves the issue of data too large to fit onto a single machine.

SMAQ Stack - MapReduce

To understand how MapReduce works, look at the two phases suggested by its name. In the map phase, input data is processed, item by item, and transformed into an intermediate data set. In the reduce phase, these intermediate results are reduced to a summarized data set, which is the desired end result.

MapReduce example

A simple example of MapReduce is the task of counting the number of unique words in a document. In the map phase, each word is identified and given the count of 1. In the reduce phase, the counts are added together for each word.

If that seems like an obscure way of doing a simple task, that's because it is. In order for MapReduce to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized. Translating queries into one or more MapReduce steps is not an intuitive process. Higher-level abstractions have been developed to ease this, discussed under Query below.

An important way in which MapReduce-based systems differ from conventional databases is that they process data in a batch-oriented fashion. Work must be queued for execution, and may take minutes or hours to process.

Using MapReduce to solve problems entails three distinct operations:

  • Loading the data -- This operation is more properly called Extract, Transform, Load (ETL) in data warehousing terminology. Data must be extracted from its source, structured to make it ready for processing, and loaded into the storage layer for MapReduce to operate on it.
  • MapReduce -- This phase will retrieve data from storage, process it, and return the results to the storage.
  • Extracting the result -- Once processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented.

Many SMAQ systems have features designed to simplify the operation of each of these stages.

Hadoop MapReduce

Hadoop is the dominant open source MapReduce implementation. Funded by Yahoo, it emerged in 2006 and, according to its creator Doug Cutting, reached “web scale” capability in early 2008.

The Hadoop project is now hosted by Apache. It has grown into a large endeavor, with multiple subprojects that together comprise a full SMAQ stack.

Since it is implemented in Java, Hadoop's MapReduce implementation is accessible from the Java programming language. Creating MapReduce jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be processed must be loaded into the Hadoop Distributed Filesystem.

Taking the word-count example from above, a suitable map function might look like the following (taken from the Hadoop MapReduce documentation, the key operations shown in bold).



public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);

}
}
}

The corresponding reduce function sums the counts for each word.

public static class Reduce
		extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));

}
}

The process of running a MapReduce job with Hadoop involves the following steps:

  • Defining the MapReduce stages in a Java program
  • Loading the data into the filesystem
  • Submitting the job for execution
  • Retrieving the results from the filesystem

Run via the standalone Java API, Hadoop MapReduce jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has grown up around Hadoop to make the task of loading and processing data more straightforward.

Other implementations

MapReduce has been implemented in a variety of other programming languages and systems, a list of which may be found in Wikipedia's entry for MapReduce. Notably, several NoSQL database systems have integrated MapReduce, and are described later in this paper.


Storage

MapReduce requires storage from which to fetch data and in which to store the results of the computation. The data expected by MapReduce is not relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among nodes and fed to the map phase as key-value pairs. This data does not require a schema, and may be unstructured. However, the data must be available in a distributed fashion, to serve each processing node.

SMAQ Stack - Storage

The design and features of the storage layer are important not just because of the interface with MapReduce, but also because they affect the ease with which data can be loaded and the results of computation extracted and searched.

Hadoop Distributed File System

The standard storage mechanism used by Hadoop is the Hadoop Distributed File System, HDFS. A core part of Hadoop, HDFS has the following features, as detailed in the HDFS design document.

  • Fault tolerance -- Assuming that failure will happen allows HDFS to run on commodity hardware.
  • Streaming data access -- HDFS is written with batch processing in mind, and emphasizes high throughput rather than random access to data.
  • Extreme scalability -- HDFS will scale to petabytes; such an installation is in production use at Facebook.
  • Portability -- HDFS is portable across operating systems.
  • Write once -- By assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput.
  • Locality of computation -- Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to facilitate this.

HDFS provides an interface similar to that of regular filesystems. Unlike a database, HDFS can only store and retrieve data, not index it. Simple random access to data is not possible. However, higher-level layers have been created to provide finer-grained functionality to Hadoop deployments, such as HBase.

HBase, the Hadoop Database

One approach to making HDFS more usable is HBase. Modeled after Google's BigTable database, HBase is a column-oriented database designed to store massive amounts of data. It belongs to the NoSQL universe of databases, and is similar to Cassandra and Hypertable.

HBase and MapReduce

HBase uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault-tolerant, distributed nodes. Like similar column-store databases, HBase provides REST and Thrift based API access.

Because it creates indexes, HBase offers fast, random access to its contents, though with simple queries. For complex operations, HBase acts as both a source and a sink (destination for computed data) for Hadoop MapReduce. HBase thus allows systems to interface with Hadoop as a database, rather than the lower level of HDFS.

Hive

Data warehousing, or storing data in such a way as to make reporting and analysis easier, is an important application area for SMAQ systems. Developed originally at Facebook, Hive is a data warehouse framework built on top of Hadoop. Similar to HBase, Hive provides a table-based abstraction over HDFS and makes it easy to load structured data. In contrast to HBase, Hive can only run MapReduce jobs and is suited for batch data analysis. Hive provides a SQL-like query language to execute MapReduce jobs, described in the Query section below.

Cassandra and Hypertable

Cassandra and Hypertable are both scalable column-store databases that follow the pattern of BigTable, similar to HBase.

An Apache project, Cassandra originated at Facebook and is now in production in many large-scale websites, including Twitter, Facebook, Reddit and Digg. Hypertable was created at Zvents and spun out as an open source project.

Cassandra and MapReduce

Both databases offer interfaces to the Hadoop API that allow them to act as a source and a sink for MapReduce. At a higher level, Cassandra offers integration with the Pig query language (see the Query section below), and Hypertable has been integrated with Hive.

NoSQL database implementations of MapReduce

The storage solutions examined so far have all depended on Hadoop for MapReduce. Other NoSQL databases have built-in MapReduce features that allow computation to be parallelized over their data stores. In contrast with the multi-component SMAQ architectures of Hadoop-based systems, they offer a self-contained system comprising storage, MapReduce and query all in one.

Whereas Hadoop-based systems are most often used for batch-oriented analytical purposes, the usual function of NoSQL stores is to back live applications. The MapReduce functionality in these databases tends to be a secondary feature, augmenting other primary query mechanisms. Riak, for example, has a default timeout of 60 seconds on a MapReduce job, in contrast to the expectation of Hadoop that such a process may run for minutes or hours.

These prominent NoSQL databases contain MapReduce functionality:

  • CouchDB is a distributed database, offering semi-structured document-based storage. Its key features include strong replication support and the ability to make distributed updates. Queries in CouchDB are implemented using JavaScript to define the map and reduce phases of a MapReduce process.
  • MongoDB is very similar to CouchDB in nature, but with a stronger emphasis on performance, and less suitability for distributed updates, replication, and versioning. MongoDB MapReduce operations are specified using JavaScript.
  • Riak is another database similar to CouchDB and MongoDB, but places its emphasis on high availability. MapReduce operations in Riak may be specified with JavaScript or Erlang.

Integration with SQL databases

In many applications, the primary source of data is in a relational database using platforms such as MySQL or Oracle. MapReduce is typically used with this data in two ways:

  • Using relational data as a source (for example, a list of your friends in a social network).
  • Re-injecting the results of a MapReduce operation into the database (for example, a list of product recommendations based on friends' interests).

It is therefore important to understand how MapReduce can interface with relational database systems. At the most basic level, delimited text files serve as an import and export format between relational databases and Hadoop systems, using a combination of SQL export commands and HDFS operations. More sophisticated tools do, however, exist.

The Sqoop tool is designed to import data from relational databases into Hadoop. It was developed by Cloudera, an enterprise-focused distributor of Hadoop platforms. Sqoop is database-agnostic, as it uses the Java JDBC database API. Tables can be imported either wholesale, or using queries to restrict the data import.

Sqoop also offers the ability to re-inject the results of MapReduce from HDFS back into a relational database. As HDFS is a filesystem, Sqoop expects delimited text files and transforms them into the SQL commands required to insert data into the database.

For Hadoop systems that utilize the Cascading API (see the Query section below) the cascading.jdbc and cascading-dbmigrate tools offer similar source and sink functionality.

Integration with streaming data sources

In addition to relational data sources, streaming data sources, such as web server log files or sensor output, constitute the most common source of input to big data systems. The Cloudera Flume project aims at providing convenient integration between Hadoop and streaming data sources. Flume aggregates data from both network and file sources, spread over a cluster of machines, and continuously pipes these into HDFS. The Scribe server, developed at Facebook, also offers similar functionality.

Commercial SMAQ solutions

Several massively parallel processing (MPP) database products have MapReduce functionality built in. MPP databases have a distributed architecture with independent nodes that run in parallel. Their primary application is in data warehousing and analytics, and they are commonly accessed using SQL.

  • The Greenplum database is based on the open source PostreSQL DBMS, and runs on clusters of distributed hardware. The addition of MapReduce to the regular SQL interface enables fast, large-scale analytics over Greenplum databases, reducing query times by several orders of magnitude. Greenplum MapReduce permits the mixing of external data sources with the database storage. MapReduce operations can be expressed as functions in Perl or Python.
  • Aster Data's nCluster data warehouse system also offers MapReduce functionality. MapReduce operations are invoked using Aster Data's SQL-MapReduce technology. SQL-MapReduce enables the intermingling of SQL queries with MapReduce jobs defined using code, which may be written in languages including C#, C++, Java, R or Python.

Other data warehousing solutions have opted to provide connectors with Hadoop, rather than integrating their own MapReduce functionality.

  • Vertica, famously used by Farmville creator Zynga, is an MPP column-oriented database that offers a connector for Hadoop.
  • Netezza is an established manufacturer of hardware data warehousing and analytical appliances. Recently acquired by IBM, Netezza is working with Hadoop distributor Cloudera to enhance the interoperation between their appliances and Hadoop. While it solves similar problems, Netezza falls outside of our SMAQ definition, lacking both the open source and commodity hardware aspects.

Although creating a Hadoop-based system can be done entirely with open source, it requires some effort to integrate such a system. Cloudera aims to make Hadoop enterprise-ready, and has created a unified Hadoop distribution in its Cloudera Distribution for Hadoop (CDH). CDH for Hadoop parallels the work of Red Hat or Ubuntu in creating Linux distributions. CDH comes in both a free edition and an Enterprise edition with additional proprietary components and support. CDH is an integrated and polished SMAQ environment, complete with user interfaces for operation and query. Cloudera's work has resulted in some significant contributions to the Hadoop open source ecosystem.


Query

Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a programming language is unintuitive and inconvenient, as is evident from the Java code listings shown above. To mitigate this, SMAQ systems incorporate a higher-level query layer to simplify both the specification of the MapReduce operations and the retrieval of the result.

SMAQ Stack - Query

Many organizations using Hadoop will have already written in-house layers on top of the MapReduce API to make its operation more convenient. Several of these have emerged either as open source projects or commercial products.

Query layers typically offer features that handle not only the specification of the computation, but the loading and saving of data and the orchestration of the processing on the MapReduce cluster. Search technology is often used to implement the final step in presenting the computed result back to the user.

Pig

Developed by Yahoo, Pig provides a new high-level language, Pig Latin, for describing and running Hadoop MapReduce jobs. It is intended to make Hadoop accessible for developers familiar with data manipulation using SQL, and provides an interactive interface as well as a Java API. Pig integration is available for the Cassandra and HBase databases.

Below is shown the word-count example in Pig, including both the data loading and storing phases (the notation $0 refers to the first field in a record).

input = LOAD 'input/sentences.txt' USING TextLoader();
words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();

While Pig is very expressive, it is possible for developers to write custom steps in User Defined Functions (UDFs), in the same way that many SQL databases support the addition of custom functions. These UDFs are written in Java against the Pig API.

Though much simpler to understand and use than the MapReduce API, Pig suffers from the drawback of being yet another language to learn. It is SQL-like in some ways, but it is sufficiently different from SQL that it is difficult for users familiar with SQL to reuse their knowledge.

Hive

As introduced above, Hive is an open source data warehousing solution built on top of Hadoop. Created by Facebook, it offers a query language very similar to SQL, as well as a web interface that offers simple query-building functionality. As such, it is suited for non-developer users, who may have some familiarity with SQL.

Hive's particular strength is in offering ad-hoc querying of data, in contrast to the compilation requirement of Pig and Cascading. Hive is a natural starting point for more full-featured business intelligence systems, which offer a user-friendly interface for non-technical users.

The Cloudera Distribution for Hadoop integrates Hive, and provides a higher-level user interface through the HUE project, enabling users to submit queries and monitor the execution of Hadoop jobs.

Cascading, the API Approach

The Cascading project provides a wrapper around Hadoop's MapReduce API to make it more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of MapReduce into a larger system more convenient. Cascading's features include:

  • A data processing API that aids the simple definition of MapReduce jobs.
  • An API that controls the execution of MapReduce jobs on a Hadoop cluster.
  • Access via JVM-based scripting languages such as Jython, Groovy, or JRuby.
  • Integration with data sources other than HDFS, including Amazon S3 and web servers.
  • Validation mechanisms to enable the testing of MapReduce processes.

Cascading's key feature is that it lets developers assemble MapReduce operations as a flow, joining together a selection of “pipes”. It is well suited for integrating Hadoop into a larger system within an organization.

While Cascading itself doesn't provide a higher-level query language, a derivative open source project called Cascalog does just that. Using the Clojure JVM language, Cascalog implements a query language similar to that of Datalog. Though powerful and expressive, Cascalog is likely to remain a niche query language, as it offers neither the ready familiarity of Hive's SQL-like approach nor Pig's procedural expression. The listing below shows the word-count example in Cascalog: it is significantly terser, if less transparent.

	(defmapcatop split [sentence]
		(seq (.split sentence "\\s+")))

(?<- (stdout) [?word ?count]
(sentence ?s) (split ?s :> ?word)
(c/count ?count))

Search with Solr

An important component of large-scale data deployments is retrieving and summarizing data. The addition of database layers such as HBase provides easier access to data, but does not provide sophisticated search capabilities.

To solve the search problem, the open source search and indexing platform Solr is often used alongside NoSQL database systems. Solr uses Lucene search technology to provide a self-contained search server product.

For example, consider a social network database where MapReduce is used to compute the influencing power of each person, according to some suitable metric. This ranking would then be reinjected to the database. Using Solr indexing allows operations on the social network, such as finding the most influential people whose interest profiles mention mobile phones, for instance.

Originally developed at CNET and now an Apache project, Solr has evolved from being just a text search engine to supporting faceted navigation and results clustering. Additionally, Solr can manage large data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful component for constructing business intelligence dashboards.


Conclusion

MapReduce, and Hadoop in particular, offers a powerful means of distributing computation among commodity servers. Combined with distributed storage and increasingly user-friendly query mechanisms, the resulting SMAQ architecture brings big data processing within reach for even small- and solo-development teams.

It is now economic to conduct extensive investigation into data, or create data products that rely on complex computations. The resulting explosion in capability has forever altered the landscape of analytics and data warehousing systems, lowering the bar to entry and fostering a new generation of products, services and organizational attitudes - a trend explored more broadly in Mike Loukides' "What is Data Science?" report.

The emergence of Linux gave power to the innovative developer with merely a small Linux server at their desk: SMAQ has the same potential to streamline data centers, foster innovation at the edges of an organization, and enable new startups to cheaply create data-driven businesses.



Related:



PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL GIS – Part 4

Сентябрь 21st, 2010

WHAT CAN YOU DO WITH GEO DATA?

Geo spatial indexes are what make this type of data valuable.  With shape and point data you can find relationships between object in our physical world.  How close is the lightning in the storm front?  What homes where hailed on? (WDT) What schools are in my city?  With a list of homes for sale, how fare are they from their nearest school?  What picture where take in this area. (TwitPic)

Lets start with a simple grid of coordinates by creating a table for it call geom, adding our data points in and out of our grid and then searching with a small bounding box. The grid looks like this.

0,0 0,1 0,2
1,0 1,1 1,2
2,0 2,1 2,2
CREATE TABLE geom (`OGR_FID` int(11) NOT NULL AUTO_INCREMENT, name varchar(100), geo GEOMETRY NOT NULL,
UNIQUE KEY `OGR_FID` (`OGR_FID`),   SPATIAL KEY geo (geo) ) ENGINE=MyISAM ;
INSERT INTO geom VALUES (NULL,"place1",GeomFromText('Point(1 1)'));
INSERT INTO geom VALUES (NULL,"place2",GeomFromText('Point(2 1)'));
INSERT INTO geom VALUES (NULL,"place3",GeomFromText('Point(3 3)'));
SELECT name,AsText(geo) FROM geom WHERE MBRContains(GeomFromText('POLYGON((0 0, 0 2, 2 2, 2 0, 0 0))'),geo);
+--------+------------+
| name   | AsText(geo)|
+--------+------------+
| place2 | POINT(2 1) |
| place1 | POINT(1 1) |
+--------+------------+

The first statement creates the database with three fields, OGR_FID, name and geo.  The next three lines insert the grid data into the database. The last line looks for any point in our grid by drawing a bounding box of its four corners by “drawing” all four sides.

The MySQL function MBRContains returns a 1 or 0 (True or False) to indicate whether the MBR hold the data you are checking.   This select statement returns a (1).

select MBRContains(GeomFromText(‘POLYGON((-98 36, -97 36, -97 35, -98 35,-98 36))’), point( -97.5,35.5) )  ;

NOTE: If you search for data and don’t get any results  try reversing LAT and LON.  Coordinate data can be confusing and I’ve found not every database uses data in the same order. Also, Graphical tools like HeidiSQL, SQLyog and Toad for MySQL don’t display GIS data well.  Of these HeidiSQL did the best job.

Let’s try this on some real data?  The data in this example was download in MySQL Part 3.

select name, category, Y(SHAPE) as lat, X(SHAPE) as lng  from oklahoma_poi
where MBRContains(GeomFromText(‘POLYGON((-98 36, -97 36, -97 35, -98 35, -98 36))’), SHAPE ) limit 3;

 +-------------+------------+------------+-------------+
 | name        | category   | lat        | lng         |
 +-------------+------------+------------+-------------+
 | "Stop" sign | Automotive |  35.682066 |  -97.674018 |
 | "Stop" sign | Automotive | 35.7259863 | -97.6739644 |
 | "Stop" sign | Automotive | 35.6820227 | -97.6563292 |
 +-------------+------------+------------+-------------+

Now for some real FUN.  With two searchs, one for counties starting with “Oklahoma” and second for points of interest where the name starts with “School:”.

 set @LOC = (select astext(ogc_geom) from oklahoma_county where COUNTY_NAM like "Oklahoma%"  ) ;
 select   name, category, Y(SHAPE) as lat, X(SHAPE) as lng  from oklahoma_poi
 where  MBRContains(GeomFromText(@LOC), SHAPE ) and name like 'School:%' ;
+---------------------------------------+--------------------------------+------------+-------------+
| name                                  | category                       | lat        | lng         |
+---------------------------------------+--------------------------------+------------+-------------+
| School:Deer Creek School              | Government and Public Services | 35.6825513 | -97.6211545 |
| School:Northridge School              | Government and Public Services |  35.557277 | -97.6667095 |
| School:Harvest Hills School           | Government and Public Services |  35.576721 | -97.6605986 |
| School:Saint Francis DeSales Seminary | Government and Public Services | 35.5628328 | -97.6453203 |
~~~~~~~
| School:Clegern School                 | Government and Public Services | 35.6481103 | -97.4767064 |
| School:Risner School                  | Government and Public Services | 35.6278333 | -97.4750394 |
| School:Central Mid High School        | Government and Public Services | 35.6450549 | -97.4733729 |
| School:Central State University       | Government and Public Services | 35.6553323 | -97.4719841 |
+---------------------------------------+--------------------------------+------------+-------------+
 296 rows in set (0.07 sec)

This is a lot faster then searching each each row with a clause like “WHERE (POW(lat – 35.123456, 2)+POW(lon – -97.123456, 2))“.  How much faster? You’ll have to wait for for another posting.  Here is whats coming.

  • Viewing our GIS data.
  • How to collect your own GIS data.
  • Good and bad examples of searching GIS data.
  • Optimizing MySQL GIS.  Is it really worth using?
  • Data sources shared by users.
  • Mark Grennan

    Tweet


    PlanetMySQL Voting: Vote UP / Vote DOWN

    MySQL GIS – Part 3

    Сентябрь 7th, 2010

    Tweet

    What data is available?

    GEO data is expensive to create, so has been created by governments.  In the past governments charged for this data.  In 1980 the USGS was charging $300 (usd) per county for Oklahoma GEO data. (I complained to my congressman.) Today, a quick Internet search turns up lots of free GIS data.

    I was hoping to find a neat collection of basic GEO data.   It would be nice if there was one place you could get world political borders (Polygons), postal codes (Polygons) and  points of interest like hospitals and airports.  What you can find is lots of  lists, often collections of odd data created for a virility of complex political purpose.  For example, The Global Change Master Directory is a large list of data sources on earth and climate change, but you will not find the data here.

    Remember, as you dig for data there are two types Vector (text) and Raster (pictures).  Most sites don’t  distinguish between them or combine them for you making them less useful your your own uses.

    If you find good sources of GEO data PLEASE share them with me so I can share them with everyone else.

    Here is a short list of the sites I found and used to create my test data in my series of posts.

    I live in the state of Oklahoma. Because I know it well, I’m using it for my examples.  My search for Oklahoma GIS data turned up these sources. This should give you some idea of the data you might find in your searching. I found county borders (polygons), a list of hospitals (points), city borders (polygons),  points of interest (points) and zip codes (polygons).   Each of these came as shape files so the process was simple to get the data into MySQL.

    Adding these to my ‘geo’ database was simple.  I ran each .shp file through the ogr2ogr program.

     wget http://www.okladot.state.ok.us/hqdiv/p-r-div/maps/shp-files/munibnd.zip
     unzip munibnd.zip
     ogr2ogr -f "MySQL" MySQL:"geo,user=root,host=localhost,password=" -nln oklahoma_cities -lco engine=MYISAM munibnd.shp
  • More examples on what you can do with GIS data and MySQL.
  • Viewing our GIS data.
  • How to collect your own GIS data.
  • Good and bad examples of searching GIS data.
  • Optimizing MySQL GIS.  Is it really worth using?
  • Data sources shared users.
  • Mark Grennan

    PlanetMySQL Voting: Vote UP / Vote DOWN

    MySQL GIS – Part 1

    Август 24th, 2010

    Tweet

    In my business (weather) we use lots map based (Geo) information.  Almost every table has latitude and longitude. Working with this kind of data can be exciting and frustrating.  This should give you a quick start into GIS with MySQL.

    “A geographic information system (GIS), or geographical information system, is any system that captures, stores, analyzes, manages, and presents data that are linked to location. In the simplest terms, GIS is the merging of cartography, statistical analysis, and database technology. GIS systems are used in cartography, remote sensing, land surveying, utility management, natural resource management, photogrammetry, geography, urban planning, emergency management, navigation, and localized search engines.”Wikipedia

    GIS / Mapping Systems work with both text data and graphical data.  Applications and utilities often blur the lines between the two types and make understanding difficult.  Map servers blend raster images, with point or polygon data, and bitmap images to make complete images to display in the user’s client application.  For this post I will concentrate on the text type “data”.  The type we can index in a MySQL database.

    THE SEARCH

    After months of reading, [1]  I’m writing this post to describes what I have learned about how to get started using GEO coding data as quickly as possible.  I found very little piratical information on GIS and MySQL.  The MySQL manual covers the functions but doesn’t supply much practical information on GEO.  Anders Karlsson wrote a nice and short story about GIS that give me a good start.

    The best information has be written by Florin Duroiu in his post titled “Political boundaries overlay in Google maps”. A good part of my post is based on his work.

    STEP BY STEP

    Below are the detailed needed to to produce a MySQL database with the Points of Interest (POI).  This is based on CentOS 5.5 with MySQL 5.1.

    yum install gdal
    mkdir geo
    mkdir data
    mkdir data/Oklahoma
    cd geo/data/Oklahoma
    wget http://downloads.cloudmade.com/north_america/united_states/oklahoma/oklahoma.shapefiles.zip
    unzip oklahoma.shapefiles.zip
    mysql -e 'create database geo'
    ogr2ogr -f "MySQL" MySQL:"geo,user=root,host=localhost,password=" -nln oklahoma_poi -lco engine=MYISAM oklahoma_poi.shp
    mysql geo -e 'desc oklahoma_poi'
    
    +----------+--------------+------+-----+---------+----------------+
    | Field    | Type         | Null | Key | Default | Extra          |
    +----------+--------------+------+-----+---------+----------------+
    | OGR_FID  | int(11)      | NO   | PRI | NULL    | auto_increment |
    | SHAPE    | geometry     | NO   | MUL | NULL    |                |
    | category | varchar(30)  | YES  |     | NULL    |                |
    | name     | varchar(113) | YES  |     | NULL    |                |
    +----------+--------------+------+-----+---------+----------------+
    
    mysql geo -e "select name, category, Y(SHAPE) as lat, X(SHAPE) as lng  from oklahoma_poi where NAME like 'School:Putnam%'"
    
    +-------------------------------------+--------------------------------+------------+-------------+
    | name                                | category                       | lat        | lng         |
    +-------------------------------------+--------------------------------+------------+-------------+
    | School:Putnam City West High School | Government and Public Services |  35.492557 | -97.6605975 |
    | School:Putnam City North School     | Government and Public Services | 35.5892209 | -97.6372648 |
    | School:Putnam City School           | Government and Public Services | 35.5122794 | -97.6142079 |
    | School:Putnam High School           | Government and Public Services | 35.5214459 | -97.6086523 |
    | School:Putnam Heights Academy       | Government and Public Services | 35.5081143 | -97.5397619 |
    +-------------------------------------+--------------------------------+------------+-------------+
    

    In a coming set of post I’ll go over:

    • The “Data” types your will find and how to convert between them.
    • What data is available and where can you find it?
    • More examples on what you can do with GIS data.
    • Viewing our GIS data.
    • How to collect your own GIS data.
    • Good and bad examples of searching GIS data.
    • Optimizing MySQL GIS.  Is it really worth using?

    [1] Books: GIS for DummiesAuthor: Michael N. DeMers – John Wiley & Sons (2009) – ISBN: 0470236825
    Open Source GIS: A GRASS GIS Approach. Third Edition.
    Author: Markus Neteler and Helena Mitasova – ISBN: 978-0-38735767-6
    Web Mapping Illustrated: Using Open Source GIS Toolkits
    Author: Tyler Mitchell – ISBN: 9780596008659

    


    PlanetMySQL Voting: Vote UP / Vote DOWN

    Does Size or Type Matter?

    Июль 27th, 2010

    MySQL seems to be happy to convert types for you. Developers are rushed to complete their project and if the function works they just move on. But what is the costs of mixing your types? Does it matter if your are running across a million rows or more? Lets find out.

    Here is what the programmers see.

    mysql> select 1+1;
    +-----+
    | 1+1 |
    +-----+
    |   2 |
    +-----+
    1 row in set (0.00 sec)
    
    mysql> select "1"+"1";
    +---------+
    | "1"+"1" |
    +---------+
    |       2 |
    +---------+
    1 row in set (0.00 sec)

    Benchmark

    What if we do a thousand simple loops?  How long does the looping itself take?

    The BENCHMARK() function executes the expression expr repeatedly count times. It may be used to time how quickly MySQL processes the expression. The result value is always 0.

    mysql> select benchmark(1000000000, 1);
    +--------------------------+
    | benchmark(1000000000, 1) |
    +--------------------------+
    |                        0 |
    +--------------------------+
    1 row in set (5.42 sec)
    
    mysql> select benchmark(1000000000, "1" );
    +-----------------------------+
    | benchmark(1000000000, "1" ) |
    +-----------------------------+
    |                           0 |
    +-----------------------------+
    1 row in set (5.40 sec)

    So maybe type doesn’t matter? About five seconds just to loop but the type didn’t change it.   What if we add 1+”1″?

    mysql> select benchmark(1000000000, 1+1);
    +----------------------------+
    | benchmark(1000000000, 1+1) |
    +----------------------------+
    |                          0 |
    +----------------------------+
    1 row in set (12.65 sec)
    
    mysql> select benchmark(1000000000, 1+"1");
    +------------------------------+
    | benchmark(1000000000, 1+"1") |
    +------------------------------+
    |                            0 |
    +------------------------------+
    1 row in set (35.58 sec)
    
    mysql> select benchmark(1000000000, "1"+"1");
    +--------------------------------+
    | benchmark(1000000000, "1"+"1") |
    +--------------------------------+
    |                              0 |
    +--------------------------------+
    1 row in set (51.59 sec)

    It looks like type does matter.  But does it always matter?

    mysql> select benchmark(1000000000, sum(1+1));
    +---------------------------------+
    | benchmark(1000000000, sum(1+1)) |
    +---------------------------------+
    |                               0 |
    +---------------------------------+
    1 row in set (9.69 sec)
    
    mysql> select benchmark(1000000000, sum("1"+"1"));
    +-------------------------------------+
    | benchmark(1000000000, sum("1"+"1")) |
    +-------------------------------------+
    |                                   0 |
    +-------------------------------------+
    1 row in set (9.94 sec)
    
    mysql> select benchmark(1000000000, sum("1.23456789"+"1.23456789"));
    +-------------------------------------------------------+
    | benchmark(1000000000, sum("1.23456789"+"1.23456789")) |
    +-------------------------------------------------------+
    |                                                     0 |
    +-------------------------------------------------------+
    1 row in set (10.32 sec)

    So, not all functions are the same.  But it looks like size might matter!

    mysql> select benchmark(1000000000, 1.1+1.1);
    +--------------------------------+
    | benchmark(1000000000, 1.1+1.1) |
    +--------------------------------+
    |                              0 |
    +--------------------------------+
    1 row in set (34.90 sec)
    
    mysql> select benchmark(1000000000, "1.1"+"1.1");
    +------------------------------------+
    | benchmark(1000000000, "1.1"+"1.1") |
    +------------------------------------+
    |                                  0 |
    +------------------------------------+
    1 row in set (1 min 15.32 sec)
    
    mysql> select  benchmark(1000000000, "1.123456789"+"1.123456789");
    +----------------------------------------------------+
    | benchmark(1000000000, "1.123456789"+"1.123456789") |
    +----------------------------------------------------+
    |                                                  0 |
    +----------------------------------------------------+
    1 row in set (1 min 53.32 sec)

    Sorry.  Looks like size does matter.
    This doesn't seem logical.

    mysql> select benchmark(1000000000, 1=1);
    +----------------------------+
    | benchmark(1000000000, 1=1) |
    +----------------------------+
    |                          0 |
    +----------------------------+
    1 row in set (12.75 sec)
    
    mysql> select benchmark(1000000000, 1="1");
    +------------------------------+
    | benchmark(1000000000, 1="1") |
    +------------------------------+
    |                            0 |
    +------------------------------+
    1 row in set (40.78 sec)
    mysql> select benchmark(1000000000, 1=true);
    +-------------------------------+
    | benchmark(1000000000, 1=true) |
    +-------------------------------+
    |                             0 |
    +-------------------------------+
    1 row in set (12.73 sec)
    
    mysql> select benchmark(1000000000, 1="true");
    +---------------------------------+
    | benchmark(1000000000, 1="true") |
    +---------------------------------+
    |                               0 |
    +---------------------------------+
    1 row in set, 65535 warnings (3 min 5.72 sec)
    mysql> select benchmark(1000000000, "true"="true");
    +--------------------------------------+
    | benchmark(1000000000, "true"="true") |
    +--------------------------------------+
    |                                    0 |
    +--------------------------------------+
    1 row in set (57.25 sec)

    Maybe we should CAST our work?

    mysql> select benchmark(1000000000, cast("1" as unsigned));
    +----------------------------------------------+
    | benchmark(1000000000, cast("1" as unsigned)) |
    +----------------------------------------------+
    |                                            0 |
    +----------------------------------------------+
    1 row in set (32.27 sec)
    
    mysql> select benchmark(1000000000, cast("1" as unsigned) + cast("1" as unsigned));
    +----------------------------------------------------------------------+
    | benchmark(1000000000, cast("1" as unsigned) + cast("1" as unsigned)) |
    +----------------------------------------------------------------------+
    |                                                                    0 |
    +----------------------------------------------------------------------+
    1 row in set (1 min 7.24 sec)

    Maybe not!
    Conclusion:  Be careful with your data types.  If you are taking user input, do the type conversion ONCE in your program.  Don’t let MySQL do the type conversions for you.
    query = “SELECT * FROM table where $INPUT = 1″;   could be doing your wrong.

    References:
    
    http://dev.mysql.com/doc/refman/5.0/en/information-functions.html#function_benchmark
    
    http://dev.mysql.com/doc/refman/5.0/en/numeric-type-overview.html
    
    http://dev.mysql.com/doc/refman/5.0/en/cast-functions.html

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Is OpenStack Cloud Computing Rocket Science?

    Июль 19th, 2010

    There’s a real explosion of cloud platforms and management tools, it seems you can’t swing a dead cat without hitting one these days. In the commercial proprietary solutions space you have – CA’s 3Terra AppLogic, Enomaly, Nimbula, RightScale. In open source there are EucalyptusCloud.com, Open Nebula and Ubuntu Enterprise Cloud. There are a bunch more that I failed to mention. It makes you wonder do we really need another one? How much different can they be?Rackspace Champion's Open Source Cloud I am not sure but the newest one appears to be rather significant.

    Today Rackspace has thrown their hat in the ring with their new OpenStack initiative in collaboration with NASA — as in rocket scientists, smartest guys in the world. Unlike Amazon’s EC2 which preaches open APIs, Rackspace is working to develop an open source platform that compliments their hosted cloud offering. They also have a strong open source partner in NASA who has been working on their own cloud computing platform, NASA Nebula. NASA Nebula will now become the cornerstone for the OpenStack initiative.

    The goal of OpenStack is to allow any organization to create and offer cloud computing capabilities using open source software running on standard hardware. The project boasts both a compute and storage component. OpenStack Compute is software for automatically creating and managing large groups of virtual private servers and is available as a developer’s preview with a release target of October. OpenStack Storage is software for creating redundant, scalable object storage using clusters of commodity servers to store terabytes or even petabytes of data. Also available as a developer preview the OpenStack Storage project expects to release a production ready version in mid-September.

    Adding the Rackspace hosting model to a strong open source project makes this approach to cloud computing especially interesting. Giving private cloud users a logical migration path to public cloud use. The question effect will this initiative have to truly drive open cloud computing standards.

    OpenStack, A Foundation for Hybrid Clouds?

    This initiative while founded on open source is not necessarily the cure for lock-in but it does go much farther than anyone else offering a fully accessible reference architecture available as open source. The closest comparison I see is Eucalyptus that is mimicks the Amazon EC2 cloud compute architecture (though not AmazOpen Stack - Open Source Cloud Computingon S3), though Amazon and Eucalyptus don’t seem to share a commonly agreed upon road map but rather a leader-follower relationship.

    OpenStack’s formula is more coordinated and with a respectable user to champion it, NASA. The U.S space agency has one of the most compelling publicly documented private cloud computing stories.  NASA has gone so far as to package their solution in small footprint shipping containers to distribute among NASA research centers. These portable data centers are a model that many organizations looking to build private clouds are watching with interest.

    I like that private clouds built on the OpenStack reference architecture should be fully compatible with Rackspace hosting services. Giving users the choice to run their own cloud or host or adopt a hybrid model. It’s not unlike open source adoption models were users download a free software version that has compatibility with a commercially supported version.  Plus this is not Rackspace’s only foray into open source distributed computing as they support the Apache-hosted Cassandra project, a highly scalable distributed database, and have been showing their support at numerous cloud and open source events.

    OpenStack Web Interface

    The OpenStack Web Interface

    The strong message accompanying the launch is one of open standards and prevention of cloud lock-in. Lew Moorman, President, Cloud and CSO at Rackspace states this clearly that OpenStack wants to prevent vendor lock-in:

    “We are founding the OpenStack initiative to help drive industry standards, prevent vendor lock-in and generally increase the velocity of innovation in cloud technologies.”

    This is not unlike VMware who echoed that sentiment with an announcement this spring to collaborate with Google AppEngine.  VMware’s CTO Steve Herrod stated that they too were committed to open standards and preventing login:

    “Our shared vision is to make it easy to build, run, and manage applications for the cloud, and to do so in a way that makes the applications portable across clouds. The rich applications should be able to run in an enterprise’s private cloud, on Google’s AppEngine, or on other public clouds committed to similar openness.”

    So with all this openness and commitment to open standards is the ability to move from cloud to cloud seamlessly just around the corner? This remains to be seen it but the initiatives all seem to be well-intentioned and moving in the right direction.

    Does Open Source Prevent Cloud Lock-in?

    I don’t know whether Rackspace’s OpenStack will truly prevent cloud lock-in but it does seem to be well-intentioned. Though I  believe the following things need to happen to insure cloud lock-in doesn’t become a rampant problem:

    • Virtualization Portability – At a very simple level users need to be able to move from virtualization technologies including those hosted in the cloud need to be able to migrate seamlessly, that includes VMs running in VMware , Xen, HyperV and KVM. Then once in the cloud they need to be able to move across clouds both public and private unencumbered — Amazon, Rackspace, Eucalyptus, Ubuntu Enterprise Cloud and others. Adoption of a widespread virtualization standard like Open Virtualization Format (OVF) could help (OpenStack does seem to already support OVF, a good sign).
    • Data Portability – Just as compute environments move so should data, but not only move but be accessible across network and cloud infrastructures with high fidelity.  Security of that data goes without saying but adds another layer of complexity.
    • Cross Environment Tools – Finally, the tools to managing these environments need to manage both cloud and legacy architectures to insure that  the management of these new computing paradigms don’t make things even more complicated.

    I hope OpenStack helps drive this vision. However to deliver on the true vision of true portability across cloud platforms other cloud providers and vendors other than Rackspace will have to participate.

    Related Articles

    Technorati Tags: , , , , , , , , , , , ,


    PlanetMySQL Voting: Vote UP / Vote DOWN

    Federated Tables

    Июль 7th, 2010

    Your searching for how to create a join across two databases on two different servers and it can’t be done directly.   select  d1.a, d2.b from db1@server1 join db2@server2 where db1.c = db2.c; does not work.

    You learn about federated databases.  The federated storage engine allows accesses data in tables of remote databases.  Now how do you make it work?

    1) Check if the federated storage engine is supported.  Federation is OFF by default!

    mysql> show engines;
    +------------+---------+----------------------------------------------------------------+
    | Engine     | Support | Comment                                                        |
    +------------+---------+----------------------------------------------------------------+
    | InnoDB     | YES     | Supports transactions, row-level locking, and foreign keys     |
    | MyISAM     | DEFAULT | Default engine as of MySQL 3.23 with great performance         |
    | BLACKHOLE  | YES     | /dev/null storage engine (anything you write to it disappears) |
    | CSV        | YES     | CSV storage engine                                             |
    | MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables      |
    | FEDERATED  | YES     | Federated MySQL storage engine                                 |
    | ARCHIVE    | YES     | Archive storage engine                                         |
    | MRG_MYISAM | YES     | Collection of identical MyISAM tables                          |
    +------------+---------+----------------------------------------------------------------+
    

    If it is not “Support”ed (on) you need to add ‘federated=ON‘ to the [mysqld] section of your /etc/my.cnf file.  I found this section to be a bit troublesome.  It must be ‘=ON’ not ‘=YES” or even ‘=on’.   Most options allow these but the federated options is picky.  I’m running MySQL Enterprise 5.1.37.sp1.

    2) If you don’t already have the database created, create the database on the storage server.  By ‘storage server’ I mean the one where the data will be written to disk.

    I like to create a user just for the purpose of connection the federated copy of the database to the true database.  This way, if the password gets changed or the user deleted, the federated system can continue to connect.

    mysql> CREATE DATABASE xfiles;
    mysql> USE xfiles;
    mysql> CREATE TABLE cases(
     Name VARCHAR(20),
     case TINYINT(3),
    ) ENGINE = INNODB;

    3) Now you can create the federated version of your data on the remote system.

    mysql> CREATE DATABASE xfiles;
    mysql> USE xfiles;
    mysql> CREATE TABLE cases(
     Name VARCHAR(20),
     case TINYINT(3),
    ) ENGINE = FEDERATED
    CONNECTION = 'mysql://skiner:c0nsper@fbi/xfiles/cases';

    4) Check your work. The table status should show Engine: FEDERATED.

    mysql> use xfiles;
    mysql> show table status\G

    Now you can add records to the table and the data should show up in select on either server.

    Enjoy.


    PlanetMySQL Voting: Vote UP / Vote DOWN