by Merv Adrian | July 10, 2013 | 7 Comments
I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.
Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:
MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.
Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.) Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2, are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.
One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer. As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.
Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.
Category: Apache Apache Yarn Big Data Cloudera Gartner graph databases Hadoop HDFS Hortonworks IBM Intel MapR MapReduce Storm Yahoo! YARN Tags: Apache, big data, Cloudera, Hadoop, HDFS, Hortonworks, IBM, MapR, MapReduce, Yahoo!
by Merv Adrian | March 9, 2013 | 22 Comments
I don’t often do a pure opinion piece but I feel compelled to weigh in on a queston I’ve been asked several times since EMC released its Pivotal HD recently. The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.
The fact is, there is an entire industry building products atop Apache open source code – and that is the point of having Apache license its projects and provide the other services it does for the open source community. The license permits such use, and companies using the Apache web server, Lucene and SOLR, Cassandra and CouchDB, and many others are everywhere. Others are building BI tools or DI tools that integrate with Apache Hadoop, or selling consulting to incorporate it into solutions. Again – that is the point. Having some components of your solution stack provided by the open source community is a fact of life and a benefit for all. So are roads, but nobody accuses Fedex or your pizza delivery guy of being evil for using them without contributing some asphalt. Commercial entities (including software and IT services providers) provide needed products and services, employ people and pay taxes. We might want them to do more charitable work or make more open source contributions , and some do, but they are not morally obligated to do so. Some IT companies make huge commitments to charitable activities and some don’t – the same is true in all sectors of the economy.
I understand why open source advocates think they are defending their turf, and I know it’s a core belief that it matters how many committers you have. But I don’t believe the market will care as Hadoop moves into the mainstream. Buyers will choose the solutions that fit their needs, from suppliers who support them at a price they are comfortable with – and will do so whether the vendors have “enough” committers or not.
For clarity’s sake, this wasn’t a new market entry. EMC was already a purveyor of Hadoop-based solutions with their Greenplum HD and with a version based on MapR. That itself is a topic worth a sentence or two. EMC’s decision to offer a MapR-based distribution early on was very much a market choice – they did it for customers who demanded those features NOW (then) and could’t get them any other way. I don’t think EMC fooled those buyers, who asked for what EMC provided. Nor do I think EMC is morally reprehensible for building their own solution by leveraging something in their product portfolio (in this case, Isilon as a potential substitute for HDFS) and thus “abandoning” those customers.
Now, if EMC stops supporting those buyers, forces them to move to a new product to keep their support – well, then we can talk. But just to be clear, virtually every software company has an end of life policy on support for versions of its products. And again, some are more “oppressive” about it than others – and the topic is often very contentious. I get inquiries on it all the time. That topic has not even come up with EMC and MapR yet.
So a few deep breaths, please.
Dial it back.
Support open source. It’s a good thing. In fact, it’s transformative – it changes your choices, and often for the better, especially economically.
If you sell, by all means appeal to people who value purity. But let’s not try to have our cake and eat it too: if you sell a product based only on open source, or services that help people implement and profit from it, you’re part of the same economy as those who blend it with other pieces. Let’s compete on the basis of satisfying our customers at a fair price. The rest – well, that’s marketing. And we all know how much some people like that, and how seriously they take it.
Category: Apache Big Data Cassandra EMC Hadoop Lucene MapR open source Tags: Apache, big data, Cassandra, EMC, Hadoop, MapR, OSS
by Merv Adrian | March 8, 2013 | 1 Comment
The first three posts in this series talked about performance, projects and platforms as key themes in what is beginning to feel like a watershed year for Hadoop. All three are reflected in the surprising emergence of a number of new players on the scene, as well as some new offerings from additional ones, which I’ll cover in another post. Intel, WANdisco, and Data Delivery Networks recently entered the distribution game, making it clear that capitalizing on potential differentiators (real or perceived) in a hot market is still a powerful magnet. And in a space where much of the IP in the stack is open source, why not go for it? These introductions could all fall into the performance theme as well – they are all driven by innovations intended to improve Hadoop speed.
Intel is by far the biggest of the new entrants. I discuss them along with my colleague Arun Chandrasekaran in a recent Gartner First Take: Hadoop Distribution Seeks to Leverage Intel’s Microprocessor Strengths. Net: processor-level exploitation and expertise in memory and IO architectures can drive great improvements. There’s more: Intel made several key partnership deals:
- An agreement with SAP around the HANA in-memory DBMS to collaborate on both a technology roadmap and go to market plans, including both streaming and bulk data movement between the two environments. The two firms plan to have a single-install deliverable later this year that will provide direct bidirectional queries and integrate management as well, building atop the already demonstrable support for SAP Data Services to pull data from Intel’s Hadoop distribution, and to deliver both on the same hardware platform. (Note that appliances that combine DBMS and Hadoop on a single rack are already available from EMC, HP and IBM – but that is another post.)
- A deal with MarkLogic Enterprise NoSQL to incorporate and support Intel’s distribution, seeing Intel’s chip-based encryption as a good complement to MarkLogic’s role-based NIAP and CCEVS-compliant security system. And of course to expand the reach of the MarkLogic engine to HDFS.
- An OEM agreement with Pentaho – the latter will be in the box. Its data mining, reporting, data discovery/visualizations, predictive analytics, and data integration will help round out the offering and make it easier to build without deep java MapReduce skills – making an interesting foil to Hortonworks’ similar arrangement with Talend.
- Numerous other partnerships – over 20 – that include hardware, network and systems integrator deals.
WANdisco may be an unfamiliar name to some data management folks, but not to those using Apache Subversion, the open source version control system. As a leader in Wide Area Network Distributed Computing (there is an acronym-based name in there if you look) with patent-protected active-active peer-to-peer replication, WANdisco sees a performance-driven opening too. With a leadership team that made foundational contributions to Apache HDFS and Apache BigTop and helped build out Yahoo’s infrastructure, WANdisco comes to the table with the WDD distribution, based on Apache Hadoop 2.0 with support for WANs across data centers, including mirroring and auto-recovery. (It joins MapR in its ability to provide the latter, but without requiring the use of its own filesystem.) It supports Amazon S3 storage as well as HBase, and provides a console for wizard-based deployment. monitoring and management on both virtualized (VMware) and dedicated physical infrastructure. It also provides the usual support and consulting services and plans aggressive moves to ramp up following last year’s IPO on the London Stock Exchange and acquisition of Altostor.
Data Direct Networks (DDN), whose presence in the high performance computing (HPC) market may also be less well known to the typical Hadoop prospect, is targeting the mid-to upper end of the market with its hScaler appliance. Above 100 nodes, where significant enterprise production workloads run, and in the multi-thousand node space where government and data center customers operate, DDN is already often familiar for its Lustre-based ExaScaler and GridScaler filesystem plays. hScaler is pointed at the fact that by some estimates, 30% or more of a job on large clusters takes place on data that is not local to the node, despite the value proposition of Hadoop putting processing “next to the data.” This is the “shuffle” phase, which takes place between Map and Reduce – multiple times in a multi-job step workflow – and most are multi-step. This is a performance play that gets more attractive with larger size and more complexity. DDN touts its pipelining of Hadoop, and its ability to scale compute and storage independently as key differentiators. [edited] hScaler includes and DDN supports the Hortonnworks’ HDP, and has Pentaho for its “ETL graphical designer” and the DirectMon management console (like the ones I discussed in Part One), compatible with ExaScaler and GridScaler.
There you have it: 3 new players, all of whom are focusing on the performance dimension as their reason for entering the market. This is a maturation step; in the earliest days of a new technology, it’s enough to be able to do it at all. Some of the tire-kickers are now moving on to evaluate alternatives against one another not just on what functions they have, but on how well they do them. POCs are appearing in my Gartner inquiries, and the results, not surprisingly, vary by workloads, by the data types, and by the skills of the field staff involved in the tests. We’ll see lots of benchmarketing in the months ahead. Just remember that your mileage may vary, and ALWAYS test with a POC bakeoff.
Category: Amazon Apache Big Data Gartner Hadoop Hbase HDFS Lucene MapR MapReduce Tags: Apache, DBMS, DDN, EMC, Gartner, HANA, Hbase, HDFS, Hortonworks, HP, HPC, IBM, Intel, Lustre, MapR, MapReduce, MarkLogic, Pentaho, S3, SAP, SAS, Talend, VMware, WANdisco
by Merv Adrian | February 23, 2013 | 4 Comments
In the first two posts in this series, I talked about performance and projects as key themes in Hadoop’s watershed year. As it moves squarely into the mainstream, organizations making their first move to experiment will have to make a choice of platform. And – arguably for the first time in the early mainstreaming of an information technology wave – that choice is about more than who made the box where the software will run, and the spinning metal platters the bits will be stored on.There are three options, and choosing among them will have dramatically different implications on the budget, on the available capabilities, and on the fortunes of some vendors seeking to carve out a place in the IT landscape with their offerings.
First up is the cloud. It’s extraordinarily attractive to first timers, because there is no capital expenditure (read: no procurement process, no IT Standards Committees, minimal budget impact, etc.) It’s easy. Maybe too easy; using it outside IT can undermine years of careful work on governance and dramatically increase risk. But that’s another post; for more detail, clients can consult the Gartner report ‘Big Data’ Is Only the Beginning of Extreme Information Management. My point here is that it’s no accident that Amazon has reported that it started 2 million Elastic MapReduce (EMR) clusters – in a single year. And if that many are already on Amazon, think of the other platforms – bet there are more than a few there? You’d be right, but I won’t belabor that here – they aren’t hard to find. The growth of cloud platforms for big data is not likely to slow, and it will remain a great choice for early uses, speculative projects that may need to spin down as quickly as they spun up if they don’t prove to be useful, and ones whose economics just plain work. And for many projects, the cloud will remain the most reasonable economic choice. That’s why some of this year’s key announcements will focus there – stay tuned.
Second is our default choice so far: buy some nodes. The buy some more. Rinse and repeat. Early adopters, even the mammoth new web-based firms that got all this started, did this, and still do. Some sites literally have people who spend the day going up and down rows of racks pulling failed inexpensive disks and replacing them. It’s the source of the HDFS “3 copies, one on a separate rack” default, and it works. You can buy a quad-core Supermicro data node (or several) with 12 500 GB hard disks in it – 6 TB – for $7K or so, and a name node for $4K with more memory and less disk (why? that’s another post too) and you’ll be working with a Cloudera-certified platform.
You can spend less – and you can spend more – but the numbers are still as compelling as ever. Buy some racks, fill ‘em up with a few dozen nodes and you’re into a couple of hundred terabytes for a couple of hundred thousand (insert currency of your choice here). More expensive than the cloud, but nothing like the big server/storage combination bucks your brethren in the data center are spending for those big RDBMS platforms. Be warned: if you don’t know how to deploy, operate and optimize a cluster, you have a lot to learn. And there is a good chance your data center folks, if you have some, will need new skills even if they are already good at operating what is in there today.
Finally, and of most financial, vendor leadership and internal standards import, is the newest choice: appliances. At least 6 plays are making their way to market for Hadoop users: EMC Greenplum’s Data Computing Appliance, the Cisco Platform for NetApp Open Solution for Hadoop, HP’s AppSystem for Apache Hadoop, IBM’s PureData System for Analytics, Oracle’s Big Data Appliance, and Teradata’s Aster Analytic Appliance. (I’m sure there are others I’ve left out here, and there are data warehouse appliances and specialty plays like Yarc’s Urika for graph data applications [not Hadoop], but this is a good start.)
The big questions to ask here are: whose software are you running, what else do you get in the package beside metal and a Hadoop distribution, how much easier will it be to operate than buying your own nodes, what support will you get – and the big one: how much does it cost? In 2013, the market will begin to decide if the value proposition of appliances will play here – is the premium (and make no mistake, there is one) you pay worth the quicker time to deployment, operational and management help, and agility you get? That discussion is deep and detailed, and beyond our scope here, but I’m looking forward to continued conversations with Gartner clients who are making these choices as the market develops.
Next time: players. And there are some new ones. Don’t miss it.
Category: Amazon Apache Aster Big Data BigInsights Cisco Cloudera data warehouse appliance Elastic MapReduce EMC Gartner graph databases Hadoop HP IBM MapReduce NetApp Oracle Teradata Yarc Tags: Amazon, Apache, Aster, big data, Cisco, Cloudera, Elastic MapReduce, EMC, EMR, graph database, Hadoop, HP, IBM, MapReduce, NetApp, Oracle, Teradata, Yarc
by Merv Adrian | February 21, 2013 | 1 Comment
In Part One of this series, I pointed out that how significant attention is being lavished on performance in 2013. In this installment, the topic is projects, which are proliferating precipitously. One of my most frequent client inquiries is “which of these pieces make Hadoop?” As recently as a year ago, the question was pretty simple for most people: MapReduce, HDFS, maybe Sqoop and even Flume, Hive, Pig, HBase, Lucene/Solr, Oozie, Zookeeper. When I published the Gartner piece How to Choose the Right Apache Hadoop Distribution, that was pretty much it.
Since then, more projects have matured. More have entered incubator status. And alternatives to Apache projects have gained more traction in distributions and in customer sites whose portfolio is more expansive. I’ve talked before about my ongoing stack model that attempts to sort this out – you may have seen it in an earlier blog post. I’ve updated it a little, and in this version, you can see that the “original core” projects are bolded. A few others are too, to be discussed in my planned Hadoop Tutorial presentation at the upcoming Gartner BI Summit, March 18-20 in Grapevine, Texas, where I’ll drill into the bolded ones in more detail.
Projects (and alternatives) for the Hadoop stack
In 2013, the list of projects, alternatives, and supporting technology to watch will change as commercial distributions continue to expand what they contain and support, and as more and more use cases focus on issues like machine learning (Mahout) or text search and analytics (Lucene and Solr) and as new processing paradigms begine to compete with MapReduce under Apache 2.0. Metadata will matter, so HCatalog will turn a lot of heads. Graph processing may begin to show up if Giraph gets some traction. And there’s more:
Apache Avro - the interest in data serialization is expanding with sensor and other machine generated data. Just ask Splunk.
Apache Accumulo – a secure datastore built by guys from the NSA, investigated by the Senate? Of course you’re interested.
Apache Ambari – covered in the last post. An open source management platform.
Apache Bigtop – packaging and testing a collectiomn of your own? This is for you.
Apache Blur (incubating) for search in cloud environments – Doug Cutting is a committer on this one.
Apache Cassandra – an alternative, distributed datastore that has won POCs against pure Hadoop in some use cases I’ve seen.
Apache Chukwa – data collection on your system, for monitoring.
Apache Crunch (incubating) – a “quicker to implement than MapReduce programming” choice, for building, testing and running pipelines.
Apache Drill (incubating) – one of several entrants in the “real-time analytics” sweepstakes – and there will be others.
Apache Giraph (incubating) for graph processing uses – one of the first examples of the changes Yarn will enable.
Apache Hama for Bulk Synchronous Parallel computing in scientific computations.
Apache Kafka – a publish and subscribe system.
Apache Mahout - already being supported by several distributions – machine learning is a key new use.
Apache Whirr – a library for running services in the cloud (including a Hadoop cluster, of course.)
Cascading – not really a project but a development platfdorm, commercialized by Concurrent.
DataFu – also not an Apache project, but a collection of Pig UDFs developed at LinkedIn.
Dataguise DG for Hadoop – a security offering of great value in an insecure platform, which Hadoop certainly is today.
Hadapt – another “alternative datastore” contender, not open source, but offering a relational store right on your cluster.
HStreaming – along with IBM’s inclusion of InfoSphere Streams in its BigInsights distribution, Twitter’s Storm and the well established SQLstream, we’ll see more interest in realtime streaming operational processing as a counterpoint to the interest in realtime analytics that will be another key development this year.
Rainstor – again, not open source, but highly compressed Hadoop sounds pretty appealing. Check it out.
VMware Serengeti – aimed at creating virtualized, highly available, multi-tenant Hadoop. Big possibilities for this one.
I haven’t gone into the various analytics plays here. That’s a post for another time, and it’s arguably a “layer above.” (Or in the case of my diagram, below.) There’s only so much you can fit into a reasonable post and it’s time to end this one. Next time: platforms.
Category: Accumulo Ambari Apache Apache Drill Apache Yarn BigInsights Cassandra Cloudera Dataguise EMC Gartner Giraph graph databases Hadapt Hadoop Hbase HCatalog HDFS Hive Hortonworks Hstreaming IBM InfoSphere Lucense MapReduce Mshout Oozie open source Pig Rainstor Serengeti Solr SQLstream Sqoop VMware Zookeeper Tags: Apache, BigInsights, Cassandra, Cloudera, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, IBM, InfoSphere, MapR, MapReduce, Oozie, Pig, SQLStream, Sqoop, zookeeper
by Merv Adrian | February 16, 2013 | 11 Comments
It’s no surprise that we’ve been treated to many year-end lists and predictions for Hadoop (and everything else IT) in 2013. I’ve never been that much of a fan of those exercises, but I’ve been asked so much lately that I’ve succumbed. Herewith, the first of a series of posts on what I see as the 4 Ps of Hsdoop in the year ahead: performance, projects, platforms and players.
Performance concerns are inevitable as technologies move from early adopters, who are already tweaking everything they build as a matter of course, to mainstream firms, where the value of the investment is always expected to be validated in part by measuring and demonstrating performance superiority. It also becomes an issue when the 3rd or 4th project comes along with a workload profile different from those that came before – and it doesn’t perform as well as those heady first experiments. Getting it right with Hadoop is as much art as science today – the tools are primitive or nonexistent, the skills are more scarce than the tools, and experience – and therefore comparative measurement – is hard to come by.
What’s coming: newly buffed up versions of key management tools. It’s one method of differentiating distributions in a largely common set of software – Hortonworks doubling down on open source Apache Ambari, Cloudera enhancing Cloudera Manager, MapR’s updated Control System (as well as their continued touting of DIY favorites Nagios and Ganglia.) EMC, HP, IBM and other megavendors are continuing to instrument their existing, and familiar, enterprise tools to reach this exploding market. It will be a busy bazaar.
Resources are proliferating to help: published work like Eric Sammer’s Hadoop Operations (somewhat Cloudera-centric but very well organized and useful). A plethora of Slideshare presentations designed to help navigate the arcana of cluster optimization, workload management, configuration optimization, are appearing.
Performance has figured in a number of proof of concept (POC) tests pitting distributions against one another that I’ve heard about from Gartner clients. Some have been inconclusive; some have had clear winners. As we’ve seen in DBMS POCs over the years, your data and your workloads matter, and your results may differ from others’. I’ve seen replacements of “first distributions” by another, as performance or differing functionality comes to the fore. I’ve even seen a case where a Cassandra-based alternative won out over the Hadoop distributions.
Next time: projects proliferate.
Category: Big Data BigInsights Cloudera EMC Hadoop Hbase HDFS Hortonworks IBM MapReduce Sqoop Tags: Apache, BigInsights, Cloudera, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, MapR, MapReduce, Pig, Sqoop, zookeeper
by Merv Adrian | February 10, 2013 | 15 Comments
“Hadoop people” and “RDBMS people” – including some DBAs who have contacted me recently – clearly have different ideas about what Data Integration is. And both may differ from what Ted Friedman (twitter: @ted_friedman) and I (@merv) were talking about in our Gartner research note Hadoop Is Not a Data Integration Solution
, although I think the DBAs’ concept is far closer to ours.
We went to some lengths to precisely map Gartner criteria from the Magic Quadrant for Data Integration Tools
(see below) to the capabilities of what most people would consider the Hadoop stack – Apache Projects that are supported in a commercial distribution. Many of those capabilities were simply absent, with nothing currently available to perform them.
Moreover, even to the degree that some pieces/projects might meet some of the needs, there is nothing that ties them together into a “solution,” which itself was a carefully chosen word. Today, with Hadoop projects in general, we very often see bespoke, self-integrated, “build it yourself and good luck operating it” structures. By contrast, solutions, including those for data integration, provide the relevant pieces coherently in a way that ties together design, operation, optimization and governance. Leaving aside the absence of data quality tools or profiling tools of any kind in today’s supported Hadoop project stack, we don’t see that yet. And Ted and I note in our piece that Hortonworks, for example, implicitly acknowledged that by bundling Talend into its distribution. Talend itself places rather well in the Gartner Magic Quadrant for DI tools.
Hadoop is very useful for a lot of things – including analytics of some kinds, and ETL of some kinds, and for low-cost exploitation of data that is unsuitable for persisting in RDBMSs for a variety of reasons. It’s maturing, and steadily adding more capabilities, and is driving an economic refactoring of data storage and processing which will result in some (increasing amounts of) data being kept there and some (increasing amounts of) processes being performed there. In Gartner’s Logical Data Warehouse model, it occupies the spot for Distributed Process use cases. The relative size of that part of the landscape relative to repositories and to virtualization is yet to be determined. It will take some years to sort out, and it won’t stand still.
But platforms are not solutions. Hadoop can very much be a platform on which a DI solution can be built. But A solution? Not yet. For that, talk to the folks in the MQ referenced above. [added 2/13] Thanks for your comments and tweets – and keep them coming!
Category: Big Data data integration Hadoop Hortonworks Magic Quadrant Talend Uncategorized Tags: Apache, data integration, Hadoop, Hortonworks, Magic Quadrant
by Merv Adrian | January 30, 2013 | 8 Comments
2013 promises to be a banner year for Apache Hadoop, platform providers, related technologies – and analysts who try to sort it out. I’ve been wrestling with ways to make sense of it for Gartner clients bewildered by a new set of choices, and for them and myself, I’ve built a stack diagram that describes possible functional layers of a Hadoop-based model.
The model is not exhaustive, and it continually evolves. In my own continuous collection and update of market, usage and technical data, it serves as a scratchpad I use – every project/product name in the layers is a link to a separate slide in a large deck I use to follow developments. As you can see below, it contains many Apache and non-Apache pieces – projects, products, vendors – open and closed source. Some are quite low level – for example Trevni can be thought of as a format used inside Avro – but I include them at least in part because I keep track of “moving parts,” and in the world of open source, that means a lot of pieces that are independent of one another.
Part of the effort so far has been on relating this model to Gartner’s Information Capabilities Framework, an enormously useful view of the verbs we use to compose our semantic use cases in building business applications. My colleague Ted Friedman and I just used the two models to assess how Hadoop stacks up as a Data Integration solution. Not surprisingly, I suppose, we found it wanting. You can see our research here if you’re a Gartner client.
I expect further refinement of this stack in the weeks ahead, and more offerings at each layer as it evolves as well. I’m trying to keep it simple – at 6 layers its already getting heavy, and I’d hate to add more. But that may be unavoidable. Your feedback here will be helpful – please offer comments if you have any! As a guide to choice, simplicity is a much-desired, but often unobtainable, objective.
Category: Apache Big Data Cloudera data integration Hadoop Hbase HDFS Hortonworks MapReduce open source OSS Sqoop Tags: Apache, Cassandra, Cloudera, Datastax, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, Karmasphere, MapR, MapReduce, Oozie, open source, OSS, Pig, Sqoop, zookeeper
by Merv Adrian | December 27, 2012 | 6 Comments
I had an inquiry today from a client using packaged software for a business system that is built on a proprietary, non-relational datastore (in this case an object-oriented DBMS.) They have an older version of the product – having “failed” with a recent upgrade attempt.
The client contacted me to ask about ways to integrate this OODBMS-based system with others in their environment. They said the vendor-provided utilities were not very good and hard to use, and the vendor has not given them any confidence it will improve. The few staff programmers who have learned enough internals have already built a number of one-off connections using multiple methods, and were looking for a more generalizable way to create a layer for other systems to use when they need data from the underlying database. They expect more such requests, and foresee chaos, challenges hiring and retaining people with the right skills, and cycles of increasing cost and operational complexity.
My reply: “you’re absolutely right.”
Their only recourse, absent utilities that can extract data when needed (via federation, virtualization or distributed processing models as identified in Gartner’s Information Capabilities Framework), is to build and materialize an intermediate datastore – ODS- or DW-style. That means an exhaustive effort to anticipate all likely future requests for data. It’s unlikely they will be able to do so, and any change to that system will be as difficult as the one-offs they build now. They could literally recreate the entire database in an alternative architecture that their other systems can access via conventional methods, but the cost of building and maintaining such a redundant “layer” might rival replacement cost.
Thus, my conclusion: their vendor has demonstrated itself to be an architectural cul-de-sac they might want to consider exiting. If you have any similar systems in your portfolio, here’s a New Years’ resolution for you: back out of that dead end. As soon as you can.
Category: data integration data warehouse DBMS Tags: data integration, data warehouse, federation, virtualization
by Merv Adrian | December 8, 2012 | 7 Comments
At its first re:Invent conference in Late November, Amazon announced Redshift, a new managed service for data warehousing. Amazon also offered details and customer examples that made AWS’ steady inroads toward enterprise, mainstream application acceptance very visible.
Redshift is made available via MPP nodes of 2TB (XL) or 16TB (8XL), running the Paraccel PADB high-performance columnar, compressed DBMS, scaling to 100 8XL nodes, or 1.6PB of compressed data. XL nodes have 2 virtual cores, with 15GB of memory, while 8XL nodes have 16 virtual cores and 120 GB of memory and operate on 10Gigabit Ethernet.
Reserved pricing (the more likely scenario, involving a commitment of 1 year or 3 years) is set at “under $1000 per TB per year” for a 3 year commitment, combining upfront and hourly charges. Continuous, automated backup for up to 100% of the provisioned storage is free. Amazon does not charge for data transfer into or out of the data clusters. Network connections, of course, are not free - see Doug Henschen’s Information Week story for details.
This is a dramatic thrust in pricing, but it does not come without giving up some things. For example, Amazon has not licensed Paraccel’s high-speed data import utilities; it is far more focused at this point at enabling movement between its own Elastic MapReduce, DynamoDB and S3 storage and Redshift. Thus the early focus, and likely early adoption, is Amazon’s customers’ data already in the cloud. Movement from existing data warehouses will come later. Today, that would require exporting data into S3 and then moving it into a (designed) Redshift data warehouse using Amazon’s data movement utilities, which were not shown in detail. Design doesn’t disappear, and it’s not free. As my colleague Mark Beyer said in an email discussion:
Data warehouse and analytics expertise is harder to come by than many believe. With Amazon Redshift providing services to initiate and operate the data warehouse in lieu of Paraccel’s management interface and tools, it is left up to the Redshift implementer to “provide the data warehouse chops.” While I’m sure that any good Cloud application jockey knows their stuff, any data warehouse veteran on the planet knows that letting the apps guys write analytics is like asking your doctor to be the striker on your football team (what we call Soccer here). It is entirely likely that an entire cottage industry of “expert implementers for analytics in the Cloud” will appear on the near horizon.
It’s also not clear how much database (not deployment and operating) control will be made available. Paraccel offers plenty of knobs and buttons. Tweaking performance by configuring memory, pinning tables there, looking at how data is packed inside the “slices” – it does not appear any of that will be exposed in the Redshift version. Nor is it obvious how to build ongoing update for a Redshift data warehouse yet.
Another missing “feature” is the support model one gets from a software firm like Paraccel – the level and nature of support in an Amazon environment today is quite different. Still, this is a work in progress. It was evident at re:Invent that Amazon is building up and enhancing its enterprise-facing team, and I had an interesting conversation with them about how the engagement model for an enterprise that has had several individuals “unofficially” contracting for projects on their own transitions to a corporate model. They have seen this play out a number of times now, and it’s becoming a better understood play for them.
One final comment about the vendors’ relationship: it is not as close as I suspect Paraccel would have hoped. After a million dollar multi- investment and over a year of joint work, it was surprising not to see Paraccel’s CEO on stage for the announcement, or even a synchronized press release. This reflects the relative arm’s-length nature of this arrangement. In my subsequent conversations with them, it became clear that Amazon expects their offering to diverge from Paraccel’s over time as they add their own pieces around the part they have licensed for use in Redshift. And there was no publicized joint marketing or sales initiative.
It remains to be seen if the whole elasticity value proposition (scale up, scale down) proves as relevant to data marts and data warehouses as it does to the apps that Amazon is more accustomed to hosting, or how quickly enterprises will move their data to a public cloud. Warehouses don’t scale down. But analytic platforms iused for experimenting will, and this may create a great opportunity for Amazon. Gartner clients can see our position on other dimensions of this announcement in a First Take Mark Beyer and I just published.
Category: Amazon Big Data data warehouse data warehouse appliance DBMS Hadoop MapReduce Tags: Amazon, big data, data warehouse, DynamoDB, MapReduce, Paraccel, Redshift