by Merv Adrian | October 3, 2013 | 1 Comment
My Gartner APAC colleague Darryl Carlton and I recently discussed the obscene ratio between CEO pay and average worker pay in the US. And this IS about the US – we are supporting an astonishing gap compared to the rest of the world, and high tech vendors like Oracle are not the only ones at the top of the list – Larry Ellison comes in only number 4 on this Bloomberg list, pulling down 1,287 times what an average Oracle worker (not impoverished at nearly $75K per year) collects.
Personally, I was surprised that the banking community was not represented “better” here – but perhaps that’s because there seem to be dozens of executives, not just the CEOs, in every firm currently hauling cash home in wheelbarrows as mortgages remain underwater and pension funds and healthcare plans get de-funded. At least they share with each other.
In my conversation with Darryl, we discussed the link to company performance. Pick your measure – stock price, revenue improvement, etc. – in general those gorging at the trough are not dramatically outperforming their peers – and they certainly are not doing 10 times, or a hundred times, better than their counterparts in other countries. But look at how the US plutocrats compare to their brethren in other countries – Japan at 11:1, Britain at 22:1 -
and the US at 475:1.
It’s all in the table linked below.
CEO Pay Ratio
Darryl notes that
Drucker argues that CEO pay which exceeds 20-25 times the average paid in that company is bad for business. The new law in Australia is that if 25% of the stockholders vote down executive pay twice (two strike rule) at successive AGM’s then the Board can be replaced. The CEO of Lenovo in China has once again distributed his bonus to the lowest paid factory workers.
Not here, I’m afraid. And I want to say so much more, but only one word is needed, as far as I’m concerned:
Category: Industry trends Tags: Industry trends
by Merv Adrian | September 6, 2013 | 12 Comments
Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”
But Hadoop is not a thing, it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.
To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.
Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.
To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.
On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.
“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.
So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.
I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.
Category: Apache Apache Yarn Big Data BigInsights Cassandra Cloudera Hadoop Hbase IBM MapR MapReduce open source OSS Pig YARN Tags: Apache, big data, BigInsights, Cassandra, Cloudera, Datastax, Hadapt, Hadoop, Hbase, HDFS, IBM, open source, OSS, Pig, Yarn
by Merv Adrian | July 15, 2013 | 10 Comments
Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.
The earliest uses of Hadoop were – and most still are – ETL-style batch workloads of java MapReduce code for extracting information from hitherto unused data sets, sometimes new, and sometimes simply unutilized – what Gartner has called “dark data.” But Hadapt had already begun talking in 2011 about putting SQL tables inside HDFS files – what my friend Tony Baer has called “shoehorn[ing] tables into a file system with arbitrary data block sizes.” We highlighted them in early 2012. But this thread, though it continues, was about pre-creating relational tables – suitable for the known problems, like the existing data warehouses and data marts, and thus not complete over the new, exploratory ad unpredictable jobs Hadoop advocates envisioned.
Pre-defined or not, SQL access would need metadata, and the indispensable DDL it would provide at runtime if not in advance. The HCatalog project was incubating – first outside, then inside SQL interface project Apache Hive (itself first developed at Facebook in 2009), getting significant support from Teradata and Microsoft partnering with Hortonworks. RDBMS vendors were doing what they always do with new wrinkles at the innovative market edges – co-opting them with “embrace and extend” strategies. Thus, HP Vertica, IBM DB2 and Netezza, Kognitio, EMC Greenplum, Microsoft SQL Server, Oracle, Paraccel, Rainstor and Teradata all offered SQL-oriented ways to call out to the Hadoop stack, using external table functions, import pipes, etc. But actual SQL, for use directly against Hive (or HCatalog) metadata from inside their engine was so far a “not exactly, but…” kind of proposition.
At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise. Although Apache Drill (based on Google’s Dremel) had been announced a few months before, it was and is still listed as Incubating by Apache. That is a pre-Project, arguably pre-alpha state, though MapR’s plan to support it commercially is imminent. Microsoft publicly described its plans for Polybase, which is included in SQL Server 2102 Parallel Data Warehouse V2. Platfora, a BYOH (bring your own Hadoop) BI player, announced its interactive tool with a memory-cached embedded store, opening another front in the battles for customers to muddle over – a database or a tool? Which is a better fit for me?
In February, Hortonworks had thrown the “open, Apache, Hadoop” hat into the ring by announcing the Stinger initiative, promising a 100x speedup – and a few months later the labors of 55 developers from several companies progressed Hive to release – wait for it – 0.11. This is a platform for further development; in its upcoming stages, Stinger will make use of yet another Incubator project, in this case Tez.
Shortly after that, EMC Greenplum (by then using Pivotal as a product, not corporate, name) announced its HAWQ (“Hadoop With Query”). It added some wrinkles by grafting the commercially developed and market-hardened Greenplum MPP Postgres-based DBMS directly to HDFS (or EMC’s Isilon OneFS) – and promising in interviews to be half the price of leading alternatives.
In April, IBM announced its BigSQL, and Teradata unveiled SQL-H, both making real “SQL against Hadoop” part of their portfolios (Teradata already had Aster’s SQL-MR, first introduced in 2008.) Oracle announced an updated Big Data Appliance and some software component upgrades. Oracle again left its MySQL unit out of the conversation, relegated to making its own announcement of a MySQL Applier for Hadoop, which replicates events to HDFS via Hive, extending its existing Sqoop capability. MySQL continues to be remarkably invisible in Oracle big data messaging, despite the widespread presence of the LAMP stack in big data circles.
There you have it – the state of SQL on Hadoop at the Summit. This discussion doesn’t even include things that have not even been entered into the races yet, such as Facebook’s yet-to-be-submitted-to-Apache Presto interface. A closing thought or two:
- None of this is real-time, no matter what product branding is applied. On the continuum between batch and true real-time, these offerings fall in between – they are interactive. And the future of Hadoop in the enterprise includes both interactive and real-time. As Hadoop will be increasingly used for operational purposes, critical real-time applications will require continuous availability for successful deployment in large global organizations. There are stirrings on the horizon, but nonstop operation is still aspirational, not easily available.
- There continues to be much hype about the advantages open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92 were agreed upon and deployed by those slowpoke commercial vendors.
- The reality is, the open source community struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Two weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for features identified as needed for the roadmap. Sometimes, a commercial product manager is a blessing indeed.
P.S. Yes, I know the syntax of the title is not correct. Call it literary license.
Category: Apache Apache Drill Apache Yarn Aster Big Data Cloudera data warehouse DBMS Gartner Hadapt Hadoop HCatalog HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Netezza Oozie Oracle Rainstor RDBMS Real-time SQL Server Sqoop Teradata YARN Tags: Apache, Aster, big data, BigSQL, CDH, Cloudera, data warehouse, DB2, Drill, EMC, ETL, Greenplum, Hadapt, Hadoop, HAWQ, Hbase, HCatalog, HDFS, Hive, Hortonworks, HP, Impala, Isilon, Kognitio, MapR, MapReduce, MPP, MySQL, OneFS, Oracle, Paraccel, Platfora, Polybase, Postgres, Rainstor, SQL, Sqoop, Stinger, Teradata, Tez, Vertica
by Merv Adrian | July 13, 2013 | 3 Comments
A brief rant here: I am asked with great frequency how this RDBMS will hold off that big data play, how data warehouses will survive in a world where Hadoop exists, or whether Apple is done now that Android is doing well. There is a fundamental fallacy implicit in these questions.
Comparing what someone new and shiny may be claiming they will do a year from now with what someone established is already doing today is foolish. The established vendor being compared is not likely to stand still. In fact, it may well have got where it is precisely because it has learned to sustain innovation. In the big data world, to acknowledge that, say, the uniqueness of MapR’s current storage solution compared to HDFS will likely erode over time is accurate. But to assume MapR will stand still while that happens is not; they are several releases, and several different innovations, in. They still may fall behind – but not because they stood still.
How do I handle these questions as an analyst? By sticking with what is shipping, in production, with referenceable customers. To advise someone who has a need for technology that they should wait until some uncertain point in time when an open source provider may have some technology ready that will compete with today’s enterprise-ready, supported product strikes me as very poor advice. If they don’t need it now, they should wait anyway, and evaluate the options when they do.
This ties closely to my often-offered comment that is it is the Silly Con Valley (thanks to Paul Kent at SAS for that one) disease to believe that once we write it on the whiteboard it’s ready. It’s bad enough to compare to what we know will go GA at a relatively predictable time (like a SQL Server release) but to compare to something whose feature list is on a request for volunteers at an open source meetup is entirely different.
Category: Big Data data warehouse Gartner Hadoop HDFS MapR RDBMS SQL Server Tags: data warehouse, Hadoop, HDFS, MapR, R, RDBMS, SAS, SQL Server
by Merv Adrian | July 10, 2013 | 7 Comments
I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.
Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:
MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.
Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.) Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2, are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.
One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer. As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.
Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.
Category: Apache Apache Yarn Big Data Cloudera Gartner graph databases Hadoop HDFS Hortonworks IBM Intel MapR MapReduce Storm Yahoo! YARN Tags: Apache, big data, Cloudera, Hadoop, HDFS, Hortonworks, IBM, MapR, MapReduce, Yahoo!
by Merv Adrian | March 9, 2013 | 22 Comments
I don’t often do a pure opinion piece but I feel compelled to weigh in on a queston I’ve been asked several times since EMC released its Pivotal HD recently. The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.
The fact is, there is an entire industry building products atop Apache open source code – and that is the point of having Apache license its projects and provide the other services it does for the open source community. The license permits such use, and companies using the Apache web server, Lucene and SOLR, Cassandra and CouchDB, and many others are everywhere. Others are building BI tools or DI tools that integrate with Apache Hadoop, or selling consulting to incorporate it into solutions. Again – that is the point. Having some components of your solution stack provided by the open source community is a fact of life and a benefit for all. So are roads, but nobody accuses Fedex or your pizza delivery guy of being evil for using them without contributing some asphalt. Commercial entities (including software and IT services providers) provide needed products and services, employ people and pay taxes. We might want them to do more charitable work or make more open source contributions , and some do, but they are not morally obligated to do so. Some IT companies make huge commitments to charitable activities and some don’t – the same is true in all sectors of the economy.
I understand why open source advocates think they are defending their turf, and I know it’s a core belief that it matters how many committers you have. But I don’t believe the market will care as Hadoop moves into the mainstream. Buyers will choose the solutions that fit their needs, from suppliers who support them at a price they are comfortable with – and will do so whether the vendors have “enough” committers or not.
For clarity’s sake, this wasn’t a new market entry. EMC was already a purveyor of Hadoop-based solutions with their Greenplum HD and with a version based on MapR. That itself is a topic worth a sentence or two. EMC’s decision to offer a MapR-based distribution early on was very much a market choice – they did it for customers who demanded those features NOW (then) and could’t get them any other way. I don’t think EMC fooled those buyers, who asked for what EMC provided. Nor do I think EMC is morally reprehensible for building their own solution by leveraging something in their product portfolio (in this case, Isilon as a potential substitute for HDFS) and thus “abandoning” those customers.
Now, if EMC stops supporting those buyers, forces them to move to a new product to keep their support – well, then we can talk. But just to be clear, virtually every software company has an end of life policy on support for versions of its products. And again, some are more “oppressive” about it than others – and the topic is often very contentious. I get inquiries on it all the time. That topic has not even come up with EMC and MapR yet.
So a few deep breaths, please.
Dial it back.
Support open source. It’s a good thing. In fact, it’s transformative – it changes your choices, and often for the better, especially economically.
If you sell, by all means appeal to people who value purity. But let’s not try to have our cake and eat it too: if you sell a product based only on open source, or services that help people implement and profit from it, you’re part of the same economy as those who blend it with other pieces. Let’s compete on the basis of satisfying our customers at a fair price. The rest – well, that’s marketing. And we all know how much some people like that, and how seriously they take it.
Category: Apache Big Data Cassandra EMC Hadoop Lucene MapR open source Tags: Apache, big data, Cassandra, EMC, Hadoop, MapR, OSS
by Merv Adrian | March 8, 2013 | 1 Comment
The first three posts in this series talked about performance, projects and platforms as key themes in what is beginning to feel like a watershed year for Hadoop. All three are reflected in the surprising emergence of a number of new players on the scene, as well as some new offerings from additional ones, which I’ll cover in another post. Intel, WANdisco, and Data Delivery Networks recently entered the distribution game, making it clear that capitalizing on potential differentiators (real or perceived) in a hot market is still a powerful magnet. And in a space where much of the IP in the stack is open source, why not go for it? These introductions could all fall into the performance theme as well – they are all driven by innovations intended to improve Hadoop speed.
Intel is by far the biggest of the new entrants. I discuss them along with my colleague Arun Chandrasekaran in a recent Gartner First Take: Hadoop Distribution Seeks to Leverage Intel’s Microprocessor Strengths. Net: processor-level exploitation and expertise in memory and IO architectures can drive great improvements. There’s more: Intel made several key partnership deals:
- An agreement with SAP around the HANA in-memory DBMS to collaborate on both a technology roadmap and go to market plans, including both streaming and bulk data movement between the two environments. The two firms plan to have a single-install deliverable later this year that will provide direct bidirectional queries and integrate management as well, building atop the already demonstrable support for SAP Data Services to pull data from Intel’s Hadoop distribution, and to deliver both on the same hardware platform. (Note that appliances that combine DBMS and Hadoop on a single rack are already available from EMC, HP and IBM – but that is another post.)
- A deal with MarkLogic Enterprise NoSQL to incorporate and support Intel’s distribution, seeing Intel’s chip-based encryption as a good complement to MarkLogic’s role-based NIAP and CCEVS-compliant security system. And of course to expand the reach of the MarkLogic engine to HDFS.
- An OEM agreement with Pentaho – the latter will be in the box. Its data mining, reporting, data discovery/visualizations, predictive analytics, and data integration will help round out the offering and make it easier to build without deep java MapReduce skills – making an interesting foil to Hortonworks’ similar arrangement with Talend.
- Numerous other partnerships – over 20 – that include hardware, network and systems integrator deals.
WANdisco may be an unfamiliar name to some data management folks, but not to those using Apache Subversion, the open source version control system. As a leader in Wide Area Network Distributed Computing (there is an acronym-based name in there if you look) with patent-protected active-active peer-to-peer replication, WANdisco sees a performance-driven opening too. With a leadership team that made foundational contributions to Apache HDFS and Apache BigTop and helped build out Yahoo’s infrastructure, WANdisco comes to the table with the WDD distribution, based on Apache Hadoop 2.0 with support for WANs across data centers, including mirroring and auto-recovery. (It joins MapR in its ability to provide the latter, but without requiring the use of its own filesystem.) It supports Amazon S3 storage as well as HBase, and provides a console for wizard-based deployment. monitoring and management on both virtualized (VMware) and dedicated physical infrastructure. It also provides the usual support and consulting services and plans aggressive moves to ramp up following last year’s IPO on the London Stock Exchange and acquisition of Altostor.
Data Direct Networks (DDN), whose presence in the high performance computing (HPC) market may also be less well known to the typical Hadoop prospect, is targeting the mid-to upper end of the market with its hScaler appliance. Above 100 nodes, where significant enterprise production workloads run, and in the multi-thousand node space where government and data center customers operate, DDN is already often familiar for its Lustre-based ExaScaler and GridScaler filesystem plays. hScaler is pointed at the fact that by some estimates, 30% or more of a job on large clusters takes place on data that is not local to the node, despite the value proposition of Hadoop putting processing “next to the data.” This is the “shuffle” phase, which takes place between Map and Reduce – multiple times in a multi-job step workflow – and most are multi-step. This is a performance play that gets more attractive with larger size and more complexity. DDN touts its pipelining of Hadoop, and its ability to scale compute and storage independently as key differentiators. [edited] hScaler includes and DDN supports the Hortonnworks’ HDP, and has Pentaho for its “ETL graphical designer” and the DirectMon management console (like the ones I discussed in Part One), compatible with ExaScaler and GridScaler.
There you have it: 3 new players, all of whom are focusing on the performance dimension as their reason for entering the market. This is a maturation step; in the earliest days of a new technology, it’s enough to be able to do it at all. Some of the tire-kickers are now moving on to evaluate alternatives against one another not just on what functions they have, but on how well they do them. POCs are appearing in my Gartner inquiries, and the results, not surprisingly, vary by workloads, by the data types, and by the skills of the field staff involved in the tests. We’ll see lots of benchmarketing in the months ahead. Just remember that your mileage may vary, and ALWAYS test with a POC bakeoff.
Category: Amazon Apache Big Data Gartner Hadoop Hbase HDFS Lucene MapR MapReduce Tags: Apache, DBMS, DDN, EMC, Gartner, HANA, Hbase, HDFS, Hortonworks, HP, HPC, IBM, Intel, Lustre, MapR, MapReduce, MarkLogic, Pentaho, S3, SAP, SAS, Talend, VMware, WANdisco
by Merv Adrian | February 23, 2013 | 4 Comments
In the first two posts in this series, I talked about performance and projects as key themes in Hadoop’s watershed year. As it moves squarely into the mainstream, organizations making their first move to experiment will have to make a choice of platform. And – arguably for the first time in the early mainstreaming of an information technology wave – that choice is about more than who made the box where the software will run, and the spinning metal platters the bits will be stored on.There are three options, and choosing among them will have dramatically different implications on the budget, on the available capabilities, and on the fortunes of some vendors seeking to carve out a place in the IT landscape with their offerings.
First up is the cloud. It’s extraordinarily attractive to first timers, because there is no capital expenditure (read: no procurement process, no IT Standards Committees, minimal budget impact, etc.) It’s easy. Maybe too easy; using it outside IT can undermine years of careful work on governance and dramatically increase risk. But that’s another post; for more detail, clients can consult the Gartner report ‘Big Data’ Is Only the Beginning of Extreme Information Management. My point here is that it’s no accident that Amazon has reported that it started 2 million Elastic MapReduce (EMR) clusters – in a single year. And if that many are already on Amazon, think of the other platforms – bet there are more than a few there? You’d be right, but I won’t belabor that here – they aren’t hard to find. The growth of cloud platforms for big data is not likely to slow, and it will remain a great choice for early uses, speculative projects that may need to spin down as quickly as they spun up if they don’t prove to be useful, and ones whose economics just plain work. And for many projects, the cloud will remain the most reasonable economic choice. That’s why some of this year’s key announcements will focus there – stay tuned.
Second is our default choice so far: buy some nodes. The buy some more. Rinse and repeat. Early adopters, even the mammoth new web-based firms that got all this started, did this, and still do. Some sites literally have people who spend the day going up and down rows of racks pulling failed inexpensive disks and replacing them. It’s the source of the HDFS “3 copies, one on a separate rack” default, and it works. You can buy a quad-core Supermicro data node (or several) with 12 500 GB hard disks in it – 6 TB – for $7K or so, and a name node for $4K with more memory and less disk (why? that’s another post too) and you’ll be working with a Cloudera-certified platform.
You can spend less – and you can spend more – but the numbers are still as compelling as ever. Buy some racks, fill ‘em up with a few dozen nodes and you’re into a couple of hundred terabytes for a couple of hundred thousand (insert currency of your choice here). More expensive than the cloud, but nothing like the big server/storage combination bucks your brethren in the data center are spending for those big RDBMS platforms. Be warned: if you don’t know how to deploy, operate and optimize a cluster, you have a lot to learn. And there is a good chance your data center folks, if you have some, will need new skills even if they are already good at operating what is in there today.
Finally, and of most financial, vendor leadership and internal standards import, is the newest choice: appliances. At least 6 plays are making their way to market for Hadoop users: EMC Greenplum’s Data Computing Appliance, the Cisco Platform for NetApp Open Solution for Hadoop, HP’s AppSystem for Apache Hadoop, IBM’s PureData System for Analytics, Oracle’s Big Data Appliance, and Teradata’s Aster Analytic Appliance. (I’m sure there are others I’ve left out here, and there are data warehouse appliances and specialty plays like Yarc’s Urika for graph data applications [not Hadoop], but this is a good start.)
The big questions to ask here are: whose software are you running, what else do you get in the package beside metal and a Hadoop distribution, how much easier will it be to operate than buying your own nodes, what support will you get – and the big one: how much does it cost? In 2013, the market will begin to decide if the value proposition of appliances will play here – is the premium (and make no mistake, there is one) you pay worth the quicker time to deployment, operational and management help, and agility you get? That discussion is deep and detailed, and beyond our scope here, but I’m looking forward to continued conversations with Gartner clients who are making these choices as the market develops.
Next time: players. And there are some new ones. Don’t miss it.
Category: Amazon Apache Aster Big Data BigInsights Cisco Cloudera data warehouse appliance Elastic MapReduce EMC Gartner graph databases Hadoop HP IBM MapReduce NetApp Oracle Teradata Yarc Tags: Amazon, Apache, Aster, big data, Cisco, Cloudera, Elastic MapReduce, EMC, EMR, graph database, Hadoop, HP, IBM, MapReduce, NetApp, Oracle, Teradata, Yarc
by Merv Adrian | February 21, 2013 | 1 Comment
In Part One of this series, I pointed out that how significant attention is being lavished on performance in 2013. In this installment, the topic is projects, which are proliferating precipitously. One of my most frequent client inquiries is “which of these pieces make Hadoop?” As recently as a year ago, the question was pretty simple for most people: MapReduce, HDFS, maybe Sqoop and even Flume, Hive, Pig, HBase, Lucene/Solr, Oozie, Zookeeper. When I published the Gartner piece How to Choose the Right Apache Hadoop Distribution, that was pretty much it.
Since then, more projects have matured. More have entered incubator status. And alternatives to Apache projects have gained more traction in distributions and in customer sites whose portfolio is more expansive. I’ve talked before about my ongoing stack model that attempts to sort this out – you may have seen it in an earlier blog post. I’ve updated it a little, and in this version, you can see that the “original core” projects are bolded. A few others are too, to be discussed in my planned Hadoop Tutorial presentation at the upcoming Gartner BI Summit, March 18-20 in Grapevine, Texas, where I’ll drill into the bolded ones in more detail.
Projects (and alternatives) for the Hadoop stack
In 2013, the list of projects, alternatives, and supporting technology to watch will change as commercial distributions continue to expand what they contain and support, and as more and more use cases focus on issues like machine learning (Mahout) or text search and analytics (Lucene and Solr) and as new processing paradigms begine to compete with MapReduce under Apache 2.0. Metadata will matter, so HCatalog will turn a lot of heads. Graph processing may begin to show up if Giraph gets some traction. And there’s more:
Apache Avro - the interest in data serialization is expanding with sensor and other machine generated data. Just ask Splunk.
Apache Accumulo – a secure datastore built by guys from the NSA, investigated by the Senate? Of course you’re interested.
Apache Ambari – covered in the last post. An open source management platform.
Apache Bigtop – packaging and testing a collectiomn of your own? This is for you.
Apache Blur (incubating) for search in cloud environments – Doug Cutting is a committer on this one.
Apache Cassandra – an alternative, distributed datastore that has won POCs against pure Hadoop in some use cases I’ve seen.
Apache Chukwa – data collection on your system, for monitoring.
Apache Crunch (incubating) – a “quicker to implement than MapReduce programming” choice, for building, testing and running pipelines.
Apache Drill (incubating) – one of several entrants in the “real-time analytics” sweepstakes – and there will be others.
Apache Giraph (incubating) for graph processing uses – one of the first examples of the changes Yarn will enable.
Apache Hama for Bulk Synchronous Parallel computing in scientific computations.
Apache Kafka – a publish and subscribe system.
Apache Mahout - already being supported by several distributions – machine learning is a key new use.
Apache Whirr – a library for running services in the cloud (including a Hadoop cluster, of course.)
Cascading – not really a project but a development platfdorm, commercialized by Concurrent.
DataFu – also not an Apache project, but a collection of Pig UDFs developed at LinkedIn.
Dataguise DG for Hadoop – a security offering of great value in an insecure platform, which Hadoop certainly is today.
Hadapt – another “alternative datastore” contender, not open source, but offering a relational store right on your cluster.
HStreaming – along with IBM’s inclusion of InfoSphere Streams in its BigInsights distribution, Twitter’s Storm and the well established SQLstream, we’ll see more interest in realtime streaming operational processing as a counterpoint to the interest in realtime analytics that will be another key development this year.
Rainstor – again, not open source, but highly compressed Hadoop sounds pretty appealing. Check it out.
VMware Serengeti – aimed at creating virtualized, highly available, multi-tenant Hadoop. Big possibilities for this one.
I haven’t gone into the various analytics plays here. That’s a post for another time, and it’s arguably a “layer above.” (Or in the case of my diagram, below.) There’s only so much you can fit into a reasonable post and it’s time to end this one. Next time: platforms.
Category: Accumulo Ambari Apache Apache Drill Apache Yarn BigInsights Cassandra Cloudera Dataguise EMC Gartner Giraph graph databases Hadapt Hadoop Hbase HCatalog HDFS Hive Hortonworks Hstreaming IBM InfoSphere Lucense MapReduce Mshout Oozie open source Pig Rainstor Serengeti Solr SQLstream Sqoop VMware Zookeeper Tags: Apache, BigInsights, Cassandra, Cloudera, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, IBM, InfoSphere, MapR, MapReduce, Oozie, Pig, SQLStream, Sqoop, zookeeper
by Merv Adrian | February 16, 2013 | 11 Comments
It’s no surprise that we’ve been treated to many year-end lists and predictions for Hadoop (and everything else IT) in 2013. I’ve never been that much of a fan of those exercises, but I’ve been asked so much lately that I’ve succumbed. Herewith, the first of a series of posts on what I see as the 4 Ps of Hsdoop in the year ahead: performance, projects, platforms and players.
Performance concerns are inevitable as technologies move from early adopters, who are already tweaking everything they build as a matter of course, to mainstream firms, where the value of the investment is always expected to be validated in part by measuring and demonstrating performance superiority. It also becomes an issue when the 3rd or 4th project comes along with a workload profile different from those that came before – and it doesn’t perform as well as those heady first experiments. Getting it right with Hadoop is as much art as science today – the tools are primitive or nonexistent, the skills are more scarce than the tools, and experience – and therefore comparative measurement – is hard to come by.
What’s coming: newly buffed up versions of key management tools. It’s one method of differentiating distributions in a largely common set of software – Hortonworks doubling down on open source Apache Ambari, Cloudera enhancing Cloudera Manager, MapR’s updated Control System (as well as their continued touting of DIY favorites Nagios and Ganglia.) EMC, HP, IBM and other megavendors are continuing to instrument their existing, and familiar, enterprise tools to reach this exploding market. It will be a busy bazaar.
Resources are proliferating to help: published work like Eric Sammer’s Hadoop Operations (somewhat Cloudera-centric but very well organized and useful). A plethora of Slideshare presentations designed to help navigate the arcana of cluster optimization, workload management, configuration optimization, are appearing.
Performance has figured in a number of proof of concept (POC) tests pitting distributions against one another that I’ve heard about from Gartner clients. Some have been inconclusive; some have had clear winners. As we’ve seen in DBMS POCs over the years, your data and your workloads matter, and your results may differ from others’. I’ve seen replacements of “first distributions” by another, as performance or differing functionality comes to the fore. I’ve even seen a case where a Cassandra-based alternative won out over the Hadoop distributions.
Next time: projects proliferate.
Category: Big Data BigInsights Cloudera EMC Hadoop Hbase HDFS Hortonworks IBM MapReduce Sqoop Tags: Apache, BigInsights, Cloudera, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, MapR, MapReduce, Pig, Sqoop, zookeeper