by Merv Adrian | July 16, 2014 | 4 Comments
One of the more interesting conversations I had at the Microsoft Worldwide Partners Conference this week concerned an initiative they have launched to help IT understand – and get under control – proliferating ungoverned SaaS applications. Brad Anderson, Corporate VP for Cloud and Mobility, told the 16,000 attendees that enterprises need help. “We ask them how many SaaS apps they have in their environment and they usually tell us 30-40. We audit with the Cloud App Discovery tool and find , on average, over 300.” And are these managed? One can only imagine…
The tool is in preview now, and a link to try it out for free is provided in Microsoft’s blog, It offers more than discovery – it will permit managers to monitor usage, identify users, integrate apps into Azure Active Directory, and more.
This is part of a larger story about governance and optimization in a hybrid cloud- and on-premises world that enterprises will live in for this decade and the next. Anderson also pointed out that 3.1M smartphones were stolen and another 1.4M lost. How many of these had corporate data on them. Would you know if it happened to one of your users? Can you govern access to corporate data in the apps there, and prevent it from being pasted into emails by someone who gets that phone and uses the saved logins to get at it? Some of these challenges can be handled by policy-based tools.
Getting the apps your users want into Azure, managing them there, and linking the on-premises Active Directory used by the overwhelming majority of enterprises to Azure Active Directory offers the possibility of getting corporate data security under better control before you find out how you look in orange. One of my favorite scenarios Microsoft showed its Enterprise Mobility Suite detecting is “impossible logins” – an hour ago you logged in from Australia and now you’re apparently in Chicago. Software can stop that? Yes.
The context here was Microsoft telling its partners about the opportunities for them to sell these capabilities to customers – and it’s hard to imagine them not wanting to, especially with the incentives, certification, training and co-marketing efforts Microsoft is launching. Expect this to be a major theme, leveraging the power of the crown jewel that Active Directory is in the portfolio in many additional ways to come.
Category: Active Directory Industry trends Microsoft mobility SaaS Security Tags: Microsoft, mobility, SaaS, Security
by Merv Adrian | June 28, 2014 | 4 Comments
In February 2012, Gartner published How to Choose The Right Apache Hadoop Distribution (available to clients). At the time, the leading distributors were Cloudera, EMC (now Pivotal), Hortonworks (pre-GA), IBM, and MapR. These players all supported six Apache projects: HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Things have changed.
[updated June 29] We included Datastax (a distributor of Apache Cassandra) then, but they did not, and still don’t, consider themselves part of the Hadoop ecosystem. And they are not alone in having a reductive view of the answer to the question What Is Hadoop? Doug Cutting, pioneer in creating it and Chief Architect at Cloudera and former president of the Apache Software Foundation, considers the Hadoop Project to be HDFS, MapReduce and some common utilities. He made that point clear during a panel of luminaries my colleague Nick Heudecker conducted recently – the video is linked to Nick’s blog here. Everything else is “related projects.” Arun Murthy of Hortonworks, who has driven the creation of YARN, prefers to say that HDFS and YARN are “kernel” now, likening the description to the way most of us think of Linux. The Apache page continues to use the older description, including HDFS, MapReduce and YARN. (June 29, 2014)
To users, and especially buyers, the definition is more expansive. Hadoop is what they use to compose a useful stack of software to execute a business process of some sort. And distributors agree: in a little over two years, the set of projects included in all commercial distributions has now reached fifteen – two and a half times as many in just over two years. The list now includes Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN.
Others are likely to join this stack long before the next two years are up: the candidates include Falcon, Knox, Giraph, Hue, Lucene, Storm, Tez, and others. Hadoop has moved from a coarse-grained blunt instrument for largely ETL-style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions. More money continues to flow into the ecosystems, more companies form, more programmers take up the challenges, and the big players are scrambling to get aboard the train.
What is Hadoop?
It’s what’s next.
Category: Accumulo Apache Apache Yarn Avro Cascading Cloudera Falcon Flume Gartner Giraph Hadoop Hbase HDFS Hive Hortonworks Hue IBM Knox Lucene Mahout MapR MapReduce Oozie Pig Pivotal Spark Sqoop Storm Tez YARN Zookeeper Tags:
by Merv Adrian | March 24, 2014 | 11 Comments
This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.
In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.
This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.
In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.
During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.
But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?
Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.
Category: Accumulo Ambari Apache Apache Drill Apache Yarn Big Data BigInsights Cloudera Elastic MapReduce Gartner Giraph Hadoop Hbase HCatalog HDFS Hive Hortonworks IBM Intel Lucene MapR MapReduce Oozie open source OSS Pig Solr Sqoop Storm YARN Zookeeper Tags: Apache, big data, BigInsights, CDH, Cloudera, Datastax, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, InfoSphere, Isilon, MapR, MapReduce, Oozie, open source, OSS, Pig, Pivotal, Sqoop, zookeeper
by Merv Adrian | February 23, 2014 | 5 Comments
In my post about the BYOH market last October, I noted that increasing numbers of existing players are connecting their offerings to Apache Hadoop, even as upstarts enter their markets with a singular focus. And last month, I pointed out that Nick Heudecker and I detected a surprising lack of concern about security in a recent Hadoop webinar. Clearly, these two topics have an important intersection – both Hadoop specialists (including distribution vendors) and existing security vendors will need to expand their efforts to drive awareness if they are to capture an opportunity that is clearly going begging today. Security for big data will be a key issue in 2014 and beyond.
Other analysts at Gartner have tracked many of these products, and in my own followup I’ve been catching up on the work of Joseph Feiman and Brian Lowans, among others. Their Magic Quadrant for Data Masking, published in December, offers useful discussion of that capability (both static and dynamic) and which existing players have already added Hadoop support. Axis Technology’s DMSuite, Dataguise (who partners with Compuware), IBM InfoSphere Optim Data Privacy and InfoSphere Guardium Data Activity Monitor, Informatica Dynamic Data Masking and Persistent Data Masking, and Voltage SecureData Enterprise are all mentioned in the MQ.
There are other offerings, of course – for example Feiman and Lowans note that masking of big data is available for the Oracle Big Data Appliance with its installed Cloudera distribution, but added that it requires the use of Oracle consulting services, or the services of Oracle’s numerous service partners. Similarly, there are several emerging Hadoop focused firms I’ve mentioned elsewhere and will cover in an upcoming piece of Gartner research I’m doing with Neil MacDonald. With RSA coming up this week (unfortunately, I can’t attend), I expect to see more heat – and perhaps light as well – on the issue ahead.
Category: Apache Big Data Cloudera Dataguise Gartner Hadoop IBM Magic Quadrant Oracle Security Tags: Apache, Axis Technology, big data, Cloudera, Compuware, Dataguise, Hadoop, IBM, Informatica, open source, Oracle, OSS, Security, Voltage
by Merv Adrian | February 19, 2014 | 3 Comments
In the most profound change of leadership in Microsoft’s history, Satya Nadella, who was head of the Cloud and Enterprise division, has taken the helm, succeeding Steve Ballmer. Nadella’s “insider” understanding of Microsoft’s culture and his effectiveness in cross-team communication and collaboration could help him reshape Microsoft for the digital era — which will be key for the company to attain the visionary technical leadership to which it aspires.
Nadella’s main challenge consists of evolving Microsoft’s existing businesses (including its enterprise offerings, which represent half of its current revenue) while reinventing Microsoft to make it relevant in mobile and cloud-centric markets.
Unlike his sales-focused predecessor, Steve Ballmer, Nadella has an engineering background; thus, his selection has reinstated the model of a technically minded CEO driving the company’s technical vision. But Nadella also must overcome several challenges.
He lacks direct experience in the mobile market. His insider status raises the risk of his being overly respectful of existing businesses, and hanging back from tough decisions that potentially threaten them but are critical to generating innovation. He will also need to shake up what is widely viewed as a culturally dysfunctional management structure.
Nadella must quickly demonstrate that he is not backing a “business as usual” strategy, and that he recognizes that design is front and center in client computing for both consumers and enterprise users and that a mobilized environment has replaced the desktop. The next six months will show how well Nadella and Gates collaborate to determine Microsoft’s technical direction.
We expect Microsoft’s trajectory will be clear by year-end 2014. Do not expect radical changes in the company’s overall “devices and services” strategy; instead watch for organizational shifts, product design changes and updated product road maps to address a mobile- and cloud-dominant world.
- Establish a vision of itself as an innovative, disruptive force in IT. Concentrating on mobile technology and leveraging lessons learned from gaming can help Microsoft appeal to the next generation.
- Emphasize design to enhance ease of use for consumers, and apply these lessons to its considerable assets in IT infrastructure to change its image of a legacy enterprise vendor competing in a consumerized market.
- Enable entrepreneurs and developers to develop new business value atop a common Windows client environment with unified, cross-platform services. Microsoft must enable a complete, compelling set of apps that attracts developers and can compete with and within iOS and Android environments.
- Acknowledge its customers’ heterogeneity by supporting Google and Apple client environments, the Linux/Java environment on servers, and cloud-based services in general.
- Deliver compelling experiences and solutions to both IT and to non-IT buyers.
For clients, my research note “Nadella Must Disrupt Microsoft Models to Establish Market Leadership,” co-authored with David Cearley, can be found on the Gartner website at http://www.gartner.com/doc/2664432.
Category: Gartner Microsoft Tags: Microsoft
by Merv Adrian | January 21, 2014 | 19 Comments
“Not looking” at security and privacy seems to be the posture of people implementing Hadoop, based on recent data Gartner has collected. This is troubling, and paradoxical. In an era when the privacy of data, from government surveillance to medical record-keeping to “creepy” marketing initiatives and password breaches, has been in the news regularly, it is hard to understand why professionals implementing Hadoop are not paying attention.
The data here comes from a recent webinar I conducted with my colleague Nick Heudecker. We had over 600 attendees, and during the discussion we offered several polling questions. One had to do with barriers to Hadoop adoption. We had 213 responses to that question.
You can see the results below and two things leap out: only 2% of the respondents see lack of robust security as a barrier, and half of the respondents feel that they do not have a sufficiently defined value proposition. More on the latter in another post.
For me, the nearly non-existent response to the security issue is shocking. Can it be that people believe Hadoop is secure? Because it certainly is not. At every layer of the stack, vulnerabilities exist, and at the level of the data itself there numerous concerns. These include the use of external unveiled data and of data in file systems that lack any protection, and the separation of Hadoop initiatives in most organizations from IT governance. Add to that the kinds of use cases Hadoop is being pointed at: sensitive health care information personal data in retail systems; telephone usage; social media connection and sentiment analytics – all of them give us pause.
I’ve pointed to security as a key issue facing the Hadoop community in 2014 for some time now. The fact that awareness of the problem is not getting attention only reinforces my belief that we will see major problems as Hadoop goes mainstream.
Category: Big Data Gartner Hadoop Security Tags: big data, data security, Hadoop, Security
by Merv Adrian | January 17, 2014 | 3 Comments
In the Hadoop community there is a great deal of talk of late about its positioning as an Enterprise Data Hub. My description of this is “aspirational marketing;” it addresses the ambition its advocates have for how Hadoop will be used, when it realizes the vision of capabilities currently in early development. There’s nothing wrong with this, but it does need to be kept in perspective. It’s a long way off.
Start here: A Gartner Research Circle survey revealed that in 2013, big data projects went into production in less than 8% of enterprises – and Hadoop was by no means the only technology included in their count. In 2014, many more enterprises will go into production with their first Hadoop project or two, perhaps even pushing deployed Hadoop shops into low double digit percentages.
In those same shops, there are thousands of significant database instances, and tens of thousands of applications – and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.
So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops. Aspirations are good, but perspective helps. Don’t confuse vision with strategy. The enterprise data warehouse has a long life ahead of it; the synergistic addition of Hadoop to its evolution into the logical data warehouse is just beginning, and Hadoop’s role in operational workloads and event processing has yet to launch.
Category: Apache Big Data data warehouse DBMS Gartner Hadoop RDBMS Tags: Apache, big data, data warehouse, Hadoop
by Merv Adrian | January 13, 2014 | 11 Comments
Talk to security folks, especially network ones, and AAA will likely come up. It stands for authentication, authorization and accounting (sometimes audit). There are even protocols such as Radius (Remote Authentication Dial In User Service, much evolved from its first uses) and Diameter, its significantly expanded (and punnily named) newer cousin, implemented in commercial and open source versions, included in hardware for networks and storage. AAA is and will remain a key foundation of security in the big data era, but as a longtime information management person, I believe it’s time to acknowledge that it’s not enough, and we need a new A – anonymization.
I realize I’m speaking out of turn here. I’m not a security guy myself, and I don’t pretend to be deep in the disciplines that decide whether you are who claim to be (authentication) and govern whether you can get to the network. Nor do I know the detailed nuances, spread across many different resources, that grant me permission to do what I will be allowed to do with those resources when I get there (authorization.) I don’t understand the various protections that assure breaches do not/have not occurred, which depend on the audit capabilities (the latter, as accounting, also provides the mechanism to report on all of the above.
What I do spend some time on is what happens within the resource that holds the data, when an authenticated, authorized person who is appropriately audited gets to it. For example, we need to distinguish what DBAs can see from what an analyst can – financial types call that “separation of concerns,” and it’s typically managed by a DBMS, which has mechanisms to interact with authorization capabilities to implement policy. It can be coarse- or quite fine-grained, and it’s one of the reasons we analysts always like to remind people that we talk about database management systems, not just databases.
But here’s the problem: in the big data era, much of the data we work with is not in DBMSs – and more and more of it will not be, as file-based systems like Hadoop gain broader and broader use. File systems don’t provide that granular control, so intervening layers will be required. They too can be coarse – we can encrypt/decrypt everything, for example. Or they too can be fine-grained, offering selective, policy-based decryption – in memory, after the bits come off the disk, before handing to the requester.
Personally, I hope people who model disease vectors, or even purchase behaviors, can build effective predictive models that describe what happens to people with certain characteristics. I just don’t want that process to result in my name being “on their list.” If they can intuit and classify what I am by my behavior and assign me to a category in some separate process, that is a different issue.
One approach that matters a great deal is obfuscation, which replaces a field like name or SSN with valid characters, but not the original data. Its value is that if properly implemented, it maintains mathematical cohesion and permits statistical analysis, aggregation, model building, etc to proceed without individuation of the records they are performed over. This is a privacy concern. Redaction – the familiar “blacking out” of content, is also used – but in some policy scenarios, being able to peer into the redacted data might subsequently be of value, and redaction typically doesn’t permit this.
Both approaches, however, can be classified as anonymization (or data deidentification, but I prefer to add another A to AAA for consistency!), and in an era where big data will increasingly be used to track human behaviors for medical, commercial and security reasons, I believe it’s time for anonymization to join the other 3 As. Perhaps it’s time to talk of authentication, authorization, anonymization and audit as the true foundation for data security.
Thanks to my esteemed Gartner colleague Neil MacDonald for commenting on and improving this post.
Category: Big Data data integration data warehouse DBMS Gartner Hadoop HDFS Industry trends Security Tags: big data, data integration, data security, data warehouse, Hadoop, HDFS, Security
by Merv Adrian | October 9, 2013 | 9 Comments
When is a technology offering a platform? Arguably, when people build products assuming it will be there. Or extend their existing products to support it, or add versions designed to run on it. Hadoop is there. The age of Bring Your Own Hadoop (BYOH) is clearly upon us. Specific support for components such as Pig and Hive vary, as do capabilities and levels of partnership in development, integration and co-marketing. Some vendors are in many categories – for example, Pentaho and IBM at opposite ends of the size spectrum interact with Hadoop in development tools, data integration, BI, and other ways. A few category examples, by no means exhaustive:
Analytic Platforms: Kognitio – an analytics-focused in-memory database with SQL 2011 support – offers significant Hadoop support. Jethrodata adds indexes for SQL and stores them in HDFS to accelerate BI tools, while Splunk’s offering called – gotta love it – Hunk has a similar approach. SAS recently added a partnership wkith Hortonworks, extending existing capabilities. For integrated marketing, RedPoint Global offers a true YARN-enabled engine with a rich array of capabilities that many tool vendors would envy.
Application Performance Management – longtime stalwart Compuware is bringing its portfolio to both Hadoop and leading NoSQL offerings.
BI Tools: specialists like Alpine Data Labs, Datameer, Karmasphere and Platfora position themselves as targeted for Hadoop environments. Traditional players like SAP Business Objects may represent themselves as connecting via Hive, and increasingly we will see some, like Alteryx, Qlikview and Tableau, are partnering with emerging distribution-specific stack components like Cloudera’s Impala.
Database: some vendors, like IBM and Teradata offer their own distributions, and even appliances. Others like Actian, Calpont, Oracle and Microsoft partner with pureplay vendors. All provide connectors, management interfaces to their own management tools, etc. MarkLogic adds an “enterprise NoSQL” flavor; Rainstor adds an archiving solution for a highly compressed Hadoop environment.
Data Integration: Informatica and Talend both support HDFS and even have specific offerings for ETL, data quality, etc. Revelytix Loom offers data prep and metadata creation capabilities to shorten time to use cycles.
Development platforms: Continuuity is out to an early lead here, but is hardly alone and won’t be the last. For example, SQL Server development tool player Red Gate will enter the market soon.
Hadoop as a Service: Altiscale, Amazon, Qubole, Rackspace, Savvis, and Xplenty (who mask Hadoop development complexity) offer varying degrees of control and surrounding capabilities – and marketing, as the links demonstrate.
In-memory data grid (IMDG) engine: Gridgain offers GGFS, one of several HDFS substitutes, and like ScaleOut hServer offers an in-memory grid for execution of MapReduce code. Longtime IMDG player Terracotta has added a Hadoop connector.
Lifecycle Management: WANdisco is offering ALM and support for highly available distributed network deployments, and has recently partnered with Hortonworks.
Platform Performance Management: Appfluent, already providing visibility and performance optimization for Oracle, Teradata and IBM DB2 and PureData for Analytics (aka Netezza) platforms, has now added a Hadoop offering as well.
Search: numerous approaches and players here. One of the more interesting is LucidWorks, leveraging Lucene and Solr for search based use cases on a Hadoop infrastructure.
Security: Dataguise, Gazzang, Protegrity, and Zettaset offer varying components of a full-stack security hardening for a Hadoop deployment.
Stream Processing: DataTorrent, Tibco Streambase, Vitria and Zoomdata are among the early players here.
Workload Automation: BMC’s Control-M is already in place for a need that will become more significant as adoption rises and efficiency becomes more of an issue.
This is just a bare smattering of the evolving ecosystem, and I’d be delighted to have your additions, recommendations and comments. I will update this post to include them. Please jump in.
Category: Big Data BigInsights Cloudera DBMS Hadoop HDFS Hive Hortonworks IBM Lucene MapR MapReduce Microsoft Oracle Pig Rainstor RDBMS Security Solr SQL Server SQLstream Talend Teradata YARN Tags: Apache, big data, BigInsights, CDH, Cloudera, Compuware, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, Jethrodata, Karmasphere, Kognitio, metadata, Oracle, Pentaho, Pig, Qubole, Revelytix, SAS, Splunk, Teradata
by Merv Adrian | October 3, 2013 | 1 Comment
My Gartner APAC colleague Darryl Carlton and I recently discussed the obscene ratio between CEO pay and average worker pay in the US. And this IS about the US – we are supporting an astonishing gap compared to the rest of the world, and high tech vendors like Oracle are not the only ones at the top of the list – Larry Ellison comes in only number 4 on this Bloomberg list, pulling down 1,287 times what an average Oracle worker (not impoverished at nearly $75K per year) collects.
Personally, I was surprised that the banking community was not represented “better” here – but perhaps that’s because there seem to be dozens of executives, not just the CEOs, in every firm currently hauling cash home in wheelbarrows as mortgages remain underwater and pension funds and healthcare plans get de-funded. At least they share with each other.
In my conversation with Darryl, we discussed the link to company performance. Pick your measure – stock price, revenue improvement, etc. – in general those gorging at the trough are not dramatically outperforming their peers – and they certainly are not doing 10 times, or a hundred times, better than their counterparts in other countries. But look at how the US plutocrats compare to their brethren in other countries – Japan at 11:1, Britain at 22:1 -
and the US at 475:1.
It’s all in the table linked below.
CEO Pay Ratio
Darryl notes that
Drucker argues that CEO pay which exceeds 20-25 times the average paid in that company is bad for business. The new law in Australia is that if 25% of the stockholders vote down executive pay twice (two strike rule) at successive AGM’s then the Board can be replaced. The CEO of Lenovo in China has once again distributed his bonus to the lowest paid factory workers.
Not here, I’m afraid. And I want to say so much more, but only one word is needed, as far as I’m concerned:
Category: Industry trends Tags: Industry trends