by Merv Adrian | December 31, 2014 | 1 Comment
As an information management software analyst, I don’t spend a great deal of time looking at hardware, but when I look for a more holistic view, I occasionally check in with Gartner colleagues. Recently I had a few questions about Oracle’s hardware mix during inquiries, so I decided to check in with my colleague Errol Rasit about Gartner Quarterly Market Statistics, and find out how the hardware recovery I keep hearing about was going. What I discovered surprised me, especially in light of the messages I hear from the vendor.
There is no “recovery.” It appears that the picture remains rather bleak, especially on the SPARC side.
A few years ago, at the turn of the century, SPARC was a $10B business. But it has been in a steady decline since then (with a bubble in 2005-2007) and there is no end in sight. The next generation is supposed to fix all that for Oracle, but several preceding ones didn’t do the trick, so one has to be skeptical. Here’s what the last few years have looked like:
This annual chart ends at 2013 because 2014 is not complete yet. Not shown is a slower rate of losses in the first half of 2014 – and a slightly positive Q3 for SPARC (its first in a few years) suggests the bottom may be near, but it’s too soon to tell. The question now is whether the Intel-based business is growing quickly enough to offset the continuing decline in SPARC. And at a glance, it doesn’t appear to be. Here’s a look at the last few years:
Again, this annual chart ends with 2013. The 2014 x86 numbers (not shown) have been mixed, going slightly negative in Q3, contrasting with the SPARC upturn. The Intel-based business is now generating just about as much revenue as the SPARC business is, but has been growing more slowly than the SPARC business was shrinking.
Are we at an inflection point? How well will Oracle’s next generation of x86-based systems do to increase growth? How motivated is Oracle going to be to continue pushing SPARC forward? Those are very good questions, and I suspect we will continue to ask throughout 2015. Rhetoric notwithstanding, it’s not a very pretty picture so far.
Category: Gartner Industry trends Intel Oracle SPARC x86 Tags: Gartner, Intel, Oracle, SPARC
by Merv Adrian | December 30, 2014 | 5 Comments
OK, I admit it – I stole the title from a much smarter man. I thought that man was Yogi Berra, but maybe not – more about that at the end of this post.
Every year, Gartner issues a series of Predicts documents. This year I had the pleasure of doing one for my team on Information Infrastructure Technology. Now, I’m a software guy, and the team I’m on is all software people, so a document assigned to our team would typically be about – well, information software technology. But that would have missed the point rather dramatically, so I connected with a few colleagues and got their OK to use some of their predictions in the small set any document can include. (After that I had to argue with our methodology people about having them appear in two places, but that’s a story you don’t want to hear.)
It was an enjoyable exercise, all things considered, and I was able to collect predictions in several domains I think will have dramatic effects on the information management space. Here are the Planning Assumptions I featured:
- By 2017, IMDBMS as a differentiated DBMS market category will disappear. In-memory will be a “permission to play” capability for DBMS. (Teri Palanca)
- By 2018, HDDs will still account for 75% to 85% of all petabytes shipped to the server and ECB storage markets, but solid-state solutions will expand to account for 15% to 25% of all mission-critical, near-line and archived data. (John Monroe and Joseph Unsworth.)
- Through 2020, there will be no dominant IoT ecosystem platform; IT leaders will still need to compose solutions from multiple providers. (Al Velosa)
- By 2018, 30% of streaming, near-real-time data integration and data management use cases will be supported by stacks that include Apache Spark. (Lakshmi Randall, Nick Heudecker and me.)
- Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases. (Regina Casonato, Nick Heudecker, Mark Beyer, and me again.)
The list, confined as it was to 5 predictions, leaves out many important topics: cloud, security, Hadoop (my favorite, if you follow my research), emerging categories now collected under NoSQL, the economic impact of open source, the expiration of Oracle’s consent decree on MySQL, and more. But it’s not a bad list, and it touches on memory, storage, the IoT, software and process and governance issues. So I’m pretty happy with it. Gartner clients can find the full document here.
What about Yogi? Most Valuable Player of the American League three times (only 3 other people ever did that) and he got to the World Series as a manager in both leagues. As player, coach, or manager, he appeared in 21 World Series, winning 13 of them (thanks, Wikipedia, for the fast facts). He was a hero of mine as a kid, and learning of his wit and wisdom over the years has only enhanced that. But it appears he did not originate the quote. A little work with Bing found me a whole raft of investigations that ascribe the quote to Danish folk wisdom, even to Neils Bohr. But that’s OK – I’ll always link it to Yogi. After all, even he said (and I’m sure of this one) “I never said most of the things I said.”
Category: data lake data warehouse DBMS Gartner Hadoop IMDBMS Industry trends MySQL Oracle Tags: data lake, data warehouse, DBMS, Gartner, Hadoop, IMDBMS, MySQL, Oracle
by Merv Adrian | December 29, 2014 | 2 Comments
In September, I posted Microsoft’s Portfolio – a Formidable Mix, with a perspective on several dozen Magic Quadrants that feature Microsoft offerings. As Gartner’s Vendor Lead, I’m a mandatory peer reviewer for those and other documents. For my own edification, I decide to map the Magic Quadrants that feature Microsoft onto a quadrant-style picture that shows where Microsoft appears in that piece of research. The results are updated in this post and I’ve made a few changes. I’ve bolded the MQs published since July, and noted any changes in quadrant they appear in with arrows.
For Microsoft, the news was pretty good in Q414 – in 3 categories, Microsoft’s positioning on the Execution axis has improved. Enterprise Information Archiving, Integrated Marketing Management and Corporate Telephony all moved up, with the latter entering the Leaders quadrant.
The value of this periodic refresh seems pretty clear, and by mid-next year, we should have a much clearer view of the progress being made as Satya Nadella drives both vision and execution for one of the world’s largest and most complex organizations. I’ll continue to do quarterly updates.
Category: Microsoft Tags: Microsoft
by Merv Adrian | December 18, 2014 | 3 Comments
Donald Feinberg (@Brazingo) & Merv Adrian (@merv)
Every so often, there’s a wave of interest in the “imminent retirement” of one or more legacy database management systems (DBMS). Usually, it’s because someone with very little knowledge of the actual use and distribution of the products becomes enthusiastic about someone’s sales pitch, or an anecdote or two. Sometimes it’s the result of a “replacement” marketing campaign by a competitor. It takes longer than 40 years for DBMS technology to die, and for a (competing) marketer, it’s like the villain in a horror story who just keeps coming back. And so far, it’s usually as illusive- and as far off – as the “death of the mainframe”.
Recently, a financial analyst report stated that in 2015, the industry would begin retiring Sybase products (owned now by SAP) and Informix (owned now by IBM). We and our colleagues have since had several inquiries about this and our response is simple: poppycock. DBMS market data, and our thousands of interactions with customers, do not support any such assertions.
Let’s start with Sybase, or specifically, SAP ASE and SAP IQ, acquired by SAP from Sybase in 2010. (Full disclosure: Merv worked at Sybase in the 1990s.)
Since its acquisition of Sybase, SAP has released several enhanced versions of both SAP ASE and SAP IQ (including recently in 2014), and there’s no reason to question its intent to continue development and support of both.
Generally, the customers using these products are happy, and are not looking to replace them. We receive a steady stream inquires from Gartner clients asking about them, which have not changed in character or in volume. It is true that customers ask the question, however the vendor’s intent is not questioned. They are not typically or disproportionally about removing these products, though we regularly get inquiries about replacing all the “legacy” RDBMS offerings with new products.
SAP IQ is the oldest and most widely installed column-store DBMS on the market. It is used for both analytics and as a general purpose data warehouse; it’s also part of the SAP HANA infrastructure, used as a near-line storage engine for cooler data not required in-memory in SAP HANA.
SAP ASE has retained a sizable loyal customer base on Wall Street, where it is part of the infrastructure used for trading systems, and elsewhere. It’s been certified as a DBMS platform for SAP Applications for about two years, and its use there is growing: Gartner estimates over 6000 instances of SAP Applications using SAP ASE as a platform at the beginning of 2014. [Edited Dec 19 to change number to 6000 – see below for comment from SAP]. That rate of growth for SAP ASE is actually faster than it had been in the 10 years before SAP acquired it – most likely because now SAP ASE is an alternative to Oracle, as a platform for SAP Applications.
Given the SAP sales force’s focus on SAP HANA, and the minimal marketing of SAP ASE and SAP IQ, we do understand how a misconception around the future of these products could happen. But it is just that – a misconception.
What about Informix, acquired by IBM in 2001? Over a decade later, it remains an integral part of an IBM information management portfolio that includes three primary DBMSs – DB2, IMS and Informix, and newer entrants such as Cloudant. IBM has continued to release new enhanced versions of Informix since the acquisition; for example it has recently added JSON support with MongoDB JSON Drivers. Due to the implementation of embedded indexes, Informix is a good choice for audio and video indexing. Finally, the number of IBM Informix customers has continued to increase and its user base is very loyal, with one the largest and most active User Groups.
IBM positions Informix for three primary use cases:
- High-speed processing in verticals like retail (point of sale systems) and manufacturing
- Time-series DBMS – one of its primary features, and a “timely” one
- The Internet of Things, where its high-speed ingest capabilities and small footprint are well-suited
So, it’s our opinion that the report referenced above is erroneous, and is not based in fact. At the end of the day, one of the most powerful forces in DBMS is inertia. Just ask Oracle, whose 2Q15 financial results press release on 17-Dec-2014 noted that “software updates and product support revenues drove nearly half of total company revenue.” Legacies are sticky – if it works, people don’t take lightly to changing it. In all these cases, legacy products are not only holding their own, but finding new markets in the hands of large companies with loyal customer bases.
Don’t believe everything you read (unless, of course, we wrote it.)
Category: DBMS Gartner IBM Informix SAP Tags: Gartner, IBM, SAP, Sybase
by Merv Adrian | December 5, 2014 | 3 Comments
How have Hadoop deployments grown this year? Slowly.
Here’s a little anecdata for you:
During 2014, my colleague Nick Heudecker and I conducted quarterly webinars on the State of Hadoop, and in the Q2, Q3 and Q4 sessions we asked our (steadily growing) audience about their deployments via online polls. These results should not be considered definitive (they’re unqualified – though attendees do have to jump through a hoop or two to attend, we don’t keep extensive firmographics, titles, etc.)
What we saw was a decrease in the percentage of respondents who said they had not deployed at all by yearend. As for the other categories, which asked how many nodes were deployed, only one showed much change – the “fewer than 10″ group grew to 27% at yearend – a 50% growth over the Q2 result of 18%. This suggests a growing number of pilots – which accords with our expectations. The other groups were essentially flat, suggesting no dramatic growth in substantial projects undertaken so far, or substantial additional projects being added to the same cluster and driving growth.
Percentage of webinar respondents reporting cluster sizes, 2014
Nick and I expect to continue these webinars. See you next year, and we’ll see how things have progressed.
Category: Apache Gartner Hadoop Tags: Apache, Gartner, Hadoop
by Merv Adrian | November 17, 2014 | 9 Comments
Last week, many observers were surprised when Hortonworks’ S-1 for an initial public offering (IPO) was filed. And there are good reasons to be surprised. Why now? CEO Rob Bearden told VentureWire not long ago that he expected to exit 2014 “at a strong $100 million run rate” in preparation for a 2015 IPO. What changed? Perhaps one answer to that question might be answered by asking another question: for whom?
Is the filing is for Hortonworks to help with cash? That is not obvious. The filing is listed as being for an offering of $100M. In the context of other fundraising activities by Hadoop vendors – with Cloudera’s $900M or so of a few months ago at the top of the list – it will hardly create a war chest suitable for out-expanding its competitors aggressively. And there are cheaper and easier ways to raise $100M in Silicon Valley than an IPO.
In fact, a look at the numbers – now public for the first time because of the filing – makes it all the more puzzling. Hortonworks’ $33.4 million in revenue for the nine months ending Sept. 30 was up sharply from last year, only its second full year since HDP went GA in June 2012. Revenue for the last quarter was $12M. It was barely up over the prior quarter (also $12M), so things are actually a bit flat. But expenses are several times that – $29M and $41M respectively, so the gap is widening. Put another way, losses are growing faster than revenues are, at an accelerating rate. That $100M, plus the $111M the company has in cash now, gives it a year or two’s worth of runway to improve matters. Presumably, that’s the bet. But why only $100M if it seems possible that more could be available?
Is it for Hortonworks’ investors? Let’s see who they are. Here’s a table of their stakes (a table stakes stakes table, if you will):
Benchmark and Index are successful funds, and it’s unlikely they are in a hurry to cash in on their investment. Yahoo might care, but urgency seems unlikely, particularly after the Alibaba windfall. There is little reason to think HP is driving this. Teradata? OK, if they were betting on Hortonworks as the key element in their big data strategy, maybe they have decided to hedge, but it’s hard to imagine they really feel the need to worry about this – and they’ve already hedged by announcing a new partnership with Cloudera. They have a fair number of joint customers with MapR as well, so one can’t rule then out as a future partner too. Teradata’s role here is not likely to be the motivating factor.
Personal gain? Doubtful. The stakes owned by Hortonworks’ CEO and President are nice and will certainly help them – but there is no obvious reason for them to have accelerated this for their own gain, even if they could do so.
Is it to help build the market for Hadoop? This seems to have been the party line on general motivation till now. But they are one vendor among several, some truly megavendors and some similar in size – and evidently in prospects – for the near term. They are major contributors to the open source code in the Apache stack and driving substantial innovation. Being able to keep paying engineers (R&D is 28% of expenses, and has doubled over the past year) is a good use of funds – and $100M will fund a couple of years at the current run rate, which one might expect to level off a bit. But it won’t be the only use of funds: sales and marketing is 48%, and more is better. Still, let’s face it, because of Hortonworks’ business model, everything they build is Apache open source code. Their R&D spend enables their competitors too. It won’t separate them quickly and dramatically from the pack any better to have much more spending on either or both.
It’s been 10 years since the first Google paper on MapReduce. Hortonworks will be the first new public company descending from that and they want HDP as symbol. They were formed 3 years after Cloudera, so they can at least grab the Hadoop label for themselves. But with an open source stack, value is likely to be determined by how well the company is seen to run, how many customers it has, how likely the revenue of the company is to track growth in Hadoop usage at those and at new customer sites, etc. Hortonworks’ business is services and support. Nether is particularly high margin. Nor is it clear how customer spending on either or both will scale with their Hadoop usage.
Hortonworks’ 3 largest customers (Yahoo, Teradata, and Microsoft) account for 37.4% of its revenue – and two are investors. The biggest is Microsoft, at 22.4% now – it was 55.3% for the year ended April 30, 2013. That sort of concentration never makes investors too happy, and though it is declining it’s still sizable. The Microsoft deal, like all others, is renewable – it expires in July 2015. And like Teradata, Microsoft has added other partnerships to what was an exclusive with Hortonworks till recently. Is the possible “window” closing a reason to accelerate the IPO? According to Fortune magazine, to actually list in the 2014 calendar year, this was basically the last week for Hortonworks to make the S-1 public (due to a combination of holidays and regulatory waiting periods).
Ultimately, it’s unlikely that Hortonworks will be alone as a public company for long. MapR told the Wall Street Journal they want to IPO next year, and they claim to have more customers, high margins and “efficient cash management.” Cloudera says they “are not ready yet” though they have lower rate of losses, and also claim more customers. At the end of the day, the answer may be rather simple. And again, answering a question with a question: if not now, when? There may not be a better time.
Category: Apache Big Data Cloudera Gartner Hadoop Hortonworks HP Industry trends IPO MapR Microsoft Teradata Yahoo! Tags: Apache, big data, Cloudera, Gartner, Hadoop, Hortonworks, initial public offering, IPO, MapR, Microsoft, Teradata, Yahoo!
by Merv Adrian | October 31, 2014 | 10 Comments
New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.
So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.
This year, all that noise had jumped the Spark.
If you have not kept up, Apache Spark bids to
replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.
Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.
Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.
Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.
(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)
Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.) And Cray, having completed its spinback of Yarc, unveiled its Urika-XA platform with Hadoop and Spark pre-installed, and leveraging its HPC expertise to exploit SSDs, parallel file systems, and high-speed interconnects for a test run to see if there is a high-end performance market yet.
Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.
Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.
The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.
It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.
Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for Spark developer certification, expanding the franchise before someone else jumps in.
There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.
Category: Accumulo Amazon Apache Apache Yarn Aster Avro Big Data BigInsights Cascading Cassandra Cloudera Cray Elastic MapReduce Gartner Hadoop HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Spark Uncategorized YARN Tags: Apache, Aster, Avro, big data, BigInsights, BigSQL, BlueData, Cassandra, CDH, Cloudera, Databricks, Datastax, EMR, Gartner, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, Impala, JSON, MapR, MapReduce, MarkLogic, Microsoft, MLlib, MongoDB, Openstack, ORC, Parquet, Paxata, Platfora, Rackspace, Scalding, Spark, SQL, Tableau, Tachyon, Tresata, Trifacta, Yarn
by Merv Adrian | October 13, 2014 | 3 Comments
Hopefully, that title got your attention. A recursive acronym – the term first appeared in the book Gödel, Escher, Bach: An Eternal Golden Braid and is likely more familiar to tech folks who know Gnu – is self-referential (as in “Gnu’s not Unix.”) So how did I conclude Hadoop, whose name origin we know, fits the definition? Easy – like everyone else, I’m redefining Hadoop to suit my own purposes.
Let’s start with the obvious one. Of course, Doug Cutting named Hadoop after his child’s toy elephant, seen here.
Photo: Merv Adrian
And in its early days, as I discussed in my post about the changing composition of distributions a few months back, the story was simpler. Hadoop was HDFS, MapReduce and some utilities. As those utilities got formalized and became projects themselves and were supported by commercial distributors, the list grew: Pig, Hive, HBase, and Zookeeper were Hadoop too. And a few months ago, as I noticed, Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN had joined the list.
YARN is the one that really matters here because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads. At Strata this week, we’ll talk about its integration with Red Hat’s middleware, its cautious alliance with Spark for MapReduce replacement, its alliance with data wrangling tools from startups and Teradata, its connection, via Sentry, to security stacks… and more.
So yes, many of us are redefining Hadoop as we add new pieces – new use cases, new projects that change its very nature. My answer to “What is Hadoop”?
OK – it’s a bit cute. But hopefully, it got your attention. Hadoop’s journey is just beginning, and there is much more change ahead.
Category: Accumulo Apache Apache Yarn Big Data Cascading Flume Gartner Hadoop Hbase HDFS Hive Mahout MapReduce Oozie Pig Spark Sqoop Teradata YARN Zookeeper Tags: Apache, Flume, Gartner, Hadoop, Hbase, HDFS, Hive, MapReduce, Oozie, Pig, Sqoop, Teradata, zookeeper
by Merv Adrian | October 10, 2014 | Comments Off
From my esteemed colleague Mark Beyer
“Unstructured data” is a misnomer—everyone finally agrees on that much, at least. It is a term that is often applied to information assets that are not relational. Sometimes it is applied to machine data generated by operational technologies. Sometimes it is applied to content, like documents, or even less specific text such as email or twitter feeds. The term unstructured creates fear, loathing and desire across the IT landscape. If it generates meaningful discussions about innovative processing or challenges the re-use of existing infrastructure without simply defaulting to a mindset of replacing components—then that is a good discussion. But that is about all the term is good for.
But I’ve been growing ever more fond of Semi-structured, but not for the reason that many who are enamored of the term might assume. Semi actually means half, but that doesn’t mean it is halfway between structured and unstructured. Because that would be halfway between myth and reality, which is a “place” that doesn’t actually exist. Consider the vernacular American use of the term “semi-truck”, that would mean “half truck”. But, of course, that isn’t what semi-truck means. A truck vehicle—referred to properly as the “tractor” — is “half” of the truck and the trailer is the other half of the truck. Separately, neither can move cargo. When you put them together, you have two halves that make a whole and creates a complete, useful delivery vehicle. So, a semi-truck is actually made up of a semi-tractor and a semi-trailer which makes, simply, a whole truck.
Now consider semi-structured data. What it actually means is that “half” of the governance and schema instructions (defined as physical and logical definition plus applied governance) are in the data and the other half are in the using application. Semi-structured data means that you MUST write an application to complete the schema. That also means that with each different application, you can impose different governance and schema instructions (also the processing rules), theoretically making the data more flexible.
However, with the introduction of this capability comes the danger of having disparate governance rules regarding the same data that may be so significant, it becomes different data. As data moves further away from containing its own schema, then flexibility increases and more and more of the entire schema must be imposed by the using application—and ever greater diversity is introduced. Of course application developers think this is a grand idea, but users of that data in a second environment are not equally fond of it.
In a data lake, we drop all of these various degrees of structure into a common pool or body. Each asset is deposited with a different level of dilution regarding the schema instructions, thus requiring a oceanographer’s level of expertise to determine when the data in a data lake has “high alkalinity” or “high acidity” or too much “copper” and so on. A data scientist who understands the “alchemy” that exists in the fetid waters of a polluted lake will have no trouble cleaning the data up and discerning pollutants from di-hydrogen oxide (that’s water). But a novice or dilettante may find that they are drinking polluted data water from the lake long after they are infected with the equivalent of data E.coli and are making continuous visits to the data latrine.
To be completely fair, the analogy of the waters of a lake when compared to the Data Lake, needs some departure from the metaphor, but there are also further extensions of the metaphor that also work (other than those obnoxious points I raise above). For example, a lake that is crystal clear somewhere in the mountains of upstate New York would be a dead lake. There would be no fish, no microscopic organisms, no snakes, no mosquitos flying over its surface. It would be crystal clear and completely dead. So, if you desire a completely clean lake, you will find a sanitized data environment that defeats the entire purpose of storing information in something near its native form. You want a vibrant, living data lake. But you must keep the trucking metaphor in mind and remember that unstructured data does not exist. Instead, semi-structured data which intentionally only has SOME of the schema embedded in the information asset, requires a knowledge of how to swim. And boats (tools for exploring the data lake) don’t really work either—anyone who cannot swim must be very careful to NOT fall out of the boat. But anyone who can swim, can safely use a boat and be confident that if they fall over the side they can make it back to shore.
Data lakes are for data scientists to conduct science, they are not for casual analytics users or even advanced business analysts who generally scuff their code-writing and script-writing toes on ignorance. But those data miners that understand business process and systems analysis, or those data scientists who understand model theory and statistical primitives? Well, they can swim all day. So, buy a boat (a tool), explore away, and be comfortable in the knowledge that some of the structure is waiting for you in the lake, but not all of it—your job is to figure out the rest including sometimes reaching over the side of the boat and getting a little wet because you never know when you will fall in.
Merv’s comment: this has been a topic of some discussion on our team. My own view is very close to Mark’s and is expressed in my closing slide from the Hadoop Summit
Category: Big Data data lake data warehouse Gartner Hadoop metadata Tags: data lake, data warehouse, Hadoop, metadata
by Merv Adrian | October 9, 2014 | 1 Comment
At Garner Symposium, Drue Reeves and I had the opportunity to interview Microsoft CEO Satya Nadella. Here’s a brief clip from the closing. I’m summarizing and Satya, passionate as he was throughout the conversation, lays out his vision about mobility that crosses the personal and professional: mobility of the individual and the app experiences. “Have my work and life wherever – that’s the true form of mobility.”
Category: Uncategorized Tags: