This is a joint post authored with Nick Heudecker
There were many questions asked after the last quarterly Hadoop webinar, and Nick and I have picked a few that were asked several times to respond to here. Just to be clear, we don’t generally give deep advice in blog posts – our blogs are descriptive, not prescriptive. Gartner clients submit inquiries, and justifiably think that since they are paying for the privilege, we should reserve that for them.
1. What was that document you referred to that listed a lof of Hadoop use cases?
2. Is Apache Spark replacing Hadoop or complementing existing Hadoop practice?
Both are already happening. With uncertainty about “what is Hadoop” there is no reason to think solution stacks built on Spark, not positioned as Hadoop, will not continue to proliferate as the technology matures. At the same time, Hadoop distributions are all embracing Spark and including it in their offerings
3. Does Hadoop only work on structured data, where the data schema is strictly defined in detail? Or is Hadoop used widely on unstructured data today?
Far more the latter in its earliest uses – clickstreams, log files, machine readings, text data that is in the “dark data” archives have been among the earliest poster children. Parsing and extracting data from such sources was at the root of the earliest creators’ objectives. But it doesn’t need to stop there, and it hasn’t.
4. “Don’t Microsoft’s products like PowerPivot have a long term advantage over those of Apache?”
That’s a false dichotomy. Microsoft is an enthusiastic participant with its HDInsight offering and Hortonworks and Cloudera partnership – its tools are already designed to work with data in Hadoop (in the cloud or on-premises, or – and this is exciting – both.) Complementary is the word here. More on cloud below.
5. How much Hadoop do we see in the cloud?
A great deal, and we hope to have better metrics soon. As the webinar mentioned, Amazon got there first with a commercial offering of MapReduce, and it has hosted millions of clusters since then, and is likely to expand its partnerships with distribution vendors beyond the already promising one with MapR.
6. How is NoSQL related to Hadoop?
This question comes up frequently. The easiest classification is to think of NoSQL as OLTP-like and Hadoop as OLAP-like. Neither classification is entirely fair to either category, but it can be a helpful way to get started. The two technology categories can be integrated, just as relational DBMSs and Hadoop can be integrated. A number of vendors have announced partnerships along these lines.
7. What is the best way to get some hands on experience with Hadoop?
First figure out what you want to do – operate a cluster or use a cluster. Then start. If you want to use a cluster, nearly every vendor, if not every vendor, offers something like a preconfigured virtual machine instance. You can simply download and use it. Many of these include tutorials on data loading and processing. If you don’t have data, check out data.gov, data.worldbank.org, or here’s a list of thousands of public data sources (https://bitly.com/bundles/bigmlcom/4
). If you want to operate a cluster, download the free version from any vendor and get it working. You can also use cloud options to get started if you want to spin up a cluster and experiment. Additionally, several vendors offer training and certification programs, some of which are free.
8. Do you have any word on Apache Flink?
Nick is planning on a longer blog post on Flink, but so far none of our clients have asked and we’re not aware of any Hadoop vendors supporting it. Flink appears to address many of the same use cases as Spark. As we understand it, and we don’t claim our understanding is correct, Flink takes some inspiration from RDBMSs in how it handles data, allowing for more control over memory use and additional optimizations to iterative processing.
9. There are a lot of Hadoop products coming up/developed everyday. Is there a common site which talks about the latest developement giving accurate information?
We’re not aware of a common site, but we’ve found Hadoop Weekly (http://www.hadoopweekly.com/
) very helpful at catching things we may have missed.