To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts.
Today, I publish responses from Amr Awadallah, CTO of Cloudera.
1. How specifically are you addressing variety, not merely volume and velocity?
Variety is inherently addressed by the fact that Hadoop at its heart is a file system that can store anything. There is also additional functionality to map structure on top of that variety (e.g. HCatalog, Avro, Parquet, etc.). We are evolving Cloudera Navigator to aid with meta-data tracking for all that variety.
2. How do you address security concerns?
Hadoop has a very strong security story. Hadoop supports authentication through Kerberos, supports fine grained access control (through Sentry), natively supports on-the-wire encryption, and supports perimeter access control through HttpFS and Oozie gateways. Furthermore, Cloudera Navigator supports audit reporting (to see who accessed what and when). We also have a number of technology partners offering on the disk encryption as well. CDH4 is already deployed in a number of financial, health, and federal organizations with very strict security requirements.
3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?
First, you could run multiple MapReduce jobs in Hadoop for a number of years now. CDH ships with the fair scheduler which arbitrates resources across a number of jobs with different priorities. Second, as of CDH4, and soon in CDH5, we are including YARN (from Hadoop 2) which allows for arbitration of resources across multiple frameworks (not just MapReduce). In preparation for CDH5, we are doing a lot of work in both CDH and Cloudera Manager to make it easy to arbitrate resources across multiple isolated workloads on the same hardware infrastructure.
4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?
- Resource Management for multiple workloads
- Even stronger security
- Overall stability/hardening.
5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?
Hadoop is not a query federation engine, rather it is a storage and processing engine. You can run query federation engines on top of hadoop and that will allow you to mix data from inside hadoop with data from external sources. Note however that you will only get the performance/scalability/fault-tolerance benefits of Hadoop when the data is stored in it (as it tries to move compute as close as possible to the disks holding the data). So while federation might be ok for small amounts of data, it isn’t advised for big data per se.
6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?
Can’t reveal exact percentages as that is confidential. But fair to say that the most common use-case is still batch transformations (the ETL offload use case). Frequently customers see ETL/ELT jobs speed up from hours to sub-minute once moved to Hadoop (for a fraction of the cost). The second most common use case is what we call Active Archive, i.e. the ability to run analytics over a long history of detail-level data (which requires an economical solution to justify keeping all that data accessible versus moving to tape). While data science is typically thought of as synonymous with Hadoop/Big-Data, it is actually not yet the most common use case as it comes later in the maturity cycle of adopting Hadoop.
7. What does it take to get a CIO to sign off on a Hadoop deployment?
Typically it is easiest to argue the operational efficiency advantages to the CIO/CFO. i.e. you will be able to do your existing ETL in 1/10th the time at 1/10th the cost. Once Hadoop is deployed inside the enterprise then they start to see all the new capabilities that it brings (dynamic schemas, ability to consolidate all data types, and ability to go beyond SQL). That is when the Hadoop clusters start to expand as the CIO sees that it can create more value for the business by doing things that couldn’t be done before.
8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?
Tuning Hadoop is a very complicated task to do manually. One of the core features we have in Cloudera Manager, and our support knowledge base, is to help you exactly with that problem. This can span changing the hardware itself, changing the operating system parameters, to tuning Hadoop configs.
9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?
Hadoop 2.0 has two parts: all the improvements to HDFS which are actually much more stable than Hadoop 1.0 (lets call that HDFS 2.0), and then the new additions for YARN and MapReduce NG (lets call that MapReduce 2.0). We have been shipping Hadoop 2.0 for more than a year now in CDH4. This allowed our customers to get the better fault-tolerance and higher availability features of HDFS in Hadoop 2.0.
That said, we didn’t replace MapReduce 1 with MapReduce 2, rather we shipped both in CDH4 and marked MapReduce 2 as a technology preview. This allowed our partners (and early adopter customers) to start experimenting and building newer applications on the MapReduce 2 API while continuing to use MapReduce 1 as is. As the Apache community releases new code for improving the YARN and MapReduce 2 stability we continue to add that code to CDH4 and eventually it will be rolled out with CDH5.
10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?
This obviously varies by customer, especially in cases when you are extracting new value from your data vs just saving costs. On a cost saving basis Hadoop systems typically cost a few hundred dollars per TB, which is 1/10 the typical cost of RDBM systems.
11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?
The majority of our production Hadoop deployments are operated by the IT teams. We offer training specifically for system administrators and DBAs in IT departments to get comfortable managing CDH clusters.
12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?
SQL will be one of the key mechanisms for accessing data in Hadoop given the wide ecosystem of tools and developers that know how to speak SQL. That said, it will not be the only way. For example, Cloudera now offers search on top of Hadoop (still in beta), which is much more suitable for textual data. We also have a strong partnership with SAS for statistical analysis which again is better than SQL for such problems, etc.
13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?
Absolutely, Hadoop (and CDH in particular) is evolving to be a platform that can support many types of workloads. We have MapReduce/Pig/Hive for batch processing and transformations. We have Impala for interactive SQL and analytics. We have Search for unstructured/textual data. We have partnerships with SAS and R for data mining and statistics. I expect us to see more workloads/applications move to the platform in the future.
14. What are some of the key implementation partners you have used?
TCS, Capgemini, Deloitte, Infosys and T-Systems and Accenture.
15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?
The most important factor is to make sure that the cluster is spec’ed correctly (both servers and network), the operating system is configured correctly, and all the Hadoop parameters are set to match the workloads that you will be running. We offer a zero-to-Hadoop consulting engagement which helps our customers get up and going in the best way possible. We also put a lot of our smarts about running clusters efficiently into Cloudera Manager which tunes the cluster for you.
Hadoop really scales well as a technology once configured correctly. We have a number of customers running at the 1000s of node scale with 10s of PB of data.
Amr Awadallah, Ph.D., Chief Technology Officer at Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.
Follow Svetlana on Twitter @Sve_Sic
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.