To allow our readers compare Hadoop distribution vendors side-by-side, I forwarded the same set of questions to the participants of the recent Hadoop panel at the Gartner Catalyst conference. The questions are coming from Catalyst attendees and from Gartner analysts.
Today, I publish responses from Susheel Kaushik, Chief Architect of Pivotal Data Fabrics.
1. How specifically are you addressing variety, not merely volume and velocity?
Big Data term used in the industry is primarily associated with volume of data. Longer historical perspective or the velocity (frequency at which events are generated) is very high results in large volume of data. Handling various types of data from various data sources is crucial in establishing a complete business application at enterprises. Customers are implementing the Data Lakes use case use the Pivotal HD an HAWQ platform for storing and analyzing all types of data – structured and unstructured. Some examples of data that can be stored and analyzed on the platform are:
- Events, Log Files, CDRs, Mobile Data, legacy system and app files
Stored in Sequence, Text, Comma Separated, JSON, Avro, Map, Set, Array Files and analyzed via the standard interfaces such as SQL, Pig, HiveQL, HBase and MapReduce.
- Data from other databases or structured sources
Stored in Hive RCFile/ORC/Parquet and HFile file formats and analyzed via the standard interfaces such as SQL, Pig, HiveQL, HBase and MapReduce.
- Social network feeds
Ability to access and store the feeds from social network source, such as twitter, and enabling text analytics on the data.
Stored on HDFS and analyzed using the image analytical algorithms.
Stored on HDFS an complex and analyzed using the complex video and image analytical algorithms
- Time Series
Storage and analysis of times series data and generating insights from it.
In addition, Pivotal HD allows for the users to extend the formats supported by providing custom input and output formatters. Customers can also extend HAWQ to support proprietary data formats also.
2. How do you address security concerns?
At a very high level data security is about user management and data protection.
- User management is primarily focused on creation/deletion of user accounts, defining access policies and role permissions along with mechanisms for authorization and authentication.
- Data protection on the other hand is focused on
- Governing access to data.
Manage the various actions a user can take on data.
- Encrypting data at rest
Using the standard encryption techniques to encrypt data at rest – in this case the files on HDFS
- Encrypting data in motion
Apply standard encryption techniques to encrypt data in motion also – when sent over the wire to from one java process to another.
- Masking/tokenizing data at load time.
Tokenizing is a concept where users can get to original data if they have access to the correct ‘key’ whereas masking is a one way encryption and users cannot get to the original data.
- Governing access to data.
Pivotal HD controls access to the cluster using Kerberos authentication mechanism. Pivotal HD along with partner products supports data encryption at rest and in motion along with masking/tokenization. HAWQ provides security at table and view level on HDFS, thereby bringing the enterprise class database security to Hadoop.
3. How do you provide multi-tenancy on Hadoop? How do you implement multiple unrelated workloads in Hadoop?
- Pivotal VRP allows for fine grain control of the system resources (IO and CPU) for multi-tenancy management for Pivotal HD, HAWQ and Greenplum Database.
- HAWQ provides workload multi-tenancy. Queries run in interactive times (100x faster) which in turn allows for higher multi-tenancy.
- Cost based optimizer generates optimum query plans to deliver better query execution times.
- Scatter gather technology significantly improves data loading rates.
- Pivotal HD includes Hadoop Virtual Extensions to enable better execution on VMware vSphere. vSphere provides a very strong multi-tenancy and resource isolation for virtual environments.
4. What are the top 3 focus areas for the Hadoop community in the next 12 months to make the framework more ready for the enterprise?
- Kerberos integration is a starting point for enterprises. Easy to scale and integrate with their existing policy management tools. (E.g. RSA Netwitness – monitoring for policy violations.)
- Support for Access Control Lists rather than the traditional authorization models of security attributes at a user level.
- Unified Meta Data Repository
- Single metadata across HCatalog (HCatalog supports MapReduce, Pig and Hive) and HBase, HAWQ and other data.
- Current implementation of the metadata server has scale challenges that need to be resolved for adoption in enterprise environments.
- Namenode scalability
Enterprises environments need to store larger number of file on the platform.
- Current name node has limitation on the number of files and directories (around 150 million objects).
- Current options are to run name node server with larger memory to circumvent the physical memory limitation.
5. What are the advancements in allowing data to be accessed directly from native source without requiring it to be replicated?
Data replication can be categorized in two ways. (Interplatform) Data copies made in multiple platforms and (Intraplatform) multiple copies of the data within a platform.
- Unified Storage Services allows users access to data residing on multiple platform without data copying. USS does the metadata translation and the data is streamed directly from the source platform. USS enables applications access to data residing on multiple platforms without increasing the infrastructure costs at the customer end.
- HAWQ stores data directly on HDFS and eliminates the need for multiple copies of the data. Traditionally customers had a data warehouse for their ETL and operational workload data and made copy of the data on Hadoop for complex adhoc analytical processing.
- PXF (part of HAWQ) framework also allows users to extend HAWQ SQL support for proprietary data formats also. Native support of other proprietary formats reduces the need to make multiple copies of the data for analytics.
- Hadoop has storage efficiency of ~ 30% as it maintains 3 copies of the data to prevent data loss in case of multiple node failures. EMC Isilon storage improves the raw storage efficiency for data stored on HDFS to 80%. Isilon OneFS natively supports multiple formats NFS, CIFS and many more along with HDFS, thereby enabling the same storage platform to be used for multiple.
6. What % of your customers are doing analytics in Hadoop? What types of analytics? What’s the most common? What are the trends?
All of our customers today are doing analytics on the data stored on Hadoop and more and more data is coming to the Hadoop platforms. We find customers are gravitating towards the data lake use case where all data is stored and analyzed on the same platform.
- ETL offload
Enterprise customers are looking for a scalable platform to reduce the time, resource and costs for ETL processing on their existing Enterprise Data Warehouse systems. Pivotal HD/HAWQ scalability and parallelism along with SQL supports easy migration of customer applications and reduces the ETL processing time.
- Batch Analytics
Enterprise customers are analyzing events, log files, call data records, mobile data, legacy system and app files for security, fraud and usage insights. Earlier some of the data was not analyzed as the cost of analysis was prohibitive and scalable tools were not available. Pivotal HD/HAWQ allows customers to run batch analytical workloads.
- Interactive Analytical Applications
Advanced enterprises are now leveraging the availability of the structured data and the unstructured data from other non-traditional sources to deliver next generation applications. The key enabler in this case is the data lake use case where all kinds of data is available on a single platform with capabilities to perform advanced analytics.
The trends are:
- Cost savings
Enterprises are look at ways to do more with less. They are looking to reduce their infrastructure spend and at the same time reduce the time for ETL processing.
- Enable new applications and insights
Advanced enterprises are building next generation applications merging legacy data along with non-traditional data. Agility (time to market) is the key business driver for these initiatives which we find to be more business driven.
- Real time decision making
New business models leveraging the latency benefits along with the ability to analyze historical data are being experimented.
7. What does it take to get a CIO to sign off on a Hadoop deployment?
Grass root adoption is very high for Hadoop. All the developers want to get current with the Big Data platforms and skills and are experimenting with Hadoop.
CIOs are convinced that Hadoop is a scalable platform and understand the long term impact of the technology to their business. CIOs are more worried about the short term impact of Hadoop on their organizations and the technology integration costs. They are concerned about their cost structure and the integration and people training costs and the timing of their capital spend.
- Right partner
CIOs want to partner with a vendor with experience in building enterprise application, delivering solutions, supporting products and with existing enterprise customers. They are looking for a partner with a long term vision that enables their business and technology needs of the future.
- Minimize Disruption
CIOs are interesting in reducing the migration and application rewrite costs at their end. We find the CIOs also take monitoring and management costs into account as part of the Return on Investment (RoI) analysis for the immediate use cases before signing off on the Hadoop deployments.
Investment Extension is crucial for enterprises. Continuing to run existing applications without re-architecture or significant changes is very appealing to the CIOs.
- Scale at business pace
CIOs prefer the scaling paradigm of Pivotal HD/HAWQ – adding more hardware scales the storage and processing platforms both.
8. What are the Hadoop tuning opportunities that can improve performance and scalability without breaking user code and environments?
Hadoop has many configuration parameters and optimum parameter value for a given cluster configuration are not easy to derive. Most of these configuration parameters are spread all over – some are part of the application/job configuration, some are part of the job system configuration, some are storage configuration and quite a few are environment variables and service configuration parameters. Some of these configuration parameters are interrelated and customers need to understand the cross variable impact before making any updates. Pivotal provides following options to the customer to tune their Pivotal HD environments.
- Technical Consulting
Pivotal professional services team is helping our Hadoop customers to optimize and tune their Pivotal HD and HAWQ environments.
- 2. Pivotal VRP
Pivotal VPR allows enterprises to manage system resources at physical level (IO and CPU level) to optimize their environments.
- Pivotal HD Vaidya
Vaidya, an Apache Hadoop component and part of Pivotal distribution, guides users to improve their MapReduce job performance.
- HAWQ is already optimized for PHD
HAWQ leverages the advanced workload and resource management capabilities to deliver interactive query performance.
- Cost based optimizer delivers optimized query plans for query execution
- Dynamic pipelining capabilities improve the data loading times.
9. How much of Hadoop 2.0 do you support, given that it is still Alpha pre Apache?
Pivotal HD already supports Hadoop 2.0.
10. What kind of RoI do your customers see on Hadoop investments – cost reduction or revenue enhancement?
- Cost Reductions
Enterprises usually start their journey on Hadoop with cost reduction justifications. Reductions in their infrastructure spend and reductions in execution time for their ETL jobs. Rather than focusing on overall RoI we advise our customers to focus on a single use case from an end to end perspective (including resource leverage and application migration costs).
- Revenue Enhancements
Once the benefits are realized customers focus on the new business revenue enhancement uses cases and their associated RoI justification for scaling the environments. Once the data lake use case is available – customers find even more use cases and justifications for their existing invest spend too.
11. Are Hadoop demands being fulfilled outside of IT? What’s the percentage? Is it better when IT helps?
Hadoop is a complex environment and we advise our enterprise customers that deploying and leveraging Hadoop without assistance from their IT teams is a recipe for failure. Irrespective of where the business need for Hadoop originated, the IT teams are the best team to managing Pivotal HD and HAWQ as opposed to the business attempting it themselves.
12. How important will SQL become as mechanism for accessing data in Hadoop? How will this affect the broader Hadoop ecosystem?
SQL is the most expressive language to manipulate data and enterprises have a long history of managing structured data with it. Support of SQL is absolutely essential for the enterprise customers. Enterprises are looking for:
- Resource leverage
- Ability to leverage existing resources to deliver next generation business applications.
- SQL is the standard language for manipulating data within most of the enterprise.
Enterprises have an established ecosystem of tools for analytics and deliver insights to the existing user base. Integrate with their existing tools to extending their existing investments.
- Existing Applications
Application migrating is a very expensive operation for a business application. Enterprises are looking to scale and speed up their existing applications without making significant architectural and code changes.
13. For years, Hadoop was known for batch processing and primarily meant MapReduce and HDFS. Is that very definition changing with the plethora of new projects (such as YARN) that could potentially extend the use cases for Hadoop?
New projects (YARN, Hive, HBase) are making Hadoop easier to use. YARN specifically is improving the performance of the platform and enabling latency sensitive workloads on Hadoop. Enterprises are looking to move a significant portion of their ETL and SQL analysis on the platform. HAWQ along with in-database analytics makes it easier for enterprises to migrate their existing applications and start building new models and analytics.
Further addition of other execution frameworks (such as Message Passing Interface – MPI) support will bring more analytical and scientific workloads to the platform.
14. What are some of the key implementation partners you have used?
We collaborate with many partners all across the world and in addition to using the EMC specialized services, we have worked with third party partners such as Zaloni, Accenture, Think Big Analytics, Cap Gemini, Tata Consulting Services, CSC and Impetus to assist our enterprise customers in their application implementations.
15. What factors most affect YOUR Hadoop deployments (eg SSDs; memory size.. and so on)? What are the barriers and opportunities to scale?
Pivotal HD, HAWQ and Isilon are all proven petabyte scale technologies.
We are investigating the impact of SSD, Larger memory, Remote Direct Memory Access (RDMA), High speed interconnect (Infiniband) and TCP offload engines in a scaled 1000 node environment to optimize and performance improvements.
Susheel Kaushik leads the Technical Product Marketing team at Pivotal. At Pivotal he has helped many enterprise customers become predictive enterprises leveraging Big Data & Analytics in their decision making. Prior to Pivotal, he led the Hadoop Product Management team at Yahoo! and also led Data Systems Product Management team for the online advertising systems at Yahoo!. He has extensive technical experience in delivering scalable, mission critical solutions for customers. Susheel holds a MBA from Santa Clara University and B.Tech in Computer Science from Institute of Technology – Banaras Hindu University.
Follow Svetlana on Twitter @Sve_Sic