Blog post

Hadoop FAQs – April Webinar Q&A

By Merv Adrian | April 16, 2017 | 0 Comments

TalendOSSopen sourceInformaticaHTAPGoogleGartnerETLDBMSData IntegrationCloudApache SqoopApache SparkApache MapReduceApache HBaseApache HadoopApache FlumeApacheData and Analytics Strategies

Nick Heudecker and I received numerous questions during our April Hadoop webinar with several hundred attendees, and we have summarized and answered them below.

How can Hadoop-Spark interface with conventional RDBMS such as Oracle/UDB/Teradata vs NoSQL DBs?
Data can be moved to and from other data sources by a variety of methods, including the Hadoop stack components Sqoop and Flume, and existing products from DBMS and DI vendors like Informatica, Syncsort and Talend. The distribution vendors partner with these vendors and sometimes sell jointly with them.
Most, if not all, DBMS vendors also offer some type of Spark connector. This means you’re running Spark outside the DBMS process, in its own cluster or environment. We expect DBMS vendors to begin offering in-process Spark capabilities within 12 months.

Can conventional RDBMSs be replaced totally with NoSQL DBMSs and HBase etc., for a typical ETL life cycle serving ODS and DSS applications?
ETL is a process whose target is a DBMS, whether relational or NoSQL (HBase is a NoSQL DBMS). So yes, they can be targets. Replacing RDBMSs will work for some workloads, but by no means all – and which workloads depends on which NoSQL offering we’re talking about, as well as the workload characteristics, the number of users it and other workloads on the same system have, and the SLAs.

Will Enterprise storage still have a presence in the big data analytics field? Do you have specific recommendations on a storage volume threshold where on-premise servers would be de facto surpassed by cloud (e.g.: SAN array)?
This is another “it depends” answer. The cost of storage is both in acquisition and in operation. In the cloud, the latter costs are substantially lower, of course. But compute costs vary by workload. “Spiky” ones – think data science where hours or days at the whiteboard are followed by minutes of compute time – can be markedly cheaper. But production that “runs hot” – continuous operation – must be modeled and tested to answer the questions, and the answer will change as prices do.

How far do you see market consolidation? Is it possible in a heavily distributed business model to have winners in the short term?
Calling winners and losers is for financial analysts. Our focus is on what works for users of the technology, and the good news is that most of the Hadoop stack is still the same no matter who you get it from. If vendors go away, the likelihood is that the technology will persist in acquiring firms or others who simply use the code.

What about Google Cloud?
It shows great promise and has excellent technology.  So far it has not generated a great deal of visible revenue in this space, but it will as it builds out the sales and support infrastructure that mainstream enterprises expect.

In the architectural patterns for Hadoop (for logical data warehouse), where do the Hadoop and integration/metadata tools sit with respect to cloud/on-prem/hybrid cloud?
In general, they are struggling with the transition to hybrid deployment at varying rates – just as users are.

Are there use cases to move from a on-prem approach to a hybrid approach for Analytics Platforms of a health care company who is cautious about the use of their member’s PHI?
In a word, yes. Many health care firms are working on these issues.

How much of your inquiry is about ‘Advanced Analytics’ on Hadoop vs. Traditional BI on Hadoop? There is a lot of buzz about Spark, but it sounds like most of the widespread customer needs are around traditional BI.
Data science inquiries are focusing on Spark, but so are “traditional BI” ones when ad hoc usage is the focus. Hadoop’s limitations for either are very visible in the questions Gartner hears. More scheduled reporting and ETL are more typically Hadoop-focused.

Where do Data Wrangling companies, compliance platforms, access, etc. sit?  Is that part of the integration layer?  As well as ability to translate analytical algorithms from R to Java, Ruby, etc.
All of these categories sit “next to” the Hadoop stack, though open source projects to tackle many of them are under way. We expect little if any addition to the current set of projects from the distributors we discussed.

What do you mean by calling Spark part of the Hadoop stack? Previously you indicated that they were independent, and you made it sound as though organizations were simply adding Spark to their offerings.
Spark is completely independent from Hadoop, but it was created to run in the Hadoop environment (among others). This made it relatively easy for Hadoop distributors to ship Spark as part an aggregate offering. When we describe Spark as part of the Hadoop stack, we’re referring to commercially available Hadoop distributions.

Does Spark in near-future going to have a graphical interface with ready-made and customizable components for ETL processing?
Unlikely. These types of refinements are usually available as commercial/proprietary offerings.

Did your phrase “store-centric” mean relating to a data store, or relating to data about a retail operation?
We were talking about data stores.

Are there alternatives to Hadoop?
Yes, in the same places Hadoop itself was developed: inside web-native firms with strong engineering cultures and a determination to create new solutions to their data issues. Some will emerge and become the next “shiny objects.” Stay tuned. Hadoop wasn’t the first platform and it won’t be the last.

Comments are closed