The number of SQL options for Hadoop expanded substantially over the last 18 months. Most get a large amount of attention when announced, but a few slip under the radar. One of these low-flying options is Apache Tajo. I learned about Tajo in November of 2013 at a Hadoop User Group meeting.
Billed as a big data data warehousing system for Hadoop, Tajo development started in 2010 and moved to the Apache Software Foundation in March of 2013. Tajo is currently incubating. Its primary development sponsor is Gruter, a big data infrastructure startup in South Korea. Despite the lack of public awareness, Tajo has a fairly robust feature set:
- SQL compliance
- Fully distributed query processing against HDFS and other data sources
- ETL feature set
- User-defined functions
- Compatibility with HiveQL and Hive MetaStore
- Fault tolerance through a restart mechanism for failed tasks
- Cost-based query optimization and an extensible query rewrite engine
Things get interesting when comparing performance against Apache Hive and Cloudera Impala. SK Telecom, the largest telecommunications provider in South Korea, tested Tajo, Hive and Impala using five sample queries. Hive 0.10 and Impala 1.1.1 on CDH 4.3.0 were used for the test. Test data size was 1.7TB and query results were 8GB or less in size. (The following images were taken from the presentation in the previous link.)
Query 1: Heavy scan with 20 text matching filters
Query 2: 7 unions with joins
Query 3: Simple joins
Query 4: Group by and order by
Query 5: 30 pattern matching filters with OR conditions using group by, having and sorting
What do these results indicate? Clearly, different SQL-on-Hadoop implementations have different performance characteristics. Until these options mature to be truly multi-purpose, selecting a single option may not result in the best overall performance. Also, these benchmarks are for a specific set of use cases – not your use cases. The tested queries may have no relevance to your data and how you’re using it.
The other important takeaway is the absolute performance of these options. The sample data set and results are small in modern terms, yet none of the results are astounding relative to a modern data warehouse or RDBMS. There’s a difference between “fast” and “fast for Hadoop.” Cloudera appears to be making some headway, but a lot of ground must be covered before any Hadoop distribution is compatible with the systems vendors claim to be replacing.
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Like most SQL-on-Hadoop benchmarks, the testing leaves us with more questions than answers.
First, they tested against Hive 0.10 while the shipping version is Hive 0.12 + Tez + Yarn. 15 yard penalty and loss of down.
Second, in query 1 they tested 1TB scattered across six 3TB spindles times 6 nodes. That’s 108TB to hold 1.7TB of data. Kinda unrealistic. What we need to see is a Hadoop benchmark of 100TB of data or more. Scan all 100TB, then another query does a massive join of 50TB to 10TB. Customers should not assume real scalability based on easy benchmarks.
Last, with most of the queries spreading 8GB of data across 384GB of memory (6 nodes*64GB), well, all the test data sits in memory, around 1.3GB per node. This is testing code path speed but not disk access which we know is the slowest component.
IMO — Lacking audited TPC council benchmarks, customers should do their own comparisons — at scale.
I see your point about the Hive version to a certain extent, but Hive isn’t the story here and it won’t be until Tez matures. I’d like to see Shark as part of this comparison. Hopefully someone from Gruter will respond regarding your other comments.
Thanks for the discussion.
Thanks for all the good points and feedbacks on our test results.
I added inline comments hoping that they are helpful.
> First, they tested against Hive 0.10 while the shipping version is Hive 0.12 + Tez + Yarn. 15 yard penalty and loss of down.
The test was performed in end of October last year. By that time, Hive on Tez was still under development. There was no chance to test it and the latest stable version in CDH was 0.10. Yes, absolutely, Hive on Tez is a must-test item in our test plan.
> Second, in query 1 they tested 1TB scattered across six 3TB spindles times 6 nodes. That’s 108TB to hold 1.7TB of data. Kinda unrealistic. What we need to see is a Hadoop benchmark of 100TB of data or more. Scan all 100TB, then another query does a massive join of 50TB to 10TB. Customers should not assume real scalability based on easy benchmarks.
>Last, with most of the queries spreading 8GB of data across 384GB of memory (6 nodes*64GB), well, all the test data sits in memory, around 1.3GB per node. This is testing code path speed but not disk access which we know is the slowest component.
Not in the slides but mentioned in my presentation at the HUG meetup, for each iteration in the test, we did cache drop to remove any memory affect from previous iteration. Regarding the data set size, I agree with your point, like most of other similar tests, the data set is still small considering HW capacity. We will share more test results using bigger data set soon.
> IMO — Lacking audited TPC council benchmarks, customers should do their own comparisons — at scale.
I understand that the internal test results by a vendor, not by a trustful third party entity, could be doubtful. But still, I hope it would be helpful at least to someone who is interested in Tajo project as one of references; and the test was actually done by one of our clients, SKT; I introduced it as a case study. Fortunately, Apache Tajo is an open source project and anyone can set up and perform benchmark test according to their own use case, which means anyone can discuss further or debate on the test results; I believe such discussions are also essential for Apache Tajo project.
First of all, I’d like to thank you and appreciate your post on Tajo.
Regarding Shark, we definitely would like to include it in our next test and share the results.
We are more than happy to hear any opinion and feedback on Tajo.