
Hadoop Distributed File System(HDFS), is the core file system to store huge volume as a highly available data with fault tolerance. Apache hive and Apache Drill are couple of analytical engines out of many which are well suited for processing petabytes of data using traditional Structured Query Language(SQL) on top of HDFS and other file systems.
Hive is one of the popular data warehouse application to store and summarize structured Big data sets by facilitating us to write SQL like queries to efficiently extract business data. It converts SQL queries into Map reduce(MR) programs and process it distributive, hive uses MR as a default execution engine and it can process data available in HDFS, S3 any distributed file system per say.
Next Gen Product Development at your fingertips!
Read More
Apache Drill is yet another efficient SQL query engine which queries any kind of structured and unstructured data sets which residing any of the file system.
Drill high-level architecture includes a distributed execution environment with a core service is called Drillbit.
Which is responsible for accepting query requests from the client, processing and delivering results back to client. For proving extensive support for distributed framework like Hadoop, this Drillbit service can be installed in all the Hadoop nodes to preserve data locality in querying files stored in HDFS.
Drill uses Zookeeper to maintain cluster membership and health-check information.
Mapreduce is the core component of Hadoop to process the data stored in its distributed file system called HDFS.
Hive queries HDFS by converting both simple and complex sql queries into MR jobs and efficiently uses resources in the cluster to process historical data.
Drill queries almost any of the file system by processing in memory which effectively uses allocated resources, especially runs against distributed file systems.
Both Hive and Drill suitable for processing large data sets but still it has its own disadvantages.
Basically, hive is suitable for structured data sets which uses MR framework to compute aggregates in SQL statement and files preferably stored in HDFS.
Where Drill queries are executed against both structure, semi-structured and unstructured data sets, it simply queries any file systems stored in both flat and distributed storage.
Hive | Drill |
Mapreduce is the default execution engine | In memory processing |
Build on top of HDFS | Queries both Flat and Distributed file systems |
Distributed data warehouse | SQL Query Engine |
Supports both simple and complex data types | Does not support complex data types |
Uses MySQL/PostgreSQL to store schema | Uses zookeepers to store metadata information |
High Fault tolerance | Low Fault Tolerance |
Hive supports almost all SQL queries | It has restrictions and writing subqueries |
Queries only hdfs files and NOSQL dataset stored by hbase | Queries any RDBMS and NOSQL databases like mysql,postgres,mongodb,hbase |
Unable to query any RDBMS | Queries and do join operation between RDBMS and hive tables |
Higher memory utilization when processing historical data. | Efficient memory utilization and even process data with less RAM. |
Hive has wide range of inbuilt functions to manage timestamps | Lesser number of inbuilt functions to manage timestamps conversions |
User Defined Functions are supported. | User defined Functions are supported, hive UDFs can be used in drill with ease. |
Simple Java programming structure to write UDFs. | Complex programming structure to write UDFs. |
Distributed processing using Gateways | Distributed processing using independent agent called drill-bits. |
Hive queries will not fail when any of the node goes offline, rerun not required. | Drill queries failed when any of the drill-bit goes offline, rerun required. |
Kerberos authentication supported | Kerberos authentication supported |
Hive can be queried using JDBC. | Drill can be accessed using JDBC. |
Apache Drill has rich number of optimization configuration parameters to effectively share and utilize the resources individually allocated for the drill-bits.
Though hive supports Tez and spark to eliminate mapreduce and to minimize disk IO operations, which in turn takes more memory while processing large tables.
Hive fully depends on YARN and used for handling batched data. Below use case demonstrates resource utilization of both hive and drill.
yarn.nodemanager.resource.memory-mb=60 GB
yarn.scheduler.maximum-allocation-mb=40 GB
planner.memory.max_query_memory_per_node=4GB
planner.width.max_per_node=4
planner.width.max_per_query=1000
exec.max_hash_table_size=1 GB
Number of rows processed :20763383
Query : A query with order by timestamp desc and two tables join.
Stages | Memory Used | Time Taken |
Seven fragments in total | Max Memory : 51 MB Total Memory : 110 MB | 03 min 10.031 sec |
Total Time Taken | 03 min 10.031 sec |
Stages | Memory Used | Time Taken | |
Stage 1 | No of Mappers | 60 | 13 mins, 24sec |
Max Memory | 519 GB | ||
No of Reducers | 251 | ||
Max Memory | 17 GB | ||
Stage 2 | No of Mappers | 285 | 17mins, 18sec |
Max Memory | 547 GB | ||
No of Reducers | 75 | ||
Max Memory | 17 GB | ||
Stage 3 | Max Memory | 130 GB | 42 secs |
Total Time Taken | 20 Minutes |
Leverge your Biggest Asset Data
Inquire Now
Drill outperforms Hive in both resource utilization and latency in executing queries. Hive performs well with more voluminous data but requires more memory.
In case of drill, it is required to set disk spilling configurations based on the volume especially sorting data.
Hive takes time during aggregation to perform map reduce operation. Drill eliminates map reducer and quickly generates results for the queries with aggregation.
By Indium
By Indium
By Uma Raj
By Uma Raj
By Abishek Balakumar
LakshmiNarasimman is a Bigdata developer & admin at Indium Software who is passionate about BigData technology. He is interested in troubleshooting issues in implementing Bigdata solutions including security for Hadoop services and resolving issues. Productionizing the Bigdata cluster, fine tuning, performance benchmarking in the various applications that build atop of Bigdata.