Hadoop Distributed File System(HDFS), is the core file system to store huge volume as a highly available data with fault tolerance. Apache hive and Apache Drill are couple of analytical engines out of many which are well suited for processing petabytes of data using traditional Structured Query Language(SQL) on top of HDFS and other file systems.
What is Apache HIVE?
Hive is one of the popular data warehouse application to store and summarize structured Big data sets by facilitating us to write SQL like queries to efficiently extract business data. It converts SQL queries into Map reduce(MR) programs and process it distributive, hive uses MR as a default execution engine and it can process data available in HDFS, S3 any distributed file system per say.
Advantages or what hive is meant for?
- Hive is suitable for long running queries, doing aggregates and joining operation between two big hive and HBase tables.
- Hive has capability to query self-descriptive files formats like Parquet, ORC, Avro, JSON and so on. Allows access files stored in Hadoop Distributed File System (HDFS) or also similar other data storage systems such as Apache HBase.
- Hive is highly scalable and fault tolerance. Data growth can be handled in ease by adding more commodity hardware without compromising the performance and without any down time.
Disadvantages or what hive is not meant for?
- Hive is not suitable for querying from interactive applications like BI tools.
- Hive does not support updates and deletion of records.
- It’s not designed for Online transaction processing (OLTP), it is only used for the Online Analytical Processing (OLAP).
What is Apache DRILL?
Apache Drill is yet another efficient SQL query engine which queries any kind of structured and unstructured data sets which residing any of the file system.
Drill high-level architecture includes a distributed execution environment with a core service is called Drillbit.
Which is responsible for accepting query requests from the client, processing and delivering results back to client. For proving extensive support for distributed framework like Hadoop, this Drillbit service can be installed in all the Hadoop nodes to preserve data locality in querying files stored in HDFS.
Drill uses Zookeeper to maintain cluster membership and health-check information.
Advantages or what Drill is meant for?
- Drill is fairly suitable for querying files from interactive application, integrates with the BI/SQL tools such as Tableau, MicroStrategy, Pentaho and Jaspersoft using JDBC/ODBC drivers.
- Dynamic schema discovery enables Drill to start the query execution process without any schema, starts data processing in record-batches and discovers the schema during processing.
- Drill does not require any centralized database to store metadata.
- Drill supports Real SQL (2003 syntax) queries not SQl-Like queries. Drill can interactively query hive tables without MR job and reuses the UDFs created in HIVE.
- It can operate on cross data sources, like it can query and join tables between any RDBMS tables and hive/HBase tables.
- Scalable and high-performance engine, designed to provide high throughput and low latency.
Disadvantages or what Drill is not meant for?
- Drill is not suitable for long running queries.
- Drill automatically spills data into disk when processing data not fit in memory. This may require more disk space when querying against big data sets.
- Though it can query cross databases but still we can only use built in functions supported by Drill. Drill lacks in supporting many aggregate functions which are supported by MySQL /Oracle/Hive.
Mapreduce is the core component of Hadoop to process the data stored in its distributed file system called HDFS.
Hive queries HDFS by converting both simple and complex sql queries into MR jobs and efficiently uses resources in the cluster to process historical data.
Drill queries almost any of the file system by processing in memory which effectively uses allocated resources, especially runs against distributed file systems.
Both Hive and Drill suitable for processing large data sets but still it has its own disadvantages.
Basically, hive is suitable for structured data sets which uses MR framework to compute aggregates in SQL statement and files preferably stored in HDFS.
Where Drill queries are executed against both structure, semi-structured and unstructured data sets, it simply queries any file systems stored in both flat and distributed storage.
Hive and Drill Comparison
|Mapreduce is the default execution engine||In memory processing|
|Build on top of HDFS||Queries both Flat and Distributed file systems|
|Distributed data warehouse||SQL Query Engine|
|Supports both simple and complex data types||Does not support complex data types|
|Uses MySQL/PostgreSQL to store schema||Uses zookeepers to store metadata information|
|High Fault tolerance||Low Fault Tolerance|
|Hive supports almost all SQL queries||It has restrictions and writing subqueries|
|Queries only hdfs files and NOSQL dataset stored by hbase||Queries any RDBMS and NOSQL databases like mysql,postgres,mongodb,hbase|
|Unable to query any RDBMS||Queries and do join operation between RDBMS and hive tables|
|Higher memory utilization when processing historical data.||Efficient memory utilization and even process data with less RAM.|
|Hive has wide range of inbuilt functions to manage timestamps||Lesser number of inbuilt functions to manage timestamps conversions|
|User Defined Functions are supported.||User defined Functions are supported, hive UDFs can be used in drill with ease.|
|Simple Java programming structure to write UDFs.||Complex programming structure to write UDFs.|
|Distributed processing using Gateways||Distributed processing using independent agent called drill-bits.|
|Hive queries will not fail when any of the node goes offline, rerun not required.||Drill queries failed when any of the drill-bit goes offline, rerun required.|
|Kerberos authentication supported||Kerberos authentication supported|
|Hive can be queried using JDBC.||Drill can be accessed using JDBC.|
Hive vs Drill Comparative benchmark
Apache Drill has rich number of optimization configuration parameters to effectively share and utilize the resources individually allocated for the drill-bits.
Though hive supports Tez and spark to eliminate mapreduce and to minimize disk IO operations, which in turn takes more memory while processing large tables.
Hive fully depends on YARN and used for handling batched data. Below use case demonstrates resource utilization of both hive and drill.
Benchmarked in CDH cluster with below configuration:
- Number of nodes: 10
- Memory: 128GB
- Cores: 48 cores
- Disks: 13 TB/node
Configurations that affects Hive:
Configurations that affects Drill:
Test case executed:
Number of rows processed :20763383
Query : A query with order by timestamp desc and two tables join.
|Stages||Memory Used||Time Taken|
|Seven fragments in total|| Max Memory : 51 MB|
Total Memory : 110 MB
|03 min 10.031 sec|
|Total Time Taken||03 min 10.031 sec|
|Stages||Memory Used||Time Taken|
|Stage 1||No of Mappers||60||13 mins, 24sec|
|Max Memory||519 GB|
|No of Reducers||251|
|Max Memory||17 GB|
|Stage 2||No of Mappers||285||17mins, 18sec|
|Max Memory||547 GB|
|No of Reducers||75|
|Max Memory||17 GB|
|Stage 3||Max Memory||130 GB||42 secs|
|Total Time Taken||20 Minutes|
Drill outperforms Hive in both resource utilization and latency in executing queries. Hive performs well with more voluminous data but requires more memory.
In case of drill, it is required to set disk spilling configurations based on the volume especially sorting data.
Hive takes time during aggregation to perform map reduce operation. Drill eliminates map reducer and quickly generates results for the queries with aggregation.