How to Use HBase / Phoenix For Real Time Data And Performance Tuning!

1. What is HBase?

As we all know HBase is one of the NoSQL database, which is a distributed, strictly consistent storage system with heavy write—in terms of I/O channel saturation—and excellent read performance, and it makes use of disk space efficiently by enabling compression algorithms that can be implemented based on the nature of the data in specific column families.

2. Problem faced and how we overcame it!

As we all know HBase is one of the NoSQL database, which is a distributed, strictly consistent storage system with heavy write—in terms of I/O channel saturation—and excellent read performance, and it makes use of disk space efficiently by enabling compression algorithms that can be implemented based on the nature of the data in specific column families.

Introduction

  • Our application ingests OLTP data to RDBMS which used to be fine during the initial years after we built the application. As the transaction is bound to get increased, our legacy RDBMS along with replication servers were become too heavy for us in the following perspective,

Cutting edge Big Data Engineering Services at your Finger Tips

Read More

  1. Hardware investment
  2. Technology
  3. Capability of database for near future
  • So we thought why not try using Big Data technologies and chose HBase as a data store for our application.

Implementation

  • We spent few days in API development, so as to integrate our application’s data sink’s logic to start storing OLTP data to HBase as well. Once this was in place in our development environment, we started to see data in HBase tables successfully, voila.
  • Upon complete dev effort we went live in the production cluster just to see whether,
    • The cluster was working properly during off-peak hours of the application
    • As and when the application reached peak usage we started to see lot more RED alerts in our Cloudera Distribution of Hadoop Manager. When we started to deep dive into the alerts, we figured out that more things needed to be tweaked in HBase and Hadoop configuration (default confs of CDH were not working out for most use-cases) based on the read/write behavior of the application.

Challenges during implementation and respective fix’s

  • During peak hours of the application usage, the velocity and volume of the data is huge which anyways have to be ingested to the HBase cluster. We initially went live with the default parameters of HBase as follows,
    • hregion.memstore.flush.size : 128MB
    • Compaction:
      • hstore.compactionThreshold : 3
      • hstore.compaction.max : 10
    • hregion.memstore.block.multiplier: 10
  • We got to know that the above parameters were tightly coupled in the sense that just changing one configuration will not give better write performance. See below for example as to when the application is producing more data,
    • Our memstore was getting filled so easily that it needed to flush data to the disk which in turn created more store files per region.
      • In-order to get less store files/region we increased it to 512MB and we started to see less store files. But when the application usage was at max-peak, again the memstore started to do more frequent data flush to disc. It is not recommended to increase the memstore threshold beyond this as the node failure and recovery will take long time. So it is better to stick to this threshold as we did.
    • Minor compaction starts as and when there are 3 store files/region, which leads minor compaction queue size to high.
      • In-order to have less frequent minor compaction, it is suggested to increase compactionThreshold to a higher value like we did(20) and HBase.hstore.compaction.max: 40, so HBase won’t start minor compaction until it sees 20 uncompacted files/region server.
    • Block multiplier blocks the writes to the memstore as there were more than 10 store files/region. This in turn blocked the application to block the OLTP inserts to HBase. “It stops the world!”
      1. In-order to avoid such scenarios we changed this to a maximum – 400. After this we didn’t see writes being blocked, neither in the memstore nor in the application layer.
  • So changing these parameters according to your use case will give better write performance, which is what we ideally wanted to get from HBase as a data store for our application.

Usage: Write or Read or Both?

  • Based on our experience it is highly recommended to decide how the data store is going to be used end of the day! Is it heavy,
    • Write intensive
      • See 2.b.c for you to get better performance
    • Read-intensive
      • Read can be optimized in multiple ways but before you start tuning the parameters ensure that your read pattern is related-random read not purely random.
      • If it is relatively random read then tweaking block/bucket cache will definitely helpful otherwise don’t even think of spending time in read performance tuning.
    • Both
      • Combining 2.b.c and block/bucket cache tweaking would help.

Why Phoenix?

  • Though we successfully tuned HBase accordingly, at the end of the day our dashboards and downstream applications needed data from HBase with a caveat that they would like to get it via SQL queries. As a best practice, we chose Apache Phoenix – OLTP and operational analytics for Apache HBase.Because of its nature, we get data from Phoenix in low latencies with certain default parameter changes. That’s all it needs.

Leverge your Biggest Asset Data

Inquire Now

Conclusion

  • Would be good if you work on the below checklist, prior to the implementation,
    • Spend time in “Data model in HBase” – Based on your usage pattern.
    • What is the ideal usage of the data store? Read/Write/Both? – Then tune HBase accordingly.
    • If low latency read on top of HBase via SQL queries then choose Phoenix, certainly not Hive/Impala/Spark-SQL.


Author: ALEX MAILAJALAM
Alex is a Big Data Evangelist and a Certified Big Data Engineer with many years of experience. He has helped clients to optimize custom Big Data Implementation, migrate legacy systems to Big Data ecosystem, and build integrated Big Data and Analytics solutions to help business leaders generate custom analytics without need of IT.