Organizations are dealing with ever-increasing volumes of data. Researchers from Aberdeen estimate, in a 2017 report, that an average company’s data grows at a rate of 50% annually. Data contains business value and the companies successful in generating it, will outshine their competition.
However, proliferation and complexity of data can disrupt the efficiency of companies who rely heavily on their data assets. Traditional data management methods—such as relational databases and data warehouses—are not effective because enterprises today require solutions which help manage data across a varied but an integrated data tier.
A data lake helps overcome the limitations of the traditional means of managing data because it provides the flexibility to make changes to models and queries, it supports all users (data warehouses, for example, allow only specific users to report from the defined data) and retains unstructured, semi-structured and structured data.
What is Data Lake?
It’s a centralized storage repository that allows users to store raw data (neither processed nor analyzed) in its original format, including unstructured, semi-structured or structured data, at scale. It enables businesses to create visualizations and dashboards and perform big-data processing, machine learning and real-time analytics to make informed business decisions.
Advantages of Data Lake
Some of the benefits of the technology are as follows:
Enhanced customer interactions: Connecting consumer data from a customer relationship management (CRM) platform with social media analytics helps identify the most profitable cohort, the possible causes of customer churn, rewards and bonuses that could trigger customer loyalty and so on
Improve research and development: Data lakes enable the R&D personnel test their theories, refine assumptions, and evaluate results
Increased operational efficiency: Up to 43 percent of companies surveyed by Aberdeen reported that implementing data lake improved their operational efficiency. It simplifies collecting and storing data from Internet of Things (IoT) devices and performing analytics to identify ways to reduce operational costs, increase efficiency, among others
Become data-driven: A data lake helps unify and analyze data from varied sources to gain deeper insights and accurate results. Together with Artificial Intelligence (AI) and real-time analytics, it enables organizations to seize new opportunities as they arise
Data Lake Architecture
Organizations can establish a data lake on-premise (in their data center) or in the cloud, with multiple vendors offering the cloud-based service.
While data lakes were initially built on HDFS clusters on-premise, companies are migrating their data to the cloud as infrastructure-as-a-service (IaaS) gains popularity.
An on-premise data lake is not without challenges either. Companies must tackle the complexity of building their own data pipelines, must contend with ongoing management and operational costs in addition to the initial investment on servers and storage equipment. Also, they must manually add and configure their servers to scale a data lake to cater to more users or increasing data volume.
A data lake in the cloud, on the contrary, offers some major advantages. Yet, businesses must consider several key design aspects when opting for this mode of deployment.
Scalability and Durability
Being a centralized data repository for an entire organization, a data lake must be scalable. This feature will help scale to any size of data while importing it in real-time.
Durability is another essential aspect of a data lake where the core storage layer must be capable of providing consistent uptime while ensuring no loss or corruption of data.
Support for Different Data
Among the major design considerations in a data lake is its capability to store unstructured, semi-structured and structured data. This flexibility enables organizations to transfer anything from raw, unprocessed data to fully-aggregated analytical outcomes.
Independent of Fixed Schema
Organizations must ensure their data lake allows the storage of all data that don’t conform to a design. Rather, only when data is read at the time of processing, should it be parsed and adapted into a schema, as necessary. This feature saves plenty of time (usually spent on defining a schema) for enterprises.
Decoupling Storage From Compute
A research from Forrester estimates that 60 to 73 percent of data gathered by organizations is unused for business intelligence (BI) and analytics. Therefore, a data lake architecture combining compute and storage spends on compute capacity that’s under-utilized. By decoupling storage from compute, data teams will effortlessly and economically scale storage to suit the proliferation of data sets.
Similar to any cloud-based deployment, security for a data lake is a priority. Broadly speaking, the three domains of security relevant to a data lake in the cloud are encryption, network-level security and access control.
Encryption for stored data is essential, at least for those types of data that are not publicly available. Encryption in transit is another key consideration. Usually this is configured using built-in options for every service or through TLS/SSL with their associated certificates.
Network-level security should be consistent with an organization’s overall security framework, though it plays a critical role in implementing a robust defense strategy by denying inappropriate access at the network level.
Authentication and authorization are the key focus areas of access control.
A data lake design must incorporate a metadata storage functionality to enable users to search and learn about the data sets in the lake.
Some of the key principles to bear in mind to ensure metadata is created and maintained are enforcing a metadata requirement and automating the creation of metadata.
A data lake does offer some key advantages as it provides faster query results at low-cost storage, support unstructured, semi-structured and structured data and more. It’s essential, however, that organizations implement a robust data lake architecture to meet enterprise-wide analytical needs.