Let’s start with what a data warehouse is – Integrated historical & current data in a central repository! This data repository is derived from external data sources and operational systems. A data warehouse, being a central component of business intelligence allows enterprises to cover a rather wide range of business decisions. These decisions may include – business expansion, production method improvements, product pricing so on and so forth.
Apart from the huge role that a data warehouse plays in analysis and reporting, a data warehouse provides the following benefits to an organization:
- It allows you to keep data analysis separate from production systems. Complex analytical queries cannot be run by operational databases used by organizations every day. This is where a data warehouse lets the organization to run such queries without there being any ramifications on the production systems.
- Data warehouses bring consistency to disparate data sources.
- Data warehouses have an optimized design for analytical queries.
The popularity of Data Warehouse-as-a-service has increased tremendously over the past five years. This is primarily because of the impact that cloud computing has had on big data architecture. Let’s now have a look at the major differences between cloud-based data warehouses and traditional data warehouses.
A Traditional Data Warehouse
The traditional on-premise data warehouse requires on-premise IT resources like software and servers for the functions of the data warehouse to be delivered. Infrastructure needs to be maintained effectively when organizations run their own on-premise data warehouse.
The 3 tier structure of a traditional data warehouse:
- The data warehouse server is what occupies the bottom tier. This contains data pulled from various sources and is integrated in a sole repository.
- The OLAP servers occupy the middle tier. This allows the data to be more accessible for the different types of queries that will be used on it.
- The front-end BI tools occupy the top layer. These tools are primarily used for querying, reporting and analytics.
In order to pull data into the data warehouse, ETL tools are usually used. These tools obtain data from various sources, process it and apply the relevant business rules to get the data into the right format based on the data model. After this, the data is finally loaded into the data warehouse.
Bill Inmon and Ralph Kimball, two computer science pioneers have contrasting opinions when it comes to traditional warehouse design –
Bill Inmon suggested a top-down approach which meant that all enterprise data is stored in the data warehouse which is the central repository. From this data warehouse, dimensional data marts which serve particular lines of business are created.
On the other hand, the bottom-up approach according to Ralph Kimball suggests that the result of the combining data marts is the data warehouse.
Cloud Data Warehouse
The concept of the cloud-based data warehouse approach stems from leveraging data warehouse services provided by the public cloud providers like Google BigQuery or Amazon Redshift or Azure SQL DW.
With data warehousing services accessible over the internet, public cloud providers allow companies to cut down heavily on their initial set up costs required for a traditional on-premise data warehouse. Adding to that, the cloud data warehouses are fully managed by the providers. Hence, the service providers manage and assume entire responsibility of the required data warehouse functionalities. This includes updates and patches to the system.
In comparison – Traditional vs Cloud
The traditional data warehouse approaches differ from the cloud architectures. Cloud Architectures are somewhat different from traditional Data Warehouse approaches. Take the case of Amazon Redshift – The operations of Redshift are designed where you are required to provision a cluster of cloud-based computing nodes. A few of these nodes compile queries whereas a few of them execute these queries. Google on the other hand provides a serverless service. This means that the allocation of machine resources is managed by Google dynamically. These decisions are taken by Google thereby freeing up the user’s bandwidth. In the case of Azure, it is a solution that is relatively cheaper with an ability to scale and compute storage. In Azure, you have the advantage of pausing and resuming your databases in minutes.
When it comes to cloud data warehouse, the level of optimization it offers is very tough to match by the traditional on-premise setup. Another advantage of cloud over on-premise is columnar storage. This is when table values are stored by column and not rows. This allows for faster aggregate queries in line with the type of queries you need to run in a data warehouse. Another feature that drastically improves query speeds is massively parallel processing. This is done by using many machines to coordinate query processing for large datasets.
When it comes to scalability, in the cloud, it is just as simple as provisioning additional resources from the cloud provider. On-premise scalability is expensive and time consuming as the need to purchase more hardware arises.
The tricky aspect of having a cloud data warehouse is security – transmitting terabytes of data over the internet brings about a security concern which includes compliance concerns as well. This is because the data may carry sensitive information. An on-premise setup holds the edge here as these security concerns are totally avoided because the organization controls everything.
Summing it up
For medium and small-sized companies, the cloud makes data warehousing more accessible than before due to the low barriers to entry. Cloud data warehouses entice even the biggest enterprise due to their lower costs – reduction in infrastructure management costs and easy scalability. Putting things in perspective, the cloud does have its issues when it comes to security. However, the benefits clearly outweigh the negatives. Legacy on-premise setups are not entirely obsolete. However, the volume and velocity of data is growing at the rate of knots today and cloud services are designed to handle this sort of data. As it stands today, more and more workload is moving to the cloud and more and more companies have started providing cloud-based data warehousing services. This trend tells us that the cloud is the future of data warehousing!
Abhimanyu is a sportsman, an avid reader with a massive interest in sports. He is passionate about digital marketing and loves discussions about Big Data.