Abstract
Into the Future with Cloud: A Comparison with On-premises Data Warehouse
Iman Noor and Saad Bin Tariq
Institute of Geographic Information Systems
National University of Sciences and Technology
Islamabad, Pakistan
imannoor.kr2016@gmail.com, saadbintariq01@gmail.com
Aisha Shabbir
NUST Institute of Civil Engineering, School of Civil and Environmental Engineering
National University of Sciences and Technology
Islamabad, 44000, Pakistan
aisha.shabbir@nice.nust.edu.pk
Mary Aksa
Department of Software Engineering
Foundation University
Islamabad,44000, Pakistan
Abstract
The need for data is growing at an extremely steep rate in the ever-digital realm, where terms like "big data" are becoming a thing of the past. All this development requires the use of modern and advanced data handling techniques, where users and researchers can analyze and predict vast amounts of data efficiently. Data warehouses are centralized repositories of data used for business intelligence activities such as analysis and reporting. In this paper, a comparative emphasis has been laid down on two different types of data warehouses, on-premises, and cloud data warehouses. The on-premises are known to be physically housed inside an organization's infrastructure. Cloud data warehouses are online-accessible repositories for data that is stored on cloud platforms. This paper provides a comparative analysis of both types in the context of deployment, scalability, flexibility, query management, cost analysis, access and integration, data security, data storage, data recovery, self-service capabilities and nonetheless, speed and performance. This article further highlights the evolution of data warehouses onto cloud and accentuates the growing demand for an efficient data warehouse, due to the amplification of volume, velocity, variety, value, and veracity of the incoming data in all realms. Furthermore, it provides an in-depth analysis of the advantages of the most suitable data warehouse and discusses the limitations of both.
Keywords
Cloud data warehouse, on-premises data warehouse, cloud computing, big data
1. Introduction
The concept of data warehousing dates to the 1970s, when various organizations, businesses, and enterprises started pondering the importance of handling, storing, and managing the growing amount of data. To address these arising problems, the father of data warehouse, also known as Bill Inmon, introduced the term “data warehouse” in the late 80s. Similarly, an IBM Systems Journal article ‘An architecture for a business information system’, published in 1988, concocted the term Business Data Warehouse. Such was the journey that opened gates to endless possibilities to handle the vastly escalating data.
As of now, Data Warehousing is known to be the most significant part of a business, an organization, or a company. The world of Business Intelligence has grasped the concept of data warehousing firmly, eventually using it for decision-making purposes, for managing the reports of the business world, and for data analytical needs. In the current times, where data is been sighted as the latest weapon, and the most precious resource, the worth of finding a suitable data warehouse solution is increasing rapidly. (Vaisman, A., & Zimányi, E. (2014). Data Warehouse Systems) Data Warehouse is known as a subject-oriented, integrated, time-variant, and non-volatile repository used for supporting the decision-making processes of an organization (Chen, X. (2004). E-Business Data Warehouse Design and Implementation, Inmon, W. H. (2005). Building the Data Warehouse.) Data Warehousing is stated as the notion of assembling and depositing big data from multiple sources in such a manner that they can be efficiently retrieved and analyzed. Data warehouse is stated to be a collection of databases, that are used for storing data, analyzing the data, and manipulating data as per the need. The analysts use this data to reproduce online analysis, reports, and grant help in the decision-making aspects of an organization. (Giannakouris, K., & Smihily, M. (2014). Cloud Computing Statistics.) Therefore, the data warehouse occupies the most important and centralized place in the business intelligence world. These data warehouses collect the data from various sources across various platforms and queries and henceforth, by the usage of different decision support queries they produce an analytic data report. (Abdelaziz, E., & Mohamed, O. (2015). Optimization of Queries Execution Plans in Cloud Data Warehouses.)
Traditional data warehouses are known to be unscalable, despite the support that a traditional Data Warehouse provides, the time required for optimization, configuration, and management exceeds the ever-growing income of data. (Rehman, K. U. U., & Mahmood, S. (2018). A Comparative Analysis of Traditional and Cloud Data Warehouse.) The load on the large queries and the traffic clustering has eventually led to a decline in the performance observed in a data warehouse. (Abdelaziz, E., & Mohamed, O. (2015). Optimisation of queries execution plans in cloud data warehouses) The resolution to this problem had been provided by the emergence of Cloud-based solutions. Factors such as cost-effectiveness, scalability, and the services provided by the Cloud Data Warehouse solutions have left the world in awe. Data warehouses have continued to unstoppably grow in this era of complex data generation, coping with the expanding demands of business intelligence. This new paradigm of cloud data warehouse stores the data remotely and provides solutions to the business world. (Abdelaziz, E., & Mohamed, O. (2015). Optimisation of queries execution plans in Cloud Data Warehouses.) A vast number of organizations have transformed their data warehouses from the on-premises approach towards hosting them on the cloud. And many are yet in the phase of transformation.
In this paper, we have provided a comparative analysis between on-premises and cloud data warehouses. The paper is structured as follows; the literation review is provided in section II of the paper. The comparison has been laid down in section III. The approaches used to decide between on-premises and cloud data warehouses have been discussed in section IV. Section V of the paper is the Discussion. Finally, the conclusion has been drafted in the last section of this paper.
2. Literature Review
Cloud data warehouses have been surrounding and evolving the world of business intelligence lately. Recent studies have been emphasizing how to promote your data warehouse from a traditional approach to a cloud-based data warehouse. In a world where big data has created hype, a much larger emphasis is being placed on the approaches on handling the velocity, variety, volume, and veracity attributed to big data. In previous research, big data has been classified by these four attributes, and further detail has been laid out into these characteristics of big data. Patently, conclusions have been laid down stating big data plays the most significant role in various applications [7]. Parallelly, the spotlight has been drawn to all possible methods of tackling and handling the extensive and immense amounts of data produced daily in the business environment. Amongst these, cloud data warehousing is making its way through to being the spotlight of the future of data warehousing.
However, at the moment, all the attention is drifting towards cloud computing, cloud-based data warehouses, and their solutions; nonetheless, recent studies have been laying the foundation for a comparison between on-premises data warehouses and cloud data warehouses. (Yadav, P. K., Sharma, S., & Singh, A. (2019). Big Data and Cloud Computing: An Emerging Perspective and Future Trends, Inmon, W. H. (2005). Building the Data Warehouse.)
In (Rehman, K. U. U., & Mahmood, S. (2018). A Comparative Analysis of Traditional and Cloud Data Warehouse.) authors have compared the traditional data warehouse and cloud data warehouse on certain parameters, which include the possibility of scaling up or down their data warehouses, the flexibility of data warehouses in perspective of both approaches, cost analysis, the ability to handle diverse data types (semi-structured and structured), and others. It states that cloud data warehouses outnumbered traditional data warehouses based on the parameters in the comparative analysis. Furthermore, the research clarifies that cloud data warehouses are free from limitations such as maintenance of indexes, cleaning of files, updating metadata files, etc., making firm grounds for their conclusion that cloud data warehouses are taking over traditional warehouses with respect to decision support and business analytics. A much similar comparison analysis has been provided by another study. (Golec, D., Strugar, I., & Belak, D. (2021). The benefits of Enterprise Data Warehouse Implementation in Cloud vs. On-Premises.) The study brought attention to several advantages of cloud data warehouses, which include improved access and integration, improved speed and performance, low cost of ownership, leveraged cloud and elasticity, and others. This paper provides the advantages in the form of a figure and table. The paper deals with providing a cloud strategy; a cloud strategy is stated to be the transmutation from a traditional data warehouse onto a cloud data warehouse. (Cloud computing. (n.d.). Shaping Europe’s Digital Future.)
Although a huge amount of attention has been drawn to the comparison between traditional and modern warehouses, studies have shifted the focus to the advantages and challenges of cloud data warehouses, among which data security is highlighted as a major issue. (Boyko, N., & Shakhovska, N. (2018). Prospects for Using Cloud Data Warehouses in Information Systems.) In the midst of the transition to cloud data warehouses, organizations and businesses are reluctant due to the security issues of personal data in the cloud. According to other research organizations, as healthcare bodies, government departments, and financial institutions deal with sensitive information, organizations are skeptical about whether to trust cloud solutions with this data or not. (Shaikh, A. H., & Meshram, B. (2021). Security Issues in Cloud Computing.) In counterpoint, another study has stated that cloud data warehouses are undertaking certain sets of policies to ensure data security and convince a larger community of businesses to entrust cloud data warehouses. (Vaishnav, J., & Prasad, N. (2021). Security Aspects in Cloud Tools and Its Analysis—A Study.) To ensure the security of personal as well as nonpersonal information, the European Union has contributed towards creating a platform on which the countries of the European Union have come together to sign a joint pact on cloud computing. This platform aims to create a European cloud, in which the member countries would be assured of a regulated free flow of non-personal data, cybersecurity, and data protection in the cloud. (Cloud computing. (n.d.). Shaping Europe’s Digital Future.)
As a result, research has stated that 41% of the European Union enterprises had converted to cloud computing in the year 2021. A high turnout was observed in Sweden (75%), Finland (75%), the Netherlands (65%), and Denmark (65%). (Giannakouris, K., & Smihily, M. (2014). Cloud computing statistics on the use by enterprises.)
With all the talk revolving around the advantages of cloud-based data warehouses and the ease that these warehouses have provided to users and the business community, the challenges of cloud-based data warehouses are often overlooked. Whereas the study in (Kurunji, S., et al. (2012). Communication cost optimization for cloud Data Warehouse queries.) has provided a thorough model for communication cost optimization. According to the author, in a cloud data warehouse, the relevant partitions are never guaranteed to be saved on the same physical machine, and during the execution of a query, inter-node communication requires to traverse through various nodes, thereby adding up to increased inter-node communication. The size of communication messages grows in proportion to the number of nodes and data amount, which leads to performance degradation. Thus, the study provides a PK-map-based storage structure alongside a query processing algorithm that has been proved to minimize the inter-node communication and decrease the workload of the joins upon query execution. Such studies have paved the path of transforming into a cloud data warehouse over the traditional. These works have indicated how the studies are being initiated to further move a step towards the cloud computing approach.
Moreover, the studies did not terminate at the optimization of communication costs; the authors of (Abdelaziz, E., & Mohamed, O. (2015). Optimisation of queries execution plans in Cloud Data Warehouses.) laid out a plan for the optimization of queries in a cloud data warehouse. In the article, an approach to improving the cloud data warehouse's performance has been emphasized. The authors' work highlighted that storing in a cloud data warehouse is done via nodes, and optimizing inter-node communication is necessary to improve processing and response time. As a result, it proposes an approach to enhance the performance of cloud data warehouses based on a classification technique and an algorithm built on the MapReduce programming model, and this methodology has been claimed to minimize internode communication and hence query processing time. Similar to the optimization model stated before, research is being put into practice as to how cloud data warehouses can be moved a step further. All the advantages aside, these optimization approaches add a bonus point as to why the modification from on-premises data warehouses to cloud data warehouses should be done.
Moving further, studies have advanced from the advantages and optimization of cloud data warehouses to cloud data warehouse solutions. (Guermazi, E., BEN AYED, M., & Ben-Abdallah, H. (2014). A survey of data warehouse security. Ingénierie des Systèmes d'Information, 19(5), 75-96.) have proposed up to ten cloud-based data warehouse solutions, which include Panoply, which tops the list, Teradata Integrated Data Warehouse, Yellowbrick Data, Oracle Autonomous Warehouse, IBM Db2 Warehouse, Google BigQuery, Amazon Redshift, a few others. Further study of these solutions has also been called attention to in the work of (Cloud computing. (n.d.). Shaping Europe’s Digital Future.).
The author of (Gupta, A., et al. (2015). Amazon Redshift and the case for simpler data warehouses.) has presented Amazon Redshift as a cloud-based data warehousing solution. The author in his work brought about a comparison of traditional data warehouses versus the Amazon Redshift. The arguments state that the traditional data warehouse systems are complex and costly, whereas Amazon Redshift turns out to be a cost-effective alternative. In terms of simplicity, flexibility, and scalability, the Amazon Redshift cloud-based data warehouse is said to be well-suited for a business environment and is very capable of handling big data.
In the world of big data, where the Amazon Redshift discussion is the hot topic, technologies such as Oracle Database Cloud Service are capturing the public's attention with their capabilities. Research has been performed on the Oracle Database Cloud Service, which states that only Oracle offers a complete platform as a service (PaaS) environment that allows a blended approach catering to the hardware and software together. Additionally, this advanced technology in the world of cloud computing offers users the option of using the Oracle Database Exadata Cloud Service for boosted performance. (Cloud Data Warehouse | Solutions Providers | EM360. (n.d.). EM360 Tech, Onyebuchi, A., et al. (2022). Business Demand for a Cloud Enterprise Data Warehouse in Electronic Healthcare Computing: Issues and Developments in E-Healthcare Cloud Computing.)
To convince enterprises that the future lies in the hands of cloud computing and that there should be an adaptation of cloud data warehouses, a thorough comparison between on-premises and cloud data warehouses is necessary. The comparison can be made using an extended set of parameters. This paper's research attempts to bridge the comparability gap by delving further into both warehouses.
3. Comparative Analysis of On-Premises and Cloud-based Data Warehouse
The most prominent property that distinguishes the on-premises and the cloud data warehouse is where the data is situated, in the case of on-premises data warehouse, the data as well as the software, hardware, and applications required to handle the data are stored on-site, and henceforth, the management of this repository is solely handled by the organization itself. However, in the scenario of a cloud data warehouse, the data is hosted off premises and is handled by another organization. These data centers are run and managed by the organizations responsible for the cloud data warehouse, and yet the user of the business can uninterruptedly utilize the data in real-time, without any sort of hindrance. To further clarify the concept of a cloud data warehouse, it is stated to be a depository that stores the data remotely, however, the utilization of data is similar to that of any other data warehouse.
Furthermore, it is asserted that the cloud data warehouse is available to the users in various formats, as per the need of the user. It provides a service to the users in which the applications virtually hosted on the cloud can be installed and run. This service is named as Infrastructure as a Service (IaaS) and amongst the notable provides Amazon Elastic Compute Cloud (EC2), and Cisco Vblock hold the spotlight. (Bogdándy, B., Kovács, Á., & Tóth, Z. (2020). Case Study of an On-premise Data Warehouse Configuration.) Moreover, a cloud data warehouse provides Platform as a Service (PaaS), where the cloud provides platforms such as web servers, mail servers, databases etc. An example of this service is Microsoft Windows Azure. Lastly, the cloud provides Software as a Service (SaaS), where users are granted access to ready-to-use applications that can be used by multiple users simultaneously. (Abdelaziz, E., & Mohamed, O. (2015). Optimisation of queries execution plans in Cloud Data Warehouses.)
The comparison between the two data warehouses can be laid down on the following parameters:
• Deployment:
The process of installation, configuration, and implementation of the data warehouse.
• Scalability:
The ability to handle increasing amounts of data without compromising performance.
• Flexibility:
The ability to adapt to changes without disrupting the functionality of the system.
• Query Management:
Process of optimizing ad managing queries.
• Cost Analysis:
The cost of setting up the data warehouse.
• Access and Integration:
The ability to cater to data types in a data warehouse.
• Data Security:
Measures are taken to protect the sensitive data.
• Data Storage:
The capacity and potential to store large volumes of data.
• Data Recovery
The ability to recover data in case of data loss.
• Self-service capabilities:
The ability of the user to independently access, analyse, and visualize the data.
• Speed and performance:
The ability to process and retrieve data quickly and efficiently.
3.1 Deployment
The deployment of an on-premises data warehouse is tedious as compared to the cloud data warehouse. As for the deployment of a traditional data warehouse the planning phase is solely handled by the business organization itself, thereof, it is stated to be an enormous task to be handled. Whereas, for the cloud data warehouse, the party responsible caters to and fulfils these requirements for an organization.
3.2 Scalability
It is notably known that a data warehouse set on-premises cannot be scaled up or down. Moreover, the on-premises data warehouse lacks the capabilities of catering for the growing number of users. Moreover, studies have stated that it often takes up to days for the configuration of hardware, software, and infrastructure of an on-premises data warehouse. In contrast, a cloud data warehouse is renowned for its functionality of scaling up and down, as per the need. And is capable of handling the increasing usage. It is also proclaimed that an on-premises data warehouse lacks the capacity to down-sizing the warehouse upon the reduction in usage, but a cloud data warehouse automatically downsizes upon underutilization. (Boyko, N., & Shakhovska, N. (2018). Prospects for using cloud data warehouses in information systems.)
3.3 Flexibility
A cloud data warehouse is known to be flexible to the growing amount of usage, the incoming data, and it is known to automatically alter as the data warehouse's requisites and requirements change. The feasibility of drastically adapting to the changes ultimately reduces the cost of implementing a cloud data warehouse. On the other hand, an on-premises data warehouse is not flexible to the requirements of the user, thus leading to being more expensive as compared to a cloud data warehouse. (Rehman, K. U. U., & Mahmood, S. (2018). A Comparative Analysis of Traditional and Cloud Data Warehouse.)
3.4 Query Management
An on-premises data warehouse is known to affect the queries upon data expansion, whereas a cloud data warehouse does not affect the queries in the stated scenario. (Ali, M. H., Hosain, M. S., & Hossain, M. A. (2021). Big Data analysis using BigQuery on cloud computing platform.)
3.5 Cost Analysis
To own and set up an on-premises data warehouse is already more expensive as compared to a cloud data warehouse. As every component required to set up the data warehouse, such as the server, the hardware elements, the software, and storage devices, arouses the cost of set up. Moreover, the maintenance and regular check-up on the data warehouse further incurs the cost of implantation of an on-premises data warehouse. Whereas, the cloud data warehouse works on the principles of renting, and the services that are utilized by the users are only billed [8]. Furthermore, the maintenance, the upgradation, and the regular checkups are handled by service providers which eventually cuts down the cost to a great extent.
3.6 Access and Integration
Access and Integration are widely known for the potential to support structured and unstructured data types. An on-premises traditional data warehouse is inadequate to handle the massive types of data available, it is only capable of handling, storing, and managing the structured data type. However, the cloud data warehouse is recognized to be capable of managing and storing diverse data types, such as unstructured, semi-structured, and structured. (Golec, D., Strugar, I., & Belak, D. (2021). The benefits of enterprise data warehouse implementation in cloud vs. on-premises.)
3.7 Data Security
The users of cloud data warehouses have been skeptical about the implementation of these data warehouses in regard to the security provided. However, over time, cloud solutions have become more secure. It is stated that cloud data warehouses provide multi-factor authentication facilities which have paved the path for more secure communication and data handling. (Vaishnav, J., & Prasad, N. (2021). Security aspects in cloud tools and its analysis—a study. Proceedings of Inventive Systems and Control: ICISC 2021. Springer.)
3.8 Data Storage
As it has already been stated that the on-premises data warehouse is not able to scale the data warehouse as per the need, whereas the cloud data warehouse is. Similarly, a cloud data warehouse in need of more data storage easily provides this service to its users, but for the traditional data warehouse, this upper limit of data storage is a certain figure, which cannot be crossed.
3.9 Data Recovery
In case of a disaster occurrence, the recovery of an on-premises data warehouse is nearly impossible. In the scenario of a cloud data warehouse, the data is stored on multiple nodes, which asynchronously back up the data, due to which data is available uninterruptedly and continuously. Moreover, the requirement of setting up a secondary data storage center of cloud data warehouses is smaller than compared to an on-premises data warehouse. (Golec, D., Strugar, I., & Belak, D. (2021). The benefits of enterprise data warehouse implementation in cloud vs. on-premises.)
Table 1. Summary of comparisons of Data Warehouses
Parameters |
On-Premises Data Warehouse |
Cloud Data Warehouse |
Deployment |
|
|
Scalability |
|
|
Flexibility |
|
|
Query Management |
|
|
Cost Analysis |
|
|