Skip to content
charmnailspa

charmnailspa

Technological development

Primary Menu
  • Computer & Technology
  • internet marketing
  • Web Design
  • Technology
  • computer
  • Business
  • About Us
    • Advertise Here
    • Contact Us
    • Privacy Policy
    • Sitemap
  • Home
  • Comparison of database architectures: data warehouse, data lake and data lakehouse
  • Computer & Technology

Comparison of database architectures: data warehouse, data lake and data lakehouse

Lisa H. Shelton May 19, 2022

[ad_1]

Database architectures have experience constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing, the first one to appear was the data warehouses, introduced in the 80’s with the support of Online Analytical Processing (OLAP) systems, helping organizations face the rise of diverse applications in the 90’s by centralizing and supporting historical data to gain competitive business analytics. Later in the early 2000s data lakes appeared, thanks to innovations in cloud computing and storage, enabling to save an exorbitant amounts of data in different formats for future analysis.

To this day both solutions remain popular depending on different business needs. For instance, data warehouses allow for high-performance Business Analytics and fine grained data governance. However, they lack on affordable scalability for petabytes of data. On the other side of the pitch, data lakes enable high throughput and low latency, but they have issues with data governance leading to unmanageable “data swamps”. In addition, data is considered immutable, which leads to additional integration efforts.

This is why we can find modern data lake and data warehouse ecosystems converging, both getting inspiration, borrowing concepts, and addressing use cases from each other. In this landscape we find a new architecture emerge: the data lakehouse, which tries to combine the key benefits of both competing architectures, offering low-cost storage accessible by multiple data processing engines such as Apache Spark, raw access to the data, data manipulation, and extra flexibility. Let’s review each one of them in detail.

Table of Contents

Toggle
  • Methodology of Comparison
  • What is a data warehouse?
  • What is a data lake?
  • What is a data lakehouse?
  • Which architecture to use?
  • Sources
  • About the Author
      • Lisa H. Shelton

Methodology of Comparison

There are multiple indicators to consider when selecting a database architecture. To be more comprehensive, we pre-selected a set of common concerns.

What is a data warehouse?

A data warehouse is a centralized system designed to store present and historical data. Their focus is to provide readily available data for advanced querying and analysis. The main motivation for the emergence of data warehouses relied on solving the inconsistent data of RDBMS by transforming data from operational systems to analytical-processing support systems. The data warehouse is considered as the primary source of truth on business operations.

Data warehouse architecture

A standard data warehouse architecture (image above) uses Extract, Transform and Load (ETL) for data transit through three different layers, data staging, data core, and data access. The first layer sees raw data format transition to a fully transformed set. Then data schema-on-write is used to optimize the data model for downstream BI consumption. Finally, the access layer lets users retrieve the translated and organized data through SQL queries.

Data warehouse powers reports, dashboards, and analysis tools by storing data efficiently. They minimize input and output (I/O), allowing query results to be delivered faster and to multiple users simultaneously. Additionally, subsets of the data warehouse, called data marts, can be provided to answer specialized analytical needs.

Let’s review data warehouses by our selected indicators:

As a result, data warehouses excel at providing quality and consistency of the data; they allow for performant data analytics and BI thanks to their ACID compliance.

The limitations of data warehouses start when there is the need to use varied types of data like textual, streaming IoT or multimedia. In addition, the rise of machine learning and AI requires iterative computations that need direct raw data access.

What is a data lake?

Data lakes first appeared thanks to the inception of Hadoop which allows for multiple machines to work as one, allowing compute and storage to scale horizontally. Additionally, Hadoop allows data to be ‘thrown in’ without worrying about its structure. This works thanks to a schema-on-read process, where data is ingested in its raw format and only transformed when needed for analysis, enabling for semi-structured and unstructured data support. Lastly, data lake architecture relies on a metadata catalog (e.g., Hive Metastore, Apache Glue). Without the metadata catalog, data lakes quickly become unmanageable and drift into data swamps.

Later on, cloud services, AWS being the first one, introduced capabilities to decouple compute and storage, which mean they could scale independently. Moreover, many upgrades were possible thanks to Hadoop ecosystem being open-source. This include big data frameworks (e.g., Hortonworks, Cloudera, mapR) which aid on Hadoop components handling, as well as the addition of new open-source tools like Apache Spark, which started to be used as the processing engine, enabling performant ETL to transform raw data into refined and structured data for different uses cases.

Nevertheless, data lakes have difficulties providing data management, guaranteeing quality, security, and data governance. On top of that, if you need high performant SQL queries on petabytes of data and return complex analytical results fast, or the use of BI tools and features such as schema enforcement data and versioning, then data lakes are not enough. That’s why the current use of data lakes has evolved to a two-tier architecture alongside data warehouses.

Data lake

In this two-tier architecture, structured data is traditionally stored in raw format in the data lake, but is later on processed and stored in tabular format in the data warehouses, as we can see in the image above. The data stored in data warehouses are used for data analytics and Business Intelligence, while semi-structured and unstructured data is stored in the data lake and used for Data Science and Machine Learning.

Let’s review the two-tier architecture with the preselected indicators:

Indicator Description
Accessibility BI analysts have limited access to data warehouses in the two-tier architectures, relying on data engineers to structure data into the warehouse. Lastly, data lakes often become data swamps when there is incorrectly cataloged metadata, making that data unmanageable.
Lineage Raw data is accessible through data lakes, but often BI analysts will only have access to the data warehouse where engineers load selected data.
Data Types Structured, semi-structured and unstructured support
Ingestion Data lakes perform ELT (Extract, Load, Transform), meaning they don’t transform data before loading it, as they do not implement a schema for the data to fit. Instead, the transformation and schema are verified when a query is required. This is the process previously mentioned as a schema-on-read approach.
Governance & Security All the data is stored and managed as files. This doesn’t provide fine-grained access control on the contents of files, but only coarse-grained access control.
Upserts & purging Any data manipulation language (DML) operation on a data lake results in a file modification. A new file is created, and additional metadata operations must be made to keep track of the changes. Managing and deleting file versions is a complex task in a data lake.
Scalability Thanks to decoupled compute and storage, data lakes can scale independently. Cloud-based data lakes offer both storage and computing, making them quite scalable in comparison to Data Warehouses. Additionally, distributed file systems allow for increased storage capacity. On the negative side, due to the nature of the two-tier architecture, the scalability issues of data warehouses arise. On top of that, data is constantly transformed and processed towards data warehouses, which adds further costs and duplication of the data in both data lakes and data warehouses.
Performance Data lakes can’t match the performances of data warehouses even with engines like Spark. Additionally, the two-tier architecture is highly complex for users because data goes first into data lakes and then into data warehouses, creating complexities, delays, and new failure modes. This brings considerable performance issues when compared to plain data warehouses.
Reliability Since data lakes lack fine-grained governance and ACID compliance, there are potential issues with the consistency of the data. This is especially an issue when there are multiple readers and writers. There’s also the complexity of schema mismatch due to the lack of schema enforcement in an environment that relies on batch and streaming data from multiple heterogeneous sources. Additionally, in the case of two-tier architecture, continuous engineering is required to ETL data between data warehouses and data lakes. Each transformation incurs risks of failures that reduce data quality. This also results in stale data in the data warehouse, as transformations from data lakes may take days to load.
Applications ML & DS applications, thanks to open data format (such as parquet and orc) and engines like Spark, are directly accessible to a wide range of other analytics engines, such as machine learning systems. For BI and high performant SQL applications, it is required to do ETL pipelines towards a Data warehouse.

As a result, data lakes bring efficient capabilities to the openness of the data and the cost of data storage. Additionally, they are suitable for machine learning and artificial intelligence algorithms, thanks to their support of various processing frameworks (allowing the use of python libraries) and give access to large amounts of raw data.

On the downside, the two-tier architecture results in complex ETL pipelines due to the significant movement, processing, and duplication of data towards data warehouses. Operationalizing and governance of this data architecture also become a challenge due to cost and complexity. All these results in data swamps and stale data.

What is a data lakehouse?

In 2019 Databricks released a paper introducing the concept of data lakehouse. They intended to add a storage layer alongside Apache Spark, allowing for transactions and enforcing schema-on-write in an object-store. On the side, Netflix and Uber had released similar capabilities through Apache Iceberg and Apache Hudi, respectively. This would make the use of data warehouses redundant.

In its architecture, a data lakehouse aims to provide data governance capabilities to a data lake while also reducing the operational costs of the aforementioned two-tier architecture. To achieve this, two features become vital. The first is the use of open file formats, such as Parquet and ORC, to facilitate essential statistics and allow for predefined data schemas. The second is the low-cost data storage system of a data lake, as decoupled compute and storage will allow for multiple processing engines.

But this doesn’t provide warehouse capabilities such as in-depth data management, versioning, or schema enforcement. ACID transactions are required, which were previously impossible in a distributed system where everything is on object storage.

Data lake

Lakehouse architecture (image above) embraces this ACID paradigm by leveraging a metadata layer (e.g., Hive metastore, HDFS) and, more precisely, a storage abstraction framework (Apache Iceberg, Apache Hudi, Delta Tables). These open table formats will allow the metadata layer to register changes as transactions while handling concurrency.

Let’s dive into each point of comparison:

This architecture allows key warehouse capabilities to exist on a data lake. Furthermore, lakehouse solutions implement other optimizations on the engine layer (through Spark or Flink) to optimize query performance, such as caching, auxiliary data structures (indexes and statistics), and data layout optimizations. In comparison to data lakes, they reduced data redundancy and data staleness due to a single all-purpose data storage, reduced the so-called data swamps as data is now versioned, and added governance and security layers on top of it.

On the negative side, data lakehouse architecture is relatively new and immature, and some added features are still on a to-do list. There are also complaints on different topics, like Delta Lake’s reliance on querying uniquely through Delta Lake tables and not external tables or the complexities of using notebooks compared to the simple interface of modern Data warehouses.

Which architecture to use?

The convergence of data warehouses and lakes towards each other has brought us the new lakehouse architecture, but let’s summarize how each one stacks against the data lakehouse:

  • If you need high-performance Business Analytics while having access to fine-grained data governance, data warehouses are your choice. The high performance of a warehouse is unmatched by the others. Nevertheless, they are challenging and expensive to scale and lack the flexibility to efficiently handle all data types. If high transaction throughput and different data types are a requirement, lakehouse architecture can be a solution. The main argument against this transition is the complex migration of such different architectures.
  • If you want to scale and process petabytes of data affordably, having storage while preserving computing and provide high throughput (in opposition to low latency access), data lakes are your choice. On the other hand, data lakes do not provide fine-grained access control or high-performance business analytics. In case you need this, a transition to data lakehouse might be possible and easier to perform due to these architecture reliance on similar distributed technology.

We also want to briefly mention modern cloud data warehouses (such as Snowflakes, Clickhouse, Azure Synapse) as they provide comparable solutions to lakehouses. Nevertheless, they are slightly different, as they behave more as a data warehouse trying to adopt data lake properties like decoupled compute and storage.

Sources

[ad_2]

Source link

About the Author

Lisa H. Shelton

Administrator

Visit Website View All Posts

Post navigation

Previous: Windows 11 Hidden Features That’ll Change How You Use Your Computer
Next: DHS Suspends Its New Disinformation Board, Following Criticism

Related News

The Changing Job Roles in Silicon Valley with AI
  • Computer & Technology

The Changing Job Roles in Silicon Valley with AI

Lisa H. Shelton May 20, 2025
Is AI Really Replacing Jobs in Silicon Valley?
  • Computer & Technology

Is AI Really Replacing Jobs in Silicon Valley?

Lisa H. Shelton May 13, 2025
The Impact of Natural Language Processing on AI
  • Computer & Technology

The Impact of Natural Language Processing on AI

Lisa H. Shelton May 8, 2025
October 2025
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  
« Sep    

Archives

  • September 2025
  • May 2025
  • April 2025
  • March 2025
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • March 2020
  • February 2020
  • November 2018
  • October 2018
  • January 2017

Categories

  • Business
  • computer
  • Computer & Technology
  • internet
  • internet marketing
  • Technology
  • Web Design

Recent Posts

  • Wigs for Patients Designed for Natural Beauty
  • Aluminum Fence Installation Port St. Lucie: Elegance, Security, and Lasting Durability
  • The Changing Job Roles in Silicon Valley with AI
  • Is AI Really Replacing Jobs in Silicon Valley?
  • The Impact of Natural Language Processing on AI

Fiverr

Fiverr Logo   

BL

Seedbl

Seedbacklink

Tags

2021 Acura Rdx Technology Package 2021 Acura Tlx Technology Package 2022 Acura Mdx Technology Package Align Technology Stock Applied Racing Technology Artificial Intelligence Technology Solutions Inc Assisted Reproductive Technology Battery Technology Stocks Benjamin Franklin Institute Of Technology Chief Technology Officer Color Star Technology Craft Design Technology Definition Of Technology Definitive Technology Speakers Element Materials Technology Health Information Technology Salary Ice Mortgage Technology Information Technology Definition Information Technology Degree Information Technology Salary Interactive Response Technology International Game Technology Lacrosse Technology Atomic Clock La Crosse Technology Weather Station Luokung Technology Stock Marvell Technology Stock Price Maytag Commercial Technology Washer Microchip Technology Stock Micron Technology Stock Price Mrna Technology History Mrna Vaccine Technology Nyc College Of Technology Penn College Of Technology Recombinant Dna Technology Rlx Technology Stock Robert Half Technology Science And Technology Sharif University Of Technology Smart Home Technology Stevens Institute Of Technology Ranking Symphony Technology Group Technology In The Classroom Technology Readiness Level Technology Stores Near Me Thaddeus Stevens College Of Technology
putritransport
theramaxx

PONDOK

gemholiday
staybeyondinn

PL

shinelia
dresssky

You may have missed

Wigs for Patients Designed for Natural Beauty
  • Technology

Wigs for Patients Designed for Natural Beauty

Lisa H. Shelton September 9, 2025
Aluminum Fence Installation Port St. Lucie: Elegance, Security, and Lasting Durability
  • Technology

Aluminum Fence Installation Port St. Lucie: Elegance, Security, and Lasting Durability

Lisa H. Shelton September 3, 2025
The Changing Job Roles in Silicon Valley with AI
  • Computer & Technology

The Changing Job Roles in Silicon Valley with AI

Lisa H. Shelton May 20, 2025
Is AI Really Replacing Jobs in Silicon Valley?
  • Computer & Technology

Is AI Really Replacing Jobs in Silicon Valley?

Lisa H. Shelton May 13, 2025
charmnailspa.com | MoreNews by AF themes.

WhatsApp us