Cloud Data Warehouse Landing Zone Discussion

Introduction

In this article I want to discuss some different layout options for a landing zone in a modern cloud data warehouse architecture. With landing zone, I mean a storage account where raw data lands directly from its source system (not to be confused with a landing zone to move a system or application into the cloud).

Traditional data warehouse

In a traditional data warehouse, a typical high-level architecture looks like this:

Traditional DWH Architecture Overview

There are lot of organizational aspects which vary from company to company. Every source system owner may be responsible for preparing data exports and make them available in a shared folder over the company network or push them into an FTP folder. In other cases, DWH team is able to access the data directly in the source system database or API and pull it into the staging area. Hard data quality checks are implemented in the interface between the staging area and the DWH (duplicates, missing keys, null values, time zones, datetime formats, etc.). The data is then integrated in a highly-normalized DWH model, which serves as the single source of truth for the whole organization. Data marts are built from the central DWH. The main reason is to have a star schema which is more convenient, intuitive and shows better performance for reporting tools and business analysts. There are some variations to this architecture, including data vault or skipping the highly normalized models and create data marts with conformed dimensions directly from the staging area. These topics are out-of-scope of this article.

The staging Area

The staging area is of particular interest since it suffered different modifications in modern (or better recent) data warehouse architectures and generates lot of discussions.

Typically, a logical division is implemented by source system. I’ve seen different variations, but you can have a single database called “Staging” and a schema per source system (or multiple databases or one database, one schema and table prefixes, I don’t want to open this discussion 😉):

Data Warehouse Staging Area

In this stage a discussion with business stakeholders must take place to document the definition of every business entity, i.e., what a customer or a product is, and how to make it conformed to the whole enterprise. This is a fundamental activity to design the data model and implement important concepts like conformed dimensions. Remember, the DWH model will contain a customer dimension table which is valid for finance, CRM, retail, and all other departments.

Hadoop-like file systems – The early days

One of the first use cases for data lakes I’ve ever implemented, was to replace the staging database(s) with a cheaper, high-redundant and scalable storage system, Hadoop. Some technologies like Polybase or Hive allowed me to successfully deliver Hadoop-based solutions and implement the cold storage/hot storage concept.

Hadoop-like file system as staging area

Cloud native DWH architectures and the data lakehouse debut

Traditional data warehouse concepts like dimensional modelling have not changed and are fundamental for a solid and long lasting DWH implementation. However, the predilection for cloud-first solutions and the introduction of new technologies like the delta format have reshaped how a DWH project looks like.

I’ve participated in never ending discussions around how a landing zone should look like and how many layers are required before the DWH core model.

Minimalistic approach

Some experienced folks opt for a minimalistic approach and only want to use a landing zone to park all source system data, and then using tools like Databricks notebooks, program all required transformations to bring the data into the DWH. These kinds of solutions look like this:

Minimalistic Cloud DWH Architecture

Even when I’m a huge fan of minimalism, the solution above poses several caveats and I DO NOT recommend using it for serious projects.

Multi-layered landing zone

One important aspect of a data solution is the ability to explore, troubleshoot and debug delivered data. Having different data file formats in a landing zone is the first caveat to easily perform these tasks.

The first change here is to introduce a second layer into the landing zone to normalize the data formats. You can use parquet, since it provides a schema, it’s storage efficient and well-accepted standard. However, I would use delta, since additionally provides ACID transactions and time travelling capabilities (and we are still using an open standard):

Landing zone layout with a normalized layer

The normalized source system layer looks now quite similar to what a traditional staging area looks like and we can map it one-to-one to the bronze layer in the medallion architecture. In this layer, data is loaded incrementally, that is, only new, updated, or deleted records should be processed. The resulting delta tables should include a watermark (a timestamp column) to support the incremental load and other required metadata columns. Hard data quality rules are also applied in this layer: duplicates cleansing, NULL values handling, datetime formats, time zones, datatypes unification, etc.

The next layer can be controversial, and you can implement it or not, depending on your overall architecture, your use cases, and the scope of the project(s). If you want to keep a traditional DWH architecture, then you can use the normalized landing zone as the classical staging area and load the data in the target data model.

Normalized layer as staging area

If you have a broader ecosystem, where data is consumed not only by business intelligence tools, but also by data science teams, other enterprise applications and power users, for example, data analyst that would like to build their own reports, then you can build the next layer still in the data lake (represented as a cloud storage account in the diagrams). In this layer the source system segregation is over, and it contains an enterprise view. It can be mapped to the silver layer of the medallion architecture.

This architecture offers a lot of flexibility and thanks to an extensive tool set you can build different type of applications around it:

Data scientists can copy versions of tables in the silver layer to experiment on their own environments.
You can create a golden layer containing department specific views (including their own business logic) or provide another level of aggregation.
You can build a view layer on top of the silver layer using technologies that allow it. This view layer can be consumed by business analysts or BI tools.
You can bring the data (similar to the golden layer) in a relational database and build your data marts from here. You may want to use an MPP architecture if the amount of data requires it.

I intentionally avoid cloud provider/software vendors during the article, but here is a diagram showing how I would build it in Azure:

Other technical considerations

Source systems don’t always deliver data as you need

You may want to have a list of products, customers and their transactions from a source system. It could be the case, that you only get a denormalized dataset with all transactions. You may need to extract the customers and products out of it and combine them with the customers and products coming from other source systems. That speaks in favour of a multi-layer landing zone.

Troubleshooting and debugging

It is easier to identify data quality problems and to troubleshoot issues if you implement different layers. With a minimalistic approach some issues can be hidden and remain difficult to debug.

Data reloads

Suppose the end users finds some errors in the business reports. You discovered some columns in the incident table have a weird value. You called you colleague Markus from CRM team, and he unintentionally mapped some columns to the wrong fields and will correct them during the next export. If you have a minimalistic all-in-one approach, you will need to reload everything. With different layers and source system segregation you are able to reload only what you need, saving time and computational costs.

Final disclaimer

Some architectures presented in this article are not cost effective as just using an orchestration tool (like Azure data factory) and implement data transformation in SQL embedded in stored procedures.
Processing data in a data lake requires the usage of technologies like Spark, which is intended for big data and depending on your case it could add an unjustifiable overhead if your data volumes are small.
Data governance and access management is more complex every time you add a new layer to an architecture.
Additional data layers mean additional data movement, which introduces a higher probability of data quality issues and programming errors.

You should always keep your requirements on top of the technical solution and decide which architecture serves better your project.

Conclusion

In this article I described different approaches to implement a cloud data warehouse, specially the lading zone/staging area and tried to depict how I arrived here. There is no “one size fits all” design and you must take into consideration the size of your project, your key requirements and priorities, your budget, team expertise and other business and financial constraints.

I hope you like the article, and I would love to read what you think using the comments section.

Paul Hernandez playing with Data

Cloud Data Warehouse Landing Zone Discussion

Introduction

Traditional data warehouse

The staging Area

Hadoop-like file systems – The early days