The following article was written by my co-worker Elgar Brunott and redacted by myself. Elgar is a long year data warehouse expert who joined IT-Logix AG back in last autumn. We’ve worked together on various projects so far and had the pleasure in joining for a lot of thought exchange since then. So it’s a pleasure to see parts of our joint worked published in the subsequent blog post.
When building a new data warehouse or a data lake, one of the questions that will arise is whether you want to build a Persistent Staging Area (from now on PSA). In this blog post we will show you what a PSA is, what benefits to expect from it, typical design patterns to build it and finally an evaluation of how you can implement these patterns using a data warehouse automation toolset like the one of WhereScape.
Benefits of a Persistent Staging Area
Let’s start off with explaining what a PSA might look like:
- A database or schema where all the needed tables from one or multiple source systems are transferred one to one (no transformations allowed, but adding some columns like an insert timestamp or batch key are perfectly fine)
- In the PSA you have a historical overview of data changes in the source system. Meaning that you can make so called “as of” queries which will return the data how it looked on a specific moment (of course depending on how frequently you load new / changed data into the PSA)
- This important feature of supporting “as of” queries comes with the requirement that you have to keep track of rows that are deleted in the sourcesystem and mark those in the PSA. Otherwise those “as of” queries don’t result in exactly the same results. This is a tricky requirement which we will elaborate on further down in this article.
Having such a PSA will give you all the historical changes in data since you started loading data. This has the following benefits:
- Loading new data into the PSA can be automated to a large degree as we will show in this article. Hence it doesn’t take much effort to load (and historize) data. Now you can take your time to model your (core) data warehouse properly and load new data from the PSA once it is appropriate.
- Historical data is available in the PSA which is possibly not available anymore in the source system itself.
- Being able to respond on changing requirements over following a plan. For example you need to massively extend one of the modules in the data warehouse because of changed business requirements. When adding the requested requirements you can initially load the adapted model with the historical data already available in the PSA.
- Decreasing the development time and being able to initially load without burdening the source system with heavy data requests.
Basic Design Pattern
For implementing a PSA we separated the corresponding design patterns into basic and advanced. In the below diagram you see a basic impelmentation of a PSA where the following assumptions apply:
- We only load data in full, not in delta.
- For every table we handle the technical historization in the PSA (technical historization can be done in different ways, e.g. doing insert only, snapshots or SCD2. In this blog post we will not go into the details of these historization types)
- The source sytem will inform us in the case data is physically deleted (for example with a delete flag in the source table itself).
In this diagram we took the table “customer” from source system X as an example.
Let’s explain the different objects you see in the diagram:
- src_customer: Data from the source system is transferred one to one. No business transformations are allowed. The table will always be emptied before starting a new load.
- v_src_customer_input: A view that is used as an interface for loading the data into the (persistent) PSA table. Normally this view only selects the data without adding any logic. But for some source systems it is necessary to add basic data quality checks like checking on duplicate primary keys.
- psa_customer: This table holds the historized data of the customer table. In this table you will find all the changes for every data set loaded by the ETL/ELT process, e.g. using a SCD2 pattern.
- v_psa_customer_cid: This view is used as an interface for subsequent layers, for example the core warehouse. CID stands for “current records including deletes”, that means all the current truth records (including those which have a delete flag set to Yes).
- v_psa_customer_cur: same as above, but without deleted records.
As described in the assumptions above, we assume that delete detection is handled by the source system itself. And that data is loaded in full over and over again. Fore some projects this will work perfectly fine. But in many situations you will have the need to load the delta from big tables only. In addition, many source systems sadly do not support delete detection on their own. In such cases we will need to extend the PSA with more advanced logic as described in the next paragraph.
Advanced Design Pattern
The advanced pattern extends the basic pattern with several database objects:
Let’s have a closer look at these objects:
- v_src_customer_dct: This view handles the delete detection (dct). For tables which are loaded in full mode the view does the following – the logic for incrementally loaded tables is explained further down:
- Check all rows which are coming from the v_psa_customer_cur view and compare these with all the rows which are loaded in the src_customer table. For records which are only present in the PSA view but not the src table anymore, we can assume they were physically deleted from the source system.
- The view will return only the missing rows and mark the field “deleted_in_source” with “Yes”.
- stg_customer_dct: The rows which are marked as deleted in the v_src_customer_dct view are temporary stored in this staging table and loaded into the psa_customer table afterwards (having in mind that the “deleted_in_source” flag is set to “Yes”)
- src_customer_bk: This table is only created for tables which are regularly loaded in delta mode. Its only goal is enabling the option to implement delete detection:
- For incrementally loaded tables you cannot detect deleted records in the source as explained above.
- Therefore we create this bk table which consists only of the business key columns of the given source table (as the PSA is very source system driven, often the business key is here the same as the primary key of the source system table, which uniquely identifies the source row).
- After loading all the data the delete detection view (v_src_customer_dct) will use the bk table as input to check if any rows were deleted in the source system.
- This bk tables is loaded in full mode only. Yet loading only one or a few attributes is typically faster than doing the same thing for the complete table. Proper indexing can also speed up the delete detection process.
- In cases where a full load of the bk tables still takes too long, you can eventually decide to execute this load only once a week or even less often.
Minimizing the PSA investment with data warehouse automation
Given the explanations above you realize that such a PSA needs a lof of database objects and logic for each and every source table. Creating them manually results in a high investment with a lot of repetitive work. Nevertheless, in a long term and architecture perspective, building up a PSA is a very good investement as explained in the introduction of this article. On the other hand, defending it to stakeholders can be hard because your investment in the starting phase of your project is higher without having a directly visible business benefit.
This challenge can be addressed using data warehouse automation tools. Our goal to achieve was generating a PSA including delete detection, historization and delta handling with a minimal amount of manual effort. We selected WhereScape as the right tool suite for this and developed such a solution. In WhereScape you’ll find two applications to help you generate the PSA: WhereScape 3D (3D = data driven design) and WhereScape RED.
WhereScape RED is the actual development environment. You define the metadata for every needed object (like the tables and views shown above) and RED will generate the DDL and SQL statements for you and deploy them to the target database. Even though this already speeds up the process tremendously, you still need to add every object manually.
In WhereScape 3D you start with a model of your source system’s tables. Afterwards you’ll define “model conversion rules”. In those rules you can define what should be done with every source table using a rich set of rules and features, e.g. “copy entity” to derive the necessary objects for every source table. This allows for generating all needed database objects automatically. Afterwards you can export these objects to RED and let them be deployed to your target database. Whereas RED automates the code generation, 3D “automates the automation” by feeding RED with the necessary metadata. Here the true power of data warehouse automation becomes visible. But it’s not only about the decrease in development time. The following aspects are also very important selling points for using a data warehouse automation approach:
- Quality & consistency: While everything is generated using a specific number of design patterns, you will not have any problems caused by manually written code.
- Transparency & maintenance: As a direct consequence of the previous point, the solution is transparent and easier to maintain.
- Data lineage: While using a metadata driven approach, you will always have a complete data lineage from the source system throughout your data warehouse.
- Documentation: Using the metadata you can automatically generate technical and user documentation fo the data warehouse.
Findings & Challenges
WhereScape 3D gives you all the tools for defining all the layers of a data warehouse. But it doesn’t give you a complete set of “ready to use” model conversion rules which will do this for you (at least not for the PSA requirements described in this article). We had to implement this part ourselves.
For generating the basic version of the PSA, setting up the generation process was fairly easy. When we made the step to the advanced PSA we had several challenges which needed some out of the box thinking. These were related to topics like
- Different ETL flow for delta loads and full loads
- The PSA table gets data from two different sources (delete detection and delta data)
- The delete detecion view has different logic based on delta or full load pattern
- Some technical challenges, for example data type mappings
Learning to work with 3D is hard in the begining (we highly recommend taking some training!). But once you have overcome this first hurdle, you can setup a very powerful and flexible data warehouse automation environment. For the genration of the PSA layer we found that generating objects is the fastest way. Using data warehouse automation for the core data warehouse ifself is going to be our next challenge.
Also the fact that WhereScape is platform independent makes the investment very valubale. For example, you can migrate from Oracle to Azure SQL by simply choosing a different target platform. The needed ETL and database logic will then be generated in Azure optimized SQL code.
Thanks a lot reading this article. We hope it was informative. On a lot of topics we scratched purposely only the surface and did not go into too much details. Otherwise the article was going to be a lot longer than it already got :-).
Please leave any comments / questions below, we’d love to hear from you!