Generating a Persistent Staging Area using Data Warehouse Automation

When building a new data warehouse or a data lake, one of the questions that will arise is whether you want to build a Persistent Staging Area (from now on PSA). In this blog post we will show you what a PSA is, what benefits to expect from it, typical design patterns to build it and finally an evaluation of how you can implement these patterns using a data warehouse automation toolset like the one of WhereScape.

The following article was written by my co-worker Elgar Brunott and redacted by myself. Elgar is a long year data warehouse expert who joined IT-Logix AG back in last autumn. We’ve worked together on various projects so far and had the pleasure in joining for a lot of thought exchange since then. So it’s a pleasure to see parts of our joint worked published in the subsequent blog post.

When building a new data warehouse or a data lake, one of the questions that will arise is whether you want to build a Persistent Staging Area (from now on PSA). In this blog post we will show you what a PSA is, what benefits to expect from it, typical design patterns to build it and finally an evaluation of how you can implement these patterns using a data warehouse automation toolset like the one of WhereScape.

Benefits of a Persistent Staging Area

Let’s start off with explaining what a PSA might look like:

DWH_Architecture

  • A database or schema where all the needed tables from one or multiple source systems are transferred one to one (no transformations allowed, but adding some columns like an insert timestamp or batch key are perfectly fine)
  • In the PSA you have a historical overview of data changes in the source system. Meaning that you can make so called “as of” queries which will return the data how it looked on a specific moment (of course depending on how frequently you load new / changed data into the PSA)
  • This important feature of supporting “as of” queries comes with the requirement that you have to keep track of rows that are deleted in the sourcesystem and mark those in the PSA. Otherwise those “as of” queries don’t result in exactly the same results. This is a tricky requirement which we will elaborate on further down in this article.

Having such a PSA will give you all the historical changes in data since you started loading data. This has the following benefits:

  • Loading new data into the PSA can be automated to a large degree as we will show in this article. Hence it doesn’t take much effort to load (and historize) data. Now you can take your time to model your (core) data warehouse properly and load new data from the PSA once it is appropriate.
  • Historical data is available in the PSA which is possibly not available anymore in the source system itself.
  • Being able to respond on changing requirements over following a plan. For example you need to massively extend one of the modules in the data warehouse because of changed business requirements. When adding the requested requirements you can initially load the adapted model with the historical data already available in the PSA.
  • Decreasing the development time and being able to initially load without burdening the source system with heavy data requests.

Basic Design Pattern

For implementing a PSA we separated the corresponding design patterns into basic and advanced. In the below diagram you see a basic impelmentation of a PSA where the following assumptions apply:

  • We only load data in full, not in delta.
  • For every table we handle the technical historization in the PSA (technical historization can be done in different ways, e.g. doing insert only, snapshots or SCD2. In this blog post we will not go into the details of these historization types)
  • The source sytem will inform us in the case data is physically deleted (for example with a delete flag in the source table itself).

PSA Basic

In this diagram we took the table “customer” from source system X as an example.

Let’s explain the different objects you see in the diagram:

  • src_customer: Data from the source system is transferred one to one. No business transformations are allowed. The table will always be emptied before starting a new load.
  • v_src_customer_input: A view that is used as an interface for loading the data into the (persistent) PSA table. Normally this view only selects the data without adding any logic. But for some source systems it is necessary to add basic data quality checks like checking on duplicate primary keys.
  • psa_customer: This table holds the historized data of the customer table. In this table you will find all the changes for every data set loaded by the ETL/ELT process, e.g. using a SCD2 pattern.
  • v_psa_customer_cid: This view is used as an interface for subsequent layers, for example the core warehouse. CID stands for “current records including deletes”, that means all the current truth records (including those which have a delete flag set to Yes).
  • v_psa_customer_cur: same as above, but without deleted records.

As described in the assumptions above, we assume that delete detection is handled by the source system itself. And that data is loaded in full over and over again. Fore some projects this will work perfectly fine. But in many situations you will have the need to load the delta from big tables only. In addition, many source systems sadly do not support delete detection on their own. In such cases we will need to extend the PSA with more advanced logic as described in the next paragraph.

Advanced Design Pattern

The advanced pattern extends the basic  pattern with several database objects:

PSA Advanced

Let’s have a closer look at these objects:

  • v_src_customer_dct: This view handles the delete detection (dct). For tables which are loaded in full mode the view does the following – the logic for incrementally loaded tables is explained further down:
    • Check all rows which are coming from the v_psa_customer_cur view and compare these with all the rows which are loaded in the src_customer table. For records which are only present in the PSA view but not the src table anymore, we can assume they were physically deleted from the source system.
    • The view will return only the missing rows and mark the field “deleted_in_source” with “Yes”.
  • stg_customer_dct: The rows which are marked as deleted in the v_src_customer_dct view are temporary stored in this staging table and loaded into the psa_customer table afterwards (having in mind that the “deleted_in_source” flag is set to “Yes”)
  • src_customer_bk: This table is only created for tables which are regularly loaded in delta mode. Its only goal is enabling the option to implement delete detection:
    • For incrementally loaded tables you cannot detect deleted records in the source as explained above.
    • Therefore we create this bk table which consists only of the business key columns of the given source table (as the PSA is very source system driven, often the business key is here the same as the primary key of the source system table, which uniquely identifies the source row).
    • After loading all the data the delete detection view (v_src_customer_dct) will use the bk table as input to check if any rows were deleted in the source system.
    • This bk tables is loaded in full mode only. Yet loading only one or a few attributes is typically faster than doing the same thing for the complete table. Proper indexing can also speed up the delete detection process.
    • In cases where a full load of the bk tables still takes too long, you can eventually decide to execute this load only once a week or even less often.

Minimizing the PSA investment with data warehouse automation

Given the explanations above you realize that such a PSA needs a lof of database objects and logic for each and every source table. Creating them manually results in a high investment with a lot of repetitive work. Nevertheless, in a long term and architecture perspective, building up a PSA is a very good investement as explained in the introduction of this article. On the other hand, defending it to stakeholders can be hard because your investment in the starting phase of your project is higher without having a directly visible business benefit.

This challenge can be addressed using data warehouse automation tools. Our goal to achieve was generating a PSA including delete detection, historization and delta handling with a minimal amount of manual effort. We selected WhereScape as the right tool suite for this and developed such a solution. In WhereScape you’ll find two applications to help you generate the PSA: WhereScape 3D (3D = data driven design) and WhereScape RED.

WhereScape RED is the actual development environment. You define the metadata for every needed object (like the tables and views shown above) and RED will generate the DDL and SQL statements for you and deploy them to the target database. Even though this already speeds up the process tremendously, you still need to add every object manually.

In WhereScape 3D you start with a model of your source system’s tables. Afterwards you’ll define “model conversion rules”. In those rules you can define what should be done with every source table using a rich set of rules and features, e.g. “copy entity” to derive the necessary objects for every source table. This allows for generating all needed database objects automatically. Afterwards you can export these objects to RED and let them be deployed to your target database. Whereas RED automates the code generation, 3D “automates the automation” by feeding RED with the necessary metadata. Here the true power of data warehouse automation becomes visible. But it’s not only about the decrease in development time. The following aspects are also very important selling points for using a data warehouse automation approach:

  • Quality & consistency: While everything is generated using a specific number of design patterns, you will not have any problems caused by manually written code.
  • Transparency & maintenance: As a direct consequence of the previous point, the solution is transparent and easier to maintain.
  • Data lineage: While using a metadata driven approach, you will always have a complete data lineage from the source system throughout your data warehouse.
  • Documentation: Using the metadata you can automatically generate technical and user documentation fo the data warehouse.

Findings & Challenges

WhereScape 3D gives you all the tools for defining all the layers of a data warehouse. But it doesn’t give you a complete set of “ready to use” model conversion rules which will do this for you (at least not for the PSA requirements described in this article). We had to implement this part ourselves.

For generating the basic version of the PSA, setting up the generation process was fairly easy. When we made the step to the advanced PSA we had several challenges which needed some out of the box thinking. These were related to topics like

  • Different ETL flow for delta loads and full loads
  • The PSA table gets data from two different sources (delete detection and delta data)
  • The delete detecion view has different logic based on delta or full load pattern
  • Some technical challenges, for example data type mappings

Learning to work with 3D is hard in the begining (we highly recommend taking some training!). But once you have overcome this first hurdle, you can setup a very powerful and flexible data warehouse automation environment. For the genration of the PSA layer we found that generating objects is the fastest way. Using data warehouse automation for the core data warehouse ifself is going to be our next challenge.

Also the fact that WhereScape is platform independent makes the investment very valubale. For example, you can migrate from Oracle to Azure SQL by simply choosing a different target platform. The needed ETL and database logic will then be generated in Azure optimized SQL code.

Thanks a lot reading this article. We hope it was informative. On a lot of topics we scratched purposely only the surface and did not go into too much details. Otherwise the article was going to be a lot longer than it already got :-).

Please leave any comments / questions below, we’d love to hear from you!

My life as a BI consultant: Update Spring 2017

Spring 2017 in Provence (France)

Obviously I hadn’t time to write much on my blog during the last nine months. Let me share with you what topics kept me busy:

For the upcoming months I’ll be visiting and speaking at various events:

  • IBCS Annual Conference Barcelona: June 2nd, Discussion of the next version of the International Business Communication Standards.
  • TDWI 2017 Munich: June 26-28, Half-day workshop about practical gathering of requirements for a dashboard.
  • MAKEBI 2017 Zurich: July 3rd, I’ll be presenting a new keynote around alternatives to traditional estimation practices
  • BOAK 2017 Zurich: September 12th, same as with MAKEBI, I’ll be presenting a new keynote around alternatives to traditional estimation practices
  • WhereScape Test Drive / AgileBI introduction Zurich: September 13th: During this practical half-day workshop you learn hands-on how to use a DWH automation tool and you’ll get an introduction to the basics of Agile BI.
  • My personal highlight of today, I’ll be speaking during Agile Testing Days 2017: I’ll do a 2.5 hours workshop regarding the introduction of Agile BI in a sustainable way.

It would be a pleasure to meet you during one of these events – in case you’ll join, send me a little heads-up!

Last but not least, let me mention the Scrum Breakfast Club which I’m visting on a regular basis. We gather once a month using the OpenSpace format to discuss practical issue all around the application of agile methods in all kind of projects incl. Business Intelligence and Datawarehousing. The Club has chapters in Zurich, Bern as well as in Milan and Lisbon.

Teradata & WhereScape Test Environment in the Cloud

In this post I outline how I managed to get a cloud based training environment ready in which WhereScape RED, a data warehouse automation tool,  connects to a Teradata database test machine.

A few weeks ago I had to organize a so called “testdrive” for a local WhereScape prospect. The prospect uses a Teradata database appliance. Hence they evaluated to use WhereScape RED based on Teradata too. As a local Swiss based WhereScape partner we received a virtual machine containing a SQL Server based WhereScape RED environment. The training had to be run onsite at the customer’s location, IT-Logix provided their set of training laptops, each containing 4GB or RAM. These were my starting conditions.

First of all I thought about how to deal with Teradata for a training setup. Fortunately, Teradata provides a set of preconfigured VMs here. You can easily download them as zipped files and run it using the free VM Player.

Based on my previous experience with organizing hands-on sessions, e.g. during our local Swiss SAP BusinessObjects user group events, I wanted to use Cloudshare. This makes it much more easier (and faster!) to clone an environment for multiple training participants compared to copying tons of gigabytes to multiple laptops. In addition, the 4GB RAM wouldn’t be enough to run Teradata and WhereScape properly in a performant way. So I had two base VMs (one from WhereScape, one from Teradata) – a perfect use case to use the VM upload feature in Cloudshare for the first time.

I started with this support note which explains how to prepare your local VM and load it up to my Cloudshare FTP folder. From there you can simply add it to an environment:

01_UploadVM1

After having uploaded both VMs it looks like this in Cloudshare:

02_CloudshareEnvironment

I increased the RAM and CPU power a bit, and more important configured the network between the two machines:

Go to “Edit Environment” -> “Edit Networks”:

03_NetworkSettings

Here I had to specify to which virtual network I’d like to connect the VMs. Please keep in mind that this doesn’t provide an automatic DHCP server or similar. Either you create one within your machine or – as in my case – had to set static IPs within the individual VM (both were delivered by using a dynamic IP provided by the VM Player). Changing the IP wasn’t a big thing, neither on Windows nor on Linux.

04_TD_Setting1

But I quickly found out that the Teradata service didn’t run properly anymore afterwards.

First of all I had to create a simple test case to check if I can connect from the WhereScape VM to the Teradata machine. Besides a simple Ping (which worked) I installed the Teradata Tools & Utilities on the WhereScape machine. As I couldn’t establish a proper connection, I had to google a bit. The following article gave me the hint to add a “cop” entry to the host file:

04_TD_Setting2

After a restart of the machine, Teradata was up and running again. This you can verify with the following command “pdestate -a” by the way:

04_TD_Setting3

The next step in WhereScape was to create a new metadata repository on the Teradata database. For this I created a new schema and user in Teradata first and then created the metadata repository using the WhereScape Administrator:

06_WhereScapeSetup

In WhereScape RED I created a connection to point to the new Teradata database:

05_WhereScapeConnection

… and finally loaded a few tables from the SQL Server to Teradata:

07_WhereScape_Data

Once I finished the work, the most important step is to create a snapshot:

08_Snapshot

Based on this snapshot I finally cloned the environment for the number of participants in the testdrive with just a few clicks. After all, every participant had his own (and isolated) environment consisting of a full stack of source database (SQL Server), WhereScape and the target DWH database (Teradata).

Agile Business Intelligence Maturity Model

As outlined in my previous blog agility in business intelligence projects can’t be produced directly. Instead you should invest into professionalism, standardization and automation. In this post I’m showing an overview of concrete building blocks to support you on this way.

In my Agile Business Intelligence Maturity Model (ABIMM) I’ve collected many building blocks and arranged them in a practical sequence. An overview you can find in the following illustration:

Agile Business Intelligence Maturity Model
Agile Business Intelligence Maturity Model

We can extract a few key messages from this model:

  1. You can’t increase agility directly – you can only reduce the amount of needed upfront design. By doing this agility is increased automatically.
  2. A reduction of upfront design leads inevitably to higher risks – risks you need to deal with actively, e.g. by using a version control system or solutions for test automation (cf. my blog post here). As long as such basic infrastructure elements aren’t available you should be very cautious with introducing iterative, incremental procdures like e.g. Scrum. (A very illustrative presentation about Agility requires Safety you can find here)
  3. All beginnings are difficult: The building block “Agile Basics & Mindeset” represents an enormous hurdle in many cases. As long as an organization doesn’t experience a top down transformation towards agile values and principles (cf. e.g. the Agile Manifesto), it doesn’t make much sense to start with it bottom-up.
  4. The gulf can be overcome by buying the necessary tools for test automation, version control and training for employees. This can typically happen within the boundaries of the already existing infrastructure. But to overcome the chasm, todays often heterogenous, multi layered BI tool landscapes aren’t suited very well. That’s one reason why I’ve become a big fan of data warehouse automation and tools like WhereScape. Products like WhereScape RED institutionalize the usage of design patterns in an integrated development environment. Only for this reason e.g. refactoring on the level of the data model and hence iterative data modeling becomes feasible with realistic effort. At the same time tools like WhereScape provide you with an ultra high degree of automation for the deployment of new and changed artefacts.

A more detailed explanation of the Agile BI Maturity Model can be found in my recent article in the German TDWI journal “BI-Spektrum“.

Here you find the English translation of my article!

Many thanks to my company IT-Logix and all the great staff working with me. You are the indispensable foundation for all my BI related work. Don’t forget, you can hire me as a consultant 😉

Following you’ll find the literature list on which the different building blocks and the model itself is based on:

[AmL12] Ambler Scott W., Lines Mark: Disciplined Agile Delivery: A Practitioner’s Guide to Agile Software Delivery in the Enterprise, IBM Press, 2012

[AmS06] Ambler Scott W., Sadalage Pramod J.: Refactoring Databases: Evolutionary Database Design, Addison-Wesley Professional, 2006

[Bel] Belshee Arlo: Agile Engineering Fluency http://arlobelshee.github.io/AgileEngineeringFluency/Stages_of_practice_map.html

[BiM] Memorandum für Agile Business Intelligence: http://www.tdwi.eu/wissen/agile-bi/memorandum/

[Col12] Collier Ken: Agile Analytics, Addison-Wesley, 2012

[CoS11] Corr Lawrence, Stagnitto Jim: Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema, DecisionOne Press, 2011

[Hug12] Hughes Ralph: Agile Data Warehousing Project Management: Business Intelligence Systems Using Scrum, Morgan Kaufmann, 2012

[HuR09] Humble Jez, Russell Rolf: The Agile Maturity Model – Applied to Building and Releasing Software, http://info.thoughtworks.com/rs/thoughtworks2/images/agile_maturity_model.pdf, 2009

[Kra14] Krawatzeck Robert, Zimmer Michael, Trahasch Stephan, Gansor Tom: Agile BI ist in der Praxis angekommen, in: BI-SPEKTRUM 04/2014

[Sch13] Schweigert Tomas, Vohwinkel Detlef, Korsaa Morten, Nevalainen Risto, Biro Miklos: Agile maturity model: analysing agile maturity characteristics from the SPICE perspective, in Journal of Software: Evolution and Process, 2013 (http://www.sqs.com/de/_download/agile_maturity_wiley_2013_final.pdf)

Parts of this blog have first been published in German here.