Generating a Persistent Staging Area using Data Warehouse Automation

When building a new data warehouse or a data lake, one of the questions that will arise is whether you want to build a Persistent Staging Area (from now on PSA). In this blog post we will show you what a PSA is, what benefits to expect from it, typical design patterns to build it and finally an evaluation of how you can implement these patterns using a data warehouse automation toolset like the one of WhereScape.

The following article was written by my co-worker Elgar Brunott and redacted by myself. Elgar is a long year data warehouse expert who joined IT-Logix AG back in last autumn. We’ve worked together on various projects so far and had the pleasure in joining for a lot of thought exchange since then. So it’s a pleasure to see parts of our joint worked published in the subsequent blog post.

When building a new data warehouse or a data lake, one of the questions that will arise is whether you want to build a Persistent Staging Area (from now on PSA). In this blog post we will show you what a PSA is, what benefits to expect from it, typical design patterns to build it and finally an evaluation of how you can implement these patterns using a data warehouse automation toolset like the one of WhereScape.

Benefits of a Persistent Staging Area

Let’s start off with explaining what a PSA might look like:

DWH_Architecture

  • A database or schema where all the needed tables from one or multiple source systems are transferred one to one (no transformations allowed, but adding some columns like an insert timestamp or batch key are perfectly fine)
  • In the PSA you have a historical overview of data changes in the source system. Meaning that you can make so called “as of” queries which will return the data how it looked on a specific moment (of course depending on how frequently you load new / changed data into the PSA)
  • This important feature of supporting “as of” queries comes with the requirement that you have to keep track of rows that are deleted in the sourcesystem and mark those in the PSA. Otherwise those “as of” queries don’t result in exactly the same results. This is a tricky requirement which we will elaborate on further down in this article.

Having such a PSA will give you all the historical changes in data since you started loading data. This has the following benefits:

  • Loading new data into the PSA can be automated to a large degree as we will show in this article. Hence it doesn’t take much effort to load (and historize) data. Now you can take your time to model your (core) data warehouse properly and load new data from the PSA once it is appropriate.
  • Historical data is available in the PSA which is possibly not available anymore in the source system itself.
  • Being able to respond on changing requirements over following a plan. For example you need to massively extend one of the modules in the data warehouse because of changed business requirements. When adding the requested requirements you can initially load the adapted model with the historical data already available in the PSA.
  • Decreasing the development time and being able to initially load without burdening the source system with heavy data requests.

Basic Design Pattern

For implementing a PSA we separated the corresponding design patterns into basic and advanced. In the below diagram you see a basic impelmentation of a PSA where the following assumptions apply:

  • We only load data in full, not in delta.
  • For every table we handle the technical historization in the PSA (technical historization can be done in different ways, e.g. doing insert only, snapshots or SCD2. In this blog post we will not go into the details of these historization types)
  • The source sytem will inform us in the case data is physically deleted (for example with a delete flag in the source table itself).

PSA Basic

In this diagram we took the table “customer” from source system X as an example.

Let’s explain the different objects you see in the diagram:

  • src_customer: Data from the source system is transferred one to one. No business transformations are allowed. The table will always be emptied before starting a new load.
  • v_src_customer_input: A view that is used as an interface for loading the data into the (persistent) PSA table. Normally this view only selects the data without adding any logic. But for some source systems it is necessary to add basic data quality checks like checking on duplicate primary keys.
  • psa_customer: This table holds the historized data of the customer table. In this table you will find all the changes for every data set loaded by the ETL/ELT process, e.g. using a SCD2 pattern.
  • v_psa_customer_cid: This view is used as an interface for subsequent layers, for example the core warehouse. CID stands for “current records including deletes”, that means all the current truth records (including those which have a delete flag set to Yes).
  • v_psa_customer_cur: same as above, but without deleted records.

As described in the assumptions above, we assume that delete detection is handled by the source system itself. And that data is loaded in full over and over again. Fore some projects this will work perfectly fine. But in many situations you will have the need to load the delta from big tables only. In addition, many source systems sadly do not support delete detection on their own. In such cases we will need to extend the PSA with more advanced logic as described in the next paragraph.

Advanced Design Pattern

The advanced pattern extends the basic  pattern with several database objects:

PSA Advanced

Let’s have a closer look at these objects:

  • v_src_customer_dct: This view handles the delete detection (dct). For tables which are loaded in full mode the view does the following – the logic for incrementally loaded tables is explained further down:
    • Check all rows which are coming from the v_psa_customer_cur view and compare these with all the rows which are loaded in the src_customer table. For records which are only present in the PSA view but not the src table anymore, we can assume they were physically deleted from the source system.
    • The view will return only the missing rows and mark the field “deleted_in_source” with “Yes”.
  • stg_customer_dct: The rows which are marked as deleted in the v_src_customer_dct view are temporary stored in this staging table and loaded into the psa_customer table afterwards (having in mind that the “deleted_in_source” flag is set to “Yes”)
  • src_customer_bk: This table is only created for tables which are regularly loaded in delta mode. Its only goal is enabling the option to implement delete detection:
    • For incrementally loaded tables you cannot detect deleted records in the source as explained above.
    • Therefore we create this bk table which consists only of the business key columns of the given source table (as the PSA is very source system driven, often the business key is here the same as the primary key of the source system table, which uniquely identifies the source row).
    • After loading all the data the delete detection view (v_src_customer_dct) will use the bk table as input to check if any rows were deleted in the source system.
    • This bk tables is loaded in full mode only. Yet loading only one or a few attributes is typically faster than doing the same thing for the complete table. Proper indexing can also speed up the delete detection process.
    • In cases where a full load of the bk tables still takes too long, you can eventually decide to execute this load only once a week or even less often.

Minimizing the PSA investment with data warehouse automation

Given the explanations above you realize that such a PSA needs a lof of database objects and logic for each and every source table. Creating them manually results in a high investment with a lot of repetitive work. Nevertheless, in a long term and architecture perspective, building up a PSA is a very good investement as explained in the introduction of this article. On the other hand, defending it to stakeholders can be hard because your investment in the starting phase of your project is higher without having a directly visible business benefit.

This challenge can be addressed using data warehouse automation tools. Our goal to achieve was generating a PSA including delete detection, historization and delta handling with a minimal amount of manual effort. We selected WhereScape as the right tool suite for this and developed such a solution. In WhereScape you’ll find two applications to help you generate the PSA: WhereScape 3D (3D = data driven design) and WhereScape RED.

WhereScape RED is the actual development environment. You define the metadata for every needed object (like the tables and views shown above) and RED will generate the DDL and SQL statements for you and deploy them to the target database. Even though this already speeds up the process tremendously, you still need to add every object manually.

In WhereScape 3D you start with a model of your source system’s tables. Afterwards you’ll define “model conversion rules”. In those rules you can define what should be done with every source table using a rich set of rules and features, e.g. “copy entity” to derive the necessary objects for every source table. This allows for generating all needed database objects automatically. Afterwards you can export these objects to RED and let them be deployed to your target database. Whereas RED automates the code generation, 3D “automates the automation” by feeding RED with the necessary metadata. Here the true power of data warehouse automation becomes visible. But it’s not only about the decrease in development time. The following aspects are also very important selling points for using a data warehouse automation approach:

  • Quality & consistency: While everything is generated using a specific number of design patterns, you will not have any problems caused by manually written code.
  • Transparency & maintenance: As a direct consequence of the previous point, the solution is transparent and easier to maintain.
  • Data lineage: While using a metadata driven approach, you will always have a complete data lineage from the source system throughout your data warehouse.
  • Documentation: Using the metadata you can automatically generate technical and user documentation fo the data warehouse.

Findings & Challenges

WhereScape 3D gives you all the tools for defining all the layers of a data warehouse. But it doesn’t give you a complete set of “ready to use” model conversion rules which will do this for you (at least not for the PSA requirements described in this article). We had to implement this part ourselves.

For generating the basic version of the PSA, setting up the generation process was fairly easy. When we made the step to the advanced PSA we had several challenges which needed some out of the box thinking. These were related to topics like

  • Different ETL flow for delta loads and full loads
  • The PSA table gets data from two different sources (delete detection and delta data)
  • The delete detecion view has different logic based on delta or full load pattern
  • Some technical challenges, for example data type mappings

Learning to work with 3D is hard in the begining (we highly recommend taking some training!). But once you have overcome this first hurdle, you can setup a very powerful and flexible data warehouse automation environment. For the genration of the PSA layer we found that generating objects is the fastest way. Using data warehouse automation for the core data warehouse ifself is going to be our next challenge.

Also the fact that WhereScape is platform independent makes the investment very valubale. For example, you can migrate from Oracle to Azure SQL by simply choosing a different target platform. The needed ETL and database logic will then be generated in Azure optimized SQL code.

Thanks a lot reading this article. We hope it was informative. On a lot of topics we scratched purposely only the surface and did not go into too much details. Otherwise the article was going to be a lot longer than it already got :-).

Please leave any comments / questions below, we’d love to hear from you!

(User) Stories for Analytics Projects – Part 2

You can’t spend long in agile project management without encountering the term “user story”. I want to use this two-part article series to summarize how the concept, which originated in software development, can be applied to analytics projects, also known as business intelligence and data warehousing. In the first part, I introduced the basics of stories. Now I would like to explain how we can structure or tailor stories to suit our circumstances in analytics systems and development projects and meet these three principles:

  1. The story is the smallest planning unit I use to plan a project.
  2. The story covers a timespan limited to one or two working days.
  3. A story always has an end-to-end characteristic. In analytics, this often means that a story has a data source as its start and a form of data evaluation as its end.

From feature to story

Let’s take a look at this graphic and go through it point by point:

On the left, we see the simplified architecture stack of an analytics solution. For requirements elicitation and subsequent agile project structuring, the three levels shown are sufficient in most cases:

  • BI application: These are the most visible aspects of an analytics solution. They incorporate information products of all kinds, from reports to dashboards and data interfaces to other systems. Often, the finished product at this level is composed of data from several entities on the level below, such as data from a sales data mart and a logistics data mart.
  • DWH or data basis: Here I deliberately combine the core data warehouse and data mart levels. Even if they are technically different things, a Core DWH alone hardly adds any value to the practical use of a story. It’s only through the implementation of initial key figures and the application of case-specific business rules that the data really becomes useful.
  • Connectivity and infrastructure: Developing a new data source can be a costly business. The same applies to providing the initial infrastructure. Clearly a maximum implementation time for a story of one to two days requires an established infrastructure. But this also means that we need our own stories to develop this foundation.

In many cases, an analytics solution consists of all three levels. At this point, the technology plays a subordinate role. For instance, the DWH aspect might be mapped purely virtually in the BI front-end tool. Nevertheless, it must be considered equally in terms of content (cf. also the model of T-shirt size architecture). These three levels represent for me the macro variant of end to end, from the actual source system to the finished information product. At this macro level, I find it useful to speak of “features”. A feature is formulated analogously to a user story; you can find an example in the upper right corner of the graphic. A feature is often implemented within two to three weeks. But this is too long for the daily monitoring of success in a project. A feature must then be cut further into “epics” and stories on the three levels of BI application, DWH, and connectivity and infrastructure. Epics are merely a means of bracketing stories together. Let’s assume that Feature 1 in the fictitious club dashboard introduced in Part 1 can be outlined in four stories at the BI application level. Then I would gather these four stories into an epic to emphasize the coherence of the stories.

The central aspect of the division into the three architectural levels is that each level also has end-to-end characteristics, but this time at the micro level. Moreover, these three levels overlap somewhat. Let’s look at the club dashboard example and the story there: “As a TWDI back-office employee, I’d like a diagram showing the participants registered per round table in a city that I can select”. To implement this story, the underlying data model is our data source. At best, this may need to be adapted or extended in the application story. The final product is part of the finished dashboard and potentially generates immediate benefits for the end user.

At the DWH level, we build on an established interface to the source system and a functioning data infrastructure. But what does the other end look like, where we also expect some form of evaluation? We don’t need to build a complete dashboard for this either; there’s a story for that. Instead, we can generate a simple analysis on top of any data model, for instance with Excel, and make data visible. And immediately the product owner can start a conversation with the DWH developer about whether these initial results meet expectations.

This approach can even be applied at the connectivity level. Here, the starting point is the source system. In our example, it’s a fictitious, web-based system that allows either a CSV export or a web service query. In a first-user story, a CSV export is loaded into a load layer of the DWH. The result the product owner sees is not just a table definition but a simple evaluation of the raw data. This enables the product owner to detect possible data quality problems early.

As you can see from these example stories, I like to use the traditional story pattern with “I (end user) in my role as want so that ” at the feature and BI application levels. At the DWH and connectivity and infrastructure levels, I prefer an alternative pattern: <action> <result> <after/for/from/in> <object>.

Summary and outlook

In this second article on using stories in analytics, I showed how small stories can be reconciled with the demand for an end-to-end story. It’s important not to lose sight of the big picture, the feature, and to always work towards concrete visible analysis at the epic and story levels.
Even with this structure, it is still a challenge to tailor stories so that they can be implemented in one or two days. There are also many practical approaches to this—and material for another blog 😉

(This article was first published in German on my IT-Logix blog. Translated to English by Simon Milligan)

(User) Stories for Analytics Projects – Part 1

You can’t spend long in agile project management without encountering the term “user story”. I want to use this two-part article series to summarize how the concept, which originated in software development, can be applied to analytics projects, also known as business intelligence and data warehousing.

User stories or general stories are a tool for structuring requirements. Roman Pichler summarizes their use like this:

 “(User) stories are intended as a lightweight technique that allows you to move fast. They are not a specification, but a collaboration tool. Stories should never be handed off to a development team. Instead, they should be embedded in a conversation: The product owner and the team should discuss the stories together. This allows you to capture only the minimum amount of information, reduce overhead, and accelerate delivery.”

To keep my explanations a bit less abstract, I’ll introduce a simple, fictitious case study. Let’s assume that a club like TDWI, of which I am an active member, wants a new club dashboard. For this purpose, the project manager has already sketched their first ideas as a mockup:

Basics

All user stories follow a particular pattern, of which there are different basic variations. The most common pattern is as follows:

Some projects involve systems that aren’t designed for end users. Let’s assume we build a persistent staging area and a data interface from it that delivers integrated data to another system. In such cases, it seems a bit artificial to speak of “user stories”. Here’s an alternative pattern for formulating a story (source):

One of the central questions when using stories is always how they are tailored. How broad should a story be? I follow three principles:

  1. The story is the smallest planning unit I can use to plan a project. Of course, you can also divide a story into tasks and technical development steps. But that’s up to the people who implement it. As long as we are at the level where the product owner, and thus a specialist, is involved, we remain at the story level.
  2. The story covers a timespan limited to one or two working days. This includes the specification of requirements (a reminder to have a conversation about it), their technical implementation, and integrated testing. This forces us to create small stories of a more or less uniform size. This principle makes it possible to measure a flow in implementation daily (by now, it should be clear that I don’t think much of story points, but that’s another story).
  3. A story always has an end-to-end characteristic, and it should provide something usable in the implementation of the overall system. In analytics, this often means that a story has a data source as its start and a form of data evaluation as its end.

Let’s take a more concrete look at this using the case study. First, we focus on a specific part of the dashboard let’s call it Feature 1, where the number of participants is evaluated:

Independently of technological implementation, we can deduce two things here: a simple, analytical data model, such as a star schema, and the data sources required. From the principles formulated above, the question now arises what “end to end” means. In simple cases, it may well be that the scope of a story extends from the data source to the data model, including the actual dashboard, and that this scope can easily be implemented in one or two working days. But what if the technical development of the data source connectivity alone means several days of work? Or the calculation of a complex key figure requires several workshops?

Particularly in larger organizations, I hear people say: “This end-to-end idea is nice, but it doesn’t work for us. We prefer to write ‘technical’ stories”. This means, for example, that I as data architect model the fact table XY and event dimension, I as ETL developer load the fact table XY and the event dimension, or I as data protection officer define the row-level security for fact table XY. Although each of these examples provides a benefit to the overall system, on its own each does not add value and success is difficult to test. Another common response is: “This end-to-end story makes perfect sense, but it doesn’t work with a lead time of one to working days. We just work on a story for two or three weeks”. That’s just as problematic. If a story takes several weeks to implement, then monitoring success and progress becomes very difficult. In the worst case, we don’t realize that we’re on the wrong track after one or two days but only after three or even four weeks.
Both technical stories and long stories should be avoided. How to do this exactly is explained in the second part of this series.

(This article was first published in German on my IT-Logix blog. Translated to English by Simon Milligan)

Positioning Architecture T-Shirt-Sizes

In my previous post, I’ve introduced the idea of architecture t-shirt sizes to depict the idea that you BI architecture should growth with your requirements. In this blog post I position the four example t-shirt sizes on Damhof’s Data Management Quadrants.

T-Shirt Sizes in the Context of Data Management Quadrants

If you haven’t read my previous blog post, you should do so first.

In [Dam], Ronald Damhof describes a simple model for the positioning of data management projects in an organization. Here, he identifies two dimensions and four quadrants (see also Figure 1). On the x-axis, Damhof uses the terms push-pull strategy, known from business economics. This expresses how strongly the production process is controlled and individualized by demand. On the right or pull side, topic-specific data marts and from them information products such as reports and dashboards, for example, are developed in response to purely technical requirements. Agility and specialist knowledge are the key to this. The first two T-shirt sizes, S and M, can be categorized as belonging on this side. On the left or push side, the BI department connects various source systems and prepares the data in a data warehouse. The focus here is on economies of scale and deploying a stable basic infrastructure for BI in the company. Here we can see the other two T-shirt sizes, L and XL.

Figure2
Figure 2: Damhof’s Data Quadrant Model

On the y-axis, Damhof shows how an information system or product is produced. In the lower half, development is opportunistic. Developers and users are often identical here. For example, a current problem with data in Excel or with other tools is evaluated directly by the business user. This corresponds to the S-size T-shirt. As can be seen in my own case, the flexibility gained for research, innovation, and prototyping, for example, is at the expense of the uniformity and maintainability of results. If a specialist user leaves the company, knowledge about the analysis and the business rules applied is often lost.

In contrast, development in the upper half is systematic: developers and end users are typically different people. The data acquisition processes are largely automated and so do not depend on the presence of a specific person in daily operations. The data is highly reliable due to systematic quality assurance, and key figures are uniformly defined. The L- and XL-size T-shirts can be placed here in most cases.

The remaining T-shirt, the M-size, is somewhere “on the way” between quadrants IV and II. This means it is certainly also possible for a business user without IT support to implement a data mart. If the solution’s development and operation is also systematized, this approach can also be found in the second quadrant. This also shows that the architecture sizes are not only growing in terms of the number of levels used.

The positioning of the various T-shirt sizes in the quadrant model (see Figure 2) indicates two further movements.

  • The movement from bottom to top: We increase systematization by making the solution independent of the original professional user. In my own dashboard, for example, this was expressed by the fact that at some point data was no longer access using my personal SAP user name but using a technical account. Another aspect of systematization is the use of data modeling.
  • While my initial dashboard simply imported a wide table, in the tabular model the data was already dimensionally modelled.
  • The movement from right to left: While the first two T-shirt sizes are clearly dominated by technical requirements and the corresponding domain knowledge, further left increasing technical skills are required, for example to manage different data formats and types and to automate processes.
Figure3
Figure 2: T-Shirt Sizes in the Data Quadrant Model
Summary and Outlook

Let’s get this straight: BI solutions have to grow with their requirements. The architectural solutions shown in T-shirt sizes illustrate how this growth path can look in concrete terms. The DWH solution is built, so to speak, from top to bottom – we start with the pure information product and then build step by step up to the complete data warehouse architecture. The various architectural approaches can also be positioned in Ronald Damhof’s quadrant model: A new BI solution is often created in the fourth quadrant, where business users work exploratively with data and create the first versions of information products. If these prove successful, it is of particular importance to systematize and standardize their approach. At first, a data mart serves as a guarantor for a language used by various information products. Data decoupling from the source systems also allows further scaling of the development work. Finally, a data warehouse can be added to the previous levels to permanently merge data from different sources and, if required, make them permanently and historically available.

Organizations should aim to institutionalize the growth process of a BI solution. Business users can’t wait for every new data source to be integrated across multiple layers before it’s made available for reporting. On the other hand, individual solutions must be continuously systematized, gradually placed on a stable data foundation, and operated properly. The architecture approaches shown in T-shirt sizes provide some hints as to what this institutionalization could look like.

This article was first published in TDWI’s BI-SPEKTRUM 3/2019

Growing a BI Solution Architecture Step by Step

The boss needs a new evaluation based on a new data source. There isn’t time to load the data into an existing data warehouse, let alone into a new one. In this article I introduce the idea of architectural T-shirt sizes. Of course, different requirements lead to different architectural approaches, but at the same time, it should be possible that architecture can grow as a BI system is expanded.

T-shirt sizes for BI solutions

The boss needs a new evaluation based on a new data source. There isn’t time to load the data into an existing data warehouse, let alone into a new one. It seems obvious to reach for “agile” BI tools such as Excel or Power BI. After several months and more “urgent” evaluations, a maintenance nightmare threatens. But there is another way: In this article I introduce the idea of architectural T-shirt sizes, illustrated using our own dashboard. Of course, different requirements lead to different architectural approaches, but at the same time, it should be possible that architecture can grow as a BI system is expanded.

The summer holidays were just over, and I was about to take on a new management function within our company. It was clear to me: I wanted to make my decisions in a data-driven way and to offer my employees this opportunity. From an earlier trial, I already knew how to extract the required data from our SAP system. As a BI expert, I would love to have a data warehouse (DWH) with automated loading processes and all the rest. The problem was that, as a small business, we ourselves had no real BI infrastructure. And alongside attending to ongoing customer projects, my own time was pretty tight. So did the BI project die? Of course not. I just helped myself with the tool I had to hand, in this case Microsoft Power BI. Within a few hours the first dashboard was ready, and it was published in our cloud service even faster.

The Drawbacks of Quick and Dirty

My problem was solved in the short term. I could make daily-updated data available to my employees. Further requirements followed, so I copied the power BI file and began fitting it in here and there. Over the next few weeks, I added new key figures to it and made more copies. However, I was only partly able to keep the various dashboard copies “in sync”. In addition, operational problems came up. Of course, I had connected the SAP system under my personal username, whose password has to be changed regularly. This in turn led to interruptions to the data updating and required manual effort in reconfiguring the new password in Power BI.

T-Shirt Sizes for Step-by-Step Development of BI Architecture

I guess a lot of professional users have been in my shoes. You have to find a solution on the quick – and before you know it, you’re in your own personal maintenance nightmare. At the same time, a fully developed BI solution is a distant prospect, usually for organizational or financial reasons.

As an expert in the adaptation of agile methods for BI projects, I have been dealing with this problem for a long time: How can we both address the short-term needs of the professional user and create the sustainability of a “clean” solution? As a father of two daughters, I remembered how my children grew up – including the fact that you regularly have to buy bigger clothes. The growth process is continuous, but from time to time we have to get a new dress size. It is exactly this principle that can also be applied to BI solutions and their architecture. Figure 1 shows four example T-shirt sizes, which are explained in more detail below.

Figure1
Figure 1 Architectural solutions in T-shirt sizes at a glance
Think Big, Start Small: The S-Size T-Shirt

This approach corresponds to my first set of dashboards. A BI front-end tool connects directly to a source. All metadata, such as access data for the source systems, business rules, and key figure definitions, are developed and stored directly in the context of the information products created.

This T-shirt size is suitable for a BI project that is still in its infancy. It is often only then that a broader aim is formulated involving everything that you would like to analyze and evaluate – this is when “thinking big” begins. Practically, however, not much more known than the data source. But you would also like to be more explorative and produce first results promptly, so it makes sense to begin with the small and technically undemanding.

However, this approach reaches its limits very quickly. Here is a list, by no means complete, of criteria for deciding when the time has come to think about the next architectural dress size:

  • There exist several similar information products or data queries that use the same key figures and attributes over and over again.
  • Different users regularly access the information products, but access to the data source is through the personal access account of the original developer.
  • The source system suffers from multiple, and mostly very similar, data queries of the various information products.
Developing a Common Language: The M-Size T-Shirt

In this size, we try to save commonly useful metadata such as key figure definitions and the access information for the source systems at its own level rather than the  information product itself. This level is often also called the data mart or the semantic layer. In the case of my own dashboard, we developed a tabular model for it in Azure Analysis Services (AAS). The various specifications or copies of the dashboards as such in large part remained – only the substructure changed. However, all variants now rested on the same central foundation. The advantages of this T-shirt size in comparison to the previous are clear: Your maintenance effort is considerably reduced, because the basic data is stored centrally once and does not need to be maintained for every single information product. At the same time, you bring consistency to the definition and designation of the key figures. In a multilingual environment, the added value becomes even more apparent, because translations are centralized only once and maintained uniformly in the semantic layer. All your dashboards thus speak a common language.

In this M-size t-shirt, we still do not store any data permanently outside the source. Even if the source data is transferred to the tabular model in AAS, it must be re-imported for larger adjustments to the model. Other manufacturers’ products come completely without data storage, for example, the Universes in SAP BusinessObjects. This means that a high load is sometimes applied to the source systems, especially during the development phase. Here is a list of possible reasons to give your “BI child” the next largest dress size:

  • The load on the source systems is still too large despite the semantic layer and should be further reduced.
  • The BI solution is also to be used as an archive for the source data, for example in the case of web-based data sources where the history is only available for a limited period of time.
  • If the source system itself does not historize changes to data records, this can be interpreted as another reason for using the BI solution as a data archive.
Decoupling from the Source: The L-Size T-Shirt

The next larger T-shirt for our BI architecture, the L-size, replaces the direct data access of the data mart to the source. To do this, the data is extracted from the source and permanently stored in a separate database. This level corresponds to the concepts of a persistent staging area (PSA) (see also [Kim04] pp. 31ff. and [Vos]) or an actively managed data lake (see also [Gor19], for example p. 9). They all have one thing in common: that the data is taken from the source with as little change as possible and stored permanently. This procedure means that the existing data mart can be reassigned to this new source relatively easily. In my own dashboard example, we’re not at this stage yet. But in the next step, we have planned to extract the SAP data using the Azure Data Factory and store it permanently in an Azure SQL database. For people who, like us, use the ERP as a cloud solution, this layer reduces the lock-in effect of the ERP cloud provider. Other advantages of this persistent data storage beyond the source systems include the archiving and historization function it brings with it: Both new and changed data from the source are continuously stored. Deleted records can be marked accordingly. Technically, our data model becomes very close to the data model of the source we use, and under certain circumstances, we are already harmonizing some data types. While this layer can be realized practically and fast, there are also indicators here when to jump to the next T-shirt size:

  • The desired information products require integrated and harmonized data from several data sources.
  • The data quality of the source data is not sufficient and can’t simply be improved in the source system.
  • Key calculations should be saved permanently, for example for audit purposes.
Integrated, Harmonized and Historicized Data: The XL-Size T-Shirt

The next, and for the time being last, T-shirt, the XL-size extends the existing architecture to a classical data warehouse structure between PSA and a data mart. It foregrounds the integration and harmonization of master data from the various source systems. This is done using a central data model, which exists independently of the source used (for instance, a data vault model or a dimensional model). This enables systematic data quality controls to be carried out when initially loading and processing data. Historization concepts can also be integrated into the data here as required. The persistent storage of data in this DWH layer means it is also permanently available for audit purposes.

The various T-shirt sizes don’t only differ in the number of levels they use. To characterize the four architectural approaches more comprehensively, it’s worth taking a brief look at Damhof’s model of Data Management Quadrants [Dam]. I’ll do this in the next blog post.

Literature

[Dam] Damhof, Ronald: “Make data management a live issue for discussion throughout the organization”. https://prudenza.typepad.com/files/english—the-data-quadrant-model-interview-ronald-damhof.pdf , accessed on 22.11.2019

[Gor19] Gorelik, Alex: The Enterprise Big Data Lake. O’Reilly 2019

[Kim04] Kimball, Ralph: The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data. Wiley 2004

[Vos] Vos, Roelant: “Why you really want a Persistent Staging Area in your Data Vault architecture”. 25.06.2016, http://roelantvos.com/blog/why-you-really-want-a-persistent-staging-area-in-your-data-vault-architecture/, accessed on 22.11.2019

This article was first published in TDWI’s BI-SPEKTRUM 3/2019

Debugging access to REST data sources in Power BI

Recently I was involved in building a Power BI report connecting to a REST API as the data source. In this blog post I show you what tools made my life easier working with REST services.

The Error

In order to access the REST service we needed to create a token first and pass this to the service in the HTTP header. We followed the advice here to achieve this in Power BI directly. But it didn’t work and we always got the following error at some point: “Specified value has invalid HTTP Header characters”

PBIErrorREST

As far as I observed it, Power BI is pretty intransparent what happens behind the scene. So first of all I wanted to know whether the REST service works as expected.

Testing the REST service outside of Power BI

There are various options make a REST call, I’ll mention two of them.

PowerShell

A low installation footprint approach is using PowerShell. With the following command you can send a REST request and retrieve the result:

Invoke-RestMethod -Uri <URL> -Body (‘{“systemId”: “it-logix”, “apiToken”: “abcd”}’) -Method POST

SoapUI

At least for me a more convenient way is using a non-command-line tool, for example SoapUI.

Using SoapUI I could proof the REST service works as expected. In the second screenshot on the right side I can see the actual data returned from the REST service.

SOAPUI

SOAPUI_Response

But why would Power BI throw an error anyway?

Bringing light into what happens behind the scene of Power BI with Fiddler

Looking at the error message, my first guess was that something is wrong with how we set the request header in Power Query. To bring more light into this I used Fiddler. Fiddler links itself into all HTTP traffic on your computer and makes it visible for you to analyze it.

The Power BI traffic can be found in the microsoft.mashup… process.

Fiddler1

To my surprise I found that Power BI called the REST service successfully including getting the correct data back similar to what I got in SoapUI. But why would Power BI throw an error anyway?

Compare with a generic REST service and get the riddle solved

As always in a debugging process it is helpful to have a second example to check whether an error is generic or specific to one case. For that I used a service called JSON Placeholder. Here the data import to Power BI worked like a charm. Once again using Fiddler my interest was about the response header. I compared the header of the generic service with the one in my project. Can you see it?

JSONResponse1
The HTTP response header of the actual REST service
JSONResponse2
The HTTP response header of the JSON placeholder service

Finally we asked the REST service developer to remove the header line starting with an appostrophe – and suddenly it worked in Power BI 🙂

Summer News 2019

It’s been quiet here. No new blog post since end of 2017. Let me share with you what happend so far and what’s coming next.

One of the reasons I had no time for blogging were my various customer projects. Whereas SAP BusinessObjects was my world till 2017, in 2018 I definitely moved into the Microsoft BI world more and more. I had the pleasure to work with much of the cool stuff around Power BI, SSAS Tabular, SSRS and various Azure services. Besides that I further pushed WhereScape data warehouse automation in our customer projects.

2018 was an intensive year in my company IT-Logix too. I had to jump in as an ad interim COO last summer and run operations for eight months – besides my role as the Chief Knowledge Officer and my regular project work. In April 2019 I finally could handover all my management activites and stepped back from the executive committee after more than ten years. On the other hand I was elected into the board of directors. This new role goes along much better with my continued interest in exciting projects, developing new fields of businesses and sharing knowledge with the community.

Community is key – my concern in 2019 is to invest in the next generation of BI and analytics folks. Therefore I’m investing myself into building up the Swiss branch of TDWI Young Guns. The Young Guns community serves as a networking platform between students, young professionals and more experienced people – all sharing a common interest in data and what you can do with it. After a successful first barcamp back in May there are some promising upcoming events you shoudn’t miss if you’re around Switzerland in September and October. Subscribe to the Young Guns mailing list or join us on LinkedIn to learn more about the details.

I also continued my conceptual work with an updated version of my requirments framework IBIREF including practical guidance of how to write and slice BI specific user stories. In addition, I worked on a new model around different (or better: growing) tshirt sizes for your BI architecture. The basic concept will be published in TDWI’s BI-SPEKTRUM by end of August – you’ll find a copy here too. So stay tuned.

19.07.01. Event MAKE BI ITX (390)
A glimpse at my presentation about T-Shirt Sizes for your BI architecture

Covering the same topic, I had the pleasure to give a keynote together with Ronald Damhof during the MAKE BI confernence beginning of July. I’ll give a similar presentation during BOAK and TDWI Schweiz confernce this autumn, unfortunately without Ronald. In addition, I’ll share some of my (disciplined) agile project experience during DADay 2019, a virtual conference organised by the Disciplined Agile Consortium.

Finally I have many ideas for further blog posts. Bear with me that I’ll find some time to write them down.

Agile BI Building Blocks 2.0

Quite a while ago, I published a blog post about my Agile BI Maturity Model. In this post I’d like to show you the current state of the model.

First of all I renamed the model to “Agile BI Building Blocks”. I don’t like the “maturity” term anymore as it somehow values the way you are doing things. Building blocks are more neutral. I simply want to show what aspects you need to take into consideration to introduce Agile BI in a sustainable way. The following picture shows the current model:

What changed compared to version 1? Let me go through the individual building blocks:

  1. Agile Basics & Mindset. No change – still very important: You need to start with agile basics and the agile mindset. A good starting point is still the Agile Manifesto or the newer Modern Agile.
  2. Envision Cycle & Inception Phase. No change – this about the importance of the Inception Phase especially for BI project. Simply don’t jump straight into development but do some minimal upfront work like setup the infrastructure or create a highlevel release scope and secure funding.
  3. BI specific User Stories. Changed the term from simply User Stories to “BI specific User Stories”. Unfortunately I didn’t manage to write a blog post about this yet, but in my recent workshop materials you’ll find some ideas around it.
  4. No / Relative Estimating. Changed from Relative Estimating (which is mainly about Story Points) to include also No Estimating which is basically about the #NoEstimates movement. I held a recent presentation at TDWI Schweiz 2017 about this topic (in German only for now)
  5. (Self Organizing) Teams. Changed and put the term “Self Organizing” in brackets as this building block is about teams and team roles in general.
  6. Workspace & Co-Location. Added “Workspace” as this building block is not only about co-location (though this is an important aspect of the agile workspace in general)
  7. Agile Contracting. No change, in my recent presentation at TDWI Schweiz 2017 I talked about Agile Contracting including giving an overview of the idea of the “Agiler Festpreis”, more details you can find in the (German) book here.
  8. New: Data Modeling & Metadata Mgt. Not only for AgileBI data modeling tool support and the question around how to deal with metadata is crucial. In combination with Data Warehouse Automation these elements become even more important in the context of AgileBI.
  9. New: Data Warehouse Automation. The more I work with Data Warehouse Automation tools like WhereScape, the more I wonder how we could work previously without it. These kind of tools are an important building block on your journey of becoming a more agile BI environment. You can get a glimpse at these tools in my recent TDWI / BI-Spektrum article (again, in German only unfortunately)
  10. Version Control. No change here – still a pity that version control and integration into common tools like Git are not standard in the BI world.
  11. Test Automation. No change here, a very important aspect. Glad to see finally some DWH specific tools emerging like BiGeval.
  12. Lean & Fast processes. No change here – this block refers to introducing an iterative-incremental process. There are various kinds of process frameworks available. I personally favor Disciplined Agile providing you with a goal-centric approach and a choice of different delivery lifecycles.
  13. Identify & Apply Design Patterns. No change except that I removed “Development Standards” as a separate building block as these are often tool or technology specific formings of a given design pattern. Common design patterns in the BI world range from requirements modeling patterns (e.g. the BEAM method by Lawrence Corr as well as the IBIREF framework) to data modeling patterns like Data Vault or Dimensional Modeling and design patterns for data visualization like the IBCS standards.
  14. New: Basic Refactoring. Refactoring is a crucial skill to become more agile and continously improve already existing artefacts. Basic refactoring means that you are able to do a refactoring within the same technology or tool type, e.g. within the database using database refactoring patterns.
  15. New: Additive Iterative Data Modeling. At a certain step in your journey to AgileBI you can’t draw the full data model upfront but want to design the model more iteratively. A first step into that direction is the additive way, that means you typically enhance your data model iteration by iteration, but you model in a way that the existing model isn’t changed much. A good resource around agile / iterative modeling can be found here.
  16. Test Driven Development / Design (TDD). No change here. On the data warehouse layer tools like BiGeval simplify the application of this approach tremendously. There are also materials availble online to learn more about TDD in the database context.
  17. Sandbox Development Infrastructure. No change, but also not much progress since version 1.0. Most BI systems I know still work with a three or four system landscape. No way that every developer has its own full stack.
  18. Datalab Sandboxes. No change. The idea here is that (power) business users can get their own, temporary data warehouse copy to run their own analysis and add and integrate their own data. I see this currently only in the data science context, where a data scientist uses such a playground to experiment with data of various kinds.
  19. Scriptable BI/DWH toolset. No change. Still a very important aspect. If your journey to AgileBI takes you to this third stage of “Agile Infrastructure & Patterns” which includes topics like individual developer environments and subsequently Continuous Integration, a scriptable BI/DWH toolset is an important precondition. Otherwise automation will become pretty difficult.
  20. Continuous Integration. No change. Still a vision for me – will definitely need some more time to invest into this in the BI context.
  21. Push Button Deployments. No change. Data Warehouse Automation tools (cf. building block 9) can help with this a lot already. Still need a lot of manual automation coding to have link with test automation or a coordinated deployment for multiple technology and tool layers.
  22. New: Multilayer Refactoring. In contrast to Basic Refactoring (cf. building block 14) this is the vision that you can refactor your artefacts across multiple technologies and tools. Clearly a vison (and not yet reality) for me…
  23. New:  Heavy Iterative Data Modeling. In contrast to Additive Iterative Data Modeling (cf. building block 15) this is about the vision that you can constantly evolve your data model incl. existing aspects of it. Having the multilayer refactoring capabilities is an important precondition to achieve this.

Looking at my own journey towards more agility in BI and data warehouse environments, I’m in the midst of the second phase about basic infrastructure and basic patterns & standards. Looking forward to an exciting year 2018. Maybe the jump over the chasm will work 😉

What about your journey? Where are you now? Do you have experience with the building blocks in the third phase about agile infrastructure & patterns? Let me know in the comments section!

Click & Ready: Setup a virtual training classroom

For our upcoming event “BusinessObjects Arbeitskreis” or simply BOAK (www.boak.ch) we are providing hands-on sessions to participants. We’ve used Cloudshare as a platform for multiple years now to provide every participant with its own virtual machine environment. This year we have the chance to use the Training module of the Cloudshare platform. This modules simplifies the setup of user environments massively and gives us – the instructors – better capabilities to assist our participants. In this article I quickly walk you through how you can setup a virtual training classroom in Cloudshare:

Create a new class

Open the Training module in the web interface of Cloudshare and choose “Create a New Class”. Give it  a name and choose which blueprint and snapshot of your VM environment you want to use for the class.

 

In addition, choose an environment policy. The policy controls how long an environment can be used and after which idle time an environment is suspended. Here you see an example of an environment policy:

Next, you have to choose a time for your class to start. Important: This starting time is used by Cloudshare to prepare all the necessary VM environments in the background so that your students don’t have to wait once they log in:

Finally specify a passphrase and a maximum number of students. You can also allow non-registered users to join – an option we are grateful for during BOAK.

Manage students & run the class

After having created the class, you can register students and eventually send out an invitation email to them.

What students need to join a class is only the Student Login Link as well as the passphrase:

If you allow unregistered users to join the class, you can force them to provide some details like name and company:

Once you are logged in, the student’s personal environment is already up and running – just clik “Start Using This Environment”:

Click on “View VM”

Now the HTML5 based remote desktop session is opened directly within the browser – please note: No addons are needed for this! This helps us to keep costs for additional configurations on our training laptops to an absolute minimum.

As a student I can ask for help, chat with the instructor or with the whole class.

As an instructor I can switch to the Instructor Console where I see a thumbnail of every student’s screen:

If a student asks for help, I can quickly “zoom in” and see the student’s screen live:

If needed, the instructor can take over control of a student’s screen:

Conclusion

In this blog post I’ve showed how you can use Cloudshare’s Training Module to setup and run a virtual training lab class. I have to admit that I’m pretty amazed how easy it was to use this feature. So far everybody is happy with it:

  • Our IT administrators because it reduces laptop-side configurations.
  • Me as an instructor: I can easily invite students and run a class without technical knowledge around virtualization, VM management etc.
  • Our students: So far they were very happy with the solution. It is easy to use and the performance is as if you work with your local PC.

   

BI-specific analysis of BI requirements

Problems of requirement analysis

Practically every BI project is about requirements, because requirements communicate “what the client wants”. There are essentially two problems with this communication: the first is that clients often do not end up with what they really need. This is illustrated in the famous drawing in Figure 1: What the customer really needs.

 

Figure 1: What the customer really needs (Source: unknown, with additional material by Raphael Branger)

The second problem is that requirements can change over time. Thus, it can be that, especially in the case of long implementation cycles, the client and the contractor share a close consensus about what is wanted at the time of the requirement analysis. By the time the solution goes into operation, however, essential requirements may have changed.

Figure 2: Requirements can change over time

Of course, there is no simple remedy for these challenges in practice. Various influencing factors need to be optimized. In particular, the demand for speed calls for an agile approach, especially in BI projects. I have already written various articles, including Steps towards more agility in BI projects In that article, among other things, I describe the importance of standardization. This also applies to requirement analysis. Unfortunately, the classic literature on requirement management is not very helpful; it is either too general or too strongly focused on software development. At IT-Logix, we have developed a framework over the last ten years that helps us and our customers in BI projects to standardize requirements and generate BI-specific results. Every child needs a name, and our framework is called IBIREF (the IT-Logix Business Intelligence Requirements Engineering Framework)

Overview of IBIREF

IBIREF is divided into three areas:

Figure 3: Areas of IBIREF
  • The area of requirement topics addresses the question of what subjects should be considered at all as requirements in a BI project. I’ll go into a little more detail about this later in this article.
  • In the requirements analysis process, the framework defines possible procedures for collecting requirements. Our preferred form is an iterative-incremental (i.e. agile) process; I have dealt here with the subject of an agile development process through some user stories. It is, of course, equally possible to raise the requirements upfront in a classic waterfall process.
  • We have also created a range of tools to simplify and speed up the requirement collection process, depending on the process variant. This includes various checklists, forms and slides.

Overview of requirement topics

Now I would like to take a first look at the structuring of possible requirement topics.

Figure 4: Overview of possible requirement topics

Here are a few points about each topic:

  1. The broad requirements that arise from the project environment need to be considered to integrate a BI project properly. Which business processes should be supported by the BI solution to be developed? What are the basic professional, organizational or technical conditions? What are the project aims and the project scope?
  2. If the BI solution to be created includes a data warehouse (DWH), the requirements for this system component must be collected. We split the data requirements into two groups: The target perspective provides information about the key figures, dimensions and related requirements, such as historiography or the need for hierarchies. This is all well and good, but the source perspective should not be forgotten either. Many requirements for the DWH arise from the nature of the source data. In addition, requirements for metadata and security in the DWH have to be clarified.
  3. The BI application area includes all front-end requirements. This starts with the definition of the information products required (reports, dashboards, etc.), their target publication, purpose and data contents. One can then consider how the users navigate to and within the information products and what logic the selection options follow. One central consideration is the visualization of the data, whether in the form of tables or of diagrams. In this area, advanced standards such as the IBCS provide substantial support for the requirement analysis process (read an overview of my blog contributions to IBCS and Information Design here). The functionalities sub-item concerns requirements such as exporting and commenting. When it comes to distribution, it is interesting to know the channels through which the information products are made available to the users. And it is important to ask what security is required in the area of BI application too.
  4. The issue of requirement metadata is often neglected; however, it is useful to clarify this as early as possible in the project. This concerns the type of additional information to be collected about a requirement: Does one know who is responsible for a requirement? When was it raised, and when was it changed again? Are acceptance criteria also being collected as part of the requirement analysis?
  5. Lastly, requirements need to be collected for the documentation and training required for the use and administration of the BI system.

Summary

In this article, I have indicated that requirement analysis presents a challenge, both in general and especially in BI projects. Our IBIREF framework enables us to apply a standardized approach with the help of BI-specific tools. This allows both our customers and us to capture requirements more precisely, more completely and more quickly, thus enhancing the quality of the BI solution to be created.

Upcoming event: Please visit my team and me at our workshop at the TDWI Europe Conference in Munich in late June 2017. The theme is “Practice Makes Perfect: Practical Analysis of Requirements for a Dashboard” (though the workshop will be held in German). We will use the IBIREF framework, focusing on the BI application part, in roleplays and learn how to apply them. Register now—the number of seats for this workshop is limited!

(This article was first published by me in German on http://blog.it-logix.ch/bi-anforderungen-bi-spezifisch-erheben/)