Data Vault 2.0 Automation Solutions for Commercial Use

(1)

Data Vault 2.0 Automation Solutions for Commercial Use

Faculty of Information Technology and Communication Sciences (ITC) Master’s thesis September 2020

(2)

Master’s thesis Tampere University

Master’s Degree Programme in Computational Big Data Analytics September 2020

As the amount of data and the need for its processing and storage have increased, methods for its management and reporting have been intensely developed. However, these methods require a lot of skills, time, and manual work.

Efforts have been made to fully automate data warehousing solutions in various areas, such as loading data at different stages of data warehousing. However, few solutions automate data warehouse construction, and learning how to use these data warehouse automation solutions requires a certain amount of expertise and time.

In this research, we discuss different solution options for automating data warehouse construction. From the point of view of organizations, the study identifies different options such as purchasing, collaborating with other organizations to ob- tain or building the solution. In addition to market analysis, we also create and implement an automated tool for building a Data Vault 2.0 type data warehouse by leveraging metadata as well as sources RDBMS relationships to predict critical components of Data Vault 2.0 data warehousing, most of which are usually defined by experts.

Based on the metadata collected and processed, the classification algorithm was able to correctly classify an average of 85.89% of all given observations correctly and 55.11% correctly for business keys alone. The algorithm was able to classify more correctly the observations that were not business keys than the business keys themselves. However, the correctness of the classification has the most significant impact on what the Automation tool that builds Data Vault 2.0 inserts into the target tables of the data model, rather than what kind of tables and what source table they consist of. The model generated by the tool corresponded well to the target model implemented at the beginning of the study. What came to hubs and satellites, without taking into account a couple of missing hubs and the content of some hubs due to shortcomings in the classification of business keys, the model would have been able to be used as an enterprise data warehouse. Links differed more from the original target, but after testing, the link variations produced by the tool worked well either way.

There are still many shortcomings and areas for development in the created and implemented tool of the research, which, however, have been considered in the logic

(3)

subject.

Keywords: Data Vault 2.0, Automation solution, Commercial, Data warehouse, SQL server, Data modeling, Automated relationship transform, Metadata, Cloud solution

The originality of this thesis has been checked using the Turnitin Originality Check service.

(4)

Thanks to my employer Digia Plc and supervisor there Teemu Hämäläinen, who served as the father of the idea and provided the tools, resources, training, and constructive feedback to conduct the research. Thanks also to my thesis supervisor Martti Juhola, who has always been available when needed and to support me during the research, and auditor Marko Junkkari, whose instructions taught me a lot.

However, my greatest gratitude goes to my loved ones who have managed to be enthusiastic, encouraging, and listen to my problem-solving self-talks, even though the topic is not part of their daily life at all. If nothing else, at least we have got out of the research process, lots of confused expressions, and plenty of laughter.

(5)

Table of figures . . . . List of tables . . . . List of abbreviations used . . . .

1 Introduction . . . 1

2 Background . . . 4

2.1 Literature review . . . 4

2.1.1 Conceptual model . . . 4

2.1.2 Building a Data Vault 2.0 Type Datawarehouse . . . 6

2.1.3 Automation Tools in literature . . . 9

2.2 Market analysis . . . 11

2.3 Summary and prediscussion . . . 15

3 Research methods . . . 18

3.1 The Data . . . 18

3.1.1 AdventureWorks . . . 18

3.1.2 The Training Data . . . 19

3.2 The Basics of Data Vault 2.0 . . . 22

3.2.1 Hubs . . . 24

3.2.2 Links . . . 25

3.2.3 Satellites . . . 26

3.2.4 Advanced Components of Data Vault 2.0 . . . 27

3.2.5 Conclusions of Structures and Best Loading Practices . . . 27

3.3 Used Cloud Services and Resources . . . 30

3.4 The Data Vault 2.0 Automation Tool . . . 30

3.4.1 Gathering the metadata and feature Selection . . . 31

3.4.2 Training and testing the classificaton of the business keys . . . . 35

3.4.3 Building the Data Vault 2.0 . . . 39

4 Results . . . 44

5 Conclusions and future development . . . 48

Bibliography . . . 51

APPENDIX A. The productized tools in the market analysis comparison . . . 55

APPENDIX B. The Sample Data sets used in the training of the tool . . . 56

(6)

2.1 The Physical Data Vault automation tool phases [24]. . . . 9 2.2 The Physical Data Vault automation tool logical path [24]. . . 10 2.3 The façade measures of the automation tool organizations. . . 17 3.1 AdventureWorks sample database provided by Microsoft on a general

level. A more observable version of figure in the GitHub [36]. . . 19 3.2 AdventureWorks sample database ordinary relationships represented

by SQL server database diagram. . . 20 3.3 AdventureWorks sample database exceptional relationships represented

by SQL server database diagram. . . 20 3.4 The Data Vault 2.0 Architecture [5]. . . 29 3.5 The correlation matrix of all of the variables to be used in the clas-

sification of the business keys. . . 35 3.6 Iris data decision tree. . . 37 3.7 AdventureWorks database as The Data Vault 2.0 type data ware-

house modeled by an expert of the field. Blue objects represent hubs, green links and yellow satellites. A more observable version of figure in the GitHub[37]. . . 40 3.8 The ordinary transition of one table into hub containing the Business

key and the satellite holding the descriptive data. . . 41 3.9 The ordinary transition of two related tables into hubs containing the

Business keys, the satellites holding the descriptive data and the link containing the foreign keys. . . 42 3.10 The transition of one table related to itself into hub containing the

Business key, the satellite holding the descriptive data and the link containing the foreign keys. . . 43 3.11 The transition of one table that includes only foreign keys as its

business keys to the satellite holding the descriptive data and the link containing the foreign keys. . . 43 4.1 The decision tree of the classifier of the automation tool created from

the numerical training metadata. A more observable version of figure in the GitHub[39]. . . 46 4.2 AdventureWorks database as The Data Vault 2.0 type datawarehouse

modeled after the results of the Data vault 2.0 Creation Automati- zation Tool. Blue objects represent hubs, green links and yellow satellites. A more observable version of figure in the GitHub[38]. . . 47

(7)

2.1 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 3. . . 13 2.2 Features of productized tools on the market advertised to the Data

Vault 2.0 automation. Table 2 out of 3. . . 13 2.3 Features of productized tools on the market advertised to the Data

Vault 2.0 automation. Table 3 out of 3. . . 13 2.4 Compatibilities of productized tools on the market advertised to the

Data Vault 2.0 automation. Table 1 out of 2. . . 14 2.5 Compatibilities of productized tools on the market advertised to the

Data Vault 2.0 automation. Table 2 out of 2. . . 14 4.1 The 10 iterations’ results of the classification algorithm part of the

Data Vault 2.0 Creation Automation Tool. . . 44

(8)

ETL / ELT - Extract, transform, load / Extract, load, transform 3NF - 3rd Normal Form

BI - Business Intelligence

CART - Classification and regression tree CASE - Computer-aided software engineering CRM - Customer Relationship Management DB - Database

DV - Data Vault

DW/EDW - Data warehouse / Enterprise data warehouse ER - entity-relationship

ERP - Enterprise Resource Planning ID3 - Iterative Dichotomiser 3 IoT - Internet of Things IP - Internet Protocol

IT - Information Technology

JSON - JavaScript Object Notation MD - Message-digest

MPP - Massively Parallel Processing

NA - not applicable, not available or no answer OLTP - online transaction processing

OS - Operating System PDV - Physical Data Vault PIT - Point in Time

RDBMS - Relational database management system SQL - Structured Query Language

VM - Virtual machine VNet - Virtual network

XML - Extensible Markup Language

(9)

1 Introduction

The amount of data is increasing by 2.5 quintillion bytes every day at the current rate, and the amount of data will only increase further as the number of Internet of Things -devices grow [1]. Today, different business areas need to consider not only the Four V’s of big data (volume, variety, velocity, and veracity) but also variability, visualization, and value[2]. Data and its utilization in business have never been as crucial as it is today. Businesses and their stakeholders, partners, and customers are continually producing valuable data that could be left unutilized if the data cannot be effectively processed, stored, and analyzed. Previously mentioned utilization can be a very time and resource consuming activity. Therefore, at this stage, the automation of data processing becomes crucial.

A data warehouse (DW, also known as Enterprise Data Warehouse EDW) is a hub for data collection and management from one or more different data sources that serve as the core of Business Intelligence. The data warehouse is designed for queries and analysis instead of transaction processing. The data warehouse approach enables, among other things, data integration from many different data sources into a single database and model, isolating the burden from the source system due to large queries, presenting organizational information consistently, and facilitating decision support. Data warehouses store a copy of the current data from their data sources as well as maintain historical data, even if the source system does not maintain it.

Different roles such as data analysts and data engineers use various analytics tools utilize the data warehouse and the data it contains. Many businesses rely on data and analytics, which are powered by data warehouses.

There are many services and tools available on the market for automating a build of the data warehouse. However, not all services and tools are fully working universal, meaning that they are not completely customizable, as each data storage project is different, and each project has its own characteristics. The use of specific tools and services may also require the purchase of other services and tools, which are needed for the particular solution to work, but these services may not be suitable for the project owner’s use. Even if the service or tool one purchases turns out to be a workable and functional solution, they have their maintenance costs and operating expenses (for example, personnel, monthly fees, and licenses), which can accrue over time. Free open-source tools may be an alternative solution to the purchased service if the business already has the skills of their own to carry out the integration of that tool. However, open-source tools carry their own security and ethical risks, which one is needed to be prepared for when implementing the tool.

Besides, many companies refuse to partner with light terms and costs. As a piece

(10)

of evidence there was a general lack of partnership offerings that emerged among the companies specializing in data warehousing automation while conducting the market analysis. Furthermore, creating and negotiating a working partnership can take a lot of the company’s resources and may not achieve the desired result.

Therefore, the purpose of this study is to explore the basics of Data Vault 2.0 data storage, discuss the best options for automated data warehouse construction, and create one possible solution by building a tool that can build a Data Vault 2.0 model data warehouse based on the characteristics and identify the essential business keys of the source data without human intervention. Data Vault 2.0 is a database modeling tool designed to provide a long-term solution for historical data storage from multiple source systems, enabling, among many other things, data auditing. Data Vault 2.0 is a hybrid approach that combines the best of 3rd Normal Form (3NF) and dimension modeling [41]. These advantages are for example storing historical data, integration of several data sources and making it easier to validate the data. These data sources could be for example organization’s operational, customer relationship management or some other system, where the data are extracted from with an ETL tool.

Data Vault 2.0 is based on the hub, link, and satellite concepts. Hub is the key concept of Data vault 2.0 method and it holds the business keys of the data. Links link all the hubs including the business keys together creating a flexible structure for the model. Satellites hold all the descriptive data of the business keys and could be attached either to a hub or a link. However, the main advantage of the Data Vault 2.0 architecture is that, due to its structure, Data Vault 2.0 is much easier to automate than other data warehouse approaches. The Data Vault 2.0 structure is more flexible and adaptable to changes and additions, as no rework is required when adding additional information to the data warehouse model. For this reason, Data Vault 2.0 has been selected as the target data model to be automated. Here the target data model means a data model where all the data models are brought together from multiple source systems as one. While turning existing data models automatically into the Data Vault 2.0 model, we will also be evaluating in the second section of this thesis all of the possible challenges, which have arisen in the area of automating the creation of the data vault in general and other studies related to the area.

After this, one shall take a look at the market analysis conducted in the spring of 2020, which provides information on the level and scope of existing Data Vault automation tools and potential collaboration opportunities. After the background section, training and testing data are to be introduced. Using represented training and testing data, the rudiments of the Data Vault 2.0 are demonstrated and visual- ized. Once the data and basic methods are known, the used cloud service resources

(11)

are reviewed. Azure is the provider of the cloud service resources used in the study.

The tool is built using the data, methods, and resources mentioned in the sections above. The details and creation of the tool are represented in section 3.4.

Once we have reached a satisfactory tool according to the research aim stated above, the performance, efficiency, and accuracy of the solution will be compared with a concrete Data Vault 2.0 model made by an expert in the field in the results section.

(12)

2 Background

In the recent past, one of the most significant developments in data warehousing has been the development and generalization of data warehouse automation tools.

The goal of data warehouse automation is to significantly speed up the construction and maintenance of the data warehouse and reduce the amount of manual work.

With data warehouse automation, it is possible to do iterative development work ef- ficiently. This is especially true if the data warehousing is approached with the Data Vault 2.0 model. Data warehousing and its management are also moving mainly to the cloud, which is bringing a new dimension to data warehousing solutions. How- ever, the idea of automation is that while every data warehouse project is different, there are abundantly similarities in the design and loading of data warehouses that can be exploited across project boundaries. In the following section, we review the current state of automation tools in both the literature and the market.

2.1 Literature review

Building a Data Vault 2.0 is both an interesting and tricky concept in literature.

There are several studies and articles on the subject, but when looking for a concrete example of Data Vault construction automation through the literature, sources are scarce. Most of the existing literature portrays how Data Vault 2.0 should be built and automated, but not how it is implemented in practice. For example, the download processes and basics have been described in an accurate and comprehensible way, but the implementation and selection or construction of tools has been left in the hands of the literature reviewer. Often, the content of the literature remains only in the construction phase or in the automation of loads, but not in the automation of the construction of the Data Vault 2.0 itself. (Such as [5]; [7]; [13]; [14];

[15]; [16]; [18]; [22]; [25];[27];[28]; [29]; [30]; [31])

In the following sections, the literature on automating the construction of the Data Vault 2.0 data warehousing model, its challenges, best practices, and practical implementations of the tools are to be reviewed. We aim to take advantage of the solutions reviewed here and resolve the issues that have arisen in the past in creating an automation tool in section 3.4.

2.1.1 Conceptual model

As the first step in building a data warehouse for any model, it is essential first to create a conceptual model. A conceptual model presents an existing ensemble represented by different concepts that make it easier for people to understand what

(13)

the ensemble represents. Evidence of the conceptual Data Vault model and its importance is provided by Jovanovic et al. [11]. They propose among many other researchers[13]; [14]; [15], that the first generation of data warehouse approaches and their biggest weakness in modeling was the lack of a conceptual model. However, the suggestions for conceptual models mentioned in their study pay little or no attention to the staging area of data warehousing, which is one of the starting points for data warehousing.

Nonetheless, Jovanovic et al. note that several advances in information technology have led to a revision of second-generation data warehousing and the inclusion of conceptual modeling in data warehousing. The second-generation data warehouse architecture can be considered, among other architectures, Inmon’s top-down approach, where the data warehouse is built with normalized enterprise data modeling [16], which has also influenced other data modelers in their data warehouse mod- els. The best known of these models are the Anchor model[17] developed by Lars Rönnbäck and the Data Vault 2.0[5]developed by Dan Linstedt used in this thesis.

Data Vault 2.0 emphasizes the value of historical data in, for example, data auditing, data discovery, load speed, and flexibility what comes to changes in the warehouse[5];[7]. The Data Vault 2.0 model is a good alternative for exploiting the conceptual model due to, among other things, its logical model and it resolving the challenges of first-generation data warehousing. Challenges in the first generation data warehouse 1.0 data models have been identified as complex updates, coordi- nation and integration of source data, increasing delays in incomplete retrieval and retrieval due to incompatibilities in data, slow loading, lack of integration of data and lack of data traceability, and convoluted schema rebuilds [11]. Data Vault 1.0 and 2.0 as data warehouse 2.0 data models have successfully solved all the previous challenges in their models, so they are the most optimal data models to use [11].

The Data Vault 2.0 model already specifies the data warehouse and the information mart layer separately [5], but officially the transformation of the model from the staging area to the Data Vault 2.0 data warehouse itself has not been conceptually modeled[11]. This step is the most crucial part of the tool being worked on in this research, so the transformation must be modellable from a dimensional model to a Data Vault 2.0 model.

The tool created in the research does not automatically create a conceptual model, but the conceptual model is implemented from the data warehouse to be created so that the outcome of the tool can be compared to the desired need and utilized when building the logic of the tool. Here, Jovanovic et al. [11];[12] developed logic in modeling various relations is utilized. Reasons to leverage their logic include the need to present and analyze data needs at an early stage, develop an existing ontology independently of available data sources, and demonstrate trans-

(14)

formations between the source database and the target database[11].

2.1.2 Building a Data Vault 2.0 Type Datawarehouse

According to Inmon: ”A Data Warehouse is a subject-oriented, integrated, time- variant and non-volatile collection of data in support of management’s decisions making process” [18]. Indeed, data warehousing systems support decision making at the core of most enterprise database applications [19].

Watson states that the data warehousing consists of three components[20]. Ac- cording to Watson, data warehousing includes enterprise data warehousing, data marts (or, in the case of the Data Vault 2.0 model, information marts), and applications that extract, convert, and upload data to a data warehouse or database. These applications can be various ETL tools or reporting applications such as Power BI.

Although Watson previously listed three different areas as part of data storage, an essential part of data storage is also the area of data pre-storage before the EDW layer, i.e., the staging area.

Metadata, including definitions and rules for processing data, is also a critical part of the data warehouse [21]. Metadata are data about data. In other words, metadata are a description or summary of data and its features. While metadata play a significant role in data warehousing, it also plays a significant role in identifying the business keys of the tool created in this research.

Building and developing a data warehouse with a company’s needs requires knowledge of best practices and solutions. The challenges and characteristics of different types of data warehouse approaches must also be considered when building a data warehouse. Different data warehouse models can be identified based on their architecture. Inmon proposed Corporate Information Factory as a data warehousing solution[18]. Corporate Information Factory is the 3rd normal form (3NF) database, from which multidimensional data marts are produced[18]. Another data storage architecture is the so-called bus architecture developed by Kimball[22]. In Kimball’s bus architecture, the data warehouse consists entirely of data marts and the dimensions they require.

As discussed earlier in the Conceptual Modeling section, Inmon’s and Kimball’s data warehousing models are among the first-generation data warehousing models.

Second-generation data warehouses introduced new advantages to the data warehousing, such as the ability to support changes in data over time. In particular, the Data Vault 2.0 model developed by Linstedt has acknowledged the problems of scalability, flexibility, and efficiency compared to the first generation models.

The Data Vault 2.0 model is also recognizable by its architecture. Data Vault 2.0 is mainly modeled using hub, link, and satellite tables[5]. Unique business keys are stored in hubs, foreign key references between different tables are stored in links,

(15)

and data describing business keys are stored in satellites [5]. In addition to the tables mentioned above, the Data Vault 2.0 model may have other types of unusual tables, which are discussed later in the Data Vault 2.0 basics in section 3.2.4.

Data Vault 2.0 is an excellent solution for integrating, storing, and protecting data [23]. However, as Krneta et al. note that it is neither intended nor suitable for intensive queries or reports, and therefore that data warehouse architecture also includes a data mart, information mart data-sharing layer, which generally follows the star pattern [23]. This is because the Data Vault 2.0 data model is INSERT ONLY type data model, i.e., the data item is exported as such with errors and omissions to the data warehouse. Erroneous and incomplete data are only processed in the data distribution layer, taking the data to error marts, for example, which can be used, among other things, to improve data quality. As we have already said in practice, there are not yet many options for automating the construction of a data warehouse. Krneta et al. note that one of the practical problems in building organizations’ data warehouses is the lack of automation in the identification of various metrics, dimensions, and entities (hubs, links, and satellites in the case of Data Vault 2.0)[23];[24].

Since the design of a data warehouse model is a vital part of building a data warehouse, because without planning it is not possible to build a (at least good) data warehouse, there are shortcomings in the overall picture of automatic data warehouse construction. Krneta et al. also discuss in favor of the previous statement when examining the physical data vault design while building their CASE tool. We discuss more about their CASE tool in the next section[24]. Krneta et al. compare eight different approaches to data warehouse design from several different researchers compared to their design and implementation tool being developed. All approaches supported sources of structured data and few sources of semi-structured data. Only one approach followed the rules set for the data model. Most of the comparative approaches supported the generation of the conceptual and physical data model, but none supported the construction of the Data Vault except the CASE tool they built. This supports the view of the lack of tools to build a Data Vault 2.0 type data warehouse.

Before building the data warehouse itself or implementing it from a conceptual model, it is a good idea to have an entity-relationship (ER) model for the data coming from the source. If the ER model does not exist, but the data exists, it is good to build an ER model, since the ER model gives an overall picture of the concepts of the target area and the relationships between them. Once the schemas have been identified, distinguished, and created, the business keys, in other words, the facts, must be identified. Next, the necessary dimensions, metrics, and aggregate functions are identified. In the last step before building the data model, a logical

(16)

and physical data warehouse schema is produced. This order of implementation is also used in the work of Krneta et al. [23]; [24]. This study also utilizes the same data storage features, as Krneta et al. used automating their data mart designs.

The automatic construction of Data Vault 2.0 is also approached based on metadata and especially table relationships.

There are also specific requirements for building a Data Vault data warehouse.

These requirements are presented, among other things, by Linstedt et al. and Krneta et al. [5];[24].

Krneta et al. find data warehousing design a difficult task that involves several challenges for the architect. The first challenge is the choice of the target data warehouse data model discussed earlier. In this work, Data Vault has been chosen as the data model, as Jovanovic et al. has already defined it as the most optimal data model to use [11].

The next challenge concerns the difference in data modeling between the source system and the data warehouse. Therefore, when creating a solution, the modeler should consider whether the modeling follows the demand-driven data models that determine the source systems and needs, or whether a data-driven data model is created in the target data warehouse based on the data. The hybrid model, on the other hand, again acknowledges both the properties of the two previous models as independent of each other (sequential approach).

Data modeling should also consider existing and possible new separate source information systems. According to Linstedt, Data Vault 2.0 is the most sensible choice when it comes to distributed data sources [25]. In addition, a significant feature for automating Data Vault 2.0 data warehouse construction is the ability to build a Data Vault 2.0 data model directly based on source Relational Database Management Systems (RDBMS) schemas due to the characteristics of Data Vault 2.0 entities (hub business keys, link foreign keys, and satellite descriptive data storage).

The first version of the data warehouse is considered to be the first version of the data model built from scratch. Not necessarily, the first version is the one that meets all the requirements and needs, but subsequent versions of the data warehouse are no longer created from scratch. Either new source systems are added to the existing data warehouse or changes from existing source systems are added.

After automating the construction of a Data Vault 2.0 type data warehouse, the next challenge concerns the evolution of existing schemas and how these changes are handled in the construction tool, either when building a new one or modifying an old one.

Managing and monitoring change in data warehousing is also discussed by Sub- otic et al., who see these changes and keeping the data warehouse up to date a core challenge. Schema evolution assumes that there is one version of a schema at a time.

(17)

When changes occur in source systems, for example, new attributes are added to a table/entity, the data item is stored in the current version, but transferred to the new version to which those changes have been made. However, Subotic et al. did not yet reach a concrete solution in terms of schema versioning and change management but mapped out the state of evolution of the schema and mapped out possible future development targets and ideas. [26]

2.1.3 Automation Tools in literature

In the literature, concrete solutions for data warehouse construction, especially the Data Vault 2.0 model, have not been presented much. This is partly because building a data warehouse is not a completely simple process, as only to identify the data warehouse schema from the sources; the logic of the conceptual business model must also be taken into account.

Krneta et al. noticed a lack of automation tools in the industry literature when building their prototype[24]. Influenced by the findings of other researchers, they found that several of the sources they studied did not develop concrete automation of the modeling process ([25]; [27]; [28]; [29]; [30]; [31]). However, these sources consist of fundamental information considering Data vault 2.0, while the automation of the modeling process is already a more advanced subject.

Figure 2.1 The Physical Data Vault automation tool phases[24].

Krneta et al. present a direct algorithm for the incremental design of a physical Data Vault data warehouse[24]. The algorithm utilizes the meta-model and rules of the existing data schemas. The algorithm also acknowledges some unstructured and semi-structured sources. Based on the algorithm, Krneta et al. built a prototype for the Data Vault modeling case tool. Krneta et al.’s prototype supports vital data needs, source data retention, and the scalability of the Data Vault model.

(18)

However, Krneta et al. argue that data warehouse modeling cannot be fully automated, but some of the measures required to build a data warehouse must be considered to keep manual. Krneta et al. feel that these measures are the identification of business keys and the creation of metrics[24]. Indeed, the metrics used to support business decision-making are most often customized and implemented for the organization’s needs in the data warehouse information layer, which operates in the data warehouse as a separate layer from the raw data warehouse. Nonetheless, the identification of business keys as a manual measure can be argued. Although the business keys are unique depending on the organization, the concepts in the different areas of the business are very similar for each organization. Utilizing enough metadata and features of different concepts, it can be assumed that business key identification is possible with different classification algorithms trained to identify business keys.

Figure 2.2 The Physical Data Vault automation tool logical path [24].

Krneta et al.’s create Physical Data Vault (PDV) automation tool, and its phas- ing follows the steps of figure 2.1. The logic of the tool in its simplicity again follows the steps of figure 2.2. The tool was developed using Larman’s practices [32]. Larman’s practices included that attributes of modeling needs can be pre- sented verbally, which means that some sort of user interface, e.g., an application, is created to support the design. Krneta et al. created a user interface where, among other things, the source is selected, and the business keys are given as parameters.

Because of the conciseness of this study, in the empirical part, the user interface

(19)

for the tool to be created is not implemented, but the implementation of the tool supports the creation of a possible user interface.

In the Krneta et al.’s tool, the structured data are loaded directly into the data warehouse, the semi-structured data through the staging area for additional structuring before the load into the data warehouse, and the unstructured data through the textual analysis into the staging database for further structuring and from there into the data warehouse[24]. In the tool of this study, all the data are imported into the staging area at some point before the data warehouse, because the tool collects, among other things, the metadata to be utilized and the relationships between the tables using RDBMS schemas from the tables coming to the staging area.

In their dissertation, Nataraj successfully provided a tool to transform heteroge- neous data, such as XML and JSON, into a Data Vault 2.0 model[33]. In addition, Nataraj took advantage of the Data Vault 2.0 Big Data features and also integrated image, video, and IoT data into the data warehouse. However, the tool concerned the automatic transformation of unstructured and semi-structured data into a Data Vault 2.0 model and not the construction of the entire Data Vault 2.0 data model.

Nataraj’s work utilized in the same way the rules to build the Data Vault 2.0 repos- itory as Krneta et al. [24], which can be exploited from several perspectives in approaching the challenges of this research as well.

2.2 Market analysis

As for the research question, in addition to building the tool by oneself, purchasing and partnership options should also be considered. The current state of the latter two options is clarified by examining the market of the Data Vault 2.0 automation tools through market analysis.

The market analysis executed in this thesis has been carried out in the spring 2020. The market analysis includes organizations and their already productized tools, rather than comprehensive solutions from the perspective of different data warehousing solutions, as the study is intended to focus on automating the construction of a data warehouse that takes place within a database dedicated to it.

The database is one component in the overall data warehousing solution, which includes many other components such as ETL orchestration, datamart layer, and reporting. Of particular interest is the tool’s compatibility with various products, and features, either as part of a database or as a stand-alone unit.

The comparable tools included in the analysis are products on the market that have been found high in the search results when searching for the tool in the topic area. In addition, some of the tools have been included in the evaluation both based on the search results and the basis of their visibility, for example, in the courses related to the topic [3] or in the work community.

(20)

The first observation in the analysis emerged even before the comparison of the features and compatibility of other products were made. Some of the tools searched in the topic area, found high in the search results, were not related to automating the building of the Data Vault 2.0 model data warehouse itself or even general level workloads. However, their main job was more as ETL, Data Lineage, or Data Insight focused component for either existing data warehouse or data warehouse being built. This already says a commendable amount about the market situation and room there for tools like the one being created and implemented in this study.

The next observation concerned the opportunities for development cooperation in the market. The tools and organizations studied showed at a very early stage that there is no opportunity for collaboration in the development of their tools, or it is not publicly advertised. The only tools that could be developed by users were those that were open source. However, almost all organizations in the market analysis offered a technological or consulting partnership, i.e., the tool could be used and implemented in customer projects, but the price of this cooperation was not disclosed on behalf of any organization, so the terms of this partnership should be clarified. Also, the possibility of cooperation must consider supplier-related products. That is, is it possible to use or develop the product on behalf of other third parties once it has been acquired? In other words, is the ownership of a product is easy to pass on when necessary? Thus the other party would not be dependent on the creator of the product as both a supplier and a developer, but they could use and develop the product themselves.

In terms of features and compatibility, the results of the analysis are shown in tables 2.1, 2.2, 2.3, 2.4 and 2.5. In tables 2.1, 2.2 and 2.3, the features contemplate the characteristics of the organization’s tool, as reported on their public website. The variations, number, and details of the features varied very much, depending on the organization. Self-explanatory features are not marked unless they have been mentioned by the organization. Other compatible products of the organization have not been acknowledged. Only Data Vault 2.0 building-related tool’s features are considered. In addition, those tools are included in the tables that are not related to the automation of Data Vault 2.0 building the way this study had intended, but have been found very high in search results when retrieving automation tools, as mentioned earlier in this section.

In tables, 2.4 and 2.5, compatibility is also considered only at the general level and with the most commonly used tools. If only, for example, SQL server without the associated cloud service is mentioned in the compatibility in the organization’s website, the compatibility cannot be fully ensured with certain cloud resource providers, nor can all the other internal options of the candidate cloud services.

(21)

Table 2.1 Features of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 3.

(22)

Table 2.4 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 1 out of 2.

Table 2.5 Compatibilities of productized tools on the market advertised to the Data Vault 2.0 automation. Table 2 out of 2.

Each comparable tool has its strengths and characteristics. From the tools represented in the analysis, the tool(s) that meet the needs of both the intended use and the available technologies should be selected. However, every tool, as well as the tool in this study, needs a human user at first steps, but the goal is to find or produce a tool that minimizes human user training and time spent on that tool.

Depending on the project and the organization, the tool needs to be chosen to suit one’s budget. The tools, which price were publicly shared, varied in the price range of an average of € 731.75 in the monthly fee (standard deviation, 241.01). In addition, part of the represented monthly prices applied to only one user. However, some organizations billed according to usage and the amount of data, making the price of the tool more flexible and better suited for smaller organizations. Nevertheless, most organizations did not include prices on their pages but instead wanted to be contacted.

In the case of fixed annual or monthly licenses, the lifespan of the tool should be recognized. It must be assessed whether the tool assists in setting up the data

(23)

warehouse or whether the tool controls data warehousing around the clock and has the same life cycle as the data warehouse itself. Additional costs are also incurred for training costs. Many organizations also offer costly courses for their tools. However, given the fixed quoted prices and possible additional costs (training, consultants), there is room in the market for a more affordable alternative, if it meets the needs of the user at a certain level and depending on the customer organization’s internal technical capital. If budget is not an issue, there are already options available in the market, depending on the features sought.

In addition to features, compatibility, and price, the reliability and functionality of the tool must be examined. When considering the market situation, the turnover of the organization offering the tool, the number of employees, the country of origin, and the existing references must also be acknowledged. There are no guarantees that the tool will work unless it is demonstrably successful. In addition, the organization’s turnover and a number of employees say a lot about work productivity. The country of origin has therefore been taken into account because some organizations may face either a language or a geographical location barrier in cooperation or acquisition.

Figure 2.3 shows the factors measuring the façade of organizations, previously studied by their tools. For other factors, the choice is quite a matter of opinion, as different people prefer organizations of different sizes for different reasons, and for most, the location of the organization is not a stumbling block. However, the market supply can be perceived as slightly worrying in that more than a third of the automation tool providers have provided either no references at all to the tool’s operation or customer cases or very weakly, for example by telling about new features of the tool in their blog posts but not given any concrete used cases. Once the features, compatibility, and reliability of the tool have been ascertained, the continuation and type of cooperation can be agreed between the parties.

2.3 Summary and prediscussion

After going through the automation tools in the literature and the market, a clear difference can be noticed. There is not much literature about the development or creation of the automation tools, but there are lots of tools in the market. The literature seeks to broaden the understanding and perspective of its target group regarding the automation of data warehouse modeling and its possibilities. Thus, the literature provides skills and theory that can be applied to develop automation tools for different areas of data warehousing.

The market, on the other hand, puts the information provided by the literature into practice. With data warehousing and effectively utilizing data being a very topical issue, it is very understandable that nothing related to it should be freely distributed, such as accurately describing the construction of a working automation

(24)

tool in a publicly distributed document. On the other hand, it also requires one’s knowledge to create an automation tool based on ready-made instructions, if there is no previous experience on the subject. However, there are also exceptions in the market, meaning that free open source tools are available. Although these open source tools performed well in comparison to other tools, they were still not as advanced as tools that are licensed.

(25)

Figure 2.3 The façade measures of the automation tool organizations.

(26)

3 Research methods

In the following subsections, we acknowledge each part of the empirical part of the research and development for the Data Vault 2.0 building automation tool. First, we shall consider the data used as a training and testing data, second, the Data Vault 2.0 model itself and its basics, third, used cloud resources, and lastly, the Data Vault 2.0 Automation Tool and each development block it consists of.

3.1 The Data

In the following subsections, the dimensional database is represented into which the research tool is to be tested. In addition to this, a brief review of the training databases, which has been used in the training of the tool’s ability to find correct business keys, will be introduced.

3.1.1 AdventureWorks

A simple but extensive AdventureWorks data set has been chosen to test the tool created in this study. The AdventureWorks is an online transaction processing (OLTP) sample database provided by Microsoft[10]. The AdventureWorks database can be uploaded to an existing database in several different ways. It can either be downloaded from a publicly distributed database snapshot backup, downloaded from GitHub, or downloaded as a light version directly as a part of the SQL database from Azure[10]. Microsoft is continuously updating the sample database by releas- ing new versions. In this study, we use the 2017 version.

The AdventureWorks sample database includes the database of the fictional international company Adventure Works Cycles. Figure 3.1 shows the different schemas of the database in different colors. Due to the size of the database, figure 3.1 is only a descriptive graph of the contents and relationships of the database.

A better quality, zoomable image from the database can be found in the GitHub [36]. Production-related tables are shown in yellow, purchasing in pink, sales in gray, person-related in blue, human resources in green, and dbo (database owner) tables in purple. The relationships between the tables are represented by relational lines accompanied by a three-part identifier representing the table in which that attribute acts as a foreign key (1), the original table of the attribute (2), and the attribute itself (3).

An example of this is the relationship between UnitMeasure (Production) and ProductVendor (Purchasing) represented in 3.2 figure. Sometimes, however, the tables are joined by an attribute that does not have a uniform name in the source

(27)

Figure 3.1 AdventureWorks sample database provided by Microsoft on a general level.

A more observable version of figure in the GitHub [36].

and destination tables. An example of this is the relationship between the Vendor (Purchasing) and PurchaseOrderHeader (Purchasing) tables represented in figure 3.3, where the attribute is named according to what it is as a foreign key in the target table. Thus, instead of BusinessEntityID, this relation is labeled as VendorID.

This database is now modeled with a dimensional model, and adding new tables and data to it is tedious and affects previous tables. This model is to be transformed by automation from a dimensional database to a Data Vault 2.0 modeled database at the end of this research.

3.1.2 The Training Data

However, to identify the business key, the business key identifying portion of the automation tool must be trained to identify the correct business key from the tables.

Training databases also take into account most of the features of Data Vault 2.0, such as multiple sources, including business keys from different sources that define the

(28)

Figure 3.2 AdventureWorks sample database ordinary relationships represented by SQL server database diagram.

Figure 3.3 AdventureWorks sample database exceptional relationships represented by SQL server database diagram.

same business concept but might have different name. Several publicly distributed sample databases have been selected for the training. The sources of the databases used in the exercise are specified separately in their appendix. Below is a brief introduction to each database used for the training.

New York City Taxi Sample

The NYC Taxi sample includes information on, among other things, the time of pick- ing up, the time of dropping off, the location of the pickup and drop off, the length

(29)

of the trip, and separate fares. This sample data set contains approximately 1.7 million rows of data. The data was collected and provided by technology providers authorized under the Taxicab Livery Passenger Enhancement Programs.

School Database

This database contains one schema for the school database. The database is ele- mentary and fundamental but contains several tables for the used purpose.

Dofactory Database

Dofactory’s database is an upgraded version of Microsoft’s Northwind sample database, which AdventureWorks replaced in 2005. The database contains a basic description of the product ordered from supplier to customer, including Supplier, Product, Or- derItem, Order, and Customer tables.

Bikestores

The Bikestores database contains a simple description of the stores and production of the retailer selling the bikes. The Bikestores database contains two schemas, sales, and production.

Wide World Importers sample database v1.0

Wide World importers is a fictitious organization, which imports and distributes wholesale novelty goods such as toys and chilli chocolates. The sample database consists of Wide World Importers’ daily operational data selling products to retail customers across the United States. If needed, Wide World Importers has its own documented workflow from the beginning to the end of the supply chain.

Microsoft Contoso BI Demo Dataset for Retail Industry

Contoso is a fictitious retail organization demo dataset to demonstrate data warehouse and business intelligence solutions and functionalities among Microsoft Office product family. Dataset consists of sales and marketing, IT, finance, and many other scenarios for a retail organization, including transactions from Online transaction processing, aggregations from Online analytical processing, references, and dimensional data.

Northwind and pubs sample databases for Microsoft SQL Server

(30)

Northwind and pubs sample databases include the sales data of fictitious organization Northwind Traders, specialized in food export and import. Similar to Contoso, Northwind databases are also used by Microsoft to demonstrate the features of some of its products.

Lahman’s Baseball Database

Lahman’s Baseball database consists of batting and pitching data from 1871 to 2019.

Among batting and pitching, the database includes fielding, standings, team, man- agerial, post-season, and other statistics. Data has been gathered from the American League, National League, American Association, Union Association, Players League, and Federal League, and the National Association of 1871-1875.

SalesDB Sample Database

SalesDB database offers a straightforward structure database with only four tables representing small organizations, employees, customers, products, and sales. This database is a natural benchmark test to classification tool since if there would be problems in the identification of business keys among this database, there must be more significant problems with the classification tool or meta-attributes chosen to classify the data.

3.2 The Basics of Data Vault 2.0

There are two well-known models for data modeling. In Ralph Kimball’s modeling, the data warehouse solution is made entirely of star models. In Bill Inmon’s data warehouse solution, the data warehouse side is formed in a normalized way, and the datamarts formed from it are formed as star models. However, both models have found change management to be problematic, and both models require a large number of skilled personnel to maintain it[4]. In addition, the Inmon model can also be a Data Vault model. Thus, Data Vault 2.0 is based on the Inmonian architecture [6]. However, Data Vault 2.0 modeling has considered the weaknesses of previous models, also its predecessor Data Vault 1.0. Data Vault 2.0 used in this thesis (established 2013) works as an extension to version 1.0.

While Data Vault 2.0 is a technique to model a data warehouse, it is much more multidimensional. In addition to modeling, Data Vault 2.0 covers as much methodology as architecture. The methodology also includes an implementation that includes rules, best practices, and processes, among other things. This allows for flexible data integration from multiple sources, storing, historizing, and data distribution[5]. Data Vault was developed by Dan Linstedt as early as the 1990s but

(31)

did not enter public distribution until 2000. The Data Vault 2.0 method is platform- independent and supports both relational-based and NoSQL environments. The method also allows the hybrid solutions of the previous two and Big data solutions.

The Data Vault 2.0 method is based on generally accepted methods of process development, project management, and application development such as Scrum and Kanban. The cornerstones of the method are coherence, repeatability, and the utilization of business models [5]. Data Vault 2.0 enables agile development of iterations, and data modeling according to architecture and methodology guarantees the model’s flexibility, scalability, and fault tolerance. The modeling approach has been made flexible for changes in source systems by separating structured data from descriptive data[7]. Of particular importance in the data model is the traceability of the data (auditing the data) and the possibility to add data sources so that the existing implementation is not disrupted [5]. In addition, the integration and harmonization of data from different systems are essential.

Allowing handling and interaction with NoSQL and Big Data structures and following Scrum and other agile best practices were one of the new characteristics added to Data Vault 2.0 compared to Data Vault 1.0. However, the main distinc- tion between 2.0 and 1.0 is replacing the sequence numbers with hash keys [34].

The sequence numbers worked as primary keys, as well as hash keys do. However, they have several limitations such as dependencies slowing loading the data (specific tables needed to be loaded before others due to dependencies between tables), syn- chronization problems (sequence generator could cause duplication without syncing), erroneous satellite-link connections (differences in sequence numbers for the same data), scalability issues and preventing Massively Parallel Processing (MPP). To overcome all the sequence number limitations, hash keys were included in version 2.0. Hash keys are generated by hash functions, which make absolute that same hash key will be created for the same data every time. Besides, hash keys are created automatically and independently during loading, so one does not have to have lookup into other entities overcoming the dependency and parallel loading problem.

In the Data Vault 2.0 model, data storage and information generation are implemented separately. In our research, we focus on the area of data storage, not the area of the information mart layer, as known as the data mart layer. Separating data storage from the information mart layer enables the automation of data storage workflow, as data are loaded and stored as raw data. Indeed, a data warehouse provides ”a single version of the facts” because it does not in itself define bad and good data. Information is added to the data to facilitate auditing and updating, such as download time, checksum, hash key, and source system. However, the data in the data storage will not be changed at any stage. Incorrect data item is also left as it is for traceability, and because the incorrect data item must be corrected in

(32)

the source system itself in any case. Data editing operations are performed only at the information mart level.

The Data Vault 2.0 solution is based on business concepts that vary from organization to organization. Before starting a data warehouse project, it is a good idea to find out the fundamental concepts of the business and the relations between them. A business key is defined for each concept. More about the business key will be discussed in the hub subsection.

The implementation of the data warehouse is based on business keys and data decentralization. The hash keys, previously mentioned in this section, as an add-on to the data to facilitate updating, are calculated for the data to be stored based on the unique identifiers. When data are exported to hub, satellite, and link structures, the related data are combined with the same hash key value. Hubs, satellites, and links, as well as their features and downloads, are discussed in more detail in the separate subsections of this section. The Data Vault 2.0 method is an insert only - method. Thus, if, for example, the sales order data has changed in the source system, the checksums on the old line of the sales order in the data warehouse and on the new line entering the data warehouse do not match. This saves the new version of the data alongside the previous version. Different versions of the data can be distinguished, for example, based on the time of load. The insert only -method using checksums and hash keys also allows for faster updates and less chance of errors. The use of hash keys, on the other hand, enables direct referrals, streamlines both loads and searches, enables parallel runs (including massively parallel processing, MPP), and allows hybrid database relational databases and NoSQL data to be combined [8]; [9]. The latest versions of the data loaded to information mart are usually retrieved for reporting, but sometimes historical data are also used for reporting.

Potentially incorrect data can be compiled into Error marts, based on which, among other things, the quality of the data can be improved.

3.2.1 Hubs

Data Vault seeks to address the challenge of changes in source systems by separating the business keys associated with business concepts and the associations between them from the attributes that describe them [5]. For each key business concept, a business key is defined that can be used to identify information. Thus, Business keys are unique but may consist of many attributes. The business key is identified by the fact that it is a column that is continuously used to locate and identify data.

Business keys have a definite meaning in the source system, such as a customer number or sales order. Business keys must be true Business Keys that are not tied to any source system. Surrogate keys generated by source systems are not good business keys, as they may not be utilized at all in the integration of different source

(33)

systems and, in the worst case, may change due to changes or upgrades to the source system. Business keys are the most stable and central parts of Data Vault modeling that form the skeleton of the model. Therefore, the selection of business keys is the most critical step in building a stable and efficient model [5].

Individualized and infrequently changing business keys related to the concept of business are stored in the hub. In addition to the business keys, the hub also stores the surrogate key, load time, source, and metadata corresponding to each key regarding the origin of the business key[5]. However, in Data Vault 2.0 modeling, the hub’s surrogate primary key has been replaced with a hash key consisting of hub business columns. A modern choice is to generate a hash key both in the hub and in other parts of the Data Vault 2.0 model using the MD5 algorithm. The MD5 algorithm results in a 128-bit hash, which is typically represented as a 32- character key. The hub must not have duplicates of the business keys, except when two systems supply the same business key. Hubs are always modeled first and should always be associated with at least one satellite, and they do not historize data [5].

With hubs as the backbone of the model, hubs can be used to perform data profiling and basic analysis based on a business key. Based on the content of the hub, it can be concluded, among other things, how many keys related to the concept exist, for example, how many customers the organization has. In addition, it can be determined, for example, how many sources customer data consists of. Data quality problems can also be noticed in hubs. Hubs can have several nearly identical customers, but from different sources. This may be related to master data management issues [35].

3.2.2 Links

Link tables are the main reasons for the flexibility and scalability of Data Vault modeling. They are modeled so that any changes and additions to the model are easy to add without affecting previous components. In Data Vault 2.0 modeling, link tables store relationships of business concepts such as references to sales orders belonging to a customer. These relations can be considered as the foreign keys of source system tables. In hubs, foreign keys cannot be represented because of the modeling rules, so they are presented in links. The purpose of the link is to capture and store the relationship of the data elements with the lowest possible granularity.

The link between hubs has only unambiguous connections, and there can also be transactional links. The relation of links is always the relation of many to many (m:n). The relations are many to many because it creates flexibility in the model [5]. Usually, many to many relationships should be avoided, but if the business model is done right, the only thing that can change is the cardinality. Most likely, a city can only be located in one country, but even if this changed, the ratio of one to

(34)

one (1:1) could change to the ratio of one to many. However, link tables can control this change without anyone modifying the model or download process.

Links can be linked to other links to deal with the granularity of the data, because for example, adding a new key to a database table would change the granules in the database table. The organization can have a connection between the customer and the address, in which case a reference to the link between the centers of the product and transport company could be added. However, this is not recommended.

The link, like hub, is a simple structure. The link contains the primary key of the link, which is a hash key. In addition, the link includes a hash key for its parent hub, load time, and source. As in hubs, the link only saves the connection the first time it appears in Data Vault 2.0 (no historizing). [5]

3.2.3 Satellites

Hubs and links form the structure of the model, but do not contain any descriptive data. Satellites contain descriptive information related to business concepts. [5]

Satellites contain historical information, i.e., information on which changes are actively monitored. The satellite contains date attributes from the start and end dates of the validity of the data, which are either provided by the source system or generated by logical procedures and effectivity tables in the data warehouse[5].

However, often no end date is available, and creating logical procedures, and additional tables are unnecessarily cumbersome, so row changes are tracked using checksums that store entire row data, for example, based on the aforementioned MD5 algorithm, which is also often used to generate hash keys. Thus, if the data describing the key of a business concept changes, a new version of the line is stored in the satellite when the values do not match when comparing the checksums. If the checksums match, the new version is not stored in the data warehouse. This speeds up the loading process considerably because instead of comparing the values of several columns, you only need to compare one column value in the load.

The satellites also include a link that connects them to their parent hub or link.

This connection is a hash key based on the same information as in the hub or link.

Thus, a satellite can join either a hub or a link, and multiple satellites can join in the same hub. If more than one satellite is connected to the hub, this may be due to, for example, the frequency of satellite updates (such as bank current account transactions compared to savings account transactions). It is also good practice to distinguish satellite data related to the same issue from different source systems to different satellites, but these may also be on the same satellite because of the column identifying the source system. [5]

(35)

3.2.4 Advanced Components of Data Vault 2.0

The Data Vault 2.0 model also includes tables other than hubs, satellites, and links.

These tables include the Reference, Effectivity, Point in Time (PIT), and Bridge tables, among others. [5]

Reference tables are shared tables among the other tables and are not linked in the model by relations to other tables. In general, reference tables are used by satellites. Reference tables include, for example, calendars and codebooks that exist independently without information from other tables, such as a country or language codebook. Reference tables also contain data that is frequently referenced and used to curb unnecessary memory usage.

The Effectivity table is a satellite that temporarily stores start and end dates for the validity of rows of links and hubs. Effectivity satellites can also store deleted dates and other metadata that describe parent information within or across a specified timeline.

Bridge and PIT tables exist to improve performance. Bridge tables are placed between the Data Vault and the Information mart and are essential for those instances where joins and aggregations to the raw data vault cause performance problems. So without sacrificing the required granularity or performance, it is a good idea to take advantage of Bridge tables. Bridge tables store a cross-combination of a hub and link keys that are controlled by the where statement. The number of rows is thus managed based on business needs. [5]

If there is a need to view the data at different times, snapshot-type PIT data structures can be formed for reporting based on the load times, in which direct reference keys are compiled into the versions of the data valid at the desired times.

PIT tables store hub and link hash keys, hub business keys, and primary key values for surrounding satellites.

Bridge and PIT tables are both snapshot-based timed loads on the business data vault side, including raw data and critical elements. The purpose of both tables is to eliminate outer joins and also provide full scalability for views over Data Vault.

Besides, these are intended to improve performance and data partitioning.

3.2.5 Conclusions of Structures and Best Loading Practices

Hubs and links describe the structure of a data model and are used to describe how data items are related to each other. Satellites again represent descriptive information. Data Vault 2.0 is for Enterprise data warehouse design located between operating systems and information marts. Data Vault 2.0 integrates data from different sources, linking them while preserving the context of the source systems.

Data Vault stores the fact in its raw form, while data are interpreted in information

(36)

marts.

Hubs, links, and satellites have several attributes in common. These include hash keys, natural business keys, load date, and source. Hubs, links, and satellites can also have the end date of the validity of the row, but this is not mandatory and, most of the time, nowadays not even included.

In data modeling, the connection between hubs is only allowed via links. No reference keys (or known as foreign keys) or descriptive information can be found on the hubs. The reference keys can be found in the links and descriptive information from the satellites. There must be at least two reference keys in the link. Both hubs and links have their surrogate as the primary key. Recursive connections are also established with a link structure. Link-to-link communication is possible but not recommended. Event-type tables are usually represented as link tables.

Satellites should always have one parent table, either a hub or a link. The satellite primary key is a combination of the parent table’s key and the load date.

Satellites cannot have satellites, but multiple satellites can be connected to one hub. A satellite associated with a single hub can be divided into multiple satellites depending on the type, frequency of change, or data source.

In Data Vault 2.0 modeling, the hubs are first modeled by selected business keys associated with the business concepts. Next, the relations between the hubs and possibly between links are modeled with link tables. Third, satellites are modeled for hubs and links where data warehouse history takes place. Finally, possible independent tables, such as reference tables, are modeled.

The data must not be changed when loaded to the data warehouse, but the data will be exported with errors and in raw format [5]. Only technical changes and additions to facilitate auditing may be made. The data are only reviewed and processed at the data distribution layer. Errors can also be driven into error marts to improve data quality in the future. A simple insert only load reduces costs and minimizes changes. Business rules are brought closer to the users.

The Data Vault 2.0 upgrading ETL section is relatively straightforward and similar to the Data Vault 2.0 modeling itself. First, all hubs are loaded, and new surrogate keys are created for possible new business keys. Because the hubs do not attach directly to each other, they can be loaded in parallel.

Next, all links between the hubs are explored and loaded, and new surrogates are created for possible new relations. Since the links should not be directly related to each other, these can also be loaded in parallel. At the same time, satellites and their surrogates that attach directly to hubs can be created and updated.

Once all the links and any new connections have been established, the satellites that will be attached to the links can be added and updated. Satellites can also be loaded in parallel.