Utilization of Data Mesh Framework as a Part of Organization’s Data Management

(1)

Utilization of Data Mesh Framework as a Part of Organization’s Data Management

Simo Hokkanen

Master’s thesis

School of Computing Computer Science

September 2021

(2)

i

ITÄ-SUOMEN YLIOPISTO, Luonnontieteiden ja metsätieteiden tiedekunta, Joensuu Tietojenkäsittelytieteen laitos

Tietojenkäsittelytiede

Hokkanen Simo Santeri: Data mesh viitekehyksen hyödyntäminen osana yrityksen datan hallintaa

Pro gradu –tutkielma, 63 s., 1 liite (3 s.)

Pro gradu –tutkielman ohjaajat: FT Virpi Hotti ja Antti Loukiala Syyskuu 2021

Digitaalinen teknologia, jatkuvasti kehittyvät tietojärjestelmät, sekä erilaiset ohjelmis- tokokonaisuudet tuottavat valtavia määriä dataa. Tuotamme, kulutamme ja hyödyn- nämme dataa yhä useammalla tavalla. Jokainen moderni organisaatio, joka tahtoo py- syä nopean teknologisen kehityksen mukana, hyödyntää ja kehittää liiketoimintaansa datan avulla. Data-alan monipuolistuvat palvelut ja kehityksen nopeus vaativat uusia toimintamalleja. Data-alan uusimmat innovaatiot ja trendit kiinnostavat myös yritysten lisäksi tutkijoita. Tämän tutkielman tavoitteena on selvittää data meshin viitekehyksen pääpiirteitä ja ominaisuuksia. Lisäksi tutkielmassa selvitetään, miten kohde- alue (domain) määritellään ja sen määrittelemiseen liittyvät haasteet. Tutkimusmene- telmällisesti tehdään kirjallisuuskatsaus -ja kyselytutkimus. Kirjallisuuden avulla sel- vitettiin data meshin ominaispiirteitä ja sen soveltamisen haasteita. Kirjallisuuskat- sauksen pohjalta vastattiin seuraaviin kysymyksiin seuraavasti: kuinka domain määri- tellään ja toimivatko CDM ja data mesh yhdessä; domainin määritteleminen on haas- tavaa, mutta sen tulisi olla yhdenmukaista. Yleinen tietomalli (CDM) ei tue data meshin periaatteita. Empiirissä tutkimuksessa testataan tutkielmaa varten luodun data mesh -työkalun toimivuutta teemahaastatteluiden avulla. Kyselytutkimuksessa selvi- tettiin data meshin viitekehyksen sopivuutta organisaatioihin ja pyrittiin löytämään erilaisia tiedonhallinnan toimintamalleja. Kyselytutkimuksen perusteella voidaan to- deta, että organisaatiot omaavat jo nyt erilaisia hajautetun arkkitehtuurin piirteitä ja kaikki kohdeorganisaatiot kykenevät hyödyntämään data meshin ominaispiirteitä ha- luamallaan tavalla. Tutkimus osoittaa, että organisaatioissa, joissa data meshiä jo so- vellettiin, datan hyödyntäminen oli suoraviivaistunut. Tutkielma osoittaa myös erilaisia haasteita yritysten tiedonhallinnan tilanteista ja tuo esille data meshiä estäviä teki- jöitä, kuten monitulkinnainen domainin määritelmä, epäselvä datan omistajuus, vah- vasti keskittyneet dataratkaisut, sekä datanlukutaidon matala taso.

Avainsanat: data mesh, tiedonhallinta, data-analytiikka, hajautus, teemahaastattelu, datatuote

ACM-luokat (ACM Computing Classification System, 1998 version): C.2.4, D.2.11, E.0, H.2.0 & K.6.3.

(3)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Joensuu School of Computing

Computer Science

Hokkanen Simo Santeri: Utilization of Data Mesh Framework as a Part of Organiza- tion’s Data Management

Master’s Thesis, 63 p., 1 appendix (3 p.)

Supervisors of the master’s Thesis: PhD Virpi Hotti and Antti Loukiala September 2021

Digital technology, constantly evolving information systems, and various software packages produce vast amounts of data. We produce, consume, and utilize data in more and more ways. Every modern organization that wants to keep up with the rapid technological development is utilizing and developing its business with the help of data.

The diversifying services of the data industry and the speed of development require new operating models. The latest innovations and trends in the data industry are of interest not only to companies but also to researchers. The aim of this thesis is to elu- cidate the main features and principles of the data mesh framework. In addition, the thesis explains how to define a domain and the challenges associated with defining it.

The research method is a literature review and a survey. The characteristics of the data mesh and the challenges of its application were investigated with the help of literature.

Based on the literature review, the following questions were answered as follows: how the domain is defined and whether common data model (CDM) and data mesh work together; Defining a domain is challenging, but it should be consistent. The common data model (CDM) does not support data mesh principles. In the empirical study, the functionality of the data mesh tool created for this thesis is tested through theme interviews. The study examined the suitability of the data mesh framework for organizations and sought to find different information management operating models. Based on the study, it can be stated that organizations already have different features of a distributed architecture, and all case organizations are able to utilize the principles of the data mesh in the way they want. The study shows that in organizations where data mesh was already applied, data utilization was more streamlined. The thesis also points out various challenges in enterprise information management situations, and highlights factors that prevent data mesh, such as an ambiguous domain definition, unclear data ownership, highly centralized data solutions, and low data literacy.

Keywords: data mesh, data management, data analytics, distribution, theme interview, data product

CR Categories (ACM Computing Classification System, 1998 version): C.2.4, D.2.11, E.0, H.2.0 & K.6.3.

(4)

iii

Acknowledgment

This thesis has been carried out in a collaboration with Solita Ltd in summer 2021. I would like to express my greatest appreciation to my thesis supervisors: PhD Virpi Hotti and Antti Loukiala for helping and pushing me forward during my thesis journey.

(5)

List of abbreviations

ETL Extract, Transform & Load CDM Microsoft Common Data Model IT Information Technology

API Application Programming Interface AI Artificial Intelligence

ML Machine Learning

ERP Enterprise Resource Planning CRM Customer Relationship Management SOA Service Orientated Architecture DDD Domain Driven Design

RQ Research Question

DM Data Mesh

(6)

v

1 Introduction

Digital technology is present in almost every consumer’s or enterprise’s daily actions.

For example, people have more and more technological innovations carried with them:

smartphones, smart watches, and laptops are a good instance of this. The rapid development of software and systems engineering requires more data to reach the ambitions technology aims to solve. Therefore, massive amounts of data gathered through different systems have led the data business to grow rapidly for the past few years.

We stand on the brink of a technological revolution that will fundamentally alter the way we work, live, and relate to others around us. After three industrial revolutions introducing us to steam power, electricity, and automated production, we are transforming to the next industrial revolution: The fourth industrial revolution. This fourth revolution is the digital revolution that has been occurring since the middle of the last century. Its main characteristics are the internet of things (IoT), autonomous robots, cloud computing, and overall, the digital transformation towards the world of information systems (Schwab, 2016).

For the past few years, big data has been one of the most exciting and refreshing stim- ulating trends in the data engineering world. Big data has made enterprises develop their data strategies and projects to more complex levels. Almost every business has data to benefit from, which is why data engineering and information management are becoming a larger part of a successful business. However, for some companies’ data are still an unclear object they are trying to tackle. Therefore, efficient data architecture is required for enterprises to achieve the full potential from the data they own.

Artificial intelligence (AI) is all around us, and we are already experiencing drones and self-driving cars with digital assistants and software that helps us in our daily tasks.

Impressive progress has been made in AI research in recent years. The rapid growth of computing power and availability of vast amounts of data contribute together to developing the digital environment we live in (Schwab, 2016).

(8)

questions and the theoretical work behind this thesis. Used literature and information retrieval are decoded and explained. The main research keywords are reviewed. The first section likewise introduces the basics of data management, data models and re- flects why data is so important.

The thesis proceeds in the following order. The second section aims to give a strong theoretical background in the data mesh paradigm. The main principles in data mesh, distributed systems, and service orientated architecture are covered. The previous studies are also examined, and the downs and upsides of the data mesh (DM) architecture are explained. The foundation of data mesh lies deep in domain-driven design (DDD).

Domain-driven design will be tackled in numerous parts of this thesis. One research question is also formed around the question, what is domain and how to define domains in your organization. The data industry is full of different and versatile terms.

This thesis does not aim to explain all possible data terms to regard. The third section will include the core framework from data mesh in questions formed for theme interviews. Question battery is formed around the most quantum questions of data mesh and distributed architecture. Theme interview results will give a comprehensive insight, how professionals in this field comprehend data mesh. The fourth chapter will define when and how data mesh could be applied in the customer organizations. Show- casing these organizations will be included in the fourth chapter. The case study will be carried out through interviews with the selected organizations. The question battery and Data Mesh Suitability Reporting tool will be put to the test. Results and the most interesting answers will be highlighted in the fourth chapter. The last chapter of this thesis will discuss the results, dive into the most important findings and draw conclu- sions on the research questions. The last chapter will also lay eyes on the future of data mesh and go through possible follow-up research.

The theoretical background for this study is assembled with the latest and the most relevant studies and reports in the data mesh paradigm. Data mesh as a concept is moderately new, making the theoretical point of view fractionally narrow and challenging to execute effectively. The concept of data mesh also keeps changing and re- structuring at present, and this causes the viewpoints to reform behind the concept. It also creates an illusion that something we write, or state just now could be abrogated in just moments or few written articles. The literature part includes official reports,

(9)

statistics, whitepapers, and current news. Solita Ltd also provided great amounts of literature and statistics from data management and data engineering projects. Different hype trends and topics in software or data engineering fields usually gather people to the same place for open discussion and learning. One vendor-independent community in Slack was formed at the very early stages of the data mesh hype. This “Data Mesh Learning” Slack group performed a crucial role in gathering new information, standpoints, and opinions about the new paradigm.

Attention has been paid to the quality of references by examining the reference amount of the publications used. Also, Julkaisufoorumi.fi -website has been used, which offers a level classification for academic publications. The thesis background material has been searched throughout Google Scholar search engines. ACM, Scopus, and IEEE digital libraries have also been used to find previous literature. The most frequent keywords were data mesh, data management, information architecture, data product, domain, and data ownership.

It proved to be exceedingly difficult to include the subjects entirely via published academic literature throughout the thesis. After all, there are not too much official academic literature published yet. Overall, the field of data engineering is young, and it is getting closer to the side of software engineering. Because the field of data engineering is so nascent, much of the conversation on current challenges and the state-of-the-art is had throughout what is commonly known as “grey literature”.

Although, grey literature is quite common in software engineering-related fields including data engineering and computer science. Grey literature usually includes different sources (e.g., blog posts, videos, podcasts, and whitepapers). Multivocal literature review (MLR) recognizes the need for several different sources of opinions or voices to be heard. Instead of constructing the evidence from only the knowledge ac- curately published in formal and official academic settings, multivocal literature also uses all accessible writings or other publications around a popular, often a current topic (Garousi et al., 2016).

(10)

engineering and processing more structured and efficacious (Dehghani, 2020a). Mak- ing data mesh a good research topic to deepen.

Figure 1: The data science hierarchy of needs, in the form of a pyramid, describes different data correlations. Credit Monica Rogati, Hackernoon.

Understanding the data we use is the key principle and aspect that must change in the data industry. Far too often ETL (Extract, Transform, Load) processes keep failing due to the constantly growing complexity of the labyrinth of data pipelines (Dehghani, 2020a). Data mesh is a high-level solution with decentralized and distributed responsibility of people nearest to the data to back up continuous transform and scalability (Dehghani, 2020a).

1.1 Research Questions

This thesis aims to answer the most ponderable questions about the data mesh paradigm. We are interested in studying how data mesh applies to different companies with different data management situations. Leaning towards this previous sentiment, the following research questions have been formed:

Research Question 1: What situations or organization data mesh can be applied into, and how to proceed to data mesh?

(11)

Research Question 2: How to define a domain in your organization? (What is domain) Sub-Questions:

• Is data engineering more streamlined when using the data mesh procedure?

• What are the key challenges and benefits of data mesh?

• Does common data model (CDM) support data mesh framework?

The research questions consist of two main questions and three sub-questions. These questions provide an exploratory look into what requirements and challenges are found in the context of data mesh and distributed domain-driven design (DDD).

1.2 Data Management

Galetto (2016) defines data management as an administrative process that contains validating, acquiring, storing, securing, and processing the data to ensure reliability and accessibility of the data to its users. Bourque & Fairley (2014) state that one key concept in data management and database systems is data schema. Bourque & Fairley (2014) define schema as: “The relationships between the various entities that compose a database”. Therefore, we can see schema as it is a description of the entire database structure and blueprints attached to the data.

Business questions acquire the data required to answer that question. Eventually, data needs to answer these business questions to generate the insights needed for data- driven decision making (Galetto, 2016). With the help of organisations' data management platforms, it is possible to gather, sort, and house their information and then re- package it in various ways to achieve the demanded analytics or insights. This way of information management will eventually create value for the company to grow their data business in the right direction.

A data pipeline is the structure and mapping of how the data is processed towards to use cases, such as building machine learning models. Data pipelines are known to im-

(12)

and consumed and must be altered by a line of operations called a data pipeline (Quemy, 2019).

Modern technologies typically do not create bottlenecks for data management to suc- ceed, as these new technologies usually can scale horizontally and vertically. One of the key aspects of good data management is to optimize functions and data processing in booming large software projects. Continuously scaling large software development is not a new problem to solve. However, the basic foundation of clear and effective data management is necessary to scale modern technologies and customers' needs.

1.3 Data Models

A data model can be seen as a high-end abstract premise that organizes data features and elements. These features include data entities and attributes. Model defines the data elements and the relationships between the elements and attributes. The goal of data models is to show how data is stored, connected, updated, and accessed. Figure 2 highlights a simple example of a data model between customer and address.

Figure 2: A simple logical example of the data model. Credit Scott W. Ambler, 2006.

The data model is built from the viewpoint of the raw information used in the specific concept. Data tables and relationships between the data define data models, such as entities and attributes (Bourque & Fairley, 2014).

Programs and applications work on data. Data must be organized and defined within computers, and after these, systems and applications can process applications or programs. Modern software development and data development practices can automate

(13)

multiple data modelling steps, producing the data to be available faster for consump- tion, causing a need for efficient data management. Development of data practices are a continuous task for organizations.

An algorithm is a set of precisive instructions for computers on how to complete a certain task. Algorithms are used to execute complicated programs and applications more cogently (Bourque & Fairley, 2014). Most AI and ML (Machine Learning) models require high level and swift algorithms to execute the tasks in a preferred way.

1.3.1 Microsoft Common Data Model

Adobe, Microsoft, and SAP published an “Open Data” -suggestion in Microsoft 2018 Ignite Event. The result of this suggestion, the Common Data Model (CDM), aims to model the common concepts in business into one homogeneous data model. Applica- tions and systems could use this model as such or with small dilations (Hansen, 2020).

CDM defines a group of commonly used business objects (entities), attributes, and relations between objects. Typical entities for this data model are, for example account, Contact, Activity, Owner, Task, Product, and Order. The complete model includes roughly 700 entities, with about 100 fields per entity (Microsoft - CDM, 2020).

(14)

The florid published from these three companies reveals the main practical ambitions as follows, a) Getting rid of data silos, and b) Creating one unite data model, which illustrates the basic business concepts and relations with each other (Hansen, 2020).

Figure 3: Microsoft CDM featured as a complete-scale example. Credit https://docs.microsoft.com/en-us/common-data-model/

CDM does not editorialise the actual storing method of the data. Information can be saved in any format or structure needed. The model just defines the structure of the data (schema), where data must be saved to guarantee compatibility. For example, if the data is stored in a traditional SQL (Structured Query Language) relational database, CDM would define the structure of the database, like the names of the tables, columns, and the foreign-key references between the tables (Hansen, 2020).

CDM strives to ease the issues caused by data centralization. A practical example of this silo-effect is a company with three massive operative systems: marketing, sales, and customer service applications. Every application has a data structure for the Cus- tomer, which is almost equal to the one in sister application. These systems have been built by different administrators in different periods. If these applications used the

(15)

same Common Data Model, they would all be having the same understanding of the Customer, what the Customer is and what data specifies with it. If these applications are correctly built, one storage structure, one Customer-view, one interface, and one technological tool would be enough to use all this data (Hansen, 2020).

CDM should offer a solution for having up-to-date data always in use. The second benefit is towards the integration and conversions between multiple systems. These conversions would be more efficient, and the general view is easier to understand with one unite Common Data Model (Microsoft - CDM, 2020). Overall, CDM and similar concepts of data management are a little old fashioned and are the traditional understanding. CDM shows us the classical and orthodox view of how data models, attributes and entities should be treated. These classic reckonings and the new pivotal data mesh differ in many ways. With this literature about CDM we can say that data mesh applies different patterns and logics for data than CDM. Data mesh heavily relies on domain-driven design: therefore, a united customer definition is not the solution. Do- main-driven design and data mesh use domain specified customer definition that can be mapped together in the future, if even needed.

1.3.2 Conformed Dimensions

A conformed dimension is a dimension that has the same meaning to every fact that it relates to data warehousing. Therefore, using a conformed dimension can make the whole ETL process more efficient as it does not have to do various tasks to process the same dimension-related information more than once. (Serra, 2011).

McHugh (2017) defines that conformed dimensions are those dimensions that have been blueprinted so that the dimension can be used across many tables in distinct subject areas of the data warehouse or lake. Conformed dimensions can provide the customer with insights into their data that exceeds the initial needs and expectations.

Eventually, this is exactly the main point behind strong and effective information management.

(16)

1.4 The Importance of Data

Data is increasingly seen as great wealth, such as oil or gold. However, data is more than just bits or information gathered in a specific form or structure. Data is information units, usually expressed in numeric form, and it is collected through observation. In modern business data can be gathered throughout applications and systems customers consume. Data itself always does not have high value in it, and the key thing is how to process, inspect and analyse the data.

Companies built specifically on data have been around for a long time. For example, gathering customer information and using it to make better decisions, products and services is an age-old strategy, but the complete process used to be slow and difficult to scale up (Hagiu & Wright, 2020). This low-speed process changed dramatically with the advent of cloud technologies and new IT innovations that allow companies to rapidly process and make sense of vast amounts of data on their hands (Hagiu &

Wright, 2020).

Data is becoming the main driving force for digitalization, and it has been lifted in many companies as a pivotal key asset. Data is being used on an ever-increasing scale in applications, system development and decision making. Therefore, a constant de- mand for more data is insisted at better quality from a wider time horizon.

Then, where the data business gets interesting. Even when the data is proprietary or unique and it produces valuable insights, it is difficult to build a durable scaling advantage if the competitors can follow the resulting upgrade even without similar data.

Another interesting factor is how fast the insights from customer data change. More repeatedly, they do so harder for others to imitate (Hagiu & Wright, 2020). These small but very notable factors make the data business complex and competitive. Changes in data business complexity and competitiveness results in enterprises investing in data consultants and technological experts. Data is on its way to being the main driving force in any business.

Though data business and industry is growing rapidly, it also has different interesting variations it is going through. In recent years, the vast amount of raw data produced has dramatically increased. In contrast, the use of such raw data for creating new value

(17)

and insights by organizations has been limited (Rodriguez et al., 2020). Organizations seeking for new value from data is a keen topic we will dive in to during this thesis.

Data is the most important tool for companies’ administration which seeks to execute sustainable and cost-efficient business. Experienced data teams, state-of-the-art analytics, and technology solutions are an essential part of enterprises data pipelines. Pre- viously mentioned factors are not always enough when the available data is desired to be used in the most productive way (Etlia, 2021). The key parts of the process must be sharpened together, and data must be seen in a new light.

Data projects need to change to be more cost-efficient and effective. Although companies use major parts of resources in data tools and projects, technology usually is not the issue because it bends on how users need it. Data mesh can be a solution for more effective data architecture and management to score completed data projects.

(18)

2 DATA MESH FRAMEWORK

Transforming to a successful data-driven enterprise remains as one of the key strategic goals for modern companies. These companies are valuing evermore static and efficient data architecture from organizations offering data consultation for them. Data mesh is a new and decision orientated paradigm with architectural data features. For the first time, data mesh was introduced by Thoughtworks technology consultant, Zhamak Dehghani on a highly appreciated blogsite by Martin Fowler (Dehghani, 2019). After this first impression of the data mesh paradigm, the concept has had a lot of enthusiasm around it. It has become the most relevant new data engineering topic in late 2020 and early 2021. Dehghani (2019) describes data mesh to be a paradigm shift in managing analytical data. Complexity and the size of software can scale rapidly out of hands. More complicated data requirements need clear design for enterprises to scale with the needs of the data-hungry customers. Domain-driven design-based data mesh is a new solution for this evolution.

More and more information systems and software require precise and comprehensive design for data utilization. Designing and planning out just the software itself is not enough anymore. The goal for data architecture is to design organizations information on different levels. Also, the overall picture from the data on centralized silos and on an explicit level is an objective to reach. The objective of data architecture is to show organizations crucial data content beyond organization and system borders.

Organizations must tackle multi-faceted complexity and challenges in the transformation to become more data-driven. Competing business priorities, migrating different legacy systems, and the culture relying on data are factors behind the data-driven movement. Data-driven means the transformation and the keen creating more value with data, placing importance of data in a core business position. Dehghani (2019) now suggests a new way to build a distributed data architecture at scale and focus on the importance of data domains. She introduces us to a new enterprise data architecture that aims to solve the current issues with centralized data architecture. Data mesh offers a new viewpoint to tackle the challenges in monolithic architecture (Dehghani, 2019). Commonly, it is seen that operative business software, such as ERP- systems,

(19)

are the main product. Data mesh wants to focus on the data itself and remove the habit of seeing data just as a “side product” (Hovi, 2021). Seeing data as a side product is still a common vision in any business. Data is just an obligatory case within acquired IT systems, a secondary business development requirement. This shift is the key fun- dament in data mesh.

Data mesh is presented as a framework (Dehghani, 2019). Regarding to ISO (2006) a framework is a specific structure expressed in text, diagrams, or formal rules. These relate to the components of an abstract entity to each other. Framework is important to define here, because data mesh is aimed to create new impact as the future framework for data management.

Data platforms are environments or applications that import data together and serve it across different business units. A data warehouse is data storage used, for example, around reporting and analytics. It is the key central repository of data integrated from different information sources. It remodels data into a common schema that enables easy data usage. Using this methodology can lead to more effective analytics and other valuable data products.

A data lake is also a central repository of already structured but unstructured data required in any format. The key benefits of a data lake are that it can speed up data availability because data is usually stored in raw format. Furthermore, the data schema is also defined at the usage time, allowing flexibility towards the data.

Data warehouses and lakes being the structured and meticulously conceived data management, data swamp is the opposite. According to Knight (2018), a data swamp has no clear organization form or system built around the enterprises’ data. Data swamps have narrow curation, including little to no active management throughout the life cycle of data. Also, the metadata and data governance are usually poorly executed. We need to remember that someone’s data lake can be another’s data swamp, and this is because of the variety of data and businesses we face. Data swamps usually have the issue of being unusable and frustrating for data consultants and engineers (Knight,

(20)

In data mesh, the data infrastructure is technically centralized, the same for everyone in the organization, but data pipelines are built in a distributed domain-driven fashion.

Following this principle, every single of one data pipeline can be optimized for the specific needs in that business domain, for example, marketing or customer service (Hovi, 2021). Having unique and optimized data pipelines for domains does not mean that every business domain has or needs to build its own data lakes or warehouses, the side of which stays monolithic. Instead, these domains will be in charge and have full ownership of the data they consume.

Following the previous example operation model, the same specialist and employees are accountable for the complete data pipeline in its full form, all the way from the production of the data until its final usage. Using this model, business domain employees witness and understand the data they use at a completely new level (Hovi, 2021).

When data pipelines are unique and understood by the same people that consume that data at the business domain side, massive value can be created continuously.

Zhamak Dehghani (2020a) ponders what we mean by data; she explains that we can divide the data landscape into analytical and operational data. Operational data lays in databases that support business capabilities through microservices and APIs. Opera- tional data, for example, serves the needs of applications running the day-to-day business and has a transactional nature. Operational data typically comes from transactions between organizations and their customers (Dehghani, 2020a).

Analytical data is usually temporal and supports views of the business situations over time. Analytical data is traditionally modelled somehow, and future-perspective insights can be built with various available technological tools. For example, engineers can train machine learning models and create plots to support functions all around the business (Dehghani, 2020a). Figure 4 shows the operational and analytical data coop- eration, with ETL processes as the contactor for an effective data pipeline.

(21)

Figure 4. ETL Data pipeline. Credit Zhamak Dehghani, 2020. https://martin- fowler.com/articles/data-mesh-principles.html

In her presentation, Dehghani (2020b) explains that operational data creates API-based access to data, captures the current state of applications running, and serves later parts of ETL pipelines with graph and relational databases. The operational data plane creates platform for the analytical side of data utilization to scale towards different insights and future visions.

To summary the core message from Data Mesh together, we can highlight four main principles (Dehghani, 2020a):

1. Domain-oriented decentralized data ownership and architecture 2. Data as a Product

3. Self-serve data infrastructure as a platform

4. Federated computational governance & Data Governance

These four main principles are explained and tackled in various sections during this thesis. The domain-orientated decentralized data architecture principle and the third

(22)

governance part is processed in 2.3 Non-Invasive Data Governance. This section is an apropos place to determine how data mesh observes data governance and how non- invasive actions help company data management. The meaning of domain and idea behind Domain-Driven Design will be observed first. Dehghani (2020a) states that Data Mesh has taken a lot of influence from DDD, so it makes the theories from Eric Evans natural to inspect first.

2.1 Domain-Driven Design

Before facing the current situation, pitfalls, and challenges in centralized monolithic architecture, we must focus on defining the meaning of the domain and core principles around domain-driven design. What does the data domain mean and do organizations vary with the views on the domain? This question will also be tackled on the later parts of this thesis, during the interviews performed on the selected customer organizations.

Who could potentially apply data mesh -thinking into their data architecture?

In the context of data mesh, the domain does not mean a certain group of computers that can be accessed and administered with a common set of processes. Here, the domain does not touch the concept of domain names, network domains or Internet Pro- tocol (IP) resources. However, in deeper levels of data management and distributed domain architectures, the domain concept has rather different and multidimensional definitions. These different definitions and viewpoints will be covered throughout the following section.

Domain-driven design is an approach for the software development industry that focuses on programming a clear domain model. This model has a rich understanding of the processes and rules of a domain. Domain-driven design name roots itself from an extensively honored book: Domain-Driven Design – Tackling Complexity in the Heart of Software by Eric Evans. Evans (2004) describes the approach through a catalogue of patterns. This approach is particularly suited for complex domains, where a lot of often-messy logic needs to be organized properly.

The concept of software systems based on a carefully developed domain model has been around since the software industry emerged (Fowler, 2020). Moreover,

(23)

throughout the 1980s and 1990s, representing the underlying domain was a fundamen- tal part of much object-orientated and database development (Fowler, 2020). Overall, Evans tied up a complete and overwhelming contribution in developing a common vocabulary and identifying main conceptual elements beyond the diverse modeling notations that dominated the domain discussion.

Large-scaled agile software development neglects the proper architecting support in such development projects (Uludağ et al., 2018). Domain-driven design addresses solutions for an increasing number of large organizations developing evermore complex software systems while adopting agile and lean methods during the software development processes. Moreover, we can inspect the domain-driven design bottom theory:

the business domain should match the language and structure used in software code (Evans, 2004). Definitive examples of software code structure repeatedly used are class names, class variables and methods. Domain-driven design is overall a very broad and heavy concept, signifying that it includes terms and abstracts. Bounded Con- text, for example, is a central pattern to make strategic decisions in DDD, where large domains and teams are on a linchpin.

2.1.1 Domain

Evans (2004) explains that every software program follows up to some activity or interest of its user to apply the product. That area of subject the user applies in the program is the domain of the software. For example, airline-booking program involves a domain of real people getting on a real aircraft. On the other hand, some domains are immaterial: An accounting programs domain is money and finance, IT system domains often have little to do with computers. Of course, there are few exceptions. For example, a source-code control system, the domain is software development itself (Evans, 2004).

Ability to solve domain-related tasks for its users is the core of the software. Software and systems have multiple functions, usually even vital when looking at the bigger

(24)

domain itself to understand the business the software is implemented into. Developers must sharpen their modelling skills and master domain-driven design (Evans, 2004).

Although, that, for example, is not one of the key priorities on most IT projects. Com- monly, developers do not find interest in understanding the certain domain in which they operate, much less making a significant effort to understand domain-modelling.

Most highly technical software developers enjoy solving quantifiable problems that train technical skills and understanding (Evans, 2004). Computer scientist’s capabilities and common interests do not seem to find messy domain work interesting. These preferences could come from the education or teaching of software development and programming. Developers see that their task is not on the domain side, but rather on the pure programming side. Talented developers can also have multiple projects running simultaneously, decreasing the time and interest to find a specific domain defining issue. Thus, there is a clear gap between the development and business units.

Also, Evans (2004) condenses that domain is: “a sphere of knowledge, influence, or activity”. It clearly shows that a domain can be difficult to define, and the lack of definition for domains could cause obstacles in the software development process.

Nowadays, Evans’s domain definition is seen in a variety of ways. In this study, interviewed organizations can express how the domain is seen in their point of view. We will also aim to determine if some organizations do not have a clear meaning or definition for the domain. We might quickly find if the data mesh framework could or could not be applied to their data architecture at all.

Vaughn (2013) supports the domain definition from Eric Evans (2004) and discovers that domain to be one of the most important aspects of efficient software development and data processing. To design high-quality software products that meet core business objectives, tactical and strategic modelling tools are required to clear vision of the domains.

We need to dig deeper into the core difference between a business domain and a data domain. The different definitions between business or data domains are the most crucial parts to figure out in any modern data-utilizing enterprise.

(25)

2.1.2 Context Mapping

As previously explained, Bounded Context has different tools built around it. For example, Vaughn (2013) describes different ways annex several bounded contexts, context maps being the most explicit.

Context mapping is simply a tool that enables to recognize the relationship between bounded contexts and the relationship between the business units or teams being obli- gated for them. Vaughn (2013) particularizes that context maps are not a technique to be limited by drawing a diagram of a specific system architecture in use. Moreover, it is about understanding the relationships between the different Bounded Contexts in a business and then the patterns used to draw objects purely from model to another.

Overall, context mapping takes bounded context further in a notion of strategic design, and how to organize large domains. The context mapping principle on organizing large domains is one of the first instances to point out that domain-driven design. At the same time data mesh might just be intended to be used in large organizations.

Alternative ways for context modelling and mapping have also arrived in the IT industry. Event storming is a good example of this, and it is a workshop-based method for highlighting what is going on in the heart of the software program, the domain. If you compare it to other various methods out there, event storming is supremely lightweight and, on purpose, does not require support from computers. Results or an example process are attached to a wall with sticky notes. Event storming roots itself to show the focus on domain events, and the methods have similar aspects as brainstorming (Bran- dolini, 2013).

2.1.3 Ubiquitous language

Fowler (2020) imparts the topic of ubiquitous language as a key part of domain-driven design. This language is a major part of effective software development specialising in programming and application development around domain models. Ubiquitous lan-

(26)

unified language to understand the developed products in the same way. Evans (2004) states that domain terminology must be embedded straight into the software systems.

The importance of domain terminology is one of the core roots in Domain-Driven De- sign. Figure 5 below shows how Ubiquitous language could tie together various parties of an organization. Although Evans originally formed DDD and Ubiquitous language for software development, it is seen that it fits well into the concept of data development and the data industry.

Figure 5: The Ubiquitous Language. Credit InfoQ, 2009.

Ubiquitous language offers multiple different participants and together they can form a united and efficient way of working. In addition, ubiquitous language touches the organizational culture, everyday conversations, and technical factors like code and documentation.

2.2 Service-Oriented Architecture

Architecture is one of the most intriguing and, at the same time, common terms around software development and data management. This part seeks to find answers on what architecture means and how the service-oriented architecture touches the data mesh concept. But, first, we need to find a proper answer for the question: What does architecture mean in the context of IT and data?

(27)

IT architecture can be seen as a set of structures needed to reason the system, which also comprises software elements, relations among them, and data properties. How- ever, software architecture is not the same as data architecture, and they should be seen as separate architecture domains (Zhu, 2013). These architectures want to tackle slightly different concerns from their aspect.

Service-oriented architecture (SOA) instead means a logical way of blueprinting a software system to provide services either to applications that end-users consume or other distributed services in a network (Papazoglou et al., 2007). The distribution side articulately plays a vital role in a service-oriented architecture. New software applications and data systems are often seen as a service to end-users, which needs a strong architectural concept to lay on.

2.2.1 Microservices

The microservice architectural style is a vision to develop a singular application as a suite of many small services. Each service running its process and communicating with lite mechanisms, often seen as an API (Lewis & Fowler, 2014). Microservices has been commonly seen as the go-to method in modern software development.

Lewis & Fowler (2014) explain that the strength of microservices can be seen through a simple example, comparing it to a monolithic style single unit. Of course, software development varies clearly from data applications. However, the same mindset can still be set in both development environments.

Data mesh essentially refers to the concept of breaking down data siloes into smaller, more decentralized portions. Much like the shift from monolithic applications toward microservices architectures in the world of software development, data mesh can be described as a data-centric version of microservices (Furia, 2021). Data applications and solutions usually come an inch behind software business, which leads the direction of technology development and industry.

(28)

2.2.2 DataOps Culture

DataOps culture follows from a common term, Software Development (Dev) and IT Operations (Ops), known as DevOps. DevOps aims to clear the development cycle, boost continuous integrations, and ensure high software quality. DevOps includes aspects from Agile methodology. DataOps is a group of technical practices, cultural norms, workflows, and architectural patterns (DataKitchen, 2021). Shortly said, DataOps seeks to pursue more effective tools for data analytics and communication.

Overall goal of the data mesh is not to vaporize the benefits and utilization of data lakes and warehouses. Instead, the goal is to enhance productivity and to develop the teams consuming data. A clear object on the horizon is that technical experts, data production, and business units work together more efficiently. These same principles touch the overall work culture that DataOps wants to propel.

Rodriguez et al. (2020) also point out that DataOps is just one of the many tools or frameworks that emerge to attend the demanding requirements of a data-driven process that covers all the points from data collection to analysis and decision making.

2.2.3 Distributed Systems

As a simple definition, a distributed system is a group of computers working together meanwhile appears as one computer unit to the end-user. The core aspects of the data mesh framework rely on distribution and decentralization. Monolithic systems need to be replaced with microservices to be able to apply the Data Mesh thinking. A distributed system is a complete set of computers, networks, and processes, connected by a network, to work united to collectively execute a specific group of services (Neuman, 1994). This definition of a distributed system fits the distribution aspect of data architecture that data mesh strives to fulfil. Data mesh creates distribution in data ownership and data processing. Also, the mindset of distributed data architecture is a vital part of the mesh. Distribution sets a new and refreshing phase in the data world.

Distribution is a strong tool to enhance computing ability in the business globalisation around us. Distributed systems are used among various cloud databases and data systems. Distributed data systems provide distribution for data storage, infrastructure, and

(29)

cloud computing. These distributed data systems help companies with the continually growing need to model and analyse massive amounts of data.

2.3 Non-Invasive Data Governance

Data mesh follows distributed system architecture patterns with independent data products, self-serving data platform infrastructure and various deploying teams working with data. This data can include vital information from key processes, transactions, or customer engagements. These principles create a demanding requirement to imple- ment a strong governance model for data (Dehghani, 2020a).

Dehghani (2020a) states that data mesh has different priorities regarding data governance models than traditional governance of analytical information management systems. In contrast, federated computational governance in data mesh contains the understanding of change management and multiple interpretative connections (Dehghani, 2020a).

Different governance models, laws, and standards challenge the data industry to have more transparent data usage and ownership. Data Governance has multiple different definitions, and it is widely seen in various ways. Commonly, data governance is seen as the process of managing the availability, integrity, security, and usability of data in enterprise systems used (Stedman & Vaughan, 2020). Data governance is based on internal data policies and standards that also control company data usage. Data governance is seen as increasingly critical for organizations that face new data privacy regulations. These companies usually rely evermore on data analytics and knowledge management to improve operative systems and decision-making (Stedman &

Vaughan, 2020).

Non-invasive data governance means a set of practices of applying formal behaviour and accountability to secure effective use, security, compliance, and quality of data (Seiner, 2016). Non-invasive mindset helps companies to get a better grip on their data.

(30)

A broader understanding of data politics and governance help organizations to scale with their data product development and analysis.

Different metadata management and standardizations fix the common problem in organizations, which is that data are often too difficult to find – almost like locked away in a system somewhere businesses cannot access it (Shahrin, 2021).

Data governance capabilities ensure the state of data. However, we should always remember that even after various high-quality checks, there will usually always be still a node that is contaminated. The danger of contaminated data is not a special case to remember, more like general knowledge to keep in mind while working in the world of data (Shahrin, 2021).

2.3.1 Data Ownership

Domain data ownership finds its place in data mesh core, where overall decentralization and responsibility distribution are key aspects of people nearest to the data they consume. Moreover, responsibility distribution is included to support scalability and continuous change of data business (Dehghani, 2020a).

Ownership of the specific business domain in the DM model means the ownership of data as well. However, data ownership creates different responsibilities and introduc- tions to follow. Domain data owners must understand who is consuming that data, how it is being used, and what the common native methods users see comfortable carrying out (Dehghani, 2020a). Understanding these aspects creates a foundation for ethical working methods but also strives data as a product -thinking onwards.

2.3.2 Reshaping Data Teams

Data mesh strongly strives towards modern and agile data teams. These teams would be decentralized across the organization’s domains, and they would serve the needs of business domain professionals. Dehghani (2020a) describes that when reshaping data teams and focusing on domain data, we need to accelerate the movement towards new data roles.

(31)

Data mesh implementation should support for domain data to be considered as a product. These changes on data teams also create a need for new data roles that organizations should introduce, such as data product developers and domain data product owners. Data product developers and domain data product owners are responsible for operative organizations which want to ensure that data is delivered as a product (Dehghani, 2020a). These new data roles divide future organizations between data as a product (DaaP) and data as a service (DaaS) operating model. For example, data product developers can be similar to data engineers but desire better data products.

Figure 6 (Adapted, Dehghani, 2020a) explains new domain data nodes as cubes and different notations and authors affecting the bigger picture.

Figure 6. Data Mesh as a Software Architecture. Adapted, Zhamak Dehghani, 2020.

https://martinfowler.com/articles/data-mesh-principles.html

Figure 6 shows us the architectural point of view behind data mesh. Software and information systems architecture is typically represented in such illustration. This

(32)

the domains that can self-serve data from the monolithic infrastructure (Data warehouse, data lake, etc.). New domains can also be created and work as cross-functional teams, owning their specific data. Data should be seen as a product and the complete organization can work together towards having more efficient and operant data products.

2.4 Data as a Product

One of the core principles of the data mesh framework is the data as a product -mindset. The view of how data should be treated is one of the most criticized parts of data mesh. Data as a product thinking leads to seeing data as an asset, even a possible product. Concept of data as a product is criticized because product thinking most definitely doesn’t fit all businesses and data use cases. Data is a highly versatile commodity, and it varies largely between organizations.

Dehghani (2020b) points out that data needs to be easily discoverable, and a common implementation is to have a certain registry, data catalogue, for example. This registry shows all available data products with their meta-information, such as a source of origin, lineage, owners, and sample datasets.

Data mesh highly focuses on the efficient use of analytical data. Analytical data provided by the domains must be treated as a product. Consumers of that data should be treated as customers at the same time (Dehghani, 2020a). We also have to think about the fundamentals of a product and features used to create the product. Hovi (2021) state that good product has a clear concept, is produced through a certain production, has a standard value, and has an end-user or a customer.

Hovi (2021) defines data products as information formed from company data that value the customers with a standard product, building a data product has to start with the customer’s requirements. Data products should always help a person, streamline operations, or do something that has value (Hovi, 2021).

Data as a product allows organizations to defeat the existing challenges of analytical data architectures. These challenges touch the high cost and friction of discovering, trusting, understanding, and ultimately using quality data (Dehghani, 2020a).

(33)

Data as a product mindset could possibly only fit technological organizations because most classic businesses have a physical product or operation to bring forth. This per- ception also came to prominence during our theme interviews, and it supports the state- ment from Hovi (2021) that agile technological organizations will be the first to im- plement this concept.

(34)

3 DATA MESH QUESTION BATTERY FOR HYPE LANDING

In this chapter, we focus on examining the purpose and objectives of the study on a deeper level. In addition, we will also lay our eyes on qualitative research, which is utilized as a vital part of the data collection of this study. The chapter also describes in detail the execution of theme interviews and assesses the reliability of the study. This chapter aims to provide the reader with a clear idea of the steps after which the implementation of the research and results have emerged.

Interviews are a typical way in qualitative research to collect data. This study uses theme interviews as a tool to find answers for the designed question battery. Theme interviews also set an advisable observation to the situation of case study organizations.

The goal of this case study is to find answers to our research questions from the Intro- duction chapter. Research question 1: “What situations or organization data mesh can be applied into, and how to proceed to data mesh?” & 2: “How to define a domain in your organization?”. These research questions help us understand where data mesh could fit and the main principles to consider when moving towards distributed architecture.

The basic concept of new research is to create something new for science and people consuming the study. This study aims to create value through the new framework of data mesh. In addition, the study seeks to advance the know-how around data management, information architecture, and distributed data mesh paradigm. Case study supports and gives practical proof.

Because data mesh is a very young framework, we need to focus on creating the first steps towards understanding the concept from an academic perspective. This study fills the void of data mesh studies and creates a path for other studies to continue after- wards. We aim to find few starting points to review and lift them to the podium for further research. Our interview questions are justified based on data mesh principles

(35)

by Zhamak Dehghani (2020a) and the general assumptions on how distributed architecture would fit organizations.

Before interviews, we designed a question battery with a smaller set of questions or statements. However, soon we noticed that we need to dig deeper into the core of organizations and data management issues to find answers if the data mesh framework would fit. So, we structured the questions into six dimensions and this way, we found a way to prove why these exact questions should be asked. Similar questions and de- bates also came up on different social media platforms, such as Twitter, LinkedIn, and the Data Mesh Learning Slack group.

3.1 Data Mesh Suitability Report

Data mesh is a very new term, and massive hype around it also challenges the sug- gested architectural framework. Data Mesh Suitability Reporting tool was built to organize and justify the questions asked during the theme interviews.

(36)

Figure 7 shows us a suitable tool to find how well data mesh and distributed architec- ture fit into an organization’s situation. This reporting tool includes six dimensions (Domain, Maturity Level, Ways of Working, Technology/Products/Services, Data as a Product, & Data Ownership). These dimensions all included 2-5 questions each to form the total 29 questions of the set. Figure 7 shows three levels (General, Middle and Executive) on the left side. These levels include two dimensions each. Every interview started with the general questions, ending up with more specific executive questions about ways of working and data ownership.

After going through all the preprepared questions we could find some references if this organization could adapt data mesh. Also, the possible result of data mesh being difficult to adapt is an extremely important finding if obtained. Questions can be found in Appendix 1.

3.2 Question Layouts

Questions for this study are formed from various insights and standpoints of different professionals in the academic and business world. Questions are set in a neutral form with a little challenge at the same time. Questions aim to be eye-opening for interviewees to learn something and maybe find new viewpoints for their organizations data situation.

Questions are built to be answered even though the interviewed person does not pro- foundly understand data mesh architecture. The interviewed persons are IT and data professionals from different companies with variable industry backgrounds. The variety of different industries is a strong factor, and it lets us have a broad outlook towards the world of data on a practical and executive level. Theme interviews included previously mentioned and categorized 29 questions. The layout around these questions was rather neutral, and interviewees could answer the questions during an open-minded conversation that supports common qualitative research methods.

Most premier questions are highlighted in Table 1. These questions are opened more closely in chapter 4, where we inspect answers from interviews and make visions on how these affect the bigger picture of data mesh.

(37)

Domain (DQ) - Organization

DQ1 How is domain defined in your organization? How many domains are there?

DQ2

Do all business areas (domains) get the data utilized at the level of require?

Do certain domains make more use of data than others?

General Data (GD) – Data Management GD1

Does your company utilize data? From how many different sources does your company collect data?

GD2 Who in your company utilizes and consumes this data?

Technology / Products / Services (TQ) – Data Management

TQ1 Are your business products and services primarily physical or digital?

TQ2

What is the general situation of the company’s digitalization? Is there a de- signed data strategy?

Maturity Level (ML) - Organization ML1

Do you feel that the company’s data literacy/maturity level is high enough for a distributed model?

ML2

Are data team professionals (e.g., data engineer) overladen? Is the compe- tence focused on a very small area or even into individuals at the moment?

Data as a Product (DP) – Data Management

DP1

Does the company provide data for external use, or does it only utilize its own data?

DP2

Does company processes/operations generate data that could be utilized, but is not yet in use?

Ways of Working (WW) – Organization WW1

Is the development team responsible for the product being created, is the busi- ness involved in this?

WW2

What are the approximate sizes of the data teams? (How many data engi- neers, project managers, etc.)?

Data Ownership (DO) – Data Management DO1

Who owns the data in the company? Is data ownership in centralized solu- tions or decentralized in business areas?

Table 1: The core questions of the thematic interview according to the framework.

3.2.1 Organization Questions

As Figure 8 shows, the data mesh suitability tool has organizational questions formed into three dimensions. These three dimensions are seen as important parts of the organizational side of data mesh. Organizational questions tie up the organization defin-

(38)

3.2.2 Data Management Questions

Technological questions are set up on the right side of Figure 7. Data management and its three dimensions form a strong base for questions focusing on the data technical side. We need to point out clearly that data mesh is not a technological framework, or it is not a new technical solution to supplant data lakes or warehouses. The goal is to strengthen the way data is handled. Data mesh has data management and technological viewpoints, so this needs to be a solid part of our question battery. These data management related questions also help us find essential information from the interviewed organizations and their data architecture situation. Data mesh creator Zhamak Dehghani (2019) states that data mesh is a new paradigm shift for organizations to adapt to the changing data world.

3.3 Study Reliability

Research studies commonly include different error and distraction factors that can affect the study results, and this way, the whole study reliability can also be in danger.

Therefore, reliability assessment is a key part of scientific research, as it has certain standards and values that it should strive for (Saaranen-Kauppinen & Puusniekka, 2006).

In qualitative research, it is essential to access the credibility and reliability of research.

For example, the results of a qualitative study must not be random, and the methods used in the study must be able to examine what the study is intended to investigate. In addition, the concepts used must fit the content of the research problem. One aspect of the reliability of qualitative research is generalizability or transferability: whether the research results can be generalized or transferred to other objects or situations (Tutkimuksen toteuttaminen - Jyväskylän yliopisto, 2010). As this study is qualitative research, attention has been paid towards reliability.

Theme interviews naturally have similar issues and error factors. Theme interviews fit the research when the issue to be explored is not very well known and the research design is not completely locked. It may also be clarified as the project progresses.

Whereas, in the light of the answers received during the interviews, additional

(39)

questions will be asked. Data mesh is not extensively known or defined yet. Although we had questions prepared beforehand for the interviews, leaving the interview situations open-minded and interactive, we could reach just the right amount of conversation with each interviewee (Routio, 2020). Every interview situation was a little different, and the question battery prepared in advance could be modified to follow the direction of the conversation.

The following graph shows the progress process of this thesis. Four highlighted stages are explained, and the thesis progress bar is supported with the followed schedule of the thesis work.

Figure 8: Thesis progress process

A well-documented and clarifying piece of writing is a key factor in achieving the trust of the reader. The study is nicely written and explicit to follow while reading, overall

(40)

4 CASE STUDY OF POSSIBLE DATA MESH ORGANI- ZATIONS

A total of seven interviews were conducted for the implementation of the thesis. The results from these interviews serve as the material for the empirical part of the research.

As mentioned previously, theme interviews were chosen to be the form of data acqui- sition. This section will focus on the journey of the interviews, with the findings and results part. The chapter will include highlighting of the companies interviewed and critique of the method used during the interviews. The most important questions and answers are inspected. Results from the interviews and attachments towards the data mesh framework are drawn in verbal form.

As this thesis was done in collaboration with a business, thanks to Solita Ltd, we achieved a significant and respectful sampling of organizations for research use. Or- ganizations were contacted via email with an invitation for an interview. The seven organizations were found rapidly, and we moved towards booking the interviews for each organization’s representative. Interviews were carried out with a business communication platform, Microsoft Teams. Each interview’s videoconference had a 1–2 - hour booking. The average duration of the interviews was 1 hour and 10 minutes. In- terviews were recorded for later review and the interviewer took notes during the conversation.

The interviewees were professionals who have great responsibility and mission to develop their organizations data efficiency further. These people demonstrably know the significance and importance of data utilization and see pain points/challenges from the parade ground for their organization. Interviewees carry out different responsibilities and job titles, such as Data Management Manager, Data Lead and Head of Data & AI.

All these positions aim to enhance the use of operational and analytical data. Inter- viewees being mostly senior and higher executives is a great advantage to get comprehensive insight from case organizations.

Utilization of Data Mesh Framework as a Part of Organization’s Data Management