Feasibility of database query composition and reuse with Ecto

(1)

FEASIBILITY OF DATABASE QUERY COMPOSITION AND REUSE WITH ECTO

Faculty of Information Technology and Communication Sciences Master’s Thesis Examiners: University Lecturer Timo Mäkinen, Researcher Mikko Nurminen April 2021

(2)

ABSTRACT

Mika Kunnas: Feasibility of database query composition and reuse with Ecto Master’s Thesis

Tampere University

Degree Programme in Information Technology April 2021

Software reuse is a common practice in software development, and it is a widely studied topic. However, there is not much research on how to take advantage of reuse from the perspective of SQL queries and data. When many features of a software require similar set of data from the database, there is an opportunity for reuse.

The workforce optimization software at RELEX combines data from different database queries to avoid writing similar queries from scratch. This is done by chaining calls to functions that each execute a database query passing the data to the next query as an input. The downsides of this approach is that the queries are not tailored to each use case, and that the number of queries executed is not explicit. The queries are built with Ecto, which is a query generator for Elixir.

This thesis investigated query composition as an alternative way of query reuse. The idea of query composition is to build small query blocks that can be combined in different ways to form the final single query. Two case queries in the application were cho- sen and rewritten using query composition. Based on the predefined criteria, such as performance, and experience of working with query composition, the proposed method was evaluated against the current method to determine whether query composition could generally be recommended over the current method. The research was based on the engineering method.

The new implementation outperformed the current implementation, on average, by over 35% in three of the four performance test cases, being 12% slower in the last case. The new implementation required 48 source lines of code versus 53 lines in the current implementation. Based on the implementation project, query composition provides many important benefits over the current method, such as removing the need for manual transaction control and enabling use case specific tailored queries. Thus, query composition can be recommended over the current method of query reuse. Future research topics include investigating the validity of the results of this research with other technologies similar to Ecto.

Keywords: database query composition, database query reuse, Elixir, Ecto

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Mika Kunnas: Tietokantakyselyjen uudelleenkäytön ja koostamisen käyttökelpoisuus Ec- tolla

Diplomityö

Tampereen yliopisto

Tietotekniikan DI-tutkinto-ohjelma Huhtikuu 2021

Ohjelmiston uudelleenkäyttö on yleistä ohjelmistokehityksessä, ja aihetta on tutkittu pal- jon. Vähemmän tutkittu aihe on uudelleenkäyttö SQL-kyselyjen ja datan näkökulmasta.

Kun ohjelmiston eri toiminnallisuudet vaativat samankaltaista dataa tietokannasta, avau- tuu mahdollisuus uudelleenkäytölle.

RELEXin työvoimaoptimoinnin sovellus käyttää dataa uudelleen eri kyselyistä välttääk- seen samankaltaisten kyselyjen kirjoittamisen alusta. Tämä on toteutettu ketjuttamalla funktioita, joista jokainen ajaa tietokantakyselyn syöttäen tuloksen seuraavalle kyselyl- le. Tämän ratkaisun huonoina puolina ovat se, että kyselyitä ei ole suunniteltu käyttö- tapauskohtaisesti ja se, että ajettavien kyselyjen määrä ei ole eksplisiittinen. Kyselyt on rakennettu Ectolla, joka on kyselygeneraattori Elixirille.

Tässä työssä tutkittiin vaihtoehtoisena uudelleenkäytön välineenä kyselyjen koostamista. Sen ajatuksena on rakentaa pieniä kyselypalasia ja yhdistää niitä eri tavoin yhdeksi suuremmaksi kyselyksi. Tutkimukseen valittiin kaksi kyselytapausta, jotka kirjoitettiin uudelleen hyödyntäen kyselyjen koostamista. Uudelle toteutukselle asetettujen vaatimus- ten, kuten suorituskyvyn, ja kehitystyökokemuksen perusteella kyselyjen koostamisme- netelmää verrattiin nykyiseen menetelmään ja arvioitiin, voidaanko kyselyjen koostamista suositella nykyisen menetelmän sijaan. Tutkimus pohjautui insinöörimenetelmään.

Uusi toteutus oli kolmessa suorituskykytestissä neljästä keskimäärin yli 35 % nykyistä toteutusta nopeampi ja viimeisessä testissä 12 % hitaampi. Uusi toteutus vaati 48 ri- viä lähdekoodia, kun nykyinen toteutus tarvitsi 53 riviä. Toteutusprojektin perusteella voi- daan todeta, että kyselyjen koostaminen tarjoaa monia tärkeitä etuja nykyiseen tapaan verrattuna. Näitä ovat esimerkiksi manuaalisen transaktioiden hallinnan poistuminen se- kä käyttötapauskohtaisten kyselyjen mahdollistaminen. Tutkimuksen mukaan kyselyjen koostaminen on siis nykyistä menetelmää suositellumpi tapa käyttää kyselyitä uudelleen.

Jatkotutkimuskohteena on esimerkiksi selvittää näiden tutkimustulosten pätevyyttä mui- hin Ecton kaltaisiin teknologioihin.

Avainsanat: tietokantakyselyjen koostaminen, tietokantakyselyjen uudelleenkäyttö, Elixir, Ecto

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

Writing this thesis has been a long journey lasting for over a year. The world was thrown upside down due to the COVID-19 pandemic and the search for an instructor turned out to be a mission of its own. Even though I mostly wrote the thesis on my own, I never felt like I was alone.

I want to thank especially my dear wife, Elsa, for all the love and support she provided me every day throughout the process. Not only did she allow me to spend the time that I needed, but also gave me invaluable ideas whenever I felt I was stuck or needed another opinion. I don’t think I would have gotten very far without her.

I also want to thank my instructor Timo Mäkinen under whose professional and very supportive guidance the work saw tremendous improvement. The thesis was already around 70% complete when our co-operation began, but he took the challenge and was providing insightful comments in no time. He and the other examiner, Mikko Nurminen, created a relaxed but professional atmosphere that really helped me to push a bit further.

So I want to thank also Mikko for being such a good sport.

Finally, I want to thank RELEX and especially Antti Alakiikonen for helping me to find a meaningful topic. Antti also gave gently pushes regularly that helped me to keep on working with the thesis while at the same time looking for an instructor.

Hyvinkää, 27th April 2021

Mika Kunnas

(5)

LIST OF TERMS AND ABBREVIATIONS

API Application Programming Interface: an interface through which ex- ternal application can use the services of the application providing the interface

Arity The number of arguments a function takes Benchee A benchmarking library for Elixir

CRUD Create, read, update, delete: the basic set of operations on an information entity

Docker A popular container platform

DSL Domain Specific Language: language created for the purpose of a specific application domain

Ecto Elixir library for database interaction Elixir Functional programming language

SQL Structured Query Language: a language for interacting with relational databases

UTF-8 8-bit Unicode Transformation Format: a popular character encod- ing

(8)

1 INTRODUCTION

In modern software development, software reuse is a common practice of using once written software later in another program. The level of reuse can vary from reusing source code to reusing whole microservices. Software reuse is a widely studied topic since the 1980s. [19]

Storing data is another important concept in software development. There are many ways to store data, one of which is to leverage a database. When building software, it is common that multiple use cases require a similar set of data. As an example, consider a blogging system with the following requirements:

1. provide a list of posts that were published this week

2. provide a list of published comments for the posts from the previous requirement 3. in addition to the previous requirement, provide also the name of the publisher.

The second requirement requires data from the first requirement. Similarly, the third requirement requires data from the previous two. Instead of writing three separate database queries from scratch, there is an opportunity for reuse.

One option for reuse is to create a function for each of the requirements and chain the function calls providing data from the previous call to the next. For example, a function for the first requirement could execute a query to fetch the list of posts. The function for the second requirement would then accept the list of posts and execute a query to return the comments for those posts. This method is used in the Workforce Optimization solution at RELEX Solutions, a Finnish company for which this thesis is written.

There are three issues with this approach:

1. The number of database queries executed is not explicit but hidden behind many chained function calls. This can easily lead to chatty and inefficient communication over the network between the software logic and the database.

2. Since many queries are implicitly executed, explicit transaction control is required to avoid any data inconsistencies.

3. The queries are not tailored for the specific use case. Function A might need data from two tables, but function B calling function A might only need data from one of the tables. This can cause unnecessary work for the database.

Another option is to build the queries themselves from reusable pieces. With this method,

(9)

for example an SQL (Structured Query Language) WHERE clause could be shared by many different queries that require the same condition. During the recent years, commu- nities of various programming languages have come up with technologies for constructing database queries in this fashion. Such technologies include LINQ for C#, jOOQ for Java, Knex.js for JavaScript and Ecto for Elixir [20][12][17][6]. The existing software uses Elixir and Ecto. Ecto supports what are calledcomposable queries to compose a single larger query from reusable pieces.

This thesis will investigate using composable queries as a solution to the aforementioned issues with the current solution while still supporting reuse. Two case queries composed with the current method are rebuilt using composable queries. The new solution is evaluated based on the following criteria:

• C1: The new queries must return the same data as the current ones.

• C2: The performance of the new solution must be comparable to or better than the performance of the current solution unless query composition adds significant complexity to meet the criteria.

• C3: The number of queries executed must be explicit.

• C4: Queries for both cases must run in a single transaction.

• C5: The number of code lines required for the new solution must not increase compared to the old solution.

• C6: The granularity of reuse must be at least at the same level as in the new solution.

• C7: Small queries that return correct data must be easily composable to form a larger query returning correct data as well.

Based on the fulfillment of the aforementioned criteria and additional observations during the implementation, the goal is to determine whether the usage of query composition is feasible and if it can generally be recommended over the current method.

Previous research on the topic focuses mostly on the design of language-integrated query, such as T-LINQ, SQUR and QUEΛ [3][18][39]. Language-integrated query aims to solve the paradigm mismatch between programming languages and database query languages by allowing the usage of the building blocks provided by the programming languages to construct queries instead of having to deal with two different types of notations [3][18]. These studies, however, focus on designing the tools while this thesis focuses on the usage of one such tool, Ecto, for query composition.

In terms of existing research specifically on query composition and reuse, Allen and Par- sons found out that query reuse can lead to incorrect queries more often than when writing queries from scratch [1]. For this reason, each query built is run through a set of integration tests on application level to ensure that the criterion C1 is met.

This thesis uses the engineering method as the basis for the study. Basili defines the engineering method as follows.

(10)

"The engineering method: observe existing solutions, propose better solutions, build/de- velop, measure and analyze, and repeat the process until no more improvements appear possible." [2]

Chapter2discusses the background for the current solution and the issues with it. After chapters3and4that introduce Elixir and Ecto, the existing solution is presented in chap- ter5. A better solution is proposed and developed in chapter 6, which is measured and analyzed in chapter8after describing the performance evaluation methods in chapter7.

Finally, conclusions are drawn in chapter9. The iteration step of the engineering method is left for further research discussed in the last chapter.

(11)

2 BACKGROUND

RELEX Solutions provides planning, management and optimization solutions for the retail market. One of the key products is the workforce optimization and management software for optimizing the staff needed for the forecasted workload. [37]

The product uses PostgreSQL for storing shifts, employees and other relevant domain en- tities. The main backend application using the database is written in Elixir leveraging Ecto as the database interfacing library. The shifts are optimized by a different microservice calculating optimized rooster based on the data provided by the main backend.

Generating the data for the optimizer starts by building a list of employees with a working contract. This sounds like a fairly simple query to the database. However, the current implementation executes 14 database queries in the process of building that list. Fur- thermore, as the queries are not run in transaction, there is a potential data consistency issue.

On closer inspection, it turns out that fetching data from the database often causes ex- cessive number of queries to be executed. This is because a function that executes a database query often does some or all of the following:

• accepts a list of database row IDs

• calls a function that executes one or more database queries

• preloads data from one or more associated tables

The number of executed queries piles up as these functions are chained together. The main reason for this is code sharing and not having to build each query from scratch. The downside is that explicit control of the number of queries executed to fetch the needed data is lost. Also, since functions hard-code the preloads, each function call causes the preloads to be executed regardless of whether it is actually needed.

This thesis investigates query composition using Ecto as a solution to these problems based on the criteria described in1. Another possible solution to these issues would be to leverage database views supported by many relational databases. This method is not studied in this thesis, as

1. the engineering team prefers to have the control of the queries on the application side

(12)

2. extensive reuse with database views could potentially lead to maintenance problems

3. it is likely that in the future queries need to be built dynamically in run time.

(13)

3 ELIXIR

Elixir is a functional and compiled programming language that runs on top of the Erlang Virtual Machine to benefit from its fault-tolerance and distributed characteristics [25] [27].

It is a relatively new language with the first stable release appearing in 2014 [41]. Some features, libraries and syntax of Elixir are introduced in this chapter as necessary for the thesis.

Section3.1starts with the basics of Elixir. Section3.2discusses functions and modules.

Data types and data structures are discussed in section3.3. Finally, the pipe operator is introduced in section3.4.

3.1 Basics

The syntax of Elixir resembles the syntax of Ruby, but in contrast to Ruby, Elixir is a functional programming language. In Elixir, functions are first-class citizens meaning that they can be stored to variables, passed to other functions as arguments and returned from functions. Variables are not "real variables" in the sense that they can only be assigned once. Data structures in Elixir are immutable.

Elixir has a few control structures: if...else, case and cond. The if statement behaves similarly to mainstream languages with a couple of twists: there is noelse if and it returns a value like all control structures in Elixir. The case statement is similar to the switch statement in mainstream languages, but with a different syntax. It also automatically returns after the first match. [13]

cond is used to run multiple expression finding the first one evaluating to something other thanfalseornil. The following examples illustrate the difference betweencaseandcond:

case 2 + 2 do

1 -> "this never matches, evaluating the next one"

4 -> "this always matches"

5 -> "this line is never reached"

end

cond do

2 + 2 == 1 -> "this never matches"

1 + 1 == 2 -> "this always matches"

true == true -> "this is never reached"

end

(14)

Thewithstatement is convenient when there are dependencies between conditions [15].

Consider the following contrived example:

with 10 = ten <- a + b, 15 <- b + ten do

IO.puts("a + b == 10 and b + ten == 15, thus a == 5 and b == 5") else

_ -> IO.puts("a or b is unequal to 5")

Elixir has nofor or similar loop structures. Instead, iterations are implemented via recur- sion. Luckily, most of the daily iteration needs are already covered by the Enumerable module containing functions for running through different data structures. Elixir also has thefor special form, which implements a comprehension [14].

Elixir has the basic set of operators familiar from most other languages. A few of the familiar looking operators have a different function in Elixir [33]:

• ++concatenates lists

• =is a match operator used for both assignment and pattern matching.

In addition to the familiar operators, there are many more exotic operators such as [33]:

• <>for binary concatenation

• =˜is used to check whether the left-hand side string contains the string on the right hand side

• \\is used to specify default arguments

• |>is the pipe operator which is discussed in section3.4.

3.2 Functions and modules

Elixir has two types of functions; named and anonymous. Named functions can be either public or private and are defined usingdef anddefp, respectively. Anonymous functions are defined using thefnkeyword. Anonymous functions are not discussed further in this thesis.

Named functions are identified by the combination of a name andarity (the number of arguments). To identify a function, the notation name/arity is used. The following two functions are different due to a different arity. [26][31]

def multiply(a, b), do: a * b

def multiply(a, b, c), do: a * b * c

Named functions can also have default values for arguments [26]. The following code defines the functionincrement/2. The value given as the first argument is incremented by the value of the second argument or by one, if not given.

def increment(a, b \\ 1), do: a + b

(15)

In the previous examples, the body of the function has been on a single line. When the body spans over multiple lines, thedo/end block syntax must be used [31]. The following two functions are equivalent.

def multiply(a, b), do: a * b def multiply(a, b) do

a * b end

Notice that there is noreturn statement in Elixir; the value of the last statement is automatically returned.

A module can be thought of simply as a container for named functions. An example of a module in the Elixir standard library isString containing functions for working with UTF-8 (8-bit Unicode Transformation Format) encoded binaries. Modules can also be nested using the dot (.) operator. [26][31]

Module names are commonly defined as aliases. Aliases must be written inCamelCase, likeDateTime. Function names must be insnake_case. [32]

The following code identifies the function calledto_stringwith arity of 1 in theString.Chars module:

String.Chars.to_string/1

Module names can be aliased [22]. Consider the following module:

defmodule MyApp.SomeModule.SomeSubmodule do def some_function() do

# some logic here end

end

To call thesome_function/0, one would have to type:

MyApp.SomeModule.SomeSubmodule.some_function()

It is possible to call the function only withSomeSubmodule.some_function()by aliasing it first:

alias MyApp.SomeModule.SomeSubmodule

SomeSubmodule.some_function()

Thealiaskeyword acceptsasoption for giving the alias a custom name:

alias MyApp.SomeModule.SomeSubmodule, as: Submodule

Submodule.some_function()

(16)

3.3 Data types and data structures

In addition to the standard data types like integers, strings and booleans, Elixir has a few types and data structures worth explaining mostly to understand their syntax.

Anatomis a constant whose value starts with a colon (e.g. :an_atom). The name of the atom is also its value. [23] Atoms are often used for tagging and labeling other values.

Atupleis a collection of values separated by commas and wrapped in curly brackets (e.g.

{:error, "Invalid integer", 1.0}). The previous tuple has three elements: an atom, a string and a float. Accessing an element in a tuple is fast while adding a new element is slow.

[23]

Lists are linked lists containing comma separated values wrapped in square brackets (e.g. [1, 2, 3]). Accessing an element in list is a linear operation, so accessing the first element or adding a new element at the beginning of the list is very fast. [23]

A keyword list is a list that consists of 2-element tuples where the first element is an atom. Elixir provides an alternative syntax for keyword lists, so the following two lines are equivalent. Keyword lists can contain the same key more than once. [29]

[{:key, "value"}, {:key2, "value2"}]

[key: "value", key2: "value2"]

Amap is an unordered key-value store where the key can be any value. Key and value are separated with a double arrow (=>). When the key is an atom, either of the following syntaxes can be used. [29]

%{:key => "value"}

%{key: "value"}

Structs are an extension of maps with compile time checks and a predefined set of keys with default values. Structs are defined inside a module. The following code defines a User struct with first and last name that default tonil. [35]

defmodule User do

defstruct first_name: nil, last_name: nil end

The following code creates a user struct.

%User{first_name: "John", last_name: "Doe"}

Structs and be distinguished from maps by having the module name after the % character.

[35]

3.4 Pipe operator

The pipe operator (|>) makes it possible to turn nested function calls into pipeline-like series of operations [28]. As an example, consider a string consisting of values separated

(17)

by commas, such asfirst,second,third. The requirement is to reverse each value keeping their order unchanged.

One possibility to achieve this is to implement the following steps:

1. Split the string into a list of values.

2. Reverse each value in the list.

3. Join each value in the list back to a string using comma as a separator.

This can be implemented using the pipe operator as follows

"first,second,third"

|> String.split(",") # step 1

|> Enum.map(&String.reverse/1) # step 2

|> Enum.join(",") # step 3

This could also be written in one line. The pipe operator passes the expression on the left-hand side as the first argument of the function call on the right-hand side [28].

(18)

4 DATABASE QUERIES WITH ECTO

Ecto is a data mapping library for Elixir that is commonly used to interact with a database [6]. Based on the number of stars at GitHub and the number of total downloads, it is by far the most popular library in the Elixir ecosystem for working with databases [16] [38].

Many of the features in Ecto are implemented as macros. Macros are essentially compile- time constructs that receive, manipulate and return code [30]. They allow Ecto to, for example, provide clean DSLs (Domain Specific Language) for many functionalities that are turned into normal Elixir code during compilation [24].

Ecto contains multiple modules for data mapping and database interaction. For this thesis, the most important concepts areschema,query andrepository. Schema represents a database table. It is discussed in section 4.1. Queries, discussed in section 4.2, are used to manipulate and fetch data from a repository. Finally, a repository is a mapping to a data store, and it is discussed in section4.3.

4.1 Schema

Schema is a mapping between an Elixir struct and a data source [10]. This section defines an example schemaUser that is used in section4.2and4.3.

defmodule User do use Ecto.Schema

schema "users" do

field :email, :string field :age, :integer has_one :account, Account end

end

The name of the module becomes the name of the schema. The macro schema/2 is used to create a schema that usually maps to a database table such as users in this case. Column on the table is represented by thefield/3 macro indicating the name and the type of the field. [10]

Schemas can have associations. Thehas_one/3macro defines a one-to-one relationship with another schema. In this case theUser schema has zero or oneAccount associated with it. [10] TheAccount schema is not defined here for brevity.

(19)

Schemas are very flexible and have many different features. For the purpose of this thesis, it is enough to have a high-level understanding of what schemas are. In this thesis, schemas are used with queries, which are introduced next.

4.2 Query

Queries are used to manipulate and retrieve data from a database. The moduleEcto.Query represents queries by implementing a DSL with two different APIs (Application Program- ming Interface): keyword-based and macro-based. [8]

Following is an example of the keyword-based syntax. It produces a query to get the e-mail addresses of users from the users table who are at least 30 years old.

import Ecto.Query

from u in "users", where: u.age >= 30, select: u.email

The same query can be produced using the macro-based API as follows:

import Ecto.Query

"users"

|> where([u], u.age >= 30)

|> select([u], u.email)

The functions and macros in Ecto.Query must be imported before usage. The import statement will be left out from following examples for brevity.

The keyword-based API resembles SQL. The macro from accepts an expression as its first argument and a keyword list as the second argument [8]. This can be made more explicit by including parentheses and square brackets.

from(u in "users", [where: u.age >= 30, select: u.email])

In the previous examples,u is a binding to users and u.age is field access. There can be many bindings as joins create bindings also. Bindings are positional, so in the macro- based API they are in a list ([u]). The bindings are added to the list in the order of creation.

[8]

The keyword-based and macro macro-based syntaxes are equivalent [8]. The focus in this thesis will be on the macro-based syntax, so it is used in the following code samples in this chapter.

Instead of building the query directly against the database table, schemas can be lever- aged. The following example uses theUser schema described in section4.1.

User

|> where([u], u.age >= 30)

(20)

Leaving out theselectcall would return a list of fullUserstructs instead of just the e-mails.

So far, the wanted value ofu.age has been hard coded. The value of a variable can be used with the help of thepin operator (ˆ) [11].

min_age = 30

User

|> where([u], u.age >= ^min_age)

The pin operator protects against SQL injection by instructing the query builder to use parameterized queries. If the pin operator is left out, the compiler will emit an error. [11]

It is worth noting that Ecto.Query is only responsible of constructing the query. The execution of the query is handled by the repository discussed in the next section.

4.3 Repository

A repository maps to a data store via an adapter. The adapter handles the communication between the application and the data store. Ecto comes with built-in adapters for PostgreSQL and MySQL databases. [7][9]

Ecto.Repo defines multiple callbacks for executing CRUD (Create, read, update and delete) operations against the data store. The module is not used directly. It provides the behavior and default implementation for different operations that are executed via a custom module, usually calledApplicationName.Repo. [9]

For the purpose of this thesis, the underlying semantics or the vast majority of the provided callbacks are not important. Instead, the focus will be on two callbacks: all/2 and preload/3.

Theall/2 callback is used to fetch all entries from the database matching the given query.

It takes a queryable, such as a query or a schema, as its first argument and optional options as the second argument. [9]

Imagining an application called MyApp, the following example would fetch a list of users who are 20 years old.

User

|> where([u], u.age == 20)

|> MyApp.Repo.all()

Section4.1described theUserschema that had an association withAccount. How does one fetch the account associated to the user? The following code shows how to leverage preloads to accomplish that.

(21)

User

|> where([u], u.age == 20)

|> MyApp.Repo.all()

|> MyApp.Repo.preload([:account])

The preload/3 preloads the provided associations (in this case account) into eachUser in the result [9]. An important trait of preloads is that each preload in the list creates one database query in addition to the base query executed byall/2 [8]. The previous example would execute two database queries in total. The preloads are run automatically in the same transaction as the main query.

(22)

5 QUERY CASES

The following section introduce the two cases investigated in this thesis. The first case, fetching employees for the shift optimizer, is discussed in section5.1. The second case, introduced in section5.2, is about fetching shifts to be displayed for shift planning.

5.1 Case 1: fetching employees for the optimizer

This section introduces the current algorithm for fetching the employees needed by the shift optimizer. Before that, however, a few terms related to employees are explained.

An employee works for a market which is a unit of shift planning within a company or a store. For very small stores, one market usually covers the whole store. For large stores, a department can be divided into multiple markets.

An employee works primarily for one market but can be loaned to another market. When an employee works primarily for market A, but can be loaned to market B, the employee is a primary employee for market A andsecondary employee for market B. Technically, an employee can simultaneously be a primary and a secondary employee for the same market.

The primary market of an employee can change. Suppose that the primary market of an employee was market A since January 1st, 2020 but was transferred to market B on May 20th, 2020. From the perspective of market A, the employee is a transferred employee from January 1st to May 19th if market A is neither the primary or the secondary market of that employee. The employee is considered transferred in market A because she or he did shifts in market A in the past but does not work there anymore.

The list of employees in a market on a given date consists of the following types of employees:

1. Primary employees 2. Secondary employees 3. Transferred employees

The relationship between the different types of employees is shown in figure 5.1. The darkest circle illustrates employees having shifts done to a market during a certain date range. These employees are transferred if they are not primary or secondary employees of that market.

(23)

Figure 5.1.Venn diagram of different types of employees.

Figure 5.2.Current algorithm for fetching employees for the optimizer.

(24)

Figure5.2shows the current algorithm of fetching employees from the database. Bolded shapes depict database queries. At high level, the IDs of primary and secondary employees are summed with the IDs of transferred employees. The result list of IDs is used to query the employees from the database and the list of employees are finally filtered to include only employees with contracts.

The most interesting part is the flow to get the list of transferred employees. First, all employees in a date range are queried removing the employees that are either primary or secondary to get the transferred employees. Four preloads are then executed. Finally, only the IDs of the transferred employees are taken on the application side. Notice that primary and secondary employee IDs are both fetched twice: first to be summed with transferred employees and then to calculate the transferred employees.

For this flow, the preloads are unnecessary, since only the IDs of the transferred employees are used. The preloads exist, because there are other parts of the system that use the same function returning the transferred employees themselves with the necessary data preloaded. This means that there are currently two possibilities for fetching transferred employees without affecting existing implementation:

1. Use the dedicated function that executes 7 database queries (primary employees, secondary employees, employees in date range, four preloads).

2. Create a new implementation from scratch.

Similarly, fetching the list of all employees using the function that implements the algorithm in figure5.2executes 14 database queries from which 8 are preloads.

In this thesis, queries for each of the employee sets depicted in figure5.1are built. The queries are then composed to build a larger query for the flow in figure5.2. The new implementation will provide the same employee data for the optimizer as the current solution to meet the criterion C1.

5.2 Case 2: fetching shifts for planning

The shift planning happens through a web-based user interface. The shifts to be planned are fetched from the backend where the algorithm shown in figure5.3is triggered. Notice that the algorithm uses the same subroutine of fetching primary and secondary employee IDs as the algorithm in figure5.2.

There are two lists of shifts that are combined at the end:

1. shifts for primary and secondary employees 2. shifts for transferred employees.

The first list is constructed by querying the shifts table for shifts where the employee ID is in the list of primary and secondary employee IDs. The shifts returned will contain all the shifts of those employees in the date range regardless of the market. This is because employees can be borrowed, so some shifts must be displayed for multiple markets.

(25)

Figure 5.3.Current algorithm for fetching shifts for the planning view.

The second list is constructed by querying the shifts table for shifts where the employee ID is not in the list of primary and secondary employee IDs. The list will only contain shifts done in the selected market.

The subroutine on the left in figure5.3is called twice. The algorithm fetches two preloads twice leading to four preloads while only one of them is actually used. The total number of queries being executed is ten as seen in the figure as bolded shapes.

(26)

6 QUERY COMPOSITION

Ecto provides the ability to build database queries piece by piece with composition [4].

Section4.2described the basics of building a query with Ecto. As a reminder, following is an example of a query to get the emails of users who are 30 years old:

User

|> where([u], u.age == 30)

Leveraging composition, the previous query can be written as:

query = User

|> where([u], u.age == 30) query

The only difference here is that the return value of the where macro call is stored to a variable. Most functions and macros in theEcto.Query module accept a queryable, such as another query as the first argument and return a query.

The true power of composition comes when these pieces of query are defined inside functions and reused in many queries. For maximum reusability, each query builder introduced in this thesis will obey the following constraints:

1. Noorder_by clauses: the full control of ordering is left to the specific use case.

2. No preloads: preloads are use case specific and should be determined as close to the actual query execution as possible. This is required to meet the criterion C3.

3. Do not depend on specific clauses on the provided query: for example, it must not assume that there is ajoinclause in the provided query.

4. Avoid setting constraints for further usage: the caller should be able to add clauses freely to the returned query without errors or unexpected side effects.

The complete code for this chapter can be found from appendixA. Sections6.1and6.2 discuss the queries to satisfy the two cases described in sections 5.1and 5.2, respectively.

(27)

6.1 Composition for case 1 6.1.1 Primary employees

Employees are stored in the database inemployeestable. There is also the correspond- ingVision.Employees.Employee schema that maps to the database table. Visionis the main module of the application as the original name of the application is still in use in the code.Vision.Employeescontains functions mostly for retrieving various employee related data from the database. The employee schema hasmarket_id field which contains the ID of the primary market.

The following code would build a query for fetching primary employees of the market with the ID of 1:

Vision.Employees.Employee

|> where([e], e.market_id == 1)

Thewhereclause can be moved into a function parameterizing both the query andmar- ket_id. This and other query building functions will be defined in theEmployeemodule:

defmodule Vision.Employees.Employee do import Ecto.Query

# schema definition here

def with_market_id(query, market_id) do

where(query, [e], e.market_id == ^market_id) end

end

The function name follows the naming convention used by Avi Turkewitz in his blog post [40]. Starting the function name withwithprovides a generic convention for naming functions that add various filters to the query, whether it is awhereor ajoinclause. There are many ways to reuse pieces of query in Ecto and one can even define dynamic fragments [5]. In this thesis, however, the focus is on defining simple functions for readability.

The following code would fetch all primary employees of the market with ID of 1:

alias Vision.Employees.Employee

Employee

|> Employee.with_market_id(1)

|> Vision.Repo.all()

This has, however, a small issue; the responsibility for the base query for fetching all employees is given to the caller. Here, Employee is the base query for all employees to which the with_market_id filter is added. The query to fetch all employees might not mean the same as fetching all rows from theemployees table. For example, there might

(28)

later be a requirement to be able to hide employees from the system but retain them in the database. In that case it would be beneficial to be able to have a separate base query where a visibility filter could be added. The following code shows a builder function for such base query:

def all() do

Vision.Employees.Employee end

Instead of usingEmployee directly, the caller can then use the base query builder function:

Employee.all()

|> Employee.with_market_id(1)

Still, it would be easy to forget to use theall/1function. The solution is to provide a default query if one is not provided [40]. The all/1 function can be called by default if no other query is provided as shown in the next code:

def with_market_id(query \\ all(), market_id) do where(query, [e], e.market_id == ^market_id) end

The first argument is now optional and can be omitted:

Employee.with_market_id(1)

Thewith_market_id/2function can be called with just one argument, as Elixir will bind the value to the second argument and use the default value for the first argument. Executing with_market_id(1)returns a query that produces the following SQL:

SELECT e0.*

FROM employees AS e0 WHERE e0.market_id = 1

6.1.2 Secondary employees

On the database, secondary employees are defined as a many-to-many relationship between markets and employees. There is theemployees_secondary_markets table con- tainingmarket_idandemployee_idcolumns. The columns form a composite primary key, so each row in the table is unique. On the application side, there is also the corresponding schemaVision.Employees.EmployeeSecondaryMarket.

To get the secondary employees of a market, inner join between employees and em- ployees_secondary_markets can be used. Ecto has the join/5 macro for this purpose [8]. Code in program6.1adds to the query a filter that includes only employees that are secondary employees of the provided market.

(29)

1 alias Vision.Employees.EmployeeSecondaryMarket, as: Esm 2

3 def with_secondary_market_id(query \\ all(), market_id) do 4 query

5 |> join(:inner, [e], esm in Esm, on: e.id == esm.employee_id) 6 |> where([..., esm], esm.market_id == ^market_id)

7 end

Program 6.1. Function for filtering secondary employees.

The function joins the provided query with theEmployeeSecondaryMarketschema aliased asEsmusing the matching employee ID as condition. Finally, thewhereclause filters only the employees where the secondary market matches the provided one.

It is worth noting that the where call on line 6 contains at least two bindings: the first binding in the initial query and the binding from the join line above. There can be more if the query contains other join clauses. Since the bindings are positional and only the last binding from the join is of interest, the expression[..., esm] is used. The... notation can be used to ignore bindings that come before the next one matched (esmin this case) [8].

The join clause matches only the first binding which represents the original employee query. If the provided query has more bindings, the notation[e] simply matches the first binding and ignores the rest [8].

Executingwith_secondary_market_id(1)builts a query that produces the following SQL:

SELECT e0.*

FROM employees AS e0

INNER JOIN employee_secondary_markets AS e1 ON e0.id = e1.employee_id

WHERE e1.market_id = 1

6.1.3 Transferred employees

As shown in figure 5.2, the current algorithm for fetching transferred employees is to get employees that have shifts in a date range and remove both primary and secondary employees from that list. As stated in section5.1, an employee is a transferred employee for market A on a specific date if the employee has a shift on that date on market A but does not do shifts there anymore.

All shifts in the system are stored in the shifts table. The schema corresponding to that table isVision.Shifts.Shift. Each shift contains date, employee and market information to determine who is working where and when.

To get the employees that have shifts in a market in a date range, two query builder functions will be defined. Program6.2shows the first attempt for a function to filter employees based on having shifts in a specific market.

(30)

1 alias Vision.Shifts.Shift 2

3 def with_shifts_in_market(query \\ all(), market_id) do 4 query

5 |> join(:inner, [e], s in Shift, on: e.id == s.employee_id) 6 |> where([..., s], s.market_id == ^market_id)

7 |> distinct([e], e.id) 8 end

Program 6.2.First attempt to filter employees based on having shifts in a market.

The query is very similar to the one built for filtering secondary employees in program 6.1. The two differences are that employees are joined with shifts and there is an additional distinct clause. Since an employee can, and probably does, have multiple shifts in a market, thedistinct clause makes sure that duplicate employees are removed. It is enough to check the ID of the employee in thedistinct clause, since the IDs are unique.

The query produced by this function withmarket_idof 1 creates the following SQL:

SELECT DISTINCT ON (e0.id) e0.*

FROM employees AS e0 INNER JOIN shifts AS s1 ON e0.id = s1.employee_id WHERE s1.market_id = 1

The initial function for filtering employees based on having shifts in a date range is shown in program6.3

1 def with_shifts_in_date_range(query \\ all(), from, to) do 2 query

3 |> join(:inner, [e], s in Shift, on: e.id == s.employee_id) 4 |> where([..., s], s.date >= ^from and s.date <= ^to)

5 |> distinct([e], e.id) 6 end

Program 6.3.First attempt to filter employees based on having shifts in a date range.

The function body is identical to the function in program6.2except for the conditions in thewhereclause.

To build a query for employees that have shifts in market with the ID of 1 in January 2020, the code in program6.4could be executed.

1 alias Vision.Employees.Employee 2

3 Employee.with_shifts_in_market(1)

4 |> Employee.with_shifts_in_date_range(~D[2020-01-01], ~D[2020-01-31]) Program 6.4. Query for employees with shifts in market and in date range.

Here, the dates are created with the~D sigil which is an alternative representation of a

%Date{}struct [34]. Executing the code above, however, throws an error:

(31)

** (Ecto.Query.CompileError) only one distinct expression is allowed in query

The error clearly states that only one distinct clause can be defined in a query. Both functions in programs 6.2and 6.3define distinct, so chaining the functions causes the error. Removing one of the clauses would solve the issue but would cause the function to return duplicate values if used separately. A better approach is to add the clause to the query unless it already exists. This can be achieved with a macro shown in program6.5.

A macro is defined instead of using a function, because no way was found to pass bindings to a function. This is because if a function was given[e] as an argument, Elixir would try to find a variable callede.

Queries in Ecto are stored as the %Ecto.Query{} struct and the macro adds a distinct clause if it is not already defined. Otherwise, the original query is returned unchanged.

1 defmacro distinct_once(query, binding \\ [], expr) do 2 quote do

3 unquoted_query = unquote(query) 4

5 case unquoted_query do

6 %Ecto.Query{distinct: nil} ->

7 distinct(unquoted_query, unquote(binding), unquote(expr)) 8

9 _ ->

10 unquoted_query

11 end

12 end 13 end

Program 6.5. A macro to conditionally add a distinct clause.

With the macro in place, distinct in both functions can be replaced with distinct_once.

Now the program6.4runs without errors. However, another issue can be found by turning the returned query into SQL:

FROM employees AS e0 INNER JOIN shifts AS s1 ON e0.id = s1.employee_id INNER JOIN shifts AS s2 ON e0.id = s2.employee_id WHERE s1.market_id = 1

AND s2.date >= ’2020-01-01’

AND s2.date <= ’2020-01-31’

The join with shifts is declared twice, which leads to a significant performance penalty in PostgreSQL. This can be verified by analyzing the execution times withEXPLAIN ANA- LYZE [36].

(32)

Both query building functions have identical joins and Ecto retains both joins instead of removing one of them. As with thedistinct clause, thejoin will be added only once, but using a different technique. There can be different types of joins in a query, so simply checking for an existing join clause is not enough. Fortunately, it is possible to give a binding a name and check via thehas_named_binding/2 function whether a binding with that name already exists [8][40].

Following [40], program6.6shows the initial version of a private function to join employees with shifts. It gives the binding the name :shifts when building the join. If the binding already exists, the original query is returned unmodified.

It might, however, become quite tedious to write this kind of a function with a manual check for all joins across the application. To solve this, a macro is written as shown in program6.7.

1 defp join_shifts(query) do

2 if has_named_binding?(query, :shifts) do

3 query

4 else

5 join(query, :inner, [e], s in Shift, 6 as: :shifts,

7 on: e.id == s.employee_id

8 )

9 end

10 end

Program 6.6. Initial function for joining employees with shifts.

1 defmacro join_named(query, qual, binding, expr, opts) do 2 binding_name = Keyword.get(opts, :as)

3

4 if binding_name == nil do 5 error_msg =

6 "missing named binding for join_named/5, :as option required"

7 raise ArgumentError, error_msg

8 end

9

10 quote do

11 unquoted_query = unquote(query) 12

13 if has_named_binding?(unquoted_query, unquote(binding_name)) do 14 unquoted_query

15 else

16 unquoted_query 17 |> join(

18 unquote(qual),

19 unquote(binding),

20 unquote(expr),

(33)

21 unquote(opts)

22 )

23 end

24 end 25 end

Program 6.7. A macro to help with named joins.

In program6.7, the contents of thequote block are essentially the same as the function in program6.6. The macro contains the necessary unquoting and the arguments tojoin are parameterized. The macro accepts the same arguments as the standardjoin/5. On line 4, an additional check is made to make sure that the :as option is given to enforce the creation of a named binding.

With the macro in place, thejoin_shifts/1function becomes simpler:

defp join_shifts(query) do

join_named(query, :inner, [e], s in Shift, as: :shifts,

on: e.id == s.employee_id )

end

Now, the functions in programs6.2and6.3can be adjusted as shown in program6.8 1 def with_shifts_in_market(query \\ all(), market_id) do

2 query

3 |> join_shifts()

4 |> where([..., s], s.market_id == ^market_id) 5 |> distinct_once([e], e.id)

6 end 7

8 def with_shifts_in_date_range(query \\ all(), from, to) do 9 query

10 |> join_shifts()

11 |> where([..., s], s.date >= ^from and s.date <= ^to) 12 |> distinct_once([e], e.id)

13 end

Program 6.8. Final functions to filter employees by shifts in market and date range.

It is worth noting that the where clauses on lines 4 and 11 could also be written as where([shifts: s], ...) by leveraging the created named binding. However, since the name of the binding is hardcoded in the join_shifts/1function, using the named binding in the whereclauses would create an awkward dependency. One option would be to pass the binding name as an argument to the join_shifts function, but it is not straightforward, because the binding name must be a compile time atom and cannot be replaced with a dynamic variable. As the latest join is referenced, it is simple to use the last positional binding, which will guarantee the correct mapping.

(34)

The SQL generated by the query in program6.4is now as intended:

FROM employees AS e0 INNER JOIN shifts AS s1 ON e0.id = s1.employee_id WHERE s1.market_id = 1

AND s1.date >= ’2020-01-01’

AND s1.date <= ’2020-01-31’

To build the final query for transferred employees, another whereclause is added to the query in program 6.4 to filter out the employees that are either primary or secondary.

Program6.9shows the complete function for transferred employees’ query.

On line 15, the where clause removes primary and secondary employees from all the employees that have shifts in the market during the date range. Thewhereclause needs the IDs of the employees to be removed, so the queries for these are built between lines 2 and 10. It is worth noting that these queries are the ones that were built in sections 6.1.1and6.1.2, but with the addedselect clause to select only the IDs.

1 def all_transferred(market_id, from, to) do 2 primary_employees_q =

3 market_id

4 |> with_market_id() 5 |> select([e], e.id) 6

7 secondary_employees_q = 8 market_id

9 |> with_secondary_market_id() 10 |> select([e], e.id)

11

12 market_id

13 |> with_shifts_in_market()

14 |> with_shifts_in_date_range(from, to) 15 |> where(

16 [e],

17 e.id not in subquery(primary_employees_q) and 18 e.id not in subquery(secondary_employees_q)

19 )

20 end

Program 6.9. Functions to build a query for transferred employees.

The SQL generated by the function in program6.9is shown in program6.10 1 SELECT DISTINCT ON (e0.id) e0.*

2 FROM employees AS e0 3 INNER JOIN shifts AS s1 4 ON e0.id = s1.employee_id 5 WHERE s1.market_id = 1

(35)

6 AND s1.date >= ’2020-01-01’

7 AND s1.date <= ’2020-01-31’

8 AND NOT e0.id IN ( 9 SELECT se0.id AS id 10 FROM employees AS se0 11 WHERE se0.market_id = 1

12 )

13 AND NOT e0.id IN ( 14 SELECT se0.id AS id 15 FROM employees AS se0

16 INNER JOIN employee_secondary_markets AS se1 17 ON se0.id = se1.employee_id

18 WHERE se1.market_id = 1

19 )

Program 6.10.SQL generated by the all_transferred query.

6.1.4 The final query

In this subsection, the queries built in subsections6.1.1,6.1.2 and 6.1.3 are combined to form the final query for all employees in the specified market. As stated in section 5.1, all employees in a market consist of primary employees, secondary employees and transferred employees. The final query, shown in program6.11will build a union of these sets of employees removing duplicates.

1 def all_in_market(market_id, from, to) do 2 primary_emps_q = with_market_id(market_id)

3 secondary_emps_q = with_secondary_market_id(market_id) 4

5 final_q = 6 market_id

7 |> all_transferred(from, to)

8 |> union_field(primary_emps_q, :id) 9 |> union_field(secondary_emps_q, :id) 10

11 all()

12 |> where([e], e.id in subquery(final_q) 13 end

Program 6.11. The query builder function for case 1.

union_field/3 is a helper function built to simplify unions by a single column. The implementation is shown in program6.12.

IDs are taken before the unions instead of running complete employees through unions.

This is because it is enough to compare the IDs for duplicates for performance rather than whole rows. Also, if this function returned a query ending with a union, it could not be ordered with an order_by clause by the caller without wrapping it in a subquery first.

(36)

This function should then return a subquery, which means that a subquery is inevitable anyways.

1 def union_field(%Ecto.Query{select: nil} = query1, query2, field) do 2 query2 =

3 query2

4 |> select([q], field(q, ^field)) 5

6 query1

7 |> select([q], field(q, ^field)) 8 |> union(^query2)

9 end 10

11 def union_field(%Ecto.Query{select: _} = query1, query2, field) do 12 query2 =

13 query2

14 |> select([q], field(q, ^field)) 15

16 query1

17 |> union(^query2) 18 end

Program 6.12.Helper function for unions by a column.

The generated SQL with market_id of 1 and date range of January 2020 is shown in program6.13.

1 SELECT e0.*

2 FROM employees AS e0 3 WHERE e0.id IN (

4 SELECT DISTINCT ON (se0.id) se0.id AS id 5 FROM employees AS se0

6 INNER JOIN shifts AS ss1 7 ON se0.id = ss1.employee_id 8 WHERE ss1.market_id = 1

9 AND ss1.date >= ’2020-01-01’ AND ss1.date <= ’2020-01-31’

10 AND NOT se0.id IN ( 11 SELECT se0.id AS id 12 FROM employees AS se0 13 WHERE se0.market_id = 1

14 )

15 AND NOT se0.id IN ( 16 SELECT se0.id AS id 17 FROM employees AS se0

18 INNER JOIN employee_secondary_markets AS se1 19 ON se0.id = se1.employee_id

20 WHERE se1.market_id = 1

21 )

22 UNION

Feasibility of database query composition and reuse with Ecto