An application with Docker and Amazon Web Services

(1)

Duc Nguyen

AN APPLICATION WITH DOCKER AND AMAZON WEB SERVICES

(2)

AN APPLICATION WITH DOCKER AND AMAZON WEB SERVICES

Duc Nguyen

Bachelor’s thesis

Autumn 2019

Information Technology

Oulu University of Applied Sciences

(3)

ABSTRACT

Oulu University of Applied Sciences

Information Technology, Software Engineering

Author: Duc Nguyen

Title of Bachelor´s thesis: An application with Docker and Amazon Web Services Supervisor: Kari Laitinen

Term and year of completion: Autumn 2019 Number of pages: 41 pages + 3-page appendix

The main purpose of this thesis is to broaden the selection of cloud services by suggesting a guideline that can be utilized to deploy application on Amazon Web Services. The reason behind the decision to choose AWS Batch over AWS Lambda and other services are discussed. The

author’s case company, Tori has been using AWS Batch to process high volumes of data in batch.

Thus, the application must be scalable, resilient, nimble, and able to undertake heavy load.

Additionally, this thesis serves to demonstrate the steps needed to deploy an application using AWS and Docker, which are modern technologies and commonly used when developing software nowadays. The release pipeline in this project has been made in order to meet the demands of modern software development and improve the efficiency as well as robustness for the developers.

The result of this project is a fully functional software that plays an essential role in the case company and processes significant amount of data every day. The requirements for the software, such as elasticity, consistency, etc., are satisfied. Further improvements are planned and will be implemented in the nearest future.

Keywords: PHP, Docker, AWS

(4)

VOCABULARY ... 6

1 INTRODUCTION ... 7

1.1 Thesis objectives and research question ... 7

1.2 The Company TORI ... 7

2 TECHNOLOGY ... 10

2.1 Object-oriented Programming ... 10

2.2 Docker ... 11

2.3 Amazon Web Services ... 13

2.3.1 AWS CloudFormation ... 13

2.3.2 AWS Elastic Compute Cloud ... 14

2.3.3 AWS Lambda... 15

2.3.4 AWS Batch ... 17

2.3.5 Relational Database Service ... 18

2.3.6 Elastic Container Registry ... 18

2.3.7 Simple Storage Service ... 19

2.4 Continuous Integration, Delivery & Deployment ... 20

3 IMPLEMENTATION ... 21

3.1 Analysis and design... 21

3.1.1 Building the System with AWS Services ... 21

3.1.2 Sputnik components ... 23

3.2 Development process ... 25

3.2.1 Important concepts in advertisements ... 25

3.2.2 Installing dependencies ... 26

3.2.3 Dockerization ... 27

3.2.4 Building Docker container ... 28

3.2.5 CloudFormation ... 29

3.3 Usage ... 30

3.3.1 Command ... 30

3.3.2 Architecture... 32

3.3.3 Database model ... 33

3.3.4 Testing ... 35

3.3.5 CI/CD ... 36

3.4 Further improvement ... 37

3.4.1 Implement unit tests ... 37

(5)

3.4.3 Improve locking mechanism ... 38

4 CONCLUSION ... 39

5 REFERENCES ... 40

LIST OF FIGURES AND TABLES ... 41

APPENDICES ... 42

(6)

API Application Programming Interface

AWS Amazon Web Service

CD Continuous Delivery CI Continuous Integration EC2 Elastic Compute Cloud ECR Elastic Container Registry HTML Hypertext Markup Language JSON JavaScript Object Notation OOP Object Oriented Programming PHP Hypertext Preprocessor RDS Relational Database Service S3 Simple Storage Service XML Extensible Markup Language

(7)

1 INTRODUCTION

1.1 Thesis objectives and research question

RESEARCH OBJECTIVE

The goal of this thesis is to document and suggest a guideline that can be used for future reference when developing and deploying applications using Amazon Web Services and Docker. Technologies are continuously moving forward, enabling a vast variety of possibilities that were once impossible due to technical limitations. The aim of this project is to utilize the cutting-edge technologies for developing software, through the act of describing its process and indicating the advantages and disadvantages. The author’s inspiration is Sputnik, an integration project that processes thousands of advertisements to publicize on Tori.fi website.

RESEARCH QUESTIONS

Research questions act as a lodestar which help authors to stay on the right path when researching and writing. In this thesis, the author focuses on answering these following research questions:

1. Which platform and Amazon Web Services components can be used to implement and deploy an integration application seamlessly, especially for project written in PHP?

2. What software architecture is needed to develop an integration program?

3. How to set up and use Docker in Amazon Web Services?

4. What is Continuous Deployment? How to use it in this specific case?

1.2 The Company TORI

Tori.fi is the biggest and most prominent customer-to-customer (C2C) marketplace in Finland. However, there are many competitors for example Facebook marketplace or Nettiauto and Huuto.net which indicates that there is still room to grow. On Tori.fi, there are buyers and sellers in all kinds of goods: furniture, vehicles such as cars or motorbikes, electronics, hobby equipment, etc.

C2C online marketplaces are common selling channels for end users around the world, regularly giving online vendors a platform to a broad range of customer base. Tori.fi plays an important role in world of circular economy by raising the awareness about reusing and repairing products as well as reducing non- essential goods to minimize littering.

(8)

arrived companies such as Facebook or Amazon, they still hold good positions in people's daily life, for example in France, Leboncoin (leboncoin.fr) is the first one in C2C market across all trading categorizes, or the sister company of Tori.fi in Sweden, namely Blocket (blocket.se) has about 5 million unique visitors per week, which is 70% of Sweden's population.

Tori.fi ranks first and second in terms of general and vehicle trading aspect, respectively. There are more than 2.4 million unique visitors that use Tori.fi every month and over 1 million advertisements are on the website constantly. In 2018, there were more than 10.3 million advertisements that were published in total and 3.05 million successful transaction which was valued at about €626 million. Tori.fi has turned into an important part of Finnish life in ten years and become the seventh most visited website in Finland with nearly 270 million total visits in 2018 (data extracted from company’s private annual report). The growth in number of annual visits is shown in Figure 1.

Figure 1 Annual visits to Tori.fi Year-on-Year growth (company private report)

(9)

As Tori.fi grows, the connection between Tori and other businesses needs to be strengthened. There are many ways of cooperation, one of those is affiliate marketing, which is the act of advertising other companies’

products to Tori’s customer. Thus, the need of getting the data from those companies and publishing them on the Tori.fi website is pressing, and that is the reason why the application in this thesis was carried out.

(10)

In this chapter the author will bestow upon readers with definition and the reason he includes them in the thesis.

2.1 Object-oriented Programming

Object oriented programming (OOP) is a programming architecture which is composed around data, or objects, as opposed to functions and logic. An object can be portrayed as a data field that has distinctive qualities and behaviors. For instance, an object can go from physical substances as an individual portrayed by properties like name and address, down to little computer programs like gadgets. This contrasts the old way to deal with programming where priority was set on how the logic was composed instead of how to characterize the information inside the rationale.

The initial phase in OOP is to distinguish every single object a software engineer needs to control and determine how these objects correlate with one another, an activity regularly known as ’data modeling’.

When an object is referred to, it is summed up as a class of items that characterizes the sort of information it contains and any logic arrangements that can control it. Each particular logic arrangement is known as a

’method’ and items can speak with well-characterized interfaces called ’messages’.

Basically, OOP centers around the objects that engineers need to control instead of the logic required to control them. This way of computer programming suits programs that are enormous, complex and effectively scaled or maintained. Because of the association of an object-oriented program, this technique is likewise helpful for collaborative development where activities can be partitioned into groups. Additional with shorter programs, modularity and flexibility. (Richard P. Gabriel, Guy L. Steele, Robert R. Kessler, 2013)

OOP depends on the accompanying principles:

Encapsulation: the execution and condition of each article are secretly held inside a defined limit, or class.

Different objects do not have the right to access this class or to make changes, but can only call a rundown of public function, or method. This trait of information stowing away gives more prominent program security and stays away from unintended information debasement.

Abstraction: objects are just disclosing inside components that are important in the utilization of different objects, concealing any pointless execution code. This idea assists engineers in making changes and

(11)

Inheritance: relationships and subclasses between articles can be allocated, enabling engineers to reuse a common logic while keeping up a remarkable progressive system. This property of OOP powers an increasingly exhaustive information examination, diminishes implementation time and guarantees a more elevated level of exactness.

Polymorphism: Objects are permitted to take on more than one structure contingent upon the specific circumstance. The program will figure out which significance or use is essential for every execution of that object, eliminating the need to copy the code.

Although Simula is credited as the earliest OOP programming language, the most frequently used languages are: Java, JavaScript, Python, C++, Visual Basic.NET, Ruby, Scala and PHP (Ole Lehrmann, Birger, Kristen, 1993). In this thesis, the author mainly works with PHP, thus other languages will not be defined in this thesis.

PHP (recursive abbreviation for PHP: Hypertext Preprocessor) is an open source scripting language specialized for web development and can be inserted into HTML. It was initially made by Rasmus Lerdorf in 1994 (php.net). The language can convey back and forth with a server and make a powerful website page for the user. What sets PHP aside from something like customer side JavaScript is that the code is executed on the server, creating HTML which is then sent to the customer. The customer gets the aftereffects of running that content without recognizing what the fundamental code was. It is even possible to render all HTML document object model with PHP.

2.2 Docker

Docker appeared to public in Santa Clara at PyCon in 2013. It was released as an open source in March 2013. At that time, it utilized Linux Containers (LXC) as its default execution condition. After one year, with the arrival of rendition 0.9, Docker supplanted LXC with its own segment, which was written in the Go programming language.

Docker is an open platform for developing, delivering, and running applications. Docker enables users to separate their applications from the infrastructure so they can deliver software quickly. With Docker, users can control their infrastructure in the same ways they manage the applications (Docker Overview). Docker virtualization allows the software tools to be executed in an isolated and controlled environment referred as a container.

“In Docker containers, dependencies are provided exactly as intended by the developer and, consequently, they simplify the distribution of scientific software and foster

reproducible research.” (List, 2017).

(12)

All containers are controlled by a single operating-system kernel and are in this manner more lightweight than virtual machines.

Objects

Docker objects are different elements used to collect an application in Docker. The primary classes of Docker objects are images, containers and services.

Containers are built using Docker image, which is a read-only template used to store and ship applications.

(Docker Overview). Docker is by far the most well-known usage of Operating-System virtualization, at present its online vault administration (Docker Hub) stores more than 4.5 millions of software image. Utilizing that vault, it is conceivable to download and convey Docker images as software containers.

A Docker container is a standardized, encapsulated environment that runs applications. (Ellingwood, 2015). A container is managed using the Docker Application Programming Interface (API) or Command-line Interface (CLI).

A Docker service enables compartments to be scaled over several Docker daemons (a persistent process that manages Docker containers and handles container objects). The outcome is known as a “swarm”

consisting of several coordinating daemons that impart through the Docker API.

Tools

There are two tools when it comes to Docker applications: namely Docker Compose and Docker Swarm.

Docker Compose is the tool relevant for the work of the author, thus it will be defined below.

Docker Compose is a gadget for characterizing and running multi-container Docker applications. It utilizes YAML Ain't Markup Language (YAML) records to design the application's services and plays out the creation and start-up procedure of the considerable number of containers with a solitary command. The docker- compose CLI utility enables clients to run commands on numerous containers without a moment's delay, for instance, building images, scaling containers, running compartments that were halted, and more.

Commands identified with image manipulation, or user-interactive options, are not pertinent in Docker Compose on the grounds that they address one container. The docker-compose.yml record is utilized to characterize an application's services and incorporates different setup choices. For instance, the build option characterizes configuration choices, for example, the Dockerfile way, the command choice enables one to abrogate default Docker directions. The principal open beta rendition of Docker Compose (version 0.0.1) was released on December 21, 2013.The first-generation prepared adaptation (1.0) was made accessible on October 16, 2014.

(13)

2.3 Amazon Web Services

Amazon Web Services (AWS) is an auxiliary of Amazon that gives on-request cloud computing platforms and APIs to people, organizations, and governments, on a metered pay-as-you-go premise. (Performance Analysis of High Performance Computing Applications on the Amazon Web Service Cloud, 2010). In other words, these cloud computing web administrations give a set of crude conceptual specialized framework and disseminated registering building blocks and tools. One of these services is Amazon Elastic Compute Cloud (EC2), which enables clients to have a set of virtual computers that is accessible constantly, through the Internet. AWS's rendition of virtual computers copy the vast majority of the characteristics of a genuine PC including, hardware central processing units (CPUs) and graphics processing units (GPUs), hard- disk/SSD storage, RAM; a decision of operating systems; networking; and pre-loaded application programming, for example, servers, databases, customer relationship management (CRM), and so on.

The AWS is executed at the server center all through the world and kept up by the Amazon auxiliary.

Expenses depend on the usage, the hardware/OS/programming/organizing highlights picked by the subscriber, required accessibility, redundancy, security, and administration alternatives. Subscribers can pay for a virtual AWS computer, a physical PC, or groups of either. As a feature of the membership agreement, Amazon gives security to subscribers' framework. (aws.amazon.com, n.d.)

In 2017, AWS contained more than 90 (165 starting at 2019) services spreading over a wide range including processing, storage, networking, database, etc. The most well-known incorporate is Amazon Elastic Compute Cloud and Amazon Simple Storage Service (Amazon S3). Most services are not presented in a straight-forward manner to end clients, but rather offer through APIs for developers to use in their applications. Amazon Web Services can be used via Hyper Text Transfer Protocol (HTTP), utilizing the Representational State Transfer (REST) and Simple Object Access Protocol (SOAP).

Amazon markets AWS to customers as a method for acquiring enormous scale computing power more rapidly and inexpensively than building a real physical server center. All services are charged depending on the usage, yet each one estimates usage in shifting manners.

2.3.1 AWS CloudFormation

CloudFormation is a service from AWS that enables users to model and it supplies infrastructure easily.

Users describe every resource they need AWS to turn up in either a text file or coding language, then AWS facilitates them to curate everything. The CloudFormation templates can be stored in Amazon Simple Storage Service then deployed by AWS CloudFormation to set up the infrastructure (see Figure 2).

(14)

Figure 2 AWS CloudFormation

CloudFormation ensures that reliant assets in the template are altogether generated in the best possible order. For instance, suppose one needs to make a DNS Route53 record and an EC2 instance having the DNS record point to the EC2 instance. CloudFormation will take care of configuring the EC2 instance first, wait until that is prepared, and afterward make the DNS record subsequently. AWS CloudFormation coordinates the provisioning of the necessary infrastructure.

So as opposed to composing a script with many AWS API calls, wait, loops, and retry logic, users simply depict what they need and order CloudFormation to do it for them.

2.3.2 AWS Elastic Compute Cloud

Amazon Elastic Compute Cloud (EC2) is a web service interface that gives resizable computing power in the AWS cloud (aws.amazon.com). It is intended for software engineers to have full oversight over web- scaling and processing capacity.

EC2 instances can rescale and the quantity of instances is scaled up or down according to users’

prerequisite. These instances can be propelled in at least one topographical areas or locales, and Availability Zones (AZs). Every area contains a few AZs at distinct places, associated by minimal delay networks in the same area. EC2 likewise enables clients to implement applications to robotize scaling as indicated by shifting needs and pinnacle periods and makes it easy to convey virtual servers and modify storage, diminishing the demand to put resources into hardware and streamlining implementation procedures.

EC2 arrangement includes making an Amazon Machine Image (AMI), which incorporates an operating system, applications, and setups. That AMI is stacked to the Amazon Simple Storage Service (S3), and it is enlisted with EC2, and at that point clients can start virtual machines as required.

(15)

Amazon offers distinctive types of EC2 for various prerequisites and spending plans, including hourly, dedicated, and spot rates.

There are many benefits of using Cloud Computing, according to (aws.amazon.com):

• Increase speed and agility: Cloud Computing allows user to focus on other precious IT assets rather than infrastructure and physical data centers. With EC2, it is easy to deploy extraordinary number of servers in minutes and turn off when they are no longer in use.

• Trade capital expense for variable expense: There is no need to invest on data centers and servers, EC2 allows users to pay only what has been used.

• Benefit from massive economies of scale: Users will get lower cost than normally.

• Stop guessing capacity: With Cloud Computing, users can start by using little capacity and then scale up and down when needed.

• Save money by cutting cost running and maintaining data centers: Cloud Computing help users focus on other more important matters than data centers.

• Globalization: It is simple to deploy multiple regions with Amazon EC2.

Although Cloud Computing still has some risks for example the privacy agreement or security and data protection are provided by a third party, or the location of data is beyond the border of nations, the advantages still outweigh the disadvantages greatly. Therefore, more and more companies are moving into the Cloud.

2.3.3 AWS Lambda

AWS Lambda is an occasion-driven, serverless computing component in Amazon Web Services. It is a processing service that runs code when triggered by events and consequently deals with the infrastructure required by that code. It was first presented in November 2014.

AWS Lambda was originally used for transferring images or items to Amazon S3, updating to DynamoDB tables, reacting to mouse clicks or responding to sensor readings from an Internet of Things (IoT) gadget.

AWS Lambda can likewise be utilized for back-end services activated by custom HTTP requests and get shut down when not being used (see Figure 3).

(16)

Figure 3 AWS Lambda

Dissimilar to Amazon EC2, whose price is calculated in a second, the price of AWS Lambda is calculated in 100 milliseconds. Consumption under certain range fall inside the AWS Lambda complementary plan - which doesn't terminate a year after registration, unlike the complementary plan for some AWS services.

From theirs users’ opinion, Serverless administrations are typically portrayed by following capacities (Infrastructure Cost Comparison of Running Web Applications in the Cloud Using AWS Lambda and Monolithic and Microservice Architectures, 2016)

• No server administration: there is no software or runtime to set up, keep up, or regulate.

• Adaptable scaling: users can scale their application consequently or by modifying its ability through changing the units of utilization (for instance, throughput, memory) as opposed to units of single servers.

• High accessibility: serverless applications have inner accessibility and adaptation to internal failure.

Users don't have to design for these capacities on the grounds that the administrations running the application give them naturally.

• No inactive limit: users don't need to pay for inert limit. There is no compelling reason to pre- arrangement or over-arrangement limit with regards to things like process and capacity. There is no charge when the code is not running.

Lambda can be portrayed as Function as a Service (FaaS). Similarly, as with some other FaaS, the administration, provisioning, scaling and unwavering quality is taken care of naturally by AWS. This enables users to deal with a abstract level, centered completely around the business rationale and (nearly) disregard resources beneath.

(17)

2.3.4 AWS Batch

As the name speaks itself, AWS Batch enables clients to run their tasks on Amazon Web Services cloud in batches. The act of cluster computing empowers specialists to effectively get a large figuring capacity. One of the common truths about cloud computing services is that they are easy to understand regarding usability and proficiency. AWS batch is similar. It enables clients to appreciate the services without stressing over designing and dealing with the essential framework. The AWS Batch administration can rapidly assign the necessary assets when new jobs are submitted. The portion of asset is done in such a way to ensure that the general figure cost is as low as possible. The entire procedure is overseen so that there is no impact on the conveyance time of the outcomes.

On AWS Batch, users can run their batch computing-based task of any size. The AWS Batch is entirely adaptable in its activity. It naturally dispenses the necessary assets depending on the size of the activity.

Every distribution of assets is done in an advanced manner. Users are not required to have the batch computing programming and software in their systems. This empowers users to concentrate on taking care of the current issue and dissecting the outcomes.

The four significant segments of AWS Batch are: jobs, job definitions, job queues and compute environment.

Job is a unit of performable task submitted to the AWS Batch. It is like a Linux executable or a shell content.

A job in AWS Batch has a name and it runs on an Amazon EC2 instance. An AWS Batch job can likewise allude to other running tasks and distinguish them by their name or ID. One job can be reliant on the state of the other activity.

Job definitions: Job definitions contain the points of interest of the given current task. It demonstrates how the jobs are to be run. Each job requires some resources to complete the execution, and the definition is answerable for monitoring the assets. Users can determine the memory for the handling necessities of the given activity. The definition can deal with different parts of the AWS Batch work for persistent storage.

Container settings, environment variables and other can be defined in job definitions

Job queues: things submitted to a specific queue before it is executed. A planning calculation runs and schedules the given activity onto a register domain for execution. While presenting the ASW Batch jobs, you can allocate needs to it. A job can be doled out to a more important queue or assigned to a low need line so that this job can execute whenever the assets are less expensive.

Compute environment: comprises all the resources that are needed to run AWS Batch job. Users are allowed to choose the preferred type of computing instance or manage their own environment in the Amazon ECS Cluster.

(18)

Amazon Relational Database Service (Amazon RDS) is a service that makes it simpler to set up, run, and scale a relational database in the AWS Cloud. It gives a cost-proficient, resizable limit with respect to an industry-standard relational database and oversees normal database management tasks.

According to Amazon (https://aws.amazon.com/rds), a portion of the significant highlights are:

• Multi-Availability Zone (AZ) deployment: Amazon RDS Multi-AZ allows users to manage and keep a co-occurrent copy, depends on the database, in different location. This will allow the database to continue working without the need of manual management in case of database outage.

• Understand replica: RDS has multiple usage such as scale in for database which is read-heavy.

The feature is available on MySQL, MariaDB, PostgreSQL and up to 5 replicas.

• Performance metrics and monitoring: available in Amazon CloudWatch API.

• RDS Costs: RDS has a similar pricing method to Amazon Elastic Compute Cloud (EC2). RDS cost is evaluated per hour and there are two choices: either paying normal hourly rate or paying one up-front fee and receive discount for hourly rate. Apart from that, users also have to pay other fees such as data transfers, Input/Output (I/O) Operations, and so on.

• Backups: Amazon RDS has automation backup which creates and stores snapshot of database for maximum of 35 days.

• Operation: database instances can be overseen from the AWS Management Console, utilizing the Amazon RDS APIs and Amazon Web Service Command-line Interfaces. Increase DB space is bolstered, however not diminish assigned space.

2.3.6 Elastic Container Registry

Amazon Elastic Container Registry (ECR) is a Docker container registry that can be used to store, control and deploy Docker container images. Amazon ECR is coordinated with Amazon Elastic Container Service (ECS), disentangling your improvement of work process. Amazon ECR holds users’ images in an exceptionally accessible and versatile design, enabling users to dependably send containers for their applications. Incorporation with AWS Identity and Access Management gives resources control of every repository. With Amazon ECR, there are no expenses in advance or contract. Users pay for the amount of data they store in their repositories and data moved through the Internet. (see Figure 4)

(19)

Figure 4 AWS Elastic Container Registry 2.3.7 Simple Storage Service

Amazon Simple Storage Service (Amazon S3) is a service that offers storage and protection for data of any sizes to customers of all scale. Amazon S3 provides simple management features which allow users to arrange and access to the data with just a few clicks. It gives software engineers entry to access fast, reliable, scalable and cheap data storage infrastructure all around the world with the goal to exaggerate the advantage of scale and hand developer those benefits (Robinson, 2008).

Amazon S3 can also be used to store other AWS services’ templates, for example AWS Lambda function or AWS CloudFormation templates (see Figure 5).

Figure 5 AWS Simple Storage Service

Amazon S3 emphasize on lucidity and robustness, which is the reason for its minimal feature, as Amazon proclaims it as 99.999999999% durability. (aws.amazon.com)

(20)

Continuous Integration (CI): is a process that requires developers to integrate code into a common code repository a few times a day. Each commit will be checked by an automated build such as code quality check, syntax style review, snapshot testing, etc. CI helps improve code quality and detect error early for quick fixing. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly. (Fowler, Continuous Integration, 2006). benefits when implement CI include smaller code changes, faster error identifications and reaction time, more test reliability. “ (Meyer, 2014)

Continuous Delivery (CD): is the ability to produce software in short rotation with new implementation and changes and make sure that the programs can always be deployable at any time. However, the deployment process will be done manually. It is possible to implement continuous delivery method by continuously integrating the software, building applications, and running automated tests on those programs to detect problems. Furthermore, you push the executables into increasingly production-like environments to ensure the software will work in production. (Fowler, Continuous Delivery, 2016). By doing continuous delivery, many benefits can be reached (https://continuousdelivery.com/):

• Low risk release: The main purpose of continuous delivery. The process will make the software deployments trouble-free and risk-free, which can be brought about any time when needed.

• Faster time to market: The time-consuming phase in software delivery, which is integration and testing, can be removed by immerging into developers’ daily work. By doing so it can shorten the time to deliver software to the market significantly.

• Higher quality: The automation testing helps assure the quality of the code so developers can focus on other testing to improve the quality of the software such as usability testing, performance testing.

• Lower cost: By reducing the time to market and make program quality higher, companies will reduce the cost of fixes when releasing and also the human resources cost.

Continuous Deployment: is a procedure where the software is released whenever the code is committed.

The changes will pass through an automated testing stage and then it is spontaneously released to production environment. This method puts an end to human protections against erroneous code in exchange of fast deployment. The CD process aims to immediately deploy software to customers as soon as new code is developed and can result in lots of benefits for organization, such as: new business opportunities, reduced risk for each release, and prevent development of wasted software. (Information and Software Technology, 2015). However, many IT organizations are unable to adopt continuous deployment due to regulatory compliance or law restrictions.

(21)

3 IMPLEMENTATION

Tori.fi imports a sizeable chunk of its advertisements from external sources. At the time of writing in November 2019, 58,000 of the 1.2 million advertisements on Tori are imported from external feeds (from company private report).

Sputnik is the integration tool that is used to synchronize the advertisements from the feeds with Tori.fi: it fetches the data from the feeds, transforms it, and finally pushes the data to Tori. It runs different jobs frequently to make sure that the data on Tori is as up to date as possible. The data and information which are used in this thesis is mainly based on Tori’s private annual report.

3.1 Analysis and design

3.1.1 Building the System with AWS Services

There are two choices to build the system with AWS services: either basing the solution around AWS Batch or AWS Lambda. Both have the benefit of not needing to manage individual servers except for the UI and the database. However, AWS Batch is more flexible and is not limited in terms of its memory which can be a possible limiting factor for us since large amounts of data are handled. AWS Lambda instances can maximally only use 3008 MB of memory.

AWS Batch: service is ideal for running short-running tasks with no extra cost on top of the computing resources. The service runs short-lived Docker containers on ECS instances. It requires no management of servers as it starts up and shuts down servers on demand. Therefore, it is very easy to operate. Furthermore, it is possible to leverage spot instances with AWS Batch which can lower the server costs significantly.

The fact that it only requires Docker containers makes it more flexible when compared to something such as AWS Lambda since it has more flexibility in terms of our technology choices. Furthermore, just relying on Docker keeps the application easily testable locally as there is no need to use a real or simulated AWS environment for testing. Some of the feeds require the application to contact them from a specific set of IPs.

However, with AWS Batch the IPs keep changing as it spins up new machines. This problem can be overcome by setting up a NAT gateway that makes all outbound requests look like they originate from the same IP.

Other considered solutions: AWS Lambda combined with AWS Simple Queue Service, Simple Notification Service, or even AWS could have worked for this application. The idea was to have a Lambda function processing the feed that would then split the feed into small work items such as create/update/delete

(22)

the queues do not support ‘First In, First Out’ processing for Lambda, and they offer at-least-once delivery which means that it is required to handle duplicate messages. This could have been handled by attaching a sequence number to the messages and saving the sequence number to the database so that the ordering of the messages could be understood. However, this option seems to be overly complex and it would have also limited the choice of languages since Lambda offers support to a limited set of languages, while the legacy code was written in PHP, which is unable to run on Lambda without significant effort. Furthermore, Lambda deployment would have been more complicated when compared to simply pushing Docker images.

AWS Step Functions: this could be used to orchestrate the workflow of processing individual advertisements.

With Step Functions, a state machine can be defined to keep track of the work state and then decide the next action based on the current state. AWS Lambda and AWS Batch jobs can be launched as a response to a state change. The service can help with decomposing offline processing operations into several discrete steps which helps with retries and removes the need to use queues to distribute the workloads to workers.

However, this service is too expensive for what it provides. For example, in April 2019 over 5 million advertisements were processed by Sputnik (Company private report) and each state transition would cost us $0.000025 making the cost at least 5,000,000 * $0.000025 = $125 per month assuming only a single state transition per ad. However, on average, it would most likely need more than one state transition per ad, which would make the cost at least $250-$500 per month. Moreover, this cost is solely for the Step Functions and does not include the Lambda cost. On the other hand, it is possible to use services like AWS Batch or AWS Lambda and called Step Functions for each advertisement that is new, deleted, or changed, but even then, the benefits of Step Functions are not clear.

CloudWatch/Lambda Scheduling: The application schedule model is shown in Figure 6.

(23)

The Lambda Scheduler schedules new jobs based on job definitions, job queue, currently running jobs, and finished jobs. The scheduler reads the list of job definitions from AWS Batch, job states from CloudWatch, and uses it to see if it should schedule more jobs. In this project, the application should not be running more than one instance of each job type at a time. Therefore, the scheduler first confirms which jobs are in the job queue and the compute environment (running jobs) and those job types are immediately excluded from being scheduled. Then the scheduler goes to read the finished jobs and it schedules all jobs that have been completed longer than WAIT_AFTER_LAST_RUN (from the job definition parameters) minutes ago.

3.1.2 Sputnik components

The architecture of the Sputnik application is shown in the component diagram in Figure 7, and each component is responsible for different mission as described in Table 1.

Figure 7 Application components diagram

(24)

Component Description

Bash scripts Arranges the running of various components.

ParserController Controls the handling of new information from the feeds.

Parsers Reads the feed files which contain the ads and parse each ad to an array for further processing.

Mappers Transforms parsed data to a common representation.

ImageHandler Gets image information from database and uploads new images.

Only images from active companies are pushed.

Deleter Deletes ads that are no longer available from feeds.

AdPusher Pushes advertisements to Tori and processes responses in order to link advertisements in Tori and the application together.

The responsibility of Sputnik is to fetch data from external sources, then process the data and push to the database to publish advertisements on the website Tori.fi. The data can be in XML or JSON format, depends on the providers, so the software must be able to handle both format and can be build up to process other extension file when required.

The application must have following characteristics:

• Fast and stable: More than 60,000 advertisements are imported from this application to be shown on the website that serves more than 2.1 million people every month.

• Resilient: Customers from other companies want to get their advertisements published on the website without any delay. Hence it is important to make the application as durable and as fast to recover as possible.

• Scalable: The number of advertisements coming from companies and supported categories are increasing remarkably. Therefore, one vital trait of the program is it must be easy to scale. The application must follow the Open-Closed Principle, which means it must be able to scale but impossible to make modification. (Martin, 2017, p. 70)

(25)

3.2 Development process

3.2.1 Important concepts in advertisements

There are some important concepts from advertisement that is needed in order to implement the integration application:

Identifier: is a piece of information that identifies the ad. For example, a license plate number can be used to identify a vehicle. Sputnik uses the identifier information to link advertisement coming from the feed with advertisement in Sputnik.

Each feed contains a different set of advertisements linked to a different set of identifiers. Some feeds might have their own internal identifiers associated with each ad. For example, one company gives us their own id for each ad:

{ ...

"co2Emissions": null, "isEco": false,

"id": "13179390",

"name": "Hyundai IONIQ electric ComfortEco Holmgrens Edition",

"vehicleType": "car", "bodyType": "Halvkombi", ...

}

As there are various kinds of identifiers and the identifiers of different feeds may conflict with each other, it is not enough to identify the ad based on just the identifier. Therefore, the ad is identified based on a combination of the feed name and the identifier. For example, the complete identifier for the above example would be ("bytbil", "13179390").

Hashes: Once Sputnik has fetched the latest data from the feed, it compares the advertisement from the feed with the data in the Sputnik database based on a calculated hash and the one found from the database.

There are two different hashes: the ad hash (ads.hash) and the image hash (ads.image_hash). The ad hash is calculated based on the whole ad except for the images and some other fields. The image hash, on the other hand, is constructed based on the list of image urls in the advertisement (considering their order). The division of the hashes into these two hashes stems from the fact that plain hashes cannot tell us what has changed in the advertisement when the hash is different. It can only tell that there are some changes in the

(26)

part of the data has changed.

Status: Each ad in the ads table has a status that indicates the status of that ad. First of all, each ad is related to a company (ads.company_id → companies.company_id) which can be either active or not (companies.active). Advertisements that do not belong to companies are not pushed to Tori nor are their images uploaded to Yams. The status of an ad can be:

• Synched: The ad has been pushed to Tori and is up to date.

• New: The ad has never been pushed to Tori.

• Modified: The ad has changed in the feed. It will be pushed to Tori if the company it belongs to has active='t'.

• Discardednew: The ad was new, and the pushing has failed.

• Discardedmodified: The ad was modified, and the pushing has failed.

• Deleted: The ad is missing from the feed and has been deleted from Tori.

• To_be_deleted: The ad is waiting to be deleted from Tori.

• Delete_failure: Failed to delete the ad from Tori.

Those are the most important concept in Sputnik. Every advertisement has mentioned parameters and the application is running based on them.

3.2.2 Installing dependencies

As mentioned previously, the required dependency packages were installed using a dependency management tool called Composer. Installed packages will be stored locally in a directory inside of the project, which could be a part of Docker build without any unnecessary global installations.

After setting up Composer, an initialize command needs to be run to generate composer.json file before installing packages.

$ composer init

All packages can be installed using the following command:

$ composer require {package-name}

Any package name that is installed will be saved to composer.json. In this project, six packages are used which are the Human Language and Character Encoding Supporter (ext-mbstring), JSON web token library (php-jwt), Amazon Web Service Software Development Kit (aws-sdk-php), Arbitrary-precision Integer Arithmetic library (phpseclib), PHP Unit Testing Framework (phpunit) and PHP Unit Snapshot Testing

(27)

Figure 8 Packages in composer.json

A simple command can be run with the purpose of installing required dependencies:

$ php composer.phar install

3.2.3 Dockerization

The first step to dockerize an application is to create a Dockerfile, which will act as an instruction how the deployment will proceed. In Dockerfile, the steps to build the application needs to be written down, for example in Figure 9.

Figure 9 Dockerfile content

First, all the dependencies must be downloaded and established:

• FROM: Command to select an Operating System from Docker Hub, where Docker images are stored globally and can be fetched to use locally. In this case, the image that contains PHP version 7.3.6 is specified.

• RUN: The installation commands are run in terminal. Composer dependencies, PHP extension and dependencies will be downloaded and established inside Docker container.

(28)

directory.

After that, all the files from application must be copied into the Docker container. The command with keyword

“COPY” below will copy the files of current directory to the container:

COPY –chown=sputnik:sputnik . /home/sputnik/

Finally, the container is instructed how to start the program:

ENTRYPOINT [ “/bin/bash”, “./entrypoint.sh” ]

The command with keyword “ENTRYPOINT” above informs how the application will be initialized. The command can be understood as /bin/bash ./entrypoint.sh in the terminal.

3.2.4 Building Docker container

There are two stages that should be done to have our application fully operational within Docker container:

Create Docker image: Docker image is created with the following command when inside of application directory:

$ docker build .

After running the first time, Docker will save the cache of different file layers which significantly improves the building speed afterward. A successful message will be shown in the terminal with image ID.

Figure 10 Docker build successful message

The image can be named and tagged by using flag -t in build commands so it is more straightforward to handle.

$ docker build -t {name:tag} .

When the building is done, the list of existing images can be obtained by running command:

$ docker images

Figure 11 List of Docker image

Docker Compose: The next step after successfully creating image is to install and connect to the database.

This is the moment Docker Compose comes into place. With Docker Compose, it is possible to describe and perform multiple Docker containers at once by creating a YAML file which includes all the configurations of the application's services. Then, all the services from the setting can be built and started with a single command:

$ docker-compose build

(29)

Figure 12 docker-compose.yml content

In the Figure 14 above, there are two services that will be built when running the build command:

PostgreSQL version11.3 and Sputnik. The configuration file includes:

• Build: The context and the name of the build can be set in Docker Compose.

• Environment: Environment variables for the build are given if needed.

• Image: The image can be built from scratch or pulled from Docker Hub.

• Networks: Networks can be created or specify which network should the build belongs to.

• Ports: Defining which external port corresponding to internal port in the container.

• Volumes: Assign Docker volumes.

3.2.5 CloudFormation

As mention above, CloudFormation is an Infrastructure as Code tool provided by AWS which allows its user to manage their infrastructure effectively. The CloudFormation template is a file that contains all necessary AWS services and parameters for the application required to work properly. Hence, CloudFormation makes it easier to deploy the application in another environment and simpler to modify the infrastructure.

(30)

to run the application, such as Lambda Scheduler, the database, and so on; The second one describes the job definitions that is used by AWS Batch to declare suitable command, variables, waiting interval, etc.

3.3 Usage

3.3.1 Command

Docker container can be executed by running command:

$ docker run {image:tag} [COMMAND] [ARG...]

This will start a container which holds the program and run the command internally. In this project Docker Compose is used to start and perform multiple containers, that are connected together, at once by running the command:

$ docker-compose run IMAGE[:TAG] [COMMAND] [ARG...}]

With the aim of easing the execution of the application, a list of commands was made and can be run easily within Docker container. The command is a Bash file written in Unix Shell Language which comprises all the procedures, functions and commands used to run the application. The list of command are shown in Figure 13:

Figure 13 List of application commands

To implement these commands, they must be treated as argument when the Bash file is launched. The code below will read the input from terminal as argument and activate the correct command as implemented in Figure 14. If the argument is empty, the entire application will be terminated.

(31)

Figure 14 Script to read command in Shell

A switch statement will act as an identifier to check user input and react accordingly to the command should the command be in the list. Running the command will execute all the script inside it.

Figure 15 Example implementation of command

The command above will be triggered when users execute the following line:

$ {bash_file} run [ARG…]

Whereas bash_file is the name of the script file, run is the command keyword, and ARG will be passed as parameters to the script. There is possibility to implement function inside of a Bash file, for example when the functions below are called, the application will be exited with either ‘Success’ or ‘Error’ code as in Figure 16.

Figure 16 Implement function in Shell Script

There are other commands included in the file such as database cleaning, testing or accessing to database.

They are conducted in the same way as ‘run’ command.

(32)

The application in this thesis is running completely on Amazon Web Service and has components as shown in Figure 17.

Figure 17 Application architecture in AWS

The process begins with the Scheduler which is a Python script run by AWS Lambda. The scheduling script runs every minute to get job states from AWS CloudWatch and job definitions from AWS Batch. After that the script will measure which job needs to be run next and then adds that necessary task to the job queue.

AWS Batch will spin up Amazon ECS instance using Docker image from Amazon ECR and secret keys from AWS Secret Manager.

It is possible to adjust the maximum vCPU of AWS Batch to limit the maximum number of ECS instance running simultaneously. Once started, ECS instance will run the command according to the job definitions with all the environment variables and secrets, which will process the data from feeds and push to external services depending on job definitions. The instance also connects to PostgreSQL database on Amazon RDS to store data and send log lines to Amazon CloudWatch.

(33)

The whole infrastructure of the application is described and provisioned by CloudFormation which makes it easier to manage and deploy the project in another environment.

3.3.3 Database model

In this project, AWS RDS is used to set up a PostgreSQL database. The database model is described in Figure 18 and Table 2 lists database tables.

Figure 18 Application database model Table 2 Database table description

Table Description

ads Information on the parsed advertisements.

identifier identifies the product that is being sold. It can be, for example, the license plate number of a car.

hash is used to tell if the advertisement has changed after the last time it was seen.

image_hash is used to tell if the array of images in the ad has changed.

image_status tells if all the images in the ad have been fetched.

(34)

images Information about the images relating to the fetched advertisements.

fetched tells if Sputnik has ever tried to download the image.

ad_params Advertisement parameters as understood by Tori.

trans_errors Errors returned while pushing the advertisements.

companies Company data extracted from the files.

companies_overwrite_information Overwrites the information found in the companies table when pushing the advertisements.

ad_archive Old advertisements are moved here after deletion. Not showed anywhere.

ad_params_archive Parameters from deleted advertisements are moved here. Not showed anywhere.

Several indexers have been implemented in the database to increase the processing speed of the data retriveal query. (see Figure 19)

Figure 19 List of indexers in database

However, the indexers will slow down the speed of updating query, therefore they need to be used carefully.

(35)

3.3.4 Testing

Testing is an important part of developing software applications. The core of the application in this thesis was written in 2013 and there were hardly any needs of testing at that time. As time goes on, testing has become a crucial part in the development process. Despite the need of testability, there are some difficulties in implementing tests for this application as it requires many external dependencies and connections which limits the selection of test methods that can be used.

At present, snapshot testing was implemented that is able to test the Parsers and Mappers part of the application. It generates a JSON snapshot from the test data using Parser and Mapper, and then compares the new snapshots with the existings one. The test return to successful statuses only when both snapshots are identical, as shown in Figure 20.

Figure 20 Successful test result

If the test failed, an error message will be shown with the difference between two snapshots as in Figure 21.

Figure 21 Fail test result

The snapshots can be updated by command line and should be included when commiting to version control.

Figure 22 Update snapshot test result

(36)

After generation, the generated template is sent to s3://sputnik-cloudformation bucket to be used by CloudFormation. However, the stack is not automatically updated by this process and it has to be done manually by giving CloudFormation the URL of the uploaded template and updating the parameters if necessary.

New Sputnik code is deployed automatically to production/dev by pushing code to master/dev, respectively.

Travis runs and builds the image based on the latest code and runs static analysis and tests to make sure that everything is approved with the code before doing the final push to ECR. The updated image is automatically used by the jobs that start after the image has pushed.

In this project, the continuous deployment is carried out by Travis. This is done by adding ‘.travis.yml’

configuration file to the project and add webhook to Github. (see Figure 23)

Figure 23 travis.yml content

Whenever new code is committed to Github, Travis will build Docker container from the commit and run static analysis and tests to ensure there is no error with the change. After that, new CloudFormation templates will be generated by a Python script and sent to S3 bucket if the branch is ‘master’ or ‘dev’ as shown in Figure 24.

(37)

However, the CloudFormation will not be applied automatically by this process. CloudFormation templates need to be updated by giving the S3 URL and updating the parameters manually if required.

Finally, the Docker image will be built and pushed to AWS ECR. The building and pushing script are shown in Figure 25.

Figure 25 Push Docker image to AWS ECR

The updated image will be used immediately by production or testing environment after it has been pushed.

3.4 Further improvement

3.4.1 Implement unit tests

Despite having snapshot tests implemented, implementing unit test is one important improvement as each has its own advantages as well as disadvantages. While it is easier to test with snapshot testing without much effort, it is hard to understand what is tested and what is the expected result. This brings up the needs of unit testing, which can used to test smaller part and find bugs. Moreover, unit test testing makes it possible to test functions and classes that can’t be checked without complicated rendering.

3.4.2 Speed up Parser

With the current set up, the Parser has become the bottleneck of the application. Whilst other components of the application can be run multiple instances at the same time, Parser can only be run individually. This

(38)

with the existed information (id, hashed…) in the database, then only process further with the necessary information. This would make Parser significantly faster, but it requires changes in code and database in order to get everything working properly work.

3.4.3 Improve locking mechanism

At the moment, the database locking mechanism is determined by ads.image_status and ads.locked column in ads table. This is excessive as ads.locked has a built-in unlocking mechanism and there is no specific reason to use ads.image_status as lock when choosing the ads to update.

Furthermore, there are some cases where images.fetched is set even when there is no action has been made with the images. This happens because image.fetch is set to ‘true’ at the beginning of the image handling process. The rectification can be reached by querying database directly and set the value only when the process with images has been finished.

(39)

4 CONCLUSION

The main objective of this thesis is to provide a guideline and share experiences about implementing application on the cloud. While working on this project, the author has learned about Docker and Amazon Web Services components, their advantages as well as disadvantages in order to make suitable choices for the software.

Due to the scope of the thesis, there are issues that haven’t been mentioned. One of those is the alerting and monitoring system which is conducted by internal software and it is too broad for the scope of a thesis.

The application was successfully deployed and fully functional on cloud. The chosen components meet the expectations and are proved to be operating consistently without any major issues. However, there is still room for improvement in the future as mentioned in section 3.5.

When mentioning AWS people usually think about common services such as EC2 or Lambda. However, this project is using another service which is AWS Batch in order to run scheduled task. The author hopes this will provide a basis for an alternative way to implement software applications using AWS.

All in all, the aim of this thesis, which was to describe the procedure for application deployment using Docker and AWS, has been accomplished. The application is running in production at the moment and plays an essential role in the biggest online marketplace in Finland.

(40)

(n.d.). Retrieved from php.net.

(n.d.). Retrieved from https://aws.amazon.com/rds.

(n.d.). Retrieved from https://aws.amazon.com/s3/

aws.amazon.com. (n.d.).

Docker Overview. (n.d.).

docs.aws.amazon.com. (n.d.).

Ellingwood. (2015). The Docker Ecosystem: An Introduction to Coomon Components.

Fowler, M. (2006). Continuous Integration.

Fowler, M. (2016). Continuous Delivery.

https://continuousdelivery.com/. (n.d.).

Information and Software Technology. (2015).

Infrastructure Cost Comparison of Running Web Applications in the Cloud Using AWS Lambda and Monolithic and Microservice Architectures. (2016).

List, M. (2017, 06 10). Using Docker Compose for the Simple Deployment of an Integrated Drug Target Screening Platform. Journal of Integrative Bioinformatics.

Martin, R. C. (2017). Clean Architecture. Pearson.

Meyer, M. (2014). Continuous Intergration and Its Tools.

Ole Lehrmann, Birger, Kristen. (1993). Object-Oriented Programming in the Beta Programming Language.

Performance Analysis of High Performance Computing Applications on the Amazon Web Service Cloud.

(2010).

Replication for Availability & Durability with MySQL and Amazon RDS. (2011).

Richard P. Gabriel, Guy L. Steele, Robert R. Kessler. (2013). LISP and Symbolic Computation.

Robinson, D. (2008). Amazon Web Services Made Simple: Learn how Amazon EC2, S3, SimpleDB and SQS Web Services enables you to reach business goals faster.

Sangho Yi, Derrick Kondo, Artur Andrzejak. (2010). Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud.

(41)

LIST OF FIGURES AND TABLES

Figure 1 Annual visits to Tori.fi Year-on-Year growth (company private report) ... 8

Figure 2 AWS CloudFormation ... 14

Figure 3 AWS Lambda ... 16

Figure 4 AWS Elastic Container Registry ... 19

Figure 5 AWS Simple Storage Service ... 19

Figure 6 Scheduling UML ... 22

Figure 7 Application components diagram ... 23

Figure 8 Packages in composer.json ... 27

Figure 9 Dockerfile content ... 27

Figure 10 Docker build successful message... 28

Figure 11 List of Docker image ... 28

Figure 12 docker-compose.yml content ... 29

Figure 13 List of application commands ... 30

Figure 14 Script to read command in Shell ... 31

Figure 15 Example implementation of command ... 31

Figure 16 Implement function in Shell Script... 31

Figure 17 Application architecture in AWS... 32

Figure 18 Application database model ... 33

Figure 19 List of indexers in database ... 34

Figure 20 Successful test result ... 35

Figure 21 Fail test result ... 35

Figure 22 Update snapshot test result ... 35

Figure 23 travis.yml content ... 36

Figure 24 Push CloudFormation to AWS S3... 36

Figure 25 Push Docker image to AWS ECR ... 37

Table 1 Application components usage ... 24

Table 2 Database table description ... 33

(42)

1. Parser and Mapper UML diagram

(43)

2. ImagePusher UML diagram

3. AdPusher UML diagram

(44)