The Development of a Content Management System for Small-Scale Voice Controlled Websites

(1)

The Development of a Content Management System for Small-Scale Voice Controlled

Websites

Johanna Laitila

Master’s thesis

School of Computing Computer Science

October 2021

(2)

university of eastern finland

, Faculty of Science and Forestry, Kuopio School of Computing

Computer Science

Laitila, Johanna: The Development of a Content Management System for Small-Scale Voice Controlled Websites

Master’s thesis, 55 p.

Supervisors: Tapani Toivonen and Ren Ohmura October 2021

Abstract: Using voice to control computers has become more common over the recent years, especially in mobile devices. Voice as an input method allows for greater accessibility of computing to groups of users such as the visually impaired, who might not fully be able to use traditional graphic user interfaces.

Voice controls have not been widely implemented in the context of websites due to lack of ready-made tools and web development experience required from the developers.

Thus, we propose a new Content Management System with support for creating voice-controlled websites with ease. A Content Management System would make the development of websites easier, faster, cheaper, and more accessible to the general public.

The proposed system is targeted towards small-scale websites, towards small business and personal use with support for creating informational websites. The end users can navigate the created websites using the graphical user interface as well as voice commands. The site could answer to their voice commands using speech synthesis to make website experiences more conversational.

We found combining a Content Management System with support for voice-controlled websites can make the development of accessible websites easier and potentially increase the number of accessible websites on the internet. The proposed system is the first stage of developing a full Content Management System, and with further development it could be put into use on the internet.

Keywords: Web Development; CMS; Content Management Systems; Speech Recog- nition; Web Speech API; Accessibility

ACM CCS (2012)

•Information systems →Web applications; •Human-centered computing →Web- based interaction;Sound-based input / output;

(3)

itä-suomen yliopisto

, Luonnontieteiden ja metsätieteiden tiedekunta, Kuopio Tietojenkäsittelytieteen laitos

Tietojenkäsittelytiede

Laitila, Johanna: Ääniohjattavien verkkosivujen sisällönhallintajärjestelmän kehitys Pro gradu -tutkielma, 55 s.

Ohjaajat: Tapani Toivonen ja Ren Ohmura Lokakuu 2021

Tiivistelmä: Puheentunnistus on yhtä tavallisempaa tietokoneissa ja mobiililaitteissa.

Puheentunnistus parantaa tietotekniikan saavutettavuutta erityisesti näkövammaisille, jotka eivät välttämättä voi täysin käyttää graafisia käyttöliittymiä. Puheentunnistuksen käyttö internetin ja verkkosivujen ympäristössä on kuitenkin harvinaista.

Puheentunnistuksen käyttöönotto verkkosivuilla on vaikeaa johtuen valmiiden työka- lujen puutteesta sekä kehittäjiltä vaadittavan tietotaidon määrästä. Vastauksena näihin ongelmiin tutkielmassa kehitetään sisällönhallintajärjestelmä, joka mahdollistaa puheella käytettävien verkkosivujen kehityksen helposti. Sisällönhallintajärjestelmä mahdollistaa saavutettavien verkkosivujen kehittämisen helpommin, nopeammin ja tehokkaammin kuin aiemmat ratkaisut.

Tutkielmassa esitetty järjestelmä on suunnattu kehittämään pienen skaalan verkkosivu- ja, joita voidaan käyttää henkilökohtaisessa käytössä sekä yrityskäytössä esittämään informatiivista ja tekstipohjaista sisältöä. Sivuja voidaan käyttää sekä graafisen käyttö- liittymän että puheentunnistuksella käytettävien äänikomentojen kautta, ja sivusto voi vastata käyttäjälle puhesynteesin avulla.

Tutkielman tuloksista saadaan selville, että sisällönhallintajärjestelmällä on potentiaalia tehdä ääniohjattujen verkkosivujen kehittäminen helpommaksi, ja sitä kautta voi lisätä internetin saavutettavuutta. Tutkielmassa kehitetty järjestelmä on ensimmäinen vaihe toteuttaa kokonainen sisällönhallintajärjestelmä ääniohjattujen verkkosivujen luomiseen, ja jatkokehityksen avulla se voitaisiin laittaa käyttöön julkiseen internettiin.

Avainsanat: Web-kehitys; CMS; Sisällönhallintajärjestelmät; Puheentunnistus; Web Speech API; Saavutettavuus

ACM CCS (2012)

•Information systems →Web applications; •Human-centered computing →Web- based interaction;Sound-based input / output;

(4)

Preface

This thesis was written for the University of Eastern Finland and for Toyohashi University of Technology during the summer and autumn of 2021. Even though the COVID-19 pandemic changed my original plans of an exchange year in Japan to remote studies, I am thankful for both universities for giving me the opportunity to complete my studies in the Double Degree Master’s Programme.

I would like to thank my supervisors Tapani Toivonen and Ren Ohmura for their guidance, supervision, and good feedback on the thesis work. I would also like to thank Marko Jäntti for his supervision at the early stages of the research. Finally, I would like to thank my friends and family for their support during the whole writing process.

(5)

Acronyms

API Application Programming Interface CMS Content Management System GUI Graphical User Interface NLP Natural Language Processing STT Speech-To-Text

SUS System Usability Scale TTS Text-To-Speech

VUI Voice User Interface

WYSIWYG What You See Is What You Get

(6)

1. Introduction

Controlling computers with voice has become more widespread over the recent years due to advancements in technology. Virtual assistants such as Google Assistant, Apple’s Siri or Amazon’s Alexa have become very commonly used on mobile devices to accomplish various tasks, and smart speakers have gained popularity in household applications.

Virtual assistants are often used to answer questions, control smart appliances, and get information on subjects such as the weather. An example of a conversation between a user and a smart speaker can be seen in figure 1.1.

Even though voice-controlled software through virtual assistants have become popular, voice controls have not been as widely implemented in desktop computing applications, especially in websites where Graphical User Interfaces (GUIs) are thede factostandard way of interaction.

Figure 1.1: An example of a conversation between a user and smart speaker running the Google Assistant software, where the user (on the right) asks questions from the software (on the left). (Google, 2019)

(10)

Voice is the most natural form of interaction. It does not require the learning of new behaviours unlike GUIs, which require users to learn to operate a mouse to click things (unless using a touchscreen), to learn the layout of a keyboard for typing, to learn interaction patterns such as double clicking, dragging and dropping, and pressing the Enter key to complete actions. Using voice is the fastest way of interaction and can be operated hands-free, which makes it ideal for accessible computing. (Dasgupta, 2018, p. 6). Voice controls can also be combined with visual interfaces to make certain actions such as searching for items faster than visual identification or typing things with a keyboard to a search text field in cases where hands-free operation is not required (Adorf, 2013).

By using voice to control computers, groups who have more limited opportunities to use technology, such as the visually impaired gain access to computing and websites. For some individuals, the internet might be the only way to be connected with the world and to be included in different communities. Many countries around the world have made requirements to the law to ensure accessibility on the internet at least on governmental and other official websites, but often these requirements are ignored or unenforced.

(Zilak, Keselj & Besjedica, 2019)

Voice controls require a way to translate human speech into commands understood by computers. This happens via speech recognition systems (often called Speech-To- Text (STT)). The conversion process works by recording the user’s speech through

a microphone on their device and sending the recorded audio onto an external STT service, which is typically done online. The STT service transcribes the speech into a text format, which gets sent back to the user’s device and desired commands are interpreted from the sentence. The software can reply to the user by speech by using a speech synthesis software (often called Text-To-Speech (TTS)), which can create speech from text. Fully conversational systems can be accomplished by using a combination of speech recognition and speech synthesis. (Adorf, 2013)

As more and more mobile devices have included virtual assistants and other similar software, voice controls have become popular in mobile applications. Outside of mobile devices, voice controls are rare, especially in websites. This is partially due to the technical difficulties of developing websites implementing voice controls in websites.

From an accessibility perspective the implementation of voice controls in websites would be very important. As websites are often one of the most important ways of communicating official information, advertising businesses and e-commerce they need to be accessible by everyone.

(11)

The development of (non-voice-controlled) websites is often done by using a Content Management System (CMS), which allows users to create websites and to edit their contents easily from a single interface without much technological expertise. Websites made with CMSs often use pre-made layouts and templates for the design and appearance of the site whereas the content is the responsibility of the user utilising the system.

However, by using ready-made templates accessibility requirements can be often skipped as web designers might not have the knowledge of developing properly accessible websites and the requirement of ensuring accessibility is left to the final user who might have even less knowledge of the requirements.

CMSs lack support for voice controls out of the box and research shows creating voice- controlled applications often requires considerable knowledge in web development.

Typically voice controls are enabled by the use of JavaScriptWeb Speech APIor other similar solutions. Thus, this thesis aims to make the development of voice-controlled websites easier by combining a simple CMS solution with support for speech recognition and synthesis through Web Speech API.

The goal is to propose a system where the user can add pages to a website, edit the contents of them and add the possibility for the end users to navigate between pages using voice commands. The system would make developing websites that can be navigated with using voice easier, and the website development experience should be no more difficult to use than existing CMSs such as WordPress.

The problems of current systems the solution proposed by this thesis addresses are:

• the need to make custom solutions for each website the developers want to include voice controls in,

• the need to have considerable knowledge in web development to include voice controls in a website and

• the need for easy implementation of voice controls in Content Management Systems.

The CMS system proposed by this thesis is developed using the TypeScript programming language, a superset of JavaScript, one of the most popular web programming languages according to Stack Overflow Developer Survey (Stack Overflow, 2020). The front-end user interfaces utilise the React framework1, which is also one of the most popular frameworks for building user interfaces for websites (Stack Overflow, 2020).

1Sometimes also called React.js.

(12)

React also allows quick integration to Web Speech API through plugins such asreact- speech-recognition by Brill (2021) which work as an abstraction layer for the Web Speech API making it easier to use for the developers of the system and allowing React components to be interacted with using speech recognition. The contents of the CMS are stored in a MongoDB database as JSON objects, where the frontend can access them through a REST API.

Problems that might occur when developing the system are:

• making sure the voice recognition system knows exactly when it is being talked to,

• making the voice recognition system distinguish between words used to activate commands and key words used to make searches and queries,

• making sure the system can understand synonyms and alternative ways of saying commands and

• making sure the system can filter out stop words from the sentences.

1.1 Structure of the Thesis

Chapter 2 discusses the previous academic background in voice controls as a method of controlling computers as well as the benefits of voice controls from an accessibility perspective. It investigates implementations of voice controls in websites, answers to research question 1 and finds shortcomings in existing solutions.

Chapter 3 proposes the development of a Content Management System for small-scale voice-controlled websites as a response for shortcomings. It defines the requirements, use cases, target audience and the architecture for the system and answers to research questions 2 and 3.

Chapter 4 discusses the development and technical implementation of the CMS proposed in chapter 3. Chapter 5 evaluates the user experience of the developed system through usability testing, and chapter 6 contains conclusions about the thesis.

(13)

1.2 Research Questions and Methodology

1.2.1 Research Questions

The research questions of this thesis are:

1. How are voice-controlled websites currently developed?

2. What functionalities are required from a Content Management System to enable easier development of voice-controlled websites?

3. What are the limitations of a CMS aimed for development of voice-controlled websites that implements the functions required in research question 2?

1.2.2 Research Methodology

This thesis is based on the design science research methodology, which is a research paradigm where problems are answered by the creation of new artifacts, which contribute new knowledge to science. Artifacts can be used to describe something artificial or constructed. In information technologies, artifacts can consist of things such as algorithms, implemented and prototype systems or practices. (Hevner & Chatterjee, 2010, p. 5-6)

The design science research methodology process consists of six steps, which are identifying the problem, defining the objectives for a solution, creating the artifact, demonstrating the use of the artifact, evaluating the artifact, and communicating the problem and its importance (Hevner & Chatterjee, 2010, p. 28-30).

Problem identification is used to develop the artifact and to justify the value of solution (Hevner & Chatterjee, 2010, p. 28). The first step of identifying the problem is addressed in chapter 2 where the background knowledge of Content Management Systems, speech recognition systems and voice-controlled websites are presented and problems in existing systems are identified from literature. Keywords used for literature search include browser speech recognition, Web Speech API, content management systemandWordPress. Literature search was performed using Scopus and Microsoft Academic databases.

(14)

The objectives for a solution are defined in chapter 3 based on the problem definition in chapter 2. According to Hevner and Chatterjee (2010, p. 29), objectives can be quantitative describing how a desirable solution would be better than the current solutions, or qualitative describing an artifact supporting solutions which have not previously been answered. The objectives in the research are qualitative as a Content Management System with built-in voice recognition capabilities has not been built before, thus creating a solution for previously unanswered problems.

The creation of the artifact consists of determining the desired functionality of a system, designing the architecture, and creating the final artifact (Hevner & Chatterjee, 2010, p. 29). Chapter 3 discusses determining the desired functionality and designing the architecture, while chapter 4 discusses the technical implementation and development of the system.

Section 3.6 discusses the demonstration of the system by demonstrating the creation of a website with pages and voice commands using the system. Chapter 5 discusses the evaluation of the system, comparing how the objectives of the solution measure against the measured usability of the system.

Table 1.1 highlights the research methodology steps in relation to the chapters and sections of the thesis.

Step Chapter/Section

Identifying the Problem 2 Defining the Objective 3 Creation of the Artifact 3, 4 Demonstration of the Artifact 3.6 Evaluation of the Artifact 5

Table 1.1: The Design Science process steps in relation to the chapters of the thesis.

(15)

2. Background

The number of websites in the world is growing at a fast pace. Typically, new websites are set up by small and medium businesses to boost their online presence and to increase sales. As studies show 60% of these companies do not have their own websites yet, a lot of room for growth exists. Websites are often developed using Content Management Systems which are software tools allowing the development of websites easily and cost- effectively without requiring programming skills. (Martinez-Caro, Aledo-Hernandez, Guillen-Perez, Sanchez-Iborra & Cano, 2018)

CMSs allow the users to publish, edit and organise the contents of a website from a single interface. They handle the technical details of displaying the website leaving the users only in charge of the contents of the site. WordPress is the most popular CMS in the world, and it is used to develop websites ranging from personal blogs to e-commerce sites. (Halim, Hebrard, Hartono, Halim & Russel, 2020) Around 33% of websites on the internet are powered by WordPress (Zilak et al., 2019).

The importance of accessibility of the internet has been recognised globally, and many countries around the world have legal requirements for public web pages to be accessible.

Accessibility helps many groups such as the elderly or people with disabilities browse and communicate through the internet, as for some it is the only way of being in contact with the world and being included in a community. However, despite these legal requirements many websites are still not accessible as requirements are often ignored or unenforced. (Zilak et al., 2019).

When developing websites with CMSs ensuring accessibility falls into the hands of the people utilising CMSs to create websites. In many cases they might not have the knowledge of accessibility requirements. The developers of CMSs should design their systems to automatically create accessible websites, so their end users can create accessible websites with ease.

(16)

Freitas and Kouroupetroglou (2008) defined avoice portalas a World Wide Web portal which can be accessed entirely by speech. Typical use cases for voice portals include weather information, email, financial or government transactions. However, the term is quite rarely used in research. Voice portals exist but implementing them often requires a lot of specialised knowledge in web development. Customised solutions such as the voice-controlled e-commerce application proposed by Kandhari, Zulkernine and Isah (2018) have been developed, but these solutions are only suitable for their specific use cases and have not had source code publicly released.

Section 2.1 discusses the basic functionality and components of Content Management Systems. Section 2.2 discusses the reasons of using voice as a method of interacting with computers from an accessibility perspective. Section 2.3 discusses Web Speech API, a JavaScript Application Programming Interface enabling speech recognition and synthesis in modern web browsers for voice-based interaction with websites. Section 2.4 discusses previous research of web applications with support for voice interaction and section 2.5 discusses the shortcomings found in these solutions.

2.1 Functionality of Content Management Systems

Content Management Systems allow easy development of websites by users who are only skilled in basic office software1. They work by separating thecontentsof the website (e.g., text, images or videos shown on the website) from thepresentation(including the styling, layout, colours, and navigation of the website). Typically, the user would choose a ready-made template or have an external party develop the presentation while providing the contents of the website themselves. (Martinez-Caro et al., 2018).

A CMS is often hosted on a web hosting provider that manages the servers and the infrastructure web applications and databases run on. The user of a CMS can manage the contents of the website, create, edit, or delete pages, change the styling, add users, and access all the other functionalities of the system on the content manager interface.

(Martinez-Caro et al., 2018) Figure 2.1 shows the interface of WordPress when a user is editing the contents of a web page.

1Excluding the installation process of a CMS, which is often done by an experienced web developer with knowledge of web servers and databases.

(17)

Figure 2.1: The user interface of WordPress where the user is editing the contents a page called Home. WordPress is aWhat You See Is What You Get (WYSIWYG)software, meaning it allows the contents of websites to be edited while the contents of the pages look the same to the editor as well as the end user accessing the websites. The user can add text and media to the pages and change the styling of the contents without having to know anything about web development.

The contents of the website are stored on a database such as MySQL independently from the CMS. When a visitor accesses the website, the contents of the website are retrieved from the database. The CMS then combines the contents from the database with the programming code and style information that creates the visual interface to create a fully functional website. (Martinez-Caro et al., 2018)

While the process of editing the contents of web pages (such as WordPress shown in figure 2.1) is very simple, the process of setting up a functional website from scratch using a CMS is not. The process requires knowledge of selecting a suitable web host, selecting a suitable CMS, creating a database for the CMS, downloading the installer package from the internet for the CMS, uploading it to a server and running the installation wizard for the software where connections to the database must be set up.

(Martinez-Caro et al., 2018) This is a major issue of lowering the bar of creating new websites, but this thesis does not aim to investigate this issue further due to it being out of scope of the research.

(18)

2.2 Voice-based Accessible Computer Interaction

Traditionally data entry to a computer has been accomplished with a keyboard and a mouse, and information is displayed via text or graphical symbols on a screen. This causes issues to groups of users who have special accessibility requirements, such as the visually impaired or those who cannot fully operate a mouse or a keyboard. (Kandhari et al., 2018)

Voice controls using speech recognition have been proposed as a solution for interacting with computers without reading or typing. Speech recognition is the process of converting spoken language into text readable by computers. Through speech recognition the user can issue commands using their voice to trigger actions on the computer which allows software interaction to be designed not just through visual interfaces, but around natural languages and speech supporting multiple kinds of users with various accessibility requirements. (Kandhari et al., 2018)

Early solutions on voice-based web interaction focused on creating voice-based web browsers. In these browsers a speech recognition system is included in the browser for issuing commands and a screen reader would read the contents of the page. This would cause problems for the user in websites containing a large number such as of links such as web stores, as the screen reader would read every single link aloud. Screen readers also suffer from issues of misinterpretation and poor usability. (Kandhari et al., 2018) They do not give the contents of the page any consideration or give the user chance to get an early overview of contents of the page (Freitas & Kouroupetroglou, 2008).

Including specially designed voice interaction to websites can increase accessibility and user convenience drastically, by allowing the users to navigate pages faster, to skip through information they do not need, to search for specific pages and to control the functionality of the web page using voice. (Kandhari et al., 2018)

In addition to accessibility benefits, voice commands can be combined with hand gestures or mouse and keyboard input for faster controls of web applications. As voice controls have gained popularity through virtual assistants in devices such as smart speakers and mobile phones, users might demand them in web applications as well and the web and mobile applications become less separated. (Adorf, 2013)

(19)

Difficulties in building voice-controlled systems including interference from acoustics and background noise which can make speech unintelligible, lack of privacy from other people and the slower time to communicate messages in comparison to visual elements, which allow multiple pieces of information to be shown at once. (Freitas &

Kouroupetroglou, 2008)

Other difficulties might include detecting the correct speaker to prevent unwanted commands by other people around the user, having a large enough vocabulary to interpret commands and detecting meaning from long sentences spoken by the users.

Thus, voice controls are most suitable for using simple, limited interactions which contain few words and do not need special concerns for privacy, such as websites for businesses and public offices for displaying contact info and business hours. More complicated voice interactions require greater implementation of Natural Language Processing functionalities, systems to remember conversations and considerably more development time.

2.3 Web Speech API

Many modern web browsers supportWeb Speech API, which is a JavaScript Application Programming Interface enabling speech recognition and speech synthesis on web browsers. Web Speech API is the selected method for enabling speech recognition in the system proposed by this thesis described in chapter 3. As it is built into web browsers, it has the advantage of not requiring any third-party installations or paid licences for web developers which commercial solutions might need. Web Speech API supports a large number of languages, but the supported languages differ from browser to browser as it is not a universal solution but more a guideline which each browser may implement individually. Web Speech API is still considered an experimental cutting-edge technology. (Web Platform Incubator Community Group, 2020)

The speech recognition functionality of Web Speech API allows speech from the user to be captured to an audio file by a microphone on their device. The audio file is then converted to a text which can be used by software to recognise the intents of the user and to call desired functionalities from the software. This is done by an external web service which handles the processing of the speech, while the end developer only receives the processed text from the service. (Adorf, 2013)

Web Speech API allows grammars to be specified to constrict the commands and sentences it can detect for improving speech recognition performance. (Adorf, 2013)

(20)

This makes it ideal for creating software with user interfaces with limited number of voice user interactions but higher command detection accuracy, as opposed to solutions such as virtual assistants which are built around a universal use case. Restricting the number of words the speech recognition can recognise can increase the accuracy of the system and lead to a better user experience.

2.3.1 Drawbacks of Web Speech API

The speech synthesis functionality of Web Speech API is implemented in Google’s Chrome and Microsoft’s Edge browsers, but many other browsers including Firefox, Opera and Safari do not support it (MDN Web Docs, 2021a). These browsers consist of at least 25% of the global browser market share (StatCounter, 2021). Until these browsers include speech recognition support, lack of browser support is a major drawback of using Web Speech API.

A browser-independent solution would be to use an external speech-to-text API. Multiple different APIs exist from proprietary solutions provided by Google, Amazon, and IBM to open-source solutions such as Kaldi. Kandhari et al. (2018) found IBM’s Watson speech-to-text service to have the greatest word accuracy. However, all proprietary services offer only a limited amount of speech recognition time and users per month for free.

Whereas proprietary tools such as IBM Watson have combined NLP tools for allowing the users to differentiate command words, key words and search items from the sentences spoken by the users, Web Speech API does not contain any built-in tools for language processing. NLP allows the processing ofnatural languages(like English or Chinese, as opposed to programming or mathematical languages) to separate command words from search terms, to find out the intents of the users behind the words instead of interpreting sentences literally, and so on. When using Web Speech API language processing is left to the responsibility of the developers, who then require proficiency and knowledge of language processing systems.

When using Web Speech API, some browsers including Google Chrome send the recorded audio from the web page to a server which handles the speech recognition processing (MDN Web Docs, 2021a). The same applies to all proprietary speech to text solutions. This has been criticised by Hu, Pierron, Vincent and Jouvet (2020) and Lee, Tang and Lin (2019) as a data protection and a user privacy issue where user’s voice is transmitted to and stored by unknown third parties.

(21)

2.4 Previous Research in Voice Usage in Web Applica- tions

Academic research into voice-controlled web applications is still at its very early stages, and but some examples exist of applications implementing STT and TTS services to improve accessibility and user task performance of websites. Articles discussed here are A Voice Controlled E-Commerce Web Applicationby Kandhari et al. (2018),A Voice User Interface for football event tagging applicationsby Barra, Carcangiu, Carta, Podda and Riboni (2020) andSpeech Oriented Virtual Restaurant Clerk using Web Speech API and Natural Language Processingby Gautam, Akshay, Dhavan, Kumawat and Ajina (2020). These articles were chosen as some of the most relevant and comprehensive references of creating voice-controlled software with web technologies and each of them had a slightly differing approach from another.

2.4.1 A Voice Controlled E-Commerce Web Application

Kandhari et al. (2018) developed a voice-controlled e-commerce application, which is a prototype of a web store where the user can browse the store and buy products using their voice. The application allows users to browse a web store, search items and to add and remove items from their shopping carts. The motivation for the research was to create a more accessible web store, as the authors cited findings of poor accessibility across web stores and the growing e-commerce sector in the world. (Kandhari et al., 2018)

The application used IBM Watson for its speech recognition and speech synthesis capabilities. The choice of Watson was mainly motivated by IBM providing access to the software for the researchers. The frontend part of the application was developed using React, a JavaScript framework for creating GUIs. The backend of the application was developed using JavaScript, Python, and a MongoDB database for storing and retrieving information about the products available on the web store. However, the prototype is quite limited and other functionalities such as completing the orders were only listed as ongoing work at the time of the publication. (Kandhari et al., 2018) The user interface of the system is shown in figure 2.2.

(22)

Figure 2.2: The GUI of the voice-controlled e-commerce application by Kandhari, Zulkernine and Isah (2018).

When the user uses their voice to control the application, the user’s speech is sent to the Watson STT service which finds keywords from the speech, which are then used to activate functionalities on the application. For example, when the user speaksSearch for brownies, the STT would find the keywordssearchandbrownies, do a database search for item namedbrowniesand return the item if it was found. If a match was not found the TTS service would inform the user via speech. However, it was not mentioned how these keywords received from STT were connected with React. (Kandhari et al., 2018) The researchers found potential in the application for developing future voice-controlled e-commerce applications. The research noted the prototype could help in increasing the accessibility and usability of applications across all kinds of user groups and noted similar designs could be used in the future on fields such as online education, government services and emergency assistance. The article also discussed comparing other cloud-based STT services for in use on the application, implementing translation APIs for multi-language support and improving the prototype further by making it fully conversational by extending the usage of TTS in the application and allowing the users to complete purchases. (Kandhari et al., 2018)

However, at the time of writing this thesis no further research nor source code have been published on the topic by the authors, preventing future development based on the prototype. The authors only discussed the concept of a voice-controlled e-commerce site and the benefits of it, but the implementation of such websites must be started from scratch when further investigation into the topic is required.

(23)

2.4.2 A Voice User Interface for football event tagging applications

Barra et al. (2020) developed a Voice User Interface as an addition to a GUI to a football match event mapping application called FooTAPP as a way of improving user performance and reducing the error rate of the user’s actions. In addition to the accessibility benefits for individuals with disabilities, the authors cited voice controls as a way to eliminate repetitive and protracted physical actions GUIs might require when dealing with inputting large amounts of information manually to a system. (Barra et al., 2020)

Instead of the approach taken by Kandhari et al. (2018) where a completely new system was developed from the ground up with a design based on voice user interaction, the authors created a VUI as a complement for an existing application’s GUI. The authors used Web Speech API to implement the voice control features of the applications. The user interface of the application is shown in figure 2.3.

Figure 2.3: The GUI of FooTAPP developed by Barra, Carcangiu, Carta, Podda and Riboni (2020)

The authors noticed that using only a VUI to interact with the application actually reduced the user performance, but a combination of a VUI and GUI improved performance drastically. Some interaction patterns designed for visual interfaces might be unsuitable to be used with voice. (Barra et al., 2020).

However, the poor performance of using only the VUI is not a major concern as it offers the possibility of using the applications for those who otherwise would not be able to access the application at all. Those without visual impairments can gain the benefits of improved performance of using the combination of a VUI and a GUI as they are highly

(24)

unlikely to use only voice controls without the graphical user interface. The research thus shows voice is a viable supplementary input method for simple applications which require a lot of repeated tasks such as marking individual data points from a group as shown in the application’s football tagging use case.

VUIs should be not designed directly around the commands used with a mouse and a keyboard for an optimal user performance. This might cause challenges if the software is designed around inputting data and accessing commands in a particular order which would be less ideal to use in a VUIs. If rewriting software is not a viable option compromises in the usability of VUIs must be made. Therefore, it might be better to create a completely new software to be used with a VUI instead of extending existing applications if the circumstances allow for it.

2.4.3 Speech Oriented Virtual Restaurant Clerk using Web Speech API and Natural Language Processing

Gautam et al. (2020) developed a virtual restaurant ordering application which allows the users to order food from a virtual restaurant interface using voice. The goal of the research was to create an ordering system for a restaurant to improve efficiency and to reduce human errors while ordering food. The main interface of the system can be seen in figure 2.4. (Gautam et al., 2020)

Figure 2.4: The GUI of the system developed by Gautam, Akshay, Dhavan, Kumawat and Ajina (2020)

(25)

The developed system used Web Speech API for its TTS and STT functionalities.

Additionally, features such as face recognition were included in the system for automated customer recognition, but these features are not discussed in this thesis due to being outside the subject area. (Gautam et al., 2020)

The general idea of the system is similar to A Voice Controlled E-Commerce Web Applicationproposed by Kandhari et al., 2018, but the systems differ on their implementations of speech recognition features. WhereasA Voice Controlled E-Commerce Web Applicationused IBM Watson services for speech recognition and processing of the user’s sentences, Web Speech API does not have these features built-in, so an external NLP toolkit was combined with the system.

Unlike A Voice Controlled E-Commerce Web Application, the integration of voice controls inSpeech Oriented Virtual Restaurant Clerk using Web Speech API and Natural Language Processingdid not explicitly focus on improving accessibility, but improved accessibility comes as a by-product of including new technologies in the application. The authors found the system has potential to improve ordering performance and customer satisfaction in restaurant applications, and that voice is a useful tool for supplementary user actions to make small requests and actions on the application faster.

The system shows voice controls can be implemented seamlessly with GUIs when the system is designed to include voice recognition from the start. However, the system might include some accessibility challenges for groups such as the visually impaired when using the face recognition who might be unaware of how the facial recognition system works or if they are facing the correct way.

2.5 Shortcomings in Existing Solutions

The solutions mentioned in section 2.4 have technical potential in utilising voice controls for many types of web applications, but the solutions are highly technical and require knowledge in web development, JavaScript, and tools such as NLP and speech recognition frameworks. No source code from the applications mentioned in the research articles have been published, meaning further development of voice-controlled web applications must be started from scratch.

The way of implementing a custom-made website instead of using a more universal solution such as combining CMS functionality with voice controls makes the development time of websites longer and costs higher, and often prohibits the re-use of components

(26)

System that would support the use of voice to navigate and control actions on websites and allow support for conversational interfaces.

A CMS with support for voice controls would allow non-technical users to create their own voice-controlled websites, allowing the creation of more accessible websites with ease potentially increasing and promoting the accessibility of the internet. Inaccessible websites prohibit certain groups of users potentially from using websites at all. This limits the afflicted users from using public services, prohibits them from communication and makes businesses lose potential customers. As the current solutions to implement voice controls require expensive technical knowledge from experts, accessibility concerns are often skipped to save money. Making the development of websites easier saves money in development costs and promotes the accessibility of the internet at the same time.

Research has shown Web Speech API as a potential tool for implementing voice controls in web applications, but as it does not have built-in tools for language processing, at least some knowledge of NLP tools is required from the developer if they want to include complicated voice commands in their websites instead of simple operations such asgo forwardorgo back. There exists wrapper APIs for Web Speech API such as react-speech-recognitionbut selecting the correct API to use requires extra knowledge from the developer. The problem could be solved by using a speech recognition library which includes language processing tools such as IBM Watson which was used by Kandhari et al. (2018), but these libraries are not free to use. Commercial libraries could pose a barrier in developing voice-controlled websites for many smaller businesses, who were the most typical group to set up a website as mentioned in the beginning of chapter 2.

In addition to issues relating to the difficulty of website development from scratch, only Kandhari et al. (2018) were notably concerned on improving the accessibility of web pages using voice interfaces. Voice controls are more often seen as a way of improving performance of (repetitive) tasks, and while they are a good solution for eliminating repetitive mouse clicks and the likes of such, the design of software should also heavily consider accessibility requirements from the early stages of design and development.

Barra et al. (2020) mentioned the problem of needing to include accessibility requirements into the fundamental design of software from their implementation of voice controls into a software designed to be used with a GUI. While the combination of a VUI and a GUI boosted user performance in repetitive tasks, using only the voice to control the software gave worse performance results as the action patterns in the software were heavily designed around visual actions. Adding voice controls to existing applications increases the inclusivity of software, but there might be a performance and usability

(27)

bias against those who only use the VUI, and some actions might be unintuitive to use with voice. From an accessibility perspective, it might be the best to design software from scratch to add support for voice controls, but this is not often a realistic case.

Solutions mentioned in section 2.4 used only STT systems and did not include TTS functionality for making conversational systems and thus allowing the usage of websites without the need of visual interfaces. Kandhari et al. (2018) mentioned conversational functionality would be added to their proposed system in the future, but thus far no further articles on the topic have been published.

(28)

3. Approach to the System Design

As a solution for the shortcomings in developing voice-controlled websites mentioned in section 2.5, this chapter describes the design for a Content Management System which allows easy development of websites supporting voice controls and conversational interfaces. A CMS would reduce the problem of having to develop completely new websites from scratch when wanting to include voice controls, as the system would provide reusable technical parts handling the displaying of web pages and supporting speech recognition. When using a CMS, the user would only have to handle the installation of the system and adding the content to the website using a WYSIWYG interface.

A CMS with built-in support for voice controls (the systemfrom now on) would allow creation of simple, small-scale websites with predefined templates for personal and small business use. The websites intended to be created with the system could range from a small landing page for a company or a store containing few pages of contact information, directions and services offered to websites for public institutions. The voice controls could answer questions or navigate between the pages of the website.

Voice controls could easily answer questions on topics such as opening hours, contact information, restaurant menu, information about a product and so on.

The system would provide a simple layout for placing website contents and functionalities for adding or editing pages on the site, changing text on the pages, adding images to predefined positions on the site and adding voice commands for navigating the site. The user can create speech responses using TTS to respond to the user’s voice commands so the website can have conversational functionality with the aim of being able to use the website without a GUI to navigate between pages, find information or making search queries through the VUI.

(29)

Section 3.1 discusses the target use cases for the system, section 3.2 discusses the target audience, section 3.3 discusses the requirements of the system arising from the target use case, section 3.4 discusses the levels of voice interaction the system should target, section 3.5 discusses the basic architecture of the system and its components, and section 3.6 demonstrates how websites are created with the system.

3.1 Target Use Cases

The system is targeted towards personal and small business use, due to especially small businesses being the typical target audience to set up a new website as identified in chapter 2, and due to the requirements of personal and small business websites being small in scale.

Being targeted towards personal and small business use means the system would be used to create small-scale websites, referring to websites which have a small amount of web pages consisting mostly of text, pictures, or videos. In contrast, large-scale websites would in addition contain a lot more interactive content such as an online store, a blog, or a booking application for appointments.

Small-scale websites are most often used by small and medium enterprises to display information critical to their business. Critical information which often would be needed to have on the website in business use include the name of the company, general description of their business activities, available products and prices, contact information, interactive maps and postal addresses. (Al-Hawari, Al–Yamani & Izwawa, 2008) Small-scale websites could include sites for small businesses, local news, public services, or other kinds of targets mainly aiming to inform the public about their services and knowledge without having the need for much interaction on the website itself. These kinds of websites are suitable for supporting voice-controls as the information they contain can be relayed through speech synthesis without having the need for supporting long and complicated conversations between the website and the user.

3.2 Target Audience

CMSs are often used by non-technical personnel (Liduo & Yan, 2010). According to Martinez-Caro et al. (2018), typical users of CMSs have computer skills in basic office software such as Microsoft Word, but they are not skilled in web development. Halim

(30)

et al. (2020) identified WordPress as the most popular existing CMS in the world as well as being very beginner-friendly and having a fast setup time. Thus, the usability of the system here should be compared to the designs of Microsoft Office software and WordPress.

The target audience of the system are thedeveloper users, who utilise the development environment to create websites. The websites created with the system have thevisitors as their end users, who then use their voice to make selections on the web pages, search for items or change pages. These interactions can be accomplished by voice only, by using voice in combination with visual interfaces or using only visual interfaces. This thesis will refer to the developer users asthe users, and the end users asthe visitorsfor clarification.

The visitors are the ones with the strictest accessibility requirements, while the developers can be expected to have better computer skills and to have normal vision and hearing to be able to develop all the different aspects of the websites. The development environment should of course adhere to some accessibility requirements, but these requirements are mostly targeted and significant to end products created with the environment and the needs of their visitors.

The system is targeted towards single-user environments, where only one person will be assumed to control the contents of websites instead of multi-user environments. Liduo and Yan (2010) mentioned the ability for other users to review and approve contents before publication as an important business requirement of a CMS, but in a small-scale system built for single-user environments this requirement can be omitted from the system.

3.3 System Requirements

The basic requirements for a CMS are:

• The ability display web pages to the visitor.

• The ability to have multiple web pages on the website and to navigate between them.

• The user of the system to has to be able to edit the contents of the web pages via a simple GUI.

• The user must be able to add, delete and update web pages via the GUI.

(31)

• The contents of the website must be stored on a database independent of the visuals of the website.

The special requirements arising from the need for voice controls are:

• The ability for the visitor to interact with the web pages via voice commands. The commands can include actions such as navigating to the next page, going back, asking to repeat the previously said sentence, asking for help or other custom commands set up by the developer of the application.

• The system must respond to voice commands through speech synthesis and be able to read the contents of the website to the user.

• All the contents of the website must be accessible via voice as well as through the visual layout of the website.

3.4 Levels of Voice Interaction

When discussing voice-controlled systems, the targeted levels of interaction need to be mentioned, referring to the complexity of commands and conversation the system can understand. Levels of interaction could be split into different stages ranging from the lowest (1) to the highest (5), where the higher stages of interaction also implement all the features of the lower stages. The levels of interaction are:

• 1: The system can understand single-word voice commands such asnext,previous orstop. The system understands the commands spoken by the user literally and can reply to the user’s commands with simple responses.

• 2: The system can filter out stop words and find key words from relatively simple sentences, such as recognising the commandnextfrom the sentenceGo to next page.

• 3: The system can detect synonyms for a voice command set up by the developer creating the command. An example would be detectingnextas an alternative for continue.

• 4: The system can remember the previous commands and requests spoken by the user to make conversations instead of forgetting all the previous sentences the user has spoken. The system could offer alternative ways of asking questions to

(32)

• 5: The system can understand long and complicated commands with multiple different parameters and can use conversation to prompt the user on things which require clarification.

The minimum required level of interaction for the system developed in this thesis is level 1, but levels 2 and 3 can be targeted and development started towards fulfilling these levels of interaction. Levels 4 and 5 are out of scope of this thesis and are more the realm of virtual assistants such as Siri or Alexa.

3.5 System Architecture

The system is a CMS which allows the creation of voice-controlled websites, where the visitors can access contents through voice commands. Developers of the websites utilise a separate editor view of the system which allows them to change the contents of the website and create new voice commands easily though a Graphical User Interface. As CMSs rely heavily on the separation of content and presentation, the system architecture is aims to have a distinct separation of components. The basic system architecture will consist of:

• theVisitor Viewshown to the visitors, responsible for displaying the contents of the page as well as containing the voice interaction,

• theEditor View, where the user of the system can modify the contents of the website through the GUI in a WYSIWYG view,

• and theBackendof the system consisting of a database and behind-the-scenes business logic. Both the Visitor and Editor View connect to the backend to retrieve the data displayed on the website, and the backend retrieves the data from the database and converts it to a suitable format.

The design of the system architecture follows the three-tier architecture pattern, where the contents of the CMS are stored in a database and the database’s contents are accessed through a separate backend layer. The Visitor and the Editor Views handle the business logic of editing the data, navigating between the pages and the final presentation (both visual and auditory) of the website. (Christianson & Cochran, 2009, p. 79) The relations between the different components of the system are visualised in figure 3.1.

(33)

Figure 3.1: The relations and connections between the different main components of the system displaying the separation of content and presentation.

3.5.1 Visitor View

The Visitor View is shown to the visitor when they access a site created with the system through their web browser. The functionality of the website view consists of a connection to the backend, a graphical user interface displaying the data retrieved from the backend and a speech recognition and synthesis system handling the voice command functionality.

Graphical User Interface

The websites created with the system should have a simple Graphical User Interface where the user can see the title of the website, navigate between the pages on the site by clicking links below the title and see the contents of the page below the navigation. The developer of the website could include images, videos, maps, or other visual content next to the text as additional information in the future development versions. A prototype of the GUI of the Visitor View is seen in figure 3.2.

(34)

Figure 3.2: An example of what the Visitor View would look like to the end user.

Functionalities required from the GUI are:

• parsing a single page object received from the backend and mapping the contents of the page into appropriate HTML/JavaScript elements,

• creating a dynamic navigation menu from the list of all pages on the website received from the backend,

• allowing the navigation between different pages of the website by clicking items on the menu bar,

• displaying the voice commands recognised through the VUI in text format,

• displaying a loading message while the system is requesting data from the backend,

• and displaying an error message to the user if a connection to the backend could not be made, and thus the site is not usable.

(35)

Communication with Backend

When the visitor enters the website, a connection to the backend is made by sending a request to retrieve the front page and its contents. Pages are identified by giving them a uniquepageIdnumber, the front page always having a pageId of 0.

Communication between the Visitor View and the backend happens via a REST API.

The frontend sends a GET request with the desired pageId number to the backend. If the pageId is found in the database, the backend responds with a JSON object containing the contents of the page. The frontend then reads the JSON object and updates the contents of the page with the data received from the backend. The communication via HTTP GET requests enables the frontend and backend to be separated completely, allowing for greater flexibility and modification of either component individually.

The navigation between different pages on the Visitor View does not happen by retrieving different HTML pages from the server, but by sending another GET request to the backend requesting a new page. The received JSON object is used to change the contents of the existing already loaded page. This makes the Visitor View asingle- page application, allowing for better performance and dynamic user experience of the application with the trade-off of worse Search Engine Optimisation performance (MDN Web Docs, 2021b).

Voice Controls

The Visitor View is equipped with speech recognition and synthesis functionalities.

Each page may have their own voice commands for enabling specific functionalities within that page. The commands are mainly used to navigate between the different pages on the website. For example, the user could ask the website to navigate to the home by sayingGo Hometo the website.

Using speech recognition raises the problem of recognising which parts of the speech are meant to be commands for the computer, and which parts should be filtered out. If the system is continuously listening for commands, it might falsely detect commands from background noise such as other people talking in the background. The problem is very context sensitive as it mostly will appear in shared, public places but will not appear if the user is using the system alone in a private environment.

(36)

Virtual assistants solve this problem by using awake word1before the commands given to it. The assistant listens for wake words continuously. After detecting a wake word, the assistant keeps listening for a complete sentence and tries to interpret a command form the speech, reducing the number of accidental commands from background noise.

It should be considered whether a wake word would be useful to use in a website context.

None of the applications mentioned in section 2.4 mentioned about using wake words, instead just preferring to listen for commands continuously. The issue of wake words lies in requiring the users to know the wake word in order to use them. While a virtual assistant is a universal solution, a website is not and therefore the user is not expected to know a universal wake word for accessing websites as none exists yet.

The system design can quite easily be changed to include a wake word activation for voice commands, but this is left to future research section instead of implementing it at the first stage of development.

The system will only detect only one command at a time. In the case of the user saying multiple commands at once, the system will fail to detect an appropriate command that features all the specified commands and will activate nothing. The user should only state one command at a time due to the design constrictions, and this should be made clear by the designers of the websites.

3.5.2 Editor View

The Editor View allows the user to modify the contents of the website. It is accessed by navigating to the/editorsub-directory of the website. In a production-grade system the editor interface would ask for the user’s credentials to allow logging in to the system, but for the purposes of thesis user authentication was not implemented.

The Editor View consists of two main components: Dashboard and Page Editor. Dashboard (figure 3.3) is the main component shown when the user enters the Editor View. Dashboard keeps a list of all the pages in the site and contains buttons for creating new pages, editing existing pages or deleting a page. Currently dashboard is not used for anything besides changing the title of the website, listing all the pages on the website, and accessing them, but it might be used in the future for functionality such as manually ordering the pages of the website.

1Wake word is the term given by Amazon used in the context of Amazon Alexa virtual assistants. As no unified terminology exists for this functionality, this term is selected here.

(37)

Figure 3.3: The user interface of Dashboard, with four pages calledHome, Lunch, MenuandReservationsadded to the website.

Clicking a button on the dashboard to add or edit a page takes the user into the Page Editor, which allows the user to modify the contents of a page. Adding or editing pages share the same interface, and the only difference between these functionalities is in the request sent to the backend from the page editor of whether the page sent is a new one or a page replacing the old contents of an existing page on the database.

A user interface prototype of the Page Editor is seen in figure 3.4. The editor contains the main text contents of the page in a text area where they can be modified. Next to the text box containing the content is the area for editing the voice commands of the page, which allows the user to create voice commands for the page, inserting the utterance, response, and desired action for the voice command.

(38)

Figure 3.4: The user interface of the Page Editor displaying a page with some contents and a voice command added.

The act of validating the contents of the pages as well as the validity of the voice commands is the responsibility of the Editor View. The program should check whether the page has contents that can be saved into the database as well as the voice commands have a clear utterance as well as a response, and in the case of the voice command being used for navigation a navigation target is required.

3.5.3 Backend

The backend acts as an intermediary between the Visitor View and Editor View. The main purpose of the backend is to connect to the database and retrieve data for the editor and the presentation layer to use. The backend abstracts the original data source allowing for both better security in the system as the frontend code does not interact directly with the database but through an intermediary which can detect wrong or harmful requests, as well as allowing the database to be changed to a different type if needed without having to rewrite both the presentation and editor layers (Christianson & Cochran, 2009, p. 85).

The backend is very simple and limited in functionality, consisting only of a REST API receiving requests from the Visitor and Editor views. When a REST request is received, the backend makes a connection to the database, makes the appropriate actions such as getting or adding objects, and sends a response to the requesting party either as a HTTP message or as a JSON objects.

(39)

3.6 Demonstration of the System

This section demonstrates the usage of the system in Visitor and Editor views. Section 3.6.1 describes the usage of the system in using voice controls and section 3.6.2 describes the usage of the system in creating new pages and adding voice commands to them.

3.6.1 Visitor View

When entering to the website, visitors see the website as it is shown in figure 3.5. The title of the website and navigation links to the pages are shown on the top of the page, where the user can click the links to navigate between the pages of the website. The contents are shown in the middle of the page.

Figure 3.5: TheHomepage of a sample website created with the system.

(40)

On the bottom of the page there is theRecognised-bar showing which words the STT system has recognised. When a word or a sentence is recognised, it gets displayed in the bottom of the page as shown in figure 3.6. The bar is used to show the user visually what commands the system has recognised as an additional way of confirming the voice command actions.

Figure 3.6: The Recognised-bar, where the STT system has recognised the sentence What’s for lunch?.

3.6.2 Editor View

As mentioned in section 3.5.2, the Editor View does not yet have a log-in functionality implemented, meaning when the user accesses the Editor, they can directly access the Dashboard as it’s seen in figure 3.7, which shows a new website with no pages yet.

Figure 3.7: The Dashboard of an empty site with no pages added.

(41)

To add new pages to the website, the user has to press the blueAdd A New Page-button on the dashboard, which takes the user into the Page Editor (figure 3.8) for creating a new page.

Figure 3.8: Page Editor for a new page with no contents.

The user can change the name of the page by filling the Title text field and add the contents of the page to theContentstext field as shown in figure 3.9. The page can be saved by pressing theAdd New Page-button. After saving the new page for the first time, the button changes toSaveto indicate the page already existing and having been published in the system.

(42)

Figure 3.9: Page Editor with contents and page title filled in.

Finally, the user can create voice commands to the page by pressing theNew Voice Command-button, which creates a new command specific to the opened page. Figure 3.10 demonstrates the creation of a new command. After pressing the button, the voice command interface will expand on the Page Editor, revealing the options for the voice command (figure 3.11). The voice command editor will allow the user to set the action of the voice command2, to set the utterance spoken by the user that activates the voice command, to set the response spoken back to the user by the TTS system, and in the case of a navigation command, to set the page the action will navigate to.

2See table 4.3 and section 4.1 for the different actions and types of voice commands.

(43)

Figure 3.10: New command.

Figure 3.11: The toolbox for editing voice commands in Page Editor.

(44)

4. Technical Implementation

This chapter discusses the technical implementation of the system design described in chapter 3. First, the way of storing data in the system is described in section 4.1. Then the common implementation of frontend, which consists of Visitor and Editor views is discussed in section 4.2. Further implementation of the Visitor View is discussed in section 4.3, the implementation of Editor view in section 4.4 and the implementation of the backend is discussed in section 4.5. Finally, issues noticed in the current technical implementation are discussed in section 4.6.

4.1 Data Storage Format

As the system is based around the design of a separate Visitor View for displaying the web pages to the visitors and an Editor View for changing the contents of the pages through a GUI, a common data storage format for storing the pages is needed. JSON was chosen as the storage format as it is easily accessible programmatically by all parts of the system as explained later in this chapter.

Each page of the website is stored in an individual JSON object. As each page may have their own voice commands, the commands are stored inside page objects and are called voice command objectsin the context of the system. The data structure of a page object is listed in table 4.1.