Multi-label classification for industry-specific text document sets

(1)

MULTI-LABEL CLASSIFICATION FOR INDUSTRY-SPECIFIC TEXT DOCUMENT SETS

Master of Science Thesis Faculty of Engineering and Natural Sciences Examiners: Prof. tenure track Roel Pieters, Prof. Jyrki Nummenmaa June 2021

(2)

ABSTRACT

Juhana Rajala: Multi-label classification for industry-specific text document sets Master of Science Thesis

Tampere University

Master’s Programme in Mechanical Engineering June 2021

Multi-label classification is a generalization of a broader concept of multi-class classification in which a single instance acting as a piece of data can belong into one and only one class. In multi-labeling problems, the mapping between the class space and the instances is less restricted since one instance may be assigned with an arbitrary number of classes. This is common especially with the categorization of movies that are influenced by multiple genres yet the same idea is applicable with other types of media, too. In this work the media type is text documents, considering especially such documents that being created and stored by companies from various specialized areas of industries. In the scope of this study the class space for each company’s set of documents is very different and the document set sizes are relatively small.

For being able to benefit from machine learning, a suitable model and its training method needs to be found that are resistant to the lack of training data yet still easy to deploy and use. To find a solution for this problem, this study carries out a literature study about the common theories and practices used in the domain of multi-labeling problems, an empirical research of an algorithmic multi-label classifier and a performance analysis of the implemented classifier.

As a result of this study a NetClass based multi-label classifier suitable for relatively small text document data sets was implemented. This classifier performs effectively with pure multi- label data sets but loses some of its effectiveness with low cardinality data sets. Also, due to the algorithmic nature of the classifier the training process was noticed to be reversible, which is a rare feature for machine learning applications. Even though this opportunity was glanced in this thesis a more detailed consideration was still left for future work.

Keywords: multi-label classification, text classification, document storages, NLP, natural language processing

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Juhana Rajala: Teollisuusalakohtaisten tekstidokumenttien monileimainen luokittelu Diplomityö

Tampereen yliopisto Konetekniikan DI-ohjelma Kesäkuu 2021

Monileima luokittelu yleistää laajemman moniluokka luokittelun käsitteen, jossa yksi instans- si voi kuulua ainoastaan yhteen luokkaan. Monileima luokittelun ongelmissa luokka-avaruuden ja instanssien välillä ei kuitenkaan ole yhtä tiukkoja rajoituksia, sillä niissä yhdelle instansille voidaan määrätä mielivaltaisen monta luokkaa. Tällainen menettely on yleistä etenkin sellaisten eloku- vien kategorisoinnissa, jotka ovat saaneet vaikutteita moniste eri genreistä. Monileima luokittelu ei kuitenkaan rajoitu vain elokuviin, vaan sitä voidaan soveltaa myös muihin median muotoi- hin. Tässä työssä median muotona toimii tekstidokumentit, etenkin sellaiset joita on luodaan ja varastoidaan hyvin erilaisten ja toimialoiltaan erikoistuneiden yritysten toimesta. Työn aihepiiriin kuuluu erityisesti sellaiset suhteellisen pienet dokumenttivarastot, jotka eroavat toisistaan luokka- avaruuksiensa osalta.

Että koneoppimista voitaisiin hyödyntää tässä ongelmassa, tulee löytää sopiva malli sekä sen koulutukseen käytettävät metodit. Niiden täytyy olla helppokäyttöisiä, mutta kyetä myös toimimaan vähäisellä koulutusdatalla. Tällaisten menetelmien löytämiseksi tässä tutkimuksessa tehdään kir- jallisuuskatsaus monileima luokittelussa yleisesti käytettyihin teorioihin ja käytäntöihin, algoritmi- seen monileima luokittelijaan pohjautuva empiirinen tutkimus sekä toteutetun luokittelijen suori- tuskyvyn analyysi.

Työn lopputuloksena syntyi NetClass-malliin pohjautuva monileima luokittelija, joka sopii käy- tettäväksi myös vähäisellä koulutusdatalla. Tämä luokittelija pääsee hyviin tuloksiin aidosti moni- leimaisilla dataseteillä, mutta menettää tehokkuuttaan sellaisten dokumenttien kohdalla, jotka ovat luonteeltaan yksiluokkaisia. Algoritmisten ominaisuuksiensa myötä luokittelijan koulutuksen huo- mattiin olevan palautuva prosessi, mikä on harvinaista koneoppimisen sovelluksille. Vaikka tähän ominaisuuteen luotiin työssä nopea katsaus, jää sen selvittäminen tulevaisuuden työksi.

Avainsanat: monileima luokittelu, tekstin luokittelu, dokumentti varastot, NLP, luonnollisen kielen käsittely

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

This Master of Science thesis was implemented and written under the employment of Vertex Systems Oy. I would like to express my gratitude to Anssi Männistö, who acted as my thesis supervisor from Vertex Systems. His guidance during and prior to this thesis was essential for my achievements.

I would also like to thank Timo Tulisalmi from Vertex Systems for providing the initiat- ing force for this thesis and for his genuine interest in the process. In addition to that I feel grateful to my thesis supervisors from Tampere University, Roel Pieters and Jyrki Nummenmaa for their support and feedback.

However, the greatest support of all came from Enni and my friends, who provided invalu- able advice, assistance and peer support throughout this process.

Tampere, 25th June 2021

Juhana Rajala

(5)

LIST OF FIGURES

2.1 The difference between multi-class and multi-label classification problem

domain. Based on [15] . . . 9

3.1 Traditional classifier taxonomy . . . 10

3.2 A dimension based taxonomy to organize documents on multi-label feature selection. Adapted from [4] . . . 15

3.3 Different measure types based on problem type and measure type. Adapted from [30] . . . 16

4.1 An example structure of a term graph formed from three documents, yielding a total of eleven vertices and 29 edges . . . 24

4.2 Neighbor selection methodA1_all including all neighbor edges of inspected vertex "wood" . . . 26

4.3 Neighbor selection method A1_threshold including only neighbor edges of inspected vertex "wood" above thresholdt=2 . . . 28

4.4 Neighbor selection method A1co−occur including only one neighbor edge connecting inspected vertexwoodand vertexbuilding . . . 29

4.5 An example of a table representation of class score aggregation for a test document . . . 32

4.6 A first co-occurrence of terms "wood" and "beam" in a document from class "building design" . . . 33

4.7 A local view of NetClass based term graph . . . 35

4.8 Adjacency matrix presentation of a six-node graph . . . 38

4.9 Adjacency list presentation of a six-node graph . . . 39

5.1 Comparing results achieved by LRN-WRM against traditional ADC algorithms [10] . . . 49

5.2 Comparison of time execution of LRN-WRM run on a machine with i5 @ 3.40GHz and 16GB of DDR3 RAM with traditional and lazy ADC algorithms [10] . . . 49

5.3 Number of vertices being formed into the model graph . . . 51

5.4 Number of edges being formed into the model graph . . . 51

5.5 Ratio between the formed vertices and edges in the model graph . . . 52

5.6 Time consumption for training and writing of the model graph . . . 52

5.7 Model graph size evolution . . . 53

(7)

LIST OF TABLES

3.1 The five most popular performance metrics used in the literature for multi- label classifiers. Total number of papers 64. Adapted from [30] . . . 15 5.1 Accuracy on data set 20NG with sub-algorithm combination A1co−occur -

A2-A3_{T F}−IDF with and without stop word removal . . . 42 5.2 Six data sets and their characteristics used in model validation. Adapted

from [21] . . . 43 5.3 Performance on different data sets with sub-algorithm combinationA1_all -

A2-A3_sum . . . 46 5.4 Performance on different data sets with sub-algorithm combinationA1_all -

A2-A3_{T F}−IDF . . . 46 5.5 Performance on different data sets with sub-algorithm combinationA1_threshold

-A2-A3_sum and threshold value of 0.05 . . . 46 5.6 Performance on different data sets with sub-algorithm combinationA1threshold

-A2-A3_{T F}−IDF . . . 47 5.7 Performance on different data sets with sub-algorithm combinationA1co−occur

-A2-A3_sum . . . 47 5.8 Performance on different data sets with sub-algorithm combinationA1co−occur

-A2-A3_{T F}−IDF . . . 47 5.9 Absolute time consumption with sub-algorithm combinationA1co−occur-A2

-A3T F−IDF on a system with i7-7700HQ @ 2.80GHZ and 32GB of RAM.

*with 20,000 randomly selected documents . . . 49 5.10 Absolute time consumption & memory usage of training process with20NG

dataset on a system with i7-7700HQ @ 2.80GHZ and 32GB of RAM . . . 50

(8)

LIST OF ALGORITHMS

1 Multi-label NetClass graph creation algorithm . . . 22

2 Creating new and updating existing edges between vertices . . . 23

3 Multi-label NetClass classification algorithm . . . 25

4 Edge selection algorithm A1_all . . . 26

5 Threshold-based edge selection algorithm A1_threshold . . . 27

6 Term co-occurrence based edge selection algorithm A1_co-occur . . . 27

7 Sum-based term score aggregation algorithm A2_sum . . . 30

8 Sum-based final score aggregation algorithm of A3_sum . . . 30

9 TF-IDF-based term score aggregation algorithm A3_TF-IDF . . . 31

(9)

LIST OF SYMBOLS AND ABBREVIATIONS

.NET Core Framework for developing and running .NET based applications On-premise Locally, non-cloud, hosted service

ADC Automatic document classification

ASCII American Standard Code for Information Interchange

BFS Breadth-first search

BMP Bitmap

C# Programming language based on .NET runtime

CAD Computer-aided Design

CAE Computer-aided Engineering

DFS Depth-first search

DIA Document image analysis

IDF Inverse document frequency

IIS Internet Information Services

JPEG Joint Photographic Experts Group

LRN-WRM Lazy Reduced Neighborhood Weighted Relational Model

Multi-tenant service Centralized computing service that is shared among multiple companies or other customers

NLP Natural language processing

OCR Optical character recognition

PDF Portable Document Format

PDM Product Data Management

PLM Product Lifecycle Management

PNG Portable Network Graphics

RAM Random Access Memory

Regex Regular expression

REST API Representational State Transfer Application Programming Interface

SaaS Software as a Service

(10)

SQLite Lightweight SQL database engine

SSD Solid-state drive

STEP STandard for the Exchange of Product model data

TAU Tampere University

TF Term frequency

TIFF Tagged Image File Format

TUNI Tampere Universities

URL Uniform Resource Locator

XNOR Logical exclusive not or -operator

(11)

1. INTRODUCTION

Large variety of different types of documents are constantly being stored in various types of document storages by companies from various industries [1]. This is also the case with the customers of Vertex Systems Oy that use different kinds of Product Data Manage- ment, Product Lifecycle Management (PDM/PLM) and other document storages provided by Vertex Systems Oy. These document storages may be based on simple file systems based flat directories or complex database management systems [2]. Almost always the stored documents are being organized or categorized in some way to make it easier to search and access these documents later [3]. This organization or categorization process is often done manually by a human that at some point of the document life cycle identifies the document and places it in a suitable slot in the document storage. [3] [4]

This categorization process does not need to be always manual work. To facilitate this problem, different kinds of machine learning applications are being developed in various domains in order to relieve the need for manual work. This is necessary especially when dealing with large amounts of unstructured data [5]. However, applying machine learning in a form of automatic document classification (ADC) raises limitations for document types - data fed into machine learning applications must still be in a somewhat standardized format and because of that not all kinds of documents can be handled with one machine learning application [6]. Because of these requirements standardizing the document contents needs various processing steps that help extracting the important discriminative features that eventually advocate the categories or the classes of these documents. Implementing these processing steps is thus highly document format and language dependent and only certain formats of documents can be processed with the same processing pipeline. [7]

Regardless of well performed document processing and feature extraction it’s not also always clear to assign one document into a single category due to its content overlapping with the descriptions of multiple other categories, too [8] [9]. This is a common problem for not only for text document categorization but also for other types of media. For example, movies and songs are often influenced by multiple genres and thus are described with multiple labels. This leads to a situation where a movie can be of typescomedy,emotional and thriller at the same time. This problem is better known as multi-label classification, in which a single instance of data may contain features from various partially overlapping

(12)

classes. This problem of multi-labeling applies also to companies from various industries storing text documents labeled with their own set of predefined and specialized labels not shareable with other companies [4]. This introduces the concept of industry-specific text document sets and this thesis aims to find a multi-labeling ADC solution for handling documents from such data sets.

1.1 Background and Motivation

Vertex Systems’ already existing PDM/PLM related documents storages and cloud-based document storing solutions currently under development require solutions for making document classification process less human laborious and more flexible. Commissioned by Vertex Systems Oy this thesis was done to find suitable methods for crafting a multi- labeling ADC solution that suggests suitable labels for documents stored in the document storages. Also, since most of the documentation stored inside these document storages contain text and document processing being very document format dependent the scope of this thesis focuses on documents with direct textual content and documents with content that is possible to convert into text with the help of optical character recognition (OCR).

1.2 Research structure

Research questions for this thesis were refined from the problems and needs of Vertex Systems Oy presented in Section 1.1. Since the company did not have any internal knowledge from machine learning applications nor experience how such solutions would benefit the process of software development and production, these were the questions that were chosen to be answered in this thesis:

1. What easy-to-deploy multi-label classification methods are suitable for handling relatively small industry-specific text document sets?

2. How well would a model pre-trained with common NLP text datasets perform with industry-specific text document sets?

Three methods were chosen in this thesis to assemble the information needed to answer the research questions: literature study, empirical research and performance analysis based on the implementation.

The literature study was carried out in order to gain and document knowledge on multi- labeling machine learning applications. Because of this the information on multi-labeling solutions will not stay tacit but is spread across Vertex Systems personnel for further use. To extend the theoretical approach and to test the theory in practice, an empirical research was performed in form of implementing a multi-labeling ADC solution.

(13)

The purpose of the implementation phase was to craft on a robust, training data skew tolerant multi-label classifier that provides extensible transfer learning capabilities. This classifier implementation is also continuation for the work done by Mourão et al. in [10]

since it extends the classifier capabilities from single-label to multi-label domain. After the implementation phase the performance of the implemented ADC solution was measured and analyzed in terms of a few common performance metrics and compared insofar as it is relevant with LRN-WRM method presented in [10].

1.2.1 Thesis structure

This thesis is divided into six chapters first of which is 1Introduction presenting the mo- tivations and backgrounds and also the research questions and methods of this thesis.

The second one, 2 Industry-specific document sets, focuses on clarifying the concept and special properties of data sets strongly related to certain area of industry. As the third chapter, 3 Multi-label classification presents the steps and methods of document processing and multi-label classification, commonly used multi-label data sets and their differences and the performance metrics commonly used in multi-label classification domain.

In the fourth chapter 4Materials and methodsa NetClass based method and the process of finding features used in the implementation are presented. Also, the concept and benefits of incremental training and facts about the computational complexity caused be the algorithmic NetClass method are shown. In chapter 5Results and analysisthe effects of data set language and the performance of the implemented classifier are evaluated and discussed. In this chapter also issues related to storing the trained models and excessive memory consumption are addressed. Finally, in chapter 6 Conclusion the main results and their importance are discussed in addition to the recommendations, suggestions and limitations of the multi-label ADC implemented in this work.

(14)

2. INDUSTRY-SPECIFIC DOCUMENT SETS

Companies from various industries are handling and archiving lots of documents related to all sorts of business areas. Most of the companies share the same subset of core document types used as an internal documentation, like business and marketing reports or contracts of employment [1]. In addition to that, companies, especially in the area of manufacturing industry, have their own kinds of documents related to their own area of expertise. These documents are usually about various processes, products and manuals.

[11]

In this chapter we will present the concept of industry-specific document sets and the threats but also the opportunities such document sets provide. In addition to that the limitations of processing a broad range of different document formats and the effect of varying class space among different companies are addressed.

2.1 Traditional categorization

Wide variety of documents stored in document storages is an ongoing issue also in Ver- tex Systems Oy that has companies as customers from a wide range of industries. Some of these companies are manufacturing their own products, some act as engineering of- fices or consultants for other businesses. [11] Much depending on the field of industry, the documents stored by the companies may differ a lot. This introduces a concept of industry-specific document sets that are groups of document content types common to a certain area of industry. This implies that companies among the same area of industry most likely share a great deal of the same kind of documents with each other. Also, in case of a highly specialized company, it may have developed its own unique set of document content types that overlaps poorly with sets of document content types of any other company.

These industry-specific documents are often being handled and categorized by some external product data management systems (PDM’s) or product life-cycle management systems (PLM’s) to maintain links between related products and documents [3]. Some companies also use their own methods to archive the documents by using for example plain file systems and directory structures hosted locally as on-premises storage or more often in the cloud platforms like Dropbox or Microsoft OneDrive [11].

(15)

A common factor for all documents stored, by companies or individuals, is that they all need to be classified and organized in some way in order to make the searching of those documents later on reasonable. Documents without any identification information lying in unrelated flat directories are difficult to utilize later on.[3] Such classification and organization of documents often requires much manual work, which was one motivational factor for this work to take place.

The strategies behind the organization and categorization of various types of documents may differ a lot within organizations and individuals, but the common principle in each of those strategies is trying to assign each document in one or more pre-defined slots that represent the content or the properties of the document as accurately as possible.

Usually it is more intuitive for humans to label a certain piece of data to belong into more than one pre-defined slot. [4]

These pre-defines slots have usually been directories within a physical or a virtual computer file system. File systems not only present a flexible way to manually handle different types of documents but also provide a possibility to form hierarchical tree-like directory structures to organize the documents depending on their mutual relations[3] [12]. This hierarchical mapping makes it possible to rank the documents by some taxonomy where the root node, or in this case a directory, of the tree represents the top-level concept and each of its child nodes represent some lower-level concept, recursively [12]. With this approach, the path of a document in the file system represents the information about the classification of that particular document. [3]

Forming such hierarchical directory structures does not have to always be manual work and there is various kinds of tree traversal algorithms or solutions that exploit visitor pat- terns to automatically form and trace these tree nodes. One benefit of these nested file system directories is also that the structure they form is usually human-readable and its easy for a human to manually add and seek for certain documents if the naming conventions on those directory nodes is reasonable. [2] [5] Such file system based categorization is how ever mostly used in a smaller scale within individuals rather than among companies. Yet still, dealing even with individuals, the document storing techniques have been under research and solutions to relieve the manual work have already been created [2].

Even though file systems based categorization is popular and often meets the requirements even in company wide solutions, it often introduces additional maintenance work to keep the directory structure in shape and also demands reasonable and constant naming conventions for the directories to keep it human readable [3]. Also, the process, auto- mated or manual, of finding a certain document from the nested directories suffers from overhead caused by constantly opening one directory, checking its content, closing it and moving to the next directory node [5]. Such process done programmatically is memory speed dependent and thus is often too heavy for handling large document masses without

(16)

a help of a separate indexing of documents.

Also, non-heuristic graph algorithms for tree traversal, like depth-first-search (DFS) or breadth-first-search (BFS) have a worst-case time complexity of the sum of the nodes and the vertices in the whole graph meaning, in the worst case all directories have to be checked when searching for a specific document. Of course, this is the case when there is no clue of the whereabouts of the target document in the directory structure. This time complexity can be reduced to a logarithmic scale by naming the directories accordingly so that a need to search for most of the paths can be rejected by the name of the parent directory.

Another option for storing such large document masses is to introduce a database that holds a record for each document storing its id, type, name and other useful attributes.

With this approach the documents themselves do not have to be ordered in any way, since their path in the file systems does not represent their class information anymore but rather just acts as a breadcrumb pointer for the database to fetch the document needed.[3] An implementation based on the semantics of the document content rather than traditional path based storing was exploited by Eck & Schaefer in [3] to store and manage CAD/CAE files. With this approach searching for a certain document is easier and faster with the help of separate indices which can be either iterated over or better, searched with a O(log(n))binary search algorithm.

2.2 Document formats

The documents saved in these storages by companies usually have very few restrictions on their file formats which makes it easy for the end-users to upload almost anything related to the matter she or he is dealing with [5] [11]. The flexibility of the unrestricted file formats comes with a price tag - it is harder to support, maintain and search from a wide scale of different type of files [5]. The wide variety of different file formats also causes compatibility and interoperability problems among the software that processes these files.

This can be often seen not only when importing a file to a non-native reader software but also within a different version of the same software that created the file in the first place.

In addition to diversity of document formats, each of the documents stored can either be structured datalike Excel sheets, orunstructured datalike emails, textual documents and CAD files, which adds another level of complexity to the documents [5]. Even though structured data is naturally easier to process both for humans and computers, in an average document storage structured data advocates approximately only 15% and unstructured data approximately 85% of the document contents [6] [5]. This raises a question of how to effectively make use of the latent information hidden inside of the majority of the documents.

(17)

Processing such large variety of different file formats is, if not impossible, very hard to support. Thus, many PDM or PLM systems often make conversions of the files uploaded into more widely supported formats.[3] This conversion process is also implemented in Vertex Flow PLM, which has an option to convert uploaded 3D models and 2D drawings in for example STEP or PDF formats which are then easier for other programs to digest [13] [11].

To categorize all formats of files by their content, manual work needs to be done either during uploading process or later by manually opening a file, reviewing its content and assigning it to a certain pre-defined slot. These pre-defined slots can be for example file system directories or database records. Since this manual classification and categorization process takes a lot of human labor, it easily becomes impractical when dealing with large document masses based on large variety of file formats [5].

On the contrary, when approaching this classification problem with the help of automa- tized data intensive solutions, such as pattern recognition and machine learning, large document masses become a valuable resource that provide data that can be observed and learned from. However, for making it possible for machine learning algorithms to digest such information, the information inside the documents, structured or unstructured, needs to be represented in a standardized form so that in can be effectively used [5].

Even for machine learning applications handling a large variety of file formats without any standardization is impractical.

Since a vast majority of the documents stored by companies from various industries have some textual content or at least a few words representing natural language, the processing of these documents is reasonable to be done by means of natural language processing or NLP for short. A minority of the documents in these storages are not textual documents, but rather images, audio files or files produces by computer aided design (CAD) software or some other domain specific software. [11] These documents are not reasonable to be treated as text and even though the encoding of these files can be often represented as ASCII characters, classifying such documents is a separate problem. It is still worth mentioning that a comprehensive document classification can also take advantage on the images in the document by using tools of Document image analysis or DIA for short [1].

However, some files that are not natively based on text but might contain textual content in some form, like PDF’s or images, can be converted into text by extracting the textual content with optical character recognition or OCR for short.[1] OCR processing needs to be taken into consideration especially in the case when a significant deal of company’s documents are based on some kind of predefined layout or form, like invoices, technical drawings or similar. Such documents often contain key-value -pairs, tables or annotations that can be treated as text. [1]

(18)

With this delineation, only documents of certain formats are taken into account in this work and used as training and testing documents, while the rest are ignored. Such legal document formats are plain text files or derivatives of bitmap based image files that can be OCR processed. The recognition of suitable documents is done by observing their file format suffix. The supported suffixes for the processing are.txt,.pdf, .jpeg, .tiff,.png and.bmp.

2.3 Non-constant class space

These industry-specific documents are undeniably containing a wide scale of different matters and topics inside them. Documents produced by an architect office and a company making structural design and strength calculations for buildings likely will have at least some common building related content. Yet still, the documents about structural design and strength calculations contain more mathematical and physical aspects of a building process like material selection and durability analysis, than the documents produced by an architect office even though both companies operate in the construction industry. [11]

This difference becomes even more pronounced when comparing two or more companies operating in even more distant industries: A company specialized in electrical design and another company specialized in the production of medical instruments will have very different topics and minimal amount of common factors in their documentation.

Even though most of the internal core documentation, like marketing or business reports, often resemble each other between different companies, the most important documents, related to the company’s specialty, handled by companies from different industries often differ much in content. Each company’s own area of expertise and adopted processes affects the style of what kind of documents are being created and handled.

This diffusion of the topics among the documents results the ground truth classes of these documents to vary a lot. In addition to that, not all classes in these document storages are of equal size but the number of documents per class might instead differ by several orders of magnitude. Such skew in data often causes problems in machine learning applications [14].

Also, the documents stored are not obviously limited to contain content from only one topic [8]. This means that the classes in the available class space are not mutually exclusive.

For example, a document containing a contract of employment may be classified to belong to classespersonal information,contract,recruitmentand many others depending on the company’s habits and practices. Because of this, there exists a wide scale of suitable labels per each document resulting a vast amount of possible label combinations [8]. This is just what multi-label classification is about: mapping multiple non-mutually exclusive

(19)

Figure 2.1. The difference between multi-class and multi-label classification problem domain. Based on [15]

labels to each piece of test data [15].

There is also a common misunderstanding about multi-class and multi-label classification being synonyms, which is however not the case. Multi-class classification refers to situation where the class space consists at least three classes and each classifiable instance is being assigned to only one of them. Multi-label classification on the other hand means that each classifiable instance can be assigned to multiple classes simultaneously.

This difference is also shown in Figure 2.1. The possibility of assigning one instance into multiple classes simultaneously reduces the size of class space needed to describe the nature of all instances. This is because the differences in each and every property of an instance do not have to have their own specialized pre-defined class, but we can rather just combine few suitable higher-level classes in proper proportions to describe the same characteristics. [15]

Such multi-labeling approach is well-known in for example movie descriptions, in which one movie can be assigned in multiple categories simultaneously, likehorror,dramaand comedy. Same way documents containing images, text or other form of information may be classified with multiple labels. [15]

(20)

3. MULTI-LABEL CLASSIFICATION

As a high-level description the document classification is sequence of actions trying to extract and associate the features of a document into the document’s ground truth class [1]. This sequence of actions will however vary a lot based on the classification problem and the methods used to solve it. Actions used in traditional single-label classification may not apply to multi-labeled situations at all and because of that other approaches need to be considered. [8]

In this chapter the theory and common practices related to multi-label classification problems are presented by diving into document processing, taxonomy of classifiers and performance metrics used. From the aspect of document processing, we will concentrate into text normalization, noise reduction and feature extraction and selection. Finally, we will present some standardized and popular metrics useful for classifier performance analysis and for classifier comparison.

3.1 Document processing

Properly processing the textual content of documents used in machine learning is a key element for a successful classification process. Proper processing meaning that the raw human written text is be converted into a more standardized feature rich form which can be used as training data for machine learning applications.[16] [1] Each document that represents a certain class should have a sufficient amount of discriminative features that

Figure 3.1.Traditional classifier taxonomy

(21)

are common for that particular class. Otherwise, the lack of discriminative features may lead to inadequate accuracy and to prevent this from happening, one can augment the features of a feature-poor document by injecting synthetic ones to make the document usable.[16]

Document processing aims to find and extract these discriminative features from the text for the classifier to actually consume those when defining the class for a test document.

This processing phase can be divided roughly into two stages that are pre- and post- processing [17]. In the first one, pre-processing stage, the document contents are being read and various conditions and techniques are applied in order to reduce the noisiness by filtering out the words or characters that are not relevant for further inspection [5].

These techniques may contain noise removal, normalization, lowercasing, stop word removal and stemming or lemmatization.

Post-processing stage in the other hand is meant for further refining the data by manipulating existing or adding completely new features to the text. This stage may contain investigating the word co-occurrences or the sequential appearance of specific words.

Common approach that exploits the sequential appearance of words is collecting se- quences of adjacent words that is often called as word N-grams.[16]

Steps needed in pre- and post-processing are heavily dependent on the problem that are being solved. For example, solving one problem may rely strongly on observing the numerical data in the documents while most of the natural language processing classifiers tend to filter any numerical data out [17]. Also, the steps needed can also be language dependent. Therefore, there is no rules that apply to each and every situation - one must preferably consider and test which processing methods are suitable for her or his problem at hand.

A common pipeline for text processing would be of following kind [17]:

1. Remove special characters 2. Lowercase words

3. Remove stop words

4. Transform words to their common base forms

The first two steps, special character removal and lowercasing, are quite straight forward techniques of iterating over the words and manipulating or removing the word based on a condition. In order to apply special character and stop word removal one must first obtain a rule, list or some other information based on which the removal is done. There are no universal rules or lists for these removal steps, but some open source materials are available created by community, for example here [18].[17]

The last step, transforming the words to their common base form, is the most difficult of

(22)

these steps. This is because most words, regardless of the language, do have multiple forms of inflection that need to be transformed by following some logic. [7] The options for this step are either stemming or lemmatization, both of which aim to make the base form transformation but that differ greatly by their implementations: stemmer tries to recognize the unchangeable part of a word and cut the rest off while lemmatizer makes a dictionary search and fetches the correct base form from it [7].

By stemming the mutable suffix inflected forms of a word most likely won’t match with the actual dictionary base form, but it is often good enough to reduce multiple word variants into couple or one base candidates [7]. Lemmatizing the words on the other hand gives the perfect base form, but is naturally more complex of a process and thus performs slower than simple stemming. For machine learning applications finding the perfect base forms is not important as long as the inflection forms are deduced to the same base form, no matter if it is grammatically correct or not.[7]

Even though the text processing in NLP applications aims to coverts the words of the documents into their basic forms in order to provide those as discriminative features, not all words serve as good features individually. Instead, making so called compound features from word or word relation combinations may provide more discriminative features in some cases.[16] Such techniques exploiting the term relations may be based on Syntactic phrasesthat are groups of words gathered in grammatical sense, word N- grams that represents a sequential appearance of words in text or Non-adjacent word co-occurrences that extends the N-grams approach by exploring not only the local but global word neighborhood [16].

Using the word co-occurrence or relations between words can be a very powerful way of generating good discriminative features and this way injecting them into the original text as if they always were there.[16] And indeed that relational information extracted from the co-occurring words is not new information as it has always been there as latent information. It is common that neural networks and other deep learning classifiers tend to learn to exploit this information naturally while finding and exploiting this information is harder for classifiers based on statistics and algorithmics [19] [20].

3.2 Multi-label data sets

Data sets for multi-label classification problems differ, just like performance metrics presented in Section 3.4, from the one for single-label classification. Major difference is in the number of ground truth labels per each instance in a data set. This number is a constant one (1) for single-label instances but for multi-labeled instances it theoretically may be any positive integer. This number of labels per each instance forms a concept of cardinality of the data set, which represents the average number of ground truth labels per each instance of that particular data set. Thus, cardinality for single-label data set is

(23)

always integer one (1) but for multi-label data sets it is a positive decimal number resulting cardinality to be the property that separates single- and multi-label data sets from each other. [8] [21]

There are lots of multi-label data sets available, most of which are collected in an online data set repository [21]. This repository has several data set characterization metrics listed for each available data set as a prerequisite to help selecting the ones with desired properties. Such data set characterization metrics are domain, number of instances orm- value, number of attributes ord-value, number of labels orq-value, label cardinalityCard, label densityDens, labelset diversityDiv, average imbalanced ratio per label oravgIRfor short and ratio of unconditionally dependent label pairs by chi-square test abbreviated as rDep.[21]

M-value represents the number of instances in the data set since defining the data set size. Most of the available multi-label data sets have at least thousands or tens of thousands of samples, like 20NG, Bookmarks and Imdb, yet still some are very concise, like3s-bbc1000,3s-inter3000andStackex_coffee, containing only few hundred samples each. Number of labels or q-value is another important metric that issues how many classes there are present in the data set. Q-value however does not reveal anything about how evenly the instances in the data set are distributed across these labels. The label distribution is given byavgIRfor which value one (1) represents a perfect distribution while greater numbers indicate of imbalanced and skewed data set.

CardinalityCard measures the average number of ground truth labels per each instance, which again does not tell anything about how balanced is the label number among all instance. Label density Dens on the other hand expresses the ratio of data set’s cardinality and the total number of labels in that data set. This is a measure that seems to be important for multi-label classifiers at least in solutions based on neural networks [22].

3.3 Multi-label classification methods

Generally text classification algorithms are often categorized with a simple taxonomy that partitions general classification methods into more specialized sub-classification methods recursively, just like shown in Figure 3.1. This way, a general concept of classifier consists of single-label and multi-label classifiers, both of which again have their own sub-types.

Single-label classifiers map one and only one class from the class space to each instance either from two possible (binary) or multiple possible classes (multi-class). [23]

Multi-label classifiers on the other hand can map|C|classes to each instance, where|C|

is the number of classes in the whole class space, either by transforming the problem into multiple single-label problems (problem transformation) or by using such algorithm that can determine multiple suitable classes directly (algorithm adaptation) [24] [23]. Problem

(24)

transformation methods decompose the multi-label classification problem into numerous binary classification problems by creating |C| binary classifiers that can each decide if the classifiable instance is or isn’t assigned to that particular class [8]. Since the problem transformation classification is based on a sequence of binary classifiers, the weakness is that those binary classifiers do not interact with each other thus missing the information of term and class relations [23] [8].

This weakness of problem transformation methods can be bypassed by using algorithm adaptation methods that can utilize the term and class relations [10]. This results in algorithm adaptation methods to be more complex than the problem transformation ones but being able to exploit that extra piece of information often compensates the effort.

For example, in the case of|C|=5, we would need five (5) independently trained binary classifiers each individually determining if a test instance is or isn’t assigned to the class they represent or one pure algorithm adaptation approach that can take advantage of combining the term and class relation information. [25]

In addition to taxonomy based classifier categorization, there exists also other approaches for determining different types of classifiers. For example, with another technique accord- ing to Newton Spolaôr et al. in [4], classification algorithms can be also categorized by reviewing the dimensions of three different classifier properties shown also in Figure 3.2 - the level of exploiting label dependencies, the scope of exploiting multiple labels and the way of how the features used are extracted.

An algorithm adaptation method based on a graph network model was presented as single-label classifier by Mourão et al. in [10] that indeed is based on associative classification that exploits the term relationships. This relational model works not only in single-label problems but is applicable to multi-labeling problems with slight modifications.

Same kind of idea of analysing a network of concepts was presented in [26]. Unlike in [10], this approach exploited the label relations rather than the term relations. Despite of that difference, these two approaches resemble much each other.

3.4 Performance metrics

The performance metrics used for multi-label classification algorithms differ from those of the single-label ones since the classes in traditional single-label classification problem are mutually exclusive [27] [28] [15]. Common ways to illustrate how well a multi-label classifier performs with a certain dataset is to use Hamming loss, precision, accuracy, recall and F1-measure [29] [27] [25]. These are actually the five most popular multi-labeling metrics used in literature defined by calculating how many times different performance metrics appeared in 64 papers written about multi-labeling problems [30]. This statistic can be seen also in Table 3.1.

(25)

Figure 3.2. A dimension based taxonomy to organize documents on multi-label feature selection. Adapted from [4]

Metric Number of papers Hamming loss 55

Accuracy 26

F-Measure 18

Precision 18

Recall 18

Table 3.1.The five most popular performance metrics used in the literature for multi-label classifiers. Total number of papers 64. Adapted from [30]

(26)

Figure 3.3. Different measure types based on problem type and measure type. Adapted from [30]

The popularity of these five metrics is probably due to their nature: all of them are so- called example based classification methods. This means that they assess the overall performance of the classifier by processing each test instance separately and then taking a mean value across the whole test data set. This differs fromlabel-based methodsthat evaluate the classifier performance label-wise by calculating macro- and micro-averaged values. [30] These performance metric categories are presented in Figure 3.3.

Each of these metrics are popularly used to highlight different aspect of the classifier’s performance and thus we will also use these metrics in this work. Next let’s have a look of how each of these performance metrics are calculated and how their values are interpreted.

In each of the following equationsqis the number of the labeled instances in a testing set and for each possible labell_j valuey^j has either value 1 or 0 depending if that particular label is present in a training instance (1) or not (0). yirepresents a set of ground truth label set andy^′_i the predicted label set. Also,m is the number of labels present in a training instance resulting the instance’s label space to be am-dimensional vector [y¹, y², ..., y^m].

First, the Hamming loss is used to tell what is the ratio of the misclassified labels to the total number of labels. Naturally, lower values for Hamming loss indicate of better accuracy and higher values indicate poor accuracy in classification. Since for the same classification problem the total number of labels is constant and the number of misclassified labels is somewhere between zero (0) and the total number of labels, the Hamming loss value is always a fraction between zero (0) and one (1). [29] [25] The actual formula to calculate

(27)

Hamming loss can be seen in Equation 3.1. In this formula the last term1_y^′j

i =y^j_i depicts an XNOR operator that evaluates to value 1 if both of the valuesy^′_i^j andy^j_i are the same and otherwise it evaluates to zero.

HammingLoss= 1 mq

q

∑︂

i=1 m

∑︂

j=1

1_y^′j

i =y_i^j (3.1)

Second metric, precision, tells the ratio of correctly classified ground truth positive examples and all the examples the classifier classified as positive [28]. This means that if the classifier did not classify any of the test instances positive even if they actually were negative, in other words meaning no false positives were found, precision value would be 1 that is a perfect score. Equation for precision can be formulated either with means of set theory or just by purely comparing the ratio of true positivesT P and the sum of true positivesT P and false positivesF P shown in Equation 3.3.

P recision= 1 q

q

∑︂

i=1

|y_i∩y^′_i|

|y_i^′| ^(3.2)

P recision= 1 q

q

∑︂

i=1

|T P|

|T P +F P| ^(3.3)

Third metric, accuracy, depicts the ratio between the correctly labeled instances among all instances. Such value is easily calculate in single-label situation, as we can just make a simple division of the number of correct predictionsT P by the total number of predicted true positives T P , false positives F P and false negatives F P like in Equation 3.5 . However, accuracy can be also calculate for multi-label problems with a slightly different way by exploiting set theory that is the manner shown in Equation 3.4 [31]. Also in this metric higher values mean better performance of the classifier.

Accuracy = 1 q

q

∑︂

i=1

|y_i∩y^′_i|

|y_i∪y^′_i| ^(3.4)

Accuracy = 1 q

q

∑︂

i=1

|T P|

|T P +F P +F N| ^(3.5)

Fourth metric, recall, give the information about the ratio of correctly classified positive examples T P and the total number of positive examples existing. This ratio represents the share of how large part of the positive examples are found. This is shown in Equations 3.7 and 3.6. Recall value 1 depicts that all positive examples are being classified correctly and 0 on the contrary means that not a single positive example was being classified

(28)

correctly. Because of this recall values near 1 are indicators of better performance. [28]

Recall= 1 q

q

∑︂

i=1

|y_i∩y_i^′|

|yi| ^(3.6)

Recall = 1 q

q

∑︂

i=1

|T P|

|F P +F N| ^(3.7)

The fifth and last metric,F1-measure, combines the recall and precision by calculating a weighted harmonic mean. In multi-label classification the formula forF1-measure represents exactly the same idea as in single-label situation but is again slightly more compli- cated. [27] This can be seen in Equation 3.8. Again, larger values forF1-measure indicate better performance of the classifier. Usually in multi-label domain theF1-measure is used to compare the effectiveness of different classifiers since it conveniently wraps the recall and precision properties in itself.

F1 = 1 q

q

∑︂

i=1

|y_i∩y^′_i|

|yi|+|y_i^′| ^(3.8)

In addition to that, micro-averaging all of these metrics, especiallyF1-measure, is commonly used to better address the training data imbalance. In micro-averaging each predicted class is weighted individually based on the number of how many examples of that class were present in the data set. If the balance between data set’s classes is perfect, micro and macro averages result in the exact same values. [28]

(29)

4. A NOVEL GRAPH-BASED MULTI-LABEL CLASSIFICATION METHOD

As mentioned in Section 3.2 there are many statistical properties that exist for multi-label data sets. These properties introduce restrictions for using certain types of the multi-label classifiers. From these statistical properties especially the values of cardinalityCard and label distributionavgIR will cause the exclusion of certain types of classifiers. Classifier suitability is affected also by class imbalance and the size of available as training data. In addition to that, variance in the formats of the documents causes its own problem that are being pointed in Section 2.2.

In this section we are going to propose a novel multi-label classification approach based on the NetClass model presented by Mourão et al. in [10] that tackles most of these problems and leads in satisfactory results. This section also presents how to find discriminative features from the text documents and how to train a model and how to use it as classification tool. Additionally, the benefits and drawbacks of this model are presented.

4.1 NetClass based approach

A network-based model presented originally by Mourão et al. in [10] was used in the paper to solve a single-label classification problem. In its core, this model represents the terms and their relations found in training documents as an undirected cyclic graph, in which the vertices represent the terms and the edges represent the relations between those terms. In addition to that the edges have two properties: First one is the class in which this relation is the most dominant and the second one is the dominance value itself.

Very much alike model was used as a core in a classifier called INFORMys which was of one of the first automatic invoice processing systems published already in 1998 [32] [1].

By creating such graph, the relational information of the terms appearing in the training documents is compressed into a compact graph data structure, which can be further analyzed in the classification phase. Also, because the model is in a form of a graph, all the graph algorithms are at disposal if needed providing efficient traversal methods.

The original process of creating such graph depicted in [10] is very similar to what was implemented in this work with a one major difference: The original NetClass model had

(30)

only the most dominant classcxand its dominance valuedomcxassigned per edge, while in our method an edge has all the classes and their strength metric values [dom_c₁,dom_c₂, dom_c₃, ...,dom_c_k] assigned in which that term relation occurs. This modification makes it possible in the classification phase to use the graph to not only provide a prediction for a single, but for multiple classes. This possibility to extend the NetClass model from single- label to multi-label problem domains was already acknowledged [10] but the authors left this implementation for further work.

4.1.1 Model formation

The graph creation phase starts by first reading the training documents in and splitting all the words from them into arrays with a tag that tells the ground truth class from which these words came from. Then some pre-processing needs to be done for the arrays of words. The need for different pre-processing steps is document type dependent and more on those steps can be read in Section 3.1. Implementation done in this work uses stemming instead of lemmatization simply because good lemmatizers are hard to find especially for C# language.

After sufficient pre-processing, the actual features of the text need to be found. Finding effective features is a key element in graph formation and will heavily affect the performance of classification. In most of the text classification methods and also in the scope of this work, the pre-processed words are considered as features themselves. This does not always have to be the case since also other linguistic properties or for example se- quences of certain words, like word N-grams, can be used as features. Justification for feature selection is further discussed in Section 4.2.

After the features, or terms in other words, are found, any duplicates are removed among them. This is because the vertices in the graph that are soon to be constructed from the terms found must be unambiguous, meaning the graph can only contain a vertex representing a feature once and only once. Breaking this rule would also break the normalization of the graph and lead in inconsistent behaviour in the classification phase. However, this duplicate information is not completely thrown away since the term occurrence rate in a document is valuable information. This information in preserved by each term keeping an occurrence counter which will eventually be exploited in the classification phase.

After the duplicates are remove the actual graph generation can be initiated. This process starts by looping the terms found in a training document and creating a vertex for each of them if that vertex does not already exist in the graph. Notice that while the terms in the same training document cannot contain duplicates there can be and usually is duplicate terms among the whole training document set. This duplicate occurrence among different training documents is also valuable information that is being stored as a property of a vertex for the classifier to exploit later.

(31)

Each time a vertex is added to a graph, edges are formed or updated to other vertices of the graph that co-occur in that same training document from which the added vertex is coming from. Also, if the vertex to be added already exists in the graph, edges are formed or updated from that vertex to other vertices that are either created or already existing in the graph and appearing in the same training document as features. The fact that all the vertices co-occurring in the same document are being connected with an edge results each of the training documents to appear as their own fully connected sub-graphs or cliques in the complete graph. These cliques may share multiple vertices with other cliques and that is actually the most important aspect of this this model since it contains the information of how strongly documents from different classes are related to each other.

Creating or updating an edge between two vertices means that a new edge is either created if the two connected vertices are not already connected or if this connection already exists that edge is updated. During the edge creation or updating the edge is being assigned a class appearance list. This list provides the information about what are the classes in which this particular edge occurs and in addition to that how many times the edge appears in each of these classes. In its core this class appearance list represents the most essential information for the classification stage: the class appearance list of an edge tells how many times the two terms represented by the vertices in both of the edge’s ends are co-occurring in each of the classes present in that particular class appearance list.

Formulation of this graph creation process into an algorithm can be seen in Algorithm 1.

As an input this algorithm takes a list of documents each belonging to their ground truth classes and loops each of those documents in order to form their representative sub- graph cliques. During this process new edges are created and the existing ones updated between the vertices. As a result, this algorithm returns a fully formed graph containing

|V| vertices and |E| edges, where |V| is the number of distinct terms and |E| is the number of distinct term relations in the training data set. In theory, all the vertices in the graph could be related to each other resulting the maximum number of formed edges to be|V|*(|V|-1)/2. Even though in real situations this formula does not apply for the whole graph, in the case of each individual document clique the edge count will comply this rule.

The structure of a complete graph will look similar to what we can see in Figure 4.1. In this figure we can see that there has been a total of three training documents,d1,d2and d3, with a total of eleven different terms and 29 different edges. The sub-graph cliques formed by each training document are colored with different colors. Termst2,t3,t8and t9are shared by more than one training document: t2appears in documentsd1andd2, t3in documents d1and d3,t8in documents d2andd3andt9in documentsd1andd3, respectively.

(32)

Function CreateMultiLabelNetClassGraph(G); Input :lisf of documents

Output:Graph(G) graph;

foreachdocument in documentsdo foreachterm in document.Termsdo

existingVertex = graph.Vertices.Find(term);

inThisDocument = document.Vertices.Contains(existingVertex);

ifexistingVertex != null then if!inThisDocument then

newEdges = FormEdges(existingVertex, document.Vertices, document.ClassName);

graph.Edges.AddRange(newEdges);

existingVertex.AddClassAppearance(document.ClassName, term.DocumentAppearance);

document.Vertices.Add(existingVertex);

end else

existingVertex.AddClassAppearance(document.ClassName, 1);

end end

else if!inThisDocument then

newVertex = new Vertex() Term = term;

newVertex.AddClassAppearance(document, 1);

newEdges = FormEdges(newVertex, document.Vertices, document.ClassName);

graph.Edges.AddRange(newEdges);

graph.Vertices.Add(newVertex);

document.Vertices.Add(newVertex);

end end end

returngraph

Algorithm 1:Multi-label NetClass graph creation algorithm

All but one of the edges formed between the terms are each appearing in documents from just one class. The one,e10, connecting termst3andt9, appears two times in the training data set: one time in document d1and another time in document d3. If both of these documents,d1and d3, are from different classes, X and Y, the edgee10would have one (1) as a class appearance value for both of these classesX andY. Likewise,

(33)

Function FormEdges;

Input :vertex, list of vertices, className Output:list of edges

createdEdges;

foreachloopVertex in verticesdo

existingEdge = loopVertex.Edges.Find(vertex);

ifexistingEdge != null then

existingEdge.AddClassAppearance(className, 1);

end else

newEdge;

newEdge.Left = vertex;

newEdge.Right = loopVertex;

newEdge.AddClassAppearance(className, 1);

vertex.Edges.Add(newEdge);

loopVertex.Edges.Add(newEdge);

createdEdges.Add(newEdge);

end end

returncreatedEdges

Algorithm 2:Creating new and updating existing edges between vertices

if documentsd1andd3were from the same classX, edgee10would have two (2) as a class appearance value for the class X. In the latter case, since the edgee10appears more frequently in classX, it will have a stronger voting power for classXthan for class Y. More on the voting power and classification algorithms in Section 4.1.2.

4.1.2 Classification

In the classification stage the same pre-processing and feature extraction methods discussed in Sections 4.1.1 and 3.1 are applied to the test documents. As a result, each test document is transformed into an array of terms. One test document’s array at a time, the terms in the array are being looped through and the graph created in the model formation stage is utilized to provide class-wise values for each term. Then, bysomehowcombining the class-wise values of each term the final prediction of the class for the test document is deduced. The core power of this model is indeed in the decision making algorithms that give a definition forsomehow combining the class-wise values. These algorithms define how aggregate the class-wise values of these terms into a multi-label prediction.

These decision making algorithms deducing the final scores for each class consist of three main sub-algorithms:

(34)

Figure 4.1.An example structure of a term graph formed from three documents, yielding a total of eleven vertices and 29 edges

1. Neighbor selection algorithmA1 2. Term score aggregation algorithmA2 3. Final score aggregation algorithmA3

The implementations of each of these sub-algorithms affect heavily on what will be the result the classification task. Some implementation alternatives may be genuinely ana- lytical, some numerical and some may be based on approximations or heuristics. Let’s next have a look at how these decision making algorithms are utilized in order to deduce suitable classes for the test document.

First, when iterating though the terms found in the test document, we will try to find if a vertex Vcurr corresponding such term is present in the graph. If the graph does not include such a vertex, the term being handled is useless and discarded, since it will not have any affect on the class selection process. However, if such vertex V_curr is found from the graph, some of its neighbors {V₂,V₃, ...,V_N} are selected with algorithmA1for further inspection.

Neighbor selection criteria used by algorithmA1is based solely on the properties of the

Multi-label classification for industry-specific text document sets