Automated text sentiment analysis for Finnish language using deep learning

(1)

VILLE NUKARINEN

AUTOMATED TEXT SENTIMENT ANALYSIS FOR FINNISH LANGUAGE USING DEEP LEARNING

Master of Science thesis

Examiner: Prof. Outi Sievi-Korte Examiner and topic approved by the Faculty Council of the Faculty of Computing and Electrical Engineering on 30th August 2017

(2)

I

ABSTRACT

VILLE NUKARINEN: Automated text sentiment analysis for Finnish language using deep learning

Tampere University of Technology Master of Science thesis, 42 pages May 2018

Master’s Degree Programme in Information Technology Major: Pervasive Systems

Examiner: Prof. Outi Sievi-Korte

Keywords: machine learning, neural network, deep learning, natural language processing, sentiment analysis

The growing amount of information online opens new possibilities for companies to gather valuable data about their products and services. However, such a large amount of information should be collected and analyzed for it to be useful. Gathering and classifying data that is in the millions by hand is obviously not an option.

Therefore, information collection and analysis should be automated for the process to be viable.

In this thesis, the focus is on automatic analysis of the data, specifically sentiment analysis of product reviews written in Finnish language. The feasibility of sentiment analysis for Finnish language is researched. This is done using the latest deep learning techniques in the field of machine learning.

The results show that sentiment analysis for Finnish language is feasible and with some improvements to the classifier it could be used for automated sentiment analysis. However, the difference between validation and test set results show that, for example, further regularization could be needed to train a classifier that would generalize better to unseen data.

(3)

II

TIIVISTELM ¨ A

VILLE NUKARINEN: Automaattinen sentimenttianalyysi suomen kielelle käyttäen syväoppimista

Tampereen teknillinen yliopisto Diplomity¨o, 42 sivua

Toukokuu 2018

Tietotekniikan koulutusohjelma P¨a¨aaine: Pervasive Systems

Tarkastajat: Prof. Outi Sievi-Korte

Avainsanat: koneoppiminen, neuroverkko, syv¨aoppiminen, sentimenttianalyysi

Kasvava informaation määrä Internetissä avaa uusia mahdollisuuksia yrityksille kerätä arvokasta tietoa heidän tuotteistaan ja palveluistaan. Kuitenkin tämä valtava määrä dataa pitäisi ensin kerätä ja analysoida, jotta siitä voitaisiin hyötyä. Tosin suuren datamäärän kokoaminen ja luokittelu käsin ei ilmiselvästi ole vaihtoehto. Siksi informaation kerääminen ja analysointi pitäisi automatisoida, että prosessi olisi kan- nattava.

Tässä diplomityössä keskityään tiedon automaattiseen analysointiin, erityisesti suomen kielellä kirjoitettujen tuotearvostelujen sentimenttianalyysiin. Sentimenttiana- lyysin soveltuvuutta suomen kielelle tutkitaan uusimmilla syväoppimisen menetel- millä koneoppimisen alalla.

Tulokset näyttävät, että sentimenttianalyysi suomen kielelle on toteutettavissa ja pienten parannusten jälkeen luokittelijaa voitaisiin käyttää automaattiseen sentimenttianalyysiin. Kuitenkin validointi- ja testiaineiston tulosten välinen ero näyttää, että esimerkiksi regularisaation lisääminen voisi parantaa luokittelijan kykyä yleis- tyä uuteen materiaaliin.

(4)

III

PREFACE

This thesis was written in Tampere, Finland between August 2017 and May 2018. I would like to thank my friend Sami Jokela for coming up with the idea of collecting, analyzing and aggregating data to provide feedback for companies, which also led me to finally find an interesting topic to write about. Also, I am grateful for my friend Lauri Niskanen for checking the contents of the thesis. Lastly, I want to thank Professor Outi Sievi-Korte from Tampere University of Technology for supervising the work and giving valuable feedback.

Tampere, 18.5.2018

(5)

IV

LIST OF FIGURES

2.1 An example of a neural network. . . 5

2.2 Illustration of a simple CNN architecture. . . 7

2.3 Max pooling applied with 2x2 filters and a stride of 2. . . 8

2.4 An example of a RNN neural network. . . 9

2.5 A standard RNN with only a single layer. . . 10

2.6 A LSTM network module contains 4 layers. . . 10

2.7 LSTM network notation explained. . . 11

3.1 Bag-of-words model representing a text document. . . 17

4.1 The distribution of product reviews by rating and recommended fields in the whole data set. . . 22

4.2 The distribution of product reviews in the whole data set. . . 23

4.3 Pre-processing presented shortly. . . 24

4.4 Visualization of the structure of the neural network. . . 26

5.1 Visualization of true and false negatives and positives. . . 30

5.2 The effect of vocabulary size on accuracy and loss of the neural network model . . . 32

5.3 The effect of embedding vector size on accuracy and loss of the neural network model. . . 33

5.4 The effect of review length on accuracy and loss of the neural network model. . . 35

5.5 The effect of LSTM output dimensions on accuracy and loss of the neural network model. . . 36

(8)

VII 5.6 Visualization of the the precision and recall for neural and dummy

classifier. . . 39 5.7 Visualization of the the F1 score for neural and dummy classifier. . . 40

(9)

VIII

LIST OF TABLES

2.1 Activation functions commonly used in neural networks. . . 6

3.1 An example of the rich morphological structure of Finnish language using one-word question ”Autoissammekinko?”. . . 14

3.2 An example of lemmatization. . . 15

3.3 Challenges of sentiment analysis. . . 16

3.4 Examples of n-grams by letter. . . 16

4.1 Product review fields and example values. . . 21

4.2 The whole data set is split into training, validation and test sets. . . . 25

5.1 An example of true and false negatives and positives in the context of good and bad reviews. . . 29

5.2 Summary of the used metrics. . . 29

5.3 Hyperparameters and their values for grid search. . . 30

5.4 Accuracies and losses for the hyperparameter combination with the best result. . . 34

5.5 The best combination of hyperparameter values found using grid search. 37 5.6 Results of neural classifier using the validation set. . . 37

5.7 Results of neural classifier using the test set. . . 37 5.8 Results of dummy classifier, simple classifier, using the validation set. 37

(10)

IX

LIST OF ABBREVIATIONS AND SYMBOLS

API Application programming interface CNN Convolutional neural network LSTM Long short-term memory NB Naive Bayes classifier

NLP Natural language processing ReLU Rectified linear unit

RNN Recurrent neural network SVM Support vector machines

(11)

1

1. INTRODUCTION

One of the ways that companies can improve their products and services is being able to efficiently collect and analyze the vast amount of information available online. People write reviews in different online shops, recommend products to each other through social media, and provide feedback in several other ways online. For example, there is a huge volume of customer reviews [28] available online.

Aggregating all the customer reviews to find out the sentiment of customers could be a valuable information source for a company. This way manufacturers could get detailed information about their products and the opinions of customers. The information would not only be beneficial to manufacturers but also to retailers and potential customers. However, the problem is that as the number of reviews expands, analyzing the information by hand requires more and more manpower and resources.

Thus, developing a tool for automatic sentiment analysis is highly desirable.

Fortunately, modern machine learning techniques offer an alternative approach for sentiment analysis. Analyzing large amounts of customer reviews could be possible if humans are not needed in the process. This can indeed be accomplished utilizing machine learning.

Sentiment analysis of online customer reviews has attracted researchers of machine learning [2] in the past decade. A lot of research has been done on the topic of sentiment analysis but the amount of research targeting Finnish language seems small.

The target of this thesis is to explore if it is feasible to automatically analyze the sentiment of text written in Finnish language. The objective of the study is not about comparing different machine learning algorithms but to find out if sentiment analysis for Finnish language is generally viable. If automatic sentiment analysis is found to be feasible for Finnish language, it would make it possible to create a service that for example automatically gathers and classifies information online providing feedback to companies about customer satisfaction in Finland.

This thesis will present what are sentiment analysis and deep learning, how deep

(12)

1. Introduction 2 learning can be applied to sentiment analysis, and what should be considered when doing sentiment analysis for Finnish language.

First, the theoretical background in machine learning and natural language processing is presented in chapters 2 and 3. Based on the theoretical background, chapter 4 will then cover all steps of the study. Last, the results of the study are presented in chapter 5 and conclusions in chapter 6.

(13)

3

2. MACHINE LEARNING

Machine learning can be defined as the ability of acquiring knowledge by extracting patterns from raw data. This capability enables computers to tackle with real world problems which are hard or nearly impossible for systems that rely on hard-coded knowledge. [20]

A simple machine learning algorithm can be represented as a function y(x) which takes in training data vector x and produces vector y as the result. For example, consider an example of handwritten digits presented as images. Each image consists of 28x28 pixels making a total of 784 values as one entry in the training data. The output data in turn is a vector of values between 0 and 9 representing the digit in the corresponding image. [5]

The form of the function y(x), also known as a model, is decided in the training phase based on the given training data. Right after the model is trained, it can be used to identify digits in new images which were not in the original training data.

This is called generalization. [5]

Some of the traditional machine learning algorithms, like logistic regression [13] and support vector machines (SVM) [45] can be used to solve various problems. However, these traditional machine learning algorithms strongly rely on the representation of the data they are given [20]. In this case, the data usually requires heavy pre- processing, and only a selected set of features is passed to the machine learning algorithm.

Sometimes it is not known which features should be extracted, or extracting the features is as hard as solving the original problem. This happens, for example, when designing an algorithm to determine if there is a car in a picture. The car can have multiple colors, the picture can be taken from many different angles which changes the shape of the car, the car might be almost black at night, and so on. [20]

Manual feature engineering is an essential part of traditional machine learning approaches. Feature extraction, feature selection and selecting a classification algorithm are fundamental questions in the machine learning driven methods [1]. How-

(14)

2.1. Basics of neural networks 4

ever, feature engineering is time-consuming.

Fortunately, there is a new family of machine learning algorithms called deep learning which has shown great performance in natural language processing tasks, including sentiment analysis [12]. The benefit of these deep learning techniques is that they do not require manually extracted features but can automatically learn the features from the data [3]. The downside of deep learning is that it requires a lot of data to perform well [31].

The relevant topics of deep learning for this study are covered in the following sections. First, the basics of neural networks are covered in section 2.1. Second, a closer look at a specific kind of neural network called convolutional neural network is taken in section 2.2. Next, recurrent neural networks are presented in section 2.3.

Last, regularization is discussed in section 2.4.

2.1 Basics of neural networks

Neural networks are computing systems that were first inspired by the structure and function of the brain [15]. Even though the origins of the term ’neural network’ are linked to finding mathematical representations of information processing in biolog- ical systems [46] [39] [40], we will focus our attention to a certain kind of neural network called the multilayer perceptron.

A neural network is made of one or more layers, and a layer consists of one or more neurons, and there are connections between neurons. Neural networks can be trained to recognize certain patterns, for example to classify objects by features.

For an example of a neural network see figure 2.1. [5]

The basic neural network model can be viewed as a series of functional transforma- tions. First,M linear combinations are constructed of the input variablesx₁, ...,x_n in the following way:

a_j =

n

X

i=1

w⁽¹⁾_ji x_i+w_j0⁽¹⁾ (2.1)

where j = 1,...,M, and the superscript (1) implies that the parameters are in the first layer of the network. Parameters w_ji⁽¹⁾ are called weights and parameters w⁽¹⁾_j0 biases. Each of the quantities a_j are then transformed using a nonlinear activation functionh(·), which gives

(15)

2.1. Basics of neural networks 5

Figure 2.1 An example of a neural network. There are 3 neurons in the input layer, 4 in the hidden layer, and 2 in the output layer.

zj =h(aj). (2.2)

The quantities z_j are the outputs of the hidden units of the neural network. Usu- ally the nonlinear functions h(·) are chosen to be sigmoidal functions, for example logistic sigmoid. The quantities z_j are again linearly combined to give output unit activations

a_k=

M

X

j=1

w_kj⁽²⁾z_j +w⁽²⁾_k0 (2.3)

where k = 1,..., K, and K is the total number of outputs. Last, after putting the output unit activations through an activation function, we get a set of network outputs y_k. Choosing the activation function relies on the type of the data and

(16)

2.1. Basics of neural networks 6 other properties of the target variables. Commonly used activation functions in neural networks are presented in table 2.1. [5]

Table 2.1 Activation functions commonly used in neural networks.

Name f(x)

Rectified linear unit (ReLU) f(x) =

(0 forx <0 x forx≥0

Logistic sigmoid f(x) = 1

1 +e^−x

Hyperbolic tangent (TanH) f(x) =tanh(x) = e^x−e^−x e^x+e^−x

Softmax f_i(~x) = e^xⁱ

PJ

j=1e^x^j fori= 1, ..., J

When choosing an activation function for a multilayer perceptron, the function chosen should be nonlinear. If the activation function was linear then the whole network would remain a linear function of its input. It would also mean that there is no benefit of adding more layers to the network because any number of linear layers could be replaced by a single linear layer. The default recommendation for modern neural networks is to use the rectified linear unit (ReLU) which is nonlinear. [20]

ReLU performs well in many problems in spite of its simplicity. When combined with the dropout regularization technique, covered in section 2.4, ReLU performs particularly well. ReLU is most suitable for networks with multiple layers, and it can help prevent the vanishing gradient problem. This is because ReLU does not saturate like the sigmoid or tanh activation functions. [19]

ReLU is commonly used for the hidden layers of the network, however choosing the output layer activation function depends on the problem at hand. For binary classifications it is common to use logistic sigmoid and for multiclass classifications softmax [5]. For example when classifying product reviews with ratings from 1 to 5, using softmax in the output layer ensures that the probabilities of each rating always add up to 1.

For the neural network to be able to learn anything, the network needs a way to adjust its weights. First the network accepts an input x and produces an output y. The flow of information from the input to the output of the network is called forward propagation. However, forward propagation alone is not enough to train a neural network since there is no way to get information backward to adjust the weights based on the performance of the network. [20]

Back-propagation [40] allows the information to flow backwards for calculating the

(17)

2.2. Convolutional neural networks 7 weights. Back-propagation is only used to calculate the gradient while another algorithm such as stochastic gradient descent performs learning based on the gradient. [20]

2.2 Convolutional neural networks

Convolutional neural networks (CNNs) [27] are a family of neural networks for processing data that has a known, grid-like topology. For example, a sentence can be thought of as a 1D grid of words or characters. As the name suggests, the mathematical operation called convolution is used in convolutional neural networks. Simply put, convolutional neural networks are neural networks which use convolution in at least one of their layers instead of general matrix multiplication. [20]

Typically a convolutional network has three steps. First, convolutions are performed in parallel to produce a set of linear activations. Next, each activation is transformed using a nonlinear activation function, for example ReLU. Lastly, a pooling operation is applied to modify the output of the layer. [20]

The first two steps prior to the pooling operation are presented in figure 2.2. In the example, only one filter is used to simplify the structure of the CNN to make it easier to understand. In reality, there can be multiple filters with different sizes.

Since convolutional neural networks can be used in sentence classification tasks [49], they can also be utilized in sentiment analysis.

Figure 2.2 Illustration of a simple CNN architecture utilizing one filter for sentence classification [49].

After the first two steps, a pooling operation is run on the produced feature maps.

Max pooling [50] is a pooling operation that takes the maximum output within a

(18)

2.3. Recurrent neural networks 8 rectangular area. There are also other popular pooling functions such asL² norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel [20]. Max pooling is demonstrated in figure 2.3 using 2x2 filters and a stride of 2.

Max pooling can be especially useful when we care more about whether some feature is present than its exact location [20]. Thus, max pooling could also be helpful in textual sentiment analysis since words do not always appear in the same place of sentences.

The size of the filter determines how large is the sliding window, and the stride how much the window is moved in each step. Therefore a max pooling operation with 2x2 filters and a stride of 1 would result in an output of size 9x9. Increasing the size of the stride decreases the size of the output.

Figure 2.3Max pooling applied with 2x2 filters and a stride of 2. The 2x2 window moves left-to-right and top-to-bottom, the length of each move is determined by the stride.

2.3 Recurrent neural networks

Recurrent neural networks (RNNs) [40] are a family of neural networks for processing sequential data. RNNs are specialized in processing a sequence of values x⁽¹⁾,...,x^(T⁾ [20], which could be words in a sentence, for example.

As it is seen in the figure 2.4, recurrent neural networks can be viewed as multiple copies of the same network, each network passing a message to a successor. This allows information to persist between inputs, which is especially beneficial for sentiment analysis because previous content can change the meaning of future content.

For example sentences ”I am happy” and ”I am not happy” have opposite meaning but both contain word ”happy”.

Recurrent neural networks offer one important benefit, they can use contextual information when mapping between input and output sequences [20]. However, the range of context for standard RNN architectures is limited because of the vanishing

(19)

2.3. Recurrent neural networks 9

Figure 2.4 An example of a RNN neural network. Each network is passing a message to a successor. [35]¹

gradient problem [23].

The vanishing gradient problem is a common problem in recurrent networks which use back-propagation [4]. Learning algorithms usually try to lower the cost value of the network by changing weights until the lowest possible value is obtained. The gradient tells the rate at which cost changes with respect to weights of the network.

Since in back-propagation the gradients on the left side of the network are computed from gradients of the layers on the right side, the gradient can become vanishingly small if the previous gradients are less than one. Alternatively, if the previous gradients are larger than one, the gradient can become increasingly larger which is called exploding gradient. Vanishing gradient problem results in very long training times as well as inaccurate results as the number of layers in a network grows. [20]

One solution to the vanishing gradient problem is using a Long Short-term Memory (LSTM) [24] [17] network. LSTMs have one significant difference to standard RNNs, they have a more complex structure in the repeating module. Instead of a single neural network layer in the repeating module as presented in figure 2.5, LSTMs have four neural network layers. LSTM module is presented in figure 2.6 and the notation used in the figure is explained in figure 2.7.

As it is seen in figure 2.6, LSTMs have a cell state which flows through the network.

The LSTM can remove or add information to the cell state which is regulated by gates. Gates are made of a sigmoidal neural network layer and a pointwise multiplication operation. [20]

LSTMs have been successfully used in many natural language tasks, for example in speech recognition [22], handwriting generation [21], and in Google’s smart assistant Allo [10]. Overall, it seems that of all the RNN architectures, LSTM is currently the most used.

1Permission to use the image granted by the author.

(20)

2.4. Regularization 10

Figure 2.5 A standard RNN with only a single layer. [35] ¹

Figure 2.6 A LSTM network module contains 4 layers. [35]¹

2.4 Regularization

One of the core problems in machine learning is creating an algorithm that not only works well on training data but also generalizes to new unseen data. The act of decreasing test error at the expense of possibly increasing training error is called regularization. [20]

One of the methods of regularization is a technique named dropout [43]. The term

”dropout” refers to the act of randomly dropping out units in a neural network, temporarily removing units from the network, along with all its outgoing and incoming connections.

Many times the main motivation for utilizing dropout in a deep neural network is the fact that it prevents overfitting. An overfitted model is a model that contains more unknown parameters than can be reasoned by the data [14]. If the model is overfitted, it will no longer give good results on new unseen data.

(21)

2.4. Regularization 11

Figure 2.7 LSTM network notation explained. [35] ¹

The dropout neural network model can be presented in a mathematical form. For a network with M hidden layers, m ∈ {1, ...., M} indexing the hidden layers of the network, z^(m) denoting the vector of inputs into layer m, and y^(m) denoting the vector of outputs from layer m. W^(m) and b^(m) are the weights and biases of layer m. In this case, the feed-forward operation of a standard neural network can be described as

z^(m+1)_i =w^(m+1)_i y^m+b^(m+1)_i , y_i^(m+1) =f(z_i^(m+1)),

where f can be any activation function and i any hidden unit. With dropout, this operation becomes

r_j^(m) ∼Bernoulli(p) s^(m) =r^(m)∗y^(m)

z^(m+1)_i =w^(m+1)_i s^(m)+b^(m+1)_i y_i^(m+1) =f(z_i^(m+1)),

where r^(m) is a vector of independent Bernoulli random variables and s^(m) the thinned outputs created by multiplying element-wise with the outputs of that layer y^(m)[43]. As it is clearly seen, dropout effectively temporarily removes random units from the network.

Another common form of regularization is called early stopping. In early stopping after every iteration validation loss is checked for improvement. If the validation loss has improved, meaning it has decreased from the previous best validation loss, then this checkpoint is saved. Training is stopped after the validation loss has not improved over some pre-specified number of iterations. Early stopping can also be combined with other regularization methods. [20]

(22)

2.4. Regularization 12 This chapter covered the basics of neural networks, including neurons, activation functions, forward propagation and back-propagation. In addition to the multilayer perceptron, basics of convolutional and recurrent neural networks were also presented. Last, two regularization techniques, namely dropout and early stopping, were covered. Combining all of the above gives the foundation to design functional neural networks. In the next chapter, natural language processing and how it relates to machine learning is presented.

(23)

13

3. NATURAL LANGUAGE PROCESSING

Natural language processing (NLP) is an interdisciplinary field which focuses on the interactions between computers and human languages. NLP is a subdiscipline of artificial intelligence and computational linquistics. [7]

The study of natural language processing started in the 1950s in the first days of functional computers, however little progress was achieved in the first decades.

Only recently significant progress has been made and machines have become better at understanding language like humans, even though they are still far from human level. [34]

The coming of latest machine learning techniques, especially deep learning is also changing the field of NLP. For over a decade, linear models such as support vector machines or logistic regression were among the most used NLP techniques. How- ever, recently neural networks have shown great performance in fields such as image recognition and speech processing. [19]

With the help of natural language processing, text can be transformed into a suitable format for machine learning algorithms. First, the meaning of language morphology and how it affects sentiment analysis is covered in section 3.1. Second, lemmatization, a step in pre-processing, is covered in section 3.2. Last, a deeper look into sentiment analysis and its current state is taken in section 3.3.

3.1 Morphology

Language morphology deals with words, their internal structure, and how they are formed. Words are made of components called morphemes, the smallest elements with identifiable meanings. For example, words ”walked” and ”walker” both have the same first morpheme ”walk”, second morphemes being ”ed” and ”er” respectively. [34]

The complexity of morphology differs by language. For example, English has a quite simple morphological structure while Japanese has a very rich morphological

(24)

3.2. Lemmatization 14

structure [47]. Finnish language also has a very rich morphological structure.

The rich morphological structure of Finnish language can be seen well in one-word question ”Autoissammekinko?” which means ”Do you mean in our cars, too?”. The word can be broken apart in the following way: auto / i / ssa / mme / kin / ko.

This form is further explained in table 3.1.

Table 3.1 An example of the rich morphological structure of Finnish language using one-word question ”Autoissammekinko?”.

Morpheme Explanation

auto the root word meaning ”car”

i a plural marker

ssa an inessive case ending, meaning ”in” in this example mme a possessive suffix meaning ”our”

kin a particle meaning ”also”

ko a particle which indicates a question

Rich morphological structure can make sentiment analysis more difficult because the algorithm should somehow know that words which look different actually have same or similar meaning. However, this problem can be at least partially solved by using lemmatization which is presented in section 3.2.

Another way to improve sentiment analysis in morphologically rich languages is to extract language-specific features [25]. Language-specific features can be for example a list of positive and negative words which can be used to classify single words as negative or positive.

3.2 Lemmatization

Lemmatization is the process of grouping different grammatical forms of the same word together [30]. The context of the word is important in analyzing the lemma of a word. In table 3.2 the importance of the surrounding words to a word’s lemma can be seen. For example, word ”meeting” can either be a noun or a verb depending on the context.

Lemmatization can be used to simplify content written in languages with rich morphological structure. For example, Finnish word ”autossani” meaning ”in my car”

becomes ”auto” meaning ”car” after lemmatization. Because of this change, the sentiment analysis algorithm could now link words ”auto” and ”autossani” to the same thing ”auto”. It can be seen that in order to implement a lemmatization algorithm, the morphological structure of the word must be analyzed. This obviously makes it

(25)

3.3. Sentiment analysis 15 Table 3.2 An example of lemmatization.

Sentence Word to lemmatize Lemma Is the meeting tomorrow. meeting meeting (noun) Are we meeting tomorrow? meeting to meet (verb)

Is it bad? bad bad (adjective)

Was it better? better good (adjective)

nearly impossible to implement lemmatization without knowing the morphological structure of the language as well.

3.3 Sentiment analysis

Sentiment analysis is the process of interpreting emotions and other subjective information. Sentiment analysis can be applied for example in voice and text analysis.

One of the basic forms of sentiment analysis is called polarity classification. In polarity classification the data is classified as one of two opposing sentiments, eg.

”like” or ”dislike” [9].

Combining machine learning with sentiment analysis is one way to classify content automatically. For example, for analyzing the sentiment of Twitter tweets machine learning algorithms have been utilized a lot. Some commonly used algorithms in the past research include support vector machines and naive Bayes classifiers (NB) among others. These algorithms usually require features to be extracted, which can be done for example using n-gram. [18]

In addition to traditional machine learning algorithms like SVM and NB, also deep learning has been used in sentiment analysis in the recent years. There have been several studies such as [37] and [11] where deep neural networks, both convolutional and recurrent networks, have been used successfully. Overall, deep learning looks promising for doing sentiment analysis.

One key difference between deep learning and older machine learning approaches is that deep learning seems to require less pre-processed features to achieve good results. While SVM and NB mostly use n-gram generated features, deep learning can do well even without having to rely on n-gram or other pre-processed features. [18]

Sentiment analysis presents many challenges. Many things affect the overall sentiment of the sentence. Table 3.3 presents some of the common challenges in sentiment analysis in English which also apply in Finnish language.

(26)

3.3. Sentiment analysis 16 Table 3.3 Challenges of sentiment analysis.

Sentence Challenge

I do not like music. Handling the negation

Sony looks better than LG. Finding the target of the sentiment I hardly like movies. Adverbial modifies the sentiment I love this but would not recommend it to anyone. Difficult to categorize

This is as cheap as a diamond. Sarcasm shifts the sentiment

3.3.1 n-gram

An n-gram [6] is a sequence ofn adjacent items. Depending on the application, the items can be for example letters or words.

The number of items in an n-gram determines its name. An n-gram with one item is called ”unigram”, with two items ”bigram”, and with three items ”trigram”.

Examples of n-grams with different lengths are presented in table 3.4.

Table 3.4 Examples of n-grams by letter.

Sample sequence Unigram (n= 1) Bigram (n= 2) Trigram (n= 3) morning m, o, r, n, i, n, g mo, or, rn, ni, in, ng mor, orn, rni, nin, ing first f, i, r, s, t fi, ir, rs, st fir, irs, rst

movie m, o, v, i, e mo, ov, vi, ie mov, ovi, vie

One of the upsides of using n-gram is that the model holds some information about the word order as well. However, this only applies to bigrams and larger n-grams.

n-grams have been applied for example in statistical machine translation, speech recognition, spelling corrections, and others at Google Research in the past [16].

n-grams are also often used in sentiment analysis studies [18]. However, with the coming of deep learning, it is possible that n-gram models will become obsolete in sentiment analysis since recurrent neural networks seem to mimic the behavior of n-gram models quite well. Both models retain the information about word order.

3.3.2 Bag-of-words model

Bag-of-words model is a simplified presentation of text. It is mainly used to generate features for machine learning algorithms. From the simple presentation of a bag- of-words model, it is possible to calculate various metrics, such as term frequency which tells how many times a word appears in a text. [8]

(27)

3.3. Sentiment analysis 17 One of the downsides of bag-of-words model is that it disregards both word order and grammar. Bag-of-words model can also be thought of as a special case of the n-gram model, withn = 1.

Using bag-of-words model on a document simply means calculating the frequency of each word in the document. This is demonstrated in figure 3.1. It can be seen that the order of the words is lost in the process of converting a document to a bag-of- words model. Disregarding the word order can result in losing relevant information about the sentiment of the text, for example ”is good” and ”is not good” have opposite meanings but this would not be clearly visible in the bag-of-words model.

Figure 3.1 Bag-of-words model representing a text document.

3.3.3 Word embedding

Machine learning requires the data to be in numerical format. The mapping of words or sentences to vectors of real numbers is called word embedding in natural language processing.

One of the traditional ways of doing word embedding is using n-gram language models. However, one of the problems in n-gram is that they have no concept of similarity and therefore cannot take advantage of the fact that similar words occur in similar contexts. [33]

Many of the new techniques of word embedding rely on neural networks instead of n-gram models [33]. One of these new techniques is called Word2vec [32].

Word2vec produces a vector space from a large corpus of text. The produced vector space typically consists of several hundred dimensions and each unique word is assigned a vector in the space. The position of a word in the vector space is based on the surrounding words in the corpus. [32]

The benefit of using Word2vec over n-grams is that Word2vec does not lose information about words sharing common contexts. Moreover, the dimensions of the

(28)

3.3. Sentiment analysis 18 embedding are much smaller compared to other methods. For example, in one-hot encoding the dimensionality is same as the size of the vocabulary which could be several thousands.

This chapter presented topics centering around NLP, specifically sentiment analysis.

Different techniques used in transforming textual data into numerical format were covered, among them lemmatization, bag-of-words and n-gram models, as well as word embedding. In the next chapter, it is shown how sentiment analysis can be done by combining both machine learning and natural language processing techniques.

(29)

19

4. METHODS

We propose a case study to determine if and how the sentiment of a text written in Finnish can be analyzed automatically without human interaction. The study consists of analyzing the sentiment of product reviews written in Finnish.

First, case study as a research method is presented in section 4.1, and the software used in this study is explained in section 4.2. Third, the structure and type of the data used for training and classification is presented in section 4.3. Fourth, data pre- processing is described in section 4.4. Last, training and classification are covered in section 4.5.

4.1 Case study

Case study is a research method that investigates a phenomenon in a narrowed down context. Broader generalizations can be made based on a case study. Case study can be based on any mix of quantitative or qualitative evidence. [48]

Case studies are often utilized in social sciences, such as business, sociology, and psychology. However, case studies can also be conducted in software engineering. [48]

The purpose of this case study can be seen as exploratory. An exploratory case study is defined as trying to get an overview of the situation where it is unsure what we should be looking for [38]. With respect to this study, it is uncertain if and how well it is possible to analyze the sentiment of Finnish language automatically without human interaction. Therefore an exploratory case study is a good fit.

Carrying out a case study includes five steps: case study design, preparation for data collection, collecting evidence, analysis of collected data, and reporting [41].

All these steps are covered in the following chapters.

The objective of this study is to answer the question ”Is it feasible to do sentiment analysis for Finnish language automatically?”. Due to the difficulty of finding suitable data from multiple sources, only one source of data is used to answer this

(30)

4.2. Software 20 question, even though it is recommended to use multiple data sources for case studies [41]. However, in this case it can be argued that even one source of data is enough as long as there is enough data. It is also probably not possible to get statistically meaningful results in this study, but the purpose after all is to explore the feasibility of sentiment analysis for Finnish language, not to determine the exact accuracy of any given method.

4.2 Software

A variety of software is required in all steps of the study. For collecting the data a simple Python script is used.

For natural language processing tasks, such as lemmatization covered in section 3.2, Omorfi [36] library is used. Omorfi library can be utilized in stemming, lemmatization, and shallow morphology analysis, for example. Given a word like ”kokouksen”

meaning ”meeting’s” in English, Omorfi library is able to tell that the basic form of this word is ”kokous” and that it is a singular noun in the genetive case. Choosing Omorfi for NLP tasks was easy since there are hardly any alternatives for Finnish language.

Keras [26] libary is used together with TensorFlow [44] library to train and classify neural networks. Keras offers a high-level neural networks application programming interface (API) in Python that can run on top of TensorFlow. TensorFlow is an open source library for machine intelligence, or more precisely for numerical computation using data flow graphs.

Keras was mainly selected because the API is easy to understand and use, and is written in Python that seems to be one of the new main languages for machine learning. Keras also seems ideal for fast prototyping and testing of ideas due to its simplicity and high level of abstraction [26].

The barrier of entry for using Keras also seems to be quite low, setting up a neural network is basically only a few lines of code in Python. Accomplishing the same in TensorFlow is much more complicated. All these features make Keras the most practical choice for this study.

4.3 Data

The increase in the amount of available training data has made deep learning more useful [20]. However, finding large amounts of suitable prelabelled data for sentiment

(31)

4.4. Pre-processing 21 analysis in Finnish language proved to be difficult. This is mainly the reason why online shop www.verkkokauppa.com was chosen as the source of the data set.

It was possible to gather over 50 000 product reviews using the API of the shop by utilizing a simple script. This saved a lot of time and made it possible to entirely skip any hand-labeling of data.

The shop sells many kinds of products, especially electronic devices and appliances.

The fields in a product review and example values are presented in table 4.1.

Table 4.1 Product review fields and example values.

ReviewText Rating (1-5) IsRecommended

Tuote toimii tosi hyvin. Voin suositella kaikille. 5 true Puhelimessa on ollut ongelmia. En suosittele. 2 false

The product review contains a rating from 1 to 5 as well as whether the product is recommended by the user or not. These fields should at least somewhat correlate with the sentiment of the text. Naturally, negative product reviews should seem less positive than praising reviews.

The data set consists of 53 873 product reviews. One of the downsides of the data set is that there are significantly more positive than negative product reviews, which can be seen in figure 4.1. This might make it harder to generalize the classifier to a data set with a more even distribution of ratings. However, in this study we will only measure the generalization to the data from the same source, so an uneven distribution should not be an issue.

Most of the product reviews have a short text of 0 to 50 words. Only a few product reviews have a length of 200 words or more. This can be seen in figure 4.2.

4.4 Pre-processing

The size of the data set is quite large but still limited, which might result in difficulty to get good results by only feeding the network a list of characters as it appears in the text. Therefore, simple pre-processing consisting of lemmatization, fixing the size of the vocabulary, and padding is applied.

First, all review texts are lemmatized as described in section 3.2. This gives a list of words in their basic forms. Lemmatization is beneficial since a Finnish word can have multiple different forms, for example word ”dog” can have many different forms

(32)

4.4. Pre-processing 22

Figure 4.1 The distribution of product reviews by rating and recommended fields in the whole data set.

in Finnish language: ”koira”, ”koiran”, ”koiraa”, ”koirassa”, ”koirasta”, ”koiraan”,

”koiralla”, ”koiralta”, ”koiralle”, and so on. The size of the vocabulary in the review texts would most likely be too huge without lemmatization, and the relation with different forms of the same word would not be conveyed in the data to the neural network.

However, it should be noted that relevant information might be lost by only feeding the network basic forms of the words. The endings of the words can in some cases significantly change the meaning of the sentence. This can be seen by looking at a Finnish sentence ”Tuotteenne ovat huonoja, mutta tuotteemme ovat hyvi¨a” meaning

”Your products are bad but our products are good”. The lemmatized version of this sentence turns into ”Tuote olla huono, mutta tuote olla hyv¨a” which means

”product is bad but product is good”. Clearly, the meaning of the sentence was lost in lemmatization. Thus, it could also be beneficial to feed the network information about the form of the word.

Second, the whole data set is used to form a vocabulary of the most common m words where each word corresponds to an integer. The valuemis determined in the training phase by trying out different values. This vocabulary is later used to map words to integers. The purpose of limiting the vocabulary to only the most common

(33)

4.4. Pre-processing 23

Figure 4.2 The distribution of number of words in a product review in the whole data set.

words is to try to minimize overfitting the model to the data. The significance of words present only in a few samples should be quite small when classifying the whole data set. In addition, many of these rarely occurring words can be spelling mistakes so having them in the training data set does not make sense.

Since the neural network expects numeric data, the words are mapped to integers using the vocabulary. The words that are not in the vocabulary are given a fixed value outside the range of the vocabulary, which means that all words not in the vocabulary look the same to the neural network.

Next, review texts are pre-processed one by one. The words now presented as a vector of integers is transformed to a fixed lengthn which is chosen in the training phase by comparing the performance of different lengths. By looking at figure 4.2 it can be seen that most reviews have less than 150 words so a good upper limit for review length might be 150. Review texts that are too short are padded with a fixed value to a length ofn, and texts containing more than n words are truncated to n words. The fixed value used in the padding should not be included in the

(34)

4.5. Training and classification 24

Figure 4.3 Pre-processing presented shortly.

vocabulary. The truncating and padding happens at the beginning of the sentence.

This means that truncating vector (1, 2, 3) to a length of 2 becomes (2, 3), and respectively padding vector (2, 3) to a length of 3 becomes (0, 2, 3) if 0 is chosen as the value outside the vocabulary.

All this results in a vector ofnintegers which represents the text of a product review.

The whole pre-processing flow for one product review is presented in figure 4.3.

It could also be possible to give the neural network input vectors with different lengths, but this is not possible in this case because some of the Keras layers used in the neural network model require a fixed input vector size.

4.5 Training and classification

Training and classification require the data to be split which is described in section 4.5.1. The structure of the neural network is presented in section 4.5.2.

4.5.1 Training, validation and test sets

The data set is usually split into two or three parts: training, validation and test set. Validation set should be chosen to be large enough, otherwise the estimate of the predictive performance can be relatively noisy. Many iterations of model design can also cause overfitting to the validation set, so it might be good to hold out a third test set until the final evaluation of the algorithm. [5]

(35)

4.5. Training and classification 25 The training set is used to train the neural network. The validation set is utilized to finetune the neural network structure and parameters. Lastly, the test set is held out from the development phase, and will only be used in the final step to evaluate the accuracy of the selected neural network model. This type of test set which does not contain any training data is also referred to as holdout set [42]. In addition, no changes to the model are made after using the test set because otherwise the test set would be no different from the validation set.

The whole data set is split into three parts which are presented in table 4.2. The data sets are split randomly, and all sets consist of same proportions of ratings.

Table 4.2 The whole data set is split into training, validation and test sets.

Data set Portion of the whole data set Training 60%

Validation 20%

Test 20%

4.5.2 Structure of the neural network

The actual implementation of the neural network is done using Keras. All layer names in this section refer to Keras documentation [26].

The first layer of the neural network is an embedding layer which takes care of turning the integers corresponding to words into dense vectors of fixed size. The input to the embedding network is the end result of pre-processing presented in figure 4.3.

An alternative solution to using an embedding layer is using one-hot encoding where categorical variables are transformed into a more suitable form for machine learning algorithms. Both of the ways, embedding layer and one-hot encoding, solve the same problem: categorical values do not relate to each other numerically. For example, assign words ”car”, ”person”, ”house” integers 1, 2 and 3. In this case, the average of ”car” and ”house” would be ”person” which obviously does not make sense. One- hot encoding transforms the three words into three features: ”is car”, ”is house”

and ”is person” which are all binary. So car could be presented as a vector of (1, 0, 0), house as a vector of (0, 1, 0), and person as a vector of (0, 0, 1). In this case, all the values have the same distance from each other meaning that there is no relation between them, ”car” is as far from ”house” as from ”person”.

The next layers of the network are Conv1D, MaxPooling1D, Dropout, LSTM, and

(36)

4.5. Training and classification 26

Figure 4.4 Visualization of the structure of the neural network. The network consists of 7 layers, each layer carrying out their own tasks.

lastly a Dense layer. The network structure is presented in figure 4.4.

Convolution layer, Conv1D, helps to narrow down the amount of data together with MaxPooling1D layer. This speeds up the training of the network but does not seem to affect the accuracy of the model. This is especially useful when changing the network parameters and wanting to get quick feedback on the effects. The theory behind convolution layer is presented in section 2.2.

Dropout layer helps to prevent overfitting of the model. Dropout layer does this by randomly shutting down some of the input units at each update during training time as described in section 2.4.

(37)

4.5. Training and classification 27 The LSTM layer, covered in section 2.3, can use the context of the whole review text when deciding on the rating. This together with the convolution layer enables us to give the network comparatively raw data of the sentence. Without these layers another approach would most likely be required, for example generating features with n-grams.

And lastly, the Dense layer transforms the output to a human readable format with 5 probabilities between 0 and 1, one probability for each rating. For example, the Dense layer could output values (0.05, 0.1, 0.15, 0.2, 0.5) which would mean that rating 1 has a probability of 5% and rating 5 a probability of 50%.

(38)

28

5. RESULTS

In this chapter, the results of the study are presented. First, the metrics used to evaluate the trained models are presented in section 5.1. Next, the process of finding the best hyperparameters is described in section 5.2. Last, the results of the best model for both validation and test sets are presented in section 5.3 and discussed in section 5.4.

5.1 Metrics

Metrics are useful for comparing the performance of neural networks to each other.

Without metrics, it is hard to say that one model is better than the other. In this study, four different metrics are used to evaluate the performance of the classifier.

The first metric is called accuracy a which determines the number of examples for which the model produces the correct output. Accuracy is given by

a= t_p+t_n t_p +t_n+f_p+f_n

wheret_p is the amount of true positives,t_n amount of true negatives, f_p amount of false positives, andf_n amount of false negatives. [20]

Accuracy is one of the most common metrics used in machine learning, however accuracy itself does not work that well when the proportion of examples is unbalanced.

For example, if there are 99 positive samples and only 1 negative sample in a binary classification task then even the model which classifies everything as positive would receive an accuracy of 99%. One solution to this problem is to use other metrics like precision and recall. [20]

The second metric named precision (also called positive predictive value)p is given by

p= t_p tp+fp

(39)

5.1. Metrics 29 where t_p is the number of true positives and f_p the number of false positives. Pre- cision tells the ability of the classifier not to label negative samples as positive. [42]

Another metric named recall r (also called true positive rate or sensitivity) can be calculated in the following way:

r= _t ^t^p

p+fn

wheretp is the number of true positives and fn the number of false negatives. The recall is the ability of the classifier to find all the positive samples. [42]

The third metric named F1 score (also called F-measure) can be calculated by using the precision and recall values. F1 scoref1 is given by

f₁ = 2_p+r^pr

wherepis the precision andr the recall. F1 score is the harmonic mean of precision and recall. [42]

In table 5.1 it is demonstrated what the terms true and false positives, and true and false negatives mean assuming that a product review is either good or bad. In figure 5.1 a more general example is shown. All the used metrics are summarized in table 5.2.

Table 5.1 An example of true and false negatives and positives in the context of good and bad reviews.

Measure Explanation

True positive Good reviews correctly identified as good.

False positive Bad reviews incorrectly identified as good.

True negative Bad reviews correctly identified as bad.

False negative Good reviews incorrectly identified as bad.

Table 5.2 Summary of the used metrics.

Metric Formula Accuracy a= tp+tn

tp+tn+fp+fn

Precision p= tp

t_p+f_p Recall r= _t ^t^p

p+fn

F1 score f1 = 2_p+r^pr

(40)

5.2. Hyperparameter optimization 30

Figure 5.1 Visualization of true and false negatives and positives. In the context of this study, the black dots represent good reviews and the white dots bad reviews.

5.2 Hyperparameter optimization

Grid search was used to optimize hyperparameters of the neural network. In grid search, some of the possible combinations of hyperparameters are tested, which leads to huge amounts of iterations if the amount of parameter values to optimize is large [20]. The following parameters were optimized: vocabulary size, review length, dropout, embedding vector size, and LSTM output dimensions. This led to 4×4×3×4×3 = 576 iterations.

The optimized hyperparameters are presented in table 5.3. The values of hyperparameters are typically picked on an approximately logarithmic scale, for example 2, 4, and 8 [20].

Table 5.3 Hyperparameters and their values for grid search.

Hyperparameter Values

Vocabulary size 100, 1000, 2000, 4000 Review length 5, 50, 100, 150

Dropout 0%, 25%, 50%

Embedding vector size 32, 64, 128, 256 LSTM output dimensions 32, 64, 128

Vocabulary size and review length were covered in section 4.4. Dropout is the percentage of units to drop as described in section 2.4. Embedding vector size tells

(41)

5.2. Hyperparameter optimization 31 the dimensionality of the word vector which was covered in section 3.3.3. Last, LSTM output dimension determines how many dimensions there are in the output space.

Training was done until the validation loss did not decrease in 5 consecutive epochs.

Epoch means a training iteration over the full dataset [20]. The checkpoint with the lowest validation loss was then chosen for comparison against other models.

Comparison of selected models was done by looking at validation accuracy of the model. This method is called early stopping.

Since there are too many combinations of hyperparameters, only the results of some selected combinations are presented. First, the effect of the vocabulary size on the network is presented in figure 5.2. It can be seen that vocabulary size of 4000 gives the highest accuracies and size 100 the lowest. However, when also considering the loss of the model, it can be seen that size 4000 leads to overfitting the fastest as the loss increases first and the most.

Meanwhile, train and validation values of a vocabulary size of 100 follow each other closely which generally means that the model is not overfitting. However, a vocabulary size of 100 leads to underfitting as the accuracy never reaches over 60% even for the training set. The best value of vocabulary size was 2000 when following the early stopping method described earlier.

The next visualized hyperparameter is embedding vector size. Four different values (32, 64, 128, 256) are tried. The results are presented in figure 5.3. When looking at the accuracies, it can be seen that accuracy is going up as the embedding vector size increases. It can also be seen that using embedding vector size of 128 gives the lowest validation loss while also giving the best validation accuracy around epoch 4.

It seems that the same pattern is visible here as in vocabulary size. As the dimensions of the data get smaller, the accuracy decreases, however here the exception being that embedding vector size 256 did not surprisingly give better accuracies than the size of 128.

The third visualized hyperparameter is review length. Four different values (5, 50, 100, 150) are tried. The effect of review length on accuracy and loss of the model is presented in figure 5.4.

Again it can be seen that accuracy goes up as there is more data, especially in the training set. The validation accuracies of review lengths of 50, 100, and 150 are surprisingly similar as well as validation losses. Review length of 5 clearly leads to

(42)

Figure 5.2The effect of vocabulary size on accuracy and loss of the neural network model.

The used hyperparameters and values are: review length 50, dropout 50f%, embedding vector size 128, and LSTM output dimensions 64.

(43)

Figure 5.3The effect of embedding vector size on accuracy and loss of the neural network model. The used hyperparameters and values are: vocabulary size 2000, review length 50, dropout 0%, and LSTM output dimensions 64.

(44)

underfitting as there is not enough data to generalize to validation set.

The last visualized hyperparameter is LSTM output dimensions which is tried with three different values (32, 64, 128). The effect of LSTM output dimensions turned out to be least significant of all the selected hyperparameters. LSTM output dimensions accuracies and losses are presented in figure 5.5.

Overall, it is seen that if the dimension of the input data is too small, then the model tends to underfit since there is not enough data to make predictions. In addition, too high dimensions of input data can cause overfitting and worse results than smaller dimensions.

The best iteration yielded a validation accuracy of 72.7%. The accuracies and losses of the iteration are presented in table 5.4. The best hyperparameter values are presented in table 5.5.

Table 5.4Accuracies and losses for the hyperparameter combination with the best result.

The used hyperparameters and values are: vocabulary size 2000, review length 50, dropout 50%, embedding vector size 128, and LSTM output dimensions 64. The row with the lowest validation loss is used for comparison as it is the best using the early stopping criteria.

The selected row is highlighted in bold.

Epoch Accuracy Loss Validation accuracy Validation loss

1 0.492 1.096 0.535 0.970

2 0.574 0.929 0.601 0.890

3 0.625 0.850 0.619 0.872

4 0.658 0.798 0.633 0.843

5 0.679 0.756 0.643 0.832

6 0.696 0.718 0.645 0.828

7 0.719 0.682 0.668 0.821

8 0.729 0.653 0.676 0.820

9 0.745 0.626 0.684 0.796

10 0.760 0.597 0.696 0.797

11 0.773 0.572 0.701 0.800

12 0.788 0.540 0.709 0.799

13 0.793 0.523 0.711 0.803

14 0.803 0.499 0.713 0.835

15 0.812 0.485 0.725 0.810

16 0.820 0.462 0.727 0.793

17 0.829 0.444 0.731 0.802

18 0.833 0.432 0.736 0.811

19 0.840 0.417 0.736 0.840

20 0.848 0.405 0.739 0.821

21 0.847 0.398 0.740 0.831

(45)

Figure 5.4 The effect of review length on accuracy and loss of the neural network model.

The used hyperparameters and values are: vocabulary size 2000, embedding vector size 128, dropout 50%, and LSTM output dimensions 64.

(46)

Figure 5.5 The effect of LSTM output dimensions on accuracy and loss of the neural network model. The used hyperparameters and values are: vocabulary size 2000, review length 50, embedding vector size 128, and dropout 50%.

(47)

5.3. Validation and test results 37 Table 5.5 The best combination of hyperparameter values found using grid search.

Hyperparameter Value

Vocabulary size 2000

Review length 50

Dropout 50%

Embedding vector size 128 LSTM output dimensions 64

5.3 Validation and test results

The validation results for the neural classifier are presented in table 5.6. The final results for neural classifier obtained using the test set are presented in table 5.7.

These results can then be compared to a dummy classifier, a simple classifier that only uses the frequency of ratings for prediction. The results for dummy classifier are presented in table 5.8.

Table 5.6 Results of neural classifier using the validation set.

Rating Precision Recall F1 score Support

1 0.602 0.537 0.568 389

2 0.450 0.387 0.416 302

3 0.551 0.465 0.504 794

4 0.718 0.682 0.700 3998

5 0.777 0.837 0.806 5292

Table 5.7 Results of neural classifier using the test set.

1 0.417 0.204 0.274 406

2 0.190 0.129 0.154 333

3 0.218 0.190 0.203 777

4 0.490 0.480 0.485 3973

5 0.621 0.679 0.649 5286

Table 5.8 Results of dummy classifier, simple classifier, using the validation set.

1 0.031 0.031 0.031 389

2 0.034 0.040 0.036 302

3 0.076 0.081 0.078 794

4 0.368 0.357 0.362 3998

5 0.491 0.492 0.491 5292

(48)

5.4. Discussion 38 It can be seen that the performance of neural classifier is much better than the performance of dummy classifier. For example, by comparing the F1 scores of both classifiers it can be seen that dummy classifier has severe difficulty predicting ratings 1-3 with the score being less than 8% for all cases while neural classifier scores between 15% and 27%.

It can also be noticed that the precision of dummy classifier correlates with the frequency of ratings in the training data, which is of course expected because dummy classifier uses the frequency of ratings in the training data to make predictions.

However, this pattern is not really seen in neural classifier. While the F1 score of neural classifier for ratings 4-5 is higher than for ratings 1-3, the difference is not as significant as for dummy classifier. Surprisingly, the F1 score of neural classifier is even relatively high almost reaching 27% for rating 1 while dummy classifier only receives an F1 score of 3% for rating 1.

The precision and recall scores for neural classifier using validation and test sets, as well as dummy classifier using validation set is presented in figure 5.6. It can be seen that difference in precision and recall between neural classifier and dummy classifier is clearly visible in ratings 1-3. While for ratings 4-5 the difference is not as significant.

Finally, a visualization of F1 scores for both neural classifier and dummy classifier is presented in figure 5.7. This visualization shows the overall performance of the classifiers well. The same pattern can be seen again, especially for ratings 1-3 the performance of the neural classifier is much better.

Comparing the performance of neural classifier using both the validation and the test set, it seems that the classifier overfitted to the validation data. The overall accuracy of neural classifier using the validation set is 72.7% while for the test set only 53.6%. However, it should also be noted that the neural classifier performed a lot better on the test set than the dummy classifier on the validation set.

5.4 Discussion

The results show it is feasible to do sentiment analysis for Finnish language automatically. Especially when comparing the accuracy and recall of neural and dummy classifier, it can be seen that the neural classifier is learning something from the data to predict ratings.

However, the neural classifier clearly overfitted to the validation data despite the attempts of regularization. Also since there is only a single source of data it is not

(49)

5.4. Discussion 39

Figure 5.6Visualization of the the precision and recall for neural and dummy classifier.

possible to verify if the classifier could also predict content from a different context, like social media.

Some hyperparameters affected the performance of the model more than others.

For example, while LSTM output dimensions made almost no difference, changing the vocabulary size greatly impacted the performance. Therefore choosing the right

(50)

5.4. Discussion 40

Figure 5.7 Visualization of the the F1 score for neural and dummy classifier.

hyperparameters to optimize is also important.

Automated text sentiment analysis for Finnish language using deep learning

VILLE NUKARINEN