Aesthetics-based Image Classification

(1)

AESTHETICS-BASED IMAGE CLASSIFICATION

Faculty of Information Technology and Communication Sciences

Bachelor’s thesis

April 2020

(2)

ABSTRACT

Tiitus Hiltunen: Aesthetics-based Image Classification Bachelor’s thesis

Tampere University Information Technology April 2020

During the recent years, aesthetics-based image classification has been a keen research topic in computer vision. Images can be divided into being low in aesthetic quality or high in aesthetic quality. The goal of this research is to develop methods to classify images in a similar way a human would typically do.

The subjectivity of the topic creates many obstacles, such as the difficulty of dataset selection and the exactitude of finding a proper model to use. This thesis considers popular methods used in the discipline, the most common datasets and attributes that make the problem of aesthetic-based image classification unique. Within the scope of this research, two separate convolutional neural networks based on EfficientNet were compiled and tested, one of which achieved state-of-the-art accuracy on the AVA2 dataset.

Keywords: image aesthetic assessment, image classification, deep learning, convolutional neural networks, transfer learning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

Selecting the subject for my Bachelor’s thesis was not a difficult task. Visual aesthetics have always intrigued me, and to combine that with the interest towards computer vision and machine learning was something I could not skip. I would like to thank my supervisor Ymir for being adviceful and helpful through the whole process and my friend Hung with whom I had great conversations about the topic.

Tampere, 30th April 2020

Tiitus Hiltunen

(4)

LIST OF FIGURES

2.1 3-5-2 neural network architecture (LeNail 2019) . . . 2

2.2 Simple 2x2 Max-Pooling . . . 4

2.3 High and low aesthetic quality images from the AVA dataset (Murray et al. 2012) . 5 2.4 Learning process of transfer learning . . . 6

2.5 Architecture of EfficientNet-B0 . . . 8

3.1 Distribution of average ratings in the AVA dataset . . . 9

4.1 Original image and its horizontally flipped copy (Murray et al. 2012) . . . 14

4.2 Model 1 training progress . . . 16

4.3 Model 2 training progress . . . 16

(6)

LIST OF TABLES

3.1 Comparison of datasets . . . 11

4.1 Summary of Model 1 . . . 12

4.2 Summary of Model 2 . . . 13

4.3 Hyperparameters of the final models . . . 15

4.4 Accuracies of recent models on the AVA2 dataset . . . 17

A.1 Some of the attempts with different hyperparameters (AVA2) . . . 21

B.1 EfficientNet-B7 structure with input shape (600, 600, 3) . . . 22

B.2 EfficientNet-B3 structure with input shape (500, 500, 3) . . . 22

(7)

LIST OF SYMBOLS AND ABBREVIATIONS

L Loss function

W Weight matrix

x Input vector

σ Activation function

a Input activation

b Bias term

exp Exponential function

t Objective output

z Activation output

AVA Aesthetic Visual Analysis dataset CNN Convolutional Neural Network CUHK-PQ CUHK-PhotoQuality dataset DNN Deep Neural Network FCN Fully Convolutional Network

NN Neural Network

(8)

1 INTRODUCTION

Given the idea of aesthetics-based image classification, one would be tempted to assume that it is something a computer cannot do. However, this assumption is far from the truth, since there are attributes in aesthetic images that can be learned by computers.

Aesthetic image classification research sparks many questions: What features make highly aesthetic images aesthetic? How to gather a good dataset which will not be too vague to use for training?

What kind of models perform well in this field and why? This thesis attempts to answer all of these questions and give a good overview about the uniqueness of aesthetics-based image classification.

Applications of aesthetic image classification focus on automatically selecting aesthetically appeal- ing images for users of different commodities. These can include, for instance, real time camera applications to capture best-looking photos, or algorithms that find the most suitable portrait pic- tures for social media profiles. In addition, advertisement and streaming services are heavily reliant on visual aesthetics, and doing the assessment automatically could establish new possibilities for these lines of business.

Chapter 2 explores the theory behind aesthetic image classification and discusses deep neural networks, features of aesthetic images and transfer learning. Common image aesthetics datasets are analyzed and compared in Chapter 3. Chapter 4 includes the methodology and experiments of two different aesthetic image classifiers, and Chapter 5 convenes it all together to conclude the results of the thesis.

(9)

2 RELATED WORK

This chapter discusses the much-needed theory behind aesthetic image classification and the defi- nition of the term itself.

2.1 Deep Neural Networks

The structure of aneural network (NN) is hugely inspired by the human brain. There is a mathe- matical model for this structure, which is discussed in this section.

Neural networks are made of artificial neurons called perceptrons. One perceptron takes multiple inputsx₁, x₂, ..., x_n and produces a single binary outputy. The inputs can have different signifi- cance and for this reason inputs have weightsw1, w2, ..., wn. Each perceptron has its unique bias termb, which determines the threshold for outputting 1 instead of 0. This allow some perceptrons to be more sensitive to certain inputs than others (Nielsen 2018, p. 3-4).

Figure 2.1. 3-5-2 neural network architecture (LeNail 2019)

The architecture of a traditional neural network consists of dense layers, which are connected to one another. One layer includes multiple perceptrons. There is one input layer, one output layer and one or more hidden layers in the middle. When there are two or more hidden layers, the network is called adeep neural network (DNN). Visualization of a simple 3-5-2 network is in Figure 2.1.

When talking about DNNs, the network is so-called feedforward neural network. This means that

(10)

the network does not have any loops in it and the only direction the information can move to is forward. For these kinds of networks, each layer has its own forward operation F(·). For a layer with learnable weightsW, this forward operation is of form

y=F(x) =Wx+b=∑︂

wixi+bi. (2.1)

This is followed by an activation function, for example sigmoid z=σ(y) = 1

1 +exp(−y) (2.2)

or the rectified linear unit (ReLU, z=max(0, y)).

DNNs also have a loss function, which gives feedback about the current performance of the network.

This loss function is typically a squared error function of form L= 1

2||z−t||², (2.3)

wheretis the objective output of the inputx(Nielsen 2018, p. 11-12, 42). The feedback gathered with the loss function is then used to adjust the weights of the network using backpropagation (Nielsen 2018, p. 49-57).

2.2 Convolutional Neural Networks

There is a subset of DNNs calledconvolutional neural networks (CNN) that suit especially well for image classification. These networks take advantage of the spatial structure of the input for more accurate and faster training.

In the case of a typical DNN architecture, each neuron took all of the inputs, manipulated them a bit with the weights and biases and forwarded the information to the next layer. When talking about CNNs, one has to think of the input as a matrix instead of a vector. The network is still a feed-forward network, but instead of processing the whole input, the neuron takes a small localized region of the input matrix and manipulates that. These regions are called local receptive fields.

Since a unique neuron doesn’t process all of the inputs, the information moves much faster inside the network compared to typical DNNs. (Nielsen 2018, p. 170-171)

Instead of adjusting weights and biases for all of the neurons individually, all of the weights and biases in a particular CNN layer are the same. This means that for a hidden neuron in row j and columnk:

y=σ (︃

b+

m

∑︂

t=1 n

∑︂

i=1

wt,iat+j,i+k

)︃

(2.4)

where themandncorrespond to the height and width of the localized region andax,y is the input activation in the position(x, y). One can see that for any localized region, there is only one weight matrix of the same size for the whole layer. In more practical terms, this means that each neuron

(11)

called pooling layers. These layers try to simplify the information that is gathered to the feature maps. There is also global pooling, which means that the layer describes the feature map with only a single value. This effectively reduces the amount of parameters in the network and makes the network lighter. (Nielsen 2018, p. 174)

Figure 2.2. Simple 2x2 Max-Pooling

How the pooling layer describes the information could be for example just by taking the maximum activation from that region, which is called max-pooling. Simple max-pooling is demonstrated in Figure 2.2. Another method is L2-pooling, where one calculates the square root of the sum of the squares of the activations from a given feature map. These methods are the most commonly used, but there are other types of descriptions as well. (Nielsen 2018, p. 174-175)

2.3 Image Aesthetic Assessment

When talking about Image Aesthetic Assessment, one should always first define the term. The term is defined by Y. Deng et al. 2017 as follows:

"Image aesthetic assessment aims at computationally distinguishing high-quality photos from low- quality ones based on photographic rules, typically in the form of binary classification or quality scoring."

These approaches are referred more often in the literature as aesthetic classification and aesthetic regression respectively (Bianco et al. 2016; Jin et al. 2016; Sheng et al. 2019). These terms can be defined as:

• Aesthetic classification

Images are either of low aesthetic quality or of high aesthetic quality. More correctly classified images the better.

• Aesthetic regression

Images are rated e.g. between 1 and 10 and the model is trained to rate the images on the same scale. The closer the ratings are to the ground truth the better.

Usually comparisons of different methods are done using the classification approach. That is

(12)

because there is a lot of variance with aesthetic regression due to the subjectivity of aesthetic assessment, which makes comparing methods difficult. One has to also note that the variance problem would be present between two people as well because of differences in preference. For example, a passionate motorsport fan would probably rate motorbike images higher than someone who is not interested in motorsports. This implies that variance is not an issue that is implicitly in these models.

Most datasets used for image aesthetic assessment include ratings for the images and not a binary class. However, there are still ways to divide the dataset into two categories in terms of their aesthetic quality, which is discussed more deeply in Chapter 3.

2.4 Features of Aesthetic Images

The group of aesthetic images is diverse: images of sky full of stars, nature, beautifully designed houses, animals, sport events etc. This makes it all more harder for a neural network to sort images into being of high aesthetic quality or of low aesthetic quality. However, there are still patterns in the images that make them to be either one of the qualities.

What the network will try to find from the images are aesthetic features that are present in the aforementioned aesthetic categories. For global features that the network gathers, vivid colors, shallow depth of field and object emphasis all heighten the positive evaluation in favor of the high aesthetic quality of the image. In contrast, the absence of these features lowers the aesthetic rating of the image. (Malu et al. 2017)

Figure 2.3. High and low aesthetic quality images from the AVA dataset (Murray et al. 2012)

Figure 2.3 includes six images which are extreme examples of either aesthetic images or non- aesthetic images. Most people would find the images on the first row more aesthetically pleasing than the images on the second row. However, all of the images are quite different in terms of their

(13)

images is not entirely self-evident, since there might be highly abstract images that have a lot of negative features, but still look aesthetically pleasing. Regarding the training of a model this means that it is important to have a diverse dataset that includes all kinds of images, from Lamborghini photos to modern art pieces.

2.5 Transfer Learning

Neural networks are getting more advanced very rapidly, which means that it is becoming harder to create competitive models from scratch. On the other hand, deep models that have been tested seem to yield excellent results and are all available online for anyone to use. How could one make use of these models?

Figure 2.4. Learning process of transfer learning

The main idea of transfer learning is to use known deep learning network that has already been trained with some sort of data. The weights of the network are adjusted for that dataset but if the dataset is diverse and big enough, it makes the network have great feature extraction capabilities.

This makes training with other datasets more effective, since the network extracts high quality features from the image and draws more accurate conclusions from them. The learning process is visualized in Figure 2.4.

The most common dataset used for transfer learning is Google’s ImageNet. It has over 14 million images divided into 1000 different classes, which makes it enormously diverse compared to other popular datasets. (J. Deng et al. 2009).

(14)

2.5.1 Fully Connected Layer

One has to still adjust the pretrained model to the new data. In the case of DNNs, this happens most of the time by taking out the fully connected layer of the network, which is at the end of the network, and replacing it with new layers. These layers are selected so that they are able to classify the images from the new dataset. This can be done in multiple different ways.

The more traditional approach to this is to use one or multiple dense layers with flattening or global pooling. In binary classification, the last layer has a dimension of two and is activated with softmax, although a layer of dimension one with sigmoid activation is also commonly used. This layer is called a prediction layer. The amount of trainable parameters can be changed by altering the dimensions of other dense layers. If the dataset is not very large, it is better to use less dense layers so the network won’t remember all the details from the training data and cause the model to overfit (Zhao 2017). On the other hand, if the dataset is large, one might not utilize the whole potential of the network if there were not enough parameters to adjust.

Other and more recent approach is to make the whole network so-calledfully convolutional network (FCN). Instead of using dense layers, one uses convolutional layers to do the classification. This method was first utilized in the field of semantic segmentation (Long et al. 2014), but it has been found to be useful in image classification as well (Apostolidis and Mezaris 2019; Y. Li et al. 2018).

One of the benefits of using this approach is that FCNs can preserve the aspect ratio of the input images, unlike most conventional CNNs. The other benefit is that since there are only convolutional layers and pooling layers in the network, one can take in images that are bigger than the maximum input size of the original network.

2.5.2 Fine-tuning

When the classification problem is different from the original problem, but the contents of the datasets overlap to some extent, it is common to use something called fine-tuning (Jin et al. 2016;

Kornblith et al. 2018; H. Li et al. 2020). In conventional transfer learning one freezes all the layers except the fully connected layer, but in fine-tuning some of the early layers or all of them are set to be trainable.

One can think of fine-tuning as setting up good initial values for the network and then finding the optimum value quite close to those initial values. The downside of this approach is that if trained for too long, the model can easily overfit due to huge amount of trainable parameters.

2.6 EfficientNet

Convolutional neural network called EfficientNet shocked the computer vision world when it came out in early 2019. Not only did it achieve 84.4% top-1 accuracy in ImageNet classification task but it did it with many times less parameters than the earlier state-of-the-art models (Tan and Le 2019). The network is still a stranger to aesthetic image classification, which is why it was chosen as a base model for the experiments in this thesis. This section explains what makes EfficientNet

(15)

2.6.1 Initial Structure

There are multiple sized EfficientNets, which are named from B0 to B7 in an ascending order based on the complexity of the network. EfficientNet-B0 is the baseline model that is then scaled to obtain EfficientNets B1-B7.

Figure 2.5. Architecture of EfficientNet-B0

The architecture of the baseline model B0 is visualized in Figure 2.5. The network starts with one convolutional layer after which it has 16 mobile inverted bottleneck layers (MBConv) (Sandler et al. 2018; Tan, Chen et al. 2018). These MBConv layers are the foundation of the whole network and make it cost efficient and fast by featuring two-stage convolutions (Chollet 2016), inverted residuals and linear bottlenecks (Sandler et al. 2018).

2.6.2 Compound Scaling

EfficientNet uses so-called compound scaling to systematically scale the network’s depth d, width wand resolutionr. This scaling happens with following constraints:

d=α^ϕ (2.5)

w=β^ϕ (2.6)

r=γ^ϕ (2.7)

s.t. α·β²·γ≈2 (2.8)

α≥1, β≥1, γ≥1. (2.9)

Fixing ϕ= 1, the best values under the constraint equations 2.8 and 2.9 were found to be α= 1.20, β= 1.10 andγ = 1.15. This creates the model EfficientNet-B0. If one now fixes α, β and γ as constants, by increasingϕone gets networks EfficientNet-B1-B7 using equations 2.5, 2.6 and 2.7. (Tan and Le 2019)

Making the network wider, deeper or higher resolution contributes to how well the network is able to extract features from images. Doing this in a systematic way across all dimensions makes the network more capable of spotting more and more complex features from images. This makes EfficientNet successful with many different datasets.

(16)

3 DATASETS

Aesthetic image classification is still a narrow field of computer vision, which means that the amount of different datasets used for training and testing is quite low. This chapter analyzes three common datasets and their differences.

3.1 Aesthetic Visual Analysis Dataset

Without a doubt the most commonly used dataset is Aesthetic Visual Analysis dataset (AVA) (Murray et al. 2012). This dataset includes 255 530 images and information about the distribution of ratings from 1 to 10. All of the data is from dpchallenge.com, which is a digital photography contest website. The dataset also includes semantic tags for images, if one would like to do analysis between different types of aesthetic images.

Figure 3.1. Distribution of average ratings in the AVA dataset

Sorting the images based on their average rating by rounding them with 0.25 intervals, one can see from Figure 3.1 that the distribution is quite similar to a normal distribution. This means that most of the images are stacked up in the middle. Out of 255 530 images, there are 197 234 images with a rating between 4.0 and 6.0. This is hugely troublesome in terms of training a model for classification, where the division of images of different categories should be clear.

In the recent literature, the division of the images into low and high aesthetic quality categories is done in two different ways:

(17)

aesthetic quality. The dataset is randomly divided into training set (234 599 images) and testing set (19 930 images).

• AVA2

Images have a rating based on their mean rating. By taking top 10% and bottom 10% of images in terms of their rating, one divides the images into high and low aesthetic quality classes. Selected images are randomly divided into equally sized training and testing sets.

Recently, training and testing in aesthetic image classification has shifted more towards AVA2, since it provides more clear division between the two classes (Apostolidis and Mezaris 2019; Fu et al. 2019; Jin et al. 2016). The full analysis between these datasets is done in Section 3.4.

3.2 CUHK-PhotoQuality Dataset

Other commonly used dataset for aesthetic image classification is CUHK-PQ dataset (Wei Luo et al. 2011). It includes 17 690 images with binary labels. The images were chosen using hard positives and negatives (very highly rated and very low rated images). All the images are either from dpchallenge.com or straight from some photographers. There are 7 scene categories in which the images are grouped into: "animal", "architecture","human", "landscape", "night", "plant"

and "static". The dataset is divided into equally sized training and testing sets.

The ratio of positive examples to negative examples is 1:3. This is not too skewed ratio in and of itself, but given the small size of the dataset, it means that there are less than 5000 images for positive examples. This is quite weak for aesthetic image classification.

3.3 Photo.net Dataset

Photo.net dataset was one of the earliest aesthetic image classification datasets (Datta et al. 2008).

It contains 20 278 images gathered from photo.net, a website meant for both amateur and profes- sional photographers to share and rate photos. When using a mean ratio of 5 the ratio of positive examples to negative examples is around 7:5.

The dataset includes mean ratings for all the images and the amount of votes per rating for most images, which are also from the website. The weakness of the dataset is that the number of votes per image could be as low as 10. In terms of training a model with these images, this could affect the accuracy.

3.4 Comparison of Datasets

Selecting a good dataset for aesthetic image classification is a multidimensional problem. One has to take into account the size of the dataset, as well as how balanced the dataset is in terms of

(18)

positive and negative examples. This section tries to compare the datasets discussed in earlier sections. All the results have been gathered to Table 3.1.

Table 3.1. Comparison of datasets

Dataset Amount of images Positive to negative ratio Extreme samples

AVA1 255 530 12:5 No

AVA2 51 106 1:1 Yes

CUHK-PQ 17 690 1:3 Yes

Photo.net 20 278 7:5 No

When analyzing the information about different datasets one has to acknowledge that AVA2 is the most balanced dataset out of the four. It contains decent amount of images and the amount of positives and negatives is exactly the same. In addition, all of its samples are extreme examples of either high aesthetic quality or low aesthetic quality. This reduces mislabeling by helping the neural network spot the patterns of aesthetic images more easily.

Most publications in the field of aesthetic image classification use AVA2 for the training and testing of their models (Apostolidis and Mezaris 2019; Fu et al. 2019; Jin et al. 2016). All the other datasets seem to have some sort of weakness: The positive to negative ratio in AVA1 is questionable and it doesn’t use extreme positives or negatives, which is the case with Photo.net as well. The positive to negative ratio in CUHK-PQ -dataset is not very good given the fact that the dataset is the smallest out of the four. When deciding between AVA1 and AVA2 from the view of aesthetic image classification, it is be better to sacrifice the greater amount of images in AVA1 for the balance and stability of AVA2.

(19)

4 EXPERIMENTS

This chapter discusses the experiments and results of this thesis. Two different models based on EfficientNet are compiled, trained and analyzed. Most of the tested models and hyperparameters of this thesis can be found from Appendix A.

4.1 Structure of Model 1

The model that was used as a base network in Model 1 -experiment was EfficientNet-B7. It includes roughly 64 million parameters and can take in images up to size 600×600pixels. The model was pretrained with ImageNet (J. Deng et al. 2009). After the base model, global average pooling layer was added, which summarized each feature map with only the average value. This pooling layer was then followed by a softmax activated dense layer with two-dimensional output to perform the classification.

All the parameters in the base model were set to be non-trainable, which meant that training the whole model would not change the weights of the pretrained EfficientNet-B7 base model. Dense layer was set to be trainable. This division resulted in 5122 trainable parameters and 64 097 680 non-trainable parameters. The summary of compiled Model 1 is in Table 4.1.

Model: Sequential

Layer (type) Output Shape Number of Parameters EfficientNet-B7 (Model) (None 19, 19, 2560) 64 097 680

Global Average Pooling 2D (None, 2560) 0

Predictions (Dense) (None, 2) 5122

Total parameters: 64 102 802 Trainable parameters: 5122 Non-trainable parameters: 64 097 680

Table 4.1. Summary of Model 1

Architecture of EfficientNet-B7 base network includes dozens of layers, which was the reason why it was not added to Table 4.1 in greater detail. Short summary of the structure of the base network is in Appendix B.

If all the layers were set to be trainable, there would have been a huge risk of overfitting the model

(20)

due to base model having over 64 million parameters. On the other hand, ImageNet pretrained EfficientNet-B7 has great feature extraction capabilities already and changing the weights of the early layers with much smaller dataset would most likely decrease those capabilities.

4.2 Structure of Model 2

The second model used ImageNet pretrained EfficientNet-B3 as its base network. More complex versions of EfficientNet were not chosen because the goal was to create an FCN architecture. This kind of network automatically adds a lot of parameters to the network, which could make training computationally unbearable if larger base networks were to be used. EfficientNet-B3 has performed excellently in the ImageNet classification task, which makes it a valid base network for this model.

After the base network, there was one convolutional layer with two output filters and a filter size of 1, which made its output of size 16×16×2. Global max pooling layer was added to perform the classification. The summary of compiled Model 2 is in Table 4.2.

Model: Sequential

Layer (type) Output Shape Number of Parameters EfficientNet-B3 (Model) (None 16, 16, 1536) 10 783 528 Predictions (Convolution 2D) (None, 16, 16, 2) 3074

Global Max Pooling 2D (None, 2) 0

Total parameters: 10 786 602 Trainable parameters: 10 699 306 Non-trainable parameters: 87 296 Table 4.2. Summary of Model 2

Given the relatively complicated architecture of EfficientNet-B3, it was not part of Table 4.2.

Summary of the base network’s structure with the input size used in this experiment can be found from Appendix B.

None of the layers of the model were frozen. The reason why there are still non-trainable parameters is because EfficientNets use batch normalization layers, which have both trainable and non-trainable parameters by default. This should not be changed.

One could make the output shape of the convolution layer smaller by using larger filter size, but that would drastically reduce the trainable parameters in the classification block. In contrast, adding more convolutional layers would make the model more capable of remembering the training images and hence more prone to overfitting.

(21)

validation set. For some reason couple of images were missing from the dataset, which meant that there were 22 997 images in the training set, 2555 images in the validation set and 25 550 images in the testing set. This should not affect the results in any way.

Model 1 training did not make use of any data augmentation methods to increase the size of the dataset, because there was no fear of overfitting due to small amount of trainable parameters.

All the images were resized to 600×600 pixels since this was the maximum input size on the B7-network.

In the case of Model 2, there was a need to augment the data in some way due to high chance of overfitting. However, the diversity of images in the dataset makes aesthetic images problematic to augment: Vertical flipping of the image cannot be used since for some images it removes the aesthetic component of the image, for instance, sky being at the bottom of the image and not at the top. On the other hand, using random cropping might crop out the object that was the biggest factor in determining the aesthetic quality of the image.

Figure 4.1. Original image and its horizontally flipped copy (Murray et al. 2012)

Only data augmentation method that does not contain the issues discussed in the paragraph above is horizontal flipping. It was the only data augmentation method that was suitable to use in Model 2 experiment. Demonstration of horizontal flip is in Figure 4.1. All of the images were flipped with respect to y-axis and added into the training set. This effectively doubled the amount of images used for training. Validation set and testing set were not augmented. After this procedure, all the images were resized to500×500pixels.

(22)

4.4 Selecting Hyperparameters

Given that the models had differing structures, the hyperparameters chosen to train the networks were also quite different. All the information about parameters chosen for the final models is gathered to Table 4.3.

Model name Image size Batch size Optimizer Learning rate Epochs

Model 1 600×600 32 Adam 0.001 20

Model 2 500×500 32 SGD 0.01-0.001 40

Table 4.3. Hyperparameters of the final models

Both experiments used a batch size of 32 during training. Smaller batch size would add extra noise to the training due to weights being updated so often and hence make it harder to find a good optimum. On the other hand, there is some evidence that using significantly larger batch size would make it harder for the model to generalize (Keskar et al. 2016).

Model 1 had only 5122 trainable parameters, which indicated that training for extensive amount of epochs would improve the outcome of the training only marginally or not at all. Adam was chosen as an optimizer since it converges faster than stochastic gradient descent, which is beneficial when training for less epochs.

Learning rate 0.01-0.001 in the case of Model 2 means that the learning rate starts at 0.01 and it would be reduced by a factor of 0.5 if the accuracy were to plateau. However, it cannot get lower than 0.001. Stochastic gradient descent was selected since the model had a lot of time to converge in the course of 40 epochs.

Model 2 image size of 500×500 would have not been possible without the network being a fully convolutional network, since the maximum input size with EfficientNet-B3 is 300×300. Given that EfficientNets have been optimized in terms of their width, depth and resolution, increasing the image size any further would not be useful. Furthermore, there are many small input size image classification CNNs that have performed well with the dataset, such as ResNet50 (224×224) (Fu et al. 2019) and Inception V3 (299×299) (Jin et al. 2016). This implies that image size does not play a huge role in AVA2 accuracy.

4.5 Results

Models were trained with the hyperparameters chosen in Section 4.4. TensorFlow implementations for the models can be found from Appendices C and D. Losses and accuracies of the models were tracked during training. Model 1 development is in Figure 4.2 and Model 2 development in Figure 4.3.

(23)

Figure 4.2. Model 1 training progress

The training process of Model 1 went as expected. Training accuracy followed a logarithmic curve, which is typical for models that do not clearly overfit. The same trend can be seen in the case of training loss as well. This means that there would be no point to train the model any further, since it would not improve the measurements significantly. Validation accuracy had much larger variance, probably because there were almost ten times less samples in the validation set as there were in the training set. Validation loss followed the training loss quite religiously, implying that the model fit well.

Figure 4.3. Model 2 training progress

The beginning of Model 2 training was very promising, as there was no sign of overfitting at all. All the measurements got slightly better each epoch. The development slowed down as the training progressed, as is to be expected. Around the halfway to the training, validation loss started to increase a bit while training loss kept decreasing. This implies that if the training had been continued, the model would have severely overfitted.

More smooth training in comparison to Model 1 is due to the dynamic learning rate. Model 2 learning rate reduced by a factor whenever validation loss was about to plateau. This made the

(24)

learning rate drop by half at epoch 26, 29 and 32. After epoch 35 the learning rate became 0.001 as it could not be halved any further. This procedure helped Model 2 to find more exact optimum.

After training, both models were evaluated with the same AVA2 dataset. The results and comparison with other recent models are gathered in Table 4.4. All the results by other models are from their respective papers.

Model AVA2 Accuracy (%)

Dong et al. 2015 83.52

Wang et al. 2016 84.88

Model 1 85.61

Jin et al. 2016 85.62

Fu et al. 2019 90.01

Apostolidis and Mezaris 2019 91.01

Model 2 91.66

Table 4.4. Accuracies of recent models on the AVA2 dataset

As one can see from Table 4.4, Model 2 outperformed all of the existing methods on the AVA2 dataset. Closest method was by Apostolidis and Mezaris 2019, which also utilized fully convolutional networks. This shows that convolutional classification layers work very well in the field of aesthetic image classification.

Model 1 did not perform clearly as well as Model 2, but the result is still satisfactory in comparison to other recent models. Looking at the training progress of the model, one can see that there is not much potential to do better. This implies that more fundamental changes to the model should be made if one would want to reach an accuracy in the high 80s.

Well selected base network made both models reach decent levels of accuracy, but the selection of the classification layer made the real difference. The traditional dense layer in the case of Model 1 could only remember very general features due to small amount of trainable parameters, which resulted it to have only 85.61% accuracy. Model 2 on the other hand was more flexible with the convolutional layer and took use of the whole network by fine-tuning nearly all of its parameters. This might inspire future research to utilize FCN architectures more in their aesthetic image classifier implementations.

(25)

5 CONCLUSIONS

Classification of images based on aesthetics is a demanding task that requires knowledge of multiple subfields of artificial intelligence such as deep learning and computer vision. DNNs have attained the best results so far in the field, and of these networks, CNNs offer the flexibility to process and analyze aesthetic images by introducing local receptive fields, shared weights and pooling.

Transfer learning provides a way to use known CNN architectures for aesthetic image classification.

The fully connected layer of the pretrained network is replaced with suitable layers that can be dense layers or convolutional layers. One can either freeze the whole network except the added layers or allow fine-tuning of all the layers. Given that the dataset used for pretraining is large and diverse, weights of the transferred network provide great feature extraction capabilities. This gives the network a solid foundation to classify novel images accurately.

An ideal aesthetic classification dataset makes a clear division between high aesthetic quality images and low aesthetic quality images. AVA1, CUHK-PQ and Photo.net datasets are either too small or unbalanced to be great datasets for aesthetic image classification. AVA2 includes only extreme examples and exactly the same amount of low and high quality images due to how it is compiled.

Current aesthetic-based image classification research uses AVA2 as a benchmark dataset exactly for these reasons.

Thesis experimented with two models based on reputable EfficientNet convolutional network.

Model 1 used rather simple but surprisingly effective fully connected layer with global average pooling and a dense layer, which resulted in an accuracy of 85.61% on the AVA2 dataset. Model 2 used more recent approach of making the convolutional neural network a fully convolutional network, which utilized convolutional classification block. This model achieved state-of-the-art accuracy of 91.66% on the AVA2 dataset, despite having smaller input image size and simpler network architecture than its counterpart. Results imply that regarding accuracy, the complexity of the base network is not as important as the selection of a good fully connected layer.

The problem of aesthetic image classification does not end with this thesis. There is still a long way to go till models can classify all of the images correctly. However, given the abstractness and subjectivity of the topic, a goal that ambitious might turn out to be next to impossible to reach.

Future research should focus on finding out the true potential of FCNs and great transfer learning networks such as EfficientNet for improved results.

(26)

REFERENCES

Apostolidis, K. and Mezaris, V. (2019). Image Aesthetics Assessment Using Fully Convolutional Neural Networks.MultiMedia Modeling. Ed. by I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng and S. Vrochidis. Cham: Springer International Publishing, 361–373. isbn: 978-3- 030-05710-7.

Bianco, S., Celona, L., Napoletano, P. and Schettini, R. (2016). Predicting Image Aesthetics with Deep Learning. Advanced Concepts for Intelligent Vision Systems. Ed. by J. Blanc-Talon, C.

Distante, W. Philips, D. Popescu and P. Scheunders. Cham: Springer International Publishing, 117–125.isbn: 978-3-319-48680-2.

Chollet, F. (2016). Xception: Deep Learning with Depthwise Separable Convolutions. CoRR abs/1610.02357. arXiv:1610.02357.url:http://arxiv.org/abs/1610.02357.

Datta, R., Li, J. and Wang, J. (Dec. 2008). Algorithmic inferencing of aesthetics and emotion in natural images: An exposition. English (US). 2008 IEEE International Conference on Image Processing, ICIP 2008 Proceedings. Proceedings - International Conference on Image Processing, ICIP. 2008 IEEE International Conference on Image Processing, ICIP 2008 ; Conference date:

12-10-2008 Through 15-10-2008, 105–108.isbn: 1424417643.doi:10.1109/ICIP.2008.4711702.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database.CVPR09.

Deng, Y., Loy, C. C. and Tang, X. (2017). Image Aesthetic Assessment: An Experimental Survey.

IEEE Signal Processing Magazine 34.4, 80–106.

Dong, Z., Shen, X., Li, H. and Tian, X. (Jan. 2015). Photo Quality Assessment with DCNN that Understands Image Well. 524–535. isbn: 978-3-319-14441-2. doi:10.1007/978-3-319-14442- 9_57.

Fu, X., Yan, J. and Fan, C. (2019). Image Aesthetics Assessment Using Composite Features from off-the-Shelf Deep Models. CoRR abs/1902.08546. arXiv: 1902 . 08546. url: http : / / arxiv . org/abs/1902.08546.

Jin, X., Chi, J., Peng, S., Tian, Y., Ye, C. and Li, X. (2016). Deep image aesthetics classification using inception modules and fine-tuning connected layer.8th International Conference on Wireless Communications Signal Processing, WCSP 2016, Yangzhou, China, October 13-15, 2016, 1–6.

doi:10.1109/WCSP.2016.7752571.url:http://dx.doi.org/10.1109/WCSP.2016.7752571.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. and Tang, P. T. P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. CoRR abs/1609.04836. arXiv:1609.04836.url:http://arxiv.org/abs/1609.04836.

Kornblith, S., Shlens, J. and Le, Q. V. (2018). Do Better ImageNet Models Transfer Better?:CoRR abs/1805.08974. arXiv:1805.08974.url:http://arxiv.org/abs/1805.08974.

LeNail, A. (2019). NN-SVG: Publication-Ready Neural Network Architecture Schematics.Journal of Open Source Software 4.33, 747. doi: 10.21105/joss.00747. url: https://doi.org/10.

21105/joss.00747.

(27)

PolSAR Image Classification.Remote Sensing 10, 1984.

Long, J., Shelhamer, E. and Darrell, T. (2014). Fully Convolutional Networks for Semantic Seg- mentation.CoRR abs/1411.4038. arXiv:1411.4038.url:http://arxiv.org/abs/1411.4038.

Malu, G., Bapi, R. S. and Indurkhya, B. (2017). Learning Photography Aesthetics with Deep CNNs.CoRR abs/1707.03981. arXiv:1707.03981.url:http://arxiv.org/abs/1707.03981.

Murray, N., Marchesotti, L. and Perronnin, F. (2012). AVA: A Large-Scale Database for Aesthetic Visual Analysis.

Nielsen, M. A. (2018).Neural Networks and Deep Learning. misc.url:http://neuralnetworks anddeeplearning.com/(visited on 12/02/2020).

Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A. and Chen, L. (2018). Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation.CoRR abs/1801.04381. arXiv:1801.04381.url:http://arxiv.org/abs/1801.04381.

Sheng, K., Dong, W., Chai, M., Wang, G., Zhou, P., Huang, F., Hu, B.-G., Ji, R. and Ma, C. (2019).

Revisiting Image Aesthetic Assessment via Self-Supervised Feature Learning. arXiv:1911.11419 [cs.CV].

Tan, M., Chen, B., Pang, R., Vasudevan, V. and Le, Q. V. (2018). MnasNet: Platform-Aware Neural Architecture Search for Mobile. CoRRabs/1807.11626. arXiv: 1807.11626. url: http:

//arxiv.org/abs/1807.11626.

Tan, M. and Le, Q. V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. CoRR abs/1905.11946. arXiv: 1905 . 11946. url: http : / / arxiv . org / abs / 1905 . 11946.

Wang, W., Zhao, M., Wang, L., Jiexiong, H., Cai, C. and Xu, X. (May 2016). A multi-scene deep learning model for image aesthetic evaluation.Signal Processing: Image Communication47.doi: 10.1016/j.image.2016.05.009.

Wei Luo, Xiaogang Wang and Tang, X. (Nov. 2011). Content-based photo quality assessment.2011 International Conference on Computer Vision, 2206–2213.doi:10.1109/ICCV.2011.6126498.

Zhao, W. (2017). Research on the deep learning of the small sample data based on transfer learning.

AIP Conference Proceedings 1864.1, 020018. doi: 10.1063/1.4992835. eprint: https://aip.

scitation.org/doi/pdf/10.1063/1.4992835.url:https://aip.scitation.org/doi/abs/

10.1063/1.4992835.

(28)

A TESTED MODELS AND HYPERPARAMETERS

Model name Epochs Learning rate Optimizer Image size Batch size Accuracy (%)

Model 1 20 0.01 Adam 600 32 83.84

Model 1 20 0.01 Adam 600 32 85.61

Model 1 20 0.001 Adam 600 64 84.58

Model 1 30 0.001 Adam 600 8 85.11

Model 1 30 0.001 Adam 600 32 83.59

Model 1 100 0.0001 Adam 600 32 83.92

Model 1 150 0.00001 Adam 600 32 80.96

Model 2 10 0.01-0.001 SGD 300 32 89.41

Model 2 30 0.01-0.001 SGD 400 32 89.51

Model 2 30 0.01-0.001 SGD 450 32 91.46

Model 2 40 0.01-0.001 SGD 500 32 91.66

Model A 30 0.001 Adam 600 32 84.30

Model B 30 0.001 Adam 600 32 84.30

Model C 10 0.01-0.001 SGD 380 32 89.03

Model C 20 0.01-0.001 SGD 450 32 91.00

Model C 30 0.01-0.001 SGD 570 32 90.22

Model C 40 0.01-0.001 SGD 380 8 90.16

Model D 20 0.01-0.001 SGD 456 32 86.07

Table A.1. Some of the attempts with different hyperparameters (AVA2)

Model A had EfficientNet-B7 as its base network. Fully connected layer was replaced with Flatten layer and two dense layers of dimensions 64 and 2 respectively, last of which was activated with softmax. Only dense layers were set to be trainable. Model B was similar to Model A in every way except the dimensionality of the first dense layer was 128 instead of 64.

Model C had EfficientNet-B4 as its base network. Fully connected layer was replaced with convolutional layer that had filter size 1 and output size 2. All layers were set to be trainable. Model 2 data augmentation was used. Model D was similar to model C in every way except the base network was EfficientNet-B5 instead of EfficientNet-B4.

Even more models were tested but they did not get any mentionable results, which is why they were not added to Table A.1.

(29)

B SUMMARIES OF THE BASE NETWORKS

Block (type) Output Shape Number of Layers Input Layer (None, 600, 600, 3) 1

Conv3x3 (None, 300, 300, 64) 1

MBConv1, k3x3 (None, 300, 300, 32) 4 MBConv6, k3x3 (None, 150, 150, 48) 7 MBConv6, k5x5 (None, 75, 75, 80) 7 MBConv6, k3x3 (None, 38, 38, 160) 10 MBConv6, k5x5 (None, 38, 38, 224) 10 MBConv6, k5x5 (None, 19, 19, 384) 13 MBConv6, k3x3 (None, 19, 19, 640) 4

Conv1x1 (None, 19, 19, 2560) 1

Table B.1. EfficientNet-B7 structure with input shape (600, 600, 3)

Block (type) Output Shape Number of Layers Input Layer (None, 500, 500, 3) 1

Conv3x3, (None, 250, 250, 40) 1

MBConv1, k3x3 (None, 250, 250, 24) 2 MBConv6, k3x3 (None, 125, 125, 32) 3 MBConv6, k5x5 (None, 63, 63, 48) 3 MBConv6, k3x3 (None, 32, 32, 96) 5 MBConv6, k5x5 (None, 32, 32, 136) 5 MBConv6, k5x5 (None, 16, 16, 232) 6 MBConv6, k3x3 (None, 16, 16, 384) 2

Conv1x1 (None, 16, 16, 1536) 1

Table B.2. EfficientNet-B3 structure with input shape (500, 500, 3)

(30)

C CODE USED FOR MODEL 1 TRAINING

1 #Import all necessary libraries 2 i m p o r t t e n s o r f l o w as tf 3

4 tf . k e r a s . b a c k e n d . c l e a r _ s e s s i o n () 5

6 f r o m t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g . i m a g e i m p o r t I m a g e D a t a G e n e r a t o r 7 f r o m t e n s o r f l o w . k e r a s i m p o r t o p t i m i z e r s

8 i m p o r t e f f i c i e n t n e t . t f k e r a s as efn 9

10 #Allow utilization of multiple GPUs

11 m i r r o r e d _ s t r a t e g y = tf . d i s t r i b u t e . M i r r o r e d S t r a t e g y () 12

13 w i t h m i r r o r e d _ s t r a t e g y . s c o p e () : 14

15 i m a g e _ s i z e = 600

16 c l a s s e s = 2

17 t r a i n p a t h = ’ e f f i c i e n t n e t / i m a g e s / t r a i n ’ 18 v a l p a t h =’ e f f i c i e n t n e t / i m a g e s / val ’ 19 t e s t p a t h =’ e f f i c i e n t n e t / i m a g e s / t e s t ’ 20

21 #Building the model

22 m o d e l = tf . k e r a s . S e q u e n t i a l ()

23 m o d e l . add ( efn . E f f i c i e n t N e t B 7 ( i n p u t _ s h a p e =( i m a g e _ s i z e , i m a g e _ s i z e ,3) , w e i g h t s =’ i m a g e n e t ’, i n c l u d e _ t o p = F a l s e ) ) 24 m o d e l . add ( tf . k e r a s . l a y e r s . G l o b a l A v e r a g e P o o l i n g 2 D () )

25 m o d e l . add ( tf . k e r a s . l a y e r s . D e n s e (2 , a c t i v a t i o n =’ s o f t m a x ’, n a m e =’ p r e d i c t i o n s ’) ) 26

27 #Set base network to be non-trainable 28 m o d e l . l a y e r s [ 0 ] . t r a i n a b l e = F a l s e 29

30 m o d e l . s u m m a r y () 31

32 #Define hyperparameters 33 l e a r n i n g _ r a t e = 0 . 0 0 1 34 b a t c h _ s i z e = 32

35 e p o c h s = 20

36

37 #Define generator for training, validation and testing

38 d a t a _ g e n e r a t o r = I m a g e D a t a G e n e r a t o r ( p r e p r o c e s s i n g _ f u n c t i o n = efn . p r e p r o c e s s _ i n p u t ) 39

40 t r a i n _ g e n e r a t o r = d a t a _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( t r a i n p a t h , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) , 41 b a t c h _ s i z e = b a t c h _ s i z e , c l a s s _ m o d e =’ c a t e g o r i c a l ’)

42

43 v a l _ g e n e r a t o r = d a t a _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( valpath , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) ,

44 b a t c h _ s i z e =8* b a t c h _ s i z e , c l a s s _ m o d e =’ c a t e g o r i c a l ’)

45

46 t e s t _ g e n e r a t o r = d a t a _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( t e s t p a t h , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) ,

48

49 #Compile the model

50 m o d e l .c o m p i l e( l o s s =’ c a t e g o r i c a l _ c r o s s e n t r o p y ’,

51 o p t i m i z e r = o p t i m i z e r s . A d a m ( l e a r n i n g _ r a t e ) , m e t r i c s =[’ c a t e g o r i c a l _ a c c u r a c y ’]) 52

53 #Fit the model

54 m o d e l . fit ( t r a i n _ g e n e r a t o r , e p o c h s = epochs , v e r b o s e =2 , v a l i d a t i o n _ d a t a = v a l _ g e n e r a t o r , v a l i d a t i o n _ s t e p s =1 , 55 c a l l b a c k s =[ tf . k e r a s . c a l l b a c k s . C S V L o g g e r ( f i l e n a m e =’ Model -1 - t r a i n i n g ’) ])

56

57 #Evaluate the model

58 loss , acc = m o d e l . e v a l u a t e ( t e s t _ g e n e r a t o r , v e r b o s e =2) 59

60 #Print results and save weights

61 p r i n t(" A c c u r a c y of M o d e l 1: ", 1 0 0 * acc ," %. ") 62 p r i n t(" E p o c h s : ", e p o c h s )

63 p r i n t(" B a t c h s i z e : ", b a t c h _ s i z e ) 64 p r i n t(" I m a g e s i z e : ", i m a g e _ s i z e ) 65 p r i n t(" L e a r n i n g r a t e : ", l e a r n i n g _ r a t e ) 66 m o d e l . s a v e _ w e i g h t s (’ Model -1 ’)

(31)

D CODE USED FOR MODEL 2 TRAINING

1 #Import all necessary libraries 2 i m p o r t t e n s o r f l o w as tf 3

4 tf . k e r a s . b a c k e n d . c l e a r _ s e s s i o n () 5

6 f r o m t e n s o r f l o w . k e r a s . p r e p r o c e s s i n g . i m a g e i m p o r t I m a g e D a t a G e n e r a t o r 7 f r o m t e n s o r f l o w . k e r a s i m p o r t o p t i m i z e r s

8 i m p o r t e f f i c i e n t n e t . t f k e r a s as efn 9

10 #Allow utilization of multiple GPUs

11 m i r r o r e d _ s t r a t e g y = tf . d i s t r i b u t e . M i r r o r e d S t r a t e g y () 12

13 w i t h m i r r o r e d _ s t r a t e g y . s c o p e () : 14

15 i m a g e _ s i z e = 500

16 c l a s s e s = 2

17 t r a i n p a t h = ’ e f f i c i e n t n e t / i m a g e s / t r a i n ’ 18 v a l p a t h =’ e f f i c i e n t n e t / i m a g e s / val ’ 19 t e s t p a t h =’ e f f i c i e n t n e t / i m a g e s / t e s t ’ 20

21 #Build the model

22 m o d e l = tf . k e r a s . S e q u e n t i a l ()

23 m o d e l . add ( efn . E f f i c i e n t N e t B 3 ( i n p u t _ s h a p e =( i m a g e _ s i z e , i m a g e _ s i z e ,3) , w e i g h t s =’ i m a g e n e t ’, i n c l u d e _ t o p = F a l s e ) )

24 m o d e l . add ( tf . k e r a s . l a y e r s . C o n v 2 D ( classes , (1 , 1) , s t r i d e s =(1 ,1) , a c t i v a t i o n =’ s o f t m a x ’, n a m e =’ p r e d i c t i o n s ’, p a d d i n g =’ v a l i d ’) ) 25 m o d e l . add ( tf . k e r a s . l a y e r s . G l o b a l M a x P o o l i n g 2 D () )

26 m o d e l . s u m m a r y () 27

28 #Define hyperparameters 29 l e a r n i n g _ r a t e = 0 . 0 1 30 b a t c h _ s i z e = 32

31 e p o c h s = 40

32

33 #Define generator for training

34 t r a i n _ d a t a _ g e n e r a t o r = I m a g e D a t a G e n e r a t o r ( p r e p r o c e s s i n g _ f u n c t i o n = efn . p r e p r o c e s s _ i n p u t , h o r i z o n t a l _ f l i p = T r u e ) 35

36 t r a i n _ g e n e r a t o r = t r a i n _ d a t a _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( t r a i n p a t h , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) , 37 b a t c h _ s i z e = b a t c h _ s i z e , c l a s s _ m o d e =’ c a t e g o r i c a l ’)

38

39 #Define generators for validation and testing

40 v a l _ t e s t _ g e n e r a t o r = I m a g e D a t a G e n e r a t o r ( p r e p r o c e s s i n g _ f u n c t i o n = efn . p r e p r o c e s s _ i n p u t ) 41

42 v a l _ g e n e r a t o r = v a l _ t e s t _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( valpath , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) ,

44

45 t e s t _ g e n e r a t o r = v a l _ t e s t _ g e n e r a t o r . f l o w _ f r o m _ d i r e c t o r y ( t e s t p a t h , t a r g e t _ s i z e =( i m a g e _ s i z e , i m a g e _ s i z e ) ,

47

48 #Compile the model

49 m o d e l .c o m p i l e( l o s s =’ c a t e g o r i c a l _ c r o s s e n t r o p y ’,

50 o p t i m i z e r = o p t i m i z e r s . SGD ( l e a r n i n g _ r a t e ) , m e t r i c s =[’ c a t e g o r i c a l _ a c c u r a c y ’]) 51

52 #ReduceLROnPlateau-callback

53 r e d u c e _ l r = tf . k e r a s . c a l l b a c k s . R e d u c e L R O n P l a t e a u ( m o n i t o r =’ v a l _ l o s s ’, f a c t o r =0.5 , p a t i e n c e =3 , m i n _ d e l t a =1 e -3 , 54 c o o l d o w n =0 , m i n _ l r = 0 . 0 0 1 )

55

56 #Fit the model

57 m o d e l . fit ( t r a i n _ g e n e r a t o r , e p o c h s = epochs , v e r b o s e =2 , v a l i d a t i o n _ d a t a = v a l _ g e n e r a t o r , v a l i d a t i o n _ s t e p s =1 , 58 c a l l b a c k s =[ r e d u c e _ l r , tf . k e r a s . c a l l b a c k s . C S V L o g g e r ( f i l e n a m e =’ Model -2 - t r a i n i n g ’) ])

59

60 #Evaluate the model

61 loss , acc = m o d e l . e v a l u a t e ( t e s t _ g e n e r a t o r , v e r b o s e =2) 62

63 #Print results and save weights

64 p r i n t(" A c c u r a c y of M o d e l 2: ", 1 0 0 * acc ," %. ") 65 p r i n t(" E p o c h s : ", e p o c h s )

66 p r i n t(" B a t c h s i z e : ", b a t c h _ s i z e ) 67 p r i n t(" I m a g e s i z e : ", i m a g e _ s i z e ) 68 p r i n t(" L e a r n i n g r a t e : ", l e a r n i n g _ r a t e ) 69 m o d e l . s a v e _ w e i g h t s (’ Model -2 ’)

Aesthetics-based Image Classification