• Ei tuloksia

4. Network Design

4.4 Training

The neural networks were trained using TensorFlow [58]. Each network was set to be trained for 50 epochs with Adam optimizer [11] and 0.001 learning rate. After the training loss did not change after 3 consecutive epochs the training was continued with 0.0001 learning rate for 50 epochs again and finished after no changes to training loss in 3 consecutive epochs. As the training images from each scene only differed by the sample count with different input RGB values and the corresponding sample variances, each

32 spp training 8 spp training

16 spp training 64 spp training

Figure 4.2. Training losses for first 32 spp training set, then 8 spp, 16 spp and 64 spp using the 32 spp trained model as a base. As the different spp models share similar input sets with different noise levels using a model trained with different spp as a base significantly decreases the training time for consecutive models.

of the different sample datasets were only retrained from one base network. So, for one scene first a network was trained with 32 spp input dataset and the corresponding network for 8, 16 and 64 spp were trained by retraining the 32 spp network with the different spp dataset only using the learning rate of 0.0001. This decreased the training time considerably for all the networks. An example training for the networks can be seen as the function of training loss and epochs in figure 4.2.

For all the CNNs the input features included the noisy RBG (3 channels), sample vari-ance (1 channel), view normal (2 channels), depth (1 channel) and so 7 input channels altogether. The gloss or roughness input also used in [3] was dropped from the final input sets as the path tracer used simple and almost uniform values for these in the the used models. The inputs used for each CNN are shown in figure 4.3.

8 spp input without albedo 8 spp input variance

View normal Depth

Albedo 4096 spp reference

Figure 4.3. Training images for each input and output. The input consisted of 7 layers:

noisy RGB without first bounce albedo (3 channels), variance (1 channel), view normal (2 channels) and depth (1 channel). The albedo demodulation was done in the network as a multiplication and the output was referenced with the training truth value of 4096 samples per pixel converged path traced image.

4.4.1 Loss Function

The chosen loss function for the path tracing denoising network was chosen based on experiments from previous work for multiple samples per pixel by Bako et al. [12]. In this work the l1 loss achieved the best results. They thought that the l1 loss is less prone for outliers compared to for example the l2 loss which low sample path tracing has. For

pixel-wise prediction the l1 or MAE loss in equation 2.6 can be written as:

where the loss is calculated for the whole denoised imagefx,y(In)compared to the ref-erenceIx,yr for the whole height H and width W.

4.4.2 Activation Functions

Each model has a nonlinear activation function after each convolutional layer except the last which has a linear activation. The linear activation function is that the output is not limited and able to be conform to theHigh Dynamic Range(HDR) of path tracing.

Six different activation functions were tested for one scene tanh, ReLU, Swish, Scaled exponential linear unit (SELU),Leaky Rectified Linear Unit (Leaky ReLU) and and iden-tity (linear). Previous work suggests that for shallow networks the tanh activation function works better [20]. Moreover, the single image super-resolution problem and other image processing problems do not usually consider high dynamic range images but image prob-lems where the values are in the range of 0.0 to 1.0. The work done by Bako et al. in [12]

suggests that doing a logarithmic transform for the specular component for the denoising to reduce the range of the color values with the function:

c

˜specular =log(1 +cspecular). (4.3)

In our work we do not separate the diffuse and specular components for the denoising because real-time requirements so this logarithmic function will have to be used for the diffuse component also. Moreover, the high dynamic range with or without the logarithmic transform has an effect for the activation function behavior for the neural network. The six activation functions were tested with and without the logarithmic transform. Instead of using the transformation for only the specular component the logarithmic transform was tested after the albedo division from the color values asc˜ =log(1 +c):

c

˜ =log(1 + (c⊘(falbedo+ϵ))), (4.4)

and after the network the inverse was applied:

c= (exp(c˜)−1)⊗(falbedo+ϵ), (4.5)

The tests for each activation functions are shown in figure 4.4. The ReLU, Leaky ReLU

a b

Figure 4.4. Losses for small convolutional neural network for a without and b with the logarithmic transformation described in equation 4.4.

and linear convolutional kernels are initialized with HeUniform [59], tanh and Swish with the Xavier method [27] and SELU with LecunNormal [60].

From these experiments the ReLU has the best convergence for the loss. Furthermore, the poor performance of tanh activation function contradicts the performance compared for similar network architecture in related work in [20] where it was found that tanh per-formed better than ReLU. However, in path tracing the input range is not limited between 0 and 1 which may be the cause for poor performance in this test as a feature of the tanh function is that the output is limited in range 0 to 1. This complicates the propagation of information inside the network as the final output must be scaled back to the high dynamic range.

The logarithmic transformation does not seem to help for the learning of the network as shown in figure 4.4 where the losses in the b) image do not converge as well as the losses in a). This may be caused by the transformation for the diffuse components making the range smaller where the range is already small and thus making the learning process more difficult. Furthermore, this was tested with only one scene (Classroom) and scenes with more specular components and higher dynamic range values may benefit more from the transform. In addition, the network tested has fewer unlinear units than in the work [12]. The transformation could still be useful even in small networks if the transformation would be only applied to the specular component and with separate networks for specular and diffuse components or using the specular component as a separate input for the same network. Nevertheless, as the specular and diffuse component separations were not readily available as an extra data further testing was out of the scope of this thesis.

Furthermore, an interesting case for the linear activation function is seen in the b) case where the logarithmic transform and training caused the loss to explode which is seen by the absence of the training loss. This may be caused by combination of the unbounded

inputs and outputs without non-linear activations to bound them inside the network and with the inverse transformation function 4.5 making the gradient explode.

From these tests for our problem the ReLU activation function was generalized to be used for each consecutive small networks. Moreover, from these tests shown that the Swish converges as well as ReLU but the simplicity of the calculation of each ReLU unit better conforms to our real-time requirements. The tested logarithmic transformation was also omitted from the subsequent testing as it did not seem to improve the learning process in the test case.