Fast Convolutional Neural Networks for Real-Time Path Tracing Denoising

(1)

FAST CONVOLUTIONAL NEURAL NETWORKS FOR REAL-TIME PATH TRACING DENOISING

Master of Science Thesis Faculty of Information Technology and Communication Sciences Tarkastajat: Dr. Markku Mäkitalo, Asst. Prof. Pekka Jääskeläinen May 2021

(2)

ABSTRACT

Atro Lotvonen: Fast Convolutional Neural Networks for Real-Time Path Tracing Denoising Master of Science Thesis

Tampere University

Signal Processing and Machine Learning May 2021

Path tracing is a method to generate photorealistic images with physically based effects such as reflections, shadows, refractions and global illumination. Path tracing in real-time requires large amount of computational power and it is often more sensible to use efficient post-processing methods to improve the quality of the output than using the computational power to increase the number of samples and thus decreasing the error this way. However, the ease of parallelizing path tracing offers a good way to improve the results for real-time when the high amount of computational power is attainable for example in a server cluster.

The advancements in machine learning for image-based problems and the evolving inference hardware for neural networks enables the reconstruction of multiple samples per pixel path tracing in real-time using machine learning based methods. However, most of the previous machine learning based methods do not consider real-time inference and this becomes even more preva- lent with real-time path tracing where the path tracing takes most of the computational time from a single frame.

In this thesis, the performance of fast convolutional neural networks is tested for denoising path traced images with multiple samples per pixel. The fast convolutional neural networks can achieve better error metrics than state-of-the-art analytical bilateral based filters in most cases. Moreover, for real-time performance the fast convolutional neural networks may be processed with almost similar requirements for computational power as the analytical filters. Also, the fast convolutional neural networks can achieve better quality in almost all cases with 8 samples per pixel inputs compared to just path tracing with 64 samples per pixel with 8x times required computational power or level of parallelization.

Keywords: Path Tracing, Machine Learning, Real-Time, Denoising, Convolutional Neural Net- works

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Atro Lotvonen: Nopeat konvoluutioneuroverkot tosiaikaiseen polunjäljityksen kohinanpoistoon Diplomityö

Tampereen yliopisto

Signaalinkäsittely ja koneoppiminen Toukokuu 2021

Polunjäljitys on menetelmä fotorealististen kuvien luomiseksi fysikaalisesti perustuvien efek- tien, kuten heijastuksien, varjojen, taittumisen ja globaaliin valaistuksen avulla. Polunjäljitys reaaliajassa vaatii paljon laskentatehoa ja on usein järkevämpää käyttää tehokkaita jälkikäsittelyme- netelmiä kuvan laadun parantamiseksi kuin laskentatehon käyttäminen näytteiden määrän lisää- miseen ja virheen vähentämiseen näin. Polunseurannan rinnakkaistamisen helppous on kuitenkin hyvä tapa parantaa tuloksia reaaliajassa, kun suuri laskentateho on saavutettavissa esimerkiksi palvelinklusterissa.

Kuvapohjaisten ongelmiin liittyvän koneoppimisen ja neuroverkkoja varten tarkoitettujen las- kentayksiköiden kehitys mahdollistavat usean näytteen pikseliä kohti polunjäljityksen rekonstruoin- nin reaaliajassa koneoppimiseen perustuvien menetelmien avulla. Suurin osa aikaisemmista koneoppimiseen perustuvista menetelmistä ei kuitenkaan ota huomioon reaaliaikaista laskentaa, jo- ka on entistä vallitsevampaa reaaliaikaisessa polunjäljityksessä, jossa polunjäljitys vie suurimman osan laskennallisesta ajasta yhdestä kuvasta.

Tässä opinnäytetyössä testataan nopeiden konvoluutioneuroverkkojen suorituskykyä polunjäl- jitetyissä kuvissa olevan kohinan poistamiseksi, kun on käytössä useita näytteitä pikseliä kohti.

Nopeat konvoluutioneuroverkot pystyvät useimmissa tapauksissa saavuttamaan paremmat arvot virhemittareissa kuin viimeisimmät analyyttiset bilateraaliset suodattimet. Lisäksi reaaliaikaisen suorituskyvyn saavuttamiseksi nopeita konvoluutioneuroverkkoja voidaan käyttää lähes saman- laisilla vaatimuksilla laskentateholle kuin analyyttisiä suodattimia. Nopeat konvoluutioneuroverkot pystyvät myös saavuttamaan paremman laadun melkein kaikissa tapauksissa 8 näytteellä pik- seliä kohden verrattuna vain polunjäljityksen tulokseen 64 näytteellä pikseliä kohden, mikä vaatii 8-kertaisesti laskentatehoa.

Avainsanat: Polunjäljitys, Koneoppiminen, Tosiaikainen, Kohinanpoisto, Konvoluutioneuroverkot Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

This thesis was done as part of a research of the Virtual reality and Graphics Architec- tures (VGA) group at Tampere University and part of a FitOptiVis project. This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783162. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Netherlands, Czech Republic, Finland, Spain, Italy.

Thank you to both of my supervisors Dr. Markku Mäkitalo and Asst. Prof. Pekka Jääskeläinen for guiding me during the thesis. I would also like to thank Dr. Matias Koskela who guided me for a long time when I started as a research assistant in computer graphics related topics. Big thanks also to all my colleagues in the CPC-VGA group especially Julius Ikkala for many fruitful comments related to this thesis.

Special thanks to my family: Laila, Toivo, Henri, Jarno, Ari Leppälä and especially to my mother Päivi, who raised three boys by herself — you are a big inspiration for me.

Tampereella, 31st May 2021

Atro Lotvonen

(5)

LIST OF SYMBOLS AND ABBREVIATIONS

2D Two-Dimensional

3D Three-Dimensional

AdaGrad Adaptive Gradient Algorithm Adam Adaptive Moment Estimation

BMFR Blockwise Multi-order Feature Regression BSDF Bidirectional Scattering Distribution Function CNN Convolutional Neural Network

CUDA Compute Unified Device Architecture DCNN Dilated Convolutional Neural Network fps frames per second

GAN Generative Adversarial Network GPU Graphics Processing Unit

HDR High Dynamic Range

ImageNet A large image database for machine learning Leaky ReLU Leaky Rectified Linear Unit

MAE Mean Absolute Error, also known as l1 loss in Machine Learning MR-KP Multi-Resolution Kernel Prediction CNN

MSE Mean Squared Error, also known as l2 loss in Machine Learning ONND OptiX Neural Network Denoiser

ReLU Rectified Linear Unit

RGB Red-Green-Blue additive color model RMSE Root-Mean-Square Error

SCNN Simple Convolutional Neural Network SELU Scaled Exponential Linear Unit SGD Stochastic Gradient Descent spp Samples Per Pixel

SSIM Structural Similarity Index Measure

(8)

SUNet Small UNet

SVGF Spatiotemporal Variance-Guided Filtering Swish Sigmoid Linear Unit

TAA Temporal Anti-Aliasing tanh Hyperbolic Tangent

UNet Fully Convolutional Neural Network with Encoder and Decoder parts

(9)

1. INTRODUCTION

Path tracing is a rendering technique to generate photorealistic images by physically emulating the light traversing inthree-dimensional(3D) scenes. Even though new hardware for path tracing has started to appear for consumers, path tracing in real-time is still a challenging problem. In order to generate path traced images in real-time the number of paths traced must be limited which generates highly noisy images. But because path tracing itself is an ’embarrassingly parallel’ problem one possible solution for generating more samples would be to increase the computational power with multiple devices and do the path tracing in server cluster. Even though distributing the path tracing for multiple devices, it is still very hard to generate fully converged images (multiple thousand paths per pixel) in real-time. Moreover, in path tracing to half the error of the output the number of samples must be quadrupled. This means also that the computational power must be increased in the same relation. However, fast reconstruction methods have been developed also to enhance the final image and there is a point where it is more efficient to denoise the path traced image instead of increasing the number of samples. There- fore, best way still seems to be is to limit the number of paths traced and utilize pre- and post-processing methods to reconstruct the final image.

Most real-time solutions for path tracing have concentrated on only using onesample per pixel (spp) and using temporal accumulation of samples and post-processing to denoise the highly noisy image. For example, in other work the solution has been to use the temporal accumulation of 5 samples per pixel and using wavelet-based denoising on the output [1]. But new consumer level Graphics Processing Units (GPUs) have started to appear with dedicated hardware for ray traversal [2]. This is an important prerequisite for generating better quality images more quickly which in turn enables more samples per pixel in real-time implementations also. This is an interesting direction to decrease the noise for the generated image which helps to simplify the denoising method.

The evolution of modern Machine Vision and learning based methods have also been an interesting topic for classifying, generating and reconstructing images. ManyConvo- lutional Neural Network (CNN) based solutions have surpassed the manual and model based methods in many image classification and reconstructing problems. They have also been successfully implemented for reconstructing path traced images but have usually lacked the computational speed for real-time implementations. These problems are

(10)

Figure 1.1. Results of convolutional neural network denoised path traced inputs with different samples per pixel inputs. The left part of the images are the different samples per pixel input images used in this thesis, the middle parts are the 4096 spp fully converged reference images and the right part is a fast convolutional neural network used to denoise the low sample per pixel input.

being solved with the increasingly faster dedicated inference hardware for machine learning and development of new faster network architectures and layers. Still the speed of the methods lags compared to many of the analytical models created for the same problems.

Moreover, interesting direction has been found to combine the learning-based methods with the model-based methods.

In this thesis we study the use of convolutional neural networks for denoising the results of multiple samples per pixel path traced images in real-time. Path tracing with 8, 16, 32 and 64 samples were considered. The main idea of the thesis is to test the real-time performance of machine learning reconstruction for path tracing and to study the effect of this post-processing in relation to increasing the number of samples in rendering. This requires limiting the number of parameters and complexity inside the network to handle the demanding requirements for real-time path tracing. For convolutional neural networks this means trading off the receptive field when decreasing the depth of the network. In this work the depth of the networks is decreased to only two hidden layers compared to for example ten hidden layers used in other work [3]. To increase the receptive field, different methods are tested; different size of convolutional kernels, pooling and dilated convolutions. Also, pruning and separable convolutions are utilized for trying to improve the performance of the networks. When designing the fast networks, few different nonlinear activation functions are tested for the small reconstruction networks. After this, 3

(11)

different small networks are evaluated and compared with a fast analytical reconstruction method also with varying receptive field and a slower network in 3 different scenes. The denoising results achieved for a small CNN is illustrated in figure 1.1.

In Chapter 2 firstly the basics of Convolutional Neural Networks (CNN) are presented and how they can be implemented for image reconstruction in real-time. In Chapter 3 path tracing is introduced and the constraints of rendering path traced images in real- time are described. In the next Chapter 4 the different designed networks are introduced and tested with different hyperparameters and optimizations. After this in Chapter 5 the designed networks for path tracing reconstruction with CNNs are evaluated and compared with analytical real-time denoiser. Next in Chapter 6 there is discussion of future work related to the thesis. In the final Chapter 7 there are conclusions.

(12)

2. CONVOLUTIONAL NEURAL NETWORKS

Deep learning and neural networks have started to produce state-of-the-art results for multitude of audio-visual tasks in recent years. The main problems for efficient utilization for various image and sound related topics for these data-based methods are still the amount of data and computational power required by the algorithms. These problems are being solved by generating large multipurpose datasets [4] and creating dedicated hardware for training and running these networks as done with NVIDIAs Tensor Cores [2]. Moreover, new types of layers, nonlinear activations and training methods are being developed to improve the training and performance of the neural networks.

2.1 Neural Networks

A conventional fully connected neural network is a cascade of neurons where each neuron is connected to multiple inputs like shown in figure 2.1. The inputs are also often referred as features but the input can be for example an audio signal or an image. Fea- ture selection is also important for the network to reduce the complexity of the learning process. The inputs or features are then fed into the network to the hidden layers. These inputs are usually connected to a neuron with different weights. These weighted inputs are summed together and used in an nonlinear activation function likehyperbolic tangent (tanh) [5]:

tanh(x) = e^x−e^−x

e^x+e^−x, (2.1)

orRectified Linear Unit (ReLU) [6]:

⎧

⎨

⎩

0 ifx≤0 x ifx >0

= max{0, x}=x1_x>0. (2.2)

The activation functions are used to enable the network to create nonlinear modeling of the problem. Unlike linear activation the nonlinear function can be proven to be a universal function approximator for at least two-layer neural network [7], which is also known as the Universal Approximation Theorem.

(13)

Figure 2.1.Simple fully connected neural network.

ReLU tanh Swish

Figure 2.2.Different nonlinear activation functions [5, 6, 8].

New activation functions have been developed to create more interesting patterns and to make the networks learn more efficiently. For example theSigmoid Linear Unit(Swish) is an interesting function which combines the linear properties of ReLU and the smooth fall of a sigmoid function [8]. The previously mentioned activation functions are illustrated in figure 2.2.

After the hidden layer, the final output layer gives the result of the network. The output can be for example a probability or a prediction of a color for a pixel.

As the input and network sizes increase, the complexity of a fully connected network grows rapidly. The complexity of neural network can be characterized by the number of parameterswand computational cost. The computational cost for neural networks single layer can be calculated simply with:

I·O, (2.3)

where I is the number of input features and O is the number of neurons or the size of

(14)

the output features in the layer. For instance, an arbitrary network layer with an input of 100x100 Red-Green-Blue (RGB) image and if no loss of information is required the output size should be at least the same size as the input. The computational cost would be(100·100·3)² = 900000000and the number of parameters wis the same as each neuron considers each color value as a single feature. Now, in many problems the whole feature space for single neuron may not be needed and as such it is more sensible to consider only a part of the features for single neuron. This is usually done in neural networks by only using a small set of samples in a window from the high number of features in the so-calledConvolutional Neural Networks (CNN).

2.2 Training and Backpropagation

The original advancement for training neural networks with back-propagation [9] and development of processing units were crucial for neural networks to work outside of theory.

The idea behind machine learning is to minimize the error for wanted quality function(Q) in respect to weightwfor the input fromndatapoints:

Q(w) = 1 n

n

∑︂

i=1

Qi(w). (2.4)

This is a sum-minimization problem, where the quality function must be evaluated in each neuron when minimizing the error of the whole network by propagating the error backwards in the network. Furthermore, the increasing number of neurons or weights wand datapoints nquickly causes the minimization problem to be extremely difficult to be calculated efficiently in practice. Therefore, stochastic methods are used to evaluate inter- mediatelossfor fewer number of datapointsiwhich is also called abatch. This is done in Stochastic Gradient Descent (SGD) [10] by iteratively evaluating the gradient for the loss ini samples:

w:=w−η∇Q(w) =w− η n

n

∑︂

i=1

∇Qi(w), (2.5)

where ∇ is the step size which is also called the learning rate. There has also been multiple updates for the SGD algorithm such asAdaptive Gradient Algorithm (AdaGrad) andAdaptive Moment Estimation (Adam) [11]. AdaGrad introduces individualized learning rates for parameters and Adam combines the variable learning rates with second moment gradients of the loss function.

For image based problems common quality functions to minimize are for example the

(15)

Mean Absolute Error (MAE):

M AE = 1 n

n

∑︂

i=0

|Y_i−Yˆ_i|, (2.6)

andMean Squared Error (MSE):

M SE = 1 n

n

∑︂

i=0

(Y_i−Yˆ_i)², (2.7)

where theY_iis a vector of observed values andYˆ

iare the predictions. MAE is also often called l1 loss and MSE as l2 loss in machine learning.

However, the choice of loss function is not simple and is an important part of optimization.

Furthermore, optimizing for a loss function in neural networks may not yield the best results when evaluating the model for the specific loss. For example, in the work by Bako et al. [12] when denoising Monte Carlo rendered images 5 different loss functions were tested l1, relative l1, l2, relative l2 and Structural Similarity Index Measure (SSIM) [13].

It was found that using the l1 loss in training resulted in best performance for training convergence and validation for each metric, even when the model was optimized for a specific loss. This is an interesting topic and many innovative new loss functions are found and used for variety of tasks. One interesting loss function is the so-called Generative Adversarial Network (GAN) loss and it is employed e.g. in the work by Ledig et al. [14]

in which perceptual loss function was derived from high-level feature maps of a another image based neural network.

The parameters such as the learning rate, the loss function and the different topological designs for the model are also calledhyperparameters.

2.3 Convolutional Neural Networks

A specialized case for neural networks are Convolutional Neural Networks (CNN) [15].

Unlike conventional fully connected layers, the convolutional layers are based on human vision [16]. The main difference with fully connected layers is that the convolutional layers are only partly connected to few inputs in a small window. An example CNN is illustrated in figure 2.3.

The convolutional layers became popular first for image recognition and classification tasks such as ImageNet [17]. The architecture of the CNN was usually that first the convolutional layers are used to extract interesting features from the input images and then the feature set is concatenated to a dense fully connected layer. But after this, also architectures for image generation and reconstruction have appeared, for example the UNet

(16)

Figure 2.3. Convolutional neural network example for hand-written digit classification.

First there are several convolutional layers with different resolutions achieved by pooling and this is later flattened to a dense fully connected network with ten outputs with different class probabilities for classes 0-9 [15].

[18]. Traditionally the image size has been really small for image recognition and classification problems and the inputs for the networks have usually been downscaled versions of the original images. For example, in ImageNet classification tasks the images are usually cropped to 256x256 or 224x224 images. Also, for classification and recognition tasks, these convolutional features are further run in multiple resolutions by usually using pooling methods to downscale the original resolution in later convolutional layers. The downscaling is also illustrated in figure 2.3 where after the first layer of convolutions the resolution of the features is halved in two dimension for each subsequent convolutional layer.

However, in image denoising or super-resolution problems downscaling the original image would cause problems for the network to perform well for the task. Moreover, the input size affects the computational cost, and the receptive field must be thought out for the task. The receptive field for a single standard convolutional layer is the size of the convolutional kernel and the subsequent layers increase this only by the new kernel size. For classification tasks the pooling layer works to increase the receptive field, but for reconstruction problem where the output resolution is the same as the input image this leads to loss of information in the low-resolution pooled layers. For this problem there are different solutions. UNet [18] tries to solve this problem with pooling layers and skip connections.

Dilated convolutions [19] try to solve this problem using convolutions in a dithering pattern the same way as Á Trous which is described in section 3.5.2. Some methods just try to increase the convolutional kernels size to increase the receptive field like done by Shi et al. in the work for image super-resolution [20].

As the receptive field describes the area the CNN layer ’sees’ [16] the receptive field for a single layer CNNr_i is just the size of the kernelk itself. After the first layer applying new

(17)

Standard convolution 5x5 Standard convolution 3x3 Standard convolution 3x3

Figure 2.4.Illustration of standard CNN receptive field.

convolutions always increases the receptive field by the kernel radius as the receptive field of the samples on the edge of the kernel are also caught. Furthermore, the receptive field for convolutional layerican be written as:

r_i =ri−1 + (k−1). (2.8)

The receptive field is illustrated also in figure 2.4, where the first layer has 5x5 convolutions and the two next subsequent layer 3x3 convolutions so the receptive field can be calculated with the equation 2.8 asr₁ = 5,r₂ = 5 + (3−1) = 7andr₃ = 7 + (3−1) = 9. From this example it can be seen that to actually increase the receptive field efficiently just stacking standard convolutions may pose problems.

As the convolutional neural network can be thought as only partially connected neural networks they greatly reduce the complexity of the network compared to fully connected network. The computational cost of standard convolutions for single layer can be calculated by [21]:

C_K·C_K·I_W ·I_H ·I_D·O_D, (2.9) where C_K is the dimension of the convolutional kernel, I_W, I_H, I_D are the input feature dimensions width, height, depth respectively (equivalent toI in the fully connected equation 2.3) and O_D is the number of output channels. The number of parameters in the convolutional neural network can be calculated with CK ·CK ·ID ·OD as the convolutional kernels are used as a sliding window over the two spatial dimensionsI_W·I_H. Now considering the example where a 100x100 RGB input image is used as input features and the output features are the same size and a convolutional kernel of size CK = 3is used. The computational complexity for the CNN is(3·3·(100·100·3)·3)) = 810000 compared to900000000in the case of fully connected network. The number of parameters is reduced to3·3·3·3 = 81 compared to900000000of fully connected network.

(18)

This reduction of complexity comes with the consideration that the receptive field for a single feature is a window of size 3x3 for the N input feature maps. In addition, because this window is slid across the feature maps the output feature maps M cannot have spa- tially discriminate features which means that the network cannot for example locally have different feature selection in different parts of the image.

The equation 2.9 is valid for 1 and 2 dimensional convolutions. Furthermore, a set of convolutions which extend the dimensionality of the convolutions to three dimensions are called 3D Convolutional Neural Networksand can be used to consider features in 3 dimensions such as temporal data [22]. However, for this thesis the temporal aspect is not considered in the experiments and therefore 3D CNNs are omitted from them. For future work 3D CNNs are an interesting direction especially when considering adding temporal data to the input set.

2.3.1 UNet

UNet was first introduced by Ronnenberger et al. for biomedical image segmentation [18].

UNet is a fully convolutional neural network which has an encoder and decoder parts illustrated in figure 2.5. In the encoding phase the network increases the receptive field of the convolutions by using pooling to reduce the resolution of each intermediate layer like done in multiple recognition networks. After the encoding phase when the receptive field is increased appropriately the decoder phase reconstructs the final image. In each decoding phase the CNN takes the lower resolution input and upsamples it either analyti- cally like bilinear interpolation or uses the so-called deconvolutional layers to upscale the input. Deconvolutional layers can be thought as the backwards operation of convolution [18].

This kind of network architecture with encoding and decoding can be used for example compression by itself as done by Rippel and Bourdev for real-time adaptive compression in [23]. But for good quality image reconstruction the networks lose a lot of spatial information in each consecutive layer with the pooling. The UNet solves this problem by using skip connections. Skip connections propagate the information from the same resolution layers from the encoding phase to the decoding phase. Moreover, for training purposes the skip connections also propagate the error better for the first layers to help with the vanishing gradient problem. Furthermore, the need of skip connections and deconvo- lutions forces using multiple convolutional layers for each separate resolution as not to lose spatial information when advancing in the network and for real-time application this requires more calculations and extra work.

The receptive field with pooling can be calculated with as per [19]:

r_i =ri−1+ (k−1)∗p_i, (2.10)

(19)

Figure 2.5. UNet with skip connections. UNet increases the receptive field of the network by reducing the resolution of the input image utilizing pooling and later reusing the intermediate outputs in later layers in the decoding phase by using skip connections[18].

Unet first layer 3x3 Unet second layer 3x3 Unet third layer 3x3

Figure 2.6. Illustration of UNet receptive field. The reduction of the pixels or the increasing grid size illustrates the reduction of resolution after pooling layers.

where thep_i is the pooling factor and this is also illustrated in figure 2.6 where a pooling of size 2 is used in two dimensions.

2.3.2 Dilated Convolutions

Dilated convolutions follows the same idea as is done in Á Trous which is introduced in section 3.5.2. The convolutions are ’sparse’ in a sense that the distance between the kernel units is controlled with a dilation factor. The dilation factor effect can be seen in figure 2.7. This is an interesting way to increase the receptive field inside a CNN without losing spatial information.

The receptive field with dilated convolutions can be calculated with the formula derived from [19]:

r_i =r_i−1+ (k−1)∗d_i∗

i−1

∏︂

j=1

s_j, (2.11)

(20)

Dilated convolution with d = 1 Dilated convolution with d = 2 Dilated convolution with d = 4

Figure 2.7. Dilated convolution with dilation rate 1, 2 and 4 so that the next dilated convolution is used for the previous step thus increasing the effective spatial field size shown with red opaque [19].

wherer_iis the receptive field of the layer,d_iis the dilation factor of the layer,kis the size of the convolutional kernel and sis the stride for the convolution. So for example in the case as shown in figure 2.7, where the convolution kernelk = 3,d = 1,2,4and there is no stride the equation receptive field can be simplified with the next equation:

r_i =ri−1+ (k−1)∗d_i, (2.12) and the receptive field would ber₁ = 3,r₂ = 7andr₃ = 15as can be seen in the figure 2.7 in red opaque. It can be noted that the receptive fields from equations 2.10 and 2.12 look the same but the significant difference is that with dilated convolutions the resolution of the input is not decreased and is thus computationally more expensive but in exchange it does not lose spatial information between the layers.

2.4 Neural Networks in Real-Time

Neural networks have not been conventionally used for time crucial problems. The inher- ent nature of neural networks is that for the network to learn meaningful representations of the data, the network must be deep. At first the development of networks was to create just deeper and slower networks [17, 24, 25]. Not only does adding more layers to network run into vanishing/exploding gradient problem [26, 27] but after a while when adding more depth to the network the accuracy gets saturated and then degrades rabidly.

This is calleddegradation problem and adding more layers to sufficiently deep network complicates the learning process for the first layers [28]. Moreover, even if the depth of the network is optimized to avoid the degradation problem, faster network methods with less depth or less complex computation models for convolutions likedepthwise separable convolutions [29] have been developed to achieve good results in real-time [20, 21] for image processing problems where the previous methods for neural networks have been

(21)

to deepen the net.

The real-time requirements for neural networks can be characterized by the size of the input, the complexity of the architecture and the size of the output. The sizes of the input and output are often delimited by the task. For example, in image classification tasks the actual input image may be a rescaled version of the original image and the output just one binary output or a probability matrix from thousand classes [4]. In another task like super-resolution the input image is a low resolution image which is upscaled version of the original image [20, 30]. The difficulty of the task and the sizes of the inputs and outputs of the network have an effect on the complexity of the choices for the architecture of the network but a lot of variations can be done to simplify the architecture inside the network to trade quality for speed [20, 21].

Input size depends on the problem. For image recognition problems smaller input sizes have helped to train the network by resizing the original input to smaller image, but there have been methods where the image resolution has been scaled up also [31]. For image super-resolution the usual way to has been to first upscale the low-resolution image with an analytical interpolation method and then feed the image to a UNet [30]. But Shi et al. suggests that the network itself should be doing the upscaling with efficient sub-pixel convolutional layer and learning these filters within the network reduces the computational cost [20] .

For image denoising problems the resolution of the input image is usually the same as for the output. Moreover, for fast path tracing reconstruction the usual way has been to accumulate samples and use the statistics of the sampling like mean and variance to reconstruct the final image [3, 12]. This is also calledPixelGather in [32] and this method is later used in this thesis for all the denoising methods. The work in [32] suggests doing the denoising for path tracing in sample space by splatting the samples for a CNN but this method increases the inference time almost linearly with the accumulated samples unlike forPixelGather method this stays constant.

2.4.1 Depthwise Separable Convolution

One way to accelerate the CNN computations are the so-called depthwise separable convolutions [29]. These type of convolutions separates the convolutions in two parts, first a depthwise pass for filtering and second a 1x1 pointwise convolution for combining.

They were used successfully in utilization for mobile and embedded vision applications in [21]. The computational cost of the depthwise separable convolutions can be calculated with [21]:

C_K ·C_K·I_W ·I_H ·I_D +I_D·O_D·I_W ·I_H, (2.13)

(22)

where the first part before the sum is the depthwise filtering and second part is the pointwise combination. The parameters for depthwise separable convolutions can be calculated withC_K·C_K·I_D+I_D·O_D.

Now again considering the example where a 100x100 RGB input image is used as input features and the output features are the same size and a convolutional kernel of size C_K = 3 is used. The computational complexity for the depthwise separable CNN is (3·3·(100·100·3) + 3·3·100·100)) = 360000 compared to the 810000 for normal convolutions. Parameters for the example are3·3·3 + 3·3 = 36 compared to 81 for standard convolutions.

The simplification for the complexity of network comes with the trade-off for model accuracy compared to standard convolutions and this trade-off is further explored by Howard et al. in [21]. Furthermore, in a practical application the depthwise separable convolutions may also be slower than standard convolutions for convolutional layers with already fewer number of parameters and lower complexity. This is further explored in this thesis in section 4.5.

2.4.2 Pruning

In neural network pruning the neurons with small weights are removed and thus reducing the complexity of the computation and memory requirements for the network. This can be often done without sacrificing the performance of the network as the network is often over-parametrized and that there is often redundancy in the models [33]. Pruning was also used in first networks to reduce the complexity and over-fitting of the network [15, 34]. More recently it has been successfully used on state-of-the-art CNN models without a loss of accuracy [35].

In practice the pruning is an iterative process [35]. In each step some of the network weights are removed by greedily finding the best connections and removing the most insignificant. After this the network is retrained. These iterations are continued until a sufficient compression rate is achieved. Moreover, the accuracy of the network might suffer from too large of a compression rate and it can be beneficial to observe the loss and end the pruning. The compression rate and its effect on the accuracy of the model is also highly dependent on the network. For example, in the work by Han et al. [35] they pruned neural networks used for classifying ImageNet [4] and achieved 9x compression rate for AlexNet [24] and a 13x compression rate for VGG [25] without loss for accuracies. More- over, they showed that both the convolutional and fully connected layers were possible to prune. Also, for inference, the new NVIDIAs Ampere architecture is able to accelerate up to 2 times the speed with sparse matrix multiplications compared to dense matrices [36].

(23)

2.4.3 Quantization

Quantization is a method to increase the speed of networks by reducing the precision of weights inside the net [33, 37]. Furthermore, neural networks seem to be very robust to quantization and sufficient to use inside the neural network producing almost no loss to accuracy [37].

While there has been a lot of previous work for lower precision and mixed precision training [37, 38] the main interest in this thesis is accelerating the inference with post-training quantization [39] and quantization aware training [40]. Moreover, the idea of this is to target the GPUs with dedicated hardware support for lower precision arithmetic’s such as Tensor Cores which supports 16-bit precision floats, 8- and 4-bit integers [2]. How- ever, lowering the precision does affect the performance depending on the network and problem and should be evaluated.

(24)

3. REAL-TIME PATH TRACING

Path tracing is a physically based method to generate photorealistic images from 3D scenes. It simulates the behavior of photons interaction in the scene and can generate realistic effects such as soft shadows, reflections, refractions and global illumination. Pre- viously the method has been deemed to work only in offline setting, but this has changed in the past few years. The evolving computational hardware has started to enable the generation of good quality path tracing in real-time.

However, the real-time budget for high resolution content in high frame rate still sets limitations for the quality of the path traced image and requirements for the needed reconstruction. In this section first the principles of path tracing are introduced. After this the limitations of the method are shown for real-time and after this the requirements for reconstruction are discussed. In section 3.5 some state-of-the-art reconstruction methods are introduced including machine learning solutions for the problem. In sections 3.5.1, 3.5.2 the bilateral blur and its multi-resolution variant Á Trous are discussed in more detail as they are used in state-of-the-art real-time path tracing denoisers [1] and also later used as comparison methods for machine learning based method in this work. Moreover, the receptive field problem and dilated convolutions in section 2.3.2 are related to the Á Trous.

3.1 Ray Tracing

Ray tracing methods have become the standard way to generate photorealistic images [41]. Ray tracing is an encompassing term for different methods emulating the physically based light traversing in scenes. A simple case for this is called ray casting where the image plane is divided into pixels and rays are traced from the ’eye’ or camera to the pixels. The rays then go through the pixel centers into theThree-Dimensional(3D) scene and intersect with the nearest surface. Ray casting evaluates the pixel color based only on the first surface point and does not accumulate other rays on the scene.

Ray tracing expands this so that there are more rays calculated for the scene. The sim- plest form of ray tracing is called Whitted-style [42], where after the first ray intersection new rays can be created from the intersection point to new directions. These rays can be secondary rays, shadow rays or reflection rays. With these rays the ray tracing can simulate reflections, shadows and refractions.

(25)

In comparison, in the primary method for real-time 3D renderingrasterizationthe idea is to project the primitives on the scene to the screen and filling the grid of pixels. This can be done by transforming the 3D points of the primitive by a transformation matrix acquired from the model and camera [43, p. 21-22]. Even though computational devices have evolved a lot, the computational requirements of ray tracing are the main limiting factor for it to replace rasterization as the computer rendering technique of choice in real-time.

3.2 Path Tracing

Path Tracing is a Monte Carlo method for approximating the rendering equation [44]:

L_o(x, ω_o) = L_e(x, ω_o) +

∫︂

Ω

f_r(x, ω_i, ω_o)L_i(x, ω_i)(ω_i·n)dω_i, (3.1) where

• x is a 3D point on a surface

• ω_i, ω_ois the direction for outgoing light

• Ωis the sphere of all directions from the point on the surface

• ∫︁

Ω...dω_i is the integral over the sphere

• f_ris abidirectional scattering distribution function(BSDF) described by the material of the surface

• L_eis the luminance emitted from the surface point

• L_i, L_ois the incoming and outgoing luminance

• nis the surface normal

• ω_i·nis the weakening factor of outward luminance due to incident angle also written ascos(θ_i).

In the right side of the equation the integral over every direction of the sphere is also recursive. This means that the ray pouncing in the scene has to be integrated over all possible directions at every pounce. Therefore, the equation does not have a closed form solution. The light transportation equation is also illustrated in figure 3.1.

In Monte Carlo path tracing the integral∫︁

Ωis approximated with equation

F_N = 1 N

N

∑︂

i=1

f(X_i)

p(X_i), (3.2)

where theF_N converges towards the integral ∫︁

Ω as N grows, X_i is a random direction from the sphereΩandp(X_i)is the probability distribution function for the variablep(X_i) in the sphereΩ. Furthermore, because the path tracing only approximates the rendering

(26)

Figure 3.1. Light transport equation. The outgoing radiance L_o is calculated by the integral over the sphereΩof all incoming radianceL_i scaled by the BSDF of the surface material and the weakening factor acquired by the incoming direction and surface normal ω_i·n.

equation it results in a noisy final image. It has been shown that the error decreases at a rate ofO√

N, where the N is the number of samples [45, p. 643]. So, for example, to half the error of the Monte Carlo estimation the number of samples have to be quadrupled.

Furthermore, the integrand of the Monte Carlo’s convergence rate is independent of the dimensionality of the integrand [45, p. 643].

One way to improve the efficiency of the Monte Carlo method is to utilize importance sampling [45, p. 688]. The idea is to use a probability function p(X_i) similar to f(X_i) in the integrand in the equation 3.2. This way if the value of the integrand is larger the estimate is more ’important’ for those samples. For example, if a direction is sampled nearly perpendicular to the surface normal the weakening factor calculated with theωi·n orcos(θ_i)is nearly 0 and thus the contribution to the final value is small [45, p. 688]. In practice in path tracing the samples could be directed toward light sources.

In summary, the Monte Carlo path tracing is an approximation for the light transportation equation 3.4 and can create realistic images with soft shadows, reflections, refractions

(27)

and global illumination. Still, the approximation is noisy and to decrease the noise the samples for the integral of the equation must be increased. This is computationally expensive and to do this in real-time the number of samples must be limited. Therefore, the output must be post-processed to achieve better quality. So, in simplicity, the quality of the output in real-time becomes the trade-off for samples calculated per pixel and the complexity of the post-processing method. This trade-off is further explored in this thesis.

3.3 Feature Buffers

With path tracing it is possible to generate feature buffers like shown in figure 3.2. More- over, in a system where there is only a one primary sample per pixel the feature buffers and primary rays can be calculated with rasterization like done in works [1, 46, 47]. This also means that the feature buffer with the first ray is ’noise-free’. From this first ray intersection for example thedepth,3D world position,texture and normal of the surface can be collected and used for post-processing. Furthermore, unlike photographic image denoising these feature buffers can be used to guide edge avoidance when reconstructing the final image when the sample count is low.

Surface normal is a 3D value describing the normal in the hit point of the primary ray.

Normal buffer can also be projected to the camera as aTwo-Dimensional (2D) value as a ’view space surface normal’. The surface normal is not usually enough to avoid edges in a 3D scene as different objects in different depths can have the same normal as seen in figure 3.2 where the normal of the tables blend with the floor.

Depth is the distance of the primary ray from the surface point to the camera. Depth is related to the world space coordinates of the hit point and can be mapped back to it by inverse transform of using the view space x and y coordinates. As the depth values may differ a lot even on the same surface the depth does not by itself help the edge avoidance much. A better way to utilize the depth values is for example to calculate the gradient∇Z for the depth values and use this like done in the work by Schied et al. [1].

Albedo describes the texture color in the hit point. The texture information is usually helpful to save in a separate buffer and it can be used to for example to remove it from the noisy image and process the untextured irradiance and added to the image again like done in related work [1, 46, 48].

Variance buffer describes the statistics of the sample variance. In a system where there is only one sample per pixel the variance cannot be estimated from the single sample and thus it can be estimated for example by using either spatial estimation from neighboring samples or samples from previous frames like done in other work by Schied et al. [1]. In a system where there are multiple samples per pixel the sample variance can be estimated.

The estimation is usually done only with the luminance of the RGB values which outputs

(28)

Normal buffer Depth buffer

Albedo buffer Variance buffer

Figure 3.2. Examples of feature buffers generated with path tracing.

a 1 channel value.

In systems where there are multiple primary samples, and the samples are also randomized the gbuffer is also noisy. But for a system like described in this paper in next section it is better to use just one primary sample per pixel to achieve real-time simulation.

3.4 Real-Time Path Tracing

The current real-time path tracing implementations have usually been based on really low sample counts per pixel for example only 1 sample per pixel and using a fast blurring denoising filter to reconstruct the final image [1, 46, 48]. With the dedicated hardware for ray traversal and for example the power of cloud computing an interesting path for real- time implementations would be to generate more samples per pixel. For offline rendering without time budget this could also mean that the primary sample from the camera is also varying in space and with this the final fully converged image is anti-aliased. For real- time systems this would require more ray traversal calculations and noisy feature buffer which would further complicate the post processing methods. Moreover, there are already real-time anti-aliasing methods to produce good quality anti-aliased results for example Temporal Anti-Aliasing(TAA) [49] and it can be successfully applied after other denoising methods like done in the works [1, 46] in real-time.

Furthermore, temporal reuse of samples for real-time in path tracing is also used like

(29)

Figure 3.3. Example path tracing setup. The primary rays from the camera are illustrated as red arrows and the secondary rays as blue arrows. The dashed arrows are shadow rays. In this example there are 1 primary ray per pixel and 2 secondary rays.

in related work [1, 46, 48]. This method requires post-processing of the samples by moving the samples with motion vectors and dropping occluded samples from reuse.

Temporal reuse of samples causes problems with dynamic environment like moving lights and animations where the samples might heavily change in intensity between the frames.

Furthermore, a method to preprocess the reuse of samples has been also proposed by Schied et al. in [50]. Also, in a distributed setting where the path tracing is distributed between different computing units the temporal data is more difficult to reuse for example in a moving camera where the smaller tiles rendered do not share the temporal data.

However, in this master’s thesis temporal pre- and post-processing methods are not used to simplify the denoising process and the denoising is applied only to a single frame of samples. For future work temporal denoising is an interesting topic and the direction is further explored in future work Chapter 6. For a real-time path tracing setup without temporal reuse of samples the setup could be as follows: One primary sample per pixel and two secondary paths from the first intersection. From each of these intersections a shadow ray is casted to a random point in a random light. This kind of setup is also illustrated in figure 3.3.

In the example setup one consideration is that the primary rays are rasterized as to generate the noise-free feature buffers and to decrease the computational workload of multiple primary samples. As such, the multiple samples per pixel in this work means that the multiple samples are generated only after the first primary ray. So, for example 8 spp in

(30)

8 spp 16 spp

32 spp 64 spp

Figure 3.4.Noisy outputs for path tracing for few samples.

this setup means that 8 secondary rays are casted after the first primary ray. As opposed to an offline setup for example used by Bako et al. in [12] where the primary rays are also randomized which also generates noisy feature buffers for reconstruction. Path traced images for different amount of samples per pixel with the previously described online setup are shown in figure 3.4.

The Monte Carlo integration of path tracing result is a time consuming process. Moreover, as each sample is independent from each other the problem is ’embarissingly parallel’

meaning that the computation of each sample can be distributed to be computed sepa- rately. For example, a system which is able to compute 1 spp path tracing in real-time for 60 fps (∼16 ms per frame), a distributed system with 64 units similar to this are able to generate effective 64 spp when the results are accumulated. However, in practice the time for path tracing more samples per pixel may be used for denoising the lower spp result to achieve better results with the same computation time. The trade-off between spp and the denoising results are further explored in this thesis.

3.5 Denoising

As stated previously, the computational demands of real-time path tracing are too much to generate an output image without noise. Therefore, many methods have been proposed to reduce the noise and variance of a noisy path traced image with only few samples

(31)

per pixel. The focus of this thesis is the real-time denoising of these samples so exploring most of the offline methods are omitted. But as also dedicated machine learning inference hardware has appeared some of the work done for machine learning denoisers are visited as the usage of CNNs for denoising in real-time is possible.

As halving the error of the output requires quadrupling the number of samples [45, p.

643], there is a point where using a denoising filter can create a better quality image with less complexity opposed to just increasing the number of samples. Moreover, the time requirements for real-time path tracing and the time consumed for even generating the noisy image places even more demanding time restrictions for the denoising algorithm.

For example, the time constraints for 60 frames per second (fps) gives the path tracing and reconstruction algorithm∼16 ms per single frame.

One method for denoising the path traced image include simple blurring filters such as bilateral blur [51] and it’s multi-resolution variant Á Trous [52] which has been successfully implemented for real-time path tracing denoising with Spatio-Temporal Variance Guided Filter (SVGF) [1]. As the the name suggests, SVGF uses spatio-temporal sample variance as the edge avoiding statistic for bilateral filter. The bilateral filter and Á Trous filter are introduced more comprehensively in sections 3.5.1 and 3.5.2. Also, a blockwise linear regression method is used by Koskela et al. in Blockwise Multi-Order Feature Regres- sion(BMFR) [46] for real time denoising.

A real-time machine learning based solution has been proposed using neural bilateral grids by Meng et al. [48]. As dedicated machine learning inference hardware has appeared, Meng also compares the proposed method with some previous machine learning methods such as aMulti-Resolution variant of Kernel Prediction CNN (MR-KP) [53] and OptiX Neural Network Denoiser (ONND) derived from the work by Chaitanya et al. [3]

which have been previously labeled as ’interactive’ and not for real-time applications.

Meng suggests these as interesting comparison points for real-time path tracing denoising. Moreover, the implementations for MR-KP and ONND run for the order of tens of milliseconds on state-of-the-art GPUs for a frame size of 1280x720 and are not able to produce real-time denoising.

One interesting way to denoise the Monte Carlo approximation is to divide the path tracing in different groups in the equation 3.1. One way to do this is to separate the direct and in- direct illumination components like done in other work [1, 47]. Another way is to separate the diffuse and specular components like done by Bako et al. in [12]. In these cases, the components would be reconstructed in separate pipelines and added together after that.

This enables for example to use different scale for the input samples as done by Bako by denoising the specular component in logarithmic scale with better results opposed to linear scale. However, for real-time consideration the separation of the components is not found to be efficacious in other work [46, 50].

(32)

3.5.1 Bilateral Filter

Bilateral filter is an edge-preserving nonlinear local filter [51]. It is essentially an extension to a gaussian blur:

w(p) =e⁻

(p−q)²

2σ² , (3.3)

where the p is the filtered pixel coordinates, q is the sample coordinate in the kernel respectively andσ is the standard deviation for the gaussian distribution. Bilateral filter extends this by adding a color intensity difference to the gaussian filter.

w_b(p) = e

−(p−q)² 2σ_d² ⁻

|I(p)−I(q)|

2σ²_l . (3.4)

Here the I(p) andI(q) are the color intensity values for the samples,σ_d andσ_l are the standard deviations for the spatial distance and color intensity values respectively.

In path tracing for example the depth buffer and the normals of the first rays offer good information for edge preservation. The usage of these separate features for edge preserving is called cross-bilateral filtering [54, 55]. In path tracing these can be added to the filter kernel like done in [1, 56] by for example:

w_n(p) = max(0, n(p)·n(q))^σⁿ, (3.5)

where n(p) andn(q) is the normals of the surface point for the sample. And the depth can be added as:

wz(p) =e

− |Z(p)−Z(q)|

σ²_z|∇Z(q)·(p−q)|+ϵ , (3.6)

whereZ(p)is the depth value for the sample and the∇Z(qis the gradient of the depth, and ϵ is used to avoid division by zero in cases where there the gradient for the depth values is zero.

Bilateral filter has also been extended to be used in a grid format for real-time image processing by CHen et al. [57]. The bilateral grid extends the 2D image with a third value which describes the bilateral filter for the 2D pixel. This idea is used in work by Meng et al. [48] to combine the idea of machine learning filter and bilateral filtering.

(33)

Á Trous with step size = 1 Á Trous with step size = 2 Á Trous with step size = 3

Figure 3.5. Á Trous with step sizes 1, 2 and 3. Each step size increases the receptive field with2ⁿwhere n is step size - 1 [52]

3.5.2 Á Trous

Á Trous ("algorithm with holes") is a specialized bilateral filter. The idea is to run multiple passes of the same bilateral blur filter in different frequencies [52]. Á Trous is also called the discrete wavelet transform where the bilateral blur filter is run in a dithering pattern.

The dithering pattern is used to increase the receptive field of the bilateral filter considering only a small set of the samples in a larger window. The kernel with the dithering pattern is illustrated in figure 3.5.

Most notably the Á Trous algorithm has been used successfully to denoise real-time path tracing with SVGF. For path tracing at low sample counts the SVGF must scale the intensity weight of the samples. It does this by modifying the intensity weight in equation 3.4

w_l(p) = e

|I(p)−I(q)|

σ_l√︁

g_3x3(V ar(I(p)) +ϵ , (3.7)

where the √︁

g_3x3(V ar(I(p))is the sample variance acquired from the variance buffer.

In addition, SVGF updates the variance estimationV ar(I(p)for every iteration of the Á Trous filter. SVGF also uses variables to estimate the standard deviations of σ_z = 1, σ_n = 128 and σ_l = 4 for the bilateral filter. These variables control the edge detection weight of the features. For real-time path tracing SVGF with one sample per pixel SVGF uses 5 iterations of Á Trous with step size of2ⁿ.

(34)

4. NETWORK DESIGN

In this chapter 3 different kind of fast neural networks were built for path tracing reconstruction. The main difference between the networks was the method they used to increase the receptive field for denoising. First was aSimple Convolutional Neural Network (SCNN) which increases the receptive field by using two consecutive convolutional layers with a larger kernel first. The second model isDilated Convolutional Neural Network (DCNN) [19] where the receptive field is increased with dilatation in the convolutional layers. The third model was a smaller implementation of the UNet (SUNet) [18] where the receptive field is increased by pooling operations between the layers.

4.1 Data Collection

The data in this work was collected by rendering 8, 16, 32 and 64 spp path traced images in three scenes: Classrooom, Sponza and GlossySponza. The path tracing setup used was the same as described in section 3.4 so that there is one primary ray per pixel followed by 8, 16, 32 and 64 secondary rays from the first intersection with shadow rays from each hit points to a random light.

For each scene about 600 frames were collected for training and 120 for evaluating the methods. The training data was collected for each scene specifically so that the networks can learn scene specific information. This could be utilized for more complex cases for example by making the network specialization to a part of a ’baking’ step in game level so it works better in those specific areas or scenes as proposed by in the work by Chaitanya et al. [3].

4.2 Starting Points

For our problem, the starting points were these. The input frame was a frame of size 1280x720 floating point values with 3 channels RGB which was the accumulated mean from multiple samples per pixel. For guiding the reconstruction multiple gbuffer were also available: the variance buffer for the luminance of the pixel from the RGB value (1 channel), the noise-free albedo of the texture (3 channels), world space normal (3 channels), distance from the camera or depth (1 channel) and the roughness or gloss

(35)

8 spp input without albedo View normal (2 channels)

Figure 4.1. As an input optimization the albedo was removed from the noisy RGB and the 3D world space normals were transformed to the 2D view space normals.

map of the material (1 channel). The hardware targeted was a single NVIDIA RTX 2080 Ti with Tensor Cores.

4.3 Network Inputs

The first two simplifications for the denoising network were to remove the noise-free albedo from the RGB channels by dividing the albedo from the RGB values and cal- culating the view space normal from the world space normal. The albedo division was done the same way like in [12]:

c

˜_{dif f use} =c_{dif f use}⊘(f_albedo+ϵ), (4.1) where ⊘ is the elementwise division and ϵ is the hadamard epsilon = 0.00316. The efficiency of this division for neural networks has been tested in the work by Chaitanya et al. [3] but also other real-time path tracing denoisers do this [1, 46]. The view space normals was acquired from the world space normals by dividing the world space normals with the inverse of the camera matrix. The view space normals simplify the input for the network by having only two channels instead of the three of world space normals. These two optimizations for the inputs were done before other testing to define the input space for the problem. The input feature optimizations are illustrated in figure 4.1.

4.4 Training

The neural networks were trained using TensorFlow [58]. Each network was set to be trained for 50 epochs with Adam optimizer [11] and 0.001 learning rate. After the training loss did not change after 3 consecutive epochs the training was continued with 0.0001 learning rate for 50 epochs again and finished after no changes to training loss in 3 consecutive epochs. As the training images from each scene only differed by the sample count with different input RGB values and the corresponding sample variances, each

(36)

32 spp training 8 spp training

16 spp training 64 spp training

Figure 4.2. Training losses for first 32 spp training set, then 8 spp, 16 spp and 64 spp using the 32 spp trained model as a base. As the different spp models share similar input sets with different noise levels using a model trained with different spp as a base significantly decreases the training time for consecutive models.

of the different sample datasets were only retrained from one base network. So, for one scene first a network was trained with 32 spp input dataset and the corresponding network for 8, 16 and 64 spp were trained by retraining the 32 spp network with the different spp dataset only using the learning rate of 0.0001. This decreased the training time considerably for all the networks. An example training for the networks can be seen as the function of training loss and epochs in figure 4.2.

For all the CNNs the input features included the noisy RBG (3 channels), sample variance (1 channel), view normal (2 channels), depth (1 channel) and so 7 input channels altogether. The gloss or roughness input also used in [3] was dropped from the final input sets as the path tracer used simple and almost uniform values for these in the the used models. The inputs used for each CNN are shown in figure 4.3.

(37)

8 spp input without albedo 8 spp input variance

View normal Depth

Albedo 4096 spp reference

Figure 4.3. Training images for each input and output. The input consisted of 7 layers:

noisy RGB without first bounce albedo (3 channels), variance (1 channel), view normal (2 channels) and depth (1 channel). The albedo demodulation was done in the network as a multiplication and the output was referenced with the training truth value of 4096 samples per pixel converged path traced image.

4.4.1 Loss Function

The chosen loss function for the path tracing denoising network was chosen based on experiments from previous work for multiple samples per pixel by Bako et al. [12]. In this work the l1 loss achieved the best results. They thought that the l1 loss is less prone for outliers compared to for example the l2 loss which low sample path tracing has. For

(38)

pixel-wise prediction the l1 or MAE loss in equation 2.6 can be written as:

H

∑︂

y=0 W

∑︂

x=0

|I_x,y^r −fx,y(Iⁿ)|, (4.2)

where the loss is calculated for the whole denoised imagef_x,y(Iⁿ)compared to the ref- erenceI_x,y^r for the whole height H and width W.

4.4.2 Activation Functions

Each model has a nonlinear activation function after each convolutional layer except the last which has a linear activation. The linear activation function is that the output is not limited and able to be conform to theHigh Dynamic Range(HDR) of path tracing.

Six different activation functions were tested for one scene tanh, ReLU, Swish, Scaled exponential linear unit (SELU),Leaky Rectified Linear Unit (Leaky ReLU) and and iden- tity (linear). Previous work suggests that for shallow networks the tanh activation function works better [20]. Moreover, the single image super-resolution problem and other image processing problems do not usually consider high dynamic range images but image problems where the values are in the range of 0.0 to 1.0. The work done by Bako et al. in [12]

suggests that doing a logarithmic transform for the specular component for the denoising to reduce the range of the color values with the function:

c

˜_specular =log(1 +c_specular). (4.3)

In our work we do not separate the diffuse and specular components for the denoising because real-time requirements so this logarithmic function will have to be used for the diffuse component also. Moreover, the high dynamic range with or without the logarithmic transform has an effect for the activation function behavior for the neural network. The six activation functions were tested with and without the logarithmic transform. Instead of using the transformation for only the specular component the logarithmic transform was tested after the albedo division from the color values asc˜ =log(1 +c):

c

˜ =log(1 + (c⊘(f_albedo+ϵ))), (4.4)

and after the network the inverse was applied:

c= (exp(c˜)−1)⊗(f_albedo+ϵ), (4.5)

The tests for each activation functions are shown in figure 4.4. The ReLU, Leaky ReLU

Fast Convolutional Neural Networks for Real-Time Path Tracing Denoising