• Ei tuloksia

Learning-based methods are interesting in the domain of path tracing reconstruction as they can produce non-linear models and in theory, they can realize how many weights are needed for a linear model in the given situation. Despite the interesting

characteristics of the machine learning-based methods, the runtime of the inference seems to be the problem. Specifically, at the time of writing, there are no published real-time machine learning-based approaches for path tracing reconstruction. How-ever, the authors of[All+17]have indicated that their method runs in real-time on current hardware[Kel+18]. In any case, this section covers some of the most inter-esting interactive and offline machine learning approaches to reconstruction.

3.4.1 Dataset Generation

Setting up the framework for researching neural networks for path tracing recon-struction is easy. Firstly, it is an image-based problem where one can use some of the thoroughly researched image neural network designs. In addition, it is easy to generate a lot of training data with a path tracer. All that is needed is a path tracer, 3D models and camera paths. Other than generating meaningful camera paths there is no requirement to have humans labeling the data.

There are many ways how data augmentation with path tracing reconstruction can be done[All+17]. For instance, the network can be taught with cropped frames and it is also possible to randomly rotate and flip the inputs. Moreover, better gen-eralization to different camera paths is achieved with randomly stopping the camera path and randomly changing the direction in which they are played.

Many previous works[All+17; Bak+17; KBS15; Vog+18] use noise-free con-verged frames as the target frame. However, if the network training uses suitable loss function, a recent Noise2Noise idea can be used [Leh+18]. There both the sample and the target frames are noisy. Generating two noisy frames with different random seeds is thousands of times faster than generating a pair of noisy and noise-free frames. This idea could even be utilized in a system that learns to reconstruct the 3D scene in interactive frame rates while the user is flying in it.

3.4.2 Network Designs

The first neural network design proposed for path tracing reconstruction was fully-connected [KBS15]. However, latelyConvolutional Neural Networks(CNN) have been preferred over fully-connected networks, because they have significantly fewer parameters to learn and they also do not overfit to screen space locations[All+17;

32 12843

64 57

32 76

16 101

8 101

4 101

8 76

16

57 32

43 64

3 128

Figure 3.6 Example of U-Net design which can be used to reconstruct path tracing with interactive frame rates [All+17]. Purple shade means convolutions, pink downsampling, and blue upsampling.

Note the recurrent connections on every encoder stage and the skip connections from the encoder stages to the decoder stages.

Bak+17; Vog+18]. Typically, feature buffers are given as extra input channels to the network and the network is allowed to learn how to best utilize them. However, also filtering feature buffers with extra convolutional layers and using a separate encoder stages just for them has been proposed[Yan+19].

Instead of directly outputting the final color the network can output kernel weights for every pixel which are then used to construct the final color[Bak+17].

The sum of the weights is forced to be exactly one and each weight is forced to be within range from zero to one by using a softmax activation. The advantage of this is that the final color is always a weighted sum of its neighborhood and not something completely different. In addition, the learning is faster because the kernel weights are scale independent.

The kernel prediction process is inverted in a recentkernel-splatting network [Gha+19]. Instead of predicting every output pixel as a weighted sum of noisy pix-els, one can predict which pixels should be affected by the noisy samples. This is done individually to every path tracing sample instead of doing it to averaged color of many samples. It is easier to learn to silence outlier samples with the kernel-splatting approach compared to the kernel-prediction. However, the computation requirement is higher, because every sample is done individually.

One possibility from general image neural network literature is to use autoen-coders, which are neural networks that combine encoder and decoder stages into one network. Autoencoders automatically learn to compress the data while preserving the most important features of it[LHY08]. Better handling of the high frequency

de-tails in an autoencoder can be achieved with skip connections[RFB15]. The idea of the skip connections is to connect the same sized layers from the encoder to decoder.

The connections remove the possibility to use the autoencoder as a data compressor, but they work well in the denoising use case since adding these connections makes the learning faster, reduces vanishing gradient problem and improves the generaliza-tion of the network[He+16; RVL12; Yan+18]. In some sources, an autoencoder with skip connections is called a U-net because it can be visualized as a U-shaped graph.

The temporal stability of the network can be improved using a so-calledRecurrent CNN (RCNN), which adds temporal recurrent connections to the convolutional network design[All+17]. An example of this kind of network design can be seen in Figure 3.6. The idea of the recurrent connection is to give the previous frame state a layer as an input to it or some other previous layer on the next frame. Recurrent connections can also use analytical reprojection for moving the most relevant data to the correct place in the layer [Vog+18]. In both cases the network learns, for example, based on the roughness feature buffer, not to use temporal data when the material suffers from temporal lag.

Also, other ideas can be used directly from image processing networks. One ex-ample of this is dropout[Sri+14], which improves generalization by randomly shut-ting down some of the neurons. Another example is transfer learning, where known to be good network design and weights are used as initial values for the teaching of the network that is tuned for the task in hand. Interestingly the base network can even be targeted to a very different task[TS07].

3.4.3 Loss Function

The selection of the loss function affects the learning of the network significantly.

The loss selection is interesting in the real-time context because it is computed only when teaching the network. More specifically, even a very complicated loss will not slow down the inference timings.

S. Bako et al.[Bak+17] tested different loss functions and found out that the absolute value loss function, in other words,L1loss is the most robust and closest to perceptual difference. However, the analysis was limited to simple loss functions that can be evaluated quickly and locally. In the case of Noise2Noise training, L2

style loss must be used because its minimum is found at the mean of the samples, which makes it converge towards the correct result[Leh+18].

Temporal errors such as flickering can be penalized with a L1 loss which com-pares the absolute difference of temporal derivatives of the network output and the reference[All+17].

One idea to enhance fine details of the image is to use the L1 loss in the gradient-domain[All+17]. However, better results can be achieved by comparing internal representations of image classification neural network and using their difference as the loss[KHL19a; Zha+18].

Loss calculations can be improved by applying some non-linearity before the cal-culation. For instance logarithm makes the loss more robust against bright outliers [Bak+17; KHL19a; Vog+18]and modified gamma correction improves quality in the dark areas of the frame[All+17]. In the case of comparing the classification networks’ internal representations, the loss calculation can be made more robust by applying random transformation to the denoiser output[KHL19b].

3.4.4 Optimizing Network for Fast Inference

Faster inference can be achieved by simplifying the overall network design[Ian+16].

Examples of simplifying the network include reducing the sizes of convolution ker-nels and working with less input and output data on every layer of the network. For example, C. Alla Chaitanya et al.[All+17]use only 7 channels of data as input to the network. Moreover, every convolutional layer uses a kernel size of only 3×3. In ad-dition, the simplification can be done automatically with so-called pruning[LDS90].

A single convolution layer can be made faster by replacing it with a combination of purely spatial convolution and a pointwise matrix multiplication with weights along the depth axis[Sif14; Van14]. In addition, explicit upsampling layers can be made faster with subpixel convolution which increases the size of the output while applying the convolution kernels[Shi+16].

3.4.5 Optimizing Inference of Existing Network

CNNs operate on rather big sets of data and the internal operations are multiplica-tions and addimultiplica-tions. Therefore, CNNs can be made faster with dedicated hardware

using a data type that saves memory bandwidth and simplifies multiplication units by using fewer bits [CBD14]. Therefore, major hardware manufacturers have re-leased their own dedicated machine learning cores.

On a general purpose hardware IEEE standard’s half-precision floating-point numbers can be used for accelerating CNNs[CBD14; IEE08]. In addition, a dedi-cated 16-bit floating-point format calledbfloat16with three extra exponent bits and three fewer mantissa bits has been proposed [HTH19]. The main motivation of bfloat16 is reduced power consumption and physical chip area with the same net-work accuracy compared to the original 16-bit format of IEEE. Also the conversion to and from 32-bit single-precision floating-point format is simple since there are as many exponent bits in the both formats.