• Ei tuloksia

4. Network Design

4.5 Architectures

After the input space and other hyperparameters were defined the next step was to de-fine the architectures for the CNNs. Many machine learning approaches have started to recently target real-time applications also but for path tracing the real-time budget is really small if considering the∼16 ms per frame budget. Some recent work for real-time denois-ing as done by Meng et al. [48] the machine learndenois-ing part is still behind the state-of-art real-time denoising. The 8 ms achieved in this work [48] with the same hardware fits the real-time budget in theory but in practice the other work for the path tracing is going to take a larger slice out of the frame budget depending on the model.

So, for this work and the state-of-the-art denoising timings were used as a reference points for real-time path tracing. The target inference was considered to be somewhere from 1 to 8 ms based on related work [1, 46, 48], which have timings of 1.6 ms, 4.4 ms, 8 ms respectively measured by Meng et al. with state-of-the-art GPUs [48]. This would give the other work of path tracing 6-15 ms per frame.

The first thing to consider about the timing of 1-8 ms is that it has a big impact on the choices of the architectures. Even many real-time solutions for video denoising like done by Shi et al. [20] cannot be considered as such because the real-time constraints are not as strict in these implementations. The other thing, which is beneficial, for multiple paths traced per pixel is that the effective accumulation of samples already makes the output less noisy (this can be seen in figure 3.4) which can help to simplify the denoising network by not needing as many convolutions or not needing a large receptive field which helps to decrease the depth of the CNN as the receptive field is usually increased in image processing CNNs by adding more hidden layers.

Three main CNNs were considered. First was a Simple Convolutional Neural Network (SCNN) inspired by the work done in [20] where super-resolution problem was considered in real-time without first upsampling the image before the network but feeding the low-resolution image as is for the network and doing the upscaling as a layer of the CNN in the end. For our problem we do not need upscaling but the simple idea of increasing the

Figure 4.5.SCNN

Figure 4.6.DCNN

receptive field for the reconstruction maps inside the network by doing two consecutive convolutional layers with a larger kernel first is useful. The SCNN used is depicted in figure 4.5.

The second model isDilated Convolutional Neural Network (DCNN) [19] where the re-ceptive field is increased with dilatation in the convolutional layers. The dilation is set so that dilation rate follows the same dithering patterns as done in Á Trous. The DCNN is depicted in figure 4.6.

Third tested model was a smaller implementation of the UNet (SUNet) [18] where the receptive field is increased by pooling operations between the layers. The problem for denoising with UNet is the need for the skip connections because the subsampling oper-ations of the pooling layers. As our target was a shallow CNN the original idea of the skip connections for better backpropagation and learning of the network [24] is not as needed, because in shallow networks the vanishing gradient is not a problem. But because the skip connections increase the computational complexity this must be considered for the

Figure 4.7. SUNet

Table 4.1. Inference times for SCNN with different convolutions with input size 1280x720x7. The computational costs were calculated as per equations 2.9 and 2.13

.

Layer 1 2 Computational cost (109) GeForce GTX 1070

Conv2D 32 16 9.8 9.2 ms

Conv2D 64 32 28 15.8 ms

Conv2D 256 256 590 154.3 ms

SeprableConv2D 32 16 1.3 14.3 ms

SeprableConv2D 64 32 3.3 24.5 ms

SeprableConv2D 256 256 67 132.0 ms

timing of the network. The tested SUNet architecture is depicted in figure 4.7.

4.5.1 Convolutions and Feature Maps

The main driving factor for the convolutions and feature map sizes was the speed of the network. So, the original sizes of the networks were based on previous work [20, 61]

and the rough estimates of the computational complexities for the problem considering the input size and used hardware. Also, few test runs were done to get an estimate of the inference speeds for different kind of nets. Timings for different size of networks are shown for the SCNN architecture and for different amount and kind of convolutions in each layer in table 4.1.

Depthwise convolutions were tested for the SCNN architecture also but for NVIDIA GTX 1070 GPU the separable convolutions were slower than normal convolutions. This is related also to the small amount convolutions in each layer as this way the normal convo-lutions were faster. The SCNN with 32 convoconvo-lutions in the first layer and 16 in the second layer runs in 9.2 ms, but the same network with separable convolutions runs in 13.9 ms.

The network started to benefit from the separable convolutions only when the number of

Table 4.2. Summary of the designed models. CC = Computational Cost, P = Parameters,

convolutions were increased for each hidden layer. For example, if the SCNN first and second layer had 256 convolutions the separable convolutions were faster. The same size of networks without separable convolutions runs in 154 ms when the same network with depthwise convolutions runs in 132 ms.

Furthermore, from these tests the feature maps of size 32 in the first layer and 16 in the second layer were generalized for the use in each subsequent model for SCNN, DCNN and SUNet. This is half the feature map size used in previous work [20] and [61], but the first feature map size is the same as in [3]. The designed models computational complexities, parameter count and receptive fields for these convolutions and feature map sizes are shown in table 4.2.

4.6 Pruning

The effect of pruning was also tested for a SCNN network. The network was pruned by setting insignificant weights to 0. The pruning levels of 50%, 60%, 70%, 80% and 90%

were tested for a trained network. The pruning was done iteratively in 10 epochs for each pruning level. The effect of pruning for different levels for the SCNN network can be seen in table 4.3.

Pruning 70% of the base model effectively decreased the size of the network 57% with-out a loss for the validationRoot-Mean-Square Error (RMSE). After pruning beyond 70%

weights to zero, the model performance started to suffer. Furthermore, the tests for NVIDIA GeForce GTX 1070 GPU, the sparsity of the convolutions did not speed up the inference of the network but decreasing the model size made the loading of the model faster. The tested model sizes and loading times were tested with TFLite models. As the model size for the SCNN was already quite small the speed up for the model loading was not significant. Furthermore, for future work the sparsity of the networks are an interesting direction for AI accelerators as the new NVIDIA Ampere architecture is able to accelerate up to 2 times the speed with sparse matrix multiplications [36].

Table 4.3. Summary of pruning with different levels for the SCNN model. The model could be pruned setting 70% of the weights to 0 with no loss for accuracy.

Pruning level Model size (bytes) Loading time (ms) RMSE

Base (0 %) 43844 0.092 0.027

50 % 27413 0.091 0.027

Model Model size (bytes) Loading time (ms) RMSE

Base 43844 0.092 0.027

Only quantization 22461 0.084 0.027

70 % pruned with quantization 11639 0.080 0.027

4.7 Quantization

Unfortunately, at the time of the thesis, utilizing computational hardware with acceleration for 16-bit floats and 8-bit integer arithmetic such as Tensor Cores [2] were unavailable.

Tensor Cores have previously been used for denoising Monte Carlo images like in [48].

This is an important acceleration to use for CNNs in real-time for this use case and is postponed for future work. In related work Tensor Cores on a Tesla V100 achieved three times the performance compared to CUDA Cores with 32-bit floating point arithmetic’s [62]. However, the quantization of the weights for a SCNN was tested for 16-bit floats effectively halving the model size. The quantization of the model weights increased the loading speed of the model with no loss for validation RMSE. The benefit of quantization and quantization with pruning is shown in table 4.4.

From these results for the SCNN, DCNN and SUNet models the pruning of 70 % seems to be a good target for effective model reduction size without a loss of performance.

However, the effect of pruning and quantization should be validated for different scenes and models separately, but these results offer a good starting point for each subsequent model optimization. Furthermore, for the smaller CNNs the speed gain from quantization and pruning for the model loading time was not significant in the test setup. Larger models and different systems may benefit more from the optimizations.