• Ei tuloksia

2. Convolutional Neural Networks

2.4 Neural Networks in Real-Time

Neural networks have not been conventionally used for time crucial problems. The inher-ent nature of neural networks is that for the network to learn meaningful represinher-entations of the data, the network must be deep. At first the development of networks was to cre-ate just deeper and slower networks [17, 24, 25]. Not only does adding more layers to network run into vanishing/exploding gradient problem [26, 27] but after a while when adding more depth to the network the accuracy gets saturated and then degrades rabidly.

This is calleddegradation problem and adding more layers to sufficiently deep network complicates the learning process for the first layers [28]. Moreover, even if the depth of the network is optimized to avoid the degradation problem, faster network methods with less depth or less complex computation models for convolutions likedepthwise separable convolutions [29] have been developed to achieve good results in real-time [20, 21] for image processing problems where the previous methods for neural networks have been

to deepen the net.

The real-time requirements for neural networks can be characterized by the size of the input, the complexity of the architecture and the size of the output. The sizes of the input and output are often delimited by the task. For example, in image classification tasks the actual input image may be a rescaled version of the original image and the output just one binary output or a probability matrix from thousand classes [4]. In another task like super-resolution the input image is a low resolution image which is upscaled version of the original image [20, 30]. The difficulty of the task and the sizes of the inputs and outputs of the network have an effect on the complexity of the choices for the architecture of the network but a lot of variations can be done to simplify the architecture inside the network to trade quality for speed [20, 21].

Input size depends on the problem. For image recognition problems smaller input sizes have helped to train the network by resizing the original input to smaller image, but there have been methods where the image resolution has been scaled up also [31]. For image super-resolution the usual way to has been to first upscale the low-resolution image with an analytical interpolation method and then feed the image to a UNet [30]. But Shi et al. suggests that the network itself should be doing the upscaling with efficient sub-pixel convolutional layer and learning these filters within the network reduces the computational cost [20] .

For image denoising problems the resolution of the input image is usually the same as for the output. Moreover, for fast path tracing reconstruction the usual way has been to accumulate samples and use the statistics of the sampling like mean and variance to reconstruct the final image [3, 12]. This is also calledPixelGather in [32] and this method is later used in this thesis for all the denoising methods. The work in [32] suggests doing the denoising for path tracing in sample space by splatting the samples for a CNN but this method increases the inference time almost linearly with the accumulated samples unlike forPixelGather method this stays constant.

2.4.1 Depthwise Separable Convolution

One way to accelerate the CNN computations are the so-called depthwise separable convolutions [29]. These type of convolutions separates the convolutions in two parts, first a depthwise pass for filtering and second a 1x1 pointwise convolution for combining.

They were used successfully in utilization for mobile and embedded vision applications in [21]. The computational cost of the depthwise separable convolutions can be calculated with [21]:

CK ·CK·IW ·IH ·ID +ID·OD·IW ·IH, (2.13)

where the first part before the sum is the depthwise filtering and second part is the point-wise combination. The parameters for depthpoint-wise separable convolutions can be calcu-lated withCK·CK·ID+ID·OD.

Now again considering the example where a 100x100 RGB input image is used as input features and the output features are the same size and a convolutional kernel of size CK = 3 is used. The computational complexity for the depthwise separable CNN is (3·3·(100·100·3) + 3·3·100·100)) = 360000 compared to the 810000 for normal convolutions. Parameters for the example are3·3·3 + 3·3 = 36 compared to 81 for standard convolutions.

The simplification for the complexity of network comes with the trade-off for model accu-racy compared to standard convolutions and this trade-off is further explored by Howard et al. in [21]. Furthermore, in a practical application the depthwise separable convolu-tions may also be slower than standard convoluconvolu-tions for convolutional layers with already fewer number of parameters and lower complexity. This is further explored in this thesis in section 4.5.

2.4.2 Pruning

In neural network pruning the neurons with small weights are removed and thus reducing the complexity of the computation and memory requirements for the network. This can be often done without sacrificing the performance of the network as the network is often over-parametrized and that there is often redundancy in the models [33]. Pruning was also used in first networks to reduce the complexity and over-fitting of the network [15, 34]. More recently it has been successfully used on state-of-the-art CNN models without a loss of accuracy [35].

In practice the pruning is an iterative process [35]. In each step some of the network weights are removed by greedily finding the best connections and removing the most insignificant. After this the network is retrained. These iterations are continued until a suf-ficient compression rate is achieved. Moreover, the accuracy of the network might suffer from too large of a compression rate and it can be beneficial to observe the loss and end the pruning. The compression rate and its effect on the accuracy of the model is also highly dependent on the network. For example, in the work by Han et al. [35] they pruned neural networks used for classifying ImageNet [4] and achieved 9x compression rate for AlexNet [24] and a 13x compression rate for VGG [25] without loss for accuracies. More-over, they showed that both the convolutional and fully connected layers were possible to prune. Also, for inference, the new NVIDIAs Ampere architecture is able to accelerate up to 2 times the speed with sparse matrix multiplications compared to dense matrices [36].

2.4.3 Quantization

Quantization is a method to increase the speed of networks by reducing the precision of weights inside the net [33, 37]. Furthermore, neural networks seem to be very robust to quantization and sufficient to use inside the neural network producing almost no loss to accuracy [37].

While there has been a lot of previous work for lower precision and mixed precision train-ing [37, 38] the main interest in this thesis is accelerattrain-ing the inference with post-traintrain-ing quantization [39] and quantization aware training [40]. Moreover, the idea of this is to target the GPUs with dedicated hardware support for lower precision arithmetic’s such as Tensor Cores which supports 16-bit precision floats, 8- and 4-bit integers [2]. How-ever, lowering the precision does affect the performance depending on the network and problem and should be evaluated.