Differentiable Camera Model for Learning-Based Optimizations

(1)

Differentiable Camera Model for Learning-Based Optimizations

Faculty of Information Technology and Communication Sciences (ITC) Bachelor of science October 2021

(2)

Melander Ville: Differentiable Camera Model for Learning-Based Optimizations Bachelor of science

Tampere University Information technology October 2021

This work presents a mathematical model of a digital camera to accurately simulate and optimize the image acquisition stage in numerous image processing applications. To account for various physical phenomena in the acquisition process and accurately model the real-world imagery, the proposed method relies on the wave optics-based formalism of light. It is further illustrated that the model is differentiable with respect to the optical parameters to be tuned, making it integrable to the current machine learning methods based on the standard gradient descent algorithms. Such benefit is demonstrated through an example application, where the image acquisition is manipulated in accordance with a certain desired aesthetic taste.

The thesis contains two main parts. Within the theoretical discussion, the foundations of the wave optics and in particular the incoherent imaging is introduced. The image acquisition module is presented through the formation of the point spread function (PSF), as well as other related optical functions. The sensor image gotten from the camera module is then compared to the target style image through a perceptual loss for the aesthetic style transfer. For this purpose, the perceptual loss functions presented in the literature for the image transformation are reviewed. The theoretical discussion is followed by the implementation and the simulation results. During both the training and the test, the camera model uses multispectral (MS) data as input to maximize the design degree of freedom.

It was found that combining the layered camera model structure with an image processing application can be done, and the camera model itself is very robust even with just a few optical elements. On the other hand, the style transfer layer added on top did not produce expected results, and instead of changing the stylistic details of the image, only the colour scheme was affected in a major way. With more optimizations to the layered structure and loss functions, a more robust style transfer could be achieved as a part of the camera model, resulting in the wanted end-to-end imaging system.

Keywords: Differentiable camera model, style transfer loss, machine learning

The originality of this thesis has been checked using the Turnitin Originality Check service.

(3)

Melander Ville: Differentioituva kameramalli oppimiseen perustuville optimoinneille Kandidaatin tutkielma

Tampereen yliopisto Tietotekniikka Lokakuu 2021

Tässä työssä esitellään tarkka matemaattinen malli digitaalisesta kamerasta, jonka para- metrit sekä osat voidaan optimoida koneoppimisalgoritmeilla monia erilaisia kuvankäsit- telyapplikaatioita varten. Esitetty metodi hyödyntää valon aalto-optista formalisointia, jotta kuvantamisvaiheessa esiintyvät valon reaalimaailman fyysiset ominaisuudet voidaan mallintaa mahdollisimman täsmällisesti. Malli on lisäksi mukautuva optisten parametrien suhteen, eli se on helposti yhdistettävissä tämän hetken standardigradienttimenetelmiiin (engl. standard gradient descend) perustuviin koneoppimismetodeihin. Tämä hyöty tuo- daan esille esimerkkisovelluksessa, jossa kuvansaantiprosessia manipuloidaan tietyn esteet- tisen tyylin aikaansaamiseksi.

Työ koostuu kahdesta osasta. Teoriaosuudessa esitellään optiikan perusteita, keskit- tyen erityisesti epäkoherenttiin kuvantamiseen. Kuvansaantimalli esitellään muodostamal- la ensin pistehajontafunktio (engl. point spread function, PSF), ja muut liittyvät optiset funktiot. Kameramoduulin sensorin muodostamaa kuvaa verrataan sen jälkeen tyylitavoi- tekuvaan käyttäen havainnointiin pohjautuvaa esteettistä tyylimuunnosta. Tätä varten esitellään ja arvioidaan tämänhetkisestä kirjallisuudesta löytyviä aiheeseen liittyviä häviö- funktioita. Teoriaosuutta seuraa johdetun mallin toteutuksen esittely, ja mallilla saatujen tulosten arviointi. Optimointi- ja testausvaiheiden aikana kameramalli hyödyntää multis- pektraalista kuvadataa sisääntulona, eli lopullinen kuva muodostetaan usean valon aallon- pituusalueen muodostamat kuvat yhdistämällä. Tämä mahdollistaa optisen mallinnuksen tarkentamisen ja maksimoi suunnittelun vapausasteet.

Työssä havaittiin, että kerroksellisen kameramallin yhdistäminen kuvankäsittelysovel- luksen kanssa onnistuu, ja itse kameramalli tuottaa päteviä tuloksia vain muutamalla opti- sella elementillä. Toisaalta itse tyylimuunnos kameramallin päällä ei tuottanut odotettuja tuloksia, ja selkeämpien rakenteiden sekä tekstuurien sijaan tuntui muuttavan vain kuvan värimaailmaa tyylikuvan kaltaiseksi. Lisäoptimoinneilla kerrosrakenteeseen ja häviöfunk- tioihin voisi olla mahdollista saada aikaan kokonaisvaltaisempi tyylimuunnos toimivaksi osaksi kameramallia, jolloin saataisiin haluttu läpikotainen kuvantamissysteemi.

Avainsanat: Derivoituva kameramalli, tyylimuunnos, koneoppiminen

Tämän julkaisun alkuperäisyys on tarkistettu Turnitinin Originality Check -palvelulla

(4)

1 Introduction . . . 1

2 Background . . . 3

3 Method . . . 6

3.1 The camera model . . . 6

3.1.1 Optics module . . . 7

3.1.2 Colour filter module . . . 9

3.2 Neural style transfer . . . 11

3.2.1 VGG16 - network . . . 11

3.2.2 Loss functions . . . 12

4 Implementation and results . . . 14

4.1 Structure . . . 14

4.2 Training . . . 16

4.3 Results . . . 17

4.4 Discussion . . . 19

5 Conclusions . . . 22

References . . . 25

(5)

CI Computational imaging CNN Convolutional neural network

DagNN Directed acyclic graph neural network DOE Diffractive optical element

HS Hyperspectral ML Machine learning MS Multispectral

PSF Point spread function ReLU Rectified linear unit VGG Visual Geometry Group

(6)

1 Introduction

After the first proper cameras in the early 19th century and first digital cameras in late 20th century, both camera optics and software have been developing at an accelerating rate [1, p. 5]. This has allowed many different image manipulation processes to be added to cameras. On the other hand, these kinds of processes are very often done to already taken images. In these kinds of approaches the aim may be to correct some mistakes produced by the camera, for example deblurring, or noise removal. These applications are relatively easy to make and thus very common and making up the bulk of computer vision methods used today. [2, pp. 32, 83], [3, p. 2]

Even with the rapid development of optics, some noticeable aberrations and limitations still exist in modern cameras, which brings us to the second way signal processing is used in imaging: so-called coded aperture approach. What this means is that the whole imaging process is manipulated from the light source to the resulting image, to make sure optimal results are achieved. These methods are often designed with a specific task in mind and can be very robust and useful for various applications. [4] However, these methods can be very hard to produce in practice, since arbitrary apertures and other optical components can be impossible to repro- duce outside simulations. Therefore some physical constraints are sometimes taken into account when designing the imaging setup.

In this work the aim was to combine the two different approaches mentioned before, which would produce a differentiable and thoroughly optimizable end-to- end imaging system. By modelling the camera as multiple layers each consisting of their own diffractive optical elements, DOEs, the camera as a whole could be easily parametrized and optimized. With this kind of layered camera model, different kinds of applications can be easily added on top to suit different purposes. For this model a neural style transfer, which has so far mainly been done on already taken images or video (see chapter 2), will serve as an example application.

The camera model used in this work was completely digital, and multi-spectral (MS) image data was used as the input description. The aim was then to train the model for a certain style image with the help of a MS image database. After training, the model could be then used to digitally take pictures that retained the physical structure of the imaged object but were stylistically similar to the style image. For the style transfer loss functions a pre-trained image classification network was used to extract style information, which reduced the time and data required for training greatly. With this kind of imaging system, rather unique artistic results could be achieved without manual effort.

(7)

This document is structured as follows. Chapter 2 presents the background for the different physics and machine learning based methods used in this implementation. Papers related to style loss functions used in this paper are also discussed, as well as presenting some other relevant camera models. In chapter 3 the optical functions used in this work will be derived, as well as the loss functions used.

Chapter 4 discusses how the model was implemented in practise presenting the layered structure used, as well as presenting the results and experiments. Finally, the functionality of the implementation will be discussed in chapter 5.

(8)

2 Background

At the beginning of the 20th century, a lot of new discoveries, including the con- firmation of light speed as a constant and a better understanding of light as an electromagnetic waveform were made. This required a complete re-imagining of the way our universe works, while also explaining great questions in physics, such as the diffraction of light. [5, pp. 31–32] The basis of this traces back to the 1860’s, when James Clerk Maxwell presented the electromagnetic wave theorem, which states that in an electromagnetic field wave-like perturbations travelling at a constant speed can exist [6, p. 35][5, pp. 29–30].

When considering light as an electromagnetic wave, a few properties of light which geometrical optics struggle to explain, such as diffraction, can be accounted for with good accuracy [6, p. 96]. Therefore the understanding of light as an electromagnetic wave will form the basis of how the propagation of light is modelled in the camera model of this paper. We will later have a look at how diffraction is modelled in this work, and it will be clearer why mere geometrical optics would not suﬀice. First, we will however take a look at the overall framework of the other part of this work, which is the machine learning component.

During recent years, neural networks have become more and more popular across a wide range of applications as they make it possible to robustly solve complex problems [7]. The fundamental principle of neural networks is to mimic the way human brain processes information and produces solutions for different tasks [8, p. 1]. In the human brain there are billions of neurons interconnected through neural pathways, which together form a complex set of networks. Information through these networks travels as an electric signal, the intensity of which can be changed by every neuron to match previously learned methods, meaning certain pathways are activated more when a specific action is done [9, pp. 3–4]. The basic idea of this functionality has been taken directly to neural networks, where the problem often is to search for optimum neuron weights that approximate best the relation between the desired input-output pairs [8, pp. 1–2]. This idea can be utilized across multiple applications, and specifically in this work we aim to combine a trainable camera model with another neural network to transform the images from the camera.

The main part of this work, the differentiable layered camera model, is a relatively new development since most of the recent image transformation tasks are done in post-processing. The basis for this kind of computational imaging (CI) has existed for quite some time just like machine learning (ML), but their true power has become more and more apparent recently. CI together with ML can be very robust when it comes to getting good quality images with imperfect optics and/or

(9)

conditions. [10] This is why in this paper we aim to combine a trainable camera model with an application, so that simulation with the simplest optics can yield good results, as well as allowing a more thorough optimization of the model as a whole.

When it comes to different image processing tasks, convolutional neural networks (CNNs), like the one used in this work, have been proven to give very accurate results with enough depth in the network. [11] Convolutional comes from the fact that CNNs utilize various convolutions with different learned filters (for example corners, or even things like eyes and faces) when going deeper into the network [12].

Therefore a CNN will also be used in this work for tasks which require information retrieving from an image, such as shapes and textures.

While this work mostly focuses on the camera model itself, an example application in the form of style transfer was added on top of the model as an example.

The basic idea of neural style transfer is to modify an input image in such a way that its style matches a given style image, but the overall construction of the input image remains the same. During recent years, there has been quite a few works related to neural style transfer, for example in [13][14][15][16][17][18]. In these works, a pre-trained VGG network is used for transforming an existing image, along with a loss network for perceptual loss. It has been found that this kind of transformation works best when using the high-level content found by the network instead of per-pixel losses [13][14].

Since this area is relatively new, there has been a lot of development over the recent years, although already in 2016 Johnson et al. in [14] had a robust enough implementation handle real-time style transfer for video. Style transfer for real-time or almost real-time video has had developments quite recently by Ruder et al. in [16] in 2018 and even to 360-degree videos by Zabaleta & Bertalmío in [18] in 2021.

This work, however, focuses on trying to implement a more complete style transfer pipeline from source to the final image.

In the neural style transfer part of this work, we use the loss functions proposed by Johnson et al. in [14], which were based on work done by Gatys et al. in [13]. As mentioned, unlike previous works in this area, in this implementation the camera itself is also a part of the process and the neural style transfer is implemented as a layer to the camera model structure. In other words, the properties of the camera model can also be changed and optimized together with the style transfer. The aim is then to be able to take pictures using the implemented camera model, that have been optimized for style transfer all the way from the image acquisition to the resulting output image.

A few of differentiable and optimizable camera models like the one in this work have also been developed in recent years for different applications. For example,

(10)

Akpinar et al. developed a camera structure for extended depth of field imaging [19], and computational optics with deep learning are also discussed in [10] and [20].

The camera part of this work will be largely based on these, but with a completely new application.

(11)

3 Method

Figure 3.1 illustrates the overall structure of the proposed method, including an image acquisition (camera) layer, a demosaicing layer, and a network-based loss layer. We utilize hyperspectral (HS) image cubes as the scene input, to rigorously model and fully manipulate the image acquisition setup. The camera module then performs the physically accurate simulation of the sensor image formation. We further introduce a demosaicing step to convert the (monochromatic) sensor image into the colour image with three channels. The demosaiced output image is then compared with the style and content images to calculate the loss. In the following, we discuss each step in more details.

Figure 3.1 General view of the setup

3.1 The camera model

We incorporate the wave optics formalism of light for two reasons. First, as stated before, the simulation model is more accurate including complex phenomena such as diffraction. Second, it is then possible to include and manipulate sophisticated optical elements into the system. In addition, such model offers a fully differentiable set of equations. We can then optimize, for instance, the profile of an arbitrary DOE.

Figure 3.2 illustrates a simplified inner structure of a digital camera, as well as

(12)

Figure 3.2 Inner structure of the proposed camera mode. Light from a point source at a distance z first propagates to the lens plane. The wavefront at the lens plane is modified via the refractive lens and the DOE, from which it propagates to the sensor plane to create the PSF.

the blocks of the corresponding mathematical model. In the most simplistic terms, we can describe a camera via two planes, namely the optics (lens) plane and the sensor plane. More specifically for this work, we will consider the camera optics from incoming light all the way through to a monochromatic sensor image captured with point spread function (PSF) as one part, and the process of manipulating this sensor image into an RGB image as the second part. The style transfer that uses the RGB image will then be the final step.

When considering an imaging system as in figure 3.1, and also taking into account the sensor noise, the final monochromatic sensor image I^s(x, y) can be found as

I^s(x, y) =

∫

z

∫

λ

(I_z,λ(x, y)∗h_z,λ(x, y))κ_λ(x, y)dλdz+η_s, (3.1) where I_z,λ(x, y) is the single channel (wavelength) λ of the MS image at depth z, h_z,λ(x, y)is the depth and wavelength-dependent PSF,∗is the convolution operator, κ_λ(x, y) is the spectral response of the sensor colour filter at the pixel location (x, y), and η^s∼N(0, σ²_s) is the zero-mean Gaussian sensor noise with the standard deviation ofσ_s. In the following, the derivation of the PSF through various optical functions, as well as the sensor color filter design are given.

3.1.1 Optics module

We derive the incoherent PSF of the system subject to a point source located at a distancez from the lens plane, which is emanating light with a single wavelengthλ.

Under the paraxial approximation, the wavefront of such a point source just before the lens plane, U_0λ,z⁻ (s, t), is defined as [6]

U_0λ,z⁻ (s, t) = exp(jkz) jλz exp

[ j k

2z(s²+t²) ]

, (3.2)

(13)

where k = 2π/λ is the wave number, and the negative sign is used to differentiate the right and left sides of the lens. The transparent optical element at the lens plane will then perform multiplicative modulation unto this incoming wave. Assuming a thin optical element with an complex transmittance A(s, t) exp[(jΦ(s, t)], the wave field just after the lens,U_0λ,z⁺ (s, t), is

U_0λ,z⁺ (s, t) =U₀⁻(s, t)A(s, t) exp[(jΦ(s, t)]. (3.3) Using Fresnel propagation from the lens plane to the sensor plane, the wave front at the sensor plane, U(x, y), is calculated via Fourier transform as

U(x, y) = exp(jkz) jλz exp

[ j k

2z(x² +y²) ]

×

∫∫ {

U(s, t) exp [

j k

2z(s²+t²) ]}

exp [

−j2π

λz(xs+yt) ]

dsdt,

(3.4)

Finally, the incoherent PSF at the sensor plane, h(x, y), is the intensity of U(x, y), h_λ,z(x, y) =|U(x, y)|². (3.5) This then describes how a point source with a given wavelength will act when going through the imaging system. The way the whole image across all the different wavelengths in the MS imaging data is formed will be discussed in later chapters.

Equation 3.3 formulates the effect the complex optical transmittance function has to the final PSF. In the proposed model, we set the phase value Φ(s, t) as the optimization parameter, whereas the amplitude A(s, t) is assumed to be a simple circular function corresponding to a fully open aperture. More precisely, we divide the phase expression into two components. These components are a fixed phase value, Φ^l_λ(s, t), corresponding to a simple imaging through a refractive thin lens, and an arbitrary phaseΦ⁰_λ(s, t), to be optimized and implemented through a DOE.

Final phase will then beΦ_λ(s, t) = Φ^l_λ(s, t) + Φ⁰_λ(s, t).

For any diffractive or refractive optical element, the phase transition function is related to the thickness (height) of the element in the following way:

Φ_λ(s, t) = k(n_λ−1)∆(s, t), (3.6) wheren_λ is the wavelength-dependent refractive index of the material. A spherical thin lens, for instance, has a height profile in the form of ∆^l(s, t) = d₀ − (R − (√

R²−(s²+t²)), with d₀ being the central thickness and R the spherical radius of the lens [6]. Using such profile within equation 3.6, we can formulate the phase

(14)

delay through the refractive lens, Φ^l_λ(s, t), as Φ^l_λ(s, t) = k(n_λ−1)(√

R²−(s²+t²)), (3.7) where the constant termsd₀ and R are omitted for simplicity.

During the training procedure, we aim to optimize the phase delay through the DOE, Φ⁰_λ(s, t), for each wavelength, λ. However, equation 3.6 dictates a physical relation between the phase delays at different λ values which has to be taken into account. One simple way to account for such correspondence is to set the DOE height profile,∆⁰(s, t), as the optimization parameter, from which Φ⁰_λ(s, t)can be directly calculated at each iteration using equation 3.6. Unfortunately such approach may result in a poor optimization performance, as the typical values of∆⁰(s, t)are within micrometer range and can suffer from numerical instability. As an alternative, we first choose a so-called nominal wavelength, λ₀, and set the phase delay at λ₀, and Φ⁰_λ₀(s, t) as the optimization parameter. Since equation 3.6 is invertible, we can simply calculate phase delay at anyλ, Φ⁰_λ(s, t), from Φ⁰_λ₀(s, t)alone:

Φ_λ(s, t) = λ0(nλ−1)

λ(n_λ₀ −1)Φ⁰_λ₀(s, t). (3.8) It has been experimentally observed, that optimizingΦ⁰_λ₀(s, t) provides more stable results, as the phase delay typically varies proportionally to2π. [19]

3.1.2 Colour filter module

The camera model proposed in this work takes MS images as rich input scene description, which contain the scene reflectance information in dense set of spectral bands. In practise, the visible spectrum of light from around 420nm to 720nm is di- vided into 28 bands, the intensities of which will be in the MS image data cube (see figure 3.1). Since the PSF can be considered the impulse response of the imaging system to a point source, the complete image of an object produced by said system can be calculated through convolution between the object and the PSF [6, p. 4]. In this case this means that for each of the 28 bands of wave lengths, a wavelength dependent PSF is first calculated for a certain band, and then convolved with the corresponding MS data image.

Now that we have the functions to model the propagation of light with a certain wavelength through the optical system, we need to take a look at how the information carried by these different wavelengths can be combined at the sensor to produce a colour image. After all, when using MS imaging data as the scene input, a slice of the HS image cube will represent one wavelength band’s presence in the scene. In practice we need to first put all the 28 wavelength bands through the optical system,

(15)

and then combine the resulting monochromatic MS images into the resulting RGB image. In more regular cameras this process is similar, and we will now look at how an RGB image from the sensor image is often made inside cameras.

Since having a different sensor for each of the primary colours of light would not be cost-effective for average cameras, a colour filter is often used on top of the sensor. The sensitivity of humans’ eyes peak around green light, which is why the most commonly used Bayer filter has 2 green filters for each blue and red filter [21, p. 150]. These filters on top of the sensor mean, that each pixel in the sensor will originally contain a luminance value of only of the primary colours. To get to the final RGB image, a demosaicing algorithm is needed to interpolate the remaining RGB values for each pixel. The need for this interpolation naturally causes the image to have some artefacts, as well as losses in quality, which is why new approaches for this are being developed. [22]

Multiple other types of colour filters exist as well, some using different ways of combining light. A lot of filter designs utilize the basic idea of the Bayer filter by combining the RGB filters in different ways, for example vertically or diagonally [22]. On the other hand, filters like the Kodak filters use different combinations of white filters with RGB filters, and intermediate colours (cyan, magenta, and yellow). However, it has been found out that these designs fare rather poorly when considering the spectral and luminary performance, and much better designs can be formulated through a more mathematical approach. [22][23] When it comes to MS formulation used in this implementation, an approach based on optimization of Gaussian basis functions will be used.

400 500 600 700

0.00 0.02 0.04 0.06 0.08

Wavelength (nm) Sensor colour filter basis

(a) Gaussian basis

450 500 550 600 650 700 0.00

0.05 0.10

Wavelength (nm) Kodak filter response

R G B

(b) Kodak response

Figure 3.3 Sensor colour filtering

Figure 3.3 (a) presents the 12 Gaussian-shaped basis functions used for as the basis to form the final Kodak sensor colour response of figure 3.3 (b). The aim is to

(16)

combine these basis functions as a weighted average to form the colour response of the sensor. As can be seen, Gaussian basis allows for smoothness and non-negativity that is present in the Kodak filter response. The mathematical basis of this kind of approach have been studied more closely in e.g. [23] and [24]. After forming the colour response, it can be used to form the final RGB image. First, a monochromatic image for each wavelength band of the MS image is calculated according to equation 3.1, and after that all the formed images are combined as a weighted average based on the colour response. The color response function κ, using the Gaussian basis functions, can be formulated as

κ_λ(x, y)∝∑

i

α_i(x, y) 1 σ√

2πexp (−1

2

(λ−µ_i σ

)2)

, (3.9)

whereα_i(x, y) is the coeﬀicients we aim to train for the basis functions.

With this, we have the mathematical derivations for getting an RGB image out of MS image data through a specified optical system. Now we will take a look at the neural style transfer layer added to the camera model, which will use a convolved RGB image gotten from the camera model as the input image the style of which we aim to change.

3.2 Neural style transfer

Neural style transfer was chosen as an example application because of the interesting results it has offered in literature during latest years. It was interesting to see whether this kind of camera model could be combined with style transfer, which could make it possible to have more robust and complete style transfer solutions that also include optics. More details will follow later, but on the basic level style transfer works with the help of the Gram matrix, which is used to describe how different colours and patterns relate to each other in an image. These relationships can be said to construct the style of an image, which can then be applied to another image. [13][14]

As mentioned, for this implementation the focus was more on the camera model, and the style transfer functions constitute an example of what the layered camera structure can be applied for. Nevertheless, the following subsections will shortly go through the loss functions used for neural style transfer in this work, as well as presenting the pre-trained image classification network used by the loss functions.

3.2.1 VGG16 - network

VGG16, or Visual Geometry Group, is a convolutional image classification network with, as the name suggests, 16 layers of trainable weights [25]. In this work, a

(17)

pre-trained version of the network will be used, which has been trained with over a million images in the ImageNet database. The amount of training data, together with the 1000 categories the network can classify, make VGG16 a robust option for image classification. [26]

When it comes to CNNs trained for image classification, it can be said that the connections in the lower levels resemble the human visual system. After all, lower layers tend to produce the input image as-is, whereas with higher layers the reconstruction becomes more and more abstract. This relation and the use of CNNs for perceptual tasks are discussed more closely in e.g. [27] and [28]. Here the latter concludes that a network trained for visual tasks will also learn to represent the real world in a way that correlates well with humans. This is also why VGG16 with its convolutional layers is a good choice for tasks dealing with perceptual assessments.

In this work, however, the VGG16-network is used for feature extraction needed in the calculation of the style transfer loss, which will be presented in the following subsection. What this feature extraction means, is that instead of using the network for describing an image with words, here it is used for capturing image data from various levels of the network. In practise this might mean for example corners or circles. While the VGG16 network consists of multiple convolutional layers with max pooling and ReLUs along with fully connected layers at the end, only a few of these layers will be used in this implementation. [25][26] The full structure of the network can be found at [26].

3.2.2 Loss functions

For calculating the style transfer loss, a loss for the reconstruction of the style image’s style content is required, as well as a loss for reconstructing the spatial features of the input. So, a function for both content and style reconstructions are required to calculate the whole style transfer loss. In this case style means for example the colour scheme, textures, patterns, and shapes of the style image, whereas content is the more straightforward part describing where things are in an image. [13][14]

When it comes to reconstructing images from a neural network, it has been found that low layers produce very similar images compared to the input, but higher layers will only retain the general spatial structure of the input image. These higher layers will then allow more freedom when it comes to colour, texture, and smaller shapes.

This is one of the reasons why only a few layers of the VGG16 network are used in style transfer. [14] We use the approach of Johnson et. al. derived from work by Gatys et.al., where layers relu1_2, relu 2_2, relu3_3 and relu 4_3 are used for the loss.

Using functions presented by Johnson et al. [14] derived from work done by Gatys et al. [13], content reconstruction loss between input image yˆ and target

(18)

image y can be calculated with the help of the VGG16-network ϕ. The feature representation C_j ×H_j ×W_j is obtained with activations ϕ_j(ˆy) and ϕ_j(y) from a convolutional layerj of the VGG16-network. The final construction loss will be the squared and normalized Euclidean distance between the feature representations of both images

lcont.(ˆy, y) = 1

C_jH_jW_j||ϕj(ˆy)−ϕj(y)||²2, (3.10) whereϕ represents the network. [14]

For the style transfer, a Gram matrix is used with the feature representation that is obtained from the network as before. The Gram matrix aims to capture which features found by the network can be found together in the Cj-dimensional Hj ×Wj-sized feature map and is proportional to the uncentered covariance of the feature representation. [14] This is done with the function

G_j(x)_c,c′ = 1 C_jH_jW_j

Hj

∑

h=1 Wj

∑

w=1

ϕ_j(x)_h,w,cϕ_j(x)_h,w,c′, (3.11) for both the input image and the style image. In practice equation 3.11 will be calculated as a product between the inverse of the feature representation, the activation ϕ_j(x) and the inverse of the activation matrix. [14]

With the gram matrix representations of the input image and the output image (G_j(y)and G_j(ˆy) respectively), the final style reconstruction loss can be calculated as the squared Frobenius norm of the difference between the matrices:

lstyle(ˆy, y) = ||Gj(ˆy)−Gj(y)||²F. (3.12) After calculating the loss functions, the actual style transfer will be done by forming a new image by optimizing

ˆ

y= arg minλclf eat(y, yc) +λl2ll2(y, yl2) +λslstyle(y, ys), (3.13) whereλ_c, λ_l2 and λ_s are scalar multipliers, and yis initialized with white noise [14].

(19)

4 Implementation and results

We will now consider the implemented model from a more practical point of view, presenting the overall layer structure used, as well as the parameters used for training the network. After presenting the model, some results will be shown and compared which each other. In this work we attempted to get the results from the paper by Johnson et al. in [14], just with this more thorough end-to-end pipeline. Even though expected results were not reached, the model did produce visually unique images based on the given style and content images.

4.1 Structure

The practical implementation of the models presented was done in two parts. First, a model of a camera was made in MATLAB and after that style transfer was added as a layer to the camera. As said before, the model of the camera only included one lens and an aperture to simplify calculations as much as possible, but still produced good results thanks to the differentiable and optimizable structure. The neural style transfer was done in MatConvNet.

The model as a whole was done in directed acyclic graph neural network (DagNN) where all the different elements were added as layers. The MS imaging data and the depth are given as inputs to the network, from which the PSF and the sensor image are calculated. An additional convolutional layer is added to mimic the demosaicing.

The nominal phase value, Φ⁰_λ

0(x, y), and the coeﬀicients of the colour filter basis functions, α_i(x, y), are set as the optimization parameters within the PSF and the sensor layers, respectively. The model of the full network can be seen in figure 4.1.

The lens model was based on a plano-convex lens, technical details of which can be accessed at [29], and with the DOE the properties of different parts of the lens could be changed (see figure 3.2 for an overview). Using the equations presented in [6] and derived in section 3.1, the optical functionality of the camera was modelled with good accuracy.

As seen in figure 4.1 the MS sensor image, IsensorMS, was simulated based on the input imaging data by combining it with the calculated PSF and then convolving with the MS data. This 28-dimensional image then went through the colour filter sensor part, which produces the RGB image according to derivations in section 3.1.2.

After that the image was clipped and convolved to finally be used as an input to the style loss function.

From the RGB sensor image Isensor, that already had the colour filters added into it, the content loss was also calculated. For the total objective loss both the

(20)

depth n/a NaN

psflayer psf_phi0

psfs 297x297x28xNaN NaN

sensor_Conv Sensor_Conv

Input n/a NaN

gt_Conv Sensor_Resp IsensorMS

n/a NaN

sensor_Resp Sensor_Resp

Isensor0 n/a NaN

Clip Clip

Isensor n/a NaN

conv3x3_01 dagnn.Conv

loss_content_l2 dagnn.Loss

Iconv NaNxNaNx3xNaN NaN

Clip_01 Clip

Irelu NaNxNaNx3xNaN NaN

loss_style dagnn.Loss

Igt n/a NaN

objective_style 1x1x1xNaN NaN

GlobalLoss dagnn.Sum

gtsensor0 n/a NaN

Clip_02 Clip

gtsensor n/a NaN

objective_content_l2 1x1x1xNaN NaN

objective 1x1x1xNaN NaN phi0

891 3MB

colorfilt_coef 28x28x12 37KB

f01 3x3x1x3 108B

b01 3x1 12B

Figure 4.1 DagNN network structure

(21)

content reconstruction loss, in this case a simple L2-loss, and the style loss (equation 3.12) between the input image and the style image were calculated. To calculate the content loss over the sensor image Isensor, we construct a so-called ground-truth sensor image where the optics are neglected. This image will be a weighted average with the colour filter across the wavelengths in the MS data. After calculating the total loss according to equation 3.13, the style transfer could be done producing an output image which had the style of the style image but content of the input image.

4.2 Training

As mentioned in section 3.1.1, different wavelengths will act differently when going through the lens, and in this model that was accounted for by dividing the visible spectrum of light into 28 slices with the help of MS image data which required an MS image database. Using KAIST MS image dataset as training data, the model was trained for different style images one at the time. The database and its related paper can be found at [30]. The training parameters are given in Table 4.1.

Parameter Value

Batch size 2

Epochs 50

Learning rate 1e-3

Momentum 0.5

Patch size 256×256

Solver Adam

Weight decay 0.1e-3

Table 4.1 Regularization and training parameters

The overall pipeline is then selecting a wanted style image and initializing the training of the network. During experiments, style loss and especially L2-loss seemed to converge rather quickly (see figure 4.2 for details), where the results did not greatly change after around 20 epochs. After training, the trained style loss network could be then added on the existing camera model. This allowed the style transfer of an input image from MS data to a style transferred RGB image.

The next chapter will present the results from four different style images used, which were selected rather arbitrarily. However, all of them do have distinct stylistic features, such as the clear and strong colouring and linework in the Great Wave, and the rolling paint strokes in The Starry Night. The more impressionistic style is also something that could have had a great impact in the input image’s smaller scale details in the form of paint strokes instead of just the original textures.

(22)

Figure 4.2 Loss during net training with ’The Starry Night’ as the style image

4.3 Results

The results can be seen in figures 4.3- 4.6. It can be seen, that while the results mostly failed to change the stylistic content of the images, the colours are captured better. This model did produce some unconventional effects, when considering the contrast of the images. Light elements in the original image turned dark in the transferred image and vice versa, giving somewhat of a negative image feeling to the style transferred images. Also, when looking at figure 4.3 (a), it has become very dark to the point of being mostly black. This is most likely because the butterfly image has very different colours compared to the style image, with the colour chart (figure 4.3 (d)) image giving good indication on how the colours are actually transferred by this model.

The content of the images did not properly change with the colours, and the only real change that happened did so in a very small scale by giving the images a noticeable texture. This was probably caused by the VGG loss proposed in the literature not producing expected results, which is why L2 loss for content loss was used to get the results setting VGG loss from equation 3.13 to zero. As was found out by Johnson et al., per-pixel losses aren’t as good as capturing stylistic

(23)

(a) Butterfly (b) CD

(c) Flowers (d) Colour chart

Figure 4.3 The Starry Night (van Gogh, 1889)

Figure 4.4 Wheat Field With Cypresses (van Gogh, 1889)

patterns, which is most likely why this model’s style transfer only really worked for colour. The style images with more distinct and prevalent features did produce some noticeable patterns, as can be seen in figures 4.3 and 4.6. With Hokusai there are some cyclical patterns, probably coming from the higher frequency data of the painting. Similarly, The Starry Night seemed to produce a circular pattern from the stars in the painting.

With a closer inspection it can be seen that the model at least slightly changed the spatial features of the images. However, these details are very small compared to the expected results, where the style transfer can mimic brush strokes and even broader elements from the style images. Even for getting these kinds of results with the complete MS imaging pipeline there were some problems, which will be discussed in the following chapter shortly.

(24)

Figure 4.5 San Giorgio Maggiore at Dusk (Monet, 1908-1912)

Figure 4.6 The Great Wave off Kanagawa, (Hokusai, 1829-1832)

4.4 Discussion

Initially, the model used was successful only in transferring the general colour scheme of the input image to somewhat match the style image’s most prominent colour. The overall colour scheme of the results was somewhat similar to the style image albeit monochromatic. With these initial results, the content of the input image did not change at all, which is why some changes were made in the implementation for subsequent test runs. The first experiment was adding another loss function based on L2-loss, to try and get the content of the image to change. However, it was quickly noticed, that with two content loss functions the model was not converging at all in regard to the content losses.

The model was then ran with both of the content loss functions (VGG of equation

(25)

Figure 4.7 Initial result

Figure 4.8 Experiment with 2 content losses

3.10 and L2) separately, but even then, the results did not come out as expected.

Now the content of the image was changing with the overall structure retained, but the transfer seemed to create very bizarre cyclical patterns, that did not visually correspond to the style image. From these two initial results it was clear, that there would be a lot to optimize and experiment in the model if one were to get the results from literature.

One possible solution for optimizing the network for style transfer better was to get more imaging data. Using MS data as an input made it a bit harder, but the results seemed to improve after using a database 5 times as large as with the first results. However, this naturally increased the already long training times making different experiments less viable. Then again, with this improvement and some bug fixes the results came out looking a lot better, as seen in the previous chapter in figures 4.3- 4.5. For these, L2-loss was used instead of VGG, because with VGG the model did not converge. While the style transfer again worked as more of a colour filter, the colours are no longer monochromatic and can be seen to resemble the style image’s colour content.

(26)

Overall, it can be noted that during the relatively short development of the model for this work, there was already clear progress when comparing the first results with the latest results. This can mean, that with future developments the wanted results could be achieved even with this more unique pipeline.

(27)

5 Conclusions

From the results of this experiment, few things about the model can be noted. It is clear, that the model failed to produce the expected results from e.g., Johnson et al. [14], but it did still create a unique output image the colours of which were rather clearly from the style image. Even then, the overall transferred colour scheme seemed to almost be a negative of the original image’s colours. Also, the colour transfer worked a bit unexpectedly in many cases, as can be seen by the colour responses in the colour chart images. In some cases, this produced better-looking results, but some images struggled greatly with overall clarity.

In terms of using what is often done in post-processing already in the imaging process, the results were still promising. It is clear that applications can be relatively easily added to a camera model like this, even though the results were not ideal.

With more time to fine-tune the loss functions, as well as the whole network, more accurate results might be achievable. An optimization regarding the speed of the model would also be welcome, as training often took days instead of hours. This was at least in some part due to the fact that MS data was used which required a lot more calculations than RGB data. There are still many fixes and improvements to be done even in the optics and sensor modules, as well as the overall layered structure, meaning a more robust implementation is possible in the future.

(28)

References

1. Robinson Linda. Art of professional photography. 1st ed. Chandni Chowk, Delhi: Global Media, 2007 :p. 5

2. Ghosh S. K. Digital image processing. eng. Oxford, England: Alpha Science International Ltd., 2013 - 2013 :pp. 32, 83

3. McAndrew Alasdair. A computational introduction to digital image processing.

eng. Second edition. Boca Raton, FL: Chapman, Hall/CRC, an imprint of Taylor, and Francis, 2015 :p. 2

4. Sitzmann Vincent, Diamond Steven, Peng Yifan, Dun Xiong, Boyd Stephen, Heidrich Wolfgang, Heide Felix, and Wetzstein Gordon. End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging. eng. ACM transactions on graphics 2018; 37:1–13 5. Hawking Stephen. Ajan lyhyt historia. fin. Tark. ja täyd., kuv. laitos. Porvoo:

WSOY, 1997 :pp. 29–32

6. Goodman Joseph W. Introduction to Fourier optics. 2nd ed. New York: McGraw- Hill, 1996

7. Venkatesan Ragav and Li Baoxin. Convolutional Neural Networks in Visual Computing: A Concise Guide. eng. 1st ed. Data-enabled engineering. Portland:

CRC Press, 2018

8. Abdi Herve. Neural networks. eng. Sage university papers series. Quantitative applications in the social sciences ; 07-0124. Thousand Oaks, [Calif.] ; SAGE, 1999 :pp. 1–2

9. Arbib Michael A. The handbook of brain theory and neural networks. eng.

Second edition. A Bradford book. Cambridge, Massachusetts: MIT Press, 2003 :pp. 3–4

10. Barbastathis George, Ozcan Aydogan, and Situ Guohai. On the use of deep learning for computational imaging. eng. Optica 2019; 6:921–43

11. Simonyan Karen and Zisserman Andrew. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2015. arXiv: 1409.1556 [cs.CV]

12. Habibi Aghdam Hamed. Guide to Convolutional Neural Networks A Practical Application to Traﬀic-Sign Detection and Classification. eng. Cham, 2017 13. Gatys Leon A., Ecker Alexander S., and Bethge Matthias. A Neural Algorithm

of Artistic Style. English. 2015 Aug. Available from: https://arxiv.org/

abs/1508.06576

(29)

14. Johnson Justin, Alahi Alexandre, and Fei-Fei Li. Perceptual Losses for Real- Time Style Transfer and Super-Resolution. English. 2016 Mar. Available from:

https://arxiv.org/abs/1603.08155

15. Elad Michael and Milanfar Peyman. Style Transfer Via Texture Synthesis. eng.

IEEE transactions on image processing 2017; 26:2338–51

16. Ruder Manuel, Dosovitskiy Alexey, and Brox Thomas. Artistic Style Transfer for Videos and Spherical Images. eng. International journal of computer vision 2018; 126:1199–219

17. Cheng Ming-Ming, Liu Xiao-Chang, Wang Jie, Lu Shao-Ping, Lai Yu-Kun, and Rosin Paul L. Structure-Preserving Neural Style Transfer. eng. IEEE transactions on image processing 2020; 29:909–20

18. Zabaleta Itziar and Bertalmío Marcelo. Photorealistic style transfer for video.

eng. Signal processing. Image communication 2021; 95:116240–

19. Akpinar Ugur, Sahin Erdem, Meem Monjurul, Menon Rajesh, and Gotchev Atanas. Learning Wavefront Coding for Extended Depth of Field Imaging.

2020. arXiv:1912.13423 [eess.IV]

20. Peng Yifan, Veeraraghavan Ashok, Heidrich Wolfgang, and Wetzstein Gordon.

Deep optics: joint design of optics and image recovery algorithms for domain specific cameras. eng. In:ACM SIGGRAPH 2020 Courses. SIGGRAPH 2020.

ACM, 2020 :1–133

21. Langford Michael John. Langford’s advanced photography. eng. 7th ed. Boston

; Focal, 2008

22. Hirakawa K. and Wolfe P. J. Spatio-Spectral Color Filter Array Design for Optimal Image Recovery. Trans. Img. Proc. 2008 Oct; 17:1876–90. doi: 10.

1109/TIP.2008.2002164. Available from: https://doi.org/10.1109/TIP.

2008.2002164

23. Shimano Noriyuki. Optimization of spectral sensitivities with Gaussian distri- bution functions for a color image acquisition device in the presence of noise.

Optical Engineering 2006; 45:1–8. doi: 10.1117/1.2159480. Available from:

https://doi.org/10.1117/1.2159480

24. Vora P.L. and Trussell H.J. Mathematical methods for the design of color scanning filters. IEEE Transactions on Image Processing 1997; 6:312–20.doi:

10.1109/83.551700

25. Koonce Brett. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization. eng. Berkeley, CA: Apress L. P, 2021

(30)

26. VGG-16 convolutional neural network - MATLAB vgg16. (Accessed 22.4.2021).

Available from: https : / / se . mathworks . com / help / deeplearning / ref / vgg16.html

27. Mahendran Aravindh and Vedaldi Andrea. Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015 :5188–96

28. Zhang Richard, Isola Phillip, Efros Alexei A, Shechtman Eli, and Wang Oliver.

The unreasonable effectiveness of deep features as a perceptual metric. In:

Proceedings of the IEEE conference on computer vision and pattern recognition.

2018 :586–95

29. Edmund Optics Inc. 2021 (Accessed 3.4.2021). Available from:https://www.

edmundoptics. com/ p/ 6mm - dia- x - 36mm - fl - vis - 0- coated- uv - plano - convex-lens/37087/

30. Choi Inchang, Jeon Daniel S., Nam Giljoo, Gutierrez Diego, and Kim Min H. High-Quality Hyperspectral Reconstruction Using a Spectral Prior. ACM Transactions on Graphics (Proc. SIGGRAPH Asia 2017) 2017 (Accessed 7.9.2021); 36:218:1–13. doi: 10 . 1145 / 3130800 . 3130810. Available from:

http://dx.doi.org/10.1145/3130800.3130810