Computational Multifocal Near-Eye Display with Hybrid Refractive-Diffractive Optics

(1)

COMPUTATIONAL MULTIFOCAL NEAR-EYE DISPLAY WITH HYBRID REFRACTIVE-DIFFRACTIVE OPTICS

Ugur Akpinar, Erdem Sahin, Atanas Gotchev

Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland

ABSTRACT

We present a multifocal computational near-eye display that employs a static diffractive optical element (DOE) in tandem with a refractive lens. The DOE is co-optimized with the convolutional neural network-based preprocessing to achieve desired multifocal display point spread function in an optimal manner. In the simulations, we demonstrate a multifocal display that can deliver sharp images for three distinct depths sampling the dioptric depth range uniformly from 3 diopters to infinity.

Index Terms— Computational display, Optics, Neural network

1. INTRODUCTION

An ideal three-dimensional (3D) near-eye display (NED) is aimed at delivering all physiological depth cues of the hu- man visual system (HVS) accurately to ensure realistic as well as comfortable viewing experience. In most of the existing NEDs, the stereo cues (binocular disparity and vergence) are simply created by projecting the corresponding two-dimensional (2D) image, with correct perspective, to each eye; and the motion parallax is available via head track- ing. However, it has been challenging to deliver the correct focus cues (accommodation and retinal defocus blur). The NEDs that fail to overcome this challenge suffer from the well-known vergence-accommodation conflict (VAC), which has been reported to be an important factor that can cause visual discomfort and fatigue [1].

There exist mainly two broad category of approaches ad- dressing the VAC problem: enabling the focus cues or mak- ing the display accommodation-invariant (AI) [2, 3]. The approaches enabling focus cues include: varifocal, where the focal depth of the display is dynamically altered to match the converged depth, e.g., using a focus-tunable lens or varying the lens-to-sensor distance [4, 5]; multifocal, where a stack of 2D displays are spatially multiplexed or multiple depths are covered by time-multiplexing through a fast varifocal optics [6, 7]; light field and holographic, where the complete field of light due to given 3D scene is created, in the former case, geometrically as a collection of all desired rays [8, 9] or, in the latter case, through wavefronts relying on wave diffrac-

tion and interference [10]. The AI displays, on the other hand, aim to remove the optical defocus blur cue. In such a case, the accommodation is expected to be only (cross-) driven by the binocular disparity, and thus the VAC is avoided.

In the Maxwellian-type displays the effective entrance pupils of the eyes are reduced by projecting images through small pinholes, where the retinal images appear to be sharp in a large depth range without perceivable defocus blur [11]. Al- though such approach is simple to implement, it significantly reduces the light throughput. A more efficient approach is to achieve a depth-invariant display point spread function (PSF) at the retina, which is implemented, for instance, employing a focus-tunable lens that changes the focal depth of the display and scans the scene depth range faster than the temporal resolution of the eye [12]. In this approach, unlike the above- mentioned multifocal technique, the display content is fixed.

In particular, the display image is deconvolved with an esti- mated (time-averaged) display PSF, which results in AI perceived images when used with the focus-tunable optics.

In this paper, we present a computational multifocal NED, which can provide a set of focal planes at intended discrete depths. Unlike in most of the existing multifocal NEDs, our approach relies on fixed static optics, in particular, an eye- piece consisting of a refractive lens and a diffractive optical element (DOE). This is a significant advantage in terms of form-factor and system complexity (e.g., issues related with adaptive optics, such as the need for very high frame-rate displays, etc.). Additionally, such custom design enables arbi- trary PSF engineering, significantly increasing the design de- gree of freedom. An additional preprocessing stage further helps achieving desired system response. We utilize a convolutional neural network (CNN) based preprocessing deconvolution algorithm for this purpose. By combining a differ- entiable display model with the preprocessing, we are able to engineer and achieve the desired (fixed) multifocal PSF in an optimal manner. Such end-to-end optimization framework has been actually previously utilized in various imaging prob- lems, but mostly for image capture, such as extended depth of field imaging [13, 14].

At this point, it is also worth mention the multifocal cameras that are analogues to multifocal NEDs in image capture, especially those employing spatial multiplexing of the imaging lens [15, 16], and multifocal contact and intraocular lenses

(2)

[17] that also combine (multiplex) lenses with different focal powers in a single optical element. It is obvious that, in the latter case, the HVS has to deal with necessary postprocessing to better interpret the multifocal image, whereas in the former case, such deconvolution can be applied as part of the computational imaging approach. Nevertheless, indepen- dently designed (multifocal) optics and deconvolution algorithm, as applied in [15], is likely to be suboptimal.

2. METHOD

Optics

D-CNN Loss=

𝐸(𝐼^𝑝, መ𝐼^𝑝) 𝐼

𝑧

መ𝐼^𝑝 𝐼^𝑑

Blur 𝐼^𝑝

𝜕𝐸

𝜕 መ𝐼^𝑝

𝜕𝐸

𝜕𝐼^𝑑

Φ 𝜕𝐸

𝜕Φ

Fig. 1: Overall representation of the proposed method.

Fig. 1 schematize the computational multifocal NED system illustrating the end-to-end iterative optimization of optics (i.e., the DOE) and CNN-based preprocessing deconvolution layer (D-CNN) based on the desired perceived imageI^p (modelled through the Blur block) and defined quality metric (Loss). Here the preprocessing stage can be thought as analo- gous to the postprocessing in the computational cameras, e.g., the deconvolution operation in extended depth of field imaging [13, 14], which aims to computationally complement the display optics in achieving the desired characteristics in the final visualized imageIˆ^p. During optimization (training), we provide an all-in-focus imageIand the accommodation distancez, which is picked randomly from a set of discrete depth values within the target scene depth range, as inputs to the network. First, D-CNN takes the input imageIand outputs the image to be driven to the display,I^d. Then, a physically realizable optics model simulates the perceived imageIˆ^pbased on the display imageI^dand the accommodation distancez.

As we aim at creating multifocal display that outputs sharp (i.e., focused) images at certain depth values, we di- vide the discrete set of z values into two subsets, namely main depths and intermediate depths. The conditional blurring block (Blur) passes the all-sharp input image, ifzcorre- sponds to one of the main (focal) depths; otherwise, for an intermediate depth, it blurs the input image based on the amount of defocus with respect to (one of the) nearest main depth. Fi- nally, the loss function is calculated between the desired perceived imageI^pand the network outputIˆ^p, and the system is optimized in an end-to-end manner using gradient-based optimization. In the following subsections, each components of the end-to-end system is described in details.

2.1. Optical design

zr ze zd

z

Lens+DOE OLED Reference

Eye Retina

(s, t) (x, y)

Fig. 2: Proposed computational multifocal NED setup.

Fig. 2 illustrates the optical layout of the multifocal display. In particular, we employ the DOE together with the refractive lens, which is expected to focus the input image at multiple depths while decreasing the chromatic aberrations exhibited by the refractive lens-only or DOE-only cases. As- suming thin lens model for the eye and planar retinal plane, and given the focus distance of the eye from the lens plane, z, there exists a reference plane(x, y), which is the conjugate plane of the retina (e.g., the black curve in Fig. 2). In the proposed implementation, we reconstruct images at the reference plane. The image formation can then be described via the PSFs.

Under monochromatic illumination with the wavelength λ, and within the limits of the paraxial optics, the incoherent PSF at the reference plane(x, y),h(x, y), is described as [18]

h_λ,z(x, y)∝

F {Q(s, t)}| _x

λz,_λz^y

2

, (1)

whereQ(s, t)represents the generalized pupil function at the lens plane, andF {.}is the Fourier transform operator. As- suming thin lens model for the refractive lens, the generalized pupil functionQ(s, t)is derived as

Qλ,z(s, t) =A(s, t) exp(jΦλ(s, t)) exp

jΨλ,z

s²+t² r²

, (2) whereA(s, t)is the circular aperture function,Φ_λ(s, t)is the phase delay due to the DOE,

Ψ_λ,z =k 2

−1 z+ 1

z_d − 1 f_λ

r², (3)

is the defocus coefficient,k= 2π/λis the wavenumber,fλis the wavelength-dependent effective focal length of the underlying refractive lens, andris the aperture radius. Finally, the perceived image atz,Iˆ^p(x, y), is the convolution between the display image and the incoherent PSF,

Iˆ_z,λ^p (x, y) =I_λ^d(x, y)∗hz,λ(x, y). (4) In order to enforce the DOE to create the multifocus effect, we employ conditional blurring on the input image to

(3)

derive the desired (ground truth) perceived image. That is, we apply blur on input images with a pre-calculated ground-truth PSFh^b(x, y), if the reconstruction depth is one of the pre- defined intermediate depths. In such case, the ground truth image,I^p(x, y), is defined to be

I_λ^p(x, y) =I_λ(x, y)∗h^b(x, y). (5) The PSF of Eq. 5 is derived using Eq. 1 and Eq. 2, where Φλ(s, t) is set to be zero, andΨλ,z = 74 for all λ. Such defocus value corresponds to the amount of blur when the object is 0.5 diopter (D) away from the focus plane of the refractive lens for the wavelength ofλ= 530nm.

In the training process, the phase delay due to the DOE, Φλ(s, t), is optimized. A physically realizable transparent optical element, such as DOE, exhibits phase delay that can be formulated via its thickness functiond(s, t), as

Φ_λ(s, t) =k(n_λ−1)d(s, t), (6) wheren_λ is the refractive index of the material atλ. Then, if the phase delay for a nominal wavelengthλ₀,Φ_λ₀(s, t)is to be optimized, the phase delay for a givenλcan be derived using Eq. 6 as

Φλ(s, t) = Φλ₀(s, t)λ0(nλ−1)

λ(n_λ₀−1). (7) Similarly, the wavelength-dependent focal length of the refractive lens,fλ, is modeled throughfλ₀as

f_λ=f_λ₀n_λ₀−1

nλ−1. (8)

2.2. Preprocessing CNN

3 I

32 64

128

256

128 128 64 64

32 32 3

+

3 I^d

Fig. 3: The preprocessing network (D-CNN) based on U-net architecture [19] equipped with skip connection.

The proposed network for the preprocssing step is shown in Fig. 3. In particular, we employ the U-net architecture [19], as a residual network. That is, the output of the network is the difference image between the display imageI^dand the input imageI. U-net is a multi-scale network which consists of encoding and decoding parts. In the encoding stage,3×3con- volutions followed by rectified linear unit (ReLU) activation

functions are utilized, shown as graded yellow in Fig. 3. At the end of each scale, the output is downsampled by 2 using a max pooling layer (red in Fig. 3). The decoding stage starts with upsampling the output of the lower scale by transposed convolution (shaded blue) and concatenation with the output of the encoding stage at the corresponding level (skipped con- nections). After that, the data is further processed via3×3 convolution and ReLU layers. The channel sizes of each filter are 32, 64, 128 and 256 in each scale, respectively, as given under each block in Fig. 3. The final output of the decoding is processed into a1×1convolution layer with the channel size equal to the input channel size of 3. After that, we add a sum- mation layer that takes the all-sharp imageIand the output of the U-net, and gives the display imageI^d. Such connection enables the network to learn the difference image, which is expected to be sparse.

2.3. Loss function

We train the network using a regularized loss function, which minimizes the L1-loss and maximizes the structural similarity (SSIM) [20] between the network outputIˆ^p and the ground truthI^p. Such loss function provides a good compromise between the texture details and perceptual quality. An additional regularization term is employed on the display imageI^din order to keep the pixel values within the display dynamic range of[0,1]. The loss function is mathematically described as

E(I^p,Iˆ^p) =Ll₁(I^p,Iˆ^p) +Lssim(I^p,Iˆ^p)

+αR(I^p,Iˆ^p) +γR^d(I^d), (9) whereLl₁(I^p,Iˆ^p)is the L1-loss,Lssim(I^p,Iˆ^p)is the SSIM- loss [21]

Lssim(I^p,Iˆ^p) = 1−SSIM(I^p,Iˆ^p), (10) R(I^p,Iˆ^p) is the regularization on the network output, and R^d(I^d)is the regularization on the display image.R^d(I^d)is the indicator function which gives 0 for the pixels within[0,1]

and 1 elsewhere. By settingγ → ∞, it approaches to a hard constraint on display image. In practice we setγ= 150. We utilize the dark channel prior [22] as the network regularizer R(I^p,Iˆ^p). The dark channel of a color image,J, is defined as the minimum of the color channels within each image patch, i.e.

J(x) = min

λ∈{R,G,B} min

y∈Ω(x)Iλ(y), (11) wherex,yare the pixel indices andΩ(x)is the neighborhood of x. In the proposed method, the regularizer is chosen to be the weighted L1-norm of the dark channel of the network output, that is

R(I^p,Iˆ^p) =|exp(−βJ^p) ˆJ^p|1, (12) whereexp(−βJ^p)decreases the weights of the pixels within the bright regions of the ground truth image. During training, we setα= 0.005, β= 10, and the patch size of 17 pixels.

(4)

3. SIMULATION RESULTS

During training, we assume a commercially available plano- convex lens as the underlying refractive lens. The effective focal length isfλs = 30mm at the specification wavelength λ_s= 587.6nm, whereas the aperture radius is taken to ber= 5mm. The materials of both the refractive lens and the DOE are assumed to be fused silica, with the wavelength-dependent refractive indices of n_λ_R = 1.458, n_λ_G = 1.461, n_λ_B = 1.466. The network is trained with color images, accounting for the wavelengthsλR= 600nm,λG= 530nm,λB= 450 nm, for red, green, and blue channels, respectively. The lens- to-OLED distance is set as zd = 28.57 mm, focusing the refractive lens at 1.5 D for the nominal wavelengthλ0=λG. The pixel pitch of OLED is 8.7µm, which corresponds to angular resolution of 1 arcmin.

The proposed computational multifocal display is optimized using a mixture of natural images in [23] and synthetic images in [24]. The images are divided into the patches of 256×256pixels, while the batch size is set as 2. The target accommodation range is set as 0-3 D, which is divided into discrete set of depths with 0.5 D interval. At each iteration, the reconstruction depth is chosen randomly. We set the intermediate depth values as0D,1D,2D,3D, at which the ground truth images,I^p, are created by applying blur on the input image, as described in Sec. 2. For the main reconstruction depth values, i.e.,0.5D,1.5D,2.5D, we setI^p=I. Such arrange- ment is based on previous studies [25], indicating that a difference of±0.5 D between the vergence and accommodation depths is within the zone of comfort. To speed up the training, we initializeΦλ₀(s, t)with the superposition of three diffractive lenses such that the hybrid refractive-diffractive system focuses at the main reconstruction depths forλ0.

−5 5

−5

5

s(mm)

t(mm)

Optimized height map

0 1 2 3 µm

(a)

0 1 2 3

−20

20

D

arcmin

PSF

(b)

Fig. 4: The optimized height map at fabrication resolution of 3µm(a), one-dimensional cross-sections of the PSFs (b).

Using Eq. 3, the limits of the defocus coefficient is|Ψ| ≤ 328for the target accommodation range of 0-3 D. Following the discussion in [14], we set the optimum mask sampling rate based on the defocus range, which corresponds to∆_s = 6 µm. Please note, however, that such sampling rate is utilized

0 10 20 30

0 1

CPD MTF (0.5 D)

(a)

R G B Lens-only

0 1 2 3

0 1

D MTF (1 CPD)

(b)

0 1 2 3

0 1

D MTF (5 CPD)

(c)

0 1 2 3

0 1

D MTF (10 CPD)

(d) Fig. 5: One-dimensional cross-sections of MTFs at the accommodation distance of 0.5 D (a), the MTFs within the assumed scene depth range for 1 CPD (b), 5 CPD (c), and 10 CPD (d). The MTF of the green channel of a conventional display without phase mask (lens-only) is also plotted.

only during training. During the simulations, we upsample the optimized phase element to 3µmvia bicubic interpola- tion, which accounts for a typical fabrication resolution of the DOE. Fig. 4 illustrates the optimized DOE with the up- sampled resolution, together with the one-dimensional PSFs at varying depth. We also present the modulation transfer function (MTF) analysis in Fig. 5, including one-dimensional MTFs at 0.5 D, as well as the cross-sections of the MTFs throughout the scene depth at three different spatial frequen- cies, namely 1, 5, and 10 CPD. As can be inferred from the PSFs and the MTFs, we observe sharper response at the aimed focus depths of 0.5 D, 1.5 D, and 2.5 D and broader PSFs at intermediate depths. However, compared to a conventional refractive lens with focal depth of 1.5 D (dashed green plots), such improvement comes at the cost of degradation in the MTFs around the focal depth. One main objective of the multifocal display is to relax such resolution-depth trade-off in an optimal manner. We aim at sacrificing as little as possible from the overall image quality, while keeping the conflict between the accommodation and convergence inside the zone of comfort [25]. In addition, the preprocessing is intended to further compensate such degradation. However, it should also be noted that the D-CNN output is limited to the display dynamic range. In particular, we assume the display minimum and maximum brightness values as 0 and 1, respectively. The preprocessing aims at finding the optimum display image within such range, which may not correspond to the global optimum.

(5)

(a) 2.5 D

20.25 / 0.695

(b) 2 D

22.70 / 0.735

(c) 1.5 D

27.18/0.849

(d) 1 D

23.75/0.796

(e) 0.5 D

20.65 / 0.754

(f) 0 D

19.17 / 0.724

22.18 / 0.783 22.08 / 0.785 23.15 / 0.809 22.14 / 0.794 22.32 / 0.799 20.25 / 0.754

23.23/0.795 23.52/0.789 24.09 / 0.810 23.30 / 0.784 24.73/0.827 21.43/0.756 Fig. 6: Comparison of the conventional stereoscopic display with single refractive lens (top), AI computational near-eye display proposed by Konradet al. [12] (middle), and the proposed multifocal computational near-eye display (bottom). The PSNR/SSIM values are given under each image.

It is also worth to mention here that both the PSFs in Fig. 4 and the MTFs in Fig. 5 are derived by taking into account the finite size of the display pixels. That is, the PSFhλ,z(x, y)of Eq. 1 is convolved with the square display pixel. Therefore, the display resolution of 1 arcmin inherently constitutes an upper bound for the MTFs, corresponding to the maximum frequency of 30 CPD. During simulations, the sampling step of the reconstruction grid is taken to be 0.5 arcmin (corresponding to 60 CPD), which accounts for the assumed maximum retinal resolution.

The delivered image quality of the computational multifocal NED is further compared with the conventional lens-only stereoscopic display focused at 1.5 D, and the AI computational display proposed by Konradet al. [12], which utilizes focus tunable lenses to sweep through the scene. We sim- ulate the AI display at the discrete mode, where the lens is focused at 0.5 D, 1.5 D, and 2.5 D. The preprocessing is done via Wiener deconvolution with the average PSF. The simulations are performed on a synthetic image from the TAU agent data set [24], which is reconstructed at six accommodation depths from 2.5 D to 0 D. The results are illustrated in Fig. 6. The peak signal-to-noise ratio (PSNR) and SSIM values are given as quantitative comparison under each image with bold indicating the best value for each depth. As it can be seen from the figure, the amount of blur significantly increases in the conventional display as the accommodation distance gets further away from the lens image plane. AI display achieves significantly higher image quality at such distances as well, especially at main reconstruction depths. The proposed method achieves slightly better results compared to the AI display. The main advantage of the proposed method over the AI display is that the optics is composed of static

elements, which significantly reduces the system complexity. However, in both methods we observe degradation in the overall image quality compared to the focal depth (1.5 D) of the conventional display, due to the above-mentioned resolution-depth trade-off. The degradation is especially vis- ible as milky haze effect due to reduced contrast. One main source of such artifacts is the limitation enforced by the display dynamic range, as previously discussed. That is, the output of the preprocessing step is bounded within [0,1], which in turn limits the boosting of high frequency components in the image. Nevertheless, in the quantitative analysis the maximum SSIM value of the proposed method is comparable with the maximum SSIM in case of the conventional method.

4. CONCLUSION

We present a computational multifocal NED to ad- dress the VAC. The proposed method takes advantage of a co-designed (preprocessing) deconvolution and hybrid refractive-diffractive optics to create multiple accommodation distances. In addition, the resolution-depth trade-off inherent to our optical setup is optimized by selecting the focus depths with 1D interval, which satisfies the comfort zone condition.

A critical advantage of our computational multifocal NED is that we utilize static optical elements, which significantly reduces the system complexity and further eases the integration of the proposed algorithm into commercially available head- sets. As a future work, we plan to implement a prototype display and perform subjective experiments to rigorously char- acterize the display especially in terms of the VAC.

(6)

5. REFERENCES

[1] David M. Hoffman, Ahna R. Girshick, Kurt Akeley, and Martin S. Banks, “Vergence–accommodation conflicts hinder visual performance and cause visual fatigue,”

Journal of Vision, vol. 8, no. 3, pp. 33–33, 03 2008.

[2] Hong Hua, “Enabling focus cues in head-mounted displays,” Proceedings of the IEEE, vol. 105, no. 5, pp.

805–824, 2017.

[3] Gregory Kramida, “Resolving the vergence- accommodation conflict in head-mounted displays,”

IEEE transactions on visualization and computer graphics, vol. 22, pp. 1912 – 1931, 08 2015.

[4] Nitish Padmanaban, Robert Konrad, Tal Stramer, Emily A Cooper, and Gordon Wetzstein, “Optimizing virtual reality for all users through gaze-contingent and adaptive focus displays,” Proceedings of the National Academy of Sciences, vol. 114, no. 9, pp. 2183–2188, 2017.

[5] Kaan Akundefinedit, Ward Lopes, Jonghyun Kim, Pe- ter Shirley, and David Luebke, “Near-eye varifocal augmented reality display using see-through screens,”ACM Trans. Graph., vol. 36, no. 6, Nov. 2017.

[6] Kurt Akeley, Simon J Watt, Ahna Reza Girshick, and Martin S Banks, “A stereo display prototype with multiple focal distances,” ACM transactions on graphics (TOG), vol. 23, no. 3, pp. 804–813, 2004.

[7] Gordon D. Love, David M. Hoffman, Philip J.W. Hands, James Gao, Andrew K. Kirby, and Martin S. Banks,

“High-speed switchable lens enables the development of a volumetric stereoscopic display,”Opt. Express, vol.

17, no. 18, pp. 15716–15725, Aug 2009.

[8] Douglas Lanman and David Luebke, “Near-eye light field displays,” ACM Transactions on Graphics (TOG), vol. 32, no. 6, pp. 1–10, 2013.

[9] F. Huang, K. Chen, and G. Wetzstein, “The Light Field Stereoscope: Immersive Computer Graphics via Fac- tored Near-Eye Light Field Displays with Focus Cues,”

ACM Trans. Graph. (SIGGRAPH), , no. 4, 2015.

[10] Andrew Maimone, Andreas Georgiou, and Joel S Kollin, “Holographic near-eye displays for virtual and augmented reality,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1–16, 2017.

[11] Takahisa Ando, Koji Yamasaki, Masaaki Okamoto, Toshiaki Matsumoto, and Eiji Shimizu, “Retinal pro- jection display using holographic optical element,” in Practical Holography XIV and Holographic Materi- als VI. International Society for Optics and Photonics, 2000, vol. 3956, pp. 211–216.

[12] Robert Konrad, Nitish Padmanaban, Keenan Mol- ner, Emily A Cooper, and Gordon Wetzstein,

“Accommodation-invariant computational near-eye displays,” ACM Transactions on Graphics (TOG), vol.

36, no. 4, pp. 88, 2017.

[13] Vincent Sitzmann, Steven Diamond, Yifan Peng, Xiong Dun, Stephen Boyd, Wolfgang Heidrich, Felix Heide, and Gordon Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Transactions on Graphics, vol. 37, no. 4, 2018.

[14] U. Akpinar, E. Sahin, and A. Gotchev, “Learning optimal phase-coded aperture for depth of field extension,”

in2019 IEEE International Conference on Image Pro- cessing (ICIP), Sep. 2019, pp. 4315–4319.

[15] Anat Levin, Samuel W. Hasinoff, Paul Green, Fr´edo Du- rand, and William T. Freeman, “4d frequency analysis of computational cameras for depth of field extension,”

ACM Trans. Graph., vol. 28, no. 3, July 2009.

[16] Eyal Ben-Eliezer, Emanuel Marom, Naim Konforti, and Zeev Zalevsky, “Experimental realization of an imaging system with an extended depth of field,”Appl. Opt., vol.

44, no. 14, pp. 2792–2798, May 2005.

[17] Daniel Carson, Warren Hill, Xin Hong, and Mutlu Karakelle, “Optical bench performance of acrysof iq restor, at lisa tri, and finevision intraocular lenses,”Clin- ical ophthalmology (Auckland, N.Z.), vol. 8, pp. 2105–

13, 10 2014.

[18] J. W. Goodman,Introduction to Fourier Optics, Roberts and Company Publishers, 2005.

[19] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,

“U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Med- ical image computing and computer-assisted interven- tion. Springer, 2015, pp. 234–241.

[20] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error vis- ibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.

[21] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz,

“Loss functions for neural networks for image processing,” arXiv preprint arXiv:1511.08861, 2015.

[22] Kaiming He, Jian Sun, and Xiaoou Tang, “Single image haze removal using dark channel prior,” IEEE transactions on pattern analysis and machine intelligence, vol.

33, no. 12, pp. 2341–2353, 2010.

[23] Herve Jegou, Matthijs Douze, and Cordelia Schmid,

“Hamming embedding and weak geometric consistency for large scale image search,” inEuropean conference on computer vision. Springer, 2008, pp. 304–317.

[24] Harel Haim, Shay Elmalem, Raja Giryes, Alex Bron- stein, and Emanuel Marom, “Depth Estimation from a Single Image using Deep Learned Phase Coded Mask,”

IEEE Transactions on Computational Imaging, pp. 298 – 310, 2018.

[25] Takashi Shibata, Joohwan Kim, David M. Hoffman, and Martin S. Banks, “The zone of comfort: Predicting visual discomfort with stereo displays.,”Journal of vision, vol. 11 8, pp. 11, 2011.