PetriHelin,PekkaAstola, StudentMember,IEEE ,BhaskarRao, Fellow,IEEE ,IoanTabus, SeniorMember,IEEE Minimumdescriptionlengthsparsemodelingandregionmergingforlosslessplenopticimagecompression

(1)

Minimum description length sparse modeling and region merging for lossless plenoptic image

compression

Petri Helin, Pekka Astola, Student Member, IEEE, Bhaskar Rao, Fellow, IEEE, Ioan Tabus, Senior Member, IEEE

Abstract—This paper proposes a complete lossless compression method for exploiting the redundancy of rectified light-field data. The light-field data consists of an array of rectified subaperture images, called for short views, which are segmented into regions according to an optimized partition of the central view. Each region of a view is predictively encoded using a specifically designed sparse predictor, exploiting the smoothness of each color component in the current view, and the cross- similarities with the other color components and already encoded neighbor views. The views are encoded sequentially, using a spiral scanning order, each view being predicted based on several similar neighbor views. The essential challenge for each predictor becomes choosing the most relevant regressors, from a large number of possible regressors belonging to the neighbor views. The proposed solution here is to couple sparse predictor design and minimum description length (MDL) principle, where the data description length is measured by an implementable code length, optimized for a class of probability models. The paper introduces a region merging segmentation under MDL criterion for partitioning the views into regions having their own specific sparse predictors. In experiments, several fast sparse design methods are considered. The proposed scheme is evaluated over a database of plenoptic images, achieving better lossless compression ratios than straightforward usage of standard image and video compression methods for the spiral sequence of views.

Keywords—lossless compression, light-field coding, plenoptics, sparse prediction, minimum description length segmentation

I. INTRODUCTION

The field of plenoptics emerged recently as a prominent research area, although its roots date back to 1908 [1]. The initial mostly academic optics studies have evolved theoretically and technologically to such extent that nowadays plenoptic cameras are made commercially available for industrial applications and even for the consumer market.

Recently, the compressibility of captured light-fields has attracted attention in the scientific community as an important topic. Several studies of the lossy compression of images captured with the recent plenoptic camera technologies were published, see e.g. [2] and [3].

In view of the wide availability of the cameras, standardization efforts are taking place for the compression of plenoptic camera lenslet images, under the initiative JPEG-PLENO of JPEG standardization committee, and two challenges were taking place at major conferences, ICME 2016 and ICIP 2017.

The main focus of the standardization call-for-proposal and of

Petri Helin, Pekka Astola, and Ioan Tabus are with Tampere University of Technology, Finland. Bhaskar Rao is with Department of Electrical and Computer Engineering, University of California, San Diego, CA, USA.

the challenges is lossy compression in a wide range of compression rates. Several publications were submitted in response to the challenges, [4], [5], [6], [7], [8] and a first comparative study of their results was published in [9]. The performance is illustrated in terms of improving the objective compression performance with respect to JPEG legacy compressor, or with respect to a straightforward usage of HEVC compressor, and in terms of the subjective quality of the resulting stack of subaperture images, as in [9]. Lossy compression effects on refocused photograhps was studied in [10].

The main topic of this paper is the lossless compression of a light-field data structure, which consists of a stack of subaperture images. The lossless compression, or near- lossless compression, is also one item in the goals of the standardization initiative, since in some applications, like medical imaging, only lossless compression is permitted due to legal reasons. In our study the main goal is to optimize the objective compression performance, and since the images are compressed without any losses there is no need of a subjective evaluation of compression effects.

Previous work in lossless compression of plenoptic images was done by Perra [11], who studied the compressibility of the raw lenslet image captured at the sensor of a plenoptic camera. We address here the compressibility of thefinallight- field data structure, resulting after the pipeline of processing the raw lenslet image using a typical light-field toolbox, which is an array of subaperture images, representing the views of the scene from different angular points. By encoding and transmitting this final processed information, the decoder itself does not need to perform the pipeline of transforming a raw lenslet image into a light-field data structure.

The light-field data structure is an N ×N-array of subaperture images, each representing a different angular view of the scene. The central view is obtained at the direction of the camera axis, the other views being called side views, shifted both horizontally an vertically from the camera axis.

We encode in this paper the views sequentially, starting from the central view and advancing in a spiral order through the side views, exploiting the views that were already encoded in the encoding of a new side view.

We presented a similar but simpler compression scheme in the conference paper [12]. This paper proposes a more refined method, building on the initial scheme and adding new features in the parts which affect most the performance, so that the new reported results are better than the ones in [12].

A detailed discussion of the differences to [12] is presented in

(2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 j

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

i

Fig. 1: The sequential processing order of the views is the shown spiral path starting from the center view (8,8). The running index shown in a few selected views is denoted byζ.

Section VI-C, after presenting all details of the new method.

In essence, the partitioning method is MDL based, being driven by the final compression performance, and secondly, the sparse predictor design is improved by considering alternative sparse design algorithms and using a more complete MDL cost formula for selection of the optimal sparsity mask.

In Section II, we first present the main elements of the encoding scheme, including partition propagation, linear model setup, encoding of coefficients and residuals, which are similar to [12], but now there is also cross-color prediction and the entropic encoding strategy is more refined. In Section III, we investigate multiple methods for sparse predictor design (instead of opting for only one as in [12]) and for selecting the final structure of the predictor by the MDL criterion.

In Section IV, we introduce several ways of obtaining the regions that are used for grouping similar pixels under a single region for a dedicated predictor, and propose a parameterless region merging method for obtaining an MDL-optimal set of regions. Section V shows experimental results obtained with the presented method and Section VI discusses the algorithm complexity and connections to [12]. Finally, Section VII presents the conclusions.

II. COMPRESSION ARCHITECTURE

A. Definitions and notations

A light-field structureL_F is an array with five dimensions, where an element L_F(i, j, x, y, c) represents the color c, at the pixels with coordinates(x, y), from the view image corresponding to the angular direction(i, j). The angular direction of a view is indexed by the horizontal and vertical shifts of the viewing direction, (i, j)∈ {1, . . . , N}². The size of the subaperture view isξ×η, hence(x, y)∈ {1, . . . , ξ} × {1, . . . , η}.

The set of values of the color c is defined for simplicity of notation as {0,1,2}, but we denote by that the color space RGB, so {R, G, B} ≡ {0,1,2}. The experimental results in this paper are given for the case of Lytro Illum plenoptic camera, where(ξ, η) = (434,625)andN = 15. An individual view in the light-field structure is an RGB color image Aij

with elements Aij(x, y, c). We call central view the view Aicjc, all other views being side views. For Lytro Illum i_c=j_c= 8, henceA₈₈is the central view.

The partition P_i_c_j_c = {Ω⁰_i

cj_c, . . . ,Ω^L−1_i

cjc} of A_i_c_,j_c, can be represented as a matrix Z_i_c_j_c of labels (with labels from 0 to (L−1)), where Ω⁰_i

cj_c is the closest region and Ω^L−1_i

cj_c

the farthest. The set of pixels of any view, is denoted Ω = {1, . . . , ξ} × {1, . . . , η}. The partition label imageZij for the view Aij induces a partition Pij = {Ω⁰_ij, . . . ,Ω^L−1_ij } of Ω by the relation Ω^k_ij = {(x, y)|Zij(x, y) = k}. The encoding scheme encodes the side views in a spiral order, illustrated in Fig. 1, with the linear index for enconding order denoted ζ(i, j), e.g., ζ(8,8) = 1, ζ(8,7) = 2, . . . , ζ(15,1) = 225.

B. Partitions of views into regions

A central role in the encoding scheme is played by the partition of each side view into regions. The central partition matrix Zi_cj_c in this work is designed by one of the methods described in Section IV. The label imageZij associated with the side view(i, j)is obtained with Alg. 1 as explained below.

We assume fronto-planar regions, in order to simplify the process of finding the partitions induced in the side views by the partition in the central view. A region Ω^k_i

cj_c is displaced by δ_ij^k pixels in the horizontal direction and by µ^k_ij pixels in the vertical direction for defining the region Ω^k_ij in the partition ofAij. For estimating the displacements we use the approximate relation at non-occluded pixels Aicjc(x, y) ≈ A_ij(x+δ^k_ij, y +µ^k_ij) and obtain the displacements as the solutions of the minimization

(δ_ij^k, µ^k_ij) = arg max

δ,µ∈∆

X

x,y∈Ω^k_icjc 2

X

c=0

(Ai_cj_c(x, y, c)

−Aij(x+δ, y+µ, c))². (1) The set of allowed displacements in this paper is ∆ = {−15, . . . ,16}, hence five bits are needed for representing each displacement.

The displacements for all 224 side views and regions are sent to the decoder, who receives also the central partition Z_i_c_j_c and is able to construct the side view partitions in the same way as the encoder, as described in Alg. 1.

When propagating all regionsΩ⁰_i

cj_c, . . . ,Ω^L−1_i

cj_c according to the found set of displacements for one viewAij there will be situations of conflicting labels at one pixel, corresponding to an occlusion, which are solved simply by assigning the label of the closest region in the conflict (these assignments are done implicitely by the order of processing the regions from farthest to closest in the Steps 2 to 7 of Alg. 1). Also, there will be situations of disocclusions, where a pixel of Aij will not receive any label of a region (the pixels in the setΦ at Step 8). We use a simple method known as neighbor’s disparity assignment [13] where the missing pixels are allocated to the nearby region (with respect to the coordinatesxandy) located at the farthest depth, as described in steps 8–14 in Alg. 1. An example showing the regions in a partition and the results of partition propagation can be seen in Fig. 2 and 4.

C. Encoding scheme

The general scheme of the encoder is presented in Alg. 2.

The initially encoded elements are: the central view A_i_c_j_c,

(3)

Fig. 2: The proposed method is able to remove redundancy of segmented regions by prediction due to the great similarity between neighboring views. One can note that the vertical and horizontal displacements of a region between two neigbour views are very small in the case of pictures taken with handheld plenoptic cameras. Here, we illustrate this observation by showing in the upper left corner one center view of the test images and its segmentation into regions in pseudo color. The blue color-component of the highlighted area is shown in the green bordered image titled view i= 8j = 8c= 3. When moving along the spiral (see the upper right corner), at index ζ= 69(i.e. ati= 8, j= 12) the same, slightly shifted, area is shown in the red bordered image titled i= 8j = 12c = 3. The neighboring views that are already available to use in prediction and that belong to the setΓ8,12 are shown bordered blue in their corresponding locations in the spiral. The color components that are already sent are also shown (i = 8j = 12 c= 2 andi= 8 j = 12 c= 1). The contour of one region is highlighted in view i= 8j = 12c= 3in black after region propagation and the very same countour is overlaid on the neighboring views (blue) and on the center view (green). It can be seen, that it fits to the neighboring views within one pixel but is clearly shifted compared to the center view. The area marked with a black square in view i= 8 j= 12 c= 3 corresponds to the patch in Fig. 3, where the structure of the linear model for sparse predictor dedicated to green color for this particular region is shown.

which is encoded losslessly by JPEG 2000; and the central partition label image Z_i_c_j_c, which has labels in the range 0, . . . , L−1and is encoded by the lossless compressor CERV [14]. The impact of sending Z_i_c_j_c is small on the total code length (as will be discussed in Section V, the lossless code length needed is about 0.01% of the total code length, see Table II) so no major benefit could be obtained by using a lossy depth compression method such as [15].

Then the displacements for propagating the central partition to each side view is transmitted and the Alg. 1 is used to construct the partition Pij for every view.

After the initial stage, the algorithm proceeds to encode all side views in the spiral order presented in Fig. 1, by performing the loop at the Steps 6-11 of Alg. 2, over all views and all regions and finally over all color components.

Each side view Aij is encoded using the side views that were already encoded as illustrated in Fig. 2 and described in the following. We restrict the set of views to be used when

encoding the view Aij to only the five closest side views, having indices (i⁰, j⁰) for which |i−i⁰|+|j−j⁰| gets the smallest five values. In Fig. 2 the set of views for prediction isΓ_8,12={(8,11),(7,12),(7,11),(9,11),(8,10)}.

For each view (i, j) the pixels in the region Ω^k_ij are encoded by predictive coding, using a sparse linear predictor, specifically designed for the region Ω^k_ij and sent along to decoder as side information. The linear prediction model uses as regressors the pixels from the side views already encoded and from the current side view. The detailed model description is presented in Section II-D.

D. Linear regression model setting for one region of one view In this section we describe the structure of the regression problems that are solved in Step 5 of Alg. 2. Suppose that the decoder has already received and decoded completely the views A_8,8, . . . , A_7,12 along the spiral from Fig. 1, and the

(4)

Algorithm 1 Partition propagation from central view (ic, jc) to side view (i, j).

1: Input: The center view partition P_i_c_j_c = {Ω⁰_i

cj_c, . . . ,Ω^L−1_i

cj_c}; disparitiesδ^k_ij andµ^k_ij.

2: for k=L−1, . . . ,0 do

3: for all(x, y)∈Ω^k_i

c,j_c do

4: Setx⁰=x+δ_ij^k andy⁰=y+µ^k_ij.

5: Insert(x⁰, y⁰)into Ω^k_ij.

6: end for

7: end for

8: Fill missing pixels Φ = Ω\ {Ω⁰_ij, . . . ,Ω^L−1_ij } to form partition:

9: for k=L−1, . . . ,0 do

10: Dilate the set Ω^k_ij with a disk of radius r to obtain regionΦ1.

11: Φ2 = Φ∩Φ1 (pick from dilated region only missing pixels).

12: Ω^k_ij= Ω^k_ij∪Φ2 (allocate to setΩ^k_ij).

13: Φ = Φ\Φ₂ (discard Φ₂ from missing pixels).

14: end for

15: Output: Pij ={Ω⁰_ij, . . . ,Ω^L−1_ij }.

Algorithm 2 Encoder.

1: Input:The 5D light-fieldLFand the partition of the center viewPi_cj_c={Ω⁰_i_c_j_c, . . . ,Ω^L−1_i_c_j_c}.

2: Find region displacementsδ_ij^k andµ^k_ij for all regionsΩ^k_ij, i= 1, . . . , N,j = 1, . . . , N, k= 0, . . . , L−1 according to (1). Write the displacements into the bitstream using 2×5×L×N² bits.

3: Obtain partitions for side views using Alg. 1.

4: Encode center view A_i_c_j_c using lossless mode of JPEG 2000. Encode center partition label image Zi_cj_c with CERV.

5: Advance in sequential order along the spiral and encode each view exploiting the already encoded views, as follows:

6: for alli,j,c,kdo

7: (for each side view(i, j), colorc, and region Ω^k_ij)

8: Design using the algorithm Alg. 4 a sparse predictor Θ^k_ijc, by selecting the regression matrix D and the desired vectory, as described in Section II-D.

9: Encode each sparse predictor, Θ^k_ijc as described in Section II-E.

10: Encode the prediction residualsε=y− bDΘ^k_ijce for each color c, and region Ω^k_ij as described in Section II-F.

11: end for

encoder needs to encode the view A8,12. The situation is depicted in Fig. 2.

The regions Ω⁰_8,8, . . . ,Ω^L−1_8,8 of the central view are propagated using the displacements (δ⁰_8,12, µ⁰_8,12), . . .(δ^L−1_8,12, µ^L−1_8,12) respectively, to get the corresponding regions Ω⁰_8,12, . . . ,Ω^L−1_8,12. The contour of one region is highlighted by black border in Fig. 2.

The view A_8,12 is encoded region-by-region, selecting the

Algorithm 3 Decoder.

1: Input: Encoded displacements, encoded predictors, encoded residuals.

2: Read region displacements δ^k_ij and µ^k_ij for all i = 1, . . . , N,j= 1, . . . , N,k= 0, . . . , L−1.

3: Decode the center view A_i_c_j_c using the lossless mode of JPEG 2000. Decode the center partition P_i_c_j_c = {Ω⁰_i

cj_c, . . . ,Ω^L−1_i

cjc} with CERV.

4: Obtain partitions for side views using Alg. 1.

5: Advance in sequential order along the spiral and decode each view as follows:

6: for all i,j,c,k do

7: (for each side view (i, j), colorc, and regionΩ^k_ij)

8: Decode each sparse predictor, Θ^k_ijc.

9: for all `do

10: Form the `th row d^T_` of the regression matrix D from the already decoded side views as described in Section II-D.

11: Decode the prediction residual ε_` (the `th residual for color c, and regionΩ^k_ij).

12: Obtain the reconstructed valuey`=bd^T_`Θ^k_ijce+ε`, which represents the value of color c for the`th pixel from region Ω^k_ij.

13: end for

14: end for

regions in their row-wise scanning order. Next, we describe the encoding of a generic region,Ω^k_8,12. The color components at all pixels in the region are encoded sequentially: first the red component (c = 0) is transmitted for all pixels in Ω^k_8,12, then the green componentc= 1and last, the blue component c = 2, e.g., in Fig. 2, we encode color component c = 2, hence the first two color components have already been sent.

A separate linear regression model is built for every color component,c∈ {0,1,2}. Let the3×3square centered at(x, y) be N3×3(x, y) = {(x+x₁, y+y₁)|(x1, y₁) ∈ {−1,0,1}²}.

The model expresses the value of A_8,12(x, y, c) at a pixel (x, y) ∈ Ω^k_8,12 as a linear combination of the color values at all pixels(x⁰, y⁰)∈ N_3×3(x, y) in the viewA_8,12 that are already transmitted to the decoder, and of the color component c at all pixels (x⁰, y⁰)∈ N_3×3(x, y)in the selected neighbor viewsAi⁰,j⁰ with(i⁰, j⁰)∈Γij. A highlighted square of pixels of the encoded view in Fig. 2 is enlarged in Fig. 3 where the model is seen such that(x, y)is denoted by a red ×and the pixels surrounded by black border belong to the model.

The linear model is expressed in a matrix form y=DΘ+ε,

where y ∈ Rⁿ, D ∈ R^n×m, Θ ∈ R^m and ε ∈ Rⁿ. Every value A_8,12(x, y, c) with (x, y) ∈ Ω^k_8,12 appears as an entry in the vector y, hence the size of the vector y is equal to the cardinality of the region Ω^k_8,12 (in our experiments the cardinality|Ω^k_8,12|starts from few hundreds up to several tens of thousands). The regression matrixDcontains the elements in the linear combinations, explicitly given below, and the linear regression coefficients are collected in the vector of

(5)

unknowns, Θ. The elements of the c−th color plane in the region (x, y)∈Ω^k_8,12 are inserted iny according to the row- wise scanning order.

All the elements are inserted in the regression matrix D because they are possibly relevant to the values ofy. However, their relevance and their being part of the final linear model will be decided by the sparse design algorithm.

We give explicitly the structure of the linear system y = DΘ+ε by describing its generic rowr. The rth element in y is A_i,j(x, y, c) for some (x, y) ∈ Ω^k_i,j, and it is what we aim to predict by a linear combination of already known pixel color values. The rth row in Dconsists of the elements that are considered relevant toA_i,j(x, y, c), which are listed in the groups below:

• the self-prediction group contains the color c value at the west, north-west, north, and north-east neighbors, i.e., include in the row Dthe four elements,Aij(x, y−1, c), Aij(x−1, y−1, c),Aij(x−1, y, c), andAij(x−1, y+ 1, c). They form thecausalset denotedN^c(x, y). The set N^c(x, y)is causal in the row-wise scanning of the region Ω^k_ij.

• the cross-prediction, across the other color-planes in the same view, whenever the color-component c−1 or c− 2 exists, i.e., include in the row D the nine elements A_ij(x⁰, y⁰, c−1)ifc≥1and additionallyA_ij(x⁰, y⁰, c− 2) ifc= 2, with(x⁰, y⁰)∈ N_3×3(x, y).

• the cross-prediction across the other views, only at the color component c, i.e., for every view (i⁰, j⁰) ∈ Γij

include in the row D the nine elements Ai⁰j⁰(x⁰, y⁰, c), with(x⁰, y⁰)∈ N_3×3(x, y).

• the first element in each row is a constant, 1, for modeling the possible bias of the model.

After building the matrixDas above, one adjustment step is done, with the goal to constrain all elements in the regression matrix to be only elements of the regionΩ^k_ij, as in [16]. When a pixel(x⁰, y⁰)∈ N_3×3(x, y)does not belong to the regionΩ^k_ij the color value at that pixel is replaced with the corresponding color value of the nearest pixel in the neighborhood that belongs to the region (in the same way as in [16]). Similarly, if (x⁰, y⁰)∈ N^c(x, y)does not belong to the regionΩ^k_ij it is replaced by the closest from the causal template N^c(x, y). If this is not possible, i.e., if no pixel in the causal template N^c(x, y) belongs to the region Ω^k_ij, the encoder (and the decoder as well) will not consider(x, y)to be predicted by the specific predictor for color cand region Ω^k_ij, but instead will be treated by a general predictor, one for each view, designed especially for these “border” pixels, that uses only the cross- terms of the prediction model listed above.

For views where already five previous views were encoded, the matrix Dwill have the following number,m, of columns (possible regressors):m= 50for the red component (c= 0);

m = 59for green component (c= 1); and m= 68for blue component (c= 2).

The task of the sparse predictor is to select only the relevant elements in the regression model, by selecting only those columns that contribute to increasing the accuracy of prediction, otherwise their presence in the model may be detrimental, because they incur a bit-cost for being encoded

along in the bitstream. Especially for small regions, this cost may be relatively high part of the overall residual-plus- coefficients cost.

We denote byΘa sparse predictor, where only the elements Θi with indicesiin the supportI ∈ {1, . . . , m} are nonzero.

The nonzero elements from the vector are denoted θ =Θ_I and the model prediction can be written asyˆ=DΘ=D_Iθ, where D_I is selecting only the columns with the indices belonging toI from the matrixD.

The matrix D in sparse modelling is sometimes called a dictionary, especially when it is determined using training data, different to the data where the regression model is finally used.

In here however, we construct D specifically for the data to be encoded, and we have one suchDfor each of the regions, each of the views, and each of the color components, resulting in e.g.,|Ω^k_ij| ×N²×3 = 70000different regression matrices, each specific to the data where it has to be used.

The regression matrix D formed in this way includes columns with high mutual coherence, which is known to be a challenging situation for sparse recovery algorithms.

We introduce the investigated algorithms in Section III and show their performance in Section V. We do not perform normalizing and centering of the regression matrix D, which will need sending additional information about means and variances of the columns since the decoder is unaware in the beginning of the full matrix D (which is recovered row by row in the process of decoding y) and cannot compute all means and variances from columns. However, we observed that the lack of centering and normalizing does not deteriorate the performance of sparse recovery algorithms presented in Section III.

E. Encoding the predictor coefficients

We consider a sparse predictor Θ with m real valued elements, having the supportIofknonzero elementsΘ_I=θ.

In order to encode Θ, first the nonzero elements are rounded to keep only Nb bits in the fractional part, resulting in the quantized vector θ with elements:

θi = bθi2^N^be/2^N^b; i= 1, . . . , k (2) which are transmitted to the decoder in the form of the integer values m_i =bθ_i2^N^be,i = 1, . . . , k. If by quantization some of the coefficients of θ become zero, that will be treated by changing the definition of the supportI to point only to nonzero coefficients and setting kto the size of the newI.

1) First the value of k ∈ {1, . . . , K} is encoded, using dlog₂Kebits. The signs of the non-zero coefficients are then sent, without coding, as ak-long binary string.

2) The indices of the supportIare transmitted as follows. A binary vectorγ= [γ1, . . . , γm]is created having ones at the indices fromI. This binary vector is encoded using log₂ ^m_k

bits, which amounts to assuming that allm-long binary sequences of weightk(i.e., havingksymbols one) are equally likely, having probability ¹

(^mk) each. One way to implement this code is to enumerate all such strings (e.g. using the indexing scheme presented in [17]), and then to encode, using arithmetic coding, the resulting

(6)

(a) The available regressors and selected ones by OMP and OOMP.

0 10 20 30 40

Order 2050

2100 2150 2200 2250 2300 2350 2400 2450 2500 2550

Description length (bits)

OOMP OMP

(b) Description length for the region.

Fig. 3: Illustration of sparse predictor design for a region of viewAijc=A8,12,3. An(11×11)-square where all pixels belong to the same region is shown in subfigures (a)-(b). (a) Illustration of OOMP template chosen by MDL. The pixel to be predicted, marked as a red×in “viewi= 8j= 12c= 3”, is set as thetth element of the vectory. Its causal neighbors inside the black border regions are elements of thetth row of candidate regressors’ matrixD, together with all pixels inside the black contours in all shown views which are also candidate regressors. OOMP chooses sequentially to include in prediction template the pixels in the order marked by the overlaid blue numbers. Here MDL chooses to stop after selecting the firstk^∗= 7regressors.

OMP template chosen by MDL is denoted with black numbers. The optimal number of selected regressors is k^∗= 15. Hence OMP predictor is about twice slower than the OOMP predictor, which signifies about twice slower decoding speed. (b) The evolution of the MDL criterion for the whole region for the algorithms OOMP and OMP. OOMP chooses more effectively the regressor pixels for decreasing the MDL at the highest rate.

global index of the particular γ (which is an integer between 1 and ^m_k

). However, we implement this in a simpler way, by sequentially encoding the binary symbols ofγ= [γ1, . . . , γm], and hence avoiding the use of global indices, which may be too large. At the first element,γ1, the probability distribution is defined byP(γ1= 1) = _m^k and then for theith element,P(γ_i= 1) =

Pm j=iγ_j m−i+1, where i= 1, . . . , m. In fact at mostm−1binary symbols need to be encoded; the encoding process will stop whenever P(γ_j = 1) = 1 or P(γ_j = 1) = 0, which may already happen at aj less thanm−1. One can easily show that overall this encoding process results in the assignment P(γ) = ¹

(^m_k).

3) The absolute values are encoded using a Golomb Rice code of parameter p, where encoding a positive integer m is done by encoding in unary the quotientb^m₂_pcand sending thepbits of the remainder unencoded. The unary part of the code is formed of b^|m₂pⁱ^|c bits followed by a 0 separator, requiring in total k+Pk

i=1b^|m₂pⁱ^|c = k+ Pk

i=1b^b|θⁱ₂^|2p^Nb^ec bits, while the unencoded remainders are needingkpbits.

Summarizing, the overall code length for encoding the coefficients is

Lθ=

k

X

i=1

b|θi|2^N^be 2^p

+ log₂ m k

!

+k(p+ 2) +dlog₂Ke. (3) The optimal Golomb-Rice parameter p^∗ is found evaluating (3) over the predefined set p ∈ {N_b−i₁, . . . , N_b+i₂} and pickingp^∗that minimizesLθ. In our experimentsi1= 2, i2= 5 and the value of p^∗ is hence encoded using 3 bits, which is a constant value and is neglected in the expression of the code length for brevity, resulting in the optimal code length for coefficients

L^∗_θ=

k

X

i=1

b|θi|2^N^be 2^p^∗

+ log₂

m k

+k(p^∗+ 2) +dlog₂Ke (4) The expression (4) is evaluated quickly, at every stage of the outer loop in the Alg. 4 and in the Step 5 of Alg. 5, as part of the MDL criterion.

F. Encoding the prediction errors

The encoder is using for prediction the predictor coefficients that are available to the decoder as well, i.e., the predictor coefficientsθ from (2), and the residual vector

ε=y−D_Iθ (5)

(7)

havingnelementsε₁, . . . , ε_nneeds to be encoded for lossless reconstruction.

The final encoding of the prediction errors (called for short also residuals) in the bitstream is done using a rather involved context coding machinery, typical for most lossless image compression applications. The contexts are defined, in a similar manner to [18], [19], according to the quantized values of the differences between neighbor pixels in the causal neighborhood of the current pixel (see [16] for the exact definitions, used also here). At each context, the minimum and maximum of the value to be encoded are denoted smin, smax

respectively, and are encoded as side information. A histogram is adaptively maintained at each context, where the current counts N(s_min), . . . , N(s_max) for all symbols are used for defining the probability p(ε_t = s) = Psmax^N(s)+1

i=smin(N(i)+1) for encoding the current symbol εt = s, and then N(s) is incremented.

This nonparametric way of encoding ensures a robust performance, irrespective of the degree of matching of the real distribution of the residuals to any assumed model.

Although in practice the above context coding is used for creating the bitstream, during the sparse design of the linear predictor one needs to make assumptions about the ideal distribution of the residuals. We assume a Gaussian distribution for the residuals, and use it for its convenience in allowing recursive in order least squares equations, achieving a fast predictor design. Similar to the MDL scheme used in [20] we assume the distribution of the residuals given by

p(ε|ˆσ²) = 1

(2πˆσ²)^n/2e⁻^Pⁿ^t=1^ε²^t^/(2ˆ^σ²⁾, whereσˆ²= ¹_nPn

t=1ε²_t. The optimal code length for residuals equals the negative maximized log-likelihood

Lε(ε) = −log₂ 1

(2πˆσ²)^n/2e⁻^Pⁿ^t=1^ε²^t^/(2ˆ^σ²⁾

= n

2 log₂(2π) +n

2log₂ˆσ²+n 2 log₂e

= n 2 log₂

n

X

t=1

ε²_t+n

2log₂(2πe/n)

= n

2 log₂ky−D_Iθk²+n

2 log₂(2πe/n). (6) This description length is achievable in practice by using arithmetic coding, so we call it animplementable description length, but in the final stage of constructing the bitstream we are using the usually more efficient code length given by the context based coding described above.

III. SPARSE MODELING

A. The implementable MDL for model structure selection One instance of the sparse design problem is defined by the n×m-matrixD and then-vectory and the task is to find a (sub)optimal sparse predictor with support Ik^∗ ⊂ {1, . . . , m}

and the nonzero predictor parameters θ∈ Q^k_N^∗

b ⊂R^k

∗, where QN_b = {i/2^N^b|i ∈ Z} is the subset of real numbers whose fractional part is represented with Nb bits. The optimality criterion is the overall description length, MDL_k^∗, which

includes the code lengths for the prediction coefficientsθ, the prediction supportI_k^∗ and prediction residuals

ε=y−D_I_k∗θ. (7) According to the considered encoding scheme, the implementable minimum description length can be computed using (4) and (6) as follows:

MDL(I,θ) =Lε(ε) +L^∗_θ=n

2log₂ky−DIθk²+n

2log₂(2πe/n) +

m

X

i=1

b|θi|2^N^be 2^p^∗

+ log₂ m k

!

+k(p^∗+ 2) +dlog₂Ke, (8) where k is the cardinality of the support I ⊂ {1, . . . , m}, andp^∗ depends on θ as presented in Sec. II-E.

Minimizing the description length is clearly the main goal in our application, and additionally, a sparse solution having a small support sizek^∗=kΘk0is desirable since it will lead to a fast decoding time. Using typical sparse designs based on greedy algorithms is convenient for providing sparse solutions Θk minimizing ky−DΘk² for each candidate support size k = kΘkk0 starting from 1 up to an order K fixed by the user, and hence the minimizing MDLk^∗ will be found by inspecting the evaluated criteria at the sparse solutions, MDL1. . . ,MDLK.

B. The generic structure of the algorithm for sparse design and model structure selection

We need to find the optimal sparse predictors for all regions in all side views, which in typical cases requires tens of thousands of predictor designs for one plenoptic image (in experiments we have 224 side views and in each view there are hundreds of regions, hence tens of thousands of predictor designs are not uncommon for one image). For this reason we consider here only sparse design methods which are very fast for the low dimensional models encountered in this application, where typicallynis in the thousands whilem is in the tens.

We investigated the performance of three algorithms:

orthogonal matching pursuit (OMP), optimized orthogonal matching pursuit (OOMP), and least angle regressor (LARS) that were found computationally feasible in the context of our problem. The first two belong to the greedy class of algorithms and have similar algorithmic structure, where the solutions are constructed recursively in the sparsityk of the predictor, i.e., in the number of nonzero elements of the predictor. The solutions are constructed for all k = 1,2, . . . , K where K is the maximum number of regressors variables allowed by the user in the model, out of all m existing regressors. The optimal value ofkis then selected according to the minimum description length criterion, which in this lossless compression application is the only genuinely relevant criterion.

In a greedy algorithm the set {1, . . . , m} of all regressor indices is split at each stagek into the chosen regressors set, Ikand available regressorsAk. The algorithm progresses from stage k−1 by enlarging the predictor support Ik−1 (which was found the best at sparsityk−1) with a new regressor, to obtain the supportIk, having cardinality k. Alg. 4 shows the

(8)

Algorithm 4 Greedy sparse predictor design algorithm with MDL structure selection, having a common structure for OOMP and OMP algorithms. OOMP solves a full LS problem in the inner loop at step 7, while OMP basically performs the scalar product between rk−1andDi in the inner loop at step 9 (allD^T_i D_i are precomputed outside the loop).

1: Input:y,D

2: Initialize:I₀=∅; A₀={1, . . . , m}

// Outer loop

3: fork= 1 toK do

4: // Inner loop

5: fori∈ A_k−1 do

6: ifAlgorithm = OOMP then

7: Ji= minθky−D_I_k−1_∪iθk²

8: else ifAlgorithm = OMPthen

9: J_i=krk−1k²−^(r

T k−1Di)² D^T_iD_i 10: end if

11: end for

12: i^∗= arg miniJi 13: Ik=I_k−1∪i^∗

14: Ak=Ak−1\i^∗

15: θk= minθky−DI_kθk²

16: rk=y−DI_kθk 17: θk=Q(θk)(see (2))

18: MDL_k = MDL(Ik,θ_k)(see (8))

19: end for

20: k^∗= arg min_kMDL_k

21: Output: k^∗,I_k,θ_k

common structure of the sparse predictor design, where the inner loop makes the difference between the more complex OOMP algorithm or the faster OMP algorithm. After the inner loop is terminated, the best indexi^∗for extending the support is found and used to update the support Ik and the set Ak

of available indices for next outer loop stage. For the selected support Ik the LS solution θ_k is computed at line 15, and the residual r_k corresponding to it is computed at line 16 (that is strictly needed only for the OMP algorithm). The LS parameters θ_k are then rounded to keep only N_b bits in their fractional part, by the uniform quantizer described in (2) at Appendix II-E. Finally the MDLk is evaluated. After terminating the outer loop, the best k^∗ can be picked by inspecting MDL1, . . . ,MDLK. The algorithms OOMP, OMP and LARS are presented briefly in the next subsections.

C. The algorithm OOMP

The sparse design algorithm OOMP, also known under var- ious names, e.g., LS-OMP, Greedy LS, Fast OLS, and order- recursive forward selection, basically solves the problem of forward selection in least squares modeling, which appears in many application fields: econometry [21], control and system identification [22], optimization [23], sparse representations [24]. Despite the different guises of the problem, its mathe- matical structure is however the same, and the fast solutions presented in e.g. [21]-[24] are similar as main algorithmic steps, differing mainly in the trade-off between using more

Algorithm 5 LARS-LASSO sparse predictor design algorithm with MDL structure selection.

1: Input:y,D

2: Run LARS algorithm and obtain the sequence {Ik,θk}k=1,...K.

3: fork= 1toK do

4: θk =Q(θk)(see Eq (2))

5: MDL_k= MDL(Ik,θ_k)(see Eq (8))

6: end for

7: k^∗= arg min_kMDL_k

8: Output: k^∗,I_k,θ_k

or less variables (and memory space) versus getting a faster or slower execution. In this paper we use the fast and space efficient algorithm Fast Orthogonal Least Squares (FOLS) from [25].

D. The algorithm OMP

The second investigated algorithm is OMP, which evaluates scalar products at each inner iteration (line 9 in Alg. 4), and can be seen as a sped-up version of OOMP. The new regressor index being selected is the one which gives the maximum in

i^∗= arg max

i∈Ak−1

(r^T_k−1D_i)²

D^T_iD_i (9) where the residual vector at stage k−1 is r_k−1 = y − DIk−1θ^∗_k−1.

E. The algorithm LARS

Finally, we also consider the sparse design method LARS, which bridges two main classes of sparse design: the greedy algorithms and the convex relaxation algorithms, being a method able to provide solutions to the LASSO problem

min

I⊂{1,...,m}min

θ ky−D_Iθk²+λkθk1 (10) for all values of the regularization factor λ. The regularization path for LASSO is a sequence of predictor supports I1, . . . ,IK having sparsityk=|Ik|, where the regularization factorλ is iteratively set by the LARS algorithm so that the resulting sparsity isk.

For details we refer to the original paper describing LARS [26] and the implementation of the algorithm from [27]. The program from [27] has the option of running the LASSO type of LARS algorithm, in which case the support Ik does not necessarily includeI_k−1 and does not necessarily have cardi- nalityk; henceI1, . . . ,IKare not nested. In Alg. 5 are shown the steps necessary to include the LARS-LASSO algorithm in the structure for MDL predictor structure selection, where the condition of nested supports is not assumed.

IV. PARTITIONING METHODS

A. Simple partitioning methods

In [12] the partition of the central view Pi_c,j_c = {Ω⁰_i_c_,j_c, . . . ,Ω^L−1_i_c_,j_c} that needs to be input to Alg. 2 was obtained simply by quantizing the disparity map. In this paper

(9)

(a) The 10th side view, (9,6), of Bikes.

(b) C-segmentation. Self-cross ratio shown as color.

(c) C-segmentation. Code length (bit per pixel) as color.

(d) Crop of (a). (e) Crop of (b).

Fig. 4: Fig. (a) shows a color view of the data set image Bikes. The cropped area marked with a black frame is zoomed-in in panel (d) and (e). (b) shows the C-segmentation of the side view, based on mean shift and partition propagation. The color in each region visualizes the value of self-to-cross ratio ρ(Θ) introduced in (11), in percents (see the color-to-value bar in the right of the panel). It can be seen that regions with uniform color, or regions with high displacement (they are either in front or far back) have high self-cross ratio i.e. the causal mask for these regions is more important than the neighboring views. Regions with texture or low displacement obtain a low self-cross ratio. (c) shows the final compression performance as bits per pixel (per color component) value, represented by the color of the region (see the color-to-value bar in the right of the panel). Most difficult to compress areas are those far away, where the relative displacement between views is largest. (d) shows a zoom of (a) and overlays in black line the contours of regions after the partition propagation. (e) is a zoom in the corresponding area of (b) and it reveals the high self-cross ratio over almost uniform color regions.

we consider more refined means of obtaining the partition.

Any arbitrary segmentation of Ai_cj_c can be used directly in our compression scheme, when denoting the segments {Ω⁰_i_c_j_c, . . .Ω^L−1_i_c_j_c}, ordered according to the average depth over the region, so that Ω^k−1_i

cjc corresponds to a region that is closer in depth than the region Ω^k_i

cj_c. This re-labeling can be done using an explicit estimated depth map or simply using the estimated disparities discussed as a part of the encoding method in Section II.

The selected self regressors in the sparse predictor (those that belong to the current view to be encoded, see Section II-D) are able to contribute with best interpolation of the color from the neighbor pixels, so that homogeneity of a region can be evaluated in terms of the similarity of the colors or texture of the region, and hence a color segmentation is a good candidate to be investigated. If only the self part would be present, a human subjective segmentation would be very good. However, the cross part of the predictor (which exploits the cross- correlation of the current pixel with regressors from neighbor views) is implementing an interpolation process based on the geometrically relevant pixels, according to the relative position of the lenslets and also depending on the distance of the object in the 3D scene, leading to ideal regions carved in terms of the (quantized) depth of the pixels. Hence an optimal partition is not necessarily a good human segmentation of the color image, but also involves the optimal carving of regions based on the depth values.

We quantify the relative importance betweenself regressors and cross regressors, as established by the optimal sparse design, by splitting the supportI of the sparse predictor into two subsets: the self-support Iself which points to pixels in the causal prediction mask from own view, and the cross- support Icross which points to pixels in the neighbor views and defining the self-to-cross ratio

ρ(Θ) = P

i∈I_self|Θ_i| P

i∈I_cross|Θi|. (11)

Our MDL criterion automatically selects the non-zero elements of the predictor for an optimal prediction, hence one can seeρ(Θ)as giving hints about which local homogeneity is more important: that based on color, whenρis high, or that based on depth, whenρis low.

In Fig. 4 are shown the self-to-cross ratios over each region of a mean-shift segmentation of the color view; the values of ρare represented acording to the vertical color-bar next to the image.

1) Disparity-map partitioning method,D: When predicting the color value at a pixel(x, y)from view(i, j), one can see the optimal sparse predictor as an optimal interpolator of the values present in the prediction mask, which may belong to a neighbor view (i⁰, j⁰), among others. In the case of a view (i, j) which is shifted by a fractional disparity with respect to the neighbor view(i⁰, j⁰), the interpolator will adjust itself according to the disparity, and if the disparity is constant over a certain region, the same predictor will be optimal for the whole region. Hence segmentation based on disparity makes a good basis.

We adopted this strategy earlier in [12] and now we show how it compares to other segmentation alternatives in Section V. The depth map for plenoptic images can be estimated with numerous methods, see e.g. [28]. In this paper we use the depth map provided by the camera manufacturer’s software, namely Lytro Desktop. Given a pixel with disparity dwe quantize it to dq = j

Nd d−d_min d_max−dmin

m

where Nd is the number of constant disparity levels andd_max andd_min correspond to the range of disparities. The final quantized disparity map emerges from the pixels that have the the same quantized disparity and belong to the same group of connected pixels. We denoted in results the depth-map partitioning method with suffix D.

In practice, the regions obtained with quantized disparity maps lack in quality when it comes to object contours, due to the fact that light-field depth estimation methods, such as [29], rely on mixing the results of two approaches: first,