Cumulative Attribute Space Regression for Head Pose Estimation and Color Constancy

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Cumulative attribute space regression for head pose estimation and color constancy

Ke Chen

^a

, Kui Jia

^b

, Heikki Huttunen

^a^,^∗

, Jiri Matas

^a^,^c

, Joni-Kristian Kämäräinen

^a

aLaboratory of Signal Processing, Tampere University of Technology, Finland

bSchool of Electronic and Information Engineering, South China University of Technology, China

cDepartment of Cybernetics, Czech Technical University, Prague

a rt i c l e i n f o

Article history:

Received 11 April 2018 Revised 21 September 2018 Accepted 9 October 2018 Available online 10 October 2018 Keywords:

Multivariate regression Cumulative attribute space Head pose

Color constancy

a b s t r a c t

Two-stageCumulativeAttribute(CA)regressionhasbeenfoundeffectiveinregressionproblemsofcom- putervisionsuchasfacialageandcrowddensityestimation.Theﬁrststageregressionmapsinputfea- turestocumulativeattributesthatencodecorrelationsbetweentargetvalues.Thepreviousworkshave dealtwithsingleoutputregression.Inthiswork,weproposecumulativeattributespacesfor2-and3- output(multivariate)regression.WeshowhowtheoriginalCAspacecanbegeneralizedtomultipleout- putbytheCartesianproduct(CartCA).However,fortargetspaceswithmorethantwooutputstheCartCA becomescomputationallyinfeasibleandthereforeweproposeanapproximatesolution-multi-viewCA (MvCA)- whereCartCAisappliedtooutput pairs.We experimentallyverifyimprovedperformanceof theCartCAandMvCAspacesin2Dand3Dfaceposeestimationandthree-output(RGB)illuminantesti- mationforcolorconstancy.

ThisisanopenaccessarticleundertheCCBYlicense.(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Multipleoutput regressionpredictsseveralcontinous variables simultaneously.Oneoftheemergingtopicswithinregressionprob- lems is visual regression. Regression has many applicationsin vision,suchas2Dand3Dheadposeestimationandlandmarkdetec- tion[1–3] (seeFig.1),illuminationestimationforcolorconstancy [4],aswellasapparentageestimation[5].

A straightforwardsolution isto learn individual regressors for eachtargetvariableseparatelyusingthetraditionaltechniques(e.g.

ridge regression,random forest regression[6]andsupport vector regression[7]).However,independentregressorsdiscardtheinter- dependencebetweenthetargetvariables,whichcanbesubstantial invisionproblems.Thereare moreadvancedapproachesformul- tivariateregression,suchasjointlearningofregressorsinamulti- taskfashion[8]andstructuredlearning[9],buteventhesegeneric approachescannoteffectivelymodelcross-targetcorrelationsofvi- sualdataandareofteninferiortoproblemspeciﬁcmethods.

Mostoftheabovemethodsapplythetraditionalsinglelayerre- gression architecture, where themultivariate output isestimated either directly from image features, or by optimizing a tailored score function . During the recent years there have been multi-

∗ Corresponding author.

E-mail address: heikki.huttunen@tut.ﬁ(H. Huttunen).

plesuccessfulattemptstoreplacethesinglelayermodelwithtwo layer(twostage)architectures[10–12].Theﬁrstlayeroutput rep- resentsan “attributespace” where attributefeatures havean im- portantsemanticmeaning forthe regressionorclassiﬁcationtask solvedbythesecondlayeroutput.

Inthiswork, we focuson theconcept ofcumulative attribute (CA)spacemappingthatwasproposedinourpreviouswork[12]. Themainideabehindthecumulativeattributesistheintuitivefact thatlowlevelfeaturesforcertainvisiontasks,suchasageestima- tionorcrowdcounting,arecumulative bynature.Inthiswork,we show that thishypothesis holdsfora wider classof visionprob- lems.

Inspiredbythe successofCA forscalar-valuedregression[13], weextendCAtothemultivariateoutputsetting.Astraightforward extension isto applyCA regressionto each outputvariable inde- pendently.Thisapproachisthebaselineinourwork–Independent CumulativeAttribute space (IndepCA).The drawback of IndepCAis its limited ability to exploit the multi-dimensional nature of the targetspacethusomittingthecorrelationsoftheoutputvariables (suchasvisualsimilarityoffaces betweenadjacentpitchandyaw binsinFig.1).

Toovercome thislimitationwe generalize the CA to 2-output casebyadoptingamappingbasedontheCartesianproduct(Fig.1) –Cartesian Cumulative Attribute space(CartCA). The CartCAdivides themulti-dimensionalspaceintodisjointregions. Fora landmark https://doi.org/10.1016/j.patcog.2018.10.015

(2)

Fig. 1. Cartesian Cumulative Attribute space (CartCA) for 2-output regression. CA-based regression has three processing stages: i) feature extraction, ii) mapping from feature space to Cumulative Attribute space ( Attribute Learning ) and iii) mapping from CA space to a two-dimensional output space ( Target Regression : head yaw and pitch angles).

point anchored in a multi-dimensional target space, i.e. a single regressionlabel, CartCAformsuniquelydifferentbinarypartitions oftrainingsamples. CartCAis a generalization of theoriginal CA fortwo-dimensionaltargetspace.Thenumberofbinarypartitions grows exponentially w.r.t. the label spacedimensionality making CartCAimpracticalbeyondmorethantwooutputs.

To avoidthe combinatorialexplosion, we propose an approxi- mationby projectingtrainingsamplesintovarious2D sub-spaces forwhichCartCAisapplied.WecallthisapproachMulti-View Cu- mulativeAttribute(MvCA)regression.Intheexperimental part,we studythesemethods inthreedifferentmultivariate visual regres- sionproblems:2Dheadposeestimation,3Dheadposeestimation and3Dillumination(RGB)estimationforcolorconstancy.Inallex- periments,ourmethodprovidescompetitiveperformanceandcon- sistentlyoutperformsmethodsthatdonotconstructacumulative attributespacelayerforregression.

Ourmaincontributionsaresummarizedasfollows:

• We extend the scalar value cumulative attribute (CA) regression to2-outputcumulative regressionby adoptingthe Carte- sianproducttopartitionoutputspaces(CartCA).

• Wepropose anapproximation approachforCAwith ≥ 3out- puts by partitioning output spaces to multiple 2D views - Multi-view Cumulative Attribute (MvCA). This approximation avoidsexponentialgrowthofCartCA.

• Wedemonstrateeffectivenessofmulti-outputCAregressionin severalcomputervisionapplications(2Dand3Dheadposees- timation andRGBilluminationestimation forcolorconstancy) where CartCA and MvCA achieve competitive accuracies as comparedtostate-of-the-art.

2. Relatedwork

In thissection, we provide a short survey on the recent and relatedworksinvisualregressionandattributelearning.Sinceour experimentsareperformedon2D and3D targets, wealsosurvey relatedworksontheseapplications(namely,headposeestimation andcolorconstancyestimation).

Multivariate Regression– For the standard univariate regression problems in computer vision, we seek for a mapping f: R^N→R, wherethe input x∈R^N corresponds to N extractedim- agefeatures andtheoutput y∈R isa real-valued regressiontarget.Traditional methods includeL₂ regularized (ridge) regression ,L₁ regularized (LASSO)regression[14],random forest regression [6]and support vector regression [7], to name a few. These re- gressionmethodscan be appliedto multivariate regressionprob- lems f:R^N→R^D by independentlylearningunivariateregressors

f:R^N→R for each target variable y₁,y₂,...,y_D separately. This approach,however,omits interdependenciesbetweenoutputvari- ables and for that purpose there are other generic approaches such asenforcingjointlylearningregressors ina multi-taskfashion[8]orstructuredlearningmethods[9].Forexample,structured multivariateregressionisappliedinanumberofcomputervision applications[15].

Mid-layerattributeshavebeenadoptedincertainrecentworks [10–12,16–18]. These methods learn D₁-dimensional feature representation, which is used in a two-layer learning architecture f:R^N→R^D¹→R^D or (concatenation of features and attributes) f:R^N→R^D¹,R^N→R^D.Indeed,ithasbeenshowninmanycases that the two-layer structure improves the accuracy. Inspired by thesuccessofcumulativeattributes(CAs)forscalar-valuedregression[13],we generalize CAto the2-output (D=2) and3-output (D=3) settingsin thiswork.Forthis work,we adoptthe Partial LeastSquares(PLS)regression[19]and NIPALS [20]forestimating theregressionscore(andloading)matricesduetotheirsimplicity (formoredetailsseeSection3.3).

Attribute Learning – Visual attributes, which can be either manually definedaccording toprior knowledge [17,18] ordiscov- ered from data [10,16], have been widely applied to a number ofclassificationproblemsincomputer vision,e.g.,image categori- sation [11,17], person re-identification [18], and action and video eventrecognition[16].Theproposedclassificationproblems,however,are differentfromtheregressionproblems sincethey rarely establish natural and cumulative correlation, such as the person ageoranumberof people,andoftenrequiremanual annotation.

Yang etal.[21]proposed correlationanalysis fortwo-viewimage reconstruction.

Recently, the concept of cumulative attributes [12] was pro- posedforregressionproblems,asthoseclassiﬁcation-orientedat- tributescannot be utilized directlyto explorethe cumulative de- pendency across regression labels. However, CA developed for scalar-valued regression problems can only be applied to multivariate regression problems with the price of missing multi- dimensionalnatureofthetargetspace(IndepCAinthiswork).

Head Pose Estimation – In this case, the regressiontarget is eithertwo-dimensional (yawandpitchangles)or3D (+roll).The challengesreside in featureinconsistency andlabelambiguity. In particular,forthesameheadpose,featurevariationsbetweendif- ferent persons are large dueto varying facial appearance. More- over,theposelabelsarenoisyastheexactgroundtruthisdiﬃcult to acquire. As head poseestimation is challengingdue touncer- tain labels,it is considered agood testbed forevaluating robust- ness of the proposed attributes. The recent algorithms for head

(3)

poseestimationcanbecategorizedintotwogroups:classiﬁcation- based[22]andregression-based[1,15,23,24].Moreover,deeparchi- tectureshavebeenproposedforhumanposerecovery[25].

If the head poseestimation problem is castto a classiﬁcation problem, the implicit assumption is that pose labels are independent,whichdiscards theordered dependencyacrossthelabel space [22]. In the view of this, the regression-based algorithms haverecentlybecomemorepopularforboth2D[15,26,27]and3D headposeestimation[23,24].

In [27], a partial least square regression model was adopted to cope with the misalignment problem when estimating the headpose.FoytikandAsari[26]introducedatwo-layerregression framework inacoarse-to-finemanner,whichfirst determinesthe rangeofprediction(i.e.coarseestimationtorobustifyagainstam- biguous labels) andthen learnsa regressionfunction to estimate thefinal posevalue.Recently, Gengetal.[1]introducedthe con- ceptofsoftlabellingbyusingadjacentlabelsaroundthetruepose label in a multi-label learningfashion. This reduces the negative effect ofambiguoustargets andhelpsto capturecorrelationsbe- tweentheneighbouringtargets.However,thesoftlabellingsuffers fromthe invalid assumption that labelcorrelations existonly lo- cally.

Onthecontrary,thegoalofourCartCAandMvCAistorepre- sent thetarget correlationsgloballyacross thewhole posespace.

Beyondmultivariate labeldistribution,regression forests[23] and its variants [15,24] were proven their effectiveness andreal-time eﬃciencyin2Dand3Dheadposeestimation.

Illumination Estimation – Another experimental case in our paper considers the estimation of illumination of color images.

This is a 3-outputregression problem, where the goalis to esti- matetheR,GandBvaluesofsceneillumination.

Existing algorithms for illumination estimation can be cate- gorised into two maingroups: statistics based[28,29] andlearn- ing based[30–32].In[32],aﬁve-layerad-hoc CNNwasdesigned combiningfeaturegeneration andmulti-channelregressionto estimate illuminationinan end-to-end manner. Qianetal.[4]em- ployed an implicit structured output regression on theoutput of fully-connected layerofVGG-Nettodiscoverinter-outputcorrela- tion.

3. Methodology

This section ﬁrst introduces cumulative attribute (CA) regression in [12] (Section 3.1). Next, a two-variate generalization of CA isproposed (CartCA)and then multi-view CA (MvCA) which is more practical for D>2 target outputs (Section 3.2). In Section3.3thetwo-stageregressionisdiscussedinmoredetail.

3.1. Cumulativeattributespace

Considerastandardscalarvaluevisualregressionproblem,with I trainingexamples {x_i,y_i},where x_i∈R^N are N extractedimage featuresfortheimageindexedbyiandy_i∈Risthecorresponding scalartarget.Chenetal.[12]introducesmid-levelmappingtoa_i∈ R^D¹ whichistermedasa“cumulativeattribute” vectorofx_i.

Themainworkﬂowisbasedontwostageregression,wherethe ﬁrst regressor provides attribute mapping f₁:R^N→R^D¹ andthe secondregressorprovidesthetargetoutputmapping f₂:R^D¹→R. Itisnoteworthy,thatthebestperformanceisachievedbyconcate- nating the original features andtheestimated attributevector in thesecondstage,i.e. f2:x,a→R.

During the training stage, the mid-level attribute values a_i∈ R^D¹ are generated by thresholding the regression target y_i∈R

usingthefollowingCArule:

ai,j=

1, when y_i

τ

j,

0, when yi>

τ

j, (1)

for j=1,2,...,D₁. Inother words,the regressionproblemis de- composed into D₁ binary classiﬁcation problems by thresholding thetarget at

τ

j.Thedimension oftheattribute spaceD₁ andthe correspondingthresholdsareproblemspeciﬁc;forexample,inage estimationan obvious choice isto set

τ

1=1,

τ

2=2,...,

τ

99=99 whenD₁=99.

The attribute mapping f₁ is learned using ridge regression;

meaningthatwelearnD₁ attributefunctionscorresponding toD₁ mid-level binary targets.Ideally the mapping should look like a step function with the change located at the true target value, butestimatedattributesaˆ_iareactuallyrealvaluedvectorsthatare notbinarizedbutdirectlyusedinthenextstageregressorf₂.This meansthat binaryvaluesare usedonly duringthetrainingstage andinthetestingstagerealvaluemultiviewcumulativeattributes areusedfortheﬁnalregressor.

Alternative to the regression based attribute functions in our work,alsoanytwo-class (binary) classiﬁercan be trainedforthe attribute assignmentsdeﬁned in (2). However, during ourexper- imentswe havefound the realvalued outputs of regressors,soft attributes,moreeffective.Thiscanbeexplainedbythefactthatno informationislostinthebinarydecisionsandthewholepipeline isregressionbased.

3.2.2-and3-outputcumulativeattributespaces

Wewillnowproposethreevariantsofgeneralizingtheunivari- atecasetomultivariate.

IndepCA— A straightforward multivariate (D≥2) extension of CAis to treat all output dimensionsasindependent anduse the standard CAforeach output variable.We denotethisstraightfor- ward extension as IndepCA. If, for simplicity, we assume that all Doutput dimensionsare similar,then their corresponding cumulative attribute spaces can be represented by D₁-dimensional at- tributevectors. IndepCAlearnsD₁-dimensionalattribute mapping for each D dimensions of the target space y_i∈R^D. For the final stage regression we concatenate D D₁-dimensional attribute vec- torstoasinglevectoroflengthD₁×D.Thesecondstageregressor isamulti-variateregressororDunivariateregressorsthatprovide thetargetoutputy_i=(^y1,y₂,...,y_D)^.^Morê^detailsâbout^the^prac- ticalcomputationareinSection3.3.

For scalar-valued regression, an important advantage of CA comes from its more effective use of the 1D target space than traditionalregression learning settings.In particular, withall the availabletrainingsamples,eachattributefunctioninCAistrained tooutputeitherpositive(i.e.one)ornegative(i.e.zero)values,and acollectionofsuchtrainedattributefunctions,correspondingtoa rangeoflandmarkpointsanchoredinthe1Dtargetspace(e.g.in- tegerages), provides strong evidencefor estimationof thetarget output.Incontrast,regressorsintraditionalsettingsaretrainedto givea completerangeofvaluesinthetargetspace, whileregres- sionﬁdelityforanyspeciﬁctargetvalueistakencareofonlybya (usuallysmall)subset oftrainingsamples.ThisadvantageofCAis particularlyimportantformanyregression problemsincomputer vision,such ashuman ageestimation andcrowd densityestima- tionwhichoftensufferfromsparseandimbalancedtrainingdata.

Theaforementionedcollectiveevidenceprovidedbytrainedat- tributemappingfunctionsandthe attributevector representation whereeachentrycorrespondstoa“landmark” (e.g.age)inatarget spaceisintuitive andeasyto manuallyselectfor1D cases.How- ever,themultivariatesettingismorecomplexasthereisnosimi- larlyuniquewaytodividetheoutputspaceto“zeros” and“ones”.

Wehave alreadydeﬁneda multivariatemodel basedon multiple

(4)

CAregressors(IndepCA),butitsmainweaknessisthatitdoesnot exploitthemulti-dimensionalnatureofthetargetspaceinmulti- variateregression,i.e.cross-correlations andinterdependencies of outputvariables.

CartCA— The main problemin generalizingCAto multivariate casesis how to partition D-dimensional spacesuch that it natu- rallyrepresentsthecumulativenatureofattributeswiththeirmu- tualdependency.Asanovelsolution,weproposeamodeltermed CartesianCumulativeAttributes(CartCA).

Assume againthatwe haveItrainingsamples{x_i,y_i}.Consid- ering a D-dimensional target y_i∈R^D, each component y_j₌₁_,₂_,_..._,_D willpartition thetrainingsamplesintotwo subsetsasdeﬁnedin (1).Now,ifthisisdone foralljvariablesandtheirsuperpositions addedbyCartesian product,the vectorentries y_i collectivelypar- tition the training samples into 2^D subsets, which we denote as

{

S1,...,S2^D

}

^. ^These ^subsets ^of ^training ^samples ^suggest ^that ^we

canlearn2^D differentattributefunctionsanchoredattheposition yinthetargetspace.Fork=1,...,2^D,CartCAassignsattributela- bels

{

^a^ki

}

^to^the^training^samples^{^xi}basedonthe followingrule a^k_i =

1, when yi∈Sk,

0, otherwise . (2)

Consider, for example, the particular caseof two-dimensional targets,i.e.,D=2.Then,theaboveruleforconstructingthe2^D(in thiscase4D)attributetensorsisgivenasfollows

a_i⁽_,¹_j⁾=

1, when

τ

_j⁽¹⁾y⁽_i¹⁾ and

τ

_j⁽²⁾y_i⁽²⁾, 0, otherwise,

a_i⁽_,²_j⁾=

1, when

τ

j⁽¹⁾y⁽_i¹⁾ and

τ

j⁽²⁾y_i⁽²⁾, 0, otherwise,

a_i,j⁽³⁾=

1, when

τ

j⁽¹⁾y⁽_i¹⁾ and

τ

j⁽²⁾y_i⁽²⁾,

0, otherwise, (3)

a_i⁽_,⁴_j⁾=

1, when

τ

j⁽¹⁾y⁽_i¹⁾ and

τ

j⁽²⁾y_i⁽²⁾, 0, otherwise,

where

τ

_j⁽¹⁾ ^and

τ

⁽_j²⁾ âre^set ^similarly^to ^theôriginal ^CAând^they

have clear semantic meaning. For a training example, the two- dimensional output sets an anchor point to partition the 4D at- tributetensor. Anillustration ofthe above attribute labelassign- ment rule is shown in Fig. 1, where the goal is to estimate the headposeyawandpitchangles.

MvCA—OnemaynoticethatthenumberofattributesinCartCA increases exponentially with the dimensionality of target space, which makes learning of CartCA impractical in cases of high- dimensionaltarget spaceand asmall amountof data.Inour ex- perimentswefoundCartCAimpracticalforD>2.Asaremedy,we proposeanapproximateCartCAtermedMulti-viewCumulativeAt- tributes(MvCA).TheMvCAattributeconstructionruleisbasedon CartCAwhichisstillpracticalforD=2using(2).

More speciﬁcally, for training samples {x_i, y_i} in the D- dimensionaltargetspace,we ﬁrstselectanoutputdimensionpair (j₁, j₂) with j₁,j₂∈

{

¹,...,D

}

, j₁=j₂, and project all the training samples into this CartCA subspace. For a ﬁxed anchor point y_i_,_{_j

1,j₂}∈R² intheCartCAsub-space,itsentriespartitiontheout- put space into 4 subsets (like those of Fig. 1), based on which MvCAuses4different“attributeplanes” byfollowingtherules in (3).

Forstudying complexityofCartCAandMvCAwe mayassume thattheD₁ attributespacesare similar. Inthiscase,we havethe total of D²₁ possible anchor points in the attribute space. MvCA learns 4 attribute planes associated with each of the landmark points, and there are in total D(^D−1)/2 such dimension pairs (j₁,j₂).MvCAlearnsattributefunctionsinthesamewayforeach

of the pairs, producing a total of 2D²₁D(^D−1) ^attribute ^planes.

ForD>3,thisissigniﬁcantlylessthanthecorrespondingnumber (2D₁)^D fortheCartCA.

In the casethat the target spaceof multivariate regression is two-dimensional(aplane),i.e.D=2,CartCAandMvCAareequiv- alentandgivethesamenumberofattribute features.Inthecase D=1 all theoriginal CA, IndepCA,CartCAandMvCAare equivalent. There are also recentworks that could be used for dimen- sionalreduction[33],butthesearebeyondthescopeofthiswork.

Geometric Interpretation of CartCA and MvCA. We take CartCA as an example, but MvCAcan be similarly analyzed.Attribute label assignmentrule(2)suggeststhateachattributefunctioninCartCA islearnedbasedonauniquebinarypartitionoftrainingsamples.

Eachattribute function trained thiswayserves asa hyperplane,¹ gives an indicative measure of the position (i.e. multi-variate regressionlabel)oftestsamplesinthetargetspace.Inthefollowing, we consider aparticular test samplex witha ground-truthlabel y=yˆ.

• Agroupof2^kattributefunctionslearnedbytherule(2),(refer- ringtorule(3)forsamplesontheboundary),whichanchored at the position yˆ in the label space, ideally provide an exact indicationon the target of x: attributes given by these func- tionsformavector1∈R²^kwithallentryvaluesof1(anyzero- valuedentryinthisvectorindicatesy=yˆ).Whensuchagroup ofattributefunctionsare notavailable, attribute functionsan- choredatneighboringpositionsofyˆformpolytopesinthetar- getspace, whichprovidedifferentlevelsofreﬁnedpositionin- formationfortheestimationofy.

• Based on different (and unique) binary partitions of the tar- getspace,otherattributefunctionsprovidedifferenthalf-space constraintsfor theestimation ofy.When theseattributes are concatenatedtothevectora_CartCA,theycollectivelyproviderich (andredundant)informationfortheestimationofy.

An illustration of the above geometric interpretation is pre- sented in Fig. 2. In summary, CartCA (or MvCA) encodes in the attributevectora_CartCA(ora_MvCA)stronginformationabouttheun- derlyingpositionofanytestsampleinthetargetspace,whichcan beexploitedforﬁnallabelestimation.

3.3. Two-stageregression

Giventrainingsamples{x_i,y_i}withinput featuresx_i∈R^N and output target vector y_i∈R^D, we construct the training attribute targets a_i∈R^D¹ based on the attribute construction rules in the previoussections.

Tothisend, weemploy thePartialLeast Squares(PLS)regression[19]forits capabilitytocopewithmulticollinearity problem, and which has recently been applied to a number of visual regression problems [27]. Typical solution forestimating the score (andloading) matricesistheNIPALS[20],whichweadoptforits low computational complexity(O(N²)).Alternatively, other multi- variateregressionmodelscanalsobeemployedsuchasmultivari- ate ridge regression [12] and regression forests [6]. Partial least squareregressionisadoptedowingtoitssimplicityinimplemen- tationandcomputationaleﬃciency.PLSlearnsamappingfunction f:R^N→R^D¹ fromtrainingdata,whichisusedtoestimatean at- tributefeature vectora˜∈R^D¹ foran unseentestsamplex andis

1Alternative to the regression based attribute functions in our work, also any two-class (binary) classiﬁer can be trained for the attribute assignments deﬁned in (2) . However, during our experiments we have found the real valued outputs of regressors, soft attributes , are more effective. This can be explained by the fact that no information is lost in the binary decisions and the whole pipeline is a regression pipeline.

(5)

Fig. 2. Geometric intuition of the proposed Cartesian Cumulative Attributes. At- tribute functions/hyperplanes (blue lines) form polytopes in the target space, which provide different levels of indicative position information on the target (dark star point) of a test sample. In the weaker form certain attributes provide half-space constraints (red lines) on the target of the test sample. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

theﬁrststageregressorintheproposedCartCAandMvCAregres- sionmethods.

Toperformthesecondstagetargetestimation,weﬁrstestimate

˜

a_i=f(^xi) ^and^thenconcatenate x_iwitha˜_i.Theconcatenatedvec- torsareusedasthetrainingdataforthesecondstagemultivariate regression.Tolearnamappingfunctionfromtheconcatenatedfea- turespacetothemultivariatetargetspace,weadoptafewrecent state-of-the-artmethods,e.g.KPLS[27],KRF[15],andMLD[1]and comparethem inourexperiments.Ouruseofthe existingmeth- ods is mainly to verifythe effectiveness of our proposed CartCA andMvCAattributefeatures,byremovingcontributionsfromother factors.

4. Experiments

In the following, the proposed multi-output cumulative at- tributespaceregressionmethods,IndepCA,CartCAandMvCA,are evaluatedinmultiplevisionproblems:2Dheadposeestimation(2 poseangles),3Dheadposeestimation(3poseangles)andillumi- nationestimationforcolorconstancy(3colorcorrectiontermsfor thered,greenandbluechannel).

4.1. Datasetsandsettings

Datasets— For 2D head pose estimation, we used the popu- lar Pointing’04 benchmark dataset [34] which contains face im- agesof15personscapturedinvaryingappearanceandundercon- trolledindoorenvironment.For3Dheadposeestimation,weused the Biwi Kinect Head Pose Estimation dataset [35], which contains depth images of20 persons. As a distinct visual regression problemfromheadposeestimationwealso evaluatedourmodel with two illumination estimation datasets [30,36] where illuminant tri-stimulus value (Red, Green,Blue) isestimated tocorrect a colorbiasedinput image. The SFUIndoordataset [36] contains 321imagescapturedin11differentcontrolledlightingconditions.

The SFUColorCheckerdataset [30]contains56812-bits dynamic range imageswhich all include theMacbeth Color Checker chart asgroundtruth.DetailsofthedatasetsaregiveninTable1.

Features— For 2D head pose estimation, after cropping the foreground offaces withmanually-annotatedboundingboxes,the facial images are normalized into 32×32 pixels from which we extract a2511-dimensionalhistogramoforientedgradients(HoG)

featurevector[37],whichiswidelyemployedintherecentworks [1,15,26,27].EncouragedbythesigniﬁcantadvanceswithConvolu- tionalNeuralNetworks (CNNs)in facial recognition[38],we also extractCNNfeaturesfromthe“fc6” layerofthepre-trainedVGG- net16layermodel[39].

For3D headposeestimation,we ﬁrst removethe background using the provided foreground masks by cropping 96×96 facial region anchored inthe center offoreground masks. The cropped facialpatchesarethenresized into32×32pixels.Inspired bythe features usedin[23,24],the depthvalue ofeach pixelin 32×32 patcheswereusedaslow-levelfeaturesafterwhichapplyingnor- malization of the non-zero pixel intensities (i.e. depth distance) into[0,1].

Finally, forthe illumination estimation problem, we used the pre-trained19-layer VGG-netwithout ﬁne-tuningas describedin [4].ForbothSFUIndoorandColorCheckerdatasets,wefollowthe settingsin[4]toextract4096-dimensionalCNN“fc6” featuresfrom imagesresizedto224×224.

Settings— For the Pointing’04 dataset, two experiments were conductedaccordingto the settings ofdata split. In thefirst experiment,we followed thesame trainingandtestingpartition as [1,15,26,27], i.e. five-fold cross-validation. An alternative setting, i.e.twoimagesequencesofthesamepersonevenlysplitintotrain- ingandtestingdata,wasadoptedforthesecond experimentasin [15].FortheBiwiKinectdataset,twoexperimentswereconducted by1)dividingthedataintotrainingpartcontainingtheimagesof the first 18 persons and testing part withthe remaining images [23,24]and2) by adoptingfive-fold cross-validation[23],respectively.FortheSFUIndoorandColorCheckerdatasets,wefollowed thestandard3-foldcross-validationprotocolin[4,29,31,32,40,41].

Comparative Methods— We collected most of the results of competitive approaches from corresponding papers. For ablation studywith the 2D datasetwe implemented several state-of-the- artmethods includinglinear/kernelpartialleastsquare regression (PLS/KPLS)[27],k-cluster regressionforests(KRF) [15],andmulti- variatelabeldistributionlearning(MLD)[1].

For3D head poseestimation,we adopted standard regression forests(RF)[6]forthesecondlayermulti-variateregressionmodel owingtoitsstrongperformanceinrecentworks[23,24].

Forilluminationestimation,weimplementcomparative multi- outputsupportvectorregression[4]inthelightofitscompetitive performance. The number of factors for PLS and KPLS with RBF kernelis25and40respectively.

ForKRF, we followed the setting in [15], the minimal size in eachleafnodeis5andwegrew20regressiontrees.Following[1], MLD adopts weighted Jeffrey’s divergence and two-dimensional Gaussiandistributionwiththeﬁnestgranularityofheadpose

μ

= 15.Regressionforestsfor3Dheadposeestimationhaveatleastthe samplesizeof 5ineach leaf node andgrow20 regressiontrees.

Forilluminationestimation,we usedmulti-output supportvector regression(MSVR)[4]withtheRBF kernel. Trade-off parameterC and

γ

^of^the^RBF^kernel^were^tuned^by^three-foldcross-validation.

We adopted the class labels to generate CartCA for 2D head poseestimation, while rounded values to nearest integers of 3D headposeanglesareemployed togenerateCartCAandMvCA.For illuminationestimation,weﬁrstnormalisedgroundtruthillumina- tionsinto[0,255]levels,whicharequantisedinto64binsinacu- mulatively and continuously changing manner. The class label of eachbinon eachcolour channelwasadopted togenerateCartCA andMvCA.

PerformanceMetrics—Forevaluatingtheperformanceofhead poseestimation,weemployed two typesofperformance metrics, i.e.regression metricin MeanAbsolute Error (MAE)andclassiﬁca- tionmetric.Consideringthedifferentdatacharacteristics inlabels (i.e.integer anglesinthe Pointing’04datasetandscalar valuesin the BiwiKinect dataset), we report the classiﬁcation accuracy of

(6)

Table 1

Details of the datasets used in the experiments. D ( i ) = range of the i th output dimension (2D face pose: yaw, pitch; 3D face pose: + roll; color constancy: color corrections c R, c G, c B).

Data # of imgs Resolution D (1) D (2) D (3) Note

Face pose

Pointing’04 [34] 2790 384 ×288 [ −90 ^◦, 90 ^◦] [ −90 ^◦, 90 ^◦] – 13 yaw and 9 pitch angles Biwi Kinect [35] 15,677 640 ×480 [ −67 ^◦, 77 ^◦] [ −84 ^◦, 54 ^◦] [ −70 ^◦, 63 ^◦] ﬂoat values

Color constancy

SFU Indoors 321 224 ×224 [0, 255] [0, 255] [0, 255] RGB values SFU Color Checker 568 224 ×224 [0, 255] [0, 255] [0, 255] RGB values

Table 2

Comparison with state-of-the-art on 2D head pose estimation with the Pointing’04 dataset (5-fold cross-validation). For MAE smaller number is better and for classiﬁcation accuracy larger number is better. Note that for 2-output regression CartCA and MvCA are equivalent.

Method Regression Metric (MAE) Classiﬁcation Metric(Accuracy)

Yaw Pitch Yaw + Pitch Yaw Pitch Yaw + Pitch

Various feature combinations

Fenzi [43] 5.9 ° 6.7 ° – – – –

AKRF-V [44] 5.5 ° 2.8 ° – – – –

SDL [45] 4.12 °±0.17 ° 2.09 °±0.12 ° – – – –

PLS [27] 8.97 °±0.87 ° 9.27 °±0.41 ° 15.51 °±0.53 ° 49.25% ±3.37% 46.38% ±3.19% 23.15% ±1.04%

HoG Features

KPLS [27] 5.89 °±0.83 ° 5.76 °±0.25 ° 10.28 °±0.70 ° 64.87% ±4.30% 65.34% ±2.08% 44.34% ±2.58%

KRF [15] 5.49 °±0.27 ° 3.90 °±0.65 ° 8.79 °±0.61 ° 64.52% ±1.97% 76.67% ±3.73% 47.53% ±2.90%

MLD [1] 4.41 °±0.57 ° 2.83 °±0.62 ° 6.74 °±0.70 ° 71.61% ±3.12% 84.98% ±2.19% 61.76% ±3.84%

IndepCA 4.31 °±0.83 ° 2.76 °±0.66 ° 6.53 °±0.76 ° 72.87% ±4.30% 85.34% ±2.08% 63.84% ±4.34%

CartCA/MvCA 4.09 °±0.70 ° 2.60 °±0.69 ° 6.22 °±0.80 ° 74.01% ±3.94% 86.95% ±2.47% 65.59% ±4.12%

VGG-Net Features

CNN 4.81 °±0.23 ° 1.85 °±0.17 ° 6.67 °±0.16 ° 68.96% ±1.08% 89.93% ±1.24% 61.58% ±1.22%

KPLS 4.72 °±0.29 ° 4.45 °±0.39 ° 8.38 °±0.44 ° 71.25% ±1.51% 72.11% ±2.26% 51.79% ±2.51%

KRF 5.37 °±0.67 ° 3.76 °±0.51 ° 8.71 °±0.23 ° 65.60% ±4.12% 76.95% ±2.76% 48.52% ±1.15%

MLD 3.53 °±0.34 ° 2.13 °±0.22 ° 5.37 °±0.37 ° 77.49% ±2.22% 88.71% ±1.25% 69.10% ±1.72%

IndepCA 3.44 °±0.26 ° 2.18 °±0.31 ° 5.33 °±0.48 ° 77.81% ±2.53% 88.71% ±2.23% 69.32% ±2.64%

CartCA/MvCA (ours) 3.25 °±0.34 ° 2.04 °±0.45 ° 5.01 °±0.69 ° 78.96% ±2.04% 89.21% ±2.29% 70.93% ±2.90%

Results are slightly different from those reported in the paper because of using our own implementation

predictedposeswithrespectto thegroundtruth[1]for2D head poseestimationandusedCumulativeScore(CS)deﬁnedin[42]for 3Dheadposeestimationastheclassiﬁcationmetrics,respectively.

Following [30,36], for illumination estimation we measured the angularerror (cosine distance)

ε

^between ^estimatedillumination I∈R³ andgroundtruthIgt∈R³:

ε

^I,Igt=arccos

I^TIgt

I

^I^gt

,

where · is the Euclideannorm. We report median and mean valueof

ε

I,I_gt ofalltestsamples.

4.2.Comparativeevaluation

2D HeadPoseEstimation—WecomparedourIndepCA,CartCA andMvCA witha number of recent methods on the Pointing’04 datasets.The resultsof these experimentsare shownin Table 2. Amongthemethods,PLS[27],KPLS[27],KRF[15],andMLD[1]use identical HoG and VGG-Net features as our approach. Since our models can use any general purpose regressor we selected MLD sinceitperformedwell bothinthe originalpaperandinourex- periments. Interestingly, our multivariate baseline IndepCA is on parwiththeexistingmethodsusingtraditionalfeatures(HoG)and clearlysuperiorwiththedeepCNNfeatures.However,intheboth casestheproposedCartCA/MvCAismoreaccurate.

Inordertofurtherassessthesignificanceofthefeatureset,we also fine-tuned the VGG-Net end-to-end in the same evaluation setting.Morespecifically,weusedtheVGGconvolutionalpipeline, withtwooutputlayersinplaceoftheoriginal1000-classoutput- layer.Theparalleloutputlayers predicttheyawandthepitchan-

gle encodedastwo independentclassiﬁcation problems.Thenet- work wastrainedusingthe negativelog-likelihoodlossandsoft- maxactivationsindividuallyforbothyawandpitchtargets.More- over,we testedalternativenetworkstructures:theResNet50base networkaswell asalternativetargetencodings.Itturnedout that clearlythe bestresultsare obtainedusing theVGG-Net structure andclassiﬁcationencoding(eachyawandpitchangleisoneclass) insteadoftheregressiontarget(thetwooutputlayers havelinear activationandaredirectlypredictingtheyawandpitchangles).

Itcanbeseenthatinmostcases,theend-to-endnetworkisin- feriortotheproposedapproach.Thenetworkisabletopredictthe pitch(vertical)anglebetterthanalternativemethods,butperforms poorlyonyawanglepredictionrenderingtheyaw+pitchmetricin- ferior,aswell.Theinferiorperformanceinhorizontalanglepredic- tionmaybe duetothelarger numberofclassesinthisdirection (13yawangles,7+2pitchangles),whichdecreasesthenumberof trainingsamplesperclassandcausesthenetworktooverﬁttothe relativelysmalltrainingset.

Finally,in order toassess the generalsuitability of a CNNfor multivariate regression problems, we also considered using the original VGG-Net features with a neural network classifier. More specifically, we trained the described network architecture with frozen convolutional layers, forcing the network to use exactly samefeatures asthe othermethods. The resultsarediscouraging asthe errors areup to three timeshigher than thebest onesin Table 2. This is an indicationthat a plain dense neural network maynotbe idealformultivariateregressiontasks(note,however, successfulresultsinrelatedtasks withe.g., autoencoderstructure [25]), andeven better results could be obtainedby coupling the fine-tunedconvolutionalpipelinewiththeproposedCartCA/MvCA.

(7)

Table 3

Comparison with state-of-the-art on 3D head pose estimation with the Biwi Kinect Database (data split 1: 18 persons for training and the remaining for testing; data Split 2: ﬁve-fold cross-validation).

Method Data Split 1 (MAE) Data Split 2 (MAE)

Yaw Pitch Roll Y + P + R Yaw Pitch Roll Y + P + R HF [46] ^∗ 3.79 ° 9.27 ° 6.62 ° 13.48 ° 8.9 ° 8.5 ° 7.9 ° –

ADF [24] ^∗ 3.54 ° 7.87 ° 5.39 ° 11.48 ° – – – –

ARF [47] ^∗ 3.52 ° 8.18 ° 4.77 ° 11.17 ° – – – –

RF [35] 3.80 ° 3.50 ° 5.40 ° – – – – –

KPLS [19] 1.90 ° 1.48 ° 1.80 ° 3.47 ° 2.01 °±0.06 ° 1.63 °±0.03 ° 1.80 °±0.06 ° 3.65 °±0.06 ° RF-i ^∗∗ 1.95 ° 1.50 ° 1.94 ° 3.72 ° 2.00 °±0.07 ° 1.49 °±0.04 ° 1.96 °±0.05 ° 3.77 °±0.10 ° RF-s ^∗∗ 1.59 ° 1.20 ° 1.39 ° 2.84 ° 1.79 °±0.11 ° 1.31 °±0.07 ° 1.47 °±0.05 ° 3.11 °±0.15 ° IndepCA 1.51 ° 1.23 ° 1.37 ° 2.80 ° 1.77 °±0.13 ° 1.34 °±0.14 ° 1.45 °±0.04 ° 3.10 °±0.18 ° CartCA 1.42 ° 1.29 ° 1.40 ° 2.74 ° 1.71 °±0.15 ° 1.30 °±0.11 ° 1.46 °±0.06 ° 3.05 °±0.18 ° MvCA 1.39 ° 1.15 ° 1.35 ° 2.64 ° 1.63 °±0.10 ° 1.24 °±0.06 ° 1.43 °±0.06 ° 2.92 °±0.14 °

∗uses foreground detection; ^∗∗is based on our implementation of [6] .

Table 4

Comparison with state-of-the-art on color constancy with the SFU Indoor and Color Checker datasets. Median and mean angular errors between estimated and ground truth illuminant (RGB) are reported as the errors (smaller is better). We use identical deep features to MSVR [4] .

SFU Indoor SFU Color Checker Median Mean Median Mean second-order Gray Edge ( 2 st GE) [48] 2.7 5.2 4.4 5.1 Weighted Gray Edge (WGE) [49] 2.4 5.6 – – Gamut Mapping (GM-pixel) [50] 2.3 3.7 2.3 4.2 Natural Image Statistics (NIS) [40] – – 3.1 4.2

Exemplar [31] – – 2.3 2.9

Grey Pixel (std) [29] 2.5 5.7 3.2 4.7

Grey Pixel (edge) [29] 2.3 5.3 3.1 4.6

MSVR [4] 1.9 3.1 2.8 4.3

IndepCA 1.8 3.0 2.6 4.2

CartCA 1.8 3.0 2.7 4.2

MvCA 1.6 2.8 2.6 4.1

3D Head PoseEstimation— Twoexperimentswere conducted using different settings for data splitting and the results are in Table3.Since theoriginalrandom forestregression(RF-i andRF- s)in[6]performedwellwiththeselecteddepthfeaturesweused RF astheregressorwithourmethodsaswell. Similartoprevious 2D headposeestimation,IndepCAis onpair(betterresultsfor6 outofthe8possiblemeasure)withstate-of-the-art(RF-i/s).How- ever, thetwo proposed extensions better exploitingoutput inter- dependencies, CartCA andMvCA, provide the best results.MvCA performedbetterthanCartCAwhichcanbe explainedbythelim- ited number of trainingdata - 2D projections of MvCA seem to robustifyregressionascomparedtofullCartCA.

IlluminationEstimation—Table4comparesourmethodswith thestate-of-the-artilluminationestimationalgorithmsontheSFU IndoorandColorChecker datasets.Ourmethodachievesthebest performance on both performance metrics on the SFU Indoor dataset, and our result is comparable to state-of-the-art on the SFUcolorchecker.Itisnoteworthythatourresultsarealwaysbet- ter than MSVR[4] who use identical deep features. Again, Inde- pCAperformedwellandMvCAwasthebestofthethreeproposed methods.

Computational Cost— The additional complexity of the pro- posedCAmodelsyieldsfromthemid-layerpresentation,attribute vector, forwhichtwo regressorsneedtobe trained.Intraditional visual regression there is a singleregressor which maps N input variables to D output variables. The computational complexities (sized of the attribute vectors) and the actual numbers for the threeproblemsareshowninTable5.

Table 5

The CA space sizes for the proposed models. Note that only CartCA and MvCA can represent cross-correlations between the output dimensions.

2D Head 3D Head Color constancy

IndepCA 22 418 192

CartCA 186 2.7 ·10 ⁷ 2.0 ·10 ⁶ MvCA 186 2.3 ·10 ⁵ 4.9 ·10 ⁴

Table 6

Comparison of the proposed CA spaces with various regressors for the second regression stage. Results corre- spond to the Yaw + Pitch MAE and classiﬁcation accuracies with the Pointing’04 benchmark.

Method Pointing’04

MAE Accuracy

KPLS [27] + HoG

IndepCA 10.80 °±0.68 ° 41.72% ±3.62%

MvCA 7.52 °±0.74 ° 56.77% ±5.23%

KRF [15] + HoG

IndepCA 8.85 °±0.76 ° 51.54% ±4.41%

MvCA 7.92 °±0.69 ° 53.33% ±3.15%

MLD [1] + HoG

IndepCA MLD 6.53 °±0.76 ° 63.84% ±4.34%

MvCA MLD 6.22 °±0.80 ° 65.59% ±4.12%

4.3.Ablationstudy

CAMapping—Inordertovalidatetheclaimthattheproposed Cartesiancumulativeattributemultivariateregression(CartCA)and itsmulti-viewprojectionbasedapproximation(MvCA)provideac- curacy improvement over the straightforward IndepCA we con- ductedanablationstudywherethedifferentCAspaceswerecom- paredusingdifferentregressorsbutwiththesamevisualfeatures.

The resultsare shown inTable 6. Inall casesthe higher dimen- sionalCA spacesprovided superior accuracy. However, it isobvi- ousthatthisfindingismostevidentwithmoretraditionalregres- sorssuchasKPLS[27].Themoreadvancedregressors,suchasKRF [15]andMLD[1],exploitoutput correlationsmoreefficientlyand thereforedifferencesbetweenIndepCAandCartCA/MvCAare less significant.

Concatenating with Imagery Features— During the experiments,we found thatthe bestpeformancewasachievedby con- catenatingoriginalimageryfeatures andcumulativeattributesfor the second stage regression. In this experiment this ﬁnding was veriﬁedwiththebothfaceposeandcolorconstancydatasets.The resultsareshowninTable7thatclearlyindicatesthatconcatena- tionprovidessmallbutsystematicimprovementinallcases.