• Ei tuloksia

Cumulative Attribute Space Regression for Head Pose Estimation and Color Constancy

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Cumulative Attribute Space Regression for Head Pose Estimation and Color Constancy"

Copied!
9
0
0

Kokoteksti

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Cumulative attribute space regression for head pose estimation and color constancy

Ke Chen

a

, Kui Jia

b

, Heikki Huttunen

a,

, Jiri Matas

a,c

, Joni-Kristian Kämäräinen

a

aLaboratory of Signal Processing, Tampere University of Technology, Finland

bSchool of Electronic and Information Engineering, South China University of Technology, China

cDepartment of Cybernetics, Czech Technical University, Prague

a rt i c l e i n f o

Article history:

Received 11 April 2018 Revised 21 September 2018 Accepted 9 October 2018 Available online 10 October 2018 Keywords:

Multivariate regression Cumulative attribute space Head pose

Color constancy

a b s t r a c t

Two-stageCumulativeAttribute(CA)regressionhasbeenfoundeffectiveinregressionproblemsofcom- putervisionsuchasfacialageandcrowddensityestimation.Thefirststageregressionmapsinputfea- turestocumulativeattributesthatencodecorrelationsbetweentargetvalues.Thepreviousworkshave dealtwithsingleoutputregression.Inthiswork,weproposecumulativeattributespacesfor2-and3- output(multivariate)regression.WeshowhowtheoriginalCAspacecanbegeneralizedtomultipleout- putbytheCartesianproduct(CartCA).However,fortargetspaceswithmorethantwooutputstheCartCA becomescomputationallyinfeasibleandthereforeweproposeanapproximatesolution-multi-viewCA (MvCA)- whereCartCAisappliedtooutput pairs.We experimentallyverifyimprovedperformanceof theCartCAandMvCAspacesin2Dand3Dfaceposeestimationandthree-output(RGB)illuminantesti- mationforcolorconstancy.

© 2018TheAuthors.PublishedbyElsevierLtd.

ThisisanopenaccessarticleundertheCCBYlicense.(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Multipleoutput regressionpredictsseveralcontinous variables simultaneously.Oneoftheemergingtopicswithinregressionprob- lems is visual regression. Regression has many applicationsin vi- sion,suchas2Dand3Dheadposeestimationandlandmarkdetec- tion[1–3] (seeFig.1),illuminationestimationforcolorconstancy [4],aswellasapparentageestimation[5].

A straightforwardsolution isto learn individual regressors for eachtargetvariableseparatelyusingthetraditionaltechniques(e.g.

ridge regression,random forest regression[6]andsupport vector regression[7]).However,independentregressorsdiscardtheinter- dependencebetweenthetargetvariables,whichcanbesubstantial invisionproblems.Thereare moreadvancedapproachesformul- tivariateregression,suchasjointlearningofregressorsinamulti- taskfashion[8]andstructuredlearning[9],buteventhesegeneric approachescannoteffectivelymodelcross-targetcorrelationsofvi- sualdataandareofteninferiortoproblemspecificmethods.

Mostoftheabovemethodsapplythetraditionalsinglelayerre- gression architecture, where themultivariate output isestimated either directly from image features, or by optimizing a tailored score function . During the recent years there have been multi-

Corresponding author.

E-mail address: heikki.huttunen@tut.fi(H. Huttunen).

plesuccessfulattemptstoreplacethesinglelayermodelwithtwo layer(twostage)architectures[10–12].Thefirstlayeroutput rep- resentsan “attributespace” where attributefeatures havean im- portantsemanticmeaning forthe regressionorclassificationtask solvedbythesecondlayeroutput.

Inthiswork, we focuson theconcept ofcumulative attribute (CA)spacemappingthatwasproposedinourpreviouswork[12]. Themainideabehindthecumulativeattributesistheintuitivefact thatlowlevelfeaturesforcertainvisiontasks,suchasageestima- tionorcrowdcounting,arecumulative bynature.Inthiswork,we show that thishypothesis holdsfora wider classof visionprob- lems.

Inspiredbythe successofCA forscalar-valuedregression[13], weextendCAtothemultivariateoutputsetting.Astraightforward extension isto applyCA regressionto each outputvariable inde- pendently.Thisapproachisthebaselineinourwork–Independent CumulativeAttribute space (IndepCA).The drawback of IndepCAis its limited ability to exploit the multi-dimensional nature of the targetspacethusomittingthecorrelationsoftheoutputvariables (suchasvisualsimilarityoffaces betweenadjacentpitchandyaw binsinFig.1).

Toovercome thislimitationwe generalize the CA to 2-output casebyadoptingamappingbasedontheCartesianproduct(Fig.1) –Cartesian Cumulative Attribute space(CartCA). The CartCAdivides themulti-dimensionalspaceintodisjointregions. Fora landmark https://doi.org/10.1016/j.patcog.2018.10.015

0031-3203/© 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license. ( http://creativecommons.org/licenses/by/4.0/ )

(2)

Fig. 1. Cartesian Cumulative Attribute space (CartCA) for 2-output regression. CA-based regression has three processing stages: i) feature extraction, ii) mapping from feature space to Cumulative Attribute space ( Attribute Learning ) and iii) mapping from CA space to a two-dimensional output space ( Target Regression : head yaw and pitch angles).

point anchored in a multi-dimensional target space, i.e. a single regressionlabel, CartCAformsuniquelydifferentbinarypartitions oftrainingsamples. CartCAis a generalization of theoriginal CA fortwo-dimensionaltargetspace.Thenumberofbinarypartitions grows exponentially w.r.t. the label spacedimensionality making CartCAimpracticalbeyondmorethantwooutputs.

To avoidthe combinatorialexplosion, we propose an approxi- mationby projectingtrainingsamplesintovarious2D sub-spaces forwhichCartCAisapplied.WecallthisapproachMulti-View Cu- mulativeAttribute(MvCA)regression.Intheexperimental part,we studythesemethods inthreedifferentmultivariate visual regres- sionproblems:2Dheadposeestimation,3Dheadposeestimation and3Dillumination(RGB)estimationforcolorconstancy.Inallex- periments,ourmethodprovidescompetitiveperformanceandcon- sistentlyoutperformsmethodsthatdonotconstructacumulative attributespacelayerforregression.

Ourmaincontributionsaresummarizedasfollows:

We extend the scalar value cumulative attribute (CA) regres- sion to2-outputcumulative regressionby adoptingthe Carte- sianproducttopartitionoutputspaces(CartCA).

Wepropose anapproximation approachforCAwith ≥ 3out- puts by partitioning output spaces to multiple 2D views - Multi-view Cumulative Attribute (MvCA). This approximation avoidsexponentialgrowthofCartCA.

Wedemonstrateeffectivenessofmulti-outputCAregressionin severalcomputervisionapplications(2Dand3Dheadposees- timation andRGBilluminationestimation forcolorconstancy) where CartCA and MvCA achieve competitive accuracies as comparedtostate-of-the-art.

2. Relatedwork

In thissection, we provide a short survey on the recent and relatedworksinvisualregressionandattributelearning.Sinceour experimentsareperformedon2D and3D targets, wealsosurvey relatedworksontheseapplications(namely,headposeestimation andcolorconstancyestimation).

Multivariate Regression For the standard univariate regres- sion problems in computer vision, we seek for a mapping f: RN→R, wherethe input x∈RN corresponds to N extractedim- agefeatures andtheoutput y∈R isa real-valued regressiontar- get.Traditional methods includeL2 regularized (ridge) regression ,L1 regularized (LASSO)regression[14],random forest regression [6]and support vector regression [7], to name a few. These re- gressionmethodscan be appliedto multivariate regressionprob- lems f:RN→RD by independentlylearningunivariateregressors

f:RN→R for each target variable y1,y2,...,yD separately. This approach,however,omits interdependenciesbetweenoutputvari- ables and for that purpose there are other generic approaches such asenforcingjointlylearningregressors ina multi-taskfash- ion[8]orstructuredlearningmethods[9].Forexample,structured multivariateregressionisappliedinanumberofcomputervision applications[15].

Mid-layerattributeshavebeenadoptedincertainrecentworks [10–12,16–18]. These methods learn D1-dimensional feature rep- resentation, which is used in a two-layer learning architecture f:RN→RD1→RD or (concatenation of features and attributes) f:RN→RD1,RN→RD.Indeed,ithasbeenshowninmanycases that the two-layer structure improves the accuracy. Inspired by thesuccessofcumulativeattributes(CAs)forscalar-valuedregres- sion[13],we generalize CAto the2-output (D=2) and3-output (D=3) settingsin thiswork.Forthis work,we adoptthe Partial LeastSquares(PLS)regression[19]and NIPALS [20]forestimating theregressionscore(andloading)matricesduetotheirsimplicity (formoredetailsseeSection3.3).

Attribute Learning Visual attributes, which can be either manually definedaccording toprior knowledge [17,18] ordiscov- ered from data [10,16], have been widely applied to a number ofclassificationproblemsincomputer vision,e.g.,image categori- sation [11,17], person re-identification [18], and action and video eventrecognition[16].Theproposedclassificationproblems,how- ever,are differentfromtheregressionproblems sincethey rarely establish natural and cumulative correlation, such as the person ageoranumberof people,andoftenrequiremanual annotation.

Yang etal.[21]proposed correlationanalysis fortwo-viewimage reconstruction.

Recently, the concept of cumulative attributes [12] was pro- posedforregressionproblems,asthoseclassification-orientedat- tributescannot be utilized directlyto explorethe cumulative de- pendency across regression labels. However, CA developed for scalar-valued regression problems can only be applied to mul- tivariate regression problems with the price of missing multi- dimensionalnatureofthetargetspace(IndepCAinthiswork).

Head Pose Estimation In this case, the regressiontarget is eithertwo-dimensional (yawandpitchangles)or3D (+roll).The challengesreside in featureinconsistency andlabelambiguity. In particular,forthesameheadpose,featurevariationsbetweendif- ferent persons are large dueto varying facial appearance. More- over,theposelabelsarenoisyastheexactgroundtruthisdifficult to acquire. As head poseestimation is challengingdue touncer- tain labels,it is considered agood testbed forevaluating robust- ness of the proposed attributes. The recent algorithms for head

(3)

poseestimationcanbecategorizedintotwogroups:classification- based[22]andregression-based[1,15,23,24].Moreover,deeparchi- tectureshavebeenproposedforhumanposerecovery[25].

If the head poseestimation problem is castto a classification problem, the implicit assumption is that pose labels are inde- pendent,whichdiscards theordered dependencyacrossthelabel space [22]. In the view of this, the regression-based algorithms haverecentlybecomemorepopularforboth2D[15,26,27]and3D headposeestimation[23,24].

In [27], a partial least square regression model was adopted to cope with the misalignment problem when estimating the headpose.FoytikandAsari[26]introducedatwo-layerregression framework inacoarse-to-finemanner,whichfirst determinesthe rangeofprediction(i.e.coarseestimationtorobustifyagainstam- biguous labels) andthen learnsa regressionfunction to estimate thefinal posevalue.Recently, Gengetal.[1]introducedthe con- ceptofsoftlabellingbyusingadjacentlabelsaroundthetruepose label in a multi-label learningfashion. This reduces the negative effect ofambiguoustargets andhelpsto capturecorrelationsbe- tweentheneighbouringtargets.However,thesoftlabellingsuffers fromthe invalid assumption that labelcorrelations existonly lo- cally.

Onthecontrary,thegoalofourCartCAandMvCAistorepre- sent thetarget correlationsgloballyacross thewhole posespace.

Beyondmultivariate labeldistribution,regression forests[23] and its variants [15,24] were proven their effectiveness andreal-time efficiencyin2Dand3Dheadposeestimation.

Illumination Estimation Another experimental case in our paper considers the estimation of illumination of color images.

This is a 3-outputregression problem, where the goalis to esti- matetheR,GandBvaluesofsceneillumination.

Existing algorithms for illumination estimation can be cate- gorised into two maingroups: statistics based[28,29] andlearn- ing based[30–32].In[32],afive-layerad-hoc CNNwasdesigned combiningfeaturegeneration andmulti-channelregressionto es- timate illuminationinan end-to-end manner. Qianetal.[4]em- ployed an implicit structured output regression on theoutput of fully-connected layerofVGG-Nettodiscoverinter-outputcorrela- tion.

3. Methodology

This section first introduces cumulative attribute (CA) re- gression in [12] (Section 3.1). Next, a two-variate generaliza- tion of CA isproposed (CartCA)and then multi-view CA (MvCA) which is more practical for D>2 target outputs (Section 3.2). In Section3.3thetwo-stageregressionisdiscussedinmoredetail.

3.1. Cumulativeattributespace

Considerastandardscalarvaluevisualregressionproblem,with I trainingexamples {xi,yi},where xi∈RN are N extractedimage featuresfortheimageindexedbyiandyi∈Risthecorresponding scalartarget.Chenetal.[12]introducesmid-levelmappingtoai∈ RD1 whichistermedasa“cumulativeattribute” vectorofxi.

Themainworkflowisbasedontwostageregression,wherethe first regressor provides attribute mapping f1:RN→RD1 andthe secondregressorprovidesthetargetoutputmapping f2:RD1→R. Itisnoteworthy,thatthebestperformanceisachievedbyconcate- nating the original features andtheestimated attributevector in thesecondstage,i.e. f2:x,a→R.

During the training stage, the mid-level attribute values ai∈ RD1 are generated by thresholding the regression target yi∈R

usingthefollowingCArule:

ai,j=

1, when yi

τ

j,

0, when yi>

τ

j, (1)

for j=1,2,...,D1. Inother words,the regressionproblemis de- composed into D1 binary classification problems by thresholding thetarget at

τ

j.Thedimension oftheattribute spaceD1 andthe correspondingthresholdsareproblemspecific;forexample,inage estimationan obvious choice isto set

τ

1=1,

τ

2=2,...,

τ

99=99 whenD1=99.

The attribute mapping f1 is learned using ridge regression;

meaningthatwelearnD1 attributefunctionscorresponding toD1 mid-level binary targets.Ideally the mapping should look like a step function with the change located at the true target value, butestimatedattributesaˆiareactuallyrealvaluedvectorsthatare notbinarizedbutdirectlyusedinthenextstageregressorf2.This meansthat binaryvaluesare usedonly duringthetrainingstage andinthetestingstagerealvaluemultiviewcumulativeattributes areusedforthefinalregressor.

Alternative to the regression based attribute functions in our work,alsoanytwo-class (binary) classifiercan be trainedforthe attribute assignmentsdefined in (2). However, during ourexper- imentswe havefound the realvalued outputs of regressors,soft attributes,moreeffective.Thiscanbeexplainedbythefactthatno informationislostinthebinarydecisionsandthewholepipeline isregressionbased.

3.2.2-and3-outputcumulativeattributespaces

Wewillnowproposethreevariantsofgeneralizingtheunivari- atecasetomultivariate.

IndepCA— A straightforward multivariate (D≥2) extension of CAis to treat all output dimensionsasindependent anduse the standard CAforeach output variable.We denotethisstraightfor- ward extension as IndepCA. If, for simplicity, we assume that all Doutput dimensionsare similar,then their corresponding cumu- lative attribute spaces can be represented by D1-dimensional at- tributevectors. IndepCAlearnsD1-dimensionalattribute mapping for each D dimensions of the target space yi∈RD. For the final stage regression we concatenate D D1-dimensional attribute vec- torstoasinglevectoroflengthD1×D.Thesecondstageregressor isamulti-variateregressororDunivariateregressorsthatprovide thetargetoutputyi=(y1,y2,...,yD).Moredetailsabouttheprac- ticalcomputationareinSection3.3.

For scalar-valued regression, an important advantage of CA comes from its more effective use of the 1D target space than traditionalregression learning settings.In particular, withall the availabletrainingsamples,eachattributefunctioninCAistrained tooutputeitherpositive(i.e.one)ornegative(i.e.zero)values,and acollectionofsuchtrainedattributefunctions,correspondingtoa rangeoflandmarkpointsanchoredinthe1Dtargetspace(e.g.in- tegerages), provides strong evidencefor estimationof thetarget output.Incontrast,regressorsintraditionalsettingsaretrainedto givea completerangeofvaluesinthetargetspace, whileregres- sionfidelityforanyspecifictargetvalueistakencareofonlybya (usuallysmall)subset oftrainingsamples.ThisadvantageofCAis particularlyimportantformanyregression problemsincomputer vision,such ashuman ageestimation andcrowd densityestima- tionwhichoftensufferfromsparseandimbalancedtrainingdata.

Theaforementionedcollectiveevidenceprovidedbytrainedat- tributemappingfunctionsandthe attributevector representation whereeachentrycorrespondstoa“landmark” (e.g.age)inatarget spaceisintuitive andeasyto manuallyselectfor1D cases.How- ever,themultivariatesettingismorecomplexasthereisnosimi- larlyuniquewaytodividetheoutputspaceto“zeros” and“ones”.

Wehave alreadydefineda multivariatemodel basedon multiple

(4)

CAregressors(IndepCA),butitsmainweaknessisthatitdoesnot exploitthemulti-dimensionalnatureofthetargetspaceinmulti- variateregression,i.e.cross-correlations andinterdependencies of outputvariables.

CartCA— The main problemin generalizingCAto multivariate casesis how to partition D-dimensional spacesuch that it natu- rallyrepresentsthecumulativenatureofattributeswiththeirmu- tualdependency.Asanovelsolution,weproposeamodeltermed CartesianCumulativeAttributes(CartCA).

Assume againthatwe haveItrainingsamples{xi,yi}.Consid- ering a D-dimensional target yi∈RD, each component yj=1,2,...,D willpartition thetrainingsamplesintotwo subsetsasdefinedin (1).Now,ifthisisdone foralljvariablesandtheirsuperpositions addedbyCartesian product,the vectorentries yi collectivelypar- tition the training samples into 2D subsets, which we denote as

{

S1,...,S2D

}

. These subsets of training samples suggest that we

canlearn2D differentattributefunctionsanchoredattheposition yinthetargetspace.Fork=1,...,2D,CartCAassignsattributela- bels

{

aki

}

tothetrainingsamples{xi}basedonthe followingrule aki =

1, when yiSk,

0, otherwise . (2)

Consider, for example, the particular caseof two-dimensional targets,i.e.,D=2.Then,theaboveruleforconstructingthe2D(in thiscase4D)attributetensorsisgivenasfollows

ai(,1j)=

1, when

τ

j(1)y(i1) and

τ

j(2)yi(2), 0, otherwise,

ai(,2j)=

1, when

τ

j(1)y(i1) and

τ

j(2)yi(2), 0, otherwise,

ai,j(3)=

1, when

τ

j(1)y(i1) and

τ

j(2)yi(2),

0, otherwise, (3)

ai(,4j)=

1, when

τ

j(1)y(i1) and

τ

j(2)yi(2), 0, otherwise,

where

τ

j(1) and

τ

(j2) areset similarlyto theoriginal CAandthey

have clear semantic meaning. For a training example, the two- dimensional output sets an anchor point to partition the 4D at- tributetensor. Anillustration ofthe above attribute labelassign- ment rule is shown in Fig. 1, where the goal is to estimate the headposeyawandpitchangles.

MvCA—OnemaynoticethatthenumberofattributesinCartCA increases exponentially with the dimensionality of target space, which makes learning of CartCA impractical in cases of high- dimensionaltarget spaceand asmall amountof data.Inour ex- perimentswefoundCartCAimpracticalforD>2.Asaremedy,we proposeanapproximateCartCAtermedMulti-viewCumulativeAt- tributes(MvCA).TheMvCAattributeconstructionruleisbasedon CartCAwhichisstillpracticalforD=2using(2).

More specifically, for training samples {xi, yi} in the D- dimensionaltargetspace,we firstselectanoutputdimensionpair (j1, j2) with j1,j2

{

1,...,D

}

, j1=j2, and project all the train- ing samples into this CartCA subspace. For a fixed anchor point yi,{j

1,j2}∈R2 intheCartCAsub-space,itsentriespartitiontheout- put space into 4 subsets (like those of Fig. 1), based on which MvCAuses4different“attributeplanes” byfollowingtherules in (3).

Forstudying complexityofCartCAandMvCAwe mayassume thattheD1 attributespacesare similar. Inthiscase,we havethe total of D21 possible anchor points in the attribute space. MvCA learns 4 attribute planes associated with each of the landmark points, and there are in total D(D−1)/2 such dimension pairs (j1,j2).MvCAlearnsattributefunctionsinthesamewayforeach

of the pairs, producing a total of 2D21D(D−1) attribute planes.

ForD>3,thisissignificantlylessthanthecorrespondingnumber (2D1)D fortheCartCA.

In the casethat the target spaceof multivariate regression is two-dimensional(aplane),i.e.D=2,CartCAandMvCAareequiv- alentandgivethesamenumberofattribute features.Inthecase D=1 all theoriginal CA, IndepCA,CartCAandMvCAare equiva- lent. There are also recentworks that could be used for dimen- sionalreduction[33],butthesearebeyondthescopeofthiswork.

Geometric Interpretation of CartCA and MvCA. We take CartCA as an example, but MvCAcan be similarly analyzed.Attribute label assignmentrule(2)suggeststhateachattributefunctioninCartCA islearnedbasedonauniquebinarypartitionoftrainingsamples.

Eachattribute function trained thiswayserves asa hyperplane,1 gives an indicative measure of the position (i.e. multi-variate re- gressionlabel)oftestsamplesinthetargetspace.Inthefollowing, we consider aparticular test samplex witha ground-truthlabel y=yˆ.

Agroupof2kattributefunctionslearnedbytherule(2),(refer- ringtorule(3)forsamplesontheboundary),whichanchored at the position yˆ in the label space, ideally provide an exact indicationon the target of x: attributes given by these func- tionsformavector1∈R2kwithallentryvaluesof1(anyzero- valuedentryinthisvectorindicatesy=yˆ).Whensuchagroup ofattributefunctionsare notavailable, attribute functionsan- choredatneighboringpositionsofyˆformpolytopesinthetar- getspace, whichprovidedifferentlevelsofrefinedpositionin- formationfortheestimationofy.

Based on different (and unique) binary partitions of the tar- getspace,otherattributefunctionsprovidedifferenthalf-space constraintsfor theestimation ofy.When theseattributes are concatenatedtothevectoraCartCA,theycollectivelyproviderich (andredundant)informationfortheestimationofy.

An illustration of the above geometric interpretation is pre- sented in Fig. 2. In summary, CartCA (or MvCA) encodes in the attributevectoraCartCA(oraMvCA)stronginformationabouttheun- derlyingpositionofanytestsampleinthetargetspace,whichcan beexploitedforfinallabelestimation.

3.3. Two-stageregression

Giventrainingsamples{xi,yi}withinput featuresxi∈RN and output target vector yi∈RD, we construct the training attribute targets ai∈RD1 based on the attribute construction rules in the previoussections.

Tothisend, weemploy thePartialLeast Squares(PLS)regres- sion[19]forits capabilitytocopewithmulticollinearity problem, and which has recently been applied to a number of visual re- gression problems [27]. Typical solution forestimating the score (andloading) matricesistheNIPALS[20],whichweadoptforits low computational complexity(O(N2)).Alternatively, other multi- variateregressionmodelscanalsobeemployedsuchasmultivari- ate ridge regression [12] and regression forests [6]. Partial least squareregressionisadoptedowingtoitssimplicityinimplemen- tationandcomputationalefficiency.PLSlearnsamappingfunction f:RN→RD1 fromtrainingdata,whichisusedtoestimatean at- tributefeature vectora˜∈RD1 foran unseentestsamplex andis

1Alternative to the regression based attribute functions in our work, also any two-class (binary) classifier can be trained for the attribute assignments defined in (2) . However, during our experiments we have found the real valued outputs of regressors, soft attributes , are more effective. This can be explained by the fact that no information is lost in the binary decisions and the whole pipeline is a regression pipeline.

(5)

Fig. 2. Geometric intuition of the proposed Cartesian Cumulative Attributes. At- tribute functions/hyperplanes (blue lines) form polytopes in the target space, which provide different levels of indicative position information on the target (dark star point) of a test sample. In the weaker form certain attributes provide half-space constraints (red lines) on the target of the test sample. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

thefirststageregressorintheproposedCartCAandMvCAregres- sionmethods.

Toperformthesecondstagetargetestimation,wefirstestimate

˜

ai=f(xi) andthenconcatenate xiwitha˜i.Theconcatenatedvec- torsareusedasthetrainingdataforthesecondstagemultivariate regression.Tolearnamappingfunctionfromtheconcatenatedfea- turespacetothemultivariatetargetspace,weadoptafewrecent state-of-the-artmethods,e.g.KPLS[27],KRF[15],andMLD[1]and comparethem inourexperiments.Ouruseofthe existingmeth- ods is mainly to verifythe effectiveness of our proposed CartCA andMvCAattributefeatures,byremovingcontributionsfromother factors.

4. Experiments

In the following, the proposed multi-output cumulative at- tributespaceregressionmethods,IndepCA,CartCAandMvCA,are evaluatedinmultiplevisionproblems:2Dheadposeestimation(2 poseangles),3Dheadposeestimation(3poseangles)andillumi- nationestimationforcolorconstancy(3colorcorrectiontermsfor thered,greenandbluechannel).

4.1. Datasetsandsettings

Datasets— For 2D head pose estimation, we used the popu- lar Pointing’04 benchmark dataset [34] which contains face im- agesof15personscapturedinvaryingappearanceandundercon- trolledindoorenvironment.For3Dheadposeestimation,weused the Biwi Kinect Head Pose Estimation dataset [35], which con- tains depth images of20 persons. As a distinct visual regression problemfromheadposeestimationwealso evaluatedourmodel with two illumination estimation datasets [30,36] where illumi- nant tri-stimulus value (Red, Green,Blue) isestimated tocorrect a colorbiasedinput image. The SFUIndoordataset [36] contains 321imagescapturedin11differentcontrolledlightingconditions.

The SFUColorCheckerdataset [30]contains56812-bits dynamic range imageswhich all include theMacbeth Color Checker chart asgroundtruth.DetailsofthedatasetsaregiveninTable1.

Features— For 2D head pose estimation, after cropping the foreground offaces withmanually-annotatedboundingboxes,the facial images are normalized into 32×32 pixels from which we extract a2511-dimensionalhistogramoforientedgradients(HoG)

featurevector[37],whichiswidelyemployedintherecentworks [1,15,26,27].EncouragedbythesignificantadvanceswithConvolu- tionalNeuralNetworks (CNNs)in facial recognition[38],we also extractCNNfeaturesfromthe“fc6” layerofthepre-trainedVGG- net16layermodel[39].

For3D headposeestimation,we first removethe background using the provided foreground masks by cropping 96×96 facial region anchored inthe center offoreground masks. The cropped facialpatchesarethenresized into32×32pixels.Inspired bythe features usedin[23,24],the depthvalue ofeach pixelin 32×32 patcheswereusedaslow-levelfeaturesafterwhichapplyingnor- malization of the non-zero pixel intensities (i.e. depth distance) into[0,1].

Finally, forthe illumination estimation problem, we used the pre-trained19-layer VGG-netwithout fine-tuningas describedin [4].ForbothSFUIndoorandColorCheckerdatasets,wefollowthe settingsin[4]toextract4096-dimensionalCNN“fc6” featuresfrom imagesresizedto224×224.

Settings— For the Pointing’04 dataset, two experiments were conductedaccordingto the settings ofdata split. In thefirst ex- periment,we followed thesame trainingandtestingpartition as [1,15,26,27], i.e. five-fold cross-validation. An alternative setting, i.e.twoimagesequencesofthesamepersonevenlysplitintotrain- ingandtestingdata,wasadoptedforthesecond experimentasin [15].FortheBiwiKinectdataset,twoexperimentswereconducted by1)dividingthedataintotrainingpartcontainingtheimagesof the first 18 persons and testing part withthe remaining images [23,24]and2) by adoptingfive-fold cross-validation[23],respec- tively.FortheSFUIndoorandColorCheckerdatasets,wefollowed thestandard3-foldcross-validationprotocolin[4,29,31,32,40,41].

Comparative Methods— We collected most of the results of competitive approaches from corresponding papers. For ablation studywith the 2D datasetwe implemented several state-of-the- artmethods includinglinear/kernelpartialleastsquare regression (PLS/KPLS)[27],k-cluster regressionforests(KRF) [15],andmulti- variatelabeldistributionlearning(MLD)[1].

For3D head poseestimation,we adopted standard regression forests(RF)[6]forthesecondlayermulti-variateregressionmodel owingtoitsstrongperformanceinrecentworks[23,24].

Forilluminationestimation,weimplementcomparative multi- outputsupportvectorregression[4]inthelightofitscompetitive performance. The number of factors for PLS and KPLS with RBF kernelis25and40respectively.

ForKRF, we followed the setting in [15], the minimal size in eachleafnodeis5andwegrew20regressiontrees.Following[1], MLD adopts weighted Jeffrey’s divergence and two-dimensional Gaussiandistributionwiththefinestgranularityofheadpose

μ

= 15.Regressionforestsfor3Dheadposeestimationhaveatleastthe samplesizeof 5ineach leaf node andgrow20 regressiontrees.

Forilluminationestimation,we usedmulti-output supportvector regression(MSVR)[4]withtheRBF kernel. Trade-off parameterC and

γ

oftheRBFkernelweretunedbythree-foldcross-validation.

We adopted the class labels to generate CartCA for 2D head poseestimation, while rounded values to nearest integers of 3D headposeanglesareemployed togenerateCartCAandMvCA.For illuminationestimation,wefirstnormalisedgroundtruthillumina- tionsinto[0,255]levels,whicharequantisedinto64binsinacu- mulatively and continuously changing manner. The class label of eachbinon eachcolour channelwasadopted togenerateCartCA andMvCA.

PerformanceMetrics—Forevaluatingtheperformanceofhead poseestimation,weemployed two typesofperformance metrics, i.e.regression metricin MeanAbsolute Error (MAE)andclassifica- tionmetric.Consideringthedifferentdatacharacteristics inlabels (i.e.integer anglesinthe Pointing’04datasetandscalar valuesin the BiwiKinect dataset), we report the classification accuracy of

(6)

Table 1

Details of the datasets used in the experiments. D ( i ) = range of the i th output dimension (2D face pose: yaw, pitch; 3D face pose: + roll; color constancy: color corrections c R, c G, c B).

Data # of imgs Resolution D (1) D (2) D (3) Note

Face pose

Pointing’04 [34] 2790 384 ×288 [ −90 , 90 ] [ −90 , 90 ] 13 yaw and 9 pitch angles Biwi Kinect [35] 15,677 640 ×480 [ −67 , 77 ] [ −84 , 54 ] [ −70 , 63 ] float values

Color constancy

SFU Indoors 321 224 ×224 [0, 255] [0, 255] [0, 255] RGB values SFU Color Checker 568 224 ×224 [0, 255] [0, 255] [0, 255] RGB values

Table 2

Comparison with state-of-the-art on 2D head pose estimation with the Pointing’04 dataset (5-fold cross-validation). For MAE smaller number is better and for classification accuracy larger number is better. Note that for 2-output regression CartCA and MvCA are equivalent.

Method Regression Metric (MAE) Classification Metric(Accuracy)

Yaw Pitch Yaw + Pitch Yaw Pitch Yaw + Pitch

Various feature combinations

Fenzi [43] 5.9 ° 6.7 °

AKRF-V [44] 5.5 ° 2.8 °

SDL [45] 4.12 °±0.17 ° 2.09 °±0.12 °

PLS [27] 8.97 °±0.87 ° 9.27 °±0.41 ° 15.51 °±0.53 ° 49.25% ±3.37% 46.38% ±3.19% 23.15% ±1.04%

HoG Features

KPLS [27] 5.89 °±0.83 ° 5.76 °±0.25 ° 10.28 °±0.70 ° 64.87% ±4.30% 65.34% ±2.08% 44.34% ±2.58%

KRF [15] 5.49 °±0.27 ° 3.90 °±0.65 ° 8.79 °±0.61 ° 64.52% ±1.97% 76.67% ±3.73% 47.53% ±2.90%

MLD [1] 4.41 °±0.57 ° 2.83 °±0.62 ° 6.74 °±0.70 ° 71.61% ±3.12% 84.98% ±2.19% 61.76% ±3.84%

IndepCA 4.31 °±0.83 ° 2.76 °±0.66 ° 6.53 °±0.76 ° 72.87% ±4.30% 85.34% ±2.08% 63.84% ±4.34%

CartCA/MvCA 4.09 °±0.70 ° 2.60 °±0.69 ° 6.22 °±0.80 ° 74.01% ±3.94% 86.95% ±2.47% 65.59% ±4.12%

VGG-Net Features

CNN 4.81 °±0.23 ° 1.85 °±0.17 ° 6.67 °±0.16 ° 68.96% ±1.08% 89.93% ±1.24% 61.58% ±1.22%

KPLS 4.72 °±0.29 ° 4.45 °±0.39 ° 8.38 °±0.44 ° 71.25% ±1.51% 72.11% ±2.26% 51.79% ±2.51%

KRF 5.37 °±0.67 ° 3.76 °±0.51 ° 8.71 °±0.23 ° 65.60% ±4.12% 76.95% ±2.76% 48.52% ±1.15%

MLD 3.53 °±0.34 ° 2.13 °±0.22 ° 5.37 °±0.37 ° 77.49% ±2.22% 88.71% ±1.25% 69.10% ±1.72%

IndepCA 3.44 °±0.26 ° 2.18 °±0.31 ° 5.33 °±0.48 ° 77.81% ±2.53% 88.71% ±2.23% 69.32% ±2.64%

CartCA/MvCA (ours) 3.25 °±0.34 ° 2.04 °±0.45 ° 5.01 °±0.69 ° 78.96% ±2.04% 89.21% ±2.29% 70.93% ±2.90%

Results are slightly different from those reported in the paper because of using our own implementation

predictedposeswithrespectto thegroundtruth[1]for2D head poseestimationandusedCumulativeScore(CS)definedin[42]for 3Dheadposeestimationastheclassificationmetrics,respectively.

Following [30,36], for illumination estimation we measured the angularerror (cosine distance)

ε

between estimatedillumination I∈R3 andgroundtruthIgt∈R3:

ε

I,Igt=arccos

ITIgt

I

Igt

,

where · is the Euclideannorm. We report median and mean valueof

ε

I,Igt ofalltestsamples.

4.2.Comparativeevaluation

2D HeadPoseEstimation—WecomparedourIndepCA,CartCA andMvCA witha number of recent methods on the Pointing’04 datasets.The resultsof these experimentsare shownin Table 2. Amongthemethods,PLS[27],KPLS[27],KRF[15],andMLD[1]use identical HoG and VGG-Net features as our approach. Since our models can use any general purpose regressor we selected MLD sinceitperformedwell bothinthe originalpaperandinourex- periments. Interestingly, our multivariate baseline IndepCA is on parwiththeexistingmethodsusingtraditionalfeatures(HoG)and clearlysuperiorwiththedeepCNNfeatures.However,intheboth casestheproposedCartCA/MvCAismoreaccurate.

Inordertofurtherassessthesignificanceofthefeatureset,we also fine-tuned the VGG-Net end-to-end in the same evaluation setting.Morespecifically,weusedtheVGGconvolutionalpipeline, withtwooutputlayersinplaceoftheoriginal1000-classoutput- layer.Theparalleloutputlayers predicttheyawandthepitchan-

gle encodedastwo independentclassification problems.Thenet- work wastrainedusingthe negativelog-likelihoodlossandsoft- maxactivationsindividuallyforbothyawandpitchtargets.More- over,we testedalternativenetworkstructures:theResNet50base networkaswell asalternativetargetencodings.Itturnedout that clearlythe bestresultsare obtainedusing theVGG-Net structure andclassificationencoding(eachyawandpitchangleisoneclass) insteadoftheregressiontarget(thetwooutputlayers havelinear activationandaredirectlypredictingtheyawandpitchangles).

Itcanbeseenthatinmostcases,theend-to-endnetworkisin- feriortotheproposedapproach.Thenetworkisabletopredictthe pitch(vertical)anglebetterthanalternativemethods,butperforms poorlyonyawanglepredictionrenderingtheyaw+pitchmetricin- ferior,aswell.Theinferiorperformanceinhorizontalanglepredic- tionmaybe duetothelarger numberofclassesinthisdirection (13yawangles,7+2pitchangles),whichdecreasesthenumberof trainingsamplesperclassandcausesthenetworktooverfittothe relativelysmalltrainingset.

Finally,in order toassess the generalsuitability of a CNNfor multivariate regression problems, we also considered using the original VGG-Net features with a neural network classifier. More specifically, we trained the described network architecture with frozen convolutional layers, forcing the network to use exactly samefeatures asthe othermethods. The resultsarediscouraging asthe errors areup to three timeshigher than thebest onesin Table 2. This is an indicationthat a plain dense neural network maynotbe idealformultivariateregressiontasks(note,however, successfulresultsinrelatedtasks withe.g., autoencoderstructure [25]), andeven better results could be obtainedby coupling the fine-tunedconvolutionalpipelinewiththeproposedCartCA/MvCA.

(7)

Table 3

Comparison with state-of-the-art on 3D head pose estimation with the Biwi Kinect Database (data split 1: 18 persons for training and the remaining for testing; data Split 2: five-fold cross-validation).

Method Data Split 1 (MAE) Data Split 2 (MAE)

Yaw Pitch Roll Y + P + R Yaw Pitch Roll Y + P + R HF [46] 3.79 ° 9.27 ° 6.62 ° 13.48 ° 8.9 ° 8.5 ° 7.9 °

ADF [24] 3.54 ° 7.87 ° 5.39 ° 11.48 °

ARF [47] 3.52 ° 8.18 ° 4.77 ° 11.17 °

RF [35] 3.80 ° 3.50 ° 5.40 °

KPLS [19] 1.90 ° 1.48 ° 1.80 ° 3.47 ° 2.01 °±0.06 ° 1.63 °±0.03 ° 1.80 °±0.06 ° 3.65 °±0.06 ° RF-i ∗∗ 1.95 ° 1.50 ° 1.94 ° 3.72 ° 2.00 °±0.07 ° 1.49 °±0.04 ° 1.96 °±0.05 ° 3.77 °±0.10 ° RF-s ∗∗ 1.59 ° 1.20 ° 1.39 ° 2.84 ° 1.79 °±0.11 ° 1.31 °±0.07 ° 1.47 °±0.05 ° 3.11 °±0.15 ° IndepCA 1.51 ° 1.23 ° 1.37 ° 2.80 ° 1.77 °±0.13 ° 1.34 °±0.14 ° 1.45 °±0.04 ° 3.10 °±0.18 ° CartCA 1.42 ° 1.29 ° 1.40 ° 2.74 ° 1.71 °±0.15 ° 1.30 °±0.11 ° 1.46 °±0.06 ° 3.05 °±0.18 ° MvCA 1.39 ° 1.15 ° 1.35 ° 2.64 ° 1.63 °±0.10 ° 1.24 °±0.06 ° 1.43 °±0.06 ° 2.92 °±0.14 °

uses foreground detection; ∗∗is based on our implementation of [6] .

Table 4

Comparison with state-of-the-art on color constancy with the SFU Indoor and Color Checker datasets. Median and mean angular errors between estimated and ground truth illuminant (RGB) are reported as the errors (smaller is better). We use identi- cal deep features to MSVR [4] .

SFU Indoor SFU Color Checker Median Mean Median Mean second-order Gray Edge ( 2 st GE) [48] 2.7 5.2 4.4 5.1 Weighted Gray Edge (WGE) [49] 2.4 5.6 Gamut Mapping (GM-pixel) [50] 2.3 3.7 2.3 4.2 Natural Image Statistics (NIS) [40] 3.1 4.2

Exemplar [31] 2.3 2.9

Grey Pixel (std) [29] 2.5 5.7 3.2 4.7

Grey Pixel (edge) [29] 2.3 5.3 3.1 4.6

MSVR [4] 1.9 3.1 2.8 4.3

IndepCA 1.8 3.0 2.6 4.2

CartCA 1.8 3.0 2.7 4.2

MvCA 1.6 2.8 2.6 4.1

3D Head PoseEstimation— Twoexperimentswere conducted using different settings for data splitting and the results are in Table3.Since theoriginalrandom forestregression(RF-i andRF- s)in[6]performedwellwiththeselecteddepthfeaturesweused RF astheregressorwithourmethodsaswell. Similartoprevious 2D headposeestimation,IndepCAis onpair(betterresultsfor6 outofthe8possiblemeasure)withstate-of-the-art(RF-i/s).How- ever, thetwo proposed extensions better exploitingoutput inter- dependencies, CartCA andMvCA, provide the best results.MvCA performedbetterthanCartCAwhichcanbe explainedbythelim- ited number of trainingdata - 2D projections of MvCA seem to robustifyregressionascomparedtofullCartCA.

IlluminationEstimation—Table4comparesourmethodswith thestate-of-the-artilluminationestimationalgorithmsontheSFU IndoorandColorChecker datasets.Ourmethodachievesthebest performance on both performance metrics on the SFU Indoor dataset, and our result is comparable to state-of-the-art on the SFUcolorchecker.Itisnoteworthythatourresultsarealwaysbet- ter than MSVR[4] who use identical deep features. Again, Inde- pCAperformedwellandMvCAwasthebestofthethreeproposed methods.

Computational Cost— The additional complexity of the pro- posedCAmodelsyieldsfromthemid-layerpresentation,attribute vector, forwhichtwo regressorsneedtobe trained.Intraditional visual regression there is a singleregressor which maps N input variables to D output variables. The computational complexities (sized of the attribute vectors) and the actual numbers for the threeproblemsareshowninTable5.

Table 5

The CA space sizes for the proposed models. Note that only CartCA and MvCA can represent cross-correlations between the output dimensions.

2D Head 3D Head Color constancy

IndepCA 22 418 192

CartCA 186 2.7 ·10 7 2.0 ·10 6 MvCA 186 2.3 ·10 5 4.9 ·10 4

Table 6

Comparison of the proposed CA spaces with various re- gressors for the second regression stage. Results corre- spond to the Yaw + Pitch MAE and classification accura- cies with the Pointing’04 benchmark.

Method Pointing’04

MAE Accuracy

KPLS [27] + HoG

IndepCA 10.80 °±0.68 ° 41.72% ±3.62%

MvCA 7.52 °±0.74 ° 56.77% ±5.23%

KRF [15] + HoG

IndepCA 8.85 °±0.76 ° 51.54% ±4.41%

MvCA 7.92 °±0.69 ° 53.33% ±3.15%

MLD [1] + HoG

IndepCA MLD 6.53 °±0.76 ° 63.84% ±4.34%

MvCA MLD 6.22 °±0.80 ° 65.59% ±4.12%

4.3.Ablationstudy

CAMapping—Inordertovalidatetheclaimthattheproposed Cartesiancumulativeattributemultivariateregression(CartCA)and itsmulti-viewprojectionbasedapproximation(MvCA)provideac- curacy improvement over the straightforward IndepCA we con- ductedanablationstudywherethedifferentCAspaceswerecom- paredusingdifferentregressorsbutwiththesamevisualfeatures.

The resultsare shown inTable 6. Inall casesthe higher dimen- sionalCA spacesprovided superior accuracy. However, it isobvi- ousthatthisfindingismostevidentwithmoretraditionalregres- sorssuchasKPLS[27].Themoreadvancedregressors,suchasKRF [15]andMLD[1],exploitoutput correlationsmoreefficientlyand thereforedifferencesbetweenIndepCAandCartCA/MvCAare less significant.

Concatenating with Imagery Features— During the experi- ments,we found thatthe bestpeformancewasachievedby con- catenatingoriginalimageryfeatures andcumulativeattributesfor the second stage regression. In this experiment this finding was verifiedwiththebothfaceposeandcolorconstancydatasets.The resultsareshowninTable7thatclearlyindicatesthatconcatena- tionprovidessmallbutsystematicimprovementinallcases.

Viittaukset

LIITTYVÄT TIEDOSTOT

• Inspired by the concept of polygon approximation and hierarchical decision mak- ing, a hard coarse decision classifier is proposed as the first stage of visual regres- sion

Cumulative UE orientation estimation error of the proposed Gibbs- sampling-based approach and reference results based on the EXIP approach (dashed lines) for the 3xNLOS scenario

The current practical approaches for depth-aware pose estimation convert a human pose from a monocular 2D image into 3D space with a single computationally intensive

Absolute pose is estimated from real aerial images using 10 manual correspondences to virtual Google Earth images, which are georeferenced.. Georeferencing is done using original

3.2. This guarantees that the robot does not accidentally hit the operator with an object it is carrying. In such case a new set of control points C obj is created using

Automatic segmentation of image into natural objects based on different color space models was studied in [18], resulting that RGB color space is the best color space representation

We integrated the method of risk estimation based on the cumulative distribution function and the assumption that arrival times and departure times of buses in the city of

Considering object detection and pose estimation based on depth data, the resolution and point cloud density of IFM is smaller than the other cameras in this thesis as seen from