Multimodal subspace support vector data description

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Multimodal subspace support vector data description

Fahad Sohrab

^a^,^∗

, Jenni Raitoharju

^a^,^b

, Alexandros Iosiﬁdis

^c

, Moncef Gabbouj

^a

aFaculty of Information Technology and Communication Sciences, Tampere University, FI-33720 Tampere, Finland

bProgramme for Environmental Information, Finnish Environment Institute, FI-40500 Jyväskylä, Finland

cDepartment of Engineering, Electrical and Computer Engineering, Aarhus University, DK-8200 Aarhus, Denmark

a rt i c l e i n f o

Article history:

Received 21 August 2019 Revised 13 July 2020 Accepted 6 September 2020 Available online 10 September 2020 Keywords:

Feature transformation Multimodal data One-class classiﬁcation Support vector data description Subspace learning

a b s t r a c t

Inthispaper, weproposeanovelmethodforprojectingdatafrom multiplemodalitiestoanewsub- spaceoptimizedforone-classclassification.Theproposed methoditerativelytransformsthedatafrom theoriginalfeaturespaceofeachmodalitytoanewcommonfeaturespacealong withfinding ajoint compactdescriptionofdatacomingfromallthemodalities.Fordataineachmodality,wedefineasepa- ratetransformationtomapthedatafromthecorrespondingfeaturespacetothenewoptimizedsubspace byexploitingtheavailableinformationfromtheclassofinterestonly.Wealsoproposedifferentregu- larizationstrategiesfortheproposedmethodandprovidebothlinearandnon-linearformulations.The proposedMultimodalSubspaceSupportVectorDataDescriptionoutperformsallthecompetingmethods usingdatafromasinglemodalityorfusingdatafromallmodalitiesinfouroutoffivedatasets.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Inoursurroundings,onadailybasis,we areexposed toinfor- mationfrommanydifferentsources.Differentsensorsareusedto gather informationabout similar objects. Ourbrains usually per- form well in combining the information from different sources to make a concise analysis of that particular entity. In order to analyze an entity, even a single source of information might be enough,buttomakesomecriticaldecisionsitisimportanttocom- bine information fromdifferent sources in a systematic way. For example,ifapersoniswalkingina crowd,themaininformation to nothit anythingcomes fromvisual cues, butpeoplecan warn eachotheralsobyvoiceorevenbytouch,andthisextrainforma- tionhelpsinunderstandingtheenvironmentinabetterway.The smellcouldhelptoavoidunpleasantspots,too.Asanotherexam- ple,whilewatchingamovie,onlyvisualinformationofthescenes maynotbeenoughtounderstandthewholescenario,buttheau- dio and/or captions combinedtogether withthe visualsinforma- tionwillprovidethefullinformation.

∗ Corresponding author.

E-mail addresses: fahad.sohrab@tuni.ﬁ (F. Sohrab), jenni.raitoharju@tuni.ﬁ (J.

Raitoharju), alexandros.iosifidis@eng.au.dk (A. Iosifidis), moncef.gabbouj@tuni.fi(M.

Gabbouj).

In machine learning techniques for predictive data modeling, trainingdata are used to forma model that can accurately clas- sifyfutureinstancesintoapredeﬁnednumberofclasses.Inmany cases,data comes fromsensors and can be further processed to extractdifferentfeatures.Thetermmultimodalisusedtodescribe thedatacoming fromdifferentsensors(also referred toasmode ormodality),however,itis alsousedasasynonym tomulti-view when different features are extracted from the same sensor or whentherearemultiplesimilarsensors,e.g.,cameras.Theaimof multimodalmachine learning algorithms is to build models that canprocess andrelate informationfrommorethan onemodality (orview).

The examples of multimodal representations are prevalent in differentapplicationareas.In[1],anactivemultimodalsensorsys- temfortarget recognitionandtrackingisstudiedwhereinforma- tionfromthreedifferentsensors (visual,infrared,andhyperspec- tral) isused.In [2], aframework forvehicle trackingwithmulti- modaldata(velocityandimages)isproposedwheretheoutcome ofvelocitymodalityestimatedbyusingaKalmanﬁlteronthedata obtainedfrommotionsensorsisfusedwithfeatures learnedfrom image modality bythe color-fasterR-CNN method.In[3],a mul- timodaldatacollectionframeworkformentalstress monitoringis studied.Intheproposedframework,physiologicalandmotionsen- sordataofpeopleunderstressarecollected.

https://doi.org/10.1016/j.patcog.2020.107648

(2)

The data in multimodal applications come from different modalities,whereeach modalityhasits ownstatisticalproperties andcontainsspeciﬁc information.Thedifferentmodalitiesusually share high-level concepts and semantic information, and all to- gethercontainmoreinformationthananysingle-modaldata[4].If webuildamodelseparatelyforeachmodality,therelationshipbe- tweenthemodalitiescannotbeexploitedeﬃciently.Inmultimodal subspacelearning, thegoalisto inferashared latentrepresenta- tion,that can accurately model data fromeach original modality andexploittherelationshipbetweenthemodalities.

Intraditionalmulticlassmachinelearning,anadequateamount of data are available for all the categories during training and, hence,thealgorithmtakesadvantageofall availabletrainingdata fromall classesto traina model [5]. However, itis possiblethat duringthetraining, data arehighly imbalanced,orthe only data availableisfromasingleclass.Insuch cases,one-classclassifica- tiontechniquesareused.Itisusefulinmanydifferentcases,such asoutlierdetection, predictingspecificevents,or,ingeneral,pre- dictingaspecific targetclass.While muchefforthasbeenputon solvingone-class classification tasksfordataof asinglemodality [6],muchlessefforthasbeenputonsolvingone-classmultimodal challengesingeneral, andwe arenot awareofanypriorwork in thefieldofmultimodallearningforone-classclassification.Inone- classmultimodaltasks,itis assumedthatthe onlydataavailable isfromasingleclassinmanydifferentmodalities.

In this paper, we propose a novel method for solving multi- modalone-class classificationtasks. Theproposed method,Multi- modalSubspaceSupportVectorDataDescription(MS-SVDD),finds a transformation for each modality along with defining a com- monmodelforallmodalitiesinalower-dimensionalsubspaceop- timizedforone-class classification.The restofthe paperisorga- nizedasfollows.InSection2,anoverviewofrelatedworkispre- sented.InSection3,thenewlyproposedMS-SVDDisderivedand discussed.InSection4,wepresenttheexperimentalsetupandre- sults,andfinally,inSection5,conclusionsaredrawn.

2. Backgroundandrelatedwork

In this section, we brieﬂy discuss the principles of multimodallearning,along withsubspacelearning.Wealsoprovidean overview of traditional methods used for multiclass multimodal datadescriptionandone-classunimodaldatadescription.

2.1.Multimodallearning

Theavailabilityofmanydifferentmodalitiescanbeblissifitin- creasestheperformanceofthemachinelearningmodel.However, ifthedatadescriptionalgorithmfailstomakeastrongconnection betweenthedifferentavailablemodalities,theperformancecanbe degraded.Toensurebetterperformance ofthemodelbycombin- ingdatafromdifferentmodalities,mainlytwoprinciplesshouldbe ensured,i.e.,consensusandcomplementaryprinciples[7]:

• Consensusprincipleaimsatminimizingthedisagreementbe- tween data available from different modes. Maximizing the agreement will reduce the errorrate, andbetter modeling of data isachieved whilecombining datafrom differentmodalities.

• Complementaryprincipleinthecontextofmultimodallearn- ing means that data from each modality may contain some knowledge notcontainedbythe otherones. Soitis necessary toexploitinformationfromalltheavailablemodestomakean accuratedescriptionofdata.

The multimodalmachine learningtechniquescanbe described bythreemainproperties:two-viewvs.multi-view, linearvs.non- linear,andunsupervisedvs.supervised[8].Asthenameindicates,

in two-view learning, the number of viewsis limited to two. In multi-view learning,the numberofviewsisnot limited. Thedif- ference betweensupervisedandunsupervised learningisthat, in supervisedlearning,theinformationonoutputlabelsofthetrain- ing dataistakeninto accountwhen trainingthe model,whilein unsupervisedmethods,thelabelsarenotusedtomodeltheunder- lyingstructureordistributionofthedata[9].Lineartechniquesfor multimodalsubspacelearningmaybetoosimpletoprovidearep- resentativemodel.Hence,kernelmethodsareproposedtocapture non-linearpatternsindata.

Themultimodallearningtechniqueshavebeenmainly applied onfourapplicationsdomains [10]: i.e.,audio-visualspeechrecog- nition [11], multimedia content indexing and retrieval [12], un- derstandinghuman multimodalbehaviors[13],andlanguage and vision media description [14]. Recently, there has been a rising trendinapplyingmultimodalmachinelearningalgorithmsalsoto other applications.Forexample,in[15],amultimodaldatafusion techniqueisusedforthepredictionofsoybean yieldfroman un- mannedaerialvehicle.

In multimodal learning, the main goal is to develop a pro- cess of fusing information from various modalities. In [16], the fusion strategies are divided into two different categories as model-agnostic and model-based approaches. In model-agnostic approaches, thefusion iseither late,early, or hybrid.Inearly fusion,thedataorextractedfeaturesare fusedtogetheratthevery initial phaseofmodeling.A newfeaturevector isusually formed by concatenatingall the availabledata fromdifferentmodes,and the model istrained withthe new feature vector. In late fusion, multiplemodelsaretrained,andthefusionisdoneforscoresgen- erated by each model for the corresponding modality. The score generatedby each model canbe a thresholdorsome probability used indecisionmaking. Hybridfusion exploits theadvantage of bothearly fusionandlatefusion.Model-basedapproachesforfu- sionexplicitlyfusesdataduringtheirconstruction,suchaskernel- basedapproaches, graphical models,andneural networks. Inthis work,wepresentamodel-basedapproachfordatafusion.

2.2. Subspacelearning

Inthecurrenteraofdatascience,wherehigh-dimensionalmul- timodalbigdataaregeneratedeveryminuteindifferentindustries, thereisaneedtogettheessentialinsightsandmineknowledgein thishigh-dimensionaldata.Subspacelearningaimsatrepresenting datainalower-dimensionalspacebykeepingintactalltheinfor- mationavailableintheoriginalhigher-dimensionalspace.

Algorithms developed forlinear subspacelearning ﬁnda projection matrix for labeled training data (represented by vectors) satisfying some optimality criteria. Principal Component Analysis (PCA)isoneoftheﬁrst subspacelearningmethods mentionedin literature. In PCA,a subspace islearned by orthogonally project- ingdatato asubspaceso thatthevariance ofdata ismaximized.

PCA worksonly with a single mode of data,i.e., all data should be in the samedimension. Another traditionalsubspace learning methodisLinearDiscriminantAnalysis(LDA),whichﬁndsalinear transformationbyexploitingtheclassinformation.

AnalogoustoPCA,butusedfortwo-viewlearning,iscanonical- correlation analysis (CCA) [17]. CCA is a classic and conventional methodforsubspace learning,which aimsatrelatingtwo setsof databy ﬁnding out thepairs ofdirections whichprovidea max- imum correlation between the two sets. It has recently become one of the popular methods for unsupervised subspace learning becauseof itsgeneralization capability andhasbeenusedexten- sivelyformultimodaldatafusionandcross-mediaretrieval[18].In subspacelearning,state-of-the-artresultsareachievedbymethods whichhaveembraced some stimulusfromconventional subspace learningmethods[19].

(3)

As an extension of methods for linear transformation, kernel methodsareintroducedtodescribenonlinearfunctionordecision boundaries.Inkernelmethods,thedataaremappedtoatypically higher-dimensional kernel-space usinga kernelfunction whereit exhibits linear patterns [20,21]. Forexample, in [22], kernel-PCA performinganonlinearformofPCAisproposed.

2.3. One-classclassiﬁcation

Inone-classclassification,theparameters ofthemodelare es- timatedusingdatafromthepositiveclassonlybecausedatafrom theother classesare eithernot availableatalloritistoodiverse innaturetobemodeledstatistically[23].Thepositiveclassisalso calledthetargetclass,andthedatafromtheother classes,which are not available during training, iscalled negative, oran outlier class.Forexample,aunimodalbiometricsystemusesasinglebio- metrictraitforverificationoridentification[24].

SupportVectorDataDescription(SVDD)[25]isamongthemost widelyusedone-classclassificationmethodsusedforanomalyde- tection and other related applications. SVDD obtains a spherical boundaryaroundtargetdatawhichcanbemadeflexiblebyusing thekerneltrick. Theobtainedboundaryisused todetectoutliers duringthetest,i.e.,anythinginsidetheclosedboundaryisclassi- fied asatarget class andotherwiseasan outlier.The Lagrangian ofSVDDisgivenasfollows

L= N

i=1

α

ⁱ^x^Tixi− N

i=1

N

j=1

α

ⁱ^x^Tixj

α

^j^, ⁽¹⁾

where x_i is the input target training instance and maximizing (1)givesasetof

α

icorresponding toeachinstance.Theinstances with

α

i≥0deﬁnethe datadescription.Other commonone-class classiﬁcation method is One-Class Support Vector Machine (OC- SVM)[26].

Techniques forenhancing the performance ofone-class classification methods,mainly extensions of SVDD,can be categorized into fourmain categories: methods basedon data structure,kernel issue, boundary shape, and non-stationary data [27]. As the name indicates, inthe data structure category, the main focusis on the structure of data. For example, in [28], a confidence co- efficient isassociatedwitheach trainingsampleto dealwiththe uncertainty ofdata. Inkernel issueextensions,the main focusis onreducingthecomplexityorproposingnewkernelsforone-class classification.Forexample,in[29],anewkernelisproposedtoim- prove theaccuracy ofSVDDfortime seriesclassification. Propos- ing changes inthe boundary forenclosing thetarget data comes underthethirdcategoryforimprovingone-classclassificationac- curacy. For example, in [30], the ellipse shape is used for en- capsulating target data instead of the traditional sphere used in SVDD. In [31], it is shown that both SVDD and OC-SVM lead to thesamesolutionwhenexploitingtheellipticalshapeoftheclass.

The last category of algorithms for improvingone-class classiﬁer performance attemptstohandlenon-stationarydata.Forexample in [32], Incremental-SVDD (I-SVDD) is proposed to handle non- stationary or increasing data. Recently, in [33], an algorithm developed forreducing the effect ofuncertain dataaround the hy- persphere ofSVDD achievedthe state ofthe art result on many UCI [34] datasets.Inthispaper,we considerbaseline SVDD com- bined withmultimodalsubspacelearning.However, inthefuture, themethodcanbefurtherextendedusingsimilarideas.

In the area of multimodal one-class classiﬁcation, researchers havemainlyfocusedonfusingtheoutputlabelsofmultiplemod- els trained for each type of feature independently, i.e., without takinginto account informationfromother featuretypes forone model[35].

3. Multimodalsubspacesupportvectordatadescription

MS-SVDDmaps data fromhigh-dimensional feature spacesto a low-dimensional feature space optimized for one-class classification.Theoptimizedsubspaceissharedbydatacomingfromall modalities.MS-SVDDis an extension ofSubspace Support Vector DataDescription(S-SVDD),whichwasproposedforunimodaldata in[36].ThemainnoveltyofMS-SVDDisusingthemultimodalap- proachforone-class classification.Here,we firstderive thelinear MS-SVDD.Then we derivetwo non-linearversionsusingthe kerneltrick[20]andtheNonlinearProjectionTrick(NPT)[37],respectively.

3.1. LinearMS-SVDD

Letusassume that the itemsto be modelled are represented byMdifferentmodalities.Theinstances ineachmodalitym,m= 1,. . .,M,are representedbyXm=[x_m,₁,x_m,₂,...x_m,N],x_m_,_i∈R^D^m, where N is the total numberof instances and D_m is the dimen- sionalityofthefeaturespaceinmodalitym.MS-SVDDtriestoﬁnd a projection matrix Qm ∈ R^d^×^D^m for each modality, which will projectthecorrespondinginstancestoalower(d)-dimensionalop- timized subspaceshared by all modalities. Thus, a featurevector x_m,i isprojectedtoad-dimensionalvectory_m,i as

y_m_,_i=Qmx_m_,_i,

∀

^m∈

{

¹,...,M

}

,

∀

ⁱ∈

{

¹,...,N

}

. (2) Toobtainacommondescriptionofall thedatatransformedfrom theircorresponding modalitiesto thenewcommonsubspace,we exploit Support Vector Data Description (SVDD) [25] to form a closedboundaryaroundthetargetclassdatainthenewsubspace.

The centerand radius of thehypersphere are denoted by a∈R^d andR, respectively. Fig.1 depicts thebasic idea of theproposed method.

Inordertoﬁndacompacthyperspherewhichenclosesall the targetdatafromallthemodalitiesinthenewsubspace,wemini- mize

F

(

R,a

)

=R²

s.t.

^Q^m^xm,i−a

²2≤R²,

∀

^m^∈

{

¹^,^.^.^.^,^M

}

^,

∀

ⁱ^∈

{

¹^,^.^.^.^,^N

}

^. ⁽³⁾

Byintroducingslackvariables

ξ

m,i,suchthatmostofthetrain- ingdatafromallthemodalitiesinthenewcommonspaceshould lieinsidethehypersphere,theabovecriterionbecomes

F

(

R,a

)

=R²+C M

m=1

N

i=1

ξ

m,i

s.t.

^Q^m^xm,i−a

²2≤R²+

ξ

m,i,

ξ

m,i≥0,

∀

^m∈

{

¹,...,M

}

,

∀

ⁱ∈

{

¹,...,N

}

. (4) TheLagrangefunctioncorrespondingto(4)canbegivenas L=R²+C

M

m=1

N

i=1

ξ

m,i− M

m=1

N

i=1

γ

m,i

ξ

m,i− M

m=1

N

i=1

α

m,i

R²+

ξ

m,i

−x^T_m,iQ^T_mQmxm,i+2a^TQ_mxm,i−a^Ta

(5) The Lagrangian function should be maximized with respect to

α

m,i ≥0, and

γ

m,i ≥0 andminimizedwithrespectto R,a,

ξ

m,i, andQ_m.Bysettingthepartialderivativetozero,weget

∂

^L

∂

^R⁼⁰ ^⇒

M

m=1

N

i=1

α

m,i=1 (6)

(4)

Fig. 1. Depiction of proposed MS-SVDD: Data from two modalities in their corresponding feature space are mapped to a common subspace, where positive class instances are enclosed inside a (hyper)sphere.

∂

^L

∂

^a ⁼⁰ ^⇒ ^a⁼

M

m=1

N

i=1

α

m,iQmxm,i (7)

∂

^L

∂ξ

m,i=0 ⇒ C−

α

m,i−

γ

m,i=0 (8)

∂

^L

∂

^Q^m⁼⁰ ^⇒ ^Q^m⁼

a N

i=1

α

m,ix^T_m_,_i

^N

i=1

α

m,ix_m_,_ix^T_m_,_i

⁻¹

(9)

It is clear from (6)–(9) that parameters

α

ând ^Q âre înterre-

latedandcannotbejointlyoptimized.Henceweapply atwostep iterativeoptimizationprocesswhere,ineachstep,we ﬁxonepa- rameterandoptimizetheother.Substituting(2),(6),(7)and(8)in theLagrangianfunction(5),weget

L= M

m=1

N

i=1

α

m,iy^T_m_,_iy_m_,_i− M

m=1

N

i=1

M

n=1

N

j=1

α

m,iy^T_m_,_iy_n_,_j

α

n,j. (10)

We seethat optimizing (10) for

α

corresponds to the traditional SVDDappliedinthesubspace.Maximizing(10)foraparticularset ofdata willgiveus

α

m,i corresponding eachsample. Thevalue of

α

m,i forcorresponding sampledeﬁnesits positionwithrespectto thehypersphere:

• Sampleswith0<

α

m,i <Cdeﬁne thedatadescriptionandlie ontheboundaryofhypersphere,theyarereferedtoassupport vectors.

• Sampleswith

α

m,i=Careoutsidetheboundary.

• Sampleswith

α

m,i=0lieinsidetheboundary.

Inthesecondstep,weﬁx

α

^and^update^Qmforeachmodality.

Forthisstep,weaddaregularizationterm

ω

^: L=

M

m=1

N

i=1

α

m,ix^T_m,iQ^T_mQmxm,i

− M

m=1

N

i=1

M

n=1

N

j=1

α

m,ix^T_m_,_iQ^T_mQnx_n,j

α

n,j+

βω

^. ⁽¹¹⁾

The regularization term

ω

^expresses ^the ^covariance ^of^data ^from

differentmodalitiesinthenewlow-dimensionalspace, and

β

^is^a

regularizationparameterforcontrolling thesigniﬁcanceof

ω

^.^We

proposedifferentsettingsfor

ω

^as

ω

0=0, (12)

ω

¹⁼^M

m=1

tr

QmXmX^T_mQ^T_m

, (13)

ω

²⁼^M

m=1

tr

QmXm

α

^m

α

^TmX^T_mQ^T_m

, (14)

ω

3 = M

m=1

tr

(

^QmXm

λ

m

λ

^TmX^T_mQ^T_m

)

, (15)

ω

4 = M

m=1

M

n=1

tr

(

^QmXmX^T_nQ^T_n

)

, (16)

ω

5 = M

m=1

M

n=1

tr

(

^QmXm

α

m

α

^TnX^T_nQ^T_n

)

, (17)

ω

6 = M

m=1

M

n=1

tr

(

^QmXm

λ

m

λ

^TnX^T_nQ^T_n

)

, (18)

where

α

m∈R^N in (14)and (17)is a vector having the elements

α

m,1,...,

α

m,N. Thus,

α

^m ^has^non-zero ^values ^for^support ^vectors

andoutliers.

λ

^m∈R^N in(15) and(18)isa vectorhaving theele- mentsof

α

mthat aresmallerthanC.Valuesof

α

m corresponding totheoutliers(i.e.,

α

m,i=C)are replacedwithzeros in

λ

^m^.^Thus,

λ

^m ^has^non-zero ^values^only^for^the ^support^vectors. ^For

ω

0,the regularizationtermbecomesobsoleteanditisnotusedintheop- timizationprocess.In

ω

1,theregularizationtermonlyusesrepre- sentationscomingfromtherespectivemodality andnorepresen- tations from the other modalities are used to describe the vari- anceofthepositiveclass.In

ω

2,allsupportvectors,i.e.,represen- tationsatthehypersphereboundary,andoutliersareusedto de- scribethe classvariance fortheupdate ofthecorresponding Qm. In

ω

3,onlysupportvectorsoftherespectivemodalityareusedto describethevarianceoftheclasstobemodelled.In

ω

4,datafrom all the modalities are used to describe the covariance andregu- larizetheupdateofQ_m.In

ω

5,theinstancesbelongingtothehy- persphere boundary and outliers fromall modalities are used to describethecovariance.In

ω

6,onlythesupportvectorsbelonging toclass boundaryfromall modalitiesare usedtoupdate Qmand describethecovarianceofthepositiveclass.

Notethat theMS-SVDDformulationreducestoS-SVDD [36]if data fromonly one modality (^M=1) âre ^takenînto âccount ^for data description. In S-SVDD, a single projection matrix Q is de- termined formapping thedata X fromhigher-dimensional space toalower-dimensionalspace.Aregularization term

ψ

^,^which^ex-

presses theclassvariance inthe low-dimensionalspace, isadded totheLagrangianfunctionofS-SVDD:

ψ

⁼^tr

(

^QX

λλ

^T^X^T^Q^T

)

^, ⁽¹⁹⁾

(5)

where

λ

^can ^take ^different^forms ^as^describedⁱⁿ^[36]^.^The ^regu-

larizationterms,

ω

0,

ω

1,

ω

2,and

ω

3 forMS-SVDDbecomeequiva- lenttotheregularizationtermsproposedforS-SVDDwhenM=1. Hence,MS-SVDDisamoregeneralizedformofS-SVDD,whichcan formadatadescriptionbyconsideringdatafrommultiplemodali- ties.

We updateQ_m byusing thegradientof Lin(11)withrespect toQm,

Qm←Qm−

η

^L, ⁽²⁰⁾

where

η

îs ^the ^learning ^rate^parameter ând^the ^gradient ôf ^Lîs

calculatedas

∂

^L

∂

^Qm =2 N

i=1

α

m,iQmxm,ix^T_m,i

−2 N

i=1

N

j=1

M

n=1

Qnx_n,jx^T_m_,_i

α

m,i

α

n,j+

β ω

^, ⁽²¹⁾

where

ω

^is^the^derivative^of^theregularizationtermwithrespect toQm

ω

⁰ ⁼ ⁰^, ⁽²²⁾

ω

1 = 2QmXmX^T_m, (23)

ω

2 = 2QmXm

α

m

α

^TmX^T_m, (24)

ω

³ ⁼ ²^Q^m^X^m

λ

^m

λ

^TmX^T_m, (25)

ω

⁴ ⁼ ²^M

n=1

(

^QnXnX^T_m

)

, (26)

ω

⁵ ⁼ ²^M

n=1

(

^QnXn

α

ⁿ

α

^TmX^T_m

)

, (27)

ω

⁶ ⁼ ²^M

n=1

(

^QnXn

λ

ⁿ

λ

^TmX^T_m

)

. (28)

We initializetheQm usingPCA.At everyiteration, theprojec- tionmatrixisorthogonalizedandnormalizedsothat

QmQ^T_m=I, (29)

where I is an identity matrix. We use QR decomposition for orthogonalizing and normalizing the projection matrix Qm. Algorithm1 describestheoverallMS-SVDDalgorithm.

3.2. Non-linearMS-SVDD

For non-linear mapping from the original feature spaces to a new shared feature space, we use two approaches.The ﬁrst ap- proachis basedonthe standardkerneltrick [20] andthe second on the Nonlinear Projection Trick (NPT) [37], which is used asa computationallylighteralternativetothekerneltrick.

3.2.1. Non-linearMS-SVDDwithstandardkerneltrick

Inthenon-lineardatadescription,theoriginaldataaremapped to a kernel space F using a non-linear function

φ

⁽·) such that x_m_,_i∈R^D^m→

φ

(^xm,i)∈F. The kernel space dimensionality can possibly be inﬁnite. Then the dataare projected from thekernel spacetoR^das

ym,i=Qm

φ (

^xm,i

)

,

∀

ⁱ^∈

{

¹^,^.^.^.^,^N

}

^. ⁽³⁰⁾

Inordertocalculatey_m,i,we usetheso-calledkerneltrickby ex- pressing theprojection matrixQm asa linearcombinationofthe

Algorithm1:MS-SVDDoptimization.

Inputs :Zmforeachm=1,...,M,//Inputdatafromall modalities

β

^,^//Regularizationparameterforcontrolling signiﬁcanceof

ω

η

^,^//^Learning^rate^parameter

d,//Dimensionalityofjointsubspace C,//RegularizationparameterinSVDD M//Totalnumberofmodalities

Outputs:Smforeachm=1,...,M,//Projectionmatricesfor differentmodalities

R,//Radiusofhypersphere

α

^//^Deﬁnes^the^datadescription

Zm=XmforlinearandNPTcase(Kmforkernelcase) Sm=Q_mforlinearandNPTcase(Wmforkernelcase)

form=1:Mdo

InitializeSmvialinear-PCA(kernel-PCA);

end

foriter=1:max_iterdo

Foreachm,mapZ_mtoY_musingEq.(2)(Eq.(31));

FormYbycombiningallYm’s;

SolveSVDDinthesubspacetoobtain

α

ⁱⁿ^Eq.^(10);

form=1:Mdo

Calculate^L^using^Eq.⁽²¹⁾^(Eq.⁽³¹⁾⁾^; UpdateSm←Sm−

η

^L;

OrthogonalizeandnormalizeS_musingQR decomposition(eigendecomposition);

end end

Foreachm,computeYmusingEq.(2)(Eq.(31));

FormYbycombiningallYm’s;

SolveSVDDtoobtaintheﬁnaldatadescription;

trainingdatarepresentationsoftherespectivemodalityintheker- nelspaceF,leadingto

ym,i=Wm

^Tm

φ (

^xm,i

)

=Wmkm,i,

∀

ⁱ^∈

{

¹^,^.^.^.^,^N

}

^, ⁽³¹⁾

wherem∈R^|^F^|^×^NisamatrixformedinFcontainingthetraining data representations ofmodality m, Wm∈R^d^×^N is a matrix containing the weights for m needed to formQm, and k_m,i is the ith columnof theGramian matrix,also calledasthe kernel matrix,Km∈R^N^×^N,havingelementsequaltoK_m_,_{i j}=

φ

(^xm,i)^T

φ

(^xm,j)^. Inourexperiments,weusetheRadialBasisFunction(RBF)kernel, givenby

Km,i j=exp

−

^xm,i−x_m,j

²2

2

σ

²

, (32)

where

σ

>0isahyperparameteranddeterminesthewidthofthe kernel.

The augmented version ofthe Lagrangian function now takes thefollowingform:

L= M

m=1

N

i=1

α

m,ik^T_m_,_iW^T_mWmk_m,i

− M

m=1

N

i=1

M

n=1

N j=1

α

m,ik^T_m_,_iW^T_mWnk_n_,_j

α

n,j+

βω

. (33)

(6)

The

α

^’sâre^calculatedôptimizing⁽¹⁰⁾^with^W^m^’s^fixed,î.e.,âpply-

ingSVDDinthesubspace.Inthesecondstep,the

α

^’sâre^fixedând

Wm’sareupdatedwiththegradientdescent:

Wm←Wm−

η

^L, (34)

wherethegradientiscalculatedas

∂

^L

∂

^W^m ⁼²

N

i=1

α

m,iWmk_m_,_ik^T_m_,_i

−2 N

i=1

N

j=1

M

n=1

Wnkn,jk^T_m_,_i

α

^m,i

α

ⁿ,j+

β ω

^. ⁽³⁵⁾

Thegradientoftheregularizationterm,

ω

^,^now^takes^the^follow-

ingforms:

ω

⁰ ⁼ ⁰^, ⁽³⁶⁾

ω

¹ ⁼ ²^W^m^K^m^K^Tm, (37)

ω

² ⁼ ²^W^m^K^m

α

^m

α

^TmK^T_m, (38)

ω

³ ⁼ ²^W^m^K^m

λ

^m

λ

^TmK^T_m, (39)

ω

⁴ ⁼ ²^M

n=1

(

^WnKnK^T_m

)

, (40)

ω

5 = 2

M

n=1

(

^WnKn

α

n

α

^TmK^T_m

)

, (41)

ω

6 = 2

M

n=1

(

^WnKn

λ

n

λ

^TmK^T_m

)

. (42)

We initialize thematrix Wm for each modeusing kernel-PCA.

WeorthogonalizeandnormalizeWmateveryiterationsothat

Wm

^Tm

mW^T_m=I. (43)

Wedecompose(43)usingeigendecompositionas

Wm

^Tm

^m^W^Tm=Vm

^m^V^Tm, (44) where^Tm^mîs^K^m^,^mîsâ^diagonal^matrix^containing^theêigen- valuesofWm^TmmW^T_mandVmcontainsthecorrespondingeigen- vectors.Afterfurthersimplification,thenormalizedprojectionma- trixWˆmcanbecomputedas

Wˆm=

(

m¹²

)

⁺^V^TmWm, (45) wherethe+signdenotes pseudo-inverse. Fornotationsimplicity, wesetWm=Wˆm.

3.2.2. Non-linearMS-SVDDwithnonlinearprojectiontrick

The non-linear MS-SVDDusing the kerneltrick requires com- putingtheeigendecomposition(44)ateveryiteration.Thisiscom- putationally expensive and, therefore, we propose an alternative non-linearapproachusingNPT[37].Here,anon-linearmappingis appliedonlyatthebeginning oftheprocess,while theoptimiza- tionfollowsthe linearMS-SVDD. IntheNPT-based MS-SVDD, we ﬁrst compute kernel matrix Km using (32). In the next step, the computedkernelmatrixiscentralizedas

Kˆm=

(

^I−EN

)

^Km

(

^I−EN

)

⁽⁴⁶⁾

whereKˆmisthecentralizedkernelmatrixandE_N isN×Nmatrix deﬁnedas

EN= 1

N1N1^T_N. (47)

1_N∈R^Nis avectorwitheach elementhavingvalue of1.Thecen- tralizedmatrixKˆmisdecomposedbyusingeigendecomposition,

Kˆm=UmAmU^T_m, (48)

where Am contains thenon-negative eigenvalues of the centered kernelmatrixandU_mcontainsthecorrespondingeigenvectors.The datainthereduceddimensionalkernelspaceisobtainedas

^m=

(

^Am¹²

)

⁺^U⁺mKˆm (49) Sincewe considerNPTasapure preprocessingstep,wecontinue by considering masour input data,i.e., weset Xm=m. Then wefollowthelinearMS-SVDD.Notethatincaseswherethenum- ber of training samples is high, this pre-processing step can be highlyacceleratedbyfollowingapproximations,liketheNyström- basedApproximateKernelSubspaceLearningmethodin[38]. 3.3. Testphase

Duringthetestphase,aninstancexm∗∈R^D^m (the∗ insubscript denotestestinstance)comingfrommodalitymisprojectedtothe commond-dimensionalsubspaceusing(2)forthelinearcase.For kernelcase,ﬁrst,thekernelvectoriscomputedas

km∗=

^Tm

φ (

^x^m∗

)

⁽⁵⁰⁾

andthenprojected tothecommond-dimensionalsubspaceusing (31).ForNPT,ﬁrstthekernelvectork_m∗iscomputedandthencen- tralizedas

kˆm∗=

(

^I−EN

)

^[^km∗−1

NKm1N]. (51)

Thecentralizedkernelvectorismappedto

φ

m∗=

(

^Tm

)

⁺^k^ˆm∗ (52) andthen to d-dimensional subspace using (2) (for notationsim- plicity

φ

m^∗ isconsideredasx_m∗).

Thedecisiontoclassifythetestinstancey_m∗ aspositiveorneg- ativeistakenonthebasisofitsdistancefromthecenterofhyper- sphere,i.e.,

^y^m∗−a

²2=y^T_m_∗ym∗−2 M

k=1

N

i=1

α

k,iy^T_m_∗y_k_,_i +

M

k=1

N

i=1

M

n=1

N

j=1

α

k,i

α

n,jy^T_k_,_iy_n,j. (53) The representation y_m∗ is assigned to the positive class when

^y^m∗−a

²2≤R² and to the negative class if

^y^m∗−a

²2>R², whereR² isthe distancefromcenter atoany supportvector on theboundary,

R² =v^Tv−2 M

m=1

N

i=1

α

^m,iy^T_m_,_iv+ M

m=1

N

i=1

M

n=1

N

j=1

α

^m,i

α

ⁿ,jy^T_m_,_iyn,j,

(54) wherevisanysupportvectorinthetrainingsetwithcorrespond- ing

α

^having^value⁰<

α

<C.Sincetheitemsarerepresentedby Mdifferentmodalities,theﬁnal decisionforassigningtheitemto a particularclass (eitherpositive ornegative) canbe takenusing differentstrategiesexplainedinSection4.3.

3.4. Complexityanalysis

The linear version ofthe proposed method has the following mainsteps:1)InitializingtheprojectionmatricesviaPCA,2)mapping data from all modalities to a lower d-dimensional shared space, 3)SVDD forobtainingthe

α

^values^and^ﬁnal ^data^descrip-

tion for all data points coming from M different modalities, 4)

(7)

computing the gradient (^L⁾ ^for ^each ^modality, ⁵⁾ ^updating ^the projection matrices and6) QR decomposition fororthogonalizing andnormalizingtheprojectionmatrices.Weanalyzeeachofthese stepsandthencomputetheoverallcomplexityofthealgorithm:

1. PCAofamatrixiscomputedbytheeigenvaluedecomposition ofits covariancematrix,so itinvolvestwo steps,i.e., comput- ingthecovariancematrixandthen theeigenvaluedecomposi- tionofthe obtainedcovariancematrix.The complexityofcal- culating covariance matrix and corresponding eigenvalue de- compositionforasinglemodalityisO

ND_m×min(^N,D_m)

and O

D³_m

,respectively[39].ThecomplexityofcomputingPCAfor allmodalitiesisO

min(^N²^D1,D²₁N)+D³₁)+(^min(^N²^D2,D²₂N)+ D³₂)+· · · +(^min(^N²^DM,D²_MN)+D³_M

. We denote the sum of dimensions of all modalities as D=D₁+D₂+· · · +D_M and similarly the sum of squared dimensions as _D2=D²₁+D²₂+

· · · +D²_M (note that _D2=(D)²⁾ ^and ^sum ^of ^cubed ^dimensions as _D³=D³₁+D³₂+· · · +D³_M. Hence, the complexity of initializing the projection matrices via PCA becomes O

min(^N²D,_D2N)+_D3

.

2. The complexityofmappingdatafromthe original D_m dimen- sionalspacetoa lowerd-dimensionalspaceisthe complexity ofmultiplyingd×Dm andDm× N,whichhasthecomplexity ofO

dD_mN

.RepeatingthisforallmodalitieswegetO

dDN

3. The complexityof SVDD forN datapoints is O

N³

[40]. For alldatapointscomingfromMdifferentmodalitiesit becomes O

M³N³

.

4. The gradient^L^to ûpdate ^Q^m îs ^computedûsing⁽²¹⁾^,^where the second term has the highest complexity(equally high as regularizationterms4–6). ItscomplexityisO(²^dN²^DmD)^.Âs thisstepisrepeatedforallmodalitiesthetotalcomplexitybe- comesO(²^dN²D2)^.

5. UpdatingtheprojectionmatriceshasO

dD

complexity.

6. The complexity of QR decomposition for a single modality is O(^d^D^m²) ^[41]^. ^Thus, ^the^overall ^complexity^of ^QR ^decomposi- tionsforallthemodalitiesisO(^d_D2)^.

Dropping the relatively lower intensive computational steps andadding therest, thefull complexityof theproposed method reduces to O

min(^N²D,_D2N)+_D3+M³N³

. Assuming that the total number of samples M^∗N is always greater than D and M< <N,thetime complexityof(asingleiterationof) ourpro- posed algorithm interms ofthe big Onotation isO(^N³)^. ^In^the testingphase,eachrepresentationofatestsampleineachmodal- ity is projected to the d-dimensional subspace and then its dis- tanceiscomparedtoR.ThishasthetotalcomplexityofO(^dD+ Md)^.

For the non-linear version with NPT, the kernel matrixKm is ﬁrstformedwhichhasthecomplexityofO(^DmN²)^.^Then ^the^kernel matrix is centralized and decomposedby usingeigendecom- position. Both of these steps have the complexity of O(^N³)^. ^As the data dimensionality in the remaining steps of the proposed methodchangesfromDmtoN,thetotalcomplexityoftheremain- ing steps becomesO

MN³+M³N³

. Thus, the overall complexity in termsof thebig O notationremains atO

N³

forM < < N, while in practice the computational complexity is higher (by a scalar multiplier c) than for thelinear version. Alsoforthe non- linearversionwiththestandardkerneltrick,theoverallcomplex- ity remains thesame, butthe kernelmapping isrepeated atev- eryiterationand, thus,thescalarc becomeslargerfortheoverall trainingprocess.Thetestingcomplexityofthenon-linearmethods increasestoO(^ND+dMN+Md)^.

4. Experiments

4.1. Datasetsandprepossessing

To evaluate the proposed method, we performed different sets of experiments over 5 datasets. Robot Execution Failures dataset, Single Proton Emission Computed Tomography (SPECTF) heartdataset, andIonospheredataset were downloaded fromUC Irvine (UCI) machine learning repository [34]. Caltech-7 dataset andHandwrittendataset were downloaded froma repository for multi-viewlearning[42].Thedetails ofdatasets andexperiments areasfollows.

TheﬁrstsetofexperimentswasperformedontheRobotExecu- tionFailuresdataset[43].InRobotExecutionFailuresdataset,force andtorquemeasurementsarecollectedatregularintervalsoftime afterataskfailure isdetected.Thedatasetisdividedintoﬁvedif- ferentlearningproblems(LP)correspondingtodifferenttriggering events:

• LP1:Failuresinapproachtograspposition

• LP2:Failuresinthetransferofapart

• LP3:Positionofthepartafteratransferfailure

• LP4:Failuresinapproachtoungraspposition

• LP5:Failuresinmotionwithpart

The total number of instances and the distribution of the classes are given in Table 1. All instances are given as 15 sam- plescollectedat315msregulartimeintervalsforeachsensor.For thisdataset,weconsideralltheinstancesbelongingtothenormal classasthetargetclassandtheremainingclassesasthenon-target data.Hence, we havetwo modalities (torque andforce measure- ments),and we consider the dataset asa one-class classiﬁcation problem.

The second set of experiments was performed SPECTF heart dataset[44].TheSPECTFheartdatasetconsistsoftwosetsoffea- tures corresponding to rest and stress condition SPECTF images ofdifferentsubjects.The trainingset consistsof40examples di- agnosedashealthy heartmuscle perfusions and40diagnosed as pathological perfusions. The test set consists of 15 instances of healthyheartmuscleperfusionsand172frominstancesdiagnosed aspathologicalperfusions.We convertthisto amultimodalone- classclassiﬁcationproblembyconsideringtherestandstresscon- ditionsas differentmodalitiesand by selectingthe healthy heart muscleperfusionsasourtargetclass.

ThethirdsetofexperimentswasperformedovertheCaltech-7 dataset.WeusedGaborfeatureandWaveletmomentsasourtwo differentmodalities.Thedatasetcontains1474totalsamplesfrom 7differentclasses.We selectedfaces (435samples) asourtarget classandtherestoftheclassesalltogether(1039samples)asthe outlierclass.

WeusedIonospheredatasetforthefourthsetofexperiments.

Thecategoriesinthisdatasetare describedby twoattributesper pulsenumberresulting fromthecomplex electromagneticsignal, processed by an autocorrelation function. We used the two at- tributes(realandcomplex)foreachpulseastwodifferentmodal- itiesandtheattribute“good” asourtargetclass.Thetotalnumber ofsamples inthis datasetis 351,out of which225 are fromthe target class (good),and the restof 126 samplesare fromoutlier class(bad).

Fortheﬁfth setofexperiments,weusedHandwrittendataset.

We considered the samples of numeral 0 as the target. In the Handwrittendataset,the totalnumberofsamplesis2000,out of which 200 are fromthe target class. The rest of the 1800 samplesare considered asan outlierclass.We usedthe Zernikemo- ment(ZER)andmorphological(MOR)featuresasourtwodifferent modalities.