• Ei tuloksia

Multimodal subspace support vector data description

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Multimodal subspace support vector data description"

Copied!
13
0
0

Kokoteksti

(1)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Multimodal subspace support vector data description

Fahad Sohrab

a,

, Jenni Raitoharju

a,b

, Alexandros Iosifidis

c

, Moncef Gabbouj

a

aFaculty of Information Technology and Communication Sciences, Tampere University, FI-33720 Tampere, Finland

bProgramme for Environmental Information, Finnish Environment Institute, FI-40500 Jyväskylä, Finland

cDepartment of Engineering, Electrical and Computer Engineering, Aarhus University, DK-8200 Aarhus, Denmark

a rt i c l e i n f o

Article history:

Received 21 August 2019 Revised 13 July 2020 Accepted 6 September 2020 Available online 10 September 2020 Keywords:

Feature transformation Multimodal data One-class classification Support vector data description Subspace learning

a b s t r a c t

Inthispaper, weproposeanovelmethodforprojectingdatafrom multiplemodalitiestoanewsub- spaceoptimizedforone-classclassification.Theproposed methoditerativelytransformsthedatafrom theoriginalfeaturespaceofeachmodalitytoanewcommonfeaturespacealong withfinding ajoint compactdescriptionofdatacomingfromallthemodalities.Fordataineachmodality,wedefineasepa- ratetransformationtomapthedatafromthecorrespondingfeaturespacetothenewoptimizedsubspace byexploitingtheavailableinformationfromtheclassofinterestonly.Wealsoproposedifferentregu- larizationstrategiesfortheproposedmethodandprovidebothlinearandnon-linearformulations.The proposedMultimodalSubspaceSupportVectorDataDescriptionoutperformsallthecompetingmethods usingdatafromasinglemodalityorfusingdatafromallmodalitiesinfouroutoffivedatasets.

© 2020TheAuthors.PublishedbyElsevierLtd.

ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)

1. Introduction

Inoursurroundings,onadailybasis,we areexposed toinfor- mationfrommanydifferentsources.Differentsensorsareusedto gather informationabout similar objects. Ourbrains usually per- form well in combining the information from different sources to make a concise analysis of that particular entity. In order to analyze an entity, even a single source of information might be enough,buttomakesomecriticaldecisionsitisimportanttocom- bine information fromdifferent sources in a systematic way. For example,ifapersoniswalkingina crowd,themaininformation to nothit anythingcomes fromvisual cues, butpeoplecan warn eachotheralsobyvoiceorevenbytouch,andthisextrainforma- tionhelpsinunderstandingtheenvironmentinabetterway.The smellcouldhelptoavoidunpleasantspots,too.Asanotherexam- ple,whilewatchingamovie,onlyvisualinformationofthescenes maynotbeenoughtounderstandthewholescenario,buttheau- dio and/or captions combinedtogether withthe visualsinforma- tionwillprovidethefullinformation.

Corresponding author.

E-mail addresses: fahad.sohrab@tuni.fi (F. Sohrab), jenni.raitoharju@tuni.fi (J.

Raitoharju), alexandros.iosifidis@eng.au.dk (A. Iosifidis), moncef.gabbouj@tuni.fi(M.

Gabbouj).

In machine learning techniques for predictive data modeling, trainingdata are used to forma model that can accurately clas- sifyfutureinstancesintoapredefinednumberofclasses.Inmany cases,data comes fromsensors and can be further processed to extractdifferentfeatures.Thetermmultimodalisusedtodescribe thedatacoming fromdifferentsensors(also referred toasmode ormodality),however,itis alsousedasasynonym tomulti-view when different features are extracted from the same sensor or whentherearemultiplesimilarsensors,e.g.,cameras.Theaimof multimodalmachine learning algorithms is to build models that canprocess andrelate informationfrommorethan onemodality (orview).

The examples of multimodal representations are prevalent in differentapplicationareas.In[1],anactivemultimodalsensorsys- temfortarget recognitionandtrackingisstudiedwhereinforma- tionfromthreedifferentsensors (visual,infrared,andhyperspec- tral) isused.In [2], aframework forvehicle trackingwithmulti- modaldata(velocityandimages)isproposedwheretheoutcome ofvelocitymodalityestimatedbyusingaKalmanfilteronthedata obtainedfrommotionsensorsisfusedwithfeatures learnedfrom image modality bythe color-fasterR-CNN method.In[3],a mul- timodaldatacollectionframeworkformentalstress monitoringis studied.Intheproposedframework,physiologicalandmotionsen- sordataofpeopleunderstressarecollected.

https://doi.org/10.1016/j.patcog.2020.107648

0031-3203/© 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

(2)

The data in multimodal applications come from different modalities,whereeach modalityhasits ownstatisticalproperties andcontainsspecific information.Thedifferentmodalitiesusually share high-level concepts and semantic information, and all to- gethercontainmoreinformationthananysingle-modaldata[4].If webuildamodelseparatelyforeachmodality,therelationshipbe- tweenthemodalitiescannotbeexploitedefficiently.Inmultimodal subspacelearning, thegoalisto inferashared latentrepresenta- tion,that can accurately model data fromeach original modality andexploittherelationshipbetweenthemodalities.

Intraditionalmulticlassmachinelearning,anadequateamount of data are available for all the categories during training and, hence,thealgorithmtakesadvantageofall availabletrainingdata fromall classesto traina model [5]. However, itis possiblethat duringthetraining, data arehighly imbalanced,orthe only data availableisfromasingleclass.Insuch cases,one-classclassifica- tiontechniquesareused.Itisusefulinmanydifferentcases,such asoutlierdetection, predictingspecificevents,or,ingeneral,pre- dictingaspecific targetclass.While muchefforthasbeenputon solvingone-class classification tasksfordataof asinglemodality [6],muchlessefforthasbeenputonsolvingone-classmultimodal challengesingeneral, andwe arenot awareofanypriorwork in thefieldofmultimodallearningforone-classclassification.Inone- classmultimodaltasks,itis assumedthatthe onlydataavailable isfromasingleclassinmanydifferentmodalities.

In this paper, we propose a novel method for solving multi- modalone-class classificationtasks. Theproposed method,Multi- modalSubspaceSupportVectorDataDescription(MS-SVDD),finds a transformation for each modality along with defining a com- monmodelforallmodalitiesinalower-dimensionalsubspaceop- timizedforone-class classification.The restofthe paperisorga- nizedasfollows.InSection2,anoverviewofrelatedworkispre- sented.InSection3,thenewlyproposedMS-SVDDisderivedand discussed.InSection4,wepresenttheexperimentalsetupandre- sults,andfinally,inSection5,conclusionsaredrawn.

2. Backgroundandrelatedwork

In this section, we briefly discuss the principles of multi- modallearning,along withsubspacelearning.Wealsoprovidean overview of traditional methods used for multiclass multimodal datadescriptionandone-classunimodaldatadescription.

2.1.Multimodallearning

Theavailabilityofmanydifferentmodalitiescanbeblissifitin- creasestheperformanceofthemachinelearningmodel.However, ifthedatadescriptionalgorithmfailstomakeastrongconnection betweenthedifferentavailablemodalities,theperformancecanbe degraded.Toensurebetterperformance ofthemodelbycombin- ingdatafromdifferentmodalities,mainlytwoprinciplesshouldbe ensured,i.e.,consensusandcomplementaryprinciples[7]:

Consensusprincipleaimsatminimizingthedisagreementbe- tween data available from different modes. Maximizing the agreement will reduce the errorrate, andbetter modeling of data isachieved whilecombining datafrom differentmodali- ties.

Complementaryprincipleinthecontextofmultimodallearn- ing means that data from each modality may contain some knowledge notcontainedbythe otherones. Soitis necessary toexploitinformationfromalltheavailablemodestomakean accuratedescriptionofdata.

The multimodalmachine learningtechniquescanbe described bythreemainproperties:two-viewvs.multi-view, linearvs.non- linear,andunsupervisedvs.supervised[8].Asthenameindicates,

in two-view learning, the number of viewsis limited to two. In multi-view learning,the numberofviewsisnot limited. Thedif- ference betweensupervisedandunsupervised learningisthat, in supervisedlearning,theinformationonoutputlabelsofthetrain- ing dataistakeninto accountwhen trainingthe model,whilein unsupervisedmethods,thelabelsarenotusedtomodeltheunder- lyingstructureordistributionofthedata[9].Lineartechniquesfor multimodalsubspacelearningmaybetoosimpletoprovidearep- resentativemodel.Hence,kernelmethodsareproposedtocapture non-linearpatternsindata.

Themultimodallearningtechniqueshavebeenmainly applied onfourapplicationsdomains [10]: i.e.,audio-visualspeechrecog- nition [11], multimedia content indexing and retrieval [12], un- derstandinghuman multimodalbehaviors[13],andlanguage and vision media description [14]. Recently, there has been a rising trendinapplyingmultimodalmachinelearningalgorithmsalsoto other applications.Forexample,in[15],amultimodaldatafusion techniqueisusedforthepredictionofsoybean yieldfroman un- mannedaerialvehicle.

In multimodal learning, the main goal is to develop a pro- cess of fusing information from various modalities. In [16], the fusion strategies are divided into two different categories as model-agnostic and model-based approaches. In model-agnostic approaches, thefusion iseither late,early, or hybrid.Inearly fu- sion,thedataorextractedfeaturesare fusedtogetheratthevery initial phaseofmodeling.A newfeaturevector isusually formed by concatenatingall the availabledata fromdifferentmodes,and the model istrained withthe new feature vector. In late fusion, multiplemodelsaretrained,andthefusionisdoneforscoresgen- erated by each model for the corresponding modality. The score generatedby each model canbe a thresholdorsome probability used indecisionmaking. Hybridfusion exploits theadvantage of bothearly fusionandlatefusion.Model-basedapproachesforfu- sionexplicitlyfusesdataduringtheirconstruction,suchaskernel- basedapproaches, graphical models,andneural networks. Inthis work,wepresentamodel-basedapproachfordatafusion.

2.2. Subspacelearning

Inthecurrenteraofdatascience,wherehigh-dimensionalmul- timodalbigdataaregeneratedeveryminuteindifferentindustries, thereisaneedtogettheessentialinsightsandmineknowledgein thishigh-dimensionaldata.Subspacelearningaimsatrepresenting datainalower-dimensionalspacebykeepingintactalltheinfor- mationavailableintheoriginalhigher-dimensionalspace.

Algorithms developed forlinear subspacelearning finda pro- jection matrix for labeled training data (represented by vectors) satisfying some optimality criteria. Principal Component Analysis (PCA)isoneofthefirst subspacelearningmethods mentionedin literature. In PCA,a subspace islearned by orthogonally project- ingdatato asubspaceso thatthevariance ofdata ismaximized.

PCA worksonly with a single mode of data,i.e., all data should be in the samedimension. Another traditionalsubspace learning methodisLinearDiscriminantAnalysis(LDA),whichfindsalinear transformationbyexploitingtheclassinformation.

AnalogoustoPCA,butusedfortwo-viewlearning,iscanonical- correlation analysis (CCA) [17]. CCA is a classic and conventional methodforsubspace learning,which aimsatrelatingtwo setsof databy finding out thepairs ofdirections whichprovidea max- imum correlation between the two sets. It has recently become one of the popular methods for unsupervised subspace learning becauseof itsgeneralization capability andhasbeenusedexten- sivelyformultimodaldatafusionandcross-mediaretrieval[18].In subspacelearning,state-of-the-artresultsareachievedbymethods whichhaveembraced some stimulusfromconventional subspace learningmethods[19].

(3)

As an extension of methods for linear transformation, kernel methodsareintroducedtodescribenonlinearfunctionordecision boundaries.Inkernelmethods,thedataaremappedtoatypically higher-dimensional kernel-space usinga kernelfunction whereit exhibits linear patterns [20,21]. Forexample, in [22], kernel-PCA performinganonlinearformofPCAisproposed.

2.3. One-classclassification

Inone-classclassification,theparameters ofthemodelare es- timatedusingdatafromthepositiveclassonlybecausedatafrom theother classesare eithernot availableatalloritistoodiverse innaturetobemodeledstatistically[23].Thepositiveclassisalso calledthetargetclass,andthedatafromtheother classes,which are not available during training, iscalled negative, oran outlier class.Forexample,aunimodalbiometricsystemusesasinglebio- metrictraitforverificationoridentification[24].

SupportVectorDataDescription(SVDD)[25]isamongthemost widelyusedone-classclassificationmethodsusedforanomalyde- tection and other related applications. SVDD obtains a spherical boundaryaroundtargetdatawhichcanbemadeflexiblebyusing thekerneltrick. Theobtainedboundaryisused todetectoutliers duringthetest,i.e.,anythinginsidetheclosedboundaryisclassi- fied asatarget class andotherwiseasan outlier.The Lagrangian ofSVDDisgivenasfollows

L= N

i=1

α

ixTixiN

i=1

N

j=1

α

ixTixj

α

j, (1)

where xi is the input target training instance and maximizing (1)givesasetof

α

icorresponding toeachinstance.Theinstances with

α

i≥0definethe datadescription.Other commonone-class classification method is One-Class Support Vector Machine (OC- SVM)[26].

Techniques forenhancing the performance ofone-class classi- fication methods,mainly extensions of SVDD,can be categorized into fourmain categories: methods basedon data structure,ker- nel issue, boundary shape, and non-stationary data [27]. As the name indicates, inthe data structure category, the main focusis on the structure of data. For example, in [28], a confidence co- efficient isassociatedwitheach trainingsampleto dealwiththe uncertainty ofdata. Inkernel issueextensions,the main focusis onreducingthecomplexityorproposingnewkernelsforone-class classification.Forexample,in[29],anewkernelisproposedtoim- prove theaccuracy ofSVDDfortime seriesclassification. Propos- ing changes inthe boundary forenclosing thetarget data comes underthethirdcategoryforimprovingone-classclassificationac- curacy. For example, in [30], the ellipse shape is used for en- capsulating target data instead of the traditional sphere used in SVDD. In [31], it is shown that both SVDD and OC-SVM lead to thesamesolutionwhenexploitingtheellipticalshapeoftheclass.

The last category of algorithms for improvingone-class classifier performance attemptstohandlenon-stationarydata.Forexample in [32], Incremental-SVDD (I-SVDD) is proposed to handle non- stationary or increasing data. Recently, in [33], an algorithm de- veloped forreducing the effect ofuncertain dataaround the hy- persphere ofSVDD achievedthe state ofthe art result on many UCI [34] datasets.Inthispaper,we considerbaseline SVDD com- bined withmultimodalsubspacelearning.However, inthefuture, themethodcanbefurtherextendedusingsimilarideas.

In the area of multimodal one-class classification, researchers havemainlyfocusedonfusingtheoutputlabelsofmultiplemod- els trained for each type of feature independently, i.e., without takinginto account informationfromother featuretypes forone model[35].

3. Multimodalsubspacesupportvectordatadescription

MS-SVDDmaps data fromhigh-dimensional feature spacesto a low-dimensional feature space optimized for one-class classifi- cation.Theoptimizedsubspaceissharedbydatacomingfromall modalities.MS-SVDDis an extension ofSubspace Support Vector DataDescription(S-SVDD),whichwasproposedforunimodaldata in[36].ThemainnoveltyofMS-SVDDisusingthemultimodalap- proachforone-class classification.Here,we firstderive thelinear MS-SVDD.Then we derivetwo non-linearversionsusingthe ker- neltrick[20]andtheNonlinearProjectionTrick(NPT)[37],respec- tively.

3.1. LinearMS-SVDD

Letusassume that the itemsto be modelled are represented byMdifferentmodalities.Theinstances ineachmodalitym,m= 1,. . .,M,are representedbyXm=[xm,1,xm,2,...xm,N],xm,i∈RDm, where N is the total numberof instances and Dm is the dimen- sionalityofthefeaturespaceinmodalitym.MS-SVDDtriestofind a projection matrix Qm ∈ Rd×Dm for each modality, which will projectthecorrespondinginstancestoalower(d)-dimensionalop- timized subspaceshared by all modalities. Thus, a featurevector xm,i isprojectedtoad-dimensionalvectorym,i as

ym,i=Qmxm,i,

m

{

1,...,M

}

,

i

{

1,...,N

}

. (2) Toobtainacommondescriptionofall thedatatransformedfrom theircorresponding modalitiesto thenewcommonsubspace,we exploit Support Vector Data Description (SVDD) [25] to form a closedboundaryaroundthetargetclassdatainthenewsubspace.

The centerand radius of thehypersphere are denoted by a∈Rd andR, respectively. Fig.1 depicts thebasic idea of theproposed method.

Inordertofindacompacthyperspherewhichenclosesall the targetdatafromallthemodalitiesinthenewsubspace,wemini- mize

F

(

R,a

)

=R2

s.t.

Qmxm,ia

22R2,

m

{

1,...,M

}

,

i

{

1,...,N

}

. (3)

Byintroducingslackvariables

ξ

m,i,suchthatmostofthetrain- ingdatafromallthemodalitiesinthenewcommonspaceshould lieinsidethehypersphere,theabovecriterionbecomes

F

(

R,a

)

=R2+C M

m=1

N

i=1

ξ

m,i

s.t.

Qmxm,ia

22R2+

ξ

m,i,

ξ

m,i≥0,

m

{

1,...,M

}

,

i

{

1,...,N

}

. (4) TheLagrangefunctioncorrespondingto(4)canbegivenas L=R2+C

M

m=1

N

i=1

ξ

m,iM

m=1

N

i=1

γ

m,i

ξ

m,iM

m=1

N

i=1

α

m,i

R2+

ξ

m,i

xTm,iQTmQmxm,i+2aTQmxm,iaTa

(5) The Lagrangian function should be maximized with respect to

α

m,i ≥0, and

γ

m,i ≥0 andminimizedwithrespectto R,a,

ξ

m,i, andQm.Bysettingthepartialderivativetozero,weget

L

R=0

M

m=1

N

i=1

α

m,i=1 (6)

(4)

Fig. 1. Depiction of proposed MS-SVDD: Data from two modalities in their corresponding feature space are mapped to a common subspace, where positive class instances are enclosed inside a (hyper)sphere.

L

a =0 a=

M

m=1

N

i=1

α

m,iQmxm,i (7)

L

∂ξ

m,i=0 ⇒ C

α

m,i

γ

m,i=0 (8)

L

Qm=0 Qm=

a N

i=1

α

m,ixTm,i

N

i=1

α

m,ixm,ixTm,i

1

(9)

It is clear from (6)–(9) that parameters

α

and Q are interre-

latedandcannotbejointlyoptimized.Henceweapply atwostep iterativeoptimizationprocesswhere,ineachstep,we fixonepa- rameterandoptimizetheother.Substituting(2),(6),(7)and(8)in theLagrangianfunction(5),weget

L= M

m=1

N

i=1

α

m,iyTm,iym,iM

m=1

N

i=1

M

n=1

N

j=1

α

m,iyTm,iyn,j

α

n,j. (10)

We seethat optimizing (10) for

α

corresponds to the traditional SVDDappliedinthesubspace.Maximizing(10)foraparticularset ofdata willgiveus

α

m,i corresponding eachsample. Thevalue of

α

m,i forcorresponding sampledefinesits positionwithrespectto thehypersphere:

Sampleswith0<

α

m,i <Cdefine thedatadescriptionandlie ontheboundaryofhypersphere,theyarereferedtoassupport vectors.

Sampleswith

α

m,i=Careoutsidetheboundary.

Sampleswith

α

m,i=0lieinsidetheboundary.

Inthesecondstep,wefix

α

andupdateQmforeachmodality.

Forthisstep,weaddaregularizationterm

ω

: L=

M

m=1

N

i=1

α

m,ixTm,iQTmQmxm,i

M

m=1

N

i=1

M

n=1

N

j=1

α

m,ixTm,iQTmQnxn,j

α

n,j+

βω

. (11)

The regularization term

ω

expresses the covariance ofdata from

differentmodalitiesinthenewlow-dimensionalspace, and

β

isa

regularizationparameterforcontrolling thesignificanceof

ω

.We

proposedifferentsettingsfor

ω

as

ω

0=0, (12)

ω

1=M

m=1

tr

QmXmXTmQTm

, (13)

ω

2=M

m=1

tr

QmXm

α

m

α

TmXTmQTm

, (14)

ω

3 = M

m=1

tr

(

QmXm

λ

m

λ

TmXTmQTm

)

, (15)

ω

4 = M

m=1

M

n=1

tr

(

QmXmXTnQTn

)

, (16)

ω

5 = M

m=1

M

n=1

tr

(

QmXm

α

m

α

TnXTnQTn

)

, (17)

ω

6 = M

m=1

M

n=1

tr

(

QmXm

λ

m

λ

TnXTnQTn

)

, (18)

where

α

m∈RN in (14)and (17)is a vector having the elements

α

m,1,...,

α

m,N. Thus,

α

m hasnon-zero values forsupport vectors

andoutliers.

λ

m∈RN in(15) and(18)isa vectorhaving theele- mentsof

α

mthat aresmallerthanC.Valuesof

α

m corresponding totheoutliers(i.e.,

α

m,i=C)are replacedwithzeros in

λ

m.Thus,

λ

m hasnon-zero valuesonlyforthe supportvectors. For

ω

0,the regularizationtermbecomesobsoleteanditisnotusedintheop- timizationprocess.In

ω

1,theregularizationtermonlyusesrepre- sentationscomingfromtherespectivemodality andnorepresen- tations from the other modalities are used to describe the vari- anceofthepositiveclass.In

ω

2,allsupportvectors,i.e.,represen- tationsatthehypersphereboundary,andoutliersareusedto de- scribethe classvariance fortheupdate ofthecorresponding Qm. In

ω

3,onlysupportvectorsoftherespectivemodalityareusedto describethevarianceoftheclasstobemodelled.In

ω

4,datafrom all the modalities are used to describe the covariance andregu- larizetheupdateofQm.In

ω

5,theinstancesbelongingtothehy- persphere boundary and outliers fromall modalities are used to describethecovariance.In

ω

6,onlythesupportvectorsbelonging toclass boundaryfromall modalitiesare usedtoupdate Qmand describethecovarianceofthepositiveclass.

Notethat theMS-SVDDformulationreducestoS-SVDD [36]if data fromonly one modality (M=1) are takeninto account for data description. In S-SVDD, a single projection matrix Q is de- termined formapping thedata X fromhigher-dimensional space toalower-dimensionalspace.Aregularization term

ψ

,whichex-

presses theclassvariance inthe low-dimensionalspace, isadded totheLagrangianfunctionofS-SVDD:

ψ

=tr

(

QX

λλ

TXTQT

)

, (19)

(5)

where

λ

can take differentforms asdescribedin[36].The regu-

larizationterms,

ω

0,

ω

1,

ω

2,and

ω

3 forMS-SVDDbecomeequiva- lenttotheregularizationtermsproposedforS-SVDDwhenM=1. Hence,MS-SVDDisamoregeneralizedformofS-SVDD,whichcan formadatadescriptionbyconsideringdatafrommultiplemodali- ties.

We updateQm byusing thegradientof Lin(11)withrespect toQm,

QmQm

η

L, (20)

where

η

is the learning rateparameter andthe gradient of Lis

calculatedas

L

Qm =2 N

i=1

α

m,iQmxm,ixTm,i

−2 N

i=1

N

j=1

M

n=1

Qnxn,jxTm,i

α

m,i

α

n,j+

β ω

, (21)

where

ω

isthederivativeoftheregularizationtermwithrespect toQm

ω

0 = 0, (22)

ω

1 = 2QmXmXTm, (23)

ω

2 = 2QmXm

α

m

α

TmXTm, (24)

ω

3 = 2QmXm

λ

m

λ

TmXTm, (25)

ω

4 = 2M

n=1

(

QnXnXTm

)

, (26)

ω

5 = 2M

n=1

(

QnXn

α

n

α

TmXTm

)

, (27)

ω

6 = 2M

n=1

(

QnXn

λ

n

λ

TmXTm

)

. (28)

We initializetheQm usingPCA.At everyiteration, theprojec- tionmatrixisorthogonalizedandnormalizedsothat

QmQTm=I, (29)

where I is an identity matrix. We use QR decomposition for orthogonalizing and normalizing the projection matrix Qm. Algorithm1 describestheoverallMS-SVDDalgorithm.

3.2. Non-linearMS-SVDD

For non-linear mapping from the original feature spaces to a new shared feature space, we use two approaches.The first ap- proachis basedonthe standardkerneltrick [20] andthe second on the Nonlinear Projection Trick (NPT) [37], which is used asa computationallylighteralternativetothekerneltrick.

3.2.1. Non-linearMS-SVDDwithstandardkerneltrick

Inthenon-lineardatadescription,theoriginaldataaremapped to a kernel space F using a non-linear function

φ

(·) such that xm,i∈RDm

φ

(xm,i)F. The kernel space dimensionality can possibly be infinite. Then the dataare projected from thekernel spacetoRdas

ym,i=Qm

φ (

xm,i

)

,

i

{

1,...,N

}

. (30)

Inordertocalculateym,i,we usetheso-calledkerneltrickby ex- pressing theprojection matrixQm asa linearcombinationofthe

Algorithm1:MS-SVDDoptimization.

Inputs :Zmforeachm=1,...,M,//Inputdatafromall modalities

β

,//Regularizationparameterforcontrolling significanceof

ω

η

,//Learningrateparameter

d,//Dimensionalityofjointsubspace C,//RegularizationparameterinSVDD M//Totalnumberofmodalities

Outputs:Smforeachm=1,...,M,//Projectionmatricesfor differentmodalities

R,//Radiusofhypersphere

α

//Definesthedatadescription

Zm=XmforlinearandNPTcase(Kmforkernelcase) Sm=QmforlinearandNPTcase(Wmforkernelcase)

form=1:Mdo

InitializeSmvialinear-PCA(kernel-PCA);

end

foriter=1:max_iterdo

Foreachm,mapZmtoYmusingEq.(2)(Eq.(31));

FormYbycombiningallYm’s;

SolveSVDDinthesubspacetoobtain

α

inEq.(10);

form=1:Mdo

CalculateLusingEq.(21)(Eq.(31)); UpdateSm←Sm

η

L;

OrthogonalizeandnormalizeSmusingQR decomposition(eigendecomposition);

end end

Foreachm,computeYmusingEq.(2)(Eq.(31));

FormYbycombiningallYm’s;

SolveSVDDtoobtainthefinaldatadescription;

trainingdatarepresentationsoftherespectivemodalityintheker- nelspaceF,leadingto

ym,i=Wm

Tm

φ (

xm,i

)

=Wmkm,i,

i

{

1,...,N

}

, (31)

wherem∈R|F|×NisamatrixformedinFcontainingthetraining data representations ofmodality m, Wm∈Rd×N is a matrix con- taining the weights for m needed to formQm, and km,i is the ith columnof theGramian matrix,also calledasthe kernel ma- trix,Km∈RN×N,havingelementsequaltoKm,i j=

φ

(xm,i)T

φ

(xm,j). Inourexperiments,weusetheRadialBasisFunction(RBF)kernel, givenby

Km,i j=exp

xm,ixm,j

22

2

σ

2

, (32)

where

σ

>0isahyperparameteranddeterminesthewidthofthe kernel.

The augmented version ofthe Lagrangian function now takes thefollowingform:

L= M

m=1

N

i=1

α

m,ikTm,iWTmWmkm,i

M

m=1

N

i=1

M

n=1

N j=1

α

m,ikTm,iWTmWnkn,j

α

n,j+

βω

. (33)

(6)

The

α

’sarecalculatedoptimizing(10)withWm’sfixed,i.e.,apply-

ingSVDDinthesubspace.Inthesecondstep,the

α

’sarefixedand

Wm’sareupdatedwiththegradientdescent:

WmWm

η

L, (34)

wherethegradientiscalculatedas

L

Wm =2

N

i=1

α

m,iWmkm,ikTm,i

−2 N

i=1

N

j=1

M

n=1

Wnkn,jkTm,i

α

m,i

α

n,j+

β ω

. (35)

Thegradientoftheregularizationterm,

ω

,nowtakesthefollow-

ingforms:

ω

0 = 0, (36)

ω

1 = 2WmKmKTm, (37)

ω

2 = 2WmKm

α

m

α

TmKTm, (38)

ω

3 = 2WmKm

λ

m

λ

TmKTm, (39)

ω

4 = 2M

n=1

(

WnKnKTm

)

, (40)

ω

5 = 2

M

n=1

(

WnKn

α

n

α

TmKTm

)

, (41)

ω

6 = 2

M

n=1

(

WnKn

λ

n

λ

TmKTm

)

. (42)

We initialize thematrix Wm for each modeusing kernel-PCA.

WeorthogonalizeandnormalizeWmateveryiterationsothat

Wm

Tm

mWTm=I. (43)

Wedecompose(43)usingeigendecompositionas

Wm

Tm

mWTm=Vm

mVTm, (44) whereTmmisKm,misadiagonalmatrixcontainingtheeigen- valuesofWmTmmWTmandVmcontainsthecorrespondingeigen- vectors.Afterfurthersimplification,thenormalizedprojectionma- trixmcanbecomputedas

m=

(

m12

)

+VTmWm, (45) wherethe+signdenotes pseudo-inverse. Fornotationsimplicity, wesetWm=m.

3.2.2. Non-linearMS-SVDDwithnonlinearprojectiontrick

The non-linear MS-SVDDusing the kerneltrick requires com- putingtheeigendecomposition(44)ateveryiteration.Thisiscom- putationally expensive and, therefore, we propose an alternative non-linearapproachusingNPT[37].Here,anon-linearmappingis appliedonlyatthebeginning oftheprocess,while theoptimiza- tionfollowsthe linearMS-SVDD. IntheNPT-based MS-SVDD, we first compute kernel matrix Km using (32). In the next step, the computedkernelmatrixiscentralizedas

m=

(

IEN

)

Km

(

IEN

)

(46)

wheremisthecentralizedkernelmatrixandEN isN×Nmatrix definedas

EN= 1

N1N1TN. (47)

1N∈RNis avectorwitheach elementhavingvalue of1.Thecen- tralizedmatrixmisdecomposedbyusingeigendecomposition,

m=UmAmUTm, (48)

where Am contains thenon-negative eigenvalues of the centered kernelmatrixandUmcontainsthecorrespondingeigenvectors.The datainthereduceddimensionalkernelspaceisobtainedas

m=

(

Am12

)

+U+mm (49) Sincewe considerNPTasapure preprocessingstep,wecontinue by considering masour input data,i.e., weset Xm=m. Then wefollowthelinearMS-SVDD.Notethatincaseswherethenum- ber of training samples is high, this pre-processing step can be highlyacceleratedbyfollowingapproximations,liketheNyström- basedApproximateKernelSubspaceLearningmethodin[38]. 3.3. Testphase

Duringthetestphase,aninstancexm∈RDm (the insubscript denotestestinstance)comingfrommodalitymisprojectedtothe commond-dimensionalsubspaceusing(2)forthelinearcase.For kernelcase,first,thekernelvectoriscomputedas

km=

Tm

φ (

xm

)

(50)

andthenprojected tothecommond-dimensionalsubspaceusing (31).ForNPT,firstthekernelvectorkmiscomputedandthencen- tralizedas

m=

(

IEN

)

[km−1

NKm1N]. (51)

Thecentralizedkernelvectorismappedto

φ

m=

(

Tm

)

+kˆm (52) andthen to d-dimensional subspace using (2) (for notationsim- plicity

φ

m isconsideredasxm).

Thedecisiontoclassifythetestinstanceym aspositiveorneg- ativeistakenonthebasisofitsdistancefromthecenterofhyper- sphere,i.e.,

yma

22=yTmym−2 M

k=1

N

i=1

α

k,iyTmyk,i +

M

k=1

N

i=1

M

n=1

N

j=1

α

k,i

α

n,jyTk,iyn,j. (53) The representation ym is assigned to the positive class when

yma

22R2 and to the negative class if

yma

22>R2, whereR2 isthe distancefromcenter atoany supportvector on theboundary,

R2 =vTv−2 M

m=1

N

i=1

α

m,iyTm,iv+ M

m=1

N

i=1

M

n=1

N

j=1

α

m,i

α

n,jyTm,iyn,j,

(54) wherevisanysupportvectorinthetrainingsetwithcorrespond- ing

α

havingvalue0<

α

<C.Sincetheitemsarerepresentedby Mdifferentmodalities,thefinal decisionforassigningtheitemto a particularclass (eitherpositive ornegative) canbe takenusing differentstrategiesexplainedinSection4.3.

3.4. Complexityanalysis

The linear version ofthe proposed method has the following mainsteps:1)InitializingtheprojectionmatricesviaPCA,2)map- ping data from all modalities to a lower d-dimensional shared space, 3)SVDD forobtainingthe

α

valuesandfinal datadescrip-

tion for all data points coming from M different modalities, 4)

(7)

computing the gradient (L) for each modality, 5) updating the projection matrices and6) QR decomposition fororthogonalizing andnormalizingtheprojectionmatrices.Weanalyzeeachofthese stepsandthencomputetheoverallcomplexityofthealgorithm:

1. PCAofamatrixiscomputedbytheeigenvaluedecomposition ofits covariancematrix,so itinvolvestwo steps,i.e., comput- ingthecovariancematrixandthen theeigenvaluedecomposi- tionofthe obtainedcovariancematrix.The complexityofcal- culating covariance matrix and corresponding eigenvalue de- compositionforasinglemodalityisO

NDm×min(N,Dm)

and O

D3m

,respectively[39].ThecomplexityofcomputingPCAfor allmodalitiesisO

min(N2D1,D21N)+D31)+(min(N2D2,D22N)+ D32)+· · · +(min(N2DM,D2MN)+D3M

. We denote the sum of dimensions of all modalities as D=D1+D2+· · · +DM and similarly the sum of squared dimensions as D2=D21+D22+

· · · +D2M (note that D2=(D)2) and sum of cubed di- mensions as D3=D31+D32+· · · +D3M. Hence, the complex- ity of initializing the projection matrices via PCA becomes O

min(N2D,D2N)+D3

.

2. The complexityofmappingdatafromthe original Dm dimen- sionalspacetoa lowerd-dimensionalspaceisthe complexity ofmultiplyingd×Dm andDm× N,whichhasthecomplexity ofO

dDmN

.RepeatingthisforallmodalitieswegetO

dDN

3. The complexityof SVDD forN datapoints is O

N3

[40]. For alldatapointscomingfromMdifferentmodalitiesit becomes O

M3N3

.

4. The gradientLto update Qm is computedusing(21),where the second term has the highest complexity(equally high as regularizationterms4–6). ItscomplexityisO(2dN2DmD).As thisstepisrepeatedforallmodalitiesthetotalcomplexitybe- comesO(2dN2D2).

5. UpdatingtheprojectionmatriceshasO

dD

complexity.

6. The complexity of QR decomposition for a single modality is O(dDm2) [41]. Thus, theoverall complexityof QR decomposi- tionsforallthemodalitiesisO(dD2).

Dropping the relatively lower intensive computational steps andadding therest, thefull complexityof theproposed method reduces to O

min(N2D,D2N)+D3+M3N3

. Assuming that the total number of samples MN is always greater than D and M< <N,thetime complexityof(asingleiterationof) ourpro- posed algorithm interms ofthe big Onotation isO(N3). Inthe testingphase,eachrepresentationofatestsampleineachmodal- ity is projected to the d-dimensional subspace and then its dis- tanceiscomparedtoR.ThishasthetotalcomplexityofO(dD+ Md).

For the non-linear version with NPT, the kernel matrixKm is firstformedwhichhasthecomplexityofO(DmN2).Then theker- nel matrix is centralized and decomposedby usingeigendecom- position. Both of these steps have the complexity of O(N3). As the data dimensionality in the remaining steps of the proposed methodchangesfromDmtoN,thetotalcomplexityoftheremain- ing steps becomesO

MN3+M3N3

. Thus, the overall complexity in termsof thebig O notationremains atO

N3

forM < < N, while in practice the computational complexity is higher (by a scalar multiplier c) than for thelinear version. Alsoforthe non- linearversionwiththestandardkerneltrick,theoverallcomplex- ity remains thesame, butthe kernelmapping isrepeated atev- eryiterationand, thus,thescalarc becomeslargerfortheoverall trainingprocess.Thetestingcomplexityofthenon-linearmethods increasestoO(ND+dMN+Md).

4. Experiments

4.1. Datasetsandprepossessing

To evaluate the proposed method, we performed different sets of experiments over 5 datasets. Robot Execution Failures dataset, Single Proton Emission Computed Tomography (SPECTF) heartdataset, andIonospheredataset were downloaded fromUC Irvine (UCI) machine learning repository [34]. Caltech-7 dataset andHandwrittendataset were downloaded froma repository for multi-viewlearning[42].Thedetails ofdatasets andexperiments areasfollows.

ThefirstsetofexperimentswasperformedontheRobotExecu- tionFailuresdataset[43].InRobotExecutionFailuresdataset,force andtorquemeasurementsarecollectedatregularintervalsoftime afterataskfailure isdetected.Thedatasetisdividedintofivedif- ferentlearningproblems(LP)correspondingtodifferenttriggering events:

LP1:Failuresinapproachtograspposition

LP2:Failuresinthetransferofapart

LP3:Positionofthepartafteratransferfailure

LP4:Failuresinapproachtoungraspposition

LP5:Failuresinmotionwithpart

The total number of instances and the distribution of the classes are given in Table 1. All instances are given as 15 sam- plescollectedat315msregulartimeintervalsforeachsensor.For thisdataset,weconsideralltheinstancesbelongingtothenormal classasthetargetclassandtheremainingclassesasthenon-target data.Hence, we havetwo modalities (torque andforce measure- ments),and we consider the dataset asa one-class classification problem.

The second set of experiments was performed SPECTF heart dataset[44].TheSPECTFheartdatasetconsistsoftwosetsoffea- tures corresponding to rest and stress condition SPECTF images ofdifferentsubjects.The trainingset consistsof40examples di- agnosedashealthy heartmuscle perfusions and40diagnosed as pathological perfusions. The test set consists of 15 instances of healthyheartmuscleperfusionsand172frominstancesdiagnosed aspathologicalperfusions.We convertthisto amultimodalone- classclassificationproblembyconsideringtherestandstresscon- ditionsas differentmodalitiesand by selectingthe healthy heart muscleperfusionsasourtargetclass.

ThethirdsetofexperimentswasperformedovertheCaltech-7 dataset.WeusedGaborfeatureandWaveletmomentsasourtwo differentmodalities.Thedatasetcontains1474totalsamplesfrom 7differentclasses.We selectedfaces (435samples) asourtarget classandtherestoftheclassesalltogether(1039samples)asthe outlierclass.

WeusedIonospheredatasetforthefourthsetofexperiments.

Thecategoriesinthisdatasetare describedby twoattributesper pulsenumberresulting fromthecomplex electromagneticsignal, processed by an autocorrelation function. We used the two at- tributes(realandcomplex)foreachpulseastwodifferentmodal- itiesandtheattribute“good” asourtargetclass.Thetotalnumber ofsamples inthis datasetis 351,out of which225 are fromthe target class (good),and the restof 126 samplesare fromoutlier class(bad).

Forthefifth setofexperiments,weusedHandwrittendataset.

We considered the samples of numeral 0 as the target. In the Handwrittendataset,the totalnumberofsamplesis2000,out of which 200 are fromthe target class. The rest of the 1800 sam- plesare considered asan outlierclass.We usedthe Zernikemo- ment(ZER)andmorphological(MOR)featuresasourtwodifferent modalities.

Viittaukset

LIITTYVÄT TIEDOSTOT

We compiled all available ecological data sets from the study area, complemented the missing data using habitat suitability modeling, calculated the total ecological score (TES)

Statistical analyses for the data obtained from the FADN data and the mail questionnaires were performed using the SAS EG 3.0 programme. All the data were divided into

Or, if you previously clicked the data browser button, click the data format you want and click Download from the popup window.. Eurostat (European Statistical Office) is

By clicking Data, you can browse and upload your datasets, Tools lead you to many sections that are for example list of geospatial software, Community has information about news

You are now connected to the server belonging to Tilastokeskus (Statistics Finland). On the left you will find several tabs, click on the tab: &#34;layer preview&#34;.. 2) Choose

3) Click “Download zip file” write your email-address where you want the download link to be sent.. The download link will appear to your

Finnish Environment Institute provides different open environmental data in vector or raster (shapefile or TIF-file) depending on the file. Data are available from whole Finland..

After you have chosen the year, theme and map sheets, click Go to Download…. New window opens where you can write the email address where link to data is send. Read and accept