ContentslistsavailableatScienceDirect
Pattern Recognition
journalhomepage:www.elsevier.com/locate/patcog
Multimodal subspace support vector data description
Fahad Sohrab
a,∗, Jenni Raitoharju
a,b, Alexandros Iosifidis
c, Moncef Gabbouj
aaFaculty of Information Technology and Communication Sciences, Tampere University, FI-33720 Tampere, Finland
bProgramme for Environmental Information, Finnish Environment Institute, FI-40500 Jyväskylä, Finland
cDepartment of Engineering, Electrical and Computer Engineering, Aarhus University, DK-8200 Aarhus, Denmark
a rt i c l e i n f o
Article history:
Received 21 August 2019 Revised 13 July 2020 Accepted 6 September 2020 Available online 10 September 2020 Keywords:
Feature transformation Multimodal data One-class classification Support vector data description Subspace learning
a b s t r a c t
Inthispaper, weproposeanovelmethodforprojectingdatafrom multiplemodalitiestoanewsub- spaceoptimizedforone-classclassification.Theproposed methoditerativelytransformsthedatafrom theoriginalfeaturespaceofeachmodalitytoanewcommonfeaturespacealong withfinding ajoint compactdescriptionofdatacomingfromallthemodalities.Fordataineachmodality,wedefineasepa- ratetransformationtomapthedatafromthecorrespondingfeaturespacetothenewoptimizedsubspace byexploitingtheavailableinformationfromtheclassofinterestonly.Wealsoproposedifferentregu- larizationstrategiesfortheproposedmethodandprovidebothlinearandnon-linearformulations.The proposedMultimodalSubspaceSupportVectorDataDescriptionoutperformsallthecompetingmethods usingdatafromasinglemodalityorfusingdatafromallmodalitiesinfouroutoffivedatasets.
© 2020TheAuthors.PublishedbyElsevierLtd.
ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/4.0/)
1. Introduction
Inoursurroundings,onadailybasis,we areexposed toinfor- mationfrommanydifferentsources.Differentsensorsareusedto gather informationabout similar objects. Ourbrains usually per- form well in combining the information from different sources to make a concise analysis of that particular entity. In order to analyze an entity, even a single source of information might be enough,buttomakesomecriticaldecisionsitisimportanttocom- bine information fromdifferent sources in a systematic way. For example,ifapersoniswalkingina crowd,themaininformation to nothit anythingcomes fromvisual cues, butpeoplecan warn eachotheralsobyvoiceorevenbytouch,andthisextrainforma- tionhelpsinunderstandingtheenvironmentinabetterway.The smellcouldhelptoavoidunpleasantspots,too.Asanotherexam- ple,whilewatchingamovie,onlyvisualinformationofthescenes maynotbeenoughtounderstandthewholescenario,buttheau- dio and/or captions combinedtogether withthe visualsinforma- tionwillprovidethefullinformation.
∗ Corresponding author.
E-mail addresses: fahad.sohrab@tuni.fi (F. Sohrab), jenni.raitoharju@tuni.fi (J.
Raitoharju), alexandros.iosifidis@eng.au.dk (A. Iosifidis), moncef.gabbouj@tuni.fi(M.
Gabbouj).
In machine learning techniques for predictive data modeling, trainingdata are used to forma model that can accurately clas- sifyfutureinstancesintoapredefinednumberofclasses.Inmany cases,data comes fromsensors and can be further processed to extractdifferentfeatures.Thetermmultimodalisusedtodescribe thedatacoming fromdifferentsensors(also referred toasmode ormodality),however,itis alsousedasasynonym tomulti-view when different features are extracted from the same sensor or whentherearemultiplesimilarsensors,e.g.,cameras.Theaimof multimodalmachine learning algorithms is to build models that canprocess andrelate informationfrommorethan onemodality (orview).
The examples of multimodal representations are prevalent in differentapplicationareas.In[1],anactivemultimodalsensorsys- temfortarget recognitionandtrackingisstudiedwhereinforma- tionfromthreedifferentsensors (visual,infrared,andhyperspec- tral) isused.In [2], aframework forvehicle trackingwithmulti- modaldata(velocityandimages)isproposedwheretheoutcome ofvelocitymodalityestimatedbyusingaKalmanfilteronthedata obtainedfrommotionsensorsisfusedwithfeatures learnedfrom image modality bythe color-fasterR-CNN method.In[3],a mul- timodaldatacollectionframeworkformentalstress monitoringis studied.Intheproposedframework,physiologicalandmotionsen- sordataofpeopleunderstressarecollected.
https://doi.org/10.1016/j.patcog.2020.107648
0031-3203/© 2020 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
The data in multimodal applications come from different modalities,whereeach modalityhasits ownstatisticalproperties andcontainsspecific information.Thedifferentmodalitiesusually share high-level concepts and semantic information, and all to- gethercontainmoreinformationthananysingle-modaldata[4].If webuildamodelseparatelyforeachmodality,therelationshipbe- tweenthemodalitiescannotbeexploitedefficiently.Inmultimodal subspacelearning, thegoalisto inferashared latentrepresenta- tion,that can accurately model data fromeach original modality andexploittherelationshipbetweenthemodalities.
Intraditionalmulticlassmachinelearning,anadequateamount of data are available for all the categories during training and, hence,thealgorithmtakesadvantageofall availabletrainingdata fromall classesto traina model [5]. However, itis possiblethat duringthetraining, data arehighly imbalanced,orthe only data availableisfromasingleclass.Insuch cases,one-classclassifica- tiontechniquesareused.Itisusefulinmanydifferentcases,such asoutlierdetection, predictingspecificevents,or,ingeneral,pre- dictingaspecific targetclass.While muchefforthasbeenputon solvingone-class classification tasksfordataof asinglemodality [6],muchlessefforthasbeenputonsolvingone-classmultimodal challengesingeneral, andwe arenot awareofanypriorwork in thefieldofmultimodallearningforone-classclassification.Inone- classmultimodaltasks,itis assumedthatthe onlydataavailable isfromasingleclassinmanydifferentmodalities.
In this paper, we propose a novel method for solving multi- modalone-class classificationtasks. Theproposed method,Multi- modalSubspaceSupportVectorDataDescription(MS-SVDD),finds a transformation for each modality along with defining a com- monmodelforallmodalitiesinalower-dimensionalsubspaceop- timizedforone-class classification.The restofthe paperisorga- nizedasfollows.InSection2,anoverviewofrelatedworkispre- sented.InSection3,thenewlyproposedMS-SVDDisderivedand discussed.InSection4,wepresenttheexperimentalsetupandre- sults,andfinally,inSection5,conclusionsaredrawn.
2. Backgroundandrelatedwork
In this section, we briefly discuss the principles of multi- modallearning,along withsubspacelearning.Wealsoprovidean overview of traditional methods used for multiclass multimodal datadescriptionandone-classunimodaldatadescription.
2.1.Multimodallearning
Theavailabilityofmanydifferentmodalitiescanbeblissifitin- creasestheperformanceofthemachinelearningmodel.However, ifthedatadescriptionalgorithmfailstomakeastrongconnection betweenthedifferentavailablemodalities,theperformancecanbe degraded.Toensurebetterperformance ofthemodelbycombin- ingdatafromdifferentmodalities,mainlytwoprinciplesshouldbe ensured,i.e.,consensusandcomplementaryprinciples[7]:
• Consensusprincipleaimsatminimizingthedisagreementbe- tween data available from different modes. Maximizing the agreement will reduce the errorrate, andbetter modeling of data isachieved whilecombining datafrom differentmodali- ties.
• Complementaryprincipleinthecontextofmultimodallearn- ing means that data from each modality may contain some knowledge notcontainedbythe otherones. Soitis necessary toexploitinformationfromalltheavailablemodestomakean accuratedescriptionofdata.
The multimodalmachine learningtechniquescanbe described bythreemainproperties:two-viewvs.multi-view, linearvs.non- linear,andunsupervisedvs.supervised[8].Asthenameindicates,
in two-view learning, the number of viewsis limited to two. In multi-view learning,the numberofviewsisnot limited. Thedif- ference betweensupervisedandunsupervised learningisthat, in supervisedlearning,theinformationonoutputlabelsofthetrain- ing dataistakeninto accountwhen trainingthe model,whilein unsupervisedmethods,thelabelsarenotusedtomodeltheunder- lyingstructureordistributionofthedata[9].Lineartechniquesfor multimodalsubspacelearningmaybetoosimpletoprovidearep- resentativemodel.Hence,kernelmethodsareproposedtocapture non-linearpatternsindata.
Themultimodallearningtechniqueshavebeenmainly applied onfourapplicationsdomains [10]: i.e.,audio-visualspeechrecog- nition [11], multimedia content indexing and retrieval [12], un- derstandinghuman multimodalbehaviors[13],andlanguage and vision media description [14]. Recently, there has been a rising trendinapplyingmultimodalmachinelearningalgorithmsalsoto other applications.Forexample,in[15],amultimodaldatafusion techniqueisusedforthepredictionofsoybean yieldfroman un- mannedaerialvehicle.
In multimodal learning, the main goal is to develop a pro- cess of fusing information from various modalities. In [16], the fusion strategies are divided into two different categories as model-agnostic and model-based approaches. In model-agnostic approaches, thefusion iseither late,early, or hybrid.Inearly fu- sion,thedataorextractedfeaturesare fusedtogetheratthevery initial phaseofmodeling.A newfeaturevector isusually formed by concatenatingall the availabledata fromdifferentmodes,and the model istrained withthe new feature vector. In late fusion, multiplemodelsaretrained,andthefusionisdoneforscoresgen- erated by each model for the corresponding modality. The score generatedby each model canbe a thresholdorsome probability used indecisionmaking. Hybridfusion exploits theadvantage of bothearly fusionandlatefusion.Model-basedapproachesforfu- sionexplicitlyfusesdataduringtheirconstruction,suchaskernel- basedapproaches, graphical models,andneural networks. Inthis work,wepresentamodel-basedapproachfordatafusion.
2.2. Subspacelearning
Inthecurrenteraofdatascience,wherehigh-dimensionalmul- timodalbigdataaregeneratedeveryminuteindifferentindustries, thereisaneedtogettheessentialinsightsandmineknowledgein thishigh-dimensionaldata.Subspacelearningaimsatrepresenting datainalower-dimensionalspacebykeepingintactalltheinfor- mationavailableintheoriginalhigher-dimensionalspace.
Algorithms developed forlinear subspacelearning finda pro- jection matrix for labeled training data (represented by vectors) satisfying some optimality criteria. Principal Component Analysis (PCA)isoneofthefirst subspacelearningmethods mentionedin literature. In PCA,a subspace islearned by orthogonally project- ingdatato asubspaceso thatthevariance ofdata ismaximized.
PCA worksonly with a single mode of data,i.e., all data should be in the samedimension. Another traditionalsubspace learning methodisLinearDiscriminantAnalysis(LDA),whichfindsalinear transformationbyexploitingtheclassinformation.
AnalogoustoPCA,butusedfortwo-viewlearning,iscanonical- correlation analysis (CCA) [17]. CCA is a classic and conventional methodforsubspace learning,which aimsatrelatingtwo setsof databy finding out thepairs ofdirections whichprovidea max- imum correlation between the two sets. It has recently become one of the popular methods for unsupervised subspace learning becauseof itsgeneralization capability andhasbeenusedexten- sivelyformultimodaldatafusionandcross-mediaretrieval[18].In subspacelearning,state-of-the-artresultsareachievedbymethods whichhaveembraced some stimulusfromconventional subspace learningmethods[19].
As an extension of methods for linear transformation, kernel methodsareintroducedtodescribenonlinearfunctionordecision boundaries.Inkernelmethods,thedataaremappedtoatypically higher-dimensional kernel-space usinga kernelfunction whereit exhibits linear patterns [20,21]. Forexample, in [22], kernel-PCA performinganonlinearformofPCAisproposed.
2.3. One-classclassification
Inone-classclassification,theparameters ofthemodelare es- timatedusingdatafromthepositiveclassonlybecausedatafrom theother classesare eithernot availableatalloritistoodiverse innaturetobemodeledstatistically[23].Thepositiveclassisalso calledthetargetclass,andthedatafromtheother classes,which are not available during training, iscalled negative, oran outlier class.Forexample,aunimodalbiometricsystemusesasinglebio- metrictraitforverificationoridentification[24].
SupportVectorDataDescription(SVDD)[25]isamongthemost widelyusedone-classclassificationmethodsusedforanomalyde- tection and other related applications. SVDD obtains a spherical boundaryaroundtargetdatawhichcanbemadeflexiblebyusing thekerneltrick. Theobtainedboundaryisused todetectoutliers duringthetest,i.e.,anythinginsidetheclosedboundaryisclassi- fied asatarget class andotherwiseasan outlier.The Lagrangian ofSVDDisgivenasfollows
L= N
i=1
α
ixTixi− Ni=1
N
j=1
α
ixTixjα
j, (1)where xi is the input target training instance and maximizing (1)givesasetof
α
icorresponding toeachinstance.Theinstances withα
i≥0definethe datadescription.Other commonone-class classification method is One-Class Support Vector Machine (OC- SVM)[26].Techniques forenhancing the performance ofone-class classi- fication methods,mainly extensions of SVDD,can be categorized into fourmain categories: methods basedon data structure,ker- nel issue, boundary shape, and non-stationary data [27]. As the name indicates, inthe data structure category, the main focusis on the structure of data. For example, in [28], a confidence co- efficient isassociatedwitheach trainingsampleto dealwiththe uncertainty ofdata. Inkernel issueextensions,the main focusis onreducingthecomplexityorproposingnewkernelsforone-class classification.Forexample,in[29],anewkernelisproposedtoim- prove theaccuracy ofSVDDfortime seriesclassification. Propos- ing changes inthe boundary forenclosing thetarget data comes underthethirdcategoryforimprovingone-classclassificationac- curacy. For example, in [30], the ellipse shape is used for en- capsulating target data instead of the traditional sphere used in SVDD. In [31], it is shown that both SVDD and OC-SVM lead to thesamesolutionwhenexploitingtheellipticalshapeoftheclass.
The last category of algorithms for improvingone-class classifier performance attemptstohandlenon-stationarydata.Forexample in [32], Incremental-SVDD (I-SVDD) is proposed to handle non- stationary or increasing data. Recently, in [33], an algorithm de- veloped forreducing the effect ofuncertain dataaround the hy- persphere ofSVDD achievedthe state ofthe art result on many UCI [34] datasets.Inthispaper,we considerbaseline SVDD com- bined withmultimodalsubspacelearning.However, inthefuture, themethodcanbefurtherextendedusingsimilarideas.
In the area of multimodal one-class classification, researchers havemainlyfocusedonfusingtheoutputlabelsofmultiplemod- els trained for each type of feature independently, i.e., without takinginto account informationfromother featuretypes forone model[35].
3. Multimodalsubspacesupportvectordatadescription
MS-SVDDmaps data fromhigh-dimensional feature spacesto a low-dimensional feature space optimized for one-class classifi- cation.Theoptimizedsubspaceissharedbydatacomingfromall modalities.MS-SVDDis an extension ofSubspace Support Vector DataDescription(S-SVDD),whichwasproposedforunimodaldata in[36].ThemainnoveltyofMS-SVDDisusingthemultimodalap- proachforone-class classification.Here,we firstderive thelinear MS-SVDD.Then we derivetwo non-linearversionsusingthe ker- neltrick[20]andtheNonlinearProjectionTrick(NPT)[37],respec- tively.
3.1. LinearMS-SVDD
Letusassume that the itemsto be modelled are represented byMdifferentmodalities.Theinstances ineachmodalitym,m= 1,. . .,M,are representedbyXm=[xm,1,xm,2,...xm,N],xm,i∈RDm, where N is the total numberof instances and Dm is the dimen- sionalityofthefeaturespaceinmodalitym.MS-SVDDtriestofind a projection matrix Qm ∈ Rd×Dm for each modality, which will projectthecorrespondinginstancestoalower(d)-dimensionalop- timized subspaceshared by all modalities. Thus, a featurevector xm,i isprojectedtoad-dimensionalvectorym,i as
ym,i=Qmxm,i,
∀
m∈{
1,...,M}
,∀
i∈{
1,...,N}
. (2) Toobtainacommondescriptionofall thedatatransformedfrom theircorresponding modalitiesto thenewcommonsubspace,we exploit Support Vector Data Description (SVDD) [25] to form a closedboundaryaroundthetargetclassdatainthenewsubspace.The centerand radius of thehypersphere are denoted by a∈Rd andR, respectively. Fig.1 depicts thebasic idea of theproposed method.
Inordertofindacompacthyperspherewhichenclosesall the targetdatafromallthemodalitiesinthenewsubspace,wemini- mize
F
(
R,a)
=R2s.t.
Qmxm,i−a22≤R2,∀
m∈{
1,...,M}
,∀
i∈{
1,...,N}
. (3)Byintroducingslackvariables
ξ
m,i,suchthatmostofthetrain- ingdatafromallthemodalitiesinthenewcommonspaceshould lieinsidethehypersphere,theabovecriterionbecomesF
(
R,a)
=R2+C Mm=1
N
i=1
ξ
m,is.t.
Qmxm,i−a22≤R2+ξ
m,i,ξ
m,i≥0,∀
m∈{
1,...,M}
,∀
i∈{
1,...,N}
. (4) TheLagrangefunctioncorrespondingto(4)canbegivenas L=R2+CM
m=1
N
i=1
ξ
m,i− Mm=1
N
i=1
γ
m,iξ
m,i− Mm=1
N
i=1
α
m,iR2+
ξ
m,i−xTm,iQTmQmxm,i+2aTQmxm,i−aTa
(5) The Lagrangian function should be maximized with respect to
α
m,i ≥0, andγ
m,i ≥0 andminimizedwithrespectto R,a,ξ
m,i, andQm.Bysettingthepartialderivativetozero,weget∂
L∂
R=0 ⇒M
m=1
N
i=1
α
m,i=1 (6)Fig. 1. Depiction of proposed MS-SVDD: Data from two modalities in their corresponding feature space are mapped to a common subspace, where positive class instances are enclosed inside a (hyper)sphere.
∂
L∂
a =0 ⇒ a=M
m=1
N
i=1
α
m,iQmxm,i (7)∂
L∂ξ
m,i=0 ⇒ C−α
m,i−γ
m,i=0 (8)∂
L∂
Qm=0 ⇒ Qm=a N
i=1
α
m,ixTm,iNi=1
α
m,ixm,ixTm,i−1(9)
It is clear from (6)–(9) that parameters
α
and Q are interre-latedandcannotbejointlyoptimized.Henceweapply atwostep iterativeoptimizationprocesswhere,ineachstep,we fixonepa- rameterandoptimizetheother.Substituting(2),(6),(7)and(8)in theLagrangianfunction(5),weget
L= M
m=1
N
i=1
α
m,iyTm,iym,i− Mm=1
N
i=1
M
n=1
N
j=1
α
m,iyTm,iyn,jα
n,j. (10)We seethat optimizing (10) for
α
corresponds to the traditional SVDDappliedinthesubspace.Maximizing(10)foraparticularset ofdata willgiveusα
m,i corresponding eachsample. Thevalue ofα
m,i forcorresponding sampledefinesits positionwithrespectto thehypersphere:• Sampleswith0<
α
m,i <Cdefine thedatadescriptionandlie ontheboundaryofhypersphere,theyarereferedtoassupport vectors.• Sampleswith
α
m,i=Careoutsidetheboundary.• Sampleswith
α
m,i=0lieinsidetheboundary.Inthesecondstep,wefix
α
andupdateQmforeachmodality.Forthisstep,weaddaregularizationterm
ω
: L=M
m=1
N
i=1
α
m,ixTm,iQTmQmxm,i− M
m=1
N
i=1
M
n=1
N
j=1
α
m,ixTm,iQTmQnxn,jα
n,j+βω
. (11)The regularization term
ω
expresses the covariance ofdata fromdifferentmodalitiesinthenewlow-dimensionalspace, and
β
isaregularizationparameterforcontrolling thesignificanceof
ω
.Weproposedifferentsettingsfor
ω
asω
0=0, (12)ω
1=Mm=1
tr
QmXmXTmQTm
, (13)
ω
2=Mm=1
tr
QmXm
α
mα
TmXTmQTm, (14)
ω
3 = Mm=1
tr
(
QmXmλ
mλ
TmXTmQTm)
, (15)ω
4 = Mm=1
M
n=1
tr
(
QmXmXTnQTn)
, (16)ω
5 = Mm=1
M
n=1
tr
(
QmXmα
mα
TnXTnQTn)
, (17)ω
6 = Mm=1
M
n=1
tr
(
QmXmλ
mλ
TnXTnQTn)
, (18)where
α
m∈RN in (14)and (17)is a vector having the elementsα
m,1,...,α
m,N. Thus,α
m hasnon-zero values forsupport vectorsandoutliers.
λ
m∈RN in(15) and(18)isa vectorhaving theele- mentsofα
mthat aresmallerthanC.Valuesofα
m corresponding totheoutliers(i.e.,α
m,i=C)are replacedwithzeros inλ
m.Thus,λ
m hasnon-zero valuesonlyforthe supportvectors. Forω
0,the regularizationtermbecomesobsoleteanditisnotusedintheop- timizationprocess.Inω
1,theregularizationtermonlyusesrepre- sentationscomingfromtherespectivemodality andnorepresen- tations from the other modalities are used to describe the vari- anceofthepositiveclass.Inω
2,allsupportvectors,i.e.,represen- tationsatthehypersphereboundary,andoutliersareusedto de- scribethe classvariance fortheupdate ofthecorresponding Qm. Inω
3,onlysupportvectorsoftherespectivemodalityareusedto describethevarianceoftheclasstobemodelled.Inω
4,datafrom all the modalities are used to describe the covariance andregu- larizetheupdateofQm.Inω
5,theinstancesbelongingtothehy- persphere boundary and outliers fromall modalities are used to describethecovariance.Inω
6,onlythesupportvectorsbelonging toclass boundaryfromall modalitiesare usedtoupdate Qmand describethecovarianceofthepositiveclass.Notethat theMS-SVDDformulationreducestoS-SVDD [36]if data fromonly one modality (M=1) are takeninto account for data description. In S-SVDD, a single projection matrix Q is de- termined formapping thedata X fromhigher-dimensional space toalower-dimensionalspace.Aregularization term
ψ
,whichex-presses theclassvariance inthe low-dimensionalspace, isadded totheLagrangianfunctionofS-SVDD:
ψ
=tr(
QXλλ
TXTQT)
, (19)where
λ
can take differentforms asdescribedin[36].The regu-larizationterms,
ω
0,ω
1,ω
2,andω
3 forMS-SVDDbecomeequiva- lenttotheregularizationtermsproposedforS-SVDDwhenM=1. Hence,MS-SVDDisamoregeneralizedformofS-SVDD,whichcan formadatadescriptionbyconsideringdatafrommultiplemodali- ties.We updateQm byusing thegradientof Lin(11)withrespect toQm,
Qm←Qm−
η
L, (20)where
η
is the learning rateparameter andthe gradient of Liscalculatedas
∂
L∂
Qm =2 Ni=1
α
m,iQmxm,ixTm,i−2 N
i=1
N
j=1
M
n=1
Qnxn,jxTm,i
α
m,iα
n,j+β ω
, (21)where
ω
isthederivativeoftheregularizationtermwithrespect toQmω
0 = 0, (22)ω
1 = 2QmXmXTm, (23)ω
2 = 2QmXmα
mα
TmXTm, (24)ω
3 = 2QmXmλ
mλ
TmXTm, (25)ω
4 = 2Mn=1
(
QnXnXTm)
, (26)ω
5 = 2Mn=1
(
QnXnα
nα
TmXTm)
, (27)ω
6 = 2Mn=1
(
QnXnλ
nλ
TmXTm)
. (28)We initializetheQm usingPCA.At everyiteration, theprojec- tionmatrixisorthogonalizedandnormalizedsothat
QmQTm=I, (29)
where I is an identity matrix. We use QR decomposition for orthogonalizing and normalizing the projection matrix Qm. Algorithm1 describestheoverallMS-SVDDalgorithm.
3.2. Non-linearMS-SVDD
For non-linear mapping from the original feature spaces to a new shared feature space, we use two approaches.The first ap- proachis basedonthe standardkerneltrick [20] andthe second on the Nonlinear Projection Trick (NPT) [37], which is used asa computationallylighteralternativetothekerneltrick.
3.2.1. Non-linearMS-SVDDwithstandardkerneltrick
Inthenon-lineardatadescription,theoriginaldataaremapped to a kernel space F using a non-linear function
φ
(·) such that xm,i∈RDm→φ
(xm,i)∈F. The kernel space dimensionality can possibly be infinite. Then the dataare projected from thekernel spacetoRdasym,i=Qm
φ (
xm,i)
,∀
i∈{
1,...,N}
. (30)Inordertocalculateym,i,we usetheso-calledkerneltrickby ex- pressing theprojection matrixQm asa linearcombinationofthe
Algorithm1:MS-SVDDoptimization.
Inputs :Zmforeachm=1,...,M,//Inputdatafromall modalities
β
,//Regularizationparameterforcontrolling significanceofω
η
,//Learningrateparameterd,//Dimensionalityofjointsubspace C,//RegularizationparameterinSVDD M//Totalnumberofmodalities
Outputs:Smforeachm=1,...,M,//Projectionmatricesfor differentmodalities
R,//Radiusofhypersphere
α
//DefinesthedatadescriptionZm=XmforlinearandNPTcase(Kmforkernelcase) Sm=QmforlinearandNPTcase(Wmforkernelcase)
form=1:Mdo
InitializeSmvialinear-PCA(kernel-PCA);
end
foriter=1:max_iterdo
Foreachm,mapZmtoYmusingEq.(2)(Eq.(31));
FormYbycombiningallYm’s;
SolveSVDDinthesubspacetoobtain
α
inEq.(10);form=1:Mdo
CalculateLusingEq.(21)(Eq.(31)); UpdateSm←Sm−
η
L;OrthogonalizeandnormalizeSmusingQR decomposition(eigendecomposition);
end end
Foreachm,computeYmusingEq.(2)(Eq.(31));
FormYbycombiningallYm’s;
SolveSVDDtoobtainthefinaldatadescription;
trainingdatarepresentationsoftherespectivemodalityintheker- nelspaceF,leadingto
ym,i=Wm
Tm
φ (
xm,i)
=Wmkm,i,∀
i∈{
1,...,N}
, (31)wherem∈R|F|×NisamatrixformedinFcontainingthetraining data representations ofmodality m, Wm∈Rd×N is a matrix con- taining the weights for m needed to formQm, and km,i is the ith columnof theGramian matrix,also calledasthe kernel ma- trix,Km∈RN×N,havingelementsequaltoKm,i j=
φ
(xm,i)Tφ
(xm,j). Inourexperiments,weusetheRadialBasisFunction(RBF)kernel, givenbyKm,i j=exp
−xm,i−xm,j222
σ
2, (32)
where
σ
>0isahyperparameteranddeterminesthewidthofthe kernel.The augmented version ofthe Lagrangian function now takes thefollowingform:
L= M
m=1
N
i=1
α
m,ikTm,iWTmWmkm,i− M
m=1
N
i=1
M
n=1
N j=1
α
m,ikTm,iWTmWnkn,jα
n,j+βω
. (33)The
α
’sarecalculatedoptimizing(10)withWm’sfixed,i.e.,apply-ingSVDDinthesubspace.Inthesecondstep,the
α
’sarefixedandWm’sareupdatedwiththegradientdescent:
Wm←Wm−
η
L, (34)wherethegradientiscalculatedas
∂
L∂
Wm =2N
i=1
α
m,iWmkm,ikTm,i−2 N
i=1
N
j=1
M
n=1
Wnkn,jkTm,i
α
m,iα
n,j+β ω
. (35)Thegradientoftheregularizationterm,
ω
,nowtakesthefollow-ingforms:
ω
0 = 0, (36)ω
1 = 2WmKmKTm, (37)ω
2 = 2WmKmα
mα
TmKTm, (38)ω
3 = 2WmKmλ
mλ
TmKTm, (39)ω
4 = 2Mn=1
(
WnKnKTm)
, (40)ω
5 = 2M
n=1
(
WnKnα
nα
TmKTm)
, (41)ω
6 = 2M
n=1
(
WnKnλ
nλ
TmKTm)
. (42)We initialize thematrix Wm for each modeusing kernel-PCA.
WeorthogonalizeandnormalizeWmateveryiterationsothat
Wm
Tm
mWTm=I. (43)
Wedecompose(43)usingeigendecompositionas
Wm
Tm
mWTm=Vm
mVTm, (44) whereTmmisKm,misadiagonalmatrixcontainingtheeigen- valuesofWmTmmWTmandVmcontainsthecorrespondingeigen- vectors.Afterfurthersimplification,thenormalizedprojectionma- trixWˆmcanbecomputedas
Wˆm=
(
m12)
+VTmWm, (45) wherethe+signdenotes pseudo-inverse. Fornotationsimplicity, wesetWm=Wˆm.3.2.2. Non-linearMS-SVDDwithnonlinearprojectiontrick
The non-linear MS-SVDDusing the kerneltrick requires com- putingtheeigendecomposition(44)ateveryiteration.Thisiscom- putationally expensive and, therefore, we propose an alternative non-linearapproachusingNPT[37].Here,anon-linearmappingis appliedonlyatthebeginning oftheprocess,while theoptimiza- tionfollowsthe linearMS-SVDD. IntheNPT-based MS-SVDD, we first compute kernel matrix Km using (32). In the next step, the computedkernelmatrixiscentralizedas
Kˆm=
(
I−EN)
Km(
I−EN)
(46)whereKˆmisthecentralizedkernelmatrixandEN isN×Nmatrix definedas
EN= 1
N1N1TN. (47)
1N∈RNis avectorwitheach elementhavingvalue of1.Thecen- tralizedmatrixKˆmisdecomposedbyusingeigendecomposition,
Kˆm=UmAmUTm, (48)
where Am contains thenon-negative eigenvalues of the centered kernelmatrixandUmcontainsthecorrespondingeigenvectors.The datainthereduceddimensionalkernelspaceisobtainedas
m=
(
Am12)
+U+mKˆm (49) Sincewe considerNPTasapure preprocessingstep,wecontinue by considering masour input data,i.e., weset Xm=m. Then wefollowthelinearMS-SVDD.Notethatincaseswherethenum- ber of training samples is high, this pre-processing step can be highlyacceleratedbyfollowingapproximations,liketheNyström- basedApproximateKernelSubspaceLearningmethodin[38]. 3.3. TestphaseDuringthetestphase,aninstancexm∗∈RDm (the∗ insubscript denotestestinstance)comingfrommodalitymisprojectedtothe commond-dimensionalsubspaceusing(2)forthelinearcase.For kernelcase,first,thekernelvectoriscomputedas
km∗=
Tm
φ (
xm∗)
(50)andthenprojected tothecommond-dimensionalsubspaceusing (31).ForNPT,firstthekernelvectorkm∗iscomputedandthencen- tralizedas
kˆm∗=
(
I−EN)
[km∗−1NKm1N]. (51)
Thecentralizedkernelvectorismappedto
φ
m∗=(
Tm)
+kˆm∗ (52) andthen to d-dimensional subspace using (2) (for notationsim- plicityφ
m∗ isconsideredasxm∗).Thedecisiontoclassifythetestinstanceym∗ aspositiveorneg- ativeistakenonthebasisofitsdistancefromthecenterofhyper- sphere,i.e.,
ym∗−a22=yTm∗ym∗−2 Mk=1
N
i=1
α
k,iyTm∗yk,i +M
k=1
N
i=1
M
n=1
N
j=1
α
k,iα
n,jyTk,iyn,j. (53) The representation ym∗ is assigned to the positive class when ym∗−a22≤R2 and to the negative class if ym∗−a22>R2, whereR2 isthe distancefromcenter atoany supportvector on theboundary,R2 =vTv−2 M
m=1
N
i=1
α
m,iyTm,iv+ Mm=1
N
i=1
M
n=1
N
j=1
α
m,iα
n,jyTm,iyn,j,(54) wherevisanysupportvectorinthetrainingsetwithcorrespond- ing
α
havingvalue0<α
<C.Sincetheitemsarerepresentedby Mdifferentmodalities,thefinal decisionforassigningtheitemto a particularclass (eitherpositive ornegative) canbe takenusing differentstrategiesexplainedinSection4.3.3.4. Complexityanalysis
The linear version ofthe proposed method has the following mainsteps:1)InitializingtheprojectionmatricesviaPCA,2)map- ping data from all modalities to a lower d-dimensional shared space, 3)SVDD forobtainingthe
α
valuesandfinal datadescrip-tion for all data points coming from M different modalities, 4)
computing the gradient (L) for each modality, 5) updating the projection matrices and6) QR decomposition fororthogonalizing andnormalizingtheprojectionmatrices.Weanalyzeeachofthese stepsandthencomputetheoverallcomplexityofthealgorithm:
1. PCAofamatrixiscomputedbytheeigenvaluedecomposition ofits covariancematrix,so itinvolvestwo steps,i.e., comput- ingthecovariancematrixandthen theeigenvaluedecomposi- tionofthe obtainedcovariancematrix.The complexityofcal- culating covariance matrix and corresponding eigenvalue de- compositionforasinglemodalityisO
NDm×min(N,Dm)
and O
D3m
,respectively[39].ThecomplexityofcomputingPCAfor allmodalitiesisO
min(N2D1,D21N)+D31)+(min(N2D2,D22N)+ D32)+· · · +(min(N2DM,D2MN)+D3M
. We denote the sum of dimensions of all modalities as D=D1+D2+· · · +DM and similarly the sum of squared dimensions as D2=D21+D22+
· · · +D2M (note that D2=(D)2) and sum of cubed di- mensions as D3=D31+D32+· · · +D3M. Hence, the complex- ity of initializing the projection matrices via PCA becomes O
min(N2D,D2N)+D3
.2. The complexityofmappingdatafromthe original Dm dimen- sionalspacetoa lowerd-dimensionalspaceisthe complexity ofmultiplyingd×Dm andDm× N,whichhasthecomplexity ofO
dDmN
.RepeatingthisforallmodalitieswegetO
dDN
3. The complexityof SVDD forN datapoints is O
N3
[40]. For alldatapointscomingfromMdifferentmodalitiesit becomes O
M3N3
.
4. The gradientLto update Qm is computedusing(21),where the second term has the highest complexity(equally high as regularizationterms4–6). ItscomplexityisO(2dN2DmD).As thisstepisrepeatedforallmodalitiesthetotalcomplexitybe- comesO(2dN2D2).
5. UpdatingtheprojectionmatriceshasO
dD
complexity.
6. The complexity of QR decomposition for a single modality is O(dDm2) [41]. Thus, theoverall complexityof QR decomposi- tionsforallthemodalitiesisO(dD2).
Dropping the relatively lower intensive computational steps andadding therest, thefull complexityof theproposed method reduces to O
min(N2D,D2N)+D3+M3N3
. Assuming that the total number of samples M∗N is always greater than D and M< <N,thetime complexityof(asingleiterationof) ourpro- posed algorithm interms ofthe big Onotation isO(N3). Inthe testingphase,eachrepresentationofatestsampleineachmodal- ity is projected to the d-dimensional subspace and then its dis- tanceiscomparedtoR.ThishasthetotalcomplexityofO(dD+ Md).
For the non-linear version with NPT, the kernel matrixKm is firstformedwhichhasthecomplexityofO(DmN2).Then theker- nel matrix is centralized and decomposedby usingeigendecom- position. Both of these steps have the complexity of O(N3). As the data dimensionality in the remaining steps of the proposed methodchangesfromDmtoN,thetotalcomplexityoftheremain- ing steps becomesO
MN3+M3N3
. Thus, the overall complexity in termsof thebig O notationremains atO
N3
forM < < N, while in practice the computational complexity is higher (by a scalar multiplier c) than for thelinear version. Alsoforthe non- linearversionwiththestandardkerneltrick,theoverallcomplex- ity remains thesame, butthe kernelmapping isrepeated atev- eryiterationand, thus,thescalarc becomeslargerfortheoverall trainingprocess.Thetestingcomplexityofthenon-linearmethods increasestoO(ND+dMN+Md).
4. Experiments
4.1. Datasetsandprepossessing
To evaluate the proposed method, we performed different sets of experiments over 5 datasets. Robot Execution Failures dataset, Single Proton Emission Computed Tomography (SPECTF) heartdataset, andIonospheredataset were downloaded fromUC Irvine (UCI) machine learning repository [34]. Caltech-7 dataset andHandwrittendataset were downloaded froma repository for multi-viewlearning[42].Thedetails ofdatasets andexperiments areasfollows.
ThefirstsetofexperimentswasperformedontheRobotExecu- tionFailuresdataset[43].InRobotExecutionFailuresdataset,force andtorquemeasurementsarecollectedatregularintervalsoftime afterataskfailure isdetected.Thedatasetisdividedintofivedif- ferentlearningproblems(LP)correspondingtodifferenttriggering events:
• LP1:Failuresinapproachtograspposition
• LP2:Failuresinthetransferofapart
• LP3:Positionofthepartafteratransferfailure
• LP4:Failuresinapproachtoungraspposition
• LP5:Failuresinmotionwithpart
The total number of instances and the distribution of the classes are given in Table 1. All instances are given as 15 sam- plescollectedat315msregulartimeintervalsforeachsensor.For thisdataset,weconsideralltheinstancesbelongingtothenormal classasthetargetclassandtheremainingclassesasthenon-target data.Hence, we havetwo modalities (torque andforce measure- ments),and we consider the dataset asa one-class classification problem.
The second set of experiments was performed SPECTF heart dataset[44].TheSPECTFheartdatasetconsistsoftwosetsoffea- tures corresponding to rest and stress condition SPECTF images ofdifferentsubjects.The trainingset consistsof40examples di- agnosedashealthy heartmuscle perfusions and40diagnosed as pathological perfusions. The test set consists of 15 instances of healthyheartmuscleperfusionsand172frominstancesdiagnosed aspathologicalperfusions.We convertthisto amultimodalone- classclassificationproblembyconsideringtherestandstresscon- ditionsas differentmodalitiesand by selectingthe healthy heart muscleperfusionsasourtargetclass.
ThethirdsetofexperimentswasperformedovertheCaltech-7 dataset.WeusedGaborfeatureandWaveletmomentsasourtwo differentmodalities.Thedatasetcontains1474totalsamplesfrom 7differentclasses.We selectedfaces (435samples) asourtarget classandtherestoftheclassesalltogether(1039samples)asthe outlierclass.
WeusedIonospheredatasetforthefourthsetofexperiments.
Thecategoriesinthisdatasetare describedby twoattributesper pulsenumberresulting fromthecomplex electromagneticsignal, processed by an autocorrelation function. We used the two at- tributes(realandcomplex)foreachpulseastwodifferentmodal- itiesandtheattribute“good” asourtargetclass.Thetotalnumber ofsamples inthis datasetis 351,out of which225 are fromthe target class (good),and the restof 126 samplesare fromoutlier class(bad).
Forthefifth setofexperiments,weusedHandwrittendataset.
We considered the samples of numeral 0 as the target. In the Handwrittendataset,the totalnumberofsamplesis2000,out of which 200 are fromthe target class. The rest of the 1800 sam- plesare considered asan outlierclass.We usedthe Zernikemo- ment(ZER)andmorphological(MOR)featuresasourtwodifferent modalities.