Automated joint skull-stripping and segmentation with Multi-Task U-Net in large mouse brain MRI databases

(1)

UEF//eRepository

DSpace https://erepo.uef.fi

Rinnakkaistallenteet Terveystieteiden tiedekunta

2021

Automated joint skull-stripping and

segmentation with Multi-Task U-Net in large mouse brain MRI databases

De Feo, Riccardo

Elsevier BV

Tieteelliset aikakauslehtiartikkelit

© 2021 The Authors

CC BY http://creativecommons.org/licenses/by/4.0/

http://dx.doi.org/10.1016/j.neuroimage.2021.117734

https://erepo.uef.fi/handle/123456789/25864

Downloaded from University of Eastern Finland's eRepository

(2)

ContentslistsavailableatScienceDirect

NeuroImage

journalhomepage:www.elsevier.com/locate/neuroimage

Automated joint skull-stripping and segmentation with Multi-Task U-Net in large mouse brain MRI databases

Riccardo De Feo

^a^,^b^,^c^,^∗

, Artem Shatillo

^d

, Alejandra Sierra

^c

, Juan Miguel Valverde

^c

, Olli Gröhn

^c

, Federico Giove

^b^,^e

, Jussi Tohka

^c

aSapienza Università di Roma, Rome 00184, Italy

bCentro Fermi–Museo Storico della Fisica e Centro Studi e Ricerche Enrico Fermi, Rome 00184, Italy

cA.I. Virtanen Institute for Molecular Sciences, University of Eastern Finland, Kuopio 70210, Finland

dCharles River Discovery Services, Kuopio, Finland

eFondazione Santa Lucia IRCCS, Rome 00179, Italy

a r t i c le i n f o

Keywords:

MRI Brain Segmentation Deep learning U-Net Mice

a b s t r a ct

Skull-strippingandregionsegmentationarefundamentalstepsinpreclinicalmagneticresonanceimaging(MRI) studies,andthesecommonproceduresareusuallyperformedmanually.WepresentMulti-taskU-Net(MU-Net), aconvolutionalneuralnetworkdesignedtoaccomplishbothtaskssimultaneously.MU-Netachievedhigherseg- mentationaccuracythanstate-of-the-artmulti-atlassegmentationmethodswithaninferencetimeof0.35sand nopre-processingrequirements.

WetrainedandvalidatedMU-Neton128T2-weightedmouseMRIvolumesaswellasonthepubliclyavailable MRMNeATdatasetof10MRIvolumes.WetestedMU-Netwithanunusuallylargedatasetcombiningseveral independentstudiesconsistingof1782mousebrainMRIvolumesofbothhealthyandHuntingtonanimals,and measuredaverageDicescoresof0.906(striati),0.937(cortex),and0.978(brainmask).Further,weexploredthe effectivenessofournetworkinthepresenceofdifferentarchitecturalfeatures,includingskipconnectionsand recentlyproposedframingconnections,andtheeffectsoftheagerangeofthetrainingsetanimals.

ThesehighevaluationscoresdemonstratethatMU-Netisapowerfultoolforsegmentationandskull-stripping, decreasinginterandintra-ratervariabilityofmanualsegmentation.TheMU-Netcodeandthetrainedmodelare publiclyavailableathttps://github.com/Hierakonpolis/MU-Net.

1. Introduction

Preclinicalimagingstudiesserve afundamentalrolein biological andmedicalresearch,relatingresearchresults atthemolecularlevel toclinicalapplicationin diagnosisandtherapy.MagneticResonance Imaging(MRI)representsapproximately23%ofallsmall-animalimag- ingstudiesprovidingtheopportunitytomonitorthedevelopmentof pathologicalconditionsandresponsestotreatmentina non-invasive way(Cunhaetal.,2014).Itsuniquequalitiesalsoincludetheavailabil- ityofdiﬀerentimagingcontrasts,renderingMRIextremelyusefulinthe contextofpreclinicalneurosciencewithapplicationsfromdrugdevelop- ment(Matthewsetal.,2013)tobasicresearch(FeboandFoster,2016).

Skull-strippingandregionsegmentationrepresentanintegralpart ofprocessingpipelinesinmurineMRimaging(Andersonetal.,2019;

Calabreseetal.,2015).Skull-strippingreferstotheidentiﬁcationofthe brainwithintheMRIvolume,andregionsegmentationreferstothela-

∗Correspondingauthorat:SapienzaUniversità diRoma,00184Rome,Italy.

E-mailaddress:riccardo.defeo@uniroma1.it(R.DeFeo).

belingofspeciﬁcanatomicalregionsofinterest(ROIs)withinthebrain.

In preclinicalMRI,thesetasks areoften performedmanually.While manualsegmentationrepresentsthegoldstandardandisemployedas thegroundtruthwhenevaluatingautomatedsegmentationalgorithms, itis time-consuminganddependson theexpertiseof theannotators performingthesegmentation.Furthermore,manualsegmentationsuf- fersfrom bothintra-andinter-ratervariability,bothinsmallanimal (Alietal.,2005)andhumanMRI(Entisetal.,2012;Yushkevichetal., 2006).

InpreclinicalMRI,state-of-the-artautomatedregionsegmentation pipelines arebasedonatlasregistration:individualMRIvolumesare alignedwithalabeledtemplate(atlas)andthelabelspropagatedtothe individualvolumes(DeFeoandGiove,2019;Lerchetal.,2011;Pagani etal.,2016;Schwarzetal.,2006;Shariefetal.,2008).Theaccuracy of registration-basedsegmentationdepends onboththesuitabilityof thetemplateandtheregistrationalgorithm.Thesegmentationaccuracy can beimprovedbymulti-atlasstrategies,wheremultipleatlasesare

https://doi.org/10.1016/j.neuroimage.2021.117734

Received17August2020;Receivedinrevisedform9December2020;Accepted7January2021 Availableonline14January2021

(3)

registeredtothesamevolumeandtheso-resultingsegmentationmaps arecombined,forexample,viamajorityvoting.Regardingmulti-atlas strategiesinmouseMRI,Baietal.(2012)compareddiﬀerentsingleand multi-atlasmethodsforatlas-based segmentationof themousebrain andreportedthatthecombinationofadiﬀeomorphicregistrationalgo- rithmandmulti-atlassegmentationprovidedthemostaccurateresults.

Maetal.(2014)demonstratedthatthemulti-atlasmethodsaresuperior tosingle-atlasmethodsandtheSTEPSprocedure forcombiningseg- mentations(Cardosoetal.,2013)bringsadvantagesoverearliercom- binationmethodologies.While multi-atlassegmentationaccounts for individualvariabilitymoreeffectivelythansingle-atlassegmentation, italsorequiresmultiplelabeledatlasesandmultipleregistrationsteps, significantlyincreasingthesegmentationtime.Multi-atlassegmentation canbefurthercombinedwiththeconstructionofaMinimumDeforma- tionTemplate(MDT)asanintermediatestepintheprocessingpipeline (Avantsetal.,2010;DeFeoandGiove,2019;Kovačević etal.,2004).An MDTminimizesthedeformationrequiredtoadaptittoeachindividual volume,thusreducingerrorswhenitslabelsarepropagatedtoeachtar- getscan.Insteadofdirectlyemployingoneormoremanuallysegmented atlases,deepneuralnetworks(DNNs)(LeCunetal.,2015)canusethese astrainingdatatolearnamappingfunctionfromtheimagestotheseg- mentationmaps.Inthisway,theanatomicalinformationisnotexplicitly representedinasetofmapsbutimplicitlyencodedinthetrainednet- work.DNNs,andinparticularConvolutionalNeuralNetworks(CNNs), havebeensuccessfullyappliedin alargenumberofcomputervision tasksinmedicalimaging.Forexample,Wachingeretal.(2018)devel- opedaregionsegmentationCNNsignificantlyoutperformingstate-of- the-art,registration-basedmethodsforthehealthyhumanbrainMRI, bothintermsofinferencetimeandaccuracy.Royetal.(2018a)further improvedonbothaspectswithanetworkbasedontheU-Netarchitec- ture(Ronnebergeretal.,2015),withareportedsegmentationtimeof20 sperbrainscan.However,withinsmall-animalMRI,theapplicationsof CNNshavebeenlimitedtoskull-stripping:Royetal.(2018b)traineda CNNalgorithmbasedonGoogleInception(Szegedyetal.,2015)forthe skull-strippinginhumansandmiceaftertraumaticbraininjury,achiev- ingbetterperformancethanotherstate-of-the-artmethods(3DPulse CoupledNeuralNetworks(3D-PCNN)(Chouetal.,2011)andRapidAu- tomaticTissueSegmentation(RATS)(Oguzetal.,2014)).

AspecifictypeofCNNarchitecture,U-Net,hasprovedtobevalu- able in biomedical image segmentation. U-Net is based on the en- coder/decoderstructure,addingskipconnectionsbetweentheencoder andthedecoderbranches,allowingittoeasilyintegratemulti-scalein- formationandbetterpropagatethegradientduringtraining.Thisar- chitecturehasbeenshowntogeneralizeevenfromalimitedamountof annotateddata(Xieetal.,2015),andassuchiswellsuitedformedical imaging,wheredatasetsaslargeastheonescommonlyusedforCNNs arerare.Valverdeetal.(2019)recentlydemonstratedtheeffectiveness ofU-Net-like architectures inpreclinicalresearch,designingthefirst DNNforthesegmentationofischemiclesionsinrodentsandachieving segmentationaccuracycomparableorbettertointer-rateragreementin manualsegmentation.

Inthiswork,weintroducemulti-taskU-Net(MU-Net)tosimulta- neouslyperformskull-strippingandregionsegmentationofthemouse brain,based ontheU-Netarchitecture.We refertoourapproachas multi-taskas weconsiderskull-strippingandregionsegmentationas separatetasks,allowingforthecompletedelineationofthebrainvol- umeregardlessofthechoiceofROIs.Whilethesetasksareoftencon- sideredasseparateinthecontextofmurinebrainsegmentation,they arestronglyrelated.Therefore,ourapproachisnotmulti-tasklearning inthestrongersenseofprovidingtwofundamentallydiﬀerentoutputs, e.g.,segmentationandclassiﬁcation(Yangetal.,2017).

Ourmaintrainandvalidationdataconsistedof128T₂MRIvolumes from32miceat4diﬀerentagesaswellasﬁvemanuallyannotatedre- gions(cortex,hippocampi,ventricles,striatiandbrainmask)fromthese images.ThisdatasetrepresentsMRimagestypicallyemployedindrug development.WedemonstratethatwiththisdataMU-Netachievesa

Table1

Summarycharacteristicsofthethreedatasetsemployedinthisstudy.BM referstobrainmask.Thetestdatasetincludedvariousgenotypesofboth sexes(seeSupplementaryTableS1fordetails).

Dataset name # Animals # MRIs # ROIs Type Train and validation 32 128 4 + BM WT males

Test 817 1,782 2 + BM various

MRM NeAt 10 10 37 + BM WT males

signiﬁcantlyhigheraccuracythanstate-of-the-artmulti-atlassegmenta- tionmethods(Cardosoetal.,2013;Maetal.,2014)inafractionofthe segmentationtime(approximately0.35s).WetrainedMU-Neton128 MRIvolumesandtestedonanindependentdatasetof 1782volumes acquiredoverthecourseoffouryearsfrombothwildtype(WT)and Huntington(HT)C57BL/6Jmice,allowingustoevaluateMU-Net in avarietyofexperimentalconditions.Additionally,wetrainedMU-Net forthesegmentationofmousebrainMRIwithisotropicvoxelsinto37 ROIsanddemonstratethatthesegmentationaccuracyofMU-Netwas equalorbetterthanastate-of-the-artmulti-atlassegmentationmethod (Maetal.,2014).

2. Materialsandmethods 2.1. Materials

Weutilizedthreediﬀerentdatasetsinthisworkassummarizedin Table1anddetailedinthefollowingsubsections.

2.1.1. Animals:train,validationandtestsets

Atotalof849mice(CharlesRiverLaboratories,Germany)wereused:

32miceforthetrainandvalidationsetand817miceforthetestset.

Trainandvalidationsetanimalswerescannedatfourdiﬀerentages(5 weeks,12weeks,16weeks,32 weeks)resulting in128volumes.All trainandvalidationsetanimalswereWTmales.

Thetestsetanimalswerepartof10studiesscannedatasingleor multipleagesfrom4upto60weeks,andincludedbothWTandseveral HTgenotypes:R6/2,Q175,Q175DN,Q111,Q50andQ20(Supplemen- taryTableS1),foratotalof1782MRIscans.Thegroupsincludedboth malesandfemales.Thesevolumeswereacquiredaspartoftenstudiesof Huntington’sdisease,kindlyprovidedbytheCHDI’CureHuntington’s DiseaseInitiative’foundation.

Allmicewerehousedingroupsofupto4percage(singlesex)ina temperature(22±1°C)andhumidity(30–70%)controlledenvironment withanormallight-darkcycle(7:00–20:00).

2.1.2. MRI:train,validationandtestsets

Micewereanesthetizedusingisoﬂurane(5%forinduction,1.5–2%

maintenance)in70%/30%mixofN₂/O₂carryinggas,ﬁxedtoahead holderandpositionedinthemagnetboreinastandardorientationrela- tivetogradientcoils.Respirationrateandtemperatureweremonitored usingPC-SAMSsoftwareandModel1030Monitoring&GatingSystem, SmallAnimalInstruments,Inc.,StonyBrook,NY.Thetemperaturewas maintainedat∼ 37^◦CusingSmallAnimalInstrumentsfeedbackwater heatingsystem.

All acquisitionswere performedusing a horizontal11.7T magnet withaboresizeof160mm,equippedwithagradientsetcapableofmax- imumgradientstrengthof750mT∕mandinterfacedtoaBrukerAvance IIIconsole(BrukerBiospinGmbH,Ettlingen,Germany).Avolumecoil (BrukerBiospinGmbH,Ettlingen,Germany)wasusedfortransmission andasurfacephasedarraycoilforreceiving(RapidBiomedicalGmbH, Rimpar,Germany).T₂weightedanatomicalimageswereacquiredus- ingaTurboRAREsequencewitheﬀectiveTR/TE=2500∕36ms,8echoes, 12msinter-echodistance,matrixsize256x256,FOV20.0x20.0mm²,31 0.6mmthickcoronalslices,−0.15mminterslicegap,and8averages.Con-

(4)

Fig.1. Generaloutlineofthearchitecturalfeaturesimplementedandcomparedinthenetworksdiscussed,varyingaccordingtothepresenceorabsenceofthe in-blockdenseconnections(purplearrowsintheconvolutionalblock),presenceorabsenceofthelayersubtractionconnections(black),andtheuseof2Dor3D ﬁlters.

cerningthetestdata,MRIexperimentalparametersonlydiﬀeredinac- quiring190.7mmthickcontiguouscoronalslices.

Volumeswithineachstudyweremanuallysegmentedbyanexperi- encedrater,whohadreceivedatrainingandpassedthequalification testsaccordingtoSOP(StandardOperatingProcedure)forvolumetric analysisin mice.Differentstudieswereanalyzedby differentraters.

Eachtrainingvolumewasmanuallysegmentedbyasingleraterdrawing thebrainmaskanddelineating4regionsofinterest:cortex,hippocampi, striatiandventricles.Thebrainmaskdidnotincludetheolfactorybulb orthecerebellum.Forthetestset,only3regionsweremanuallylabeled:

brainmask,cortexandstriati.Aseachimagewasonlysegmentedonce byasinglerater,intra-andinter-rateroverlapstatisticsarenotavail- ableforourdataset.Manualsegmentationrequiredfrom10to15min perROIperimage.

2.1.3. MRMNeAtdataset

The MRM NeAt dataset includes atlases of 10 individual T₂^∗- weightedinvivobrainMRimagesof12–14weeksoldC57BL/6Jmice;

eachwith37labelledanatomicalstructures(listedinFig.4)inaddition tothebrainmask(Maetal.,2008).Thisdatasetwasdownloadedfrom https://github.com/dancebean/mouse-brain-atlas,whereanimproved atlasisavailable(biascorrectionhasbeenapplied,leftandrightlabels havebeenseparatedand4thventriclelabeladded).Thisdatasetwas usedtoevaluatetheSTEPSalgorithmbyMaetal.(2014)andisused hereforthepurposeofcomparingMU-NetandSTEPSonalargernum- berofROIsonisotropicresolutionMRI.AsdetailedinMaetal.(2008), T₂-weightedMRdatawithavoxel-sizeof0.1mm³requiringabout2.8h ofscantimewereacquiredwitha3Dlargeﬂipanglespinechosequence usingasuper-conducting9.4T/210mmhorizontalboremagnet(Mag- nex)controlledbyanADVANCEconsole(Bruker)andequippedwithan activelyshielded11.6cmgradientset(Bruker,Billerica,MA).

2.2. MU-Nets

2.2.1. Architectures

MU-Net(Fig.1)presentsanencoder-decoderU-Net-likearchitecture, witheachbrancharticulatedinfourconvolutionalblocks.UnlikeU-Net, theﬁnalblockofthedecoderbranchfurtherbifurcatesintotwodiﬀer- entoutputmapsrepresentingourtwotasks,sharingthesamefeature

representation.Eachconvolutionalblockontheencodingpathis fol- lowedbya2x2max-poolinglayer.Thelastfeaturemapfeedsintothe bottlenecklayer,a64channel5x5convolutionallayerwithbatchnor- malization(IoﬀeandSzegedy,2015)connectingthedeepestlayerofthe encodingpathwiththedecodingpath.

Thedecoding pathis composedof4more blocksalternatingone un-poolinglayer(Nohetal.,2015)andoneconvolutionalblock.Un- poolingoperationseﬀectivelyreplaceup-convolution layersinU-Net withoutanylearnableparameters,whilepreservingspatialinformation.

Theselayersoperatebysimplyplacingtheelementsoftheun-pooled featuremapsinthepositionoftherespectivemaximumactivationfrom thecorrespondingpoolingoperation,andsettingtheresttozero.Skip connectionsconcatenatetheoutputofeachdenselayerintheencoding pathwiththerespectiveun-pooledfeaturemapofthesamesizebefore feedingitasinputtothedecodingconvolutionalblock.

Theoutputofthelastdecodinglayeractsastheinputoftwodifferent classificationlayers,whichsharethesamefeaturerepresentationupto thispoint:a1x1singlechannelconvolutionwithasigmoidactivation function,anda1x15channelslayerfollowedbyasoftmaxactivation function,fortheskull-strippingtaskandtheregionclassificationtask, respectively.

Convolutionalblock

Eachconvolutionalblockincludes3convolutionallayerspreceded byleakyReLUactivation(Maasetal.,2013)layersandbatchnormaliza- tion.All3convolutionsarepaddedandresultin64outputchannels,in analogywithRoyetal.(2018a).Thefirstandsecondconvolutionsem- ploy5x5filters,whilethethirdusesa1x1filter.Thisbecomesespecially relevantinthepresenceofdenseconnections,actingasabottleneckfor the64x3channelsoftheconcatenatedinputsandcompressingthesize ofthefeaturemaps.

2.2.2. Architecturalvariants

Westudyseveralvariationstothebasicnetworkarchitecture.

Denseconnections

Inthemodelsincludingdenseconnections(Huangetal.,2017)we modifyeachconvolutionalblockbyconcatenatingtotheinputofeach convolutiontheoutputsofthepreviousconvolutionswithinthesame block(Fig.1).

(5)

DualFramingconnections

Dualframingconnectionsrefertoadditionalskipconnectionsinthe DualFrameU-Netmodel.HanandYe(2018)proposedthisarchitec- tureforcomputedtomographyreconstructionfromsparsedatabased onsignalprocessingargumentstoreduceartifactsandimproverecov- eryofhighfrequency edges.Dualframingconnectionsconsistinthe subtractionoftheinputofeachconvolutionalblockontheencoding pathfromtheoutputoftherespectiveconvolutionalblockofthesame sizeonthedecodingpath,andassuchtheimplementationofthesecon- nectionsdoesnotincreasethenumberofmodelparameters.

3Dimplementation

A3Dimplementationcould,inprinciple,providebetterresultsby takingintoaccountthefeaturesoftheadjacentslices,whereasa2Dnet- worksevaluateseachcoronalsliceindependently.However,thelarger numberof parameters alsoincreases therisk of overﬁtting, andthe lowerresolutionintheanterior-posterioraxiscomparedtothein-plane resolutionmightconstituteconfoundingfactorsinthepresenceof3D poolingoperations.

For these reasons, we compared 2D and 3D implementationsof our network, using 5x5x5 ﬁlters and2x2x2 max-pooling layers,re- placingtheﬁltersandpoolinglayersdescribedabove.Thisresultsin 16,008,076and10,286,344parametersforthe3Dnetworkswithand withoutin-blockskipconnections,respectively.Corresponding2Dnet- workscontain3,297,676and2,087,944parameters,respectively.Thus, optingfora3Darchitectureincreasesthenumberofparametersbyfac- torsof4.85and4.93ascomparedtothe2Darchitectures.Thetotal numberofparameterswasmeasuredbyusingthePyTorchinstruction

sum(p.numel() for p in model.parameters())

.Acom- pletebreakdownofmodelparametersforeachnetworkisavailablein supplementaryTableS2.

2.2.3. Lossfunction

Recentliterature suggeststhatDice-basedlossfunctions(Milletari etal.,2016;Royetal.,2018a;Sudreetal.,2017)wouldconstitutean improvementovercross-entropylossesforthesegmentationofmedical images(KarimiandSalcudean,2019).Weoptimizedajointlossfunction 𝐿,thatisthesumoftwoDicelossfunctionscorrespondingtothethe skull-stripping(𝐿𝑆𝑆)andtheregionclassiﬁcationtask(𝐿𝑅𝑆).Let𝑝(𝑖)be thepredictedprobabilityofvoxel𝑖ofbelongingtothebrainmask,and 𝑔(𝑖)thegroundtruthforvoxel𝑖(𝑔(𝑖)=1ifthevoxelisinthebrainmask).

Further,let𝑝_𝑙(𝑖)and𝑔_𝑙(𝑖)bethesamequantitiesforlabel𝑙(𝑙=1,…,𝐾) encodingthegroundtruthasaone-hotvector.Then,thelossfunction canbewrittenas:

𝐿=𝐿_𝑆𝑆+𝐿_𝑅𝑆, (1)

𝐿_𝑆𝑆=− 2∑

𝑖𝑝(𝑖)𝑔(𝑖)

∑𝑖𝑝^𝟐(𝑖)+∑

𝑖𝑔^𝟐(𝑖), (2)

𝐿_𝑅𝑆 =−

∑𝐾 𝑙=1

2∑

𝑖𝑝_𝑙(𝑖)𝑔_𝑙(𝑖)

∑𝑖𝑝²_𝑙(𝑖)+∑

𝑖𝑔²_𝑙(𝑖), (3)

where𝐾isthenumberoflabels(ROIs)plusthebackgroundclass.

2.2.4. Training

The networks were implemented using the PyTorch framework and trained with stochastic gradient descent using Adam optimizer (KingmaandBa,2014)withthedefaultparameters(theinitiallearning rateof0.001,𝛽1=0.9,𝛽2=0.999andnoweightdecay)onanNVIDIA GeForceGTX1080GPUforupto12h(trainandvalidation)oronan NVIDIAVoltaV100GPUforupto24h(MRMNeAt).Eachnetworkwas trainedwithabatchsizeofone.Qualitatively,thetrainingpaceof2D and3Dnetworkswassubstantiallythesame,asevidencedinsupple- mentaryFig.S1.

Weaugmentedthedataonlineeachtimeanimagewasloadedby scalingthevolumesbya factor𝛼 randomlydrawnfromtheinterval

[0.95,1.01]androtatingthemaroundeachaxisbyarandomanglebe- tween−5◦and5◦.Scalingfactorssmallerthanonewerepreferredto decreasememoryrequirements.Eachtransformationwasappliedwith 50%probability.Tofurtherdecreasememoryrequirements,abounding boxwascreatedforeachvolumeusingtheannotatedbrainmaskasa reference.Eachvolumewasindividuallynormalizedto0meanandunit variance.Hyperparameters,optimizeranddataaugmentationscheme werefixedbeforetrainingensuringthateacharchitecturewouldfitinto memory,andappliedtoeachnetworkwithnoadditionalfinetuning.

2.2.5. Auxiliarybounding-boxnetwork

AsMU-Netwastrainedaftercroppingthevolumestoabounding box,wetrainedalighter2Dnetworktorunaﬁrstestimateforthebrain maskatinferencetimefromthecompletevolume.Thiswasthenusedto drawaboundingboxaroundthebrainwithonevoxelmargin.Thisaux- iliarynetworkfollowsexactlythesamearchitectureofMU-Net,omitting anyframingordenseconnections,andlimitingthenumberofchannels to4,8,16and32,fromtheshallowesttothedeepestlayer.Thisresults inanetworkwithatotalnumberof122,455trainedparameters.

2.3. STEPSmulti-atlassegmentation

STEPSisastateoftheartlabelfusionalgorithmtocombinemultiple registered templatestolabela targetvolume(Cardosoetal., 2013).

Ittakesintoaccountthelocalandglobalimagematching,combining anexpectation-maximizationapproachwithMarkovRandomFieldsto improveonthesegmentationbasedonthequalityoftheregistration itself.

The registrations were performed as follows: before registration, each volume underwent non-parametric N3 bias ﬁeld correction (Sledetal.,1998)implementedwithintheANTStoolset(Avantsetal., 2009).Takingeachvolumeasreference,allothervolumeswerethen registeredwithanaﬃnetransformationusingFSLFLIRT(Jenkinsonand Smith,2001)andthennonlinearlyregisteredviaFSLFNIRT(Andersson etal.,2007;Jenkinsonetal.,2012)withtheaidofthemanuallydrawn brainmask.LabelfusionwasachievedwiththeSTEPSalgorithmdis- tributedintheNiftySegpackage(Cardosoetal.,2013;2012).

Weusedcorrelationratio(corratio)asthecostfunctionin FLIRT andFNIRT.WeusedthedefaultFLIRTandFNIRTparameterswiththe followingexceptions.ThesearchrangeofanglesinFLIRTwas[−70^◦, 70^◦]insteadofthedefault[−90^◦,90^◦],becausetheorientationsofthe volumesweresimilar.InFNIRT,weusedsplineinterpolationinsteadof thedefaultlinearinterpolation.

STEPSdependsonthenumberoftemplatesemployedandthestan- darddeviationofitsGaussiankernel.Weperformedagridsearchtose- lecttheoptimalparameters,randomlyselecting10volumesandlabeling themusingSTEPS.WesampledthestandarddeviationoftheGaussian kernelsbetween0.5and6withastrideof0.5,andthenumberoftem- platesrangedbetween1and20randomlyselectedvolumes.Thissame processhasbeenperformedbothusingdiffeomorphicregistrationand usingaffineregistrationonly(supplementaryFig.S2),selecting16tem- platesandkernelstandarddeviationof1.5forthediffeomorphiccase, and18templateswithkernelstandarddeviationof2.5fortheaffinely registeredvolumes.Exploringbothgridsrequiredintotal287h.

Eachvolumewasthensegmentedusingtheseparameters,randomly selectinganappropriatenumberofmiceastemplatesfortheSTEPSal- gorithmasemergedfromtheparametergridsearchoutlinedabove.We repeatedthisprocedurerandomlyselectingthesamenumberoftem- platesfrommiceofthesameageonly.Themicerandomlyselectedas referenceatlaseswereselectedfromthetrainingsetassociatedtoeach volumeaccordingtothesame5-foldcrossvalidationschemeusedto traintheCNNsasoutlinedinSection2.5.

When evaluating STEPS on MRM NeAt dataset, we used scripts provided by Ma et al. (2014) at https://github.com/dancebean/

multi-atlas-segmentationasthisimplementationisoptimizedusingthis dataset.

(6)

The here described computations forthe training andvalidation datasetwereexecutedonaworkstationequippedwith a6-core,12- thread IntelCorei7-8700KCPU runningat3.70GHz. Toaccelerate thecomputationsgeneratingseveralintermediateﬁleoutputs,weused RAMdisktoreducethenumberof thediskoperations. FortheNeAt dataset,computationswereperformed ona 12-core,24-threadAMD Ryzen93900XProcessor.

2.4. Post-processing

Theonlypost-processingsteps appliedonthesegmentationmaps weretheﬁllingofholesintheresulting3Dvolume,theselectionofthe largestconnectedcomponentasthebrainmaskfortheskull-stripping task,andassigningallvoxelspredictedasnon-braintothebackground class.

2.5. Validationandmetrics

Toassesstheoverlapbetweenthegroundtruthandthepredicted segmentationmasks,weusedtheDicecoefficientastheprimaryper- formancemeasure(Dice,1945).TheDicecoefficientisdefinedastwo timesthesizeoftheintersectionoverthesumofthesizesofthetwo regions:

𝐷= 2||𝑌_𝑡∩𝑌||

||𝑌_𝑡||+|𝑌|,

whereby𝑌 weindicateourpredictionandby𝑌_𝑡thegroundtruth.This coeﬃcientrangesfrom0,meaningnooverlap,to1,indicatingacom- pleteoverlapbetweenthetworegions.

Wefurtherevaluatedourresultsusingthe95thpercentileofthesym- metricHausdorﬀ distance(HD95)(Huttenlocheretal.,1993).HD95in- dicatesthemagnitudeofthelargestsegmentationerrorcomparedtothe groundtruth,expressedinmillimeters.Weadditionallycomputedpre- cision(deﬁnedas|^𝑌𝑡∩𝑌|

|𝑌| )andrecall(deﬁnedas|^𝑌𝑡∩𝑌|

|^𝑌𝑡| ).Thesemeasures providecomplimentaryinformationtotheDiceoverlap.

Eachexperimentonthetrainandvalidationdatasetaswellasthe NeAtdataset(seeTable1)wasvalidatedaccordingtoa5-foldcrossvali- dation(CV)scheme.Volumesweredistributedineachfoldaccordingto theindividualidentityofeachanimal,preventingtheuseofthevolumes fromthevalidationanimalsfortraining.Theanimalswererandomly assignedtoeachfoldonce,andthesameanimalsremainedassignedto theirrespectivefoldsthroughallexperiments.Fortrainandvalidation dataset,thisresultedinatrainingsetof25or26miceandavalidation setof6or7miceineachfold.FortheMRMNeAtdataset,5-foldCV resultedin8volumesusedfortraining(orasregistrationatlases)and2 fortestingineachfold.Thetestdatasetwasusedasanexternaltestset toevaluateMU-Nettrainedonthetrainandvalidationdataset.

Unlessotherwisespecified,weusedapairedpermutationtesttoeval- uatethesignificanceofdifferencesbetweentheDicescoresobtainedby differentmethods,pairingtheDicescoresobtainedonthesameMRI volumes.Theunpairedpermutationtestwasusedinsteadwhencompar- ingresultsobtainedondifferentvolumes,forexample,whencomparing theaccuracyofamodelonvolumesfromyoungermicewiththatofthe samemodelonoldermice,andforallcomparisonsonthetestset.We performedpermutationtestsusing100,000iterations,andconsidered averagedifferencestobesignificantwhen𝑝wassmallerthan0.05.The unpairedpermutationtestsofDicecoefficientsbetweendifferentanimal groupswereperformedbypermutinganimals(notimages)betweenthe twogroups.This ensuresexchangeabilitywhenseveralimagesofthe sameanimalexistedduetolongitudinaldesignsinthetestset.

3. Results

Using thetrain and validation dataset, we compared the perfor- manceofdiﬀerent networkarchitectures.Furthermore,wecompared MU-Netwithmulti-atlassegmentationonbothourdataandtheMRM

NeAtdataset,andevaluatedtheimpactofmouseageontheaccuracy ofoursegmentationmaps.TheexperimentsreportedinSections3.1– 3.3arebasedon5-foldCVonthetrainandvalidationset,andexperi- mentsinSection3.4on5-foldCVontheMRMNeAtdataset.Finally,in Section3.5,wetestedMU-Nettrainedontrainandvalidationsetonan independenttestsetthatincluded1782MRIvolumesfrom817mice.

3.1. Architecturecomparison

Wecompared theperformanceof diﬀerentnetworks trainedwith andwithoutdenseconnectionsanddualframingconnections,inboth 2Dand3Dimplementations.

As shownin Table 2,all MU-Nets achievedDice scoreswith the groundtruthcomparabletoorhigherthanthetypicalinter-ratervari- abilityofmanual segmentationinthemousebrain(Dicescoresfrom 0.80to0.90(Alietal.,2005)).Theskull-strippingtaskachievedanex- cellent Dicescoreof0.984.Theventricleswerecharacterizedbythe lowestsegmentationperformance(averageDicescore0.907),whilethe cortexdisplayedthehighestoverlapwiththegroundtruth(averageDice score0.966).DicescoresforeachanimalinallROIsareprovidedassup- plementaryTableS3.

ThenetworkdisplayingthehighestaverageDicescoreswas,infact, thesimplest one,includingnoin-blockskipconnectionsnorframing connections,andusing2Dconvolutions.Theaccuracyofthisnetwork wassigniﬁcantlyhigherthantheaccuracyofotherallother2Dnetworks (𝑝<0.00003).Becauseofitsexcellentperformanceandsimplicitythis networkisourchoicefortheMU-Netarchitecture,whichisthearchi- tectureweusedforallexperimentsdetailedinSections3.2and3.3.

Thechoicebetween2Dand3Darchitectureswasthemostimpor- tantfactorinincreasingperformance,resultinginamarkedincreasein meanDicescoresforbothtasks(𝑝<0.00001)betweenall2Dnetworks comparedtothe3Dones.WefurthercomparedMU-Netwithonefea- turinglesschannelsperﬁlter(49,49,50,50,fromtheshallowestto thedeepestconvolutionalblock)tomatchthenumberofparametersto thenumberofparametersofthesimplest 2Dnetwork.Weregistered aslightly(butnotsigniﬁcantly,𝑝=0.077)loweraccuracycomparedto MU-Net,indicatedas2DSLPinTable2.

Totestwhethertheincreasedperformanceof2Darchitecturescom- paredtothe3Dimplementationdependedonthereducednumberof parametersoronanexcessivelossofinformationwhenpoolinginthe anterior-posteriordirection,wetrainedanetworkusing3Dﬁlterswhile limitingpoolingoperationstothecoronalplane.Thisnetworkachieved asegmentationaccuracyin betweenthe3Dand2Dimplementations (Table2),suggestingthatbothabovementionedaspectswererelevant inincreasingthealgorithm’sperformance.

Westudiedtheeﬀectofbiasﬁeldcorrectiontotheperformanceof MU-Nettrainingitonimageswithoutbias-correction,andseparately,on N3bias-correctedMRimages(Sledetal.,1998).Thevalidationaccuracy achievedwithbiascorrectionwasindistinguishablefromtheaccuracy ofMU-Nettrainedwithoutbiascorrection(seeTable2).

3.2. Agestratiﬁedtrainingsets

WeevaluatedtheperformanceofMU-Netwhenrestrictingthetrain- ingsettomiceofaspecificage.Networkstrainedondatafrommiceof 12,16and32weeksachievedhigheraccuracy,bothontheirrespective validationsetandtheoverallgroundtruth,comparedtothenetworks trainedon5weeksmice (𝑝<0.00001).AsshowninFig.5,while all networkstrainedononespecificagedisplayedastatisticallysignificant (𝑝<0.05,unpaired)decreaseinmeanaccuracywhenvalidatedonani- malsofadifferentage,thisdifferencewashighestbetweenthe5weeks dataandtheotherdatasets.

Limitingthetrainingdatatoonespeciﬁcageimpliesthatthesenet- worksweretrainedonlyonaquarterofthedatausedtotrainthenet- worksinSection3.1.Irrespectiveofthat,thesenetworksstillachieved averageDicescoreonthemixed-agevalidationdatasetcomparablewith

(7)

Table2

CNNandSTEPSaccuraciesmeasuredusingDicecoeﬃcientacrossdiﬀerentmethodologicalchoices.Cross-validation resultsonthetrainandvalidationdataset.

Dim SC FC Brain mask Cortex Hippocampi Ventricles Striati ROI mean 2D 0.984 ± 0.005 0.966 ± 0.009 0.925 ± 0.017 0.907 ± 0.020 0.939 ± 0.010 0.935 ± 0.026 2D x x 0.984 ± 0.006 0.963 ± 0.010 0.924 ± 0.016 0.905 ± 0.022 0.937 ± 0.009 0.932 ± 0.026 2D x 0.984 ± 0.006 0.963 ± 0.011 0.924 ± 0.017 0.905 ± 0.022 0.938 ± 0.009 0.932 ± 0.026 2D x 0.984 ± 0.005 0.964 ± 0.011 0.923 ± 0.018 0.905 ± 0.024 0.937 ± 0.010 0.932 ± 0.027 3D x x 0.982 ± 0.007 0.956 ± 0.016 0.914 ± 0.033 0.900 ± 0.025 0.926 ± 0.045 0.924 ± 0.038 3D x 0.982 ± 0.007 0.958 ± 0.016 0.916 ± 0.032 0.900 ± 0.025 0.928 ± 0.029 0.925 ± 0.034 3D x 0.982 ± 0.006 0.957 ± 0.016 0.913 ± 0.041 0.899 ± 0.028 0.926 ± 0.042 0.924 ± 0.040 3D 0.982 ± 0.007 0.957 ± 0.013 0.916 ± 0.033 0.899 ± 0.026 0.926 ± 0.039 0.924 ± 0.036 3DConv 2DPool 0.983 ± 0.006 0.961 ± 0.010 0.919 ± 0.026 0.902 ± 0.026 0.934 ± 0.014 0.929 ± 0.030 2D SLP 0.984 ± 0.005 0.965 ± 0.009 0.924 ± 0.016 0.907 ± 0.021 0.939 ± 0.010 0.934 ± 0.026 2D + N3 0.984 ± 0.005 0.965 ± 0.009 0.924 ± 0.020 0.907 ± 0.020 0.939 ± 0.009 0.934 ± 0.026 STEPS (affine) \ 0.920 ± 0.058 0.827 ± 0.079 0.761 ± 0.090 0.873 ± 0.062 0.845 ± 0.093 STEPS (diffeo) \ 0.948 ± 0.036 0.844 ± 0.048 0.812 ± 0.090 0.871 ± 0.045 0.869 ± 0.070 STEPS ^∗(affine) \ 0 . 936 ± 0 . 013 0 . 831 ± 0 . 029 0 . 781 ± 0 . 049 0 . 887 ± 0 . 019 0 . 859 ± 0 . 066 STEPS ^∗(diffeo) \ 0 . 954 ± 0 . 009 0 . 848 ± 0 . 025 0 . 826 ± 0 . 039 0 . 885 ± 0 . 016 0 . 879 ± 0 . 055 Majority Voting \ 0.889 ± 0.179 0.780 ± 0.232 0.677 ± 0.208 0.816 ± 0.245 0.791 ± 0.230 ListedvaluesaretheaveragevalidationDicescoresbetweenautomaticandmanualsegmentation±standardde- viationsoftheseDicescoresin5-foldCV.ROImeancolumnreferstothemeanDicecoefficientofthecortex,the hippocampi,theventriclesandthestriati.SCandFCindicatethepresenceofskipconnectionandframingconnec- tions.MU-Netresultsaredisplayedinthefirstrow.STEPSreferstoSTEPSusingrandomlyselectedtemplates;STEPS^∗ referstoSTEPSrunsusingrandomlyselectingmiceofthesameageonly;affineindicatesthatonlyaffineregistration wasused,whereasdiffeoindicatesthiswasfollowedbyadiffeomorphicregistrationstep;Majorityvotingrefers totheselectionofthemostoccuringlabelafterdiffeomorphicregistration;3DConv2DPool:networkfeaturingno in-blockskipconnectionsorframingconnections,with3Dfilteringand2Dpoolinginthecoronalplane;2DSLP:

2Dnetworkwithin-blockskipconnectionsandalimitednumberofparameters;2D+N3:2Dnetworktrainedon databias-correctedusingtheN3algorithm.Boldfacecharactersindicatethebestperformingnetwork,achieving signiﬁcantlyhigherDicescoresthanallothernetworksforthatROI.

theaccuracyofmanualsegmentation.TheworstperformingCNNwas thenetworktrainedon5weeksoldmice.Trainingonthe12,16and32 weeksdataandvalidatingonmiceofthesameage,weobservedDice scorescomparablewiththeoverallperformanceofMU-Nettrainedon theentiredataset(𝑝>0.15,unpaired).However,wemeasuredalower overallperformancewhenincludingmiceofallagesinthevalidation data(𝑝<0.00001),slightlyoverﬁttingforeachspeciﬁcage.

3.3. Comparisonwithmulti-atlassegmentation

WecomparedMU-Netwithmulti-atlassegmentation,applyingthe state-of-the-artSTEPS(Cardosoetal.,2013;2012)labelfusionmethod tocombinethelabelsobtainedfromtheregistrationofmultiplelabeled volumes.ThiswasimplementedusingtheNiftysegpackageasdescribed inSection2.3.Werepeatedthisprocedureusingbothdiffeomorphicand affineregistrationmethods,withrandomly-selectedtemplatesrestricted tosame-agemice.Thebrainmasksegmentationwasnotevaluatedasthe manuallydrawnmaskwasusedduringthediffeomorphicregistration procedure.

MU-NetachievedhigherDicecoefficientsthanallSTEPSimplemen- tations(𝑝<0.00001,Cohen’s𝑑: 4.39,seeTable 2).Also,there wasa markedqualitativedifferencebetweenSTEPSsegmentationandMU- Net(Fig.2),thelatterachievingresultsvisuallyindistinguishablefrom manualsegmentation.WecomputedHD95distancesfurtherconfirmed thisdifference,withanaverageof0.084±0.019mmforMU-Netagainst 0.251±0.064mmforSTEPS(𝑝<0.00001).Wemeasuredameanpreci- sionof 0.962±0.008 (MU-Net)vs 0.820±0.025 (STEPS) (𝑝<0.00001) andameanrecallof 0.951±0.011(MU-Net)vs0.952±0.013(STEPS) (𝑝=0.65).

MU-Nethadaninferencetimeofabout0.35sandatrainingtimeof12 h.STEPSsegmentationprocedurerequiredtotalinferencetimeof117 minforeachlabeledvolume(onaverage440sforeachpairwisediffeo- morphicregistrationand7.85sforlabelfusion).ImplementingSTEPS segmentationusingonlytemplatesofthesameageledtoasmallbut significantimprovementin Dicecoefficientsover randomlychoosing templatesofanyage(𝑝<0.0007,Cohen’s𝑑:0.296).Theemploymentof

diffeomorphicregistrationwasthemostimportantfactoraffectingthe performanceofSTEPS,asdisplayedinTable2.Asimplemajorityvoting strategyledtosignificantlylowerperformanceinallROIscomparedto allotherlabelfusionstrategies(𝑝<0.003).

Furthermore,wetrainedMU-Netontheoutputsoftheimplemented STEPSproceduresfeaturingdiﬀeomorphicregistration,andmeasured theDicescoresofeachnetwork’soutputwiththegroundtruth(Table3).

AsevidencedinTables2and3,andFig.3,MU-NettrainedonSTEPS segmentationsachievedhigherDicescorewiththegroundtruththan thesameSTEPSsegmentationsconstitutingthetrainingsetsofMU-Net (𝑝<0.00001).Withtheexceptionofthenetworktrainedon5weeksold mice,thesehybridnetworkswerestillunder-performingcomparedto trainingonmanuallysegmenteddata(𝑝<0.00001).

3.4. EvaluationonalargenumberofROIswithMRMNeAtdataset

WetrainedandevaluatedMU-NetontheMRMNeAtdatasetsthat includesatlasesof10 individualT₂∗-weightedinvivobrainMRim- agesof 12–14weeksoldC57BL/6Jmice;eachwith37 manuallyla- belledanatomicalstructures(Maetal.,2008).Thissamedatabasewas selectedbyMaetal.(2014)toevaluatetheSTEPSmulti-atlassegmen- tationalgorithmonmousebrainMRI.TocompareMU-NetwithSTEPS, wefollowedtheSTEPSimplementationbyMaetal.(2014)asreleased bytheauthors.

Weuseda5-foldcrossvalidationschemeforevaluation(8templates fortrainingand2templatesfortestingineachfold).Theonlyadapta- tionrequiredtotrainMU-Net onMRMNeATdatasetwas toexpand thenumberofoutputchannelsto37(plusoneforthebrainmask)to equalthatofthenumberofROIs.AsdisplayedinFig.4,Dicecoefficient of MU-Netwas greateror comparabletoSTEPS:while inamajority ofregionsMU-Net’saccuracywashigherthantheaccuracyofSTEPS, thiswasstatisticallysignificantonlyforthebrainmask,externalcap- sule,hypothalamusandbrainstem.Intheleftinferiorcolliculi,STEPS achievedsignificantlyhigherDicecoefficientthanMU-Net.Averaging theDicecoefficientsacrossallROIs,wemeasuredanaverageDicescore of0.820±0.031forMU-Netand0.814±0.023forSTEPS.Whilethisaver-

(8)

Fig.2. Segmentationcomparisoninfourslicesfromasingleanimal:(a)STEPS,(b)MU-Net,and(c)manualannotation.In(a)–(c),theregionshighlightedarethe cortex(blue),ventricles(green),striati(red),andhippocampi(yellow).Panel(d)showstheinferredbrainmaskbyMU-Net.

Table3

MeanandstandarddeviationofaverageDicescoresevaluatingtheaccuracyofMU-Nettrained onvolumessegmentedviaSTEPS.

Training Set Cortex Hippocampus Ventricles Striatum ROI mean STEPS ^∗ 0.954 ± 0.011 0.867 ± 0.027 0.866 ± 0.035 0.898 ± 0.017 0.896 ± 0.043 STEPS 0.953 ± 0.009 0.872 ± 0.022 0.849 ± 0.041 0.885 ± 0.016 0.890 ± 0.046

Fig.3.AverageDicescorecomparisonbetweendiﬀerentsegmentationmeth- ods,acrossallROIs.MU-Net:MU-Nettrainedonthemanuallysegmenteddata;

MU-Net-STEPS:MU-Nettrainedonvolumessegmentedemployingsame-age diﬀeomorphicSTEPS;STEPS:same-agediﬀeomorphicSTEPSsegmentation.The errorbarrepresentsstandarddeviation.

ageDicecoefficientforMU-Netwashigher,thedifferencewasnotstatis- ticallysignificant(𝑝=0.170,Cohen’s𝑑:0.134).Similarly,wemeasured anhigher(butnotstatisticallysignificant,𝑝=0.07)averageHD95dis- tanceforMU-Net(0.360±0.252mmvs0.240±0.038mm).Incontrast,we measuredasignificantlyhigheraverageprecisionwithMU-Net(0.823

±0.033vs0.786±0.024,𝑝=0.0009)andasigniﬁcantlylowerrecall (0.815±0.032vs0.853±0.023,𝑝=0.001).Afullbreakdownofthese metricsisavailableinsupplementaryFig.S3.Thecomputationtimere-

quiredbySTEPStosegmentasinglevolumewasofapproximately20 minwhileMU-Netrequiredlessthanonesecondpervolume.

3.5. Evaluationwithalargetestdataset

WeoptimizedtheMU-Netmodelonthetrainandvalidationdataset andtestedonalargetestsetof1782MRIvolumes,acquiredfrom817 mice withagesrangingfrom 4to60 weeks,andincludingbothWT andHTmice.Asthe5-foldcross-validationexperimentproducedfive differentMU-Netmodels,thesegmentationmapsforthetestsetwere obtainedbyaveragingthefivepredictionmapsproducedbythefive models.Tooutlinethebrainmask,weaveragedsigmoid-activatedpre- dictionsfrom fivenetworks andthresholdedthemat 0.5.Forregion segmentation,weaveragedthesoftmax-activatedoutputmaps,andfor eachvoxel,weselectedtheclassyieldingthemaximalaveragedvalue asourpredictedlabel.

Outof theentiretestset,segmentationfailedcompletely ontwo volumes,wherenobrainmaskwasdetected.Theremaining1780vol- umesweresuccessfullysegmentedwithanaverageDicescoreof0.978

±0.012forthebrainmask,0.906±0.041forthestriati,and0.937± 0.035forthecortex,distributedasillustratedinFig.7.Therewasno significantdifferencebetweenthesegmentationaccuracyofmaleand femaleanimals(𝑝>0.1,unpaired).However,therewasasignificantdif- ferenceinaccuracybetweenHTandWTmice(𝑝<0.00001,unpaired) forallROIs.DicescoresofWTanimalswere0.4%higherforthebrain mask,1.7%higherforthecortex,and1.9%higherforthestriati.Ap- plyingN3biascorrectiononallvolumesbeforesegmentationdidnot resultinasignificantDicescoredifference.AdetailedlistofDicescores, HD95,precisionandrecall,foreachanimalandeachROI,isavailable insupplementaryTableS4.

(9)

Fig.4. ComparisonbetweentheaverageDicecoefficientsofMU-NetandSTEPSmulti-atlasalgorithmbyMaetal.Errorbarscorrespondtostandarddeviation fortheaverageaccuracy.Permutation-testbasedp-valuesforeachcomparisonareprovidedinparenthesesaftertheROIname,+indicatesthattheaverageDice coefficientforMU-Netwashigherand-indicatesthattheaverageDicecoefficientforSTEPSwashigher,^∗indicatesastatisticallysignificantdifference.

(10)

Fig. 5. Mean accuracy ± standard deviation for the average accuracy of MU-Net trained and evaluated ondiﬀerent datasets according tomouseage.Networksexclusivelytrainedon olderanimalsachievedloweraccuracywhenat- temptingtogeneralizetotheyoungestanimals, andvice-versa.

Fig.6. MU-Netsegmentationcomparedtothemanualsegmentationinfourslicesoffourvolumesofthetestset.Blueandredindicate,respectively,groundtruth andinferredsegmentation,purpletheiroverlap(striatiandcortex);yellowROIs(ventriclesandhippocampi)areinferredROIsforwhichmanualannotationswere notavailable.Rowsindicate(a)thehighestperformingvolume(meanDice0.964,8weeksoldR6/2mouse);(b)thelowestperformingvolume(meanDice0.685, 12weeksoldR6/2mouse);(c)thevolumedisplayingperformanceclosesttothemeanperformanceontheentiretestset(Dice0.923,12weeksoldQ175DNmouse);

(d)onerandomlyselectedvolume(Dice0.919,8weeksoldQ175DNmouse)

Avisualinspectionofthesegmentationmaps(Fig.6)revealedthat ROIswerequalitativelysimilartothoseobtainedonthevalidationset anddisplayed inFig.2. Weobserved,however,avisibledecrease in performanceinthepresenceofstrongringingartifacts(Fig.6.b)Thisis furtherreﬂectedinthehigheraverageHD95distancesinthetestdataset thaninthevalidationdataset(Table4).

4. Discussion

Wehavepresentedamulti-taskdeepneuralnetwork,MU-Net,for thesimultaneousskull-strippingandsegmentationofmousebrainMRI.

Weselectedthebestperformingnetworkamonganumberofarchitec- turesandfoundittoachievebettersegmentationaccuracyontheval- idation setcompared tostate-of-the-artmulti-atlassegmentationpro- cedures,withamarkedlylowersegmentationtime(0.35svs117min).

WethenevaluatedtheperformanceofMU-Netonalargeandhetero-

(11)

Fig.7. TestsetDicescoredistributionforthebrainmask,cortexandstriatiROIs.MalesandFemalesincludeallmiceofeachgender,bothWTandTG.Likewise, WTandTGincludebothmalesandfemales.

Table4

Averagetestsetmetrics(seeSupplementaryTableS4fordetails).

Metric Brain Mask Cortex Striati

Dice 0.978 ± 0.012 0.937 ± 0.035 0.906 ± 0.041 HD95 (mm) 0.345 ± 0.303 0.223 ± 0.231 0.180 ± 0.167 Precision 0.989 ± 0.006 0.939 ± 0.050 0.929 ± 0.045 Recall 0.969 ± 0.022 0.939 ± 0.054 0.888 ± 0.062

geneoustestsetof1782micefrom10diﬀerentstudiesofHuntington disease,withvaryingagesandgeneticbackgrounds(WTaswellasHT Q175andR6/2variants).Inthistestset,wemeasuredaverageDice scoresof0.978,0.906and0.937forthebrainmask,striatiandcortex, rivalinghuman-levelperformance.WeadditionallytrainedMU-Netfor thesegmentationofhighresolutionmouseMRIsoftheMRMNeatatlas into37ROIsmeasuringanaverageDicescoreof0.820.Hence,weargue thattheemploymentofdeepneuralnetworksforthesegmentationof animalMRIisapromisingstrategyforthereductionofbothraterbias andsegmentationtime.

ToputtheDicescoreswehavereportedincontext,Dicescoresbe- tweentwohumanexpertshaverangedfrom0.80to0.90,depending onROI,formousebrainMRIsegmentation(Alietal.,2005).Fordif- ferentsegmentation tasksin brainMRI in general, includinghuman data,inter-andintra-raterDicescore haveranged between0.75and 0.96(Ali et al., 2005; Entis et al., 2012; Yushkevich et al., 2006).

TheDicescoresofMU-Netexceededtheabovementioned scoresbe- tweentwohumanexperts,suggesting human-levelsegmentationper- formance.Inaddition,theDicescoreofMU-Netforskull-strippingwas

higherthanDicescorefromtheskull-strippingCNNimplementedby Royetal.(2018b)(0.949).Obviously,comparingpreviouslyreported Dicescorestooursegmentationaccuracymeasuresmustbedonewith care as these vary across diﬀerent studies, segmentation tasks, and datasets,andtheconfounding factorsincludeimageresolution,pres- enceofartifactsandnoise,raterexpertise,andthechoiceofROIs.

While Royet al.(2018b) proposed aCNN for skull-strippingfor mouseMRI,toourknowledgethisworkrepresentstheﬁrstCNNper- formingbothregionsegmentationandskull-strippingin mousebrain MRI.TheadvantagesofCNNswithrespect toatlas-basedregionseg- mentation(Baietal.,2012;DeFeoandGiove,2019;Maetal.,2014) areclear.First,comparedtoatlas-basedsegmentationMU-Netismuch fasterandproducesaccurateresultswithoutpre-processing.Second,we foundMU-Nettobesigniﬁcantlymoreaccuratethanthestate-of-the- artSTEPSmulti-atlassegmentation(Maetal., 2014) onanisotropic, relativelyquicktoacquireMRimagesfavoredinpre-clinicaldrugand biomarkerdiscoveryapplications.Third,wefoundMU-Nettoperform better than or equally well compared to STEPS on isotropic, high- resolutionMRimageswithrelativelylongacquisitiontimes,favoredin basicresearch.

Weobservedthatthesegmentationaccuracyofatlas-basedmethods can vary markedly,basedon thespeciﬁcuse case dependingon the numberofmanuallydrawnROIs,voxel-size,andimagequality.Thebest performancewasachievedusingadvancedregistration-basedmethods (Maetal.,2014)onthehighresolutiondata(Maetal.,2008)witha denselylabeledatlasof37ROIs,andthelowestusingamajorityvoting ruleonasparselyoutlinedatlaswithalowresolutionalongthefronto- caudaldirection.

(12)

Withadensesegmentationofhighresolutionimages(NEaTdataset), wemeasuredslightlyhigheraverageDicecoefficientswithMU-Netthan withSTEPS,butthedifferencewasnotstatisticallysignificant.There- fore,itappearsthatforthiscasethemainadvantageofMU-Netover STEPSwouldbeintermsofsegmentationtime.TheperformanceofMU- NetontheNeAtdatasetwaslikelyhamperedbythesmallnumberof trainingimagesavailable(8imagesfortrainingineachfold).Thisalso providesanexplanationforthehigherstandarddeviationforHD95dis- tancesforMU-NetcomparedtoSTEPS.Interestingly,MU-Netachieved DicecoefficientssimilartoSTEPSwithalargeraverageprecisionbut aloweraveragerecall.ThiswouldindicatethatSTEPSpredictioncon- tainedmorefalsepositives,labelingbackgroundvoxelsasbelongingto ROIs,andconverselyMU-Net’spredictionfavoredfalsenegatives.For sparselysegmented images,typicalindrugdevelopment,whereonly specificstructuresareofinterest,STEPSappearstobemarkedlyless effectivethanMU-Net,andthetimerequiredformanualannotationis notablydecreased.Thisalsomeansthatitmightbefeasibletoannotate asmallnumberofvolumesasrequiredbythespecificstudy,andthen useMU-Nettoautomatethesegmentationoftheremainingdata.

Interestingly,MU-NetstrainedonautomaticSTEPSmulti-atlasseg- mentations achieved higher Dice score with the ground truth than STEPS,highlightingthegeneralizationabilityofMU-Net.Thissupports theuseofatlasbasedsegmentationmethodstoaugmentMRIsegmenta- tiondatasetssuggestedinRoyetal.(2018a),leveragingunlabeleddata.

TheresultsobtainedbytrainingonSTEPSsegmentationsaloneremain, however,ofinsuﬃcientqualitytoeliminatetheneedformanualanno- tationsinthetrainingdata,astheCNNattemptstoreplicateanyform ofsystematicerrorpresentintheatlas-basedlabelingprocedure.

Inliteratureboth3Dand2DimplementationsofCNNsareavail- ablefordifferentsegmentationtasks(Çiçeketal.,2016;Milletarietal., 2016;Royetal., 2018a),andotherarchitectural variants havebeen proposed:Royetal.(2018a)addeddenseconnections(Huangetal., 2017)intheconvolutionblocksofU-Netwhilekeepingthenumberof outputchannelsconstant;HanandYe (2018)proposedtwovariants basedonsignalprocessingargumentsforthereductionofartifactsin asparseimagereconstructiontask.We, however,foundthata more complexmodeldidnotimproveandinfactloweredtheaccuracyofour results,perhapsgiventhesimplicityofthetask.Thus,inagreementwith Isenseeetal.(2018),wefoundthata2Dapproachwaspreferableto3D approachinthepresenceofanisotropicvoxels.WealsofoundtheDice losstobesufficienttoeffectivelytrainourmodelwithouttheaddition ofacross-entropyloss.Aswedidnotperformanyfinetuningofhyper- parametersforanyofourmodels,itispossiblethataftersufficientfine tuningtheperformanceofoneofthesealternativeapproachesmightbe improved.

Muchlikethehumaneye,MU-Netwasnotsignificantlyaffectedby thepresenceofthebiasfield,anddidnotbenefitfromN3biascorrec- tion.Correctingforthebiasfieldmightstillbebeneficialasitdepends onthespecificexperimentalsetup,andthusN3biascorrectionmight avoidspecializingthenetworktooneparticularacquisitionprocedure.

Forthisreason,wereleasethetrainedparametersofthemodelforMU- Nettrainedonboththenon-correctedandtheN3-correcteddata.

Toensurethenetworkgeneralizestoawideagerange,ourresults indicatethatthedistinctivefeaturespresentbeforeadulthoodneedtobe adequatelyrepresentedinthetrainingdata.Thisisevidencedbythede- gradedperformanceobservedwhentestingnetworkstrainedon5-week oldmiceonthevolumesacquiredfromolderones,andvice-versa.As micearetypicallyweanedat3–4weeksandattainsexualmaturityat8–

12weeks(DuttaandSengupta,2016),5-weekoldmicearenotadults.

Incontrast,trainingsolelyonmalemicedidnotsignificantlyinfluence MU-Netperformanceonfemaleanimals.WestudiedwhytheDiceco- efficientdistributionswerebi-modalwiththelargetestset(seeFig.7).

Thebi-modalnatureofthedistributionsappearsnottobeexplainedby differencesbetweendifferentstudies,genders,orgenotypes(seesupple- mentaryFigs.S4andS5).Wecannotofferadefinitiveexplanationfor thecauseofthesebi-modaldistributions,however,wespeculatethatit

isasumofseveralfactors,includingintra-ratersegmentationvariabil- ity.

Anobviouslimitationofourapproachisitsspecialization forthe speciﬁcMRIcontrast thealgorithmistrainedon.MakingMU-Net to be morerobusttomarkedchangesintheimageacquisitioncouldbe achievedbyexpandingthetrainingdatatobemorevariableor/anduti- lizingtechniquessuchasdomainadaptation,transferlearningorimage translationtominimizetheamountofnewtrainingdataforthemodel togeneralizetonewtypeofMRIacquisition(Armaniousetal.,2020;

Zhuangetal., 2020).Thisresearchlineisoneofthemostimportant areasforfutureresearchinMRIsegmentationwithdeeplearning.How- ever,MU-Netsuccessfullygeneralizedtoavarietyoftransgenicmicein anagerangewiderthanthatofthetrainingset,thusoﬀeringavaluable waytoautomatesegmentationtasks.Anotherlimitationofthisstudy isthenumberofROIsasmousebrainatlaseswithextremelydetailed segmentationfeaturingover700ROIscurrentlyexist(Nieetal.,2019).

However,atlasessuchas(Nieetal.,2019)areconstructedbyspecial- izedproceduresanddonotcontainmanualsegmentationsofallimages usedintheatlasconstruction.Therefore,theseatlasesarenotdirectly applicablefortrainingsegmentationneuralnetworks.

TheemploymentofCNNsforthesegmentationofmousebrainMRI providesanumberofbenefitsforpreclinicalresearchers.Beyondallow- ingfortheemploymentoflargedatasetsinatime-efficientmanner,the abilitytogeneralizeandabstractfromthetrainingdataresultsinmore robustandreproduciblepredictions.Wecanthusexpectthesemethods toreducetheconfoundingeffectofintra-andinter-ratervariabilityin- herentinmanualsegmentationprocedureswhilestreamlininganimal MRIexperimentalpipelines.

Declarations

Dataavailabilitystatement

MU-Net code andtrained modelsarefreely available at https://

github.com/Hierakonpolis/MU-Net. Atutorial of usageof MU-Net is available at https://github.com/Hierakonpolis/NN4Kubiac Thetrain- ingandvalidationdatasetispropertyofCharlesRiverDiscoverySer- vices,andthetestdatasetispropertyofCHDI’CureHuntington’sDis- easeInitiative’foundation.TheMRMNeAtdatasetisfreelyavailableat https://github.com/dancebean/mouse-brain-atlas. AlltheDice scores betweenMU-Netandmanualsegmentationsareavailableassupplemen- taryﬁlestothismanuscript.

Ethicsstatement

All animalexperiments werecarried out accordingtotheUnited StatesNationalInstituteofHealth(NIH)guidelinesforthecareanduse oflaboratoryanimals,andapprovedbytheNationalAnimalExperiment Board.

Creditauthorshipcontributionstatement

Riccardo DeFeo:Methodology,Software, Formalanalysis, Writ- ing-originaldraft.ArtemShatillo:Datacuration.AlejandraSierra:

Methodology,Formalanalysis.JuanMiguelValverde:Methodology.

Olli Gröhn: Conceptualization. Federico Giove: Conceptualization.

JussiTohka:Conceptualization,Software,Writing-originaldraft.

Acknowledgments

R.D.F.’s work has received funding from the European Union’s Horizon 2020 Framework Programme under the Marie Skłodowska Curie grant agreement No #691110 (MICROBRADAM) and J.M.V.’

workwasfoundedfromMarieSkłodowskaCuriegrantagreementNo

#740264(GENOMMED).Thecontentissolelytheresponsibilityofthe