• Ei tuloksia

Hric, Darko; Kaski, Kimmo; Kivelä, Mikko Stochastic block model reveals maps of citation patterns and their evolution in time

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Hric, Darko; Kaski, Kimmo; Kivelä, Mikko Stochastic block model reveals maps of citation patterns and their evolution in time"

Copied!
28
0
0

Kokoteksti

(1)

Powered by TCPDF (www.tcpdf.org)

This material is protected by copyright and other intellectual property rights, and duplication or sale of all or part of any of the repository collections is not permitted, except that material may be duplicated by you for your research use or educational purposes in electronic or print form. You must obtain permission for any other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not an authorised user.

Published in:

Journal of Informetrics

DOI:

10.1016/j.joi.2018.05.004 Published: 01/08/2018

Document Version

Publisher's PDF, also known as Version of record

Published under the following license:

CC BY

Please cite the original version:

Hric, D., Kaski, K., & Kivelä, M. (2018). Stochastic block model reveals maps of citation patterns and their

evolution in time. Journal of Informetrics, 12(3), 757-783. https://doi.org/10.1016/j.joi.2018.05.004

(2)

ContentslistsavailableatScienceDirect

Journal of Informetrics

jo u r n al hom e p ag e :w w w . e l s e v i e r . c o m / l o c a t e / j o i

Regular article

Stochastic block model reveals maps of citation patterns and their evolution in time

Darko Hric, Kimmo Kaski, Mikko Kivelä

DepartmentofComputerScience,AaltoUniversitySchoolofScience,P.O.Box12200,FI-00076,Finland

a rt i c l e i n f o

Articlehistory:

Received2May2017

Receivedinrevisedform30May2018 Accepted30May2018

Keywords:

Webofscience Citationnetworks Evolutionofscience Stochasticblockmodel

a b s t ra c t

Inthisstudywemapoutthelarge-scalestructureofcitationnetworksofsciencejournals andfollowtheirevolutionintimebyusingstochasticblockmodels(SBMs).TheSBMfit- tingproceduresareprincipledmethodsthatcanbeusedtofindhierarchicalgroupingof journalsthatshowsimilarincomingandoutgoingcitationspatterns.Thesemethodswork directlyonthecitationnetworkwithouttheneedtoconstructauxiliarynetworksbasedon similarityofnodes.WefittheSBMstothenetworksofjournalswehaveconstructedfrom thedatasetofaround630millioncitationsandfindavarietyofdifferenttypesofgroups, suchascommunities,bridges,sources,andsinks.Inadditionweusearecentgeneralization ofSBMstodeterminehowmuchamanuallycuratedclassificationofjournalsintosubfields ofscienceisrelatedtothegroupstructureofthejournalnetworkandhowthisrelationship changesintime.TheSBMmethodtriestofindanetworkofblocksthatisthebesthigh-level representationofthenetworkofjournals,andweillustratehowtheseblocknetworks(at variouslevelsofresolution)canbeusedasmapsofscience.

©2018TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCC BYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Theprocessofcreatingscientificknowledgereliesonpublicationsthatareoftenstoredandarchived,withtheprimary purposeofpreservinganddistributingtheknowledgeobtainedthroughresearch.Thesearchivescanalsobeusedtostudy thesciencemakingitself,forexample,byextractinginformationofcollaborations,citations,orkeywordsofthepublished articles.Researchinthisfieldhasafairlylongandrichhistorywithwiderangeofresearchtopics,liketheassessment andpredictionofperformanceandqualityofindividualpapers,researchers,institutions,journals,fields,andevencountries (Althouse,West,Bergstrom,&Bergstrom,2009;Lehmann,Jackson,&Lautrup,2008;Nerur,Sikora,Mangalaraj,&Balijepally, 2005),aswellasidentificationofvariouslargescalestructuresofscience(Boyack&Klavans,2014;Carpenter&Narin, 1973;deSollaPrice,1965;Leydesdorff,Carley,&Rafols,2013;Small,1999;Waltman,vanEck,&Noyons,2010),journal classification(Janssens,Zhang,Moor,&Glänzel,2009;Leydesdorff,2006;Wang&Waltman,2016;Zhang,Liu,Janssens, Liang,&Glänzel,2010),followingresearchtrends(Chen,2013;Persson,2010;Porter&Rafols,2009),andrecognizingthe emergingfieldsorresearchers(Cozzensetal.,2010;Lambiotte&Panzarasa,2009;Shibata,Kajikawa,Takeda,Sakata,&

Matsushima,2011;Small,Boyack,&Klavans,2014;Small&Greenlee,1989).

Bibliographicdatabases,likeWebofScience,Scopus,andGoogleScholar,storemetadataofscientificpublications,which canbeusedtoanalysesciencemakingatalllevels,fromlargescalestructuretoperformanceofindividualpapers.Thenumber

Correspondingauthor.

E-mailaddresses:darko.hric@aalto.fi(D.Hric),kimmo.kaski@aalto.fi(K.Kaski),mikko.kivela@aalto.fi(M.Kivelä).

https://doi.org/10.1016/j.joi.2018.05.004

1751-1577/©2018TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/

4.0/).

(3)

andreliablyisbecomingevenmorechallengingasthenetworksunderstudycontinuetogrow.

Conventionaldataanalysistools,suchasclusteringordimensionreductionmethods,canbeusedtosimplifythedata aboutthecomplexrelationshipsbetweenthedataentities.Representingtheentitiesasvectorsoftheirfeaturesisacommon andpracticalabstractionthatallowstheuseofclusteringmethodsinthespaceoffeatures,inwhichthemostsimilarentities aregroupedbasedonthesimilarityoftheusedfeatures.Thesevectorscancontaincitationinformationbetweentheentities, andonecandefinesimilaritymeasures,likebibliographiccoupling,co-citation,distancebetweencitationvectors(Euclidean, cosine,Jaccard,etc.),andcorrelationcoefficientsbetweenthecitationvectorsorpublicationtexts(abstracts,keywords,etc.) (Boyacketal.,2005;Carpenter&Narin,1973;Janssensetal.,2009;Kessler,1963;Leydesdorff&Rafols,2012;Marshakova, 1973;Small,1973;Wang&Koopman,2017).

Thedataofscientificprogresscanbeanalysedwithavarietyofmethodsoncethedatahasbeenpreprocessed.The dimensionalityreductiontechniquesprojectthevectorsintothemostsignificantsubspacesrevealinggroupsofcorrelated entities(multidimensionalscaling,factoranalysis,etc.)(Leydesdorffetal.,2013;Small,1999).Classicalclusteringtechniques, e.g.hierarchicalclusteringandk-means,operateonthefullspaceoffeatures,andprovideclustersofsimilarentities,based onimplicitlyorexplicitlydefinedsimilaritymeasureordistance(Boyacketal.,2005;Modha&Spangler,2000;Punj&

Stewart,1983;Silva,Rodrigues,Oliveira,da,&Costa,2013;Wang&Koopman,2017).Thefactoranalysisappliedseparately tothecitingandciteddirectionofthecompletecitationmatrix,enablesfurtherspecializationintothetypesofgroupsit finds,sincebyusingonlyonedirectionatatime,itdetectsgroupsbasedonpastandfuturecitations,separately(Leydesdorff

&Rafols,2009).Theco-citationandbibliographiccouplingusesimilaritiesincitationsinthefutureandpastrespectively, andthusprovideaseparationnaturally(Weinberg,1974).Theresultsofthistypeofanalysisdependsonthepreprocessing stepofconstructingthedatavectorsandsimilarities,andgreatcareisneededininterpretingtheresults(Boyacketal.,2005;

Gläser,Glänzel,&Scharnhorst,2017;vanEck&Waltman,2009).

Thebibliometricdatacanalsobeanalysedbyconstructingnetworks—suchasthecitationnetworkbetweenjournals—and directlyfindingstructureinthemusingthegeneralpurposetoolsforanalysingthenetworks.Thedevelopmentofsuch methodswithinnetworksciencehasexplodedsincemassiveamountsofdataonlargevarietyofnetworks—suchason socialandtransportationnetworks—havebecomeavailable(Boccaletti,Latora,Moreno,Chavez,&Hwang,2006;Newman, 2003).Aprominentwayoffindingstructureincitationnetworksusingthesemethodsistoinvestigatenetworkclustersor communities(Fortunato,2010;Fortunato&Hric,2016;Porter,Onnela,&Mucha,2009),whicharesubnetworksthathave alargenumberoflinksinsidethem(Chen&Redner,2010;Lambiotte&Panzarasa,2009;Lancichinetti&Fortunato,2012;

Radicchi,Fortunato,&Vespignani,2012;Rosvall&Bergstrom,2008).Theassumptionwithmostofthesemethodsisthatthe networkisconstructedfromdenselyconnectedcoresofnodesorjournalsthathavearelativelysmallnumberofcitationsto therestofthenetwork.Thisisincontrasttothemethodsbasedonsimilarityofjournalsthatcanfindgroupswithastrong preferenceforreceivingorgivingcitationsfromacertainsubsetofjournals,forinstanceworkofappliedresearchcancite theoreticalworks,withoutbeingcitedback.

Evenifonewouldacceptthepremisethatthecommunity-likestructuresarerelevantincitationnetworks,manycom- munitydetectionmethodsarebesiegedwithintrinsicproblems.Veryoftentheydetectstructuresevenincaseofrandom networksbymistakingnoisefordata,theymightbeverysensitivetosmallperturbations(noise),andpossesa“resolution limit”,i.e.sufferingfromtheinabilitytoidentifycommunitiesbelowacertainsizethatdependsonthetotalsizeofthe network(Fortunato&Barthélemy,2007;Guimerà,Sales-Pardo,&Amaral,2004).Theperformance,reliability,andeventhe resultstosomeextentdependonthechoiceofamethodfromthelargesetofcurrentlyavailablemethods.

Theproblemswithcommunitydetectionmethodsarewell-knowninthenetworkscienceliterature,andtheneedto findthericherstructureinnetworksthanthoseobtainedbypartitioningnodestocommunitieshasbeenacknowledged formanytypesofnetworks(Leskovec,Lang,Dasgupta,&Mahoney,2009;Palla,Derényi,Farkas,&Vicsek,2005;Rombach, Porter,Fowler,&Mucha,2014;Wang&Hopcroft,2010;Xie,Kelley,&Szymanski,2013).Veryrecently,asasolutiontothis problem,theoldideaofusingstochasticblockmodels(SBMs)asmodelsofnetworkstructure(Holland,Laskey,&Leinhardt, 1983;Lorrain&White,1971;Wasserman&Anderson,1987)hasreceivedrenewedattention,becauseofthetheoreticaland algorithmicadvancesthatenabledtheiruseinareliableandscalableway(Bianconi,2009;Karrer&Newman,2011;Peixoto, 2012a).SBMisamodelinwhichnodesbelongtoblocks(thenameforgroupsintheSBMparadigm)andedgesarecreated between(andwithin)theblockswithsomefixedprobabilitiesforeachpairsofblocks.ThemethodsbasedonSBMswork byfindingthemodelwhichbestexplainsthenetworkdata.Thebestexplanationisnotnecessarilythemodelthatwould havemostlikelyproducedthedata,butthesimplicityofthemodelmustalsobetakenintoaccount,andtheprincipledand powerfulideasfromstatisticalinferenceliteratureareusedtoavoidsuchoverfitting.Onecanconsidertheblocksas“super nodes”thatareconnectedwithweightededges,andSBMmethodsthen—bydefinition—trytofindthe“supernetwork”that isthebestsimplificationoftheoriginalnetwork.

(4)

HerewetaketheadvantageoftherecentadvancesinSBMmethodsfoundinthenetworkscienceliteratureandapplythem tolargescalecitationnetworksbetweenjournals.WeusejournalcitationnetworksfromThomsonReutersCitationIndex® fortheyearsrangingfrom1900to2013whichcontainshundredsofmillionsofcitations.Manystudiesconcentratedon smallsubsetsofthecitationnetwork(An,Janssen,&Milios,2004;Grossman,2002;Neruretal.,2005;Pieters,Baumgartner, Vermunt,&Bijmolt,1999;Porter&Rafols,2009;Shibataetal.,2011;Zhangetal.,2010),whileotherswereinterestedin large-scalepatterns(Boyacketal.,2005;deMoya-Anegónetal.,2007;Leydesdorff&Rafols,2009).Wefocusonthelarge scalecitationnetworksthatareconstructedusingallarticlesinthisbibliographicdataset.Firstwedividethefulltime periodintothetimewindowsof5or10yearsandusethearticlesinthosewindowstoconstructnetworksofthejournals activeineachwindow.Thatis,wetakesnapshotsofthecontemporaryscienceatdifferentpointsoftimeandtrackthe importantdevelopmentsbyfittingthemwithhierarchicalSBMs.We visualizetheresultingblockstructureatmultiple levelsofhierarchy,andillustratethepresenceofblocksthatarenotcommunity-likebycategorizingthemassources,sinks, bridges,andcommunities.Moreover,wefollowtheevolutionoftheseblockcategoriesin16largestfieldsofscienceintime andreportthelarge-scalechangesinthemovermorethanahundred-yearobservationperiod.

Thecitationnetworkscanbestudiedinisolationbuttheycanalsobeaugmentedandcomparedwithmanyotherdata sourcessuchasjournalcategorizations,articlekeywords,andauthorinformation.Previousstudieshave,forexample,com- paredpredeterminedjournalcategoriestonetworkclusters(Boyacketal.,2005;Janssensetal.,2009)ortofactorsfrom factoranalysis(Leydesdorff&Rafols,2009).Theyhavealsoconstructednetworksusingcategoriesasnodes(Zhangetal., 2010)andevaluatedthequalityofcategorisationsusingcriteriathatfavourcommunity-likecategories(Wang&Waltman, 2016).HerewewillutilizearecentlydevelopedgeneralizationoftheSBMmethodthatallowstheinclusionofany“tag”

informationaboutthenodes(Hric,Peixoto,&Fortunato,2016)anduseittoanalysehowmuchinformationthepredeter- minedjournalcategorizationscarryaboutthegroupstructureofthecitationnetworks.Thisapproachdoesnotassumethat thejournalclassificationsarethegroundtruth,butdeterminesthesuitabilityofsubjectcategoriesfordescribingcitation structurebyaskinghowmuchbetterwecandoinestimatingthecitationflowswiththeclassificationsthanwecando withouttheknowledgeoftheclassifications.Theconstructionofcontemporarycitationnetworksallowsustotrackthe congruityofthesubjectcategorieswithcitationpatternsthroughoutthelastcentury.

Thepaperisorganizedasfollows.Theprocessofbuildingannotatedjournalnetworksfromrawcitationdataisdescribed inSection2.ThestochasticblockmodelsareintroducedanddescribedinSection3.Thenthevisualizationofthecitation networksisdescribedandaselectionofresultsispresentedinSection4.Moredetailedanalysisofjournalgroupsproperties isdoneinSection5,whileSection6dealswiththeirevolutionintime.Nextacomparisonbetweenthesubjectcategories andcitationstructureisdevelopedandpresentedinSection7.Conclusionsaremadein8.Somebasicpropertiesofthedata andadditionalresultsarepresentedinAppendicesAandD.

2. Data

AllthenetworksconstructedinthispaperarebasedondataonarticlesandcitationsextractedfromthreeThomsonReuters CitationIndex®datasets(ScienceCitationIndexExpandedTM,SocialSciencesCitationIndex®,andArts&HumanitiesCitation Index®).Thisdatabasecontainsinformationaboutthepublishingyearandthevenue(journal,proceeding,conference,etc.) ofarticles,andeachvenue(fromnowoncalledjournal)isassignedtonone,one,orseveralsubfields.Wejointhesubfieldsinto largerfieldssimilartoParoloetal.(2015).Thedatasetspansfromyear1900to2013andcontainsabout76,000journals, approximately5.5Marticles,andabout630Mcitationsintotal.Amoredetaileddescriptionofthedatacanbefoundin AppendixA.

Asthefulldatasetspansformorethanahundredyears,itincludesinformationneededtotrackdevelopmentofmodern science.Weaimtoinvestigatehowthecitationpatternshaveevolvedduringthistimeperiodandtothatendwesplitthe dataintomultipletimewindows,eachofwhichisthenusedtoconstructacontemporarynetworkofjournals.Thetotal volumeofpublicationsandcitationsisgrowingexponentiallyintime(Panetal.,2016),andbecauseofthiswesetthetime windowlengthtotenyearsbefore1970sandtofiveyearsafterwards.

2.1. Networkconstruction

Ineachtimewindow,anodecorrespondstoanactivejournalthathaspublicationsinthegiventimeperiod.Thecon- nectionsbetweenthejournalsareconstructedusingoutgoingcitationsfromthesejournalssuchthatthereisadirectedlink fromjournalatojournalbifanarticleinjournalacitesanarticleinjournalb,andtheweightofthislinkistakentobe thenumberofsuchcitations.Foreachtimewindowweonlycountthecontemporarycitationssatisfyingthefollowingtwo criteria:(1)thecitedarticleispublishedinajournalthatisactiveinthetimewindow,and(2)thetimedifferencebetween thecitingandthecitedarticleisshorterthanthelengthofthewindow.Thisprocedureensuresthatallarticlesinthetime windowcontributeequally(withtheircitations)tonetworklinks.

Wehavealsotestedadifferentapproachforselectingthecontemporarycitationswhereboththecitingandthecited articlewererequiredtobewithinthefixedtimewindow.Themorestrictfilteringofcontemporarycitationsbringsimbalance toincomingandoutgoingcitationsofarticlesdependingwhethertheyarepublishedatthebeginningortowardstheendof thewindow:thoseatthebeginninghavelargerpoolofarticlestheycanreceivecitationsfromthanthepooltheycancite,

(5)

versionofgraph-toolweused(2.19dev),inSection7wehadtousesimplifiednetworks(undirected,unweighted,and withoutself-loops).Anaivemethodofdiscardinglinkdirectionsandweights,andremovingself-loops,leavesthenetworks verydense,andisapoorapproximationbecauseitregardsalllinksequallyimportant,irrespectiveoftheirdirectionor weight.Ausualapproachistosetaglobalthresholdonthelinkweightsthatkeepsonlythestrongestlinks,ortouseonly thelinksthatformamaximumspanningtree(Kruskal,1956;Macdonald,Almaas,&Barabási,2005).Bothoftheseare globalmethods,meaningthatthedecisiononwhetheralinkwillbekeptorremoveddependsontheweightdistribution andthestructureofthefullnetwork.Weusealocalthresholdingmethod,inwhichstatisticalsignificanceofweightsof linksofeverynodearecalculatedbasedonanullmodeldefinedforeachnodeseparately(Serrano,Bogu ˜ná,&Vespignani, 2009).Thissignificancemeasureistheprobabilitythatlinkweightiscompatiblewiththenullhypothesis,instatistical inferenceknownasthepvalue,butheredenotedwith˛.Bykeepingonlythelinksthathave˛valuelowerthanacertain threshold,wearedismissingalllinksthatdonotsignificantlydifferfromonescreatedrandomly,whilethosethatarekept canbeconsideredsignificant(notrandom),andthis“significance”iscontrolledwiththevalueofthethreshold.Wetested therangeofthresholdsandfoundthattheresultsarerobustagainstthechangeofthethresholdvalue˛(seeSection7).

Theresultsareshownfor˛intherange0.05,...,0.25thatpreserveabout6%toabout21%ofthemostimportantlinks (representingabout23%toabout45%ofcitations)andabout51%toabout99%ofnodes,respectively.

3. Stochasticblockmodel

Networksandgraphscanbemeasuredandsummarizedatmanylevelsofgranularity,startingfromglobalormacroscopic measures—suchasthetotalnumberoflinksordiameter—tolocalormicroscopicmeasuressuchasnodedegreeorthe clusteringcoefficient(Newman,2010).Hereweconcentrateondescribingnetworksinamesoscopicscalethatisbetween thesetwoextremes.Networkanalysismethodsthatworkatthislevelofgranularityalmostexclusivelydealwithsetsof nodesandlinkscalledcommunitiesordependingonthefieldofresearch,clusters,groups,modules,etc.(Boccalettietal.,2006;

Fortunato,2010;Schaeffer,2007;Wasserman&Faust,1994).Thereisnotasingle,precisedefinitionofcommunity,butmost oftenitisdescribedasasetofnodeswithmoreconnectionsbetweenthemthantotherestofthenetwork(Fortunato,2010;

Porteretal.,2009).Thecommunityparadigmassumesthatanetworkcanbedescribedasacollectionoftightlyknitsetsof nodes,whicharelooselyconnectedtoeachother.

Stochasticblockmodelrelaxestheassumptionaboutthenatureofconstituentsetsofnodessuchthattheyonlyneed tobeequivalentinthewaytheyconnecttoothergroups(calledblocksintheSBMparadigm),whichineffectallowsfora descriptionbeyondthecommunitystructure,likebipartite,core-periphery,etc.(Barucca&Lillo,2016;Karrer&Newman, 2011).SBMisagenerativemodel,meaningthatitassumesamodeloftheunderlyingstructureandprescribesaprocedure forbuildingnetworksthathavethisstructureincommon.Themodelisdefinedbyassigningallnodestodisjointsets1and settingthenumberoflinksbetweenandwithinblocks.Obeyingtheabovedescribedconstraints,anetworkisgeneratedby randomlyplacinglinksbetweennodes.Analternativedescriptionistosettheprobabilitiesforplacingalinkbetweenany twoblocks,butweusedthelinkcountsfollowingtheapproachlaidoutinPeixoto(2014b).

Nodeswithinblockssharetheprobabilitiesforlinkstowardsthenodesinotherblocksbutalsoincludingtheirownblock.

Injournalcitationnetworksthismeansthatallthejournalsinablockhavethesamecitationpatternstootherblocks.They can,forinstance,receivemostoftheircitationsfromonesetofjournals,andgivethemouttoanotherset,orhavehigher thanaverageprobabilitytoexchangecitationswithsomeblocksandlower-than-averageprobabilitywithotherblocks.Two blockscouldalsohaveidenticalcitationpatternstootherblocks,butdifferentnumberofinternalcitations.Allthistellsus thatthismodelgroupsnodes(journals)intoclassesbytheirroleinthenetwork,whicheverthoseare.

Oncethemodelisknown,buildingnetworkswiththeprescribedblockstructureisstraightforward.However,themore commonsituationisoppositetothis:onlyasinglerealizationofthemodeloftheempiricalnetworkathandisknown,and parametersofthemodelthatmostlikelyproducethisnetwork,needtobeinferred.Findingthemostlikelyparametersisa highlynon-trivialtask,andmanyapproachestosolveithavebeenused(Wasserman&Anderson,1987).Allapproachesuse anobjectivefunction,inoneformortheother,thatmeasurestheprobabilityofthegivenparameterstobetheonesthat producedtheobservednetwork.Theproblemwiththisnaiveapproachisthatthebestfittingmodelwillbetoodetailedand willreproducetheobservednetworkwithveryhighaccuracy,whichgoesagainstthepurposeofthemodelsinproviding agoodsimplificationofthereality.Thecauseforthisisthefactthatthesimpleapproachusesallavailabledata,including noise,forfittingtheparameters.Intheextremecaseahighlydetailedmodelendsupputtingallthenodesintheirown

1 Theassumptionaboutblocksbeingdisjointsetsofnodescanberelaxed(Peixoto,2015).

(6)

blocks,sincethisreproducesthenetworkperfectly.Asimplesolutiontopreventthisfromhappeningistointroducethe numberofblocksasaconstraintinthefittingprocedure(Karrer&Newman,2011).Thisworksfineincaseswherethe numberofblocksisknown,otherwiseitneedstobeinferredfromdata,forinstancebyfindingabalancebetweenthemodel descriptionlengthanditsgoodness-of-fittodata.SomeoftheapproachesinthisdirectionarelistedinLatouche,Birmele, andAmbroise(2012).Inthisworkwearetakingadvantageoftherecentprogresswiththistopicthatestimatesthenumber ofblocksbyincludingtheinformationnecessaryfordescribingthemodelparametersintothetotalinformationamount beingminimizedsuchthatitpenalizesatoolengthydescription(Peixoto,2013,2015).

OneofthemainissuewiththeclassicalSBMs,inadditiontooverfitting,isthattheyassumePoisson-distributednode degrees,soanydeviationfromtheexpectedstructureisconsideredafeatureofthedataandthefittingalgorithmtriesto findamodelthatwouldadequatelydescribeit.Fittingthismodelwithanetworkwithadifferentdegreedistributionwould yieldblocksthatrepresentclassesofnodesbasedontheirdegrees,notjustonwhichothernodestheyconnectto.So,for instance,highlycitedjournalswouldbeseparatedfromjournalspublishingreviewarticles,thatpresumablyciteinlarge volumesbutdonotgetcitedasmuch.Also,journalsthatplayacentralroleintheirrespectivecommunitieswouldbeput together,evenifcommunitiesmightnotbeotherwiserelated.Thedegree-correctedstochasticblockmodelis“blind”tothis kindofstructure:itseparatesnodesintoclassesbasedontheirdegree,whichremovesunfaircompetitionforlinksbetween nodesoflargelydifferentdegrees(Karrer&Newman,2011).Thiseffectivelyincludesthedegreesequenceintothemodel thatdoesincreasetheinformationneededtodescribethemodel,butthebenefitofasimplermodelfordescribingthedata lowersthetotaldescriptionlengthinmostcases.

TheSBManditsdegree-correctedvariantcanbeexpandedtodescribehierarchicalstructureofblocks(Peixoto,2014b).

Inthiscase,eachlevelconstitutesanetworkofblocksandthebestfitofSBMofthelevelbelowit,startingfromthenetwork itselfatthebottomlevel,tothetrivialone-blocklevelontop.Fittingallthelevelsisdonesimultaneouslytoobtainthe minimumdescriptionlengthofthewholestructure.Thisapproachshowsseveraladvantages,likeallowingforevensmaller blocksthatcanreliablybeinferredandprovidingamulti-resolutionviewofthenetwork.HierarchicalSBMcanbeviewed asastackofprogressivelysimplerweightednetworks—eachlevelinthehierarchyisazoomed-outversionofthelevel belowit.Thismeansthatwecaninspectconnectivitypatternsofblocksofnodesatthedesiredlevelofdetail/blocksizes.

Theexistenceofthesmallestidentifiableblocksisincommunitydetectionliteratureknownastheresolutionlimitanditis definedasthesmallestgroupthatamethodisabletoidentify(Fortunato&Barthélemy,2007).Oneofthemostcommonly usedobjectivefunctionsismodularity(Newman&Girvan,2004),forwhichithasbeenshownthatitisnotabletoseparate smallgroups,eveninobviouscases,ifthenumberoflinksinsidethemistoosmallcomparedtothenumberoflinksintherest ofthenetwork(Fortunato&Barthélemy,2007).ItshouldbenotedthatSBMsarenotcompletelyimmunetoresolutionlimit problems(Choi,Wolfe,&Airoldi,2012;Peixoto,2013),whichpresentsaprobleminthecomparativeanalysisofnetworks spanningtwoordersofmagnitude,asisthecasewithourtimeslices.Thesmallestdetectableblocksscalewiththenetwork size,whichmeansthathigh-resolutionlevelsforlargenetworkswouldremainhiddenfromus.Fittingtheblocksatall resolutionsforthehierarchicalSBMatthesametimereducesthelimit,becauseeachlowerlevelusesblocksfromthelevel aboveasaconstraint,soblockinferenceatthelowerlevelsisineffectdonelocallyineachhigher-levelblock(Peixoto, 2014b).

Thecriteriaandalgorithmsdescribedaboveareimplementedingraph-toolPythonmodule,whichweusetodoall SBMfittinginthiswork(Peixoto,2014a).

4. Visualizingcitationflowsbetweenblocks

HierarchicalSBMlevelscanbevisualizedasnetworksthatprovideuswithamulti-resolutionmapofthecitationnetwork.

Networkvisualizationisapowerfultoolforvisualinspectionofcomplexnetworkdata,butitsreadabilityandthususefulness dependsonthelevelofdetailspresentedandthetotalsizeofthenetwork.Inaddition,ifthenetworkisdense,i.e.average nodedegreeislarge,itisevenhardertomakeaclearpictureofit.Lowerlevelscontainalotofinformationinfinedetail,but theyareoftenvastandtoodenseforvisualizationtobereadable.Butnotallweightedlinksareequallyimportantifweare interestedinlarge-scalepatterns.Thismeanswecandepictonlythemostimportantones,makingthenetworksignificantly lessdense.ThisisagaindonebytheprocedureoutlinedinSection2,whichkeepsonlythelinksthatformthebackboneof thenetwork(Serranoetal.,2009).Forthepurposeofthisanalysisthedirectionsoflinksarepreserved.

ThreeexamplesofnetworksofSBMblocksforthetimeperiodsof1920s,1960s,and1995–2000,atthelevelsmost similar,i.e.theclosestmatchingnumberofblockstosubfieldsandfieldsaredepictedinFig.1.

Clusteringofblocksofsimilarfieldsisquitevisibleinallsixnetworks.Medicineformsthebiggestcluster,followed byBiology,Chemistry,and Physics.Large-scalestructureremainssimilarforallthreetimeperiods:Medicineistightly connectedtoBiology,whichisthenconnectedtoChemistryandPhysics.Thisstructureisinaccordancewiththepreviously publishedmaps(Leydesdorff&Rafols,2009;Rosvall&Bergstrom,2008).Thefiguresalsoincludeinterestingsmall-scale detailsthatvarybetweentheresolutionsandtimewindows.Forinstance,theinterdisciplinaryblocksareoftenlocatedat theboundariesbetweenotherfields,butthiseffectismorevisibleinthesubfieldresolutionlevel.

Thefactthattherearemultipleblockswiththesamedominantfieldinlevelshavingthemostsimilarnumberofblocks tothenumberofassignedfields,isaconsequenceoflargeheterogeneityofthefieldsizesandimportance.Largefields alsohaverichinternalstructurethatovershadowssmall,moresimplefieldsinprocessofinferringthebestblocksateach hierarchicallevel.Asafirstapproximation,wehavechosenthenumberoffieldsasaguideforchoosingthemostappropriate

(7)

Fig.1.NetworksofSBMblocksforthetimeperiodsof1920s,1960s,and1995–2000.Thetoprowshowsresolutionlevelsmostsimilartosubfields,bottom tofields.Thenodeshapesandcoloursrepresentthedominantfieldinthatblockandthenodesizesareproportionaltothenumberofarticles.Directed linksarecolouredandsizedlogarithmically,accordingtothenumberofcitationstheycarry.Nodeandlinksizesarenormalizedperpairsofnetworks fromthesametimewindow,sothatthesesizescanbecomparedbetweenthetworesolutionlevelsbutnotacrossthetimewindows.Thelinkcoloursare normalizedforeachnetworkseparately.(Forinterpretationofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionofthe article.)

levelasitsimpleandintuitive,whileanother,morecomplexapproachcomparingthefullsizedistributionsdidnotprovide significantlydifferentresults(seeAppendixB).

5. Connectivitypatternsofblocks

Incontrasttotraditionalcommunitydetection,blocksinSBMarenotlimitedto“densesubgroups”wherethecitations stayinsidetheblocks,butablockcanalsorepresentastructurewherethecitationsflowoutoforintotheblock,aslongas alljournalsintheblockbehaveinsimilarway.Wesummarizethetypeofblockintermsofcitationflowsbycountingthe numberofcitationsenteringtheblocksin(articlesintheblockarebeingcited),leavingtheblocksout(theycitearticlesin otherblocks)andinternalcitationssint(citingarticlesinthesameblock)2.

2 Onecanviewthesystemofflowsandblocksasaweightednetwork.Intheliteratureofweightedcomplexnetworks(Barrat,Barthélemy,Pastor- Satorras,&Vespignani,2004),weightedsumofanode’slinksiscalledstrength,andforthedirectednetworksitcanbein-,out-andinternalstrength:sin, soutandsint.

(8)

Fig.2.Propertiesofblocksdependingontheirlocationontheconnectivitypatternplot. ¯sinand ¯soutarerelativein-andout-flowsofcitationsto/fromthe block.Duetorelation ¯sin+s¯out1pointsareconfinedtoregionsoutlinedbygreylines.Regions,aswellasredpointsatspeciallocations,aremarkedwith arrowsandannotatedwithdescriptionsofthepropertiesofblocksatthoselocations.(Forinterpretationofthereferencestocolorinthisfigurelegend, thereaderisreferredtothewebversionofthearticle.)

Notethatthesumofthethreeflowcountsrepresentsthetotalactivityofthejournalsintheblock.Herewearenot interestedinthetotalactivitiesbutinthetypeofflows.Weinvestigatethesetypesbyseparatingthetotalflowfromour flowmeasuresandconcentrateonrelativein-,out-,andinternalflows.Thesenormalizedflowsaredefinedas:

¯

sin=sin/stot, s¯out=sout/stot,

¯

sint =sint/stot,

where stot=sin+sout+sint. (1)

Bynormalisingtheflowsbythetotalactivitywereducethenumberoffreeparametersneededtodescribeblocks’flow patternsfromthreetotwo.Thatis,thesumofthethreerelativeflowsequalstoone,andknowingtwoofthemisenoughas thethirdonecanalwaysbecalculatedbasedonthem.Thismeansthatwecanreportthetwooutofthethreeflowmeasures thatarethemostconvenientforus.

Forvisualizingtheconnectivitypatternsofblocks,thechoiceof ¯sinand ¯soutasxandycoordinatesmakesiteasytovisually assesstheblock’spropertiesfromitspositionontheplotasillustratedinFig.2.Sincethesum ¯sin+¯soutmustbe≤1,points canlieonlyinthetriangleboundedbythediagonal(0,1)−(1,0).Proximitytotheorigintellsushowmuch“self-centred”

or“community-like”theblockis,whilethedistancesfromtheaxessignifythebalancebetweenreceivingandgivingout citations.Ithelpstoconsiderfourextremalcasesforablock,markedwithredpointsinFig.2:(0,0)Purecommunity.Journals inthisblockareisolatedfromtherestofthenetwork.Articlespublishedinthesejournalsonlyciteandgetcitedbyarticles injournalsfromthisblock.(1,0)Puresink.Journalsinthisblockonlyreceivecitationsanddonotciteatall.(0,1)Puresource.

Journalsinthisblockdonotgetcited,butciteothers.(0.5,0.5)Purebridge.Journalsinthisblockciteandgetcitedequally,but therearenocitationswithintheblock.Inmostcases,thevaluesliesomewherebetweentheseextremes.Thetriangularspace canbedividedintothreeregionsoutlinedbygreylinesinFig.2thatcontainblockswiththefollowingproperties:Inner triangleiscommunity-like(“acommunityinaweaksense”[Radicchi,Castellano,Cecconi,Loreto,&Parisi,2004]).More thanhalfofthecitationspertainingtothisblockstaywithinit.Upperwingmostlycitesothers.Lowerwingmostlyreceives citations.

InFig.3wevisualizethetypesofconnectivitypatternsofblocksfoundinthecitationnetworks.Wedisplaythedatafor threetimeperiods(1920s,1960s,and1995–2000)andforthreelevelsofresolution.Thisgivesusanoverviewofthetypes ofmesoscalestructuresonecanfindinthecitationnetworks.AlargenumberofblocksdetectedbytheSBMmethodfall outsidetheinnertriangle,andarethusnotcommunity-likestructuresevenintheweaksense(Radicchietal.,2004).Thisis especiallyevidentforhigh-resolutionlevelswherethevastmajorityofinferredblocksarenotcommunities.Notethatthe inclinationtowardsnon-community-likestructuresisafeatureofthedataastheSBMfittingmethodweusedoesnothave apreferenceforanyparticularblockstructurethatisplausiblewithintheSBMframework.

Notethat,inthehierarchicalstructureofblocks,thelevelthatcontainslargerblockscannothavelesscommunity-like blocksthanalevelthatcontainssmallerblocks,andfortheoneblockatthetopofthehierarchyallcitationsareinternal.

Thiscanbeillustratedbyconsideringthemergeroftwoblocks:theirinternalcitationsremaininternal,butpartoftheir externalcitationsthatgobetweenthembecomeinternaltothemergedblock(c.f.AppendixE).Inconsequence,theaverage

(9)

Fig.3. Connectivitypatternsofstructuralandclassificationblocks,forthetimeperiodsof1900–1910,1950–1960,and1995–2000.Eachrowcorrespondsto adifferentblocktype:journalsthemselves;thefirstSBMlevel;subfieldsclassification;SBMlevelwiththeclosestmatchingnumberofblockstothenumber ofsubfields(SBM:∼subfields);fieldsclassification;andSBMlevelwiththeclosestmatchingnumberofblockstothenumberoffields(SBM:∼fields).Blocks arerepresentedaspoints(colouredandshapedaccordingtothedominantfieldinablock)withcoordinatesbeingtheincomingandoutgoingcitationsas fractionsoftotalcitationsforeachblocksinand ¯sout,Eq.(1)).Pointareasareproportionaltothetotalcitations(strengthstot)pertinenttotheblock.Blocks withfewerthat10totalcitationsarenotshown.Blockscontainingaselectedsetofjournalsareannotated.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthearticle.)

(10)

fractionofinternalcitationsintheblockatthehigherlevelofhierarchycanonlybeequaltoorhigherthantheweighted averageofthetwoblocksatthelowerlevelofhierarchy.

Theprocedureoutlinedinthissectioncanbeusedwithanykindofpartitionofthejournalsinto“blocks”,notjustones inferredforSBM.Thejournalsareexplicitlypartitionedinthedatabysubfieldandfieldclassification,andwewantto comparethesepartitioningstoonesgivenbytheblocksfoundbytheSBMmethod.Forourpurposesitisusefultoview bothofthesepartitionsasblockstructures,buttoavoidconfusionwenametheblocksasdeterminedbytheclassification dataclassificationblocks,andtheonesinferredbySBMfromthecitationpatterns’structuralblocks.Journalsthemselvesform elementaryblocks,whichcanbeviewedasthe“zeroth”levelofhierarchicalSBMoranyotherhierarchicalblockstructure.

Thesezerothlevelblocksgiveuspropertiesofindividualjournals.Journalsareassignedtosubfields(inthedataset),which areinturngroupedintofields.Onewouldexpectthisclassificationtobereflectedincitationpatterns,sinceitshould groupsimilarjournalstogether.Weareabletotestthisassumptionbycomparingthepropertiesofartificialblocks,defined bysubfieldsandfields,withblocksinferredfromcitations.HierarchicalSBMprovidesuswithmanylevelsatdifferent resolutions,andinFig.3wecomparethelevelwiththemostsimilarnumberofblockstothenumberofsubfieldsandfields inthatnetwork.

Individualjournalsspanthewholespaceofin-andoutflowbalance(therearemanystrongsinksaswellasstrong sources),whiletheoverwhelmingmajoritydonotpredominantlycitethemselves.Thereareannotatedjournalsofhigher prominencethatarefoundinthelowerwing,whichistobeexpectedsincetheyreceivemorecitationsthangiveout.

ThefirstSBMleveldeterminesthesmallestnon-trivialstructuralblocks,anditisthesmallestgroupingofjournalsthat cite,andarecited,inasimilarwayastheyhavesimilarcitationpatternstowardstherestofthenetwork.Comparedwiththe levelofjournalsthespreadofpointsisreducedinalltimeperiods,althoughtoavaryingdegree.Thespreadisreducedwith time:thereisalmostnochangein1920s,somechangein1960sandasignificantconstrictionofvaluesfor1995–2000.This couldbeaconsequenceoftheresolutionlimit,wheresmallerdetailsareincreasinglyhardertocaptureasthesizeofthe networkincreases,ortheoutlyingjournalscarrylessinformationinlateryears,sotheyarecombinedwithmoremoderate ones.

Atthelevelofsubfields(the3rdand4throwsinFig.3),bothstructuralblocksandclassificationblocksdisplayapattern wheremultidisciplinaryblocksarelocatedmostlyonthecitedside.Thisisexpectedgiventhattheseblocksaredominatedby highimpactjournalspublishinghighqualityarticlesfromawidespectrumoffields.Comparingthedistributionofpointsfor classificationblocksversusstructuralblocksweseethatthelatteronesaremoreevenlyspreadoutin1960sand1995–2000, whilethisisnotthecasein1920s.Therearealsomorecommunity-likeblocksinthestructuralcase,inparticularinthefields ofEconomics,Physics,andMathematicsandGeosciencesin1920s.ThismeansthatSBMcapturesabroaderspectrumof blocktypes,whileclassificationblockstendtobemoresimilartoeachotherortheyhaveapreferenceforcertainproperties.

Thehighestlevelofhierarchywefocuson—forbothclassificationblocksandstructuralblocks—istheleveloffields(5th and6throwsinFig.3).Constrictioninclassificationblocksisagainpresent,albeitnotsostrongasforthesubfields.Withtime, Multidisciplinaryfieldstronglyseparatesfromthebulkthatremainsmoreelongatedinthecommunity-bridgedirection andslightlyleanedtowardsource-likebehaviour.Similartothelevelofsubfields,Multidisciplinaryblocktendstobecome morecitedwithtimeforbothclassificationandstructuralblocks,withthisbehaviourbeingmorepronouncedforstructural blocksandthelasttimeperiod.MedicineandPhysicsgetseparatedintoacommunity-likeblock,meaningthatthemost citationsremainwithintheirfield,whichwasnotthecaseforthesubfields.TheSBMfindsmultipleblocksofmixedfields thataresource-andcommunity-likefor1960s,andonlycommunity-likefor1995–2000.Mostofthemixedblocksin1960s arecomprisedofunclassifiedjournals,numberofwhichrisesinthesecondhalfofthecentury(seeAppendixAfordetails).

Forclassificationblocksthesejournalsareallcollectedunderthe“mixed/unknown”field(largewhitesquare).

NotethatoneneedstobecarefulwhencomparingblocksacrossdifferentpanelsinFig.3,asthereisnoguaranteethat blockswithsimilarqualitiesindifferentpanelscompriseofthesamejournals.Deeperanalysisinthisdirectionwouldrequire listingalljournalsinablock,orannotatingjournalsofinterest(whichisdoneforafewjournalsinFig.3).

6. Evolutionofblockconnectivitypatternsintime

Intheprevioussectionwesummarizedthetypesofcitationflowsofindividualblocks.Wewillnextbuildonthissum- marizationmethod,andquantifytheevolutionofcitationflowsofspecificfields.Tobemoreprecise,weaskthequestion:

whatistheexpectedtypeofcitationflowoftheblockwherearandomlychosenarticleofagivenfieldbelongsto?

Theexpectedflowsforajournalinafieldcanbecalculatedbycollectingalljournalsbelongingtoafieldandtakinga weightedaverageoftheflowsoftheblockstheybelongto.Tocalculateaverageflowsforarticles,averagingneedstobe weightedbyafractionofjournals’publicationsoutofthetotalnumberinthefield:

¯ sfx=

jaj¯sx

jaj , (2)

where ¯sxisanyofthethreeaverageflows(in-,out-,andinternal)fromEq.(1),ajisthenumberofarticlespublishedby thejournalj,andthesumsgooveralljournalsofthefieldf.Ifwedothisfornetworksinalltimewindows,inadditionto differentiatingfieldsamongthemselves,wecanfollowtheevolutionofcitationflowpatternsofindividualfieldsovertime.

(11)

Fig.4.Plotoftheevolutionoftheaveragecitationflows,fortheexamplefieldEngineering.Time(inyears)isonthehorizontalaxis.Relativecitationflows ofblockscontainingjournalsfromtherespectivefield[seeEq.(2)]areshownasshadedregions:bottompart(thedarkest)isforinternalflow ¯sfint,middle (lighter)forincomingflow ¯sfin,andtoppart(thelightest)foroutgoingflow ¯sfout.Theaveragevalueofeachflowismarkedwithablackline,andtheerrorof themeanisshownasatransitionalshadebetweenregions.Thecentralhorizontallinemarks50%oftheflow.

Similartotheprevioussection,werepeattheanalysisforstructuralblocksinferredbytheSBMandclassificationblocks givenasclassificationsofthedata(fieldsandsubfields).Thatis,theaverageflows ¯sfx(in-,out-andinternal)ofafieldfare calculatedoverallthefield’sjournals,andthevalues ¯sxaretakenfromblocks,eitherthestructuralblocksortheclassification blocks,thejournalsbelongto.

WeillustratetheapproachusedtoplottheevolutionoftheaveragecitationflowsofblockconnectivitiesinFig.4.Time (inyears)isonthehorizontalaxis,whileverticalaxisissplitbyrelativecitationflows(Eq.(2))ofblockscontainingjournals fromtherespectivefield:bottompart(thedarkest)representsinternalflow ¯sfint,middle(lighter)incomingflow ¯sfin,andtop part(thelightest)outgoingflow ¯sfout.Theaveragevalueofeachflowismarkedwithablackline,andtheerrorofthemean isshownasatransitionalshadebetweenregions.Thecentralhorizontallinemarks50%oftheflow.

Theevolutionoftheaveragecitationflowsorblockconnectivitiesfor16mostprominentfieldsisshowninFig.5.Fields areorderedbythesurfaceareaunderthecurvegainedbytakinglogarithmsofnumbersofjournalsineachfield.Logarithmic scaleisusedtomitigatetheimpactoftheexponentialincreaseofthenumberofjournals(c.f.AppendixA).

Forhalfofthe16fields,theinternalflowsofstructuralblocksarelargerthantheinternalflowsofclassificationblocks.

Mathematics,Economics,Geosciences,Law,Education,PoliticalScience,andAnthropologyjournalsarefoundtoresidein structuralblocksthataremorecommunity-likethanthecorrespondingclassificationblocks.ThiscouldmeanthatSBMis abletofindtheir“naturalcommunities”—onesthatcapturethemostofthecitationflows—whiletheclassificationsalone arenotabletoachieve.Thestrengthofthiseffectvaries,withthemoststrikingexamplesbeingMathematics,Law,and PoliticalScience.ClassificationblocksofEconomics,andpartiallyGeosciencesandLawarethemselvesquitecommunity- like.TheoppositeeffectispresentforMedicineandBiology,meaningthatthestructuralblocksarelesscommunity-likethan classificationones.Thiscanbeexplainedbythefactthatthesefieldshavealargenumberofbothsubfieldsandjournals,so theycontainarichinternalstructureofblockswithlargeflowofcitationsbetweenthem.Theirinternalcitationstructure mightalsobedifferentfromotherfields,astheycontainhighlycitedpapersdescribingmethodsandprocedures(Small&

Griffith,1974).

Thein-flows ¯sfinandout-flows ¯sfoutarequitebalancedformostfields,withafewnotableexceptions.Multidisciplinary fieldhasnoticeablymoreincomingcitationsthanoutgoing,forallhierarchylevelsandforbothtypesofblocks.Thisdoes notnecessarilymeanthatalljournalsinMultidisciplinaryfieldattractcitations,butitcanbeduetothefactthatitcontains severalhighprofilejournals,3 likeNature,Science,andPNAS.IndividualjournalsinPoliticalSciencefieldreceivemore citationsthantheygiveout,whilethisdifferencevanishesfortheblockstheybelongto.TheoppositeistrueforMedicine, Health,andEnvironmental(theygiveoutmorecitationsthantheyreceive)andthisbehaviourremainspresentforblocks inhigherlevels.

Intimedomain,fieldsexhibitawealthofbehaviours.Somehaverelativelystablepatterns(Medicine,Chemistry,Eco- nomics,Multidisciplinary,andtosomedegreeHealthandPoliticalScience),mostoftheothershavelargechangesaround theWorldWarII,whilesomehaveuniformandsteadyshifts(EducationandAnthropologyarethemostnotable).

ThemoststrikingfeatureisthesuddenriseinshareofoutgoingflowsatthetimeoftheWWII,mostlyforclassification blocks,andtosomedegreefor journalsofsomefields.Thelargestrises areindecreasingorderfor:Law,Geosciences, Mathematics,Anthropology,Physics,Biology,andEnvironmental.InBiologyandChemistryafainteffectisalsovisibleatthe levelofjournals,whileincaseofEnvironmentalitismostlyadropininternalflow.Giventhatthiseffectisalmostinvisible forstructuralblocks,theobservedchangesmostlikelydonotarisefromthechangeinjournals’citationpatterns,butinthe waytheyareclassified.Thissuddenchangecorrelateswiththelargeincreaseofthenumberofsubfields(c.f.AppendixA).

Somefieldsexhibitslowbutsteadychangeovertime,predominantlyinthestructuralblocks.Anthropologyjournalsare citingmoreexternalliteratureandlessthemselveswithtime,whichisalsovisibleinthesmalleststructuralblocks,tolesser extentinthestructuralblocksofsizeofsubfields,andnotatallinthestructuralblocksofsizeoffields.Thismeansthat considerableamountofthecitationflowthatisincreasinglygoingoutofthejournalsisneverthelessretainedinsidethe

3 Forthisreasonsomeauthorshaveinsimilaranalysesexcludedthesejournalsaltogether(Zhangetal.,2010).

(12)

Fig.5.Evolutionofstructuralandclassificationblockconnectivitiesintime,for16largestfields.Fieldsaregroupedincolumns(coloursarethesameas inFig.3)andeachrowisfordifferenttypeofblocks,fromtop:journalsthemselves;thefirstSBMlevel;subfieldsclassification;SBMlevelwiththeclosest matchingnumberofblockstothenumberofsubfields(SBM:∼subfields);fieldsclassification;andSBMlevelwiththeclosestmatchingnumberofblocks tothenumberoffields(SBM:∼fields).Time(inyears)isonthehorizontalaxis,whileverticalaxisissplitbyrelativecitationflowsofblockscontaining journalsfromtherespectivefield:outgoing,incomingandinternalflows(fromtoptobottom,respectively).(Forinterpretationofthereferencestocolor inthisfigurelegend,thereaderisreferredtothewebversionofthearticle.)

structuralblocks.Biology,Chemistry,Multidisciplinary,andEducationjournalsshowsimilarbehaviour,buttheiroutgoing flowsremainstablealsointhestructuralblocksofsizeofsubfields.Aplausibleexplanationisthatasanewjournalappears inthefield,it“steals”someofthecitationflowsfromtheoldjournalswhilemimickingtheircitationpatterntowardstherest ofthenetwork,whichmeansthattheywillallbeneverthelessputintothesameblockbySBM.Provingsuchexplanations wouldrequireamoredetailedanalysis.

(13)

Fig.6.Schematicrepresentationofjointjournal-annotationsmodel.Journalswithcitationsconnectingthem(bluecirclesandblacklines)areaugmented withannotations(redsquaresandgreylines).SBMisfittedontothewholenetworksuchthatblocksofjournalsareseparatefromblocksofannotations (bluecirclesandredsquaresrespectively).Notethatajournalcanhavemultipleannotations,oritcanbeunannotated.Hereweusesubfieldandfield classificationsastheannotationsofjournals,butanyotherdataonjournalscanbeused.(Forinterpretationofthereferencestocolorinthisfigurelegend, thereaderisreferredtothewebversionofthearticle.)

7. Predictivepowerofsubjectcategorisations

Citationnetworkscanbeaugmentedwithawealthofinformationaboutarticles,journals,andauthors.Theseinclude subfields,tags,keywords,authoraffiliations,etc.Inthisworkweuseclassificationofjournalsintosubjectcategories(sub- fields)providedinthedataset,andweareinterestedinhowmuchdoesthisclassificationcorrespondtothestructuralblocks foundinthecitationpatterns.

Thecomparisonoftwopartitions—suchasclassifications,clusterings,orblockstructures—isingeneraloftendoneusing somecomparisonmeasure,suchasJaccardindex,Omegaindex,andVariationofInformation(Meil˘a,2007).Itistypical tocomparepartitionsarisingfromaclassificationgiveninthedataandthegroupsarisingfromthenetworkstructureas returnedbysomecommunitydetectionmethod(Bommarito,Katza,&Zelnerd,2010;Chen&Redner,2010;Hric,Darst,&

Fortunato,2014;Lancichinetti&Fortunato,2009;Yang&Leskovec,2015).Thisisaviableoptioninourcaseaswell,butwe wouldhavetomakeachoiceofacomparisonmeasure.Becausethequestionofhowsimilartwopartitionsareisill-defined, eachcomparisonmeasurerealisesitdifferentlyandcanevenreturndifferentresults(Fortunato&Hric,2016;Meil˘a,2007;

Traud,Kelsic,Mucha,&Porter,2011).

Insteadofaskinghowsimilarthetwopartitionsare,weaskthequestion:whatcanweknowaboutthecitationsofajournal fromitsclassification?ExactlythisquestionisansweredbyincludingnodeannotationsintoSBMasitisdonebyHricetal.

(2016),whichisbasedonthenotionthatannotationsonnodesarejustmeta-informationonehasaboutthenetwork—there isnoprincipleddifferencebetweenthedataaboutconnectionsbetweentwonodes(links)andbetweenanodeandits annotations.Inliteraturedealingwiththecommunitydetectioninnetworksthisdistinctionbetweendataandannotations isoftenmadeexplicit,eitherbytreatingannotationsasasortof“groundtruth”forgroups(Yang&Leskovec,2012a,2012b, 2015),orasfeaturesthatneedtobelearnedbythemodel(Newman&Clauset,2016).Hereinstead,annotationsaretreatedas nodesofabipartitenetworkconsistingof“data”nodes(journalsinthecitationnetwork)and“annotation”nodes(subfields orfieldsofthejournals),andconnectionexistsbetweendatanodeandallofitsannotations(therecanbeanynumberof them,includingzero).

Fig.6illustratestheresultingcombinednetworkthatconsistsoftwokindsofnodes(journalsandannotations)andtwo kindsoflinks(citationsandjournal-annotationassignment).SBMisthenfittedwithaconstraintthateachinferredblock mustcontainonlyonekindofnodes.Thebenefitsofthisprocedureisthatthenodeannotationscontributetotheinferred nodeblocks,andannotationsarealsogroupedintoblocksof“equivalence”.

Workinginthisframework,thequestionfromthebeginningofthissectioncanbeformulatedas:howmuchinformation gaindoesonegetaboutlinksofasinglenode,afterlearningthenode’sannotations?Toansweritthefollowingprocedureis used.Asmallfractionofnodesisremovedfromthenetwork(5%or100,whicheverissmaller),turningtheminto“extra nodes”—thenodeswearemissingtheinformationon,andwouldliketoknowourchancesincorrectlyguessingwhere theirlinksconnectto.Then,theblocksareinferredforboththeoriginalnetwork(withoutannotations),andonedescribed above(withannotationsincludedinanadditionallayer).Theprobabilitiesfornode’slinkstoconnecttonodesthatbelong toexistingblocksaredefinedonlybytheblockthenodebelongsto.Withoutannotationstotelluswhichblockdoesthe extranodebelongto,theonlythingweknowaboutthenodeisitsdegree,andthusourbestguessfortheprobabilityofthis nodetobelongtoablockistousethesizedistributionoftheblocks,asitistheonlyinformationwehave.Incasewedohave thenode’sannotations,weknowitslinksintheannotationslayerwhichnarrowsourchoiceofblocksitcanbelongtoand thusraisestheprobabilitiesofguessingthecorrectlinks.Ifwedenotetheprobabilityforguessingalllinksofnodeiwithout knowingannotationswithPiandthesamebyusingannotationsasPi(ann)wecanquantifytherelativeimprovementwith thepredictivelikelihoodratioi:

i= Pi(ann)

Pi+Pi(ann) . (3)

(14)

Fig.7.Nodepredictionperformance,measuredbytheaveragepredictivelikelihoodratioforsubfieldsandfields[seeEq.(3)].Thevaluesare calculatedforsimplifiednetworks(seeSection2.2)for14timeslicesusedpreviously,withfivethresholdvalues˛:0.05,...,0.25.Eachbarcorresponds toasingletimewindow,withtheheightofthebarbeingtheaverageover˛valuesshownasdotsontopofeachbar.Foreach˛value,theaverageand standarddeviationovertensamplesisshown.Eachsampleisformedbyrandomlyremoving5%or100nodes,whicheverissmaller.

Thepredictivelikelihoodratioitakesvaluesfrom[0,1]andisabove0.5ifannotationsimprovelinkpredictionpower, around0.5iftheydonotchangeit,andbelow0.5iftheydecreaseit.Theaverageiofallsamplenodesistheaverage predictivelikelihoodratioforadataset.

Thismeasureisnotsensitivetothetotalnumberofblocksorannotations,asitdependsonlyonthe“power”ofthose annotationstopredicttheblockstructure.Theonlyimportantthingishowalignedtheannotationsaretothestructural blocks.

Weusetheaveragepredictivelikelihoodratiotomeasuretheabilityofsubfieldandfieldclassificationstopredict journals’citationsinthesimplifiednetworks(seeSection2.2).Predictingthelinksofthesimplifiednetworksisequivalent topredictingwhichjournalsarethemostimportantsourcesanddestinationsforthecitationsoftheextranodes,because thesenetworksonlyincludethemostimportantlinksforeachnodeanddonotincludetheactualcitationcountsforthe links.ThevaluesforarepresentedinFig.7foreachindividualtimeslice,andforfivethresholdlevels˛:0.05,0.1,0.15, 0.2,and0.25.

Overall,bothsubfieldandfieldclassificationcorrelatepositivelywiththecitationstructure.Theonlyexceptionisthe lowthresholdnetworksfor1900s,inwhichknowingthejournal’s(sub)fielddoesnothelpinpredictingitscitations.Low thresholdvaluesinthisalreadysmallnetworkcausedthelossoflargefractionoflinksandnodes,whichloweredthequality oftheapproximationbythesimplifiedandthresholdednetwork.Higherthresholdvaluesdonothavethisproblem.

Subfieldsaremorepredictivewithtime,althoughthereisaslightdeclineforthelast15years,withthepossiblereason beingasomesortofover-specializationofsubfields,whichdoesnotnecessarilycorrespondtothecitationpatternsofthe journalsbeingclassified.Fields,ontheotherhand,remainagoodproxyforlarge-scalecitationstructurethroughoutthe wholetimeperiod.

7.1. Predictabilityofindividualfields

Themethoddescribedintheprevioussectionanswersthequestionofhowmuchinformationwegainaboutjournal’s citationsifweknowwhatsubfieldsitisclassifiedinto.Wewillnextdividethisquestionintosmallerparts,andaskhow muchinformationdowegainbyknowingthatajournalbelongstoaspecificsubfield.

UsingthemodelfromHricetal.(2016)itispossibletocalculatehowmuchinformationgain(forguessingnode’slinks) doesasingleannotationprovide,incomparisontoacasewhereannotationsareassignedrandomly.Informationgainrelative totherandomcaseisdefinedaspredictivenessa,itisdefinedperannotationblocka,anditisnotaffectedbythetotal numberofblocksorannotations.FurtherdetailsandformulascanbefoundinHricetal.(2016).

(15)

Fig.8. Predictivenessesmeasuredbyofthetop16fields,forclassificationintofields,forallavailableyears.Thevaluesareten-yearslidingaveragesof averageforalltimewindowsusingthatyear.Eachpanelcontainsagroupoffourfieldsintheorderofdecreasingafter1960.Shadedregioninthe backgroundisthetotalspanofvalues.

Hereweagainconsidersubfieldsandfieldsasannotationsofjournals.AfterfittingtheSBMontothetwo-layerednetwork ofcitationsandannotations,thepredictivenessofeachblockofsubfields(orfields)iscalculated.Fieldsinthesameblock inheritpredictivenessoftheblock,whichfollowsfromtheSBMassumptionthatallannotationsinablockareequivalent.

Bycalculatingthefieldpredictivenessesforalltimewindows,ontopofbeingabletocomparefieldstoeachother,their changeintimecanalsobetrackedbothrelativelyandabsolutely.

BecausetheimplementationofSBMfittingalgorithmisprobabilistic,variationsinresultsaretobeexpectedwithsmall differencesinthenetworks(forinstancetwo10-yearwindowswith9-yearoverlap),andevenindifferentrunsoffitting functiononthesamenetwork.Thesevariationscauseavalues forthenetwork fromadjacenttimewindowstovary considerably,obscuringmoregeneraltrends.Takinganaverageaofalltimewindowsthatayearbelongsto,weareable toovercomethesefluctuations.Additionalclarityisachievedbyten-yearslidingaveragesofthesevalues.

Herewepresentthepredictivenessesofindividualfields,forthecasewhereclassificationintofieldswasused.Theresults forthecasewheresubfieldswereusedinstead,arepresentedinAppendixD.

BasedonfieldpredictivenessinFig.8,thetimecanbesplitintothreeperiods:before1940s,thetransitionperiod,and after1970s.Before1940sthefieldshaveonaveragehigherpredictiveness,butsincethedataisquitescarceforthisperiod, oneneedstobecarefulwhendrawingconclusions.Inthetransitionperiodallthefieldshaveverypoorpredictiveness, experiencingareboundafter1970forallbuthandfuloffieldsinthelastpanels(Engineering,Environmental,Medicine, Biology,Multidisciplinary,and Health).ThiscanbeasignofmajorchangessciencehasgonethroughaftertheWWII.

Mathematicshasthehighestpredictivenessinthethirdperiodbyavisiblemargin,whilebefore1940sthebestscoring fieldsareEngineeringandMultidisciplinary.Ithasrisensharplyfromthebottominthetransitionperiodtothetopinjust 25years,i.e.from1955to1980.ThismeansthatcitationpatternsofMathematicspapersbecamemorecharacteristicafter 1970s,whichispickedupbySBMandMathematicsjournalsendupinasmallnumberofexclusiveblocks.ForEngineering itistheopposite:itfaredveryhighbefore1940s,didnotsufferhardinthetransition,butneverrecovered.Thesamecanbe saidaboutMultidisciplinaryfield:ithadevensharperdropanddidnotreallyrecover.

Ontheotherside ofthespectrumareoftenlargefields(Engineering,Medicine,Biology),orrelatedtoalargefield (Environmental,Anthropology,andHealth).Theirlargesizemeansthattheycontainrichstructure withinthemselves, whichgetsdetectedbytheSBMaslargenumberofblocks.Henceknowingjustthefieldlabeltellslittleaboutthesmall blockswithinthefield.

Viittaukset

LIITTYVÄT TIEDOSTOT

Ana- lyysin tuloksena kiteytän, että sarjassa hyvätuloisten suomalaisten ansaitsevuutta vahvistetaan representoimalla hyvätuloiset kovaan työhön ja vastavuoroisuuden

7 Tieteellisen tiedon tuottamisen järjestelmään liittyvät tutkimuksellisten käytäntöjen lisäksi tiede ja korkeakoulupolitiikka sekä erilaiset toimijat, jotka

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä

The new European Border and Coast Guard com- prises the European Border and Coast Guard Agency, namely Frontex, and all the national border control authorities in the member

The Canadian focus during its two-year chairmanship has been primarily on economy, on “responsible Arctic resource development, safe Arctic shipping and sustainable circumpo-

The US and the European Union feature in multiple roles. Both are identified as responsible for “creating a chronic seat of instability in Eu- rope and in the immediate vicinity

Mil- itary technology that is contactless for the user – not for the adversary – can jeopardize the Powell Doctrine’s clear and present threat principle because it eases

Finally, development cooperation continues to form a key part of the EU’s comprehensive approach towards the Sahel, with the Union and its member states channelling