Powered by TCPDF (www.tcpdf.org)
This material is protected by copyright and other intellectual property rights, and duplication or sale of all or part of any of the repository collections is not permitted, except that material may be duplicated by you for your research use or educational purposes in electronic or print form. You must obtain permission for any other use. Electronic or print copies may not be offered, whether for sale or otherwise to anyone who is not an authorised user.
Published in:
Journal of Informetrics
DOI:
10.1016/j.joi.2018.05.004 Published: 01/08/2018
Document Version
Publisher's PDF, also known as Version of record
Published under the following license:
CC BY
Please cite the original version:
Hric, D., Kaski, K., & Kivelä, M. (2018). Stochastic block model reveals maps of citation patterns and their
evolution in time. Journal of Informetrics, 12(3), 757-783. https://doi.org/10.1016/j.joi.2018.05.004
ContentslistsavailableatScienceDirect
Journal of Informetrics
jo u r n al hom e p ag e :w w w . e l s e v i e r . c o m / l o c a t e / j o i
Regular article
Stochastic block model reveals maps of citation patterns and their evolution in time
Darko Hric, Kimmo Kaski, Mikko Kivelä
∗DepartmentofComputerScience,AaltoUniversitySchoolofScience,P.O.Box12200,FI-00076,Finland
a rt i c l e i n f o
Articlehistory:
Received2May2017
Receivedinrevisedform30May2018 Accepted30May2018
Keywords:
Webofscience Citationnetworks Evolutionofscience Stochasticblockmodel
a b s t ra c t
Inthisstudywemapoutthelarge-scalestructureofcitationnetworksofsciencejournals andfollowtheirevolutionintimebyusingstochasticblockmodels(SBMs).TheSBMfit- tingproceduresareprincipledmethodsthatcanbeusedtofindhierarchicalgroupingof journalsthatshowsimilarincomingandoutgoingcitationspatterns.Thesemethodswork directlyonthecitationnetworkwithouttheneedtoconstructauxiliarynetworksbasedon similarityofnodes.WefittheSBMstothenetworksofjournalswehaveconstructedfrom thedatasetofaround630millioncitationsandfindavarietyofdifferenttypesofgroups, suchascommunities,bridges,sources,andsinks.Inadditionweusearecentgeneralization ofSBMstodeterminehowmuchamanuallycuratedclassificationofjournalsintosubfields ofscienceisrelatedtothegroupstructureofthejournalnetworkandhowthisrelationship changesintime.TheSBMmethodtriestofindanetworkofblocksthatisthebesthigh-level representationofthenetworkofjournals,andweillustratehowtheseblocknetworks(at variouslevelsofresolution)canbeusedasmapsofscience.
©2018TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCC BYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Theprocessofcreatingscientificknowledgereliesonpublicationsthatareoftenstoredandarchived,withtheprimary purposeofpreservinganddistributingtheknowledgeobtainedthroughresearch.Thesearchivescanalsobeusedtostudy thesciencemakingitself,forexample,byextractinginformationofcollaborations,citations,orkeywordsofthepublished articles.Researchinthisfieldhasafairlylongandrichhistorywithwiderangeofresearchtopics,liketheassessment andpredictionofperformanceandqualityofindividualpapers,researchers,institutions,journals,fields,andevencountries (Althouse,West,Bergstrom,&Bergstrom,2009;Lehmann,Jackson,&Lautrup,2008;Nerur,Sikora,Mangalaraj,&Balijepally, 2005),aswellasidentificationofvariouslargescalestructuresofscience(Boyack&Klavans,2014;Carpenter&Narin, 1973;deSollaPrice,1965;Leydesdorff,Carley,&Rafols,2013;Small,1999;Waltman,vanEck,&Noyons,2010),journal classification(Janssens,Zhang,Moor,&Glänzel,2009;Leydesdorff,2006;Wang&Waltman,2016;Zhang,Liu,Janssens, Liang,&Glänzel,2010),followingresearchtrends(Chen,2013;Persson,2010;Porter&Rafols,2009),andrecognizingthe emergingfieldsorresearchers(Cozzensetal.,2010;Lambiotte&Panzarasa,2009;Shibata,Kajikawa,Takeda,Sakata,&
Matsushima,2011;Small,Boyack,&Klavans,2014;Small&Greenlee,1989).
Bibliographicdatabases,likeWebofScience,Scopus,andGoogleScholar,storemetadataofscientificpublications,which canbeusedtoanalysesciencemakingatalllevels,fromlargescalestructuretoperformanceofindividualpapers.Thenumber
∗ Correspondingauthor.
E-mailaddresses:darko.hric@aalto.fi(D.Hric),kimmo.kaski@aalto.fi(K.Kaski),mikko.kivela@aalto.fi(M.Kivelä).
https://doi.org/10.1016/j.joi.2018.05.004
1751-1577/©2018TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense(http://creativecommons.org/licenses/by/
4.0/).
andreliablyisbecomingevenmorechallengingasthenetworksunderstudycontinuetogrow.
Conventionaldataanalysistools,suchasclusteringordimensionreductionmethods,canbeusedtosimplifythedata aboutthecomplexrelationshipsbetweenthedataentities.Representingtheentitiesasvectorsoftheirfeaturesisacommon andpracticalabstractionthatallowstheuseofclusteringmethodsinthespaceoffeatures,inwhichthemostsimilarentities aregroupedbasedonthesimilarityoftheusedfeatures.Thesevectorscancontaincitationinformationbetweentheentities, andonecandefinesimilaritymeasures,likebibliographiccoupling,co-citation,distancebetweencitationvectors(Euclidean, cosine,Jaccard,etc.),andcorrelationcoefficientsbetweenthecitationvectorsorpublicationtexts(abstracts,keywords,etc.) (Boyacketal.,2005;Carpenter&Narin,1973;Janssensetal.,2009;Kessler,1963;Leydesdorff&Rafols,2012;Marshakova, 1973;Small,1973;Wang&Koopman,2017).
Thedataofscientificprogresscanbeanalysedwithavarietyofmethodsoncethedatahasbeenpreprocessed.The dimensionalityreductiontechniquesprojectthevectorsintothemostsignificantsubspacesrevealinggroupsofcorrelated entities(multidimensionalscaling,factoranalysis,etc.)(Leydesdorffetal.,2013;Small,1999).Classicalclusteringtechniques, e.g.hierarchicalclusteringandk-means,operateonthefullspaceoffeatures,andprovideclustersofsimilarentities,based onimplicitlyorexplicitlydefinedsimilaritymeasureordistance(Boyacketal.,2005;Modha&Spangler,2000;Punj&
Stewart,1983;Silva,Rodrigues,Oliveira,da,&Costa,2013;Wang&Koopman,2017).Thefactoranalysisappliedseparately tothecitingandciteddirectionofthecompletecitationmatrix,enablesfurtherspecializationintothetypesofgroupsit finds,sincebyusingonlyonedirectionatatime,itdetectsgroupsbasedonpastandfuturecitations,separately(Leydesdorff
&Rafols,2009).Theco-citationandbibliographiccouplingusesimilaritiesincitationsinthefutureandpastrespectively, andthusprovideaseparationnaturally(Weinberg,1974).Theresultsofthistypeofanalysisdependsonthepreprocessing stepofconstructingthedatavectorsandsimilarities,andgreatcareisneededininterpretingtheresults(Boyacketal.,2005;
Gläser,Glänzel,&Scharnhorst,2017;vanEck&Waltman,2009).
Thebibliometricdatacanalsobeanalysedbyconstructingnetworks—suchasthecitationnetworkbetweenjournals—and directlyfindingstructureinthemusingthegeneralpurposetoolsforanalysingthenetworks.Thedevelopmentofsuch methodswithinnetworksciencehasexplodedsincemassiveamountsofdataonlargevarietyofnetworks—suchason socialandtransportationnetworks—havebecomeavailable(Boccaletti,Latora,Moreno,Chavez,&Hwang,2006;Newman, 2003).Aprominentwayoffindingstructureincitationnetworksusingthesemethodsistoinvestigatenetworkclustersor communities(Fortunato,2010;Fortunato&Hric,2016;Porter,Onnela,&Mucha,2009),whicharesubnetworksthathave alargenumberoflinksinsidethem(Chen&Redner,2010;Lambiotte&Panzarasa,2009;Lancichinetti&Fortunato,2012;
Radicchi,Fortunato,&Vespignani,2012;Rosvall&Bergstrom,2008).Theassumptionwithmostofthesemethodsisthatthe networkisconstructedfromdenselyconnectedcoresofnodesorjournalsthathavearelativelysmallnumberofcitationsto therestofthenetwork.Thisisincontrasttothemethodsbasedonsimilarityofjournalsthatcanfindgroupswithastrong preferenceforreceivingorgivingcitationsfromacertainsubsetofjournals,forinstanceworkofappliedresearchcancite theoreticalworks,withoutbeingcitedback.
Evenifonewouldacceptthepremisethatthecommunity-likestructuresarerelevantincitationnetworks,manycom- munitydetectionmethodsarebesiegedwithintrinsicproblems.Veryoftentheydetectstructuresevenincaseofrandom networksbymistakingnoisefordata,theymightbeverysensitivetosmallperturbations(noise),andpossesa“resolution limit”,i.e.sufferingfromtheinabilitytoidentifycommunitiesbelowacertainsizethatdependsonthetotalsizeofthe network(Fortunato&Barthélemy,2007;Guimerà,Sales-Pardo,&Amaral,2004).Theperformance,reliability,andeventhe resultstosomeextentdependonthechoiceofamethodfromthelargesetofcurrentlyavailablemethods.
Theproblemswithcommunitydetectionmethodsarewell-knowninthenetworkscienceliterature,andtheneedto findthericherstructureinnetworksthanthoseobtainedbypartitioningnodestocommunitieshasbeenacknowledged formanytypesofnetworks(Leskovec,Lang,Dasgupta,&Mahoney,2009;Palla,Derényi,Farkas,&Vicsek,2005;Rombach, Porter,Fowler,&Mucha,2014;Wang&Hopcroft,2010;Xie,Kelley,&Szymanski,2013).Veryrecently,asasolutiontothis problem,theoldideaofusingstochasticblockmodels(SBMs)asmodelsofnetworkstructure(Holland,Laskey,&Leinhardt, 1983;Lorrain&White,1971;Wasserman&Anderson,1987)hasreceivedrenewedattention,becauseofthetheoreticaland algorithmicadvancesthatenabledtheiruseinareliableandscalableway(Bianconi,2009;Karrer&Newman,2011;Peixoto, 2012a).SBMisamodelinwhichnodesbelongtoblocks(thenameforgroupsintheSBMparadigm)andedgesarecreated between(andwithin)theblockswithsomefixedprobabilitiesforeachpairsofblocks.ThemethodsbasedonSBMswork byfindingthemodelwhichbestexplainsthenetworkdata.Thebestexplanationisnotnecessarilythemodelthatwould havemostlikelyproducedthedata,butthesimplicityofthemodelmustalsobetakenintoaccount,andtheprincipledand powerfulideasfromstatisticalinferenceliteratureareusedtoavoidsuchoverfitting.Onecanconsidertheblocksas“super nodes”thatareconnectedwithweightededges,andSBMmethodsthen—bydefinition—trytofindthe“supernetwork”that isthebestsimplificationoftheoriginalnetwork.
HerewetaketheadvantageoftherecentadvancesinSBMmethodsfoundinthenetworkscienceliteratureandapplythem tolargescalecitationnetworksbetweenjournals.WeusejournalcitationnetworksfromThomsonReutersCitationIndex® fortheyearsrangingfrom1900to2013whichcontainshundredsofmillionsofcitations.Manystudiesconcentratedon smallsubsetsofthecitationnetwork(An,Janssen,&Milios,2004;Grossman,2002;Neruretal.,2005;Pieters,Baumgartner, Vermunt,&Bijmolt,1999;Porter&Rafols,2009;Shibataetal.,2011;Zhangetal.,2010),whileotherswereinterestedin large-scalepatterns(Boyacketal.,2005;deMoya-Anegónetal.,2007;Leydesdorff&Rafols,2009).Wefocusonthelarge scalecitationnetworksthatareconstructedusingallarticlesinthisbibliographicdataset.Firstwedividethefulltime periodintothetimewindowsof5or10yearsandusethearticlesinthosewindowstoconstructnetworksofthejournals activeineachwindow.Thatis,wetakesnapshotsofthecontemporaryscienceatdifferentpointsoftimeandtrackthe importantdevelopmentsbyfittingthemwithhierarchicalSBMs.We visualizetheresultingblockstructureatmultiple levelsofhierarchy,andillustratethepresenceofblocksthatarenotcommunity-likebycategorizingthemassources,sinks, bridges,andcommunities.Moreover,wefollowtheevolutionoftheseblockcategoriesin16largestfieldsofscienceintime andreportthelarge-scalechangesinthemovermorethanahundred-yearobservationperiod.
Thecitationnetworkscanbestudiedinisolationbuttheycanalsobeaugmentedandcomparedwithmanyotherdata sourcessuchasjournalcategorizations,articlekeywords,andauthorinformation.Previousstudieshave,forexample,com- paredpredeterminedjournalcategoriestonetworkclusters(Boyacketal.,2005;Janssensetal.,2009)ortofactorsfrom factoranalysis(Leydesdorff&Rafols,2009).Theyhavealsoconstructednetworksusingcategoriesasnodes(Zhangetal., 2010)andevaluatedthequalityofcategorisationsusingcriteriathatfavourcommunity-likecategories(Wang&Waltman, 2016).HerewewillutilizearecentlydevelopedgeneralizationoftheSBMmethodthatallowstheinclusionofany“tag”
informationaboutthenodes(Hric,Peixoto,&Fortunato,2016)anduseittoanalysehowmuchinformationthepredeter- minedjournalcategorizationscarryaboutthegroupstructureofthecitationnetworks.Thisapproachdoesnotassumethat thejournalclassificationsarethegroundtruth,butdeterminesthesuitabilityofsubjectcategoriesfordescribingcitation structurebyaskinghowmuchbetterwecandoinestimatingthecitationflowswiththeclassificationsthanwecando withouttheknowledgeoftheclassifications.Theconstructionofcontemporarycitationnetworksallowsustotrackthe congruityofthesubjectcategorieswithcitationpatternsthroughoutthelastcentury.
Thepaperisorganizedasfollows.Theprocessofbuildingannotatedjournalnetworksfromrawcitationdataisdescribed inSection2.ThestochasticblockmodelsareintroducedanddescribedinSection3.Thenthevisualizationofthecitation networksisdescribedandaselectionofresultsispresentedinSection4.Moredetailedanalysisofjournalgroupsproperties isdoneinSection5,whileSection6dealswiththeirevolutionintime.Nextacomparisonbetweenthesubjectcategories andcitationstructureisdevelopedandpresentedinSection7.Conclusionsaremadein8.Somebasicpropertiesofthedata andadditionalresultsarepresentedinAppendicesAandD.
2. Data
AllthenetworksconstructedinthispaperarebasedondataonarticlesandcitationsextractedfromthreeThomsonReuters CitationIndex®datasets(ScienceCitationIndexExpandedTM,SocialSciencesCitationIndex®,andArts&HumanitiesCitation Index®).Thisdatabasecontainsinformationaboutthepublishingyearandthevenue(journal,proceeding,conference,etc.) ofarticles,andeachvenue(fromnowoncalledjournal)isassignedtonone,one,orseveralsubfields.Wejointhesubfieldsinto largerfieldssimilartoParoloetal.(2015).Thedatasetspansfromyear1900to2013andcontainsabout76,000journals, approximately5.5Marticles,andabout630Mcitationsintotal.Amoredetaileddescriptionofthedatacanbefoundin AppendixA.
Asthefulldatasetspansformorethanahundredyears,itincludesinformationneededtotrackdevelopmentofmodern science.Weaimtoinvestigatehowthecitationpatternshaveevolvedduringthistimeperiodandtothatendwesplitthe dataintomultipletimewindows,eachofwhichisthenusedtoconstructacontemporarynetworkofjournals.Thetotal volumeofpublicationsandcitationsisgrowingexponentiallyintime(Panetal.,2016),andbecauseofthiswesetthetime windowlengthtotenyearsbefore1970sandtofiveyearsafterwards.
2.1. Networkconstruction
Ineachtimewindow,anodecorrespondstoanactivejournalthathaspublicationsinthegiventimeperiod.Thecon- nectionsbetweenthejournalsareconstructedusingoutgoingcitationsfromthesejournalssuchthatthereisadirectedlink fromjournalatojournalbifanarticleinjournalacitesanarticleinjournalb,andtheweightofthislinkistakentobe thenumberofsuchcitations.Foreachtimewindowweonlycountthecontemporarycitationssatisfyingthefollowingtwo criteria:(1)thecitedarticleispublishedinajournalthatisactiveinthetimewindow,and(2)thetimedifferencebetween thecitingandthecitedarticleisshorterthanthelengthofthewindow.Thisprocedureensuresthatallarticlesinthetime windowcontributeequally(withtheircitations)tonetworklinks.
Wehavealsotestedadifferentapproachforselectingthecontemporarycitationswhereboththecitingandthecited articlewererequiredtobewithinthefixedtimewindow.Themorestrictfilteringofcontemporarycitationsbringsimbalance toincomingandoutgoingcitationsofarticlesdependingwhethertheyarepublishedatthebeginningortowardstheendof thewindow:thoseatthebeginninghavelargerpoolofarticlestheycanreceivecitationsfromthanthepooltheycancite,
versionofgraph-toolweused(2.19dev),inSection7wehadtousesimplifiednetworks(undirected,unweighted,and withoutself-loops).Anaivemethodofdiscardinglinkdirectionsandweights,andremovingself-loops,leavesthenetworks verydense,andisapoorapproximationbecauseitregardsalllinksequallyimportant,irrespectiveoftheirdirectionor weight.Ausualapproachistosetaglobalthresholdonthelinkweightsthatkeepsonlythestrongestlinks,ortouseonly thelinksthatformamaximumspanningtree(Kruskal,1956;Macdonald,Almaas,&Barabási,2005).Bothoftheseare globalmethods,meaningthatthedecisiononwhetheralinkwillbekeptorremoveddependsontheweightdistribution andthestructureofthefullnetwork.Weusealocalthresholdingmethod,inwhichstatisticalsignificanceofweightsof linksofeverynodearecalculatedbasedonanullmodeldefinedforeachnodeseparately(Serrano,Bogu ˜ná,&Vespignani, 2009).Thissignificancemeasureistheprobabilitythatlinkweightiscompatiblewiththenullhypothesis,instatistical inferenceknownasthepvalue,butheredenotedwith˛.Bykeepingonlythelinksthathave˛valuelowerthanacertain threshold,wearedismissingalllinksthatdonotsignificantlydifferfromonescreatedrandomly,whilethosethatarekept canbeconsideredsignificant(notrandom),andthis“significance”iscontrolledwiththevalueofthethreshold.Wetested therangeofthresholdsandfoundthattheresultsarerobustagainstthechangeofthethresholdvalue˛(seeSection7).
Theresultsareshownfor˛intherange0.05,...,0.25thatpreserveabout6%toabout21%ofthemostimportantlinks (representingabout23%toabout45%ofcitations)andabout51%toabout99%ofnodes,respectively.
3. Stochasticblockmodel
Networksandgraphscanbemeasuredandsummarizedatmanylevelsofgranularity,startingfromglobalormacroscopic measures—suchasthetotalnumberoflinksordiameter—tolocalormicroscopicmeasuressuchasnodedegreeorthe clusteringcoefficient(Newman,2010).Hereweconcentrateondescribingnetworksinamesoscopicscalethatisbetween thesetwoextremes.Networkanalysismethodsthatworkatthislevelofgranularityalmostexclusivelydealwithsetsof nodesandlinkscalledcommunitiesordependingonthefieldofresearch,clusters,groups,modules,etc.(Boccalettietal.,2006;
Fortunato,2010;Schaeffer,2007;Wasserman&Faust,1994).Thereisnotasingle,precisedefinitionofcommunity,butmost oftenitisdescribedasasetofnodeswithmoreconnectionsbetweenthemthantotherestofthenetwork(Fortunato,2010;
Porteretal.,2009).Thecommunityparadigmassumesthatanetworkcanbedescribedasacollectionoftightlyknitsetsof nodes,whicharelooselyconnectedtoeachother.
Stochasticblockmodelrelaxestheassumptionaboutthenatureofconstituentsetsofnodessuchthattheyonlyneed tobeequivalentinthewaytheyconnecttoothergroups(calledblocksintheSBMparadigm),whichineffectallowsfora descriptionbeyondthecommunitystructure,likebipartite,core-periphery,etc.(Barucca&Lillo,2016;Karrer&Newman, 2011).SBMisagenerativemodel,meaningthatitassumesamodeloftheunderlyingstructureandprescribesaprocedure forbuildingnetworksthathavethisstructureincommon.Themodelisdefinedbyassigningallnodestodisjointsets1and settingthenumberoflinksbetweenandwithinblocks.Obeyingtheabovedescribedconstraints,anetworkisgeneratedby randomlyplacinglinksbetweennodes.Analternativedescriptionistosettheprobabilitiesforplacingalinkbetweenany twoblocks,butweusedthelinkcountsfollowingtheapproachlaidoutinPeixoto(2014b).
Nodeswithinblockssharetheprobabilitiesforlinkstowardsthenodesinotherblocksbutalsoincludingtheirownblock.
Injournalcitationnetworksthismeansthatallthejournalsinablockhavethesamecitationpatternstootherblocks.They can,forinstance,receivemostoftheircitationsfromonesetofjournals,andgivethemouttoanotherset,orhavehigher thanaverageprobabilitytoexchangecitationswithsomeblocksandlower-than-averageprobabilitywithotherblocks.Two blockscouldalsohaveidenticalcitationpatternstootherblocks,butdifferentnumberofinternalcitations.Allthistellsus thatthismodelgroupsnodes(journals)intoclassesbytheirroleinthenetwork,whicheverthoseare.
Oncethemodelisknown,buildingnetworkswiththeprescribedblockstructureisstraightforward.However,themore commonsituationisoppositetothis:onlyasinglerealizationofthemodeloftheempiricalnetworkathandisknown,and parametersofthemodelthatmostlikelyproducethisnetwork,needtobeinferred.Findingthemostlikelyparametersisa highlynon-trivialtask,andmanyapproachestosolveithavebeenused(Wasserman&Anderson,1987).Allapproachesuse anobjectivefunction,inoneformortheother,thatmeasurestheprobabilityofthegivenparameterstobetheonesthat producedtheobservednetwork.Theproblemwiththisnaiveapproachisthatthebestfittingmodelwillbetoodetailedand willreproducetheobservednetworkwithveryhighaccuracy,whichgoesagainstthepurposeofthemodelsinproviding agoodsimplificationofthereality.Thecauseforthisisthefactthatthesimpleapproachusesallavailabledata,including noise,forfittingtheparameters.Intheextremecaseahighlydetailedmodelendsupputtingallthenodesintheirown
1 Theassumptionaboutblocksbeingdisjointsetsofnodescanberelaxed(Peixoto,2015).
blocks,sincethisreproducesthenetworkperfectly.Asimplesolutiontopreventthisfromhappeningistointroducethe numberofblocksasaconstraintinthefittingprocedure(Karrer&Newman,2011).Thisworksfineincaseswherethe numberofblocksisknown,otherwiseitneedstobeinferredfromdata,forinstancebyfindingabalancebetweenthemodel descriptionlengthanditsgoodness-of-fittodata.SomeoftheapproachesinthisdirectionarelistedinLatouche,Birmele, andAmbroise(2012).Inthisworkwearetakingadvantageoftherecentprogresswiththistopicthatestimatesthenumber ofblocksbyincludingtheinformationnecessaryfordescribingthemodelparametersintothetotalinformationamount beingminimizedsuchthatitpenalizesatoolengthydescription(Peixoto,2013,2015).
OneofthemainissuewiththeclassicalSBMs,inadditiontooverfitting,isthattheyassumePoisson-distributednode degrees,soanydeviationfromtheexpectedstructureisconsideredafeatureofthedataandthefittingalgorithmtriesto findamodelthatwouldadequatelydescribeit.Fittingthismodelwithanetworkwithadifferentdegreedistributionwould yieldblocksthatrepresentclassesofnodesbasedontheirdegrees,notjustonwhichothernodestheyconnectto.So,for instance,highlycitedjournalswouldbeseparatedfromjournalspublishingreviewarticles,thatpresumablyciteinlarge volumesbutdonotgetcitedasmuch.Also,journalsthatplayacentralroleintheirrespectivecommunitieswouldbeput together,evenifcommunitiesmightnotbeotherwiserelated.Thedegree-correctedstochasticblockmodelis“blind”tothis kindofstructure:itseparatesnodesintoclassesbasedontheirdegree,whichremovesunfaircompetitionforlinksbetween nodesoflargelydifferentdegrees(Karrer&Newman,2011).Thiseffectivelyincludesthedegreesequenceintothemodel thatdoesincreasetheinformationneededtodescribethemodel,butthebenefitofasimplermodelfordescribingthedata lowersthetotaldescriptionlengthinmostcases.
TheSBManditsdegree-correctedvariantcanbeexpandedtodescribehierarchicalstructureofblocks(Peixoto,2014b).
Inthiscase,eachlevelconstitutesanetworkofblocksandthebestfitofSBMofthelevelbelowit,startingfromthenetwork itselfatthebottomlevel,tothetrivialone-blocklevelontop.Fittingallthelevelsisdonesimultaneouslytoobtainthe minimumdescriptionlengthofthewholestructure.Thisapproachshowsseveraladvantages,likeallowingforevensmaller blocksthatcanreliablybeinferredandprovidingamulti-resolutionviewofthenetwork.HierarchicalSBMcanbeviewed asastackofprogressivelysimplerweightednetworks—eachlevelinthehierarchyisazoomed-outversionofthelevel belowit.Thismeansthatwecaninspectconnectivitypatternsofblocksofnodesatthedesiredlevelofdetail/blocksizes.
Theexistenceofthesmallestidentifiableblocksisincommunitydetectionliteratureknownastheresolutionlimitanditis definedasthesmallestgroupthatamethodisabletoidentify(Fortunato&Barthélemy,2007).Oneofthemostcommonly usedobjectivefunctionsismodularity(Newman&Girvan,2004),forwhichithasbeenshownthatitisnotabletoseparate smallgroups,eveninobviouscases,ifthenumberoflinksinsidethemistoosmallcomparedtothenumberoflinksintherest ofthenetwork(Fortunato&Barthélemy,2007).ItshouldbenotedthatSBMsarenotcompletelyimmunetoresolutionlimit problems(Choi,Wolfe,&Airoldi,2012;Peixoto,2013),whichpresentsaprobleminthecomparativeanalysisofnetworks spanningtwoordersofmagnitude,asisthecasewithourtimeslices.Thesmallestdetectableblocksscalewiththenetwork size,whichmeansthathigh-resolutionlevelsforlargenetworkswouldremainhiddenfromus.Fittingtheblocksatall resolutionsforthehierarchicalSBMatthesametimereducesthelimit,becauseeachlowerlevelusesblocksfromthelevel aboveasaconstraint,soblockinferenceatthelowerlevelsisineffectdonelocallyineachhigher-levelblock(Peixoto, 2014b).
Thecriteriaandalgorithmsdescribedaboveareimplementedingraph-toolPythonmodule,whichweusetodoall SBMfittinginthiswork(Peixoto,2014a).
4. Visualizingcitationflowsbetweenblocks
HierarchicalSBMlevelscanbevisualizedasnetworksthatprovideuswithamulti-resolutionmapofthecitationnetwork.
Networkvisualizationisapowerfultoolforvisualinspectionofcomplexnetworkdata,butitsreadabilityandthususefulness dependsonthelevelofdetailspresentedandthetotalsizeofthenetwork.Inaddition,ifthenetworkisdense,i.e.average nodedegreeislarge,itisevenhardertomakeaclearpictureofit.Lowerlevelscontainalotofinformationinfinedetail,but theyareoftenvastandtoodenseforvisualizationtobereadable.Butnotallweightedlinksareequallyimportantifweare interestedinlarge-scalepatterns.Thismeanswecandepictonlythemostimportantones,makingthenetworksignificantly lessdense.ThisisagaindonebytheprocedureoutlinedinSection2,whichkeepsonlythelinksthatformthebackboneof thenetwork(Serranoetal.,2009).Forthepurposeofthisanalysisthedirectionsoflinksarepreserved.
ThreeexamplesofnetworksofSBMblocksforthetimeperiodsof1920s,1960s,and1995–2000,atthelevelsmost similar,i.e.theclosestmatchingnumberofblockstosubfieldsandfieldsaredepictedinFig.1.
Clusteringofblocksofsimilarfieldsisquitevisibleinallsixnetworks.Medicineformsthebiggestcluster,followed byBiology,Chemistry,and Physics.Large-scalestructureremainssimilarforallthreetimeperiods:Medicineistightly connectedtoBiology,whichisthenconnectedtoChemistryandPhysics.Thisstructureisinaccordancewiththepreviously publishedmaps(Leydesdorff&Rafols,2009;Rosvall&Bergstrom,2008).Thefiguresalsoincludeinterestingsmall-scale detailsthatvarybetweentheresolutionsandtimewindows.Forinstance,theinterdisciplinaryblocksareoftenlocatedat theboundariesbetweenotherfields,butthiseffectismorevisibleinthesubfieldresolutionlevel.
Thefactthattherearemultipleblockswiththesamedominantfieldinlevelshavingthemostsimilarnumberofblocks tothenumberofassignedfields,isaconsequenceoflargeheterogeneityofthefieldsizesandimportance.Largefields alsohaverichinternalstructurethatovershadowssmall,moresimplefieldsinprocessofinferringthebestblocksateach hierarchicallevel.Asafirstapproximation,wehavechosenthenumberoffieldsasaguideforchoosingthemostappropriate
Fig.1.NetworksofSBMblocksforthetimeperiodsof1920s,1960s,and1995–2000.Thetoprowshowsresolutionlevelsmostsimilartosubfields,bottom tofields.Thenodeshapesandcoloursrepresentthedominantfieldinthatblockandthenodesizesareproportionaltothenumberofarticles.Directed linksarecolouredandsizedlogarithmically,accordingtothenumberofcitationstheycarry.Nodeandlinksizesarenormalizedperpairsofnetworks fromthesametimewindow,sothatthesesizescanbecomparedbetweenthetworesolutionlevelsbutnotacrossthetimewindows.Thelinkcoloursare normalizedforeachnetworkseparately.(Forinterpretationofthereferencestocolorinthisfigurelegend,thereaderisreferredtothewebversionofthe article.)
levelasitsimpleandintuitive,whileanother,morecomplexapproachcomparingthefullsizedistributionsdidnotprovide significantlydifferentresults(seeAppendixB).
5. Connectivitypatternsofblocks
Incontrasttotraditionalcommunitydetection,blocksinSBMarenotlimitedto“densesubgroups”wherethecitations stayinsidetheblocks,butablockcanalsorepresentastructurewherethecitationsflowoutoforintotheblock,aslongas alljournalsintheblockbehaveinsimilarway.Wesummarizethetypeofblockintermsofcitationflowsbycountingthe numberofcitationsenteringtheblocksin(articlesintheblockarebeingcited),leavingtheblocksout(theycitearticlesin otherblocks)andinternalcitationssint(citingarticlesinthesameblock)2.
2 Onecanviewthesystemofflowsandblocksasaweightednetwork.Intheliteratureofweightedcomplexnetworks(Barrat,Barthélemy,Pastor- Satorras,&Vespignani,2004),weightedsumofanode’slinksiscalledstrength,andforthedirectednetworksitcanbein-,out-andinternalstrength:sin, soutandsint.
Fig.2.Propertiesofblocksdependingontheirlocationontheconnectivitypatternplot. ¯sinand ¯soutarerelativein-andout-flowsofcitationsto/fromthe block.Duetorelation ¯sin+s¯out≤1pointsareconfinedtoregionsoutlinedbygreylines.Regions,aswellasredpointsatspeciallocations,aremarkedwith arrowsandannotatedwithdescriptionsofthepropertiesofblocksatthoselocations.(Forinterpretationofthereferencestocolorinthisfigurelegend, thereaderisreferredtothewebversionofthearticle.)
Notethatthesumofthethreeflowcountsrepresentsthetotalactivityofthejournalsintheblock.Herewearenot interestedinthetotalactivitiesbutinthetypeofflows.Weinvestigatethesetypesbyseparatingthetotalflowfromour flowmeasuresandconcentrateonrelativein-,out-,andinternalflows.Thesenormalizedflowsaredefinedas:
¯
sin=sin/stot, s¯out=sout/stot,
¯
sint =sint/stot,
where stot=sin+sout+sint. (1)
Bynormalisingtheflowsbythetotalactivitywereducethenumberoffreeparametersneededtodescribeblocks’flow patternsfromthreetotwo.Thatis,thesumofthethreerelativeflowsequalstoone,andknowingtwoofthemisenoughas thethirdonecanalwaysbecalculatedbasedonthem.Thismeansthatwecanreportthetwooutofthethreeflowmeasures thatarethemostconvenientforus.
Forvisualizingtheconnectivitypatternsofblocks,thechoiceof ¯sinand ¯soutasxandycoordinatesmakesiteasytovisually assesstheblock’spropertiesfromitspositionontheplotasillustratedinFig.2.Sincethesum ¯sin+¯soutmustbe≤1,points canlieonlyinthetriangleboundedbythediagonal(0,1)−(1,0).Proximitytotheorigintellsushowmuch“self-centred”
or“community-like”theblockis,whilethedistancesfromtheaxessignifythebalancebetweenreceivingandgivingout citations.Ithelpstoconsiderfourextremalcasesforablock,markedwithredpointsinFig.2:(0,0)Purecommunity.Journals inthisblockareisolatedfromtherestofthenetwork.Articlespublishedinthesejournalsonlyciteandgetcitedbyarticles injournalsfromthisblock.(1,0)Puresink.Journalsinthisblockonlyreceivecitationsanddonotciteatall.(0,1)Puresource.
Journalsinthisblockdonotgetcited,butciteothers.(0.5,0.5)Purebridge.Journalsinthisblockciteandgetcitedequally,but therearenocitationswithintheblock.Inmostcases,thevaluesliesomewherebetweentheseextremes.Thetriangularspace canbedividedintothreeregionsoutlinedbygreylinesinFig.2thatcontainblockswiththefollowingproperties:Inner triangleiscommunity-like(“acommunityinaweaksense”[Radicchi,Castellano,Cecconi,Loreto,&Parisi,2004]).More thanhalfofthecitationspertainingtothisblockstaywithinit.Upperwingmostlycitesothers.Lowerwingmostlyreceives citations.
InFig.3wevisualizethetypesofconnectivitypatternsofblocksfoundinthecitationnetworks.Wedisplaythedatafor threetimeperiods(1920s,1960s,and1995–2000)andforthreelevelsofresolution.Thisgivesusanoverviewofthetypes ofmesoscalestructuresonecanfindinthecitationnetworks.AlargenumberofblocksdetectedbytheSBMmethodfall outsidetheinnertriangle,andarethusnotcommunity-likestructuresevenintheweaksense(Radicchietal.,2004).Thisis especiallyevidentforhigh-resolutionlevelswherethevastmajorityofinferredblocksarenotcommunities.Notethatthe inclinationtowardsnon-community-likestructuresisafeatureofthedataastheSBMfittingmethodweusedoesnothave apreferenceforanyparticularblockstructurethatisplausiblewithintheSBMframework.
Notethat,inthehierarchicalstructureofblocks,thelevelthatcontainslargerblockscannothavelesscommunity-like blocksthanalevelthatcontainssmallerblocks,andfortheoneblockatthetopofthehierarchyallcitationsareinternal.
Thiscanbeillustratedbyconsideringthemergeroftwoblocks:theirinternalcitationsremaininternal,butpartoftheir externalcitationsthatgobetweenthembecomeinternaltothemergedblock(c.f.AppendixE).Inconsequence,theaverage
Fig.3. Connectivitypatternsofstructuralandclassificationblocks,forthetimeperiodsof1900–1910,1950–1960,and1995–2000.Eachrowcorrespondsto adifferentblocktype:journalsthemselves;thefirstSBMlevel;subfieldsclassification;SBMlevelwiththeclosestmatchingnumberofblockstothenumber ofsubfields(SBM:∼subfields);fieldsclassification;andSBMlevelwiththeclosestmatchingnumberofblockstothenumberoffields(SBM:∼fields).Blocks arerepresentedaspoints(colouredandshapedaccordingtothedominantfieldinablock)withcoordinatesbeingtheincomingandoutgoingcitationsas fractionsoftotalcitationsforeachblock(¯sinand ¯sout,Eq.(1)).Pointareasareproportionaltothetotalcitations(strengthstot)pertinenttotheblock.Blocks withfewerthat10totalcitationsarenotshown.Blockscontainingaselectedsetofjournalsareannotated.(Forinterpretationofthereferencestocolorin thisfigurelegend,thereaderisreferredtothewebversionofthearticle.)
fractionofinternalcitationsintheblockatthehigherlevelofhierarchycanonlybeequaltoorhigherthantheweighted averageofthetwoblocksatthelowerlevelofhierarchy.
Theprocedureoutlinedinthissectioncanbeusedwithanykindofpartitionofthejournalsinto“blocks”,notjustones inferredforSBM.Thejournalsareexplicitlypartitionedinthedatabysubfieldandfieldclassification,andwewantto comparethesepartitioningstoonesgivenbytheblocksfoundbytheSBMmethod.Forourpurposesitisusefultoview bothofthesepartitionsasblockstructures,buttoavoidconfusionwenametheblocksasdeterminedbytheclassification dataclassificationblocks,andtheonesinferredbySBMfromthecitationpatterns’structuralblocks.Journalsthemselvesform elementaryblocks,whichcanbeviewedasthe“zeroth”levelofhierarchicalSBMoranyotherhierarchicalblockstructure.
Thesezerothlevelblocksgiveuspropertiesofindividualjournals.Journalsareassignedtosubfields(inthedataset),which areinturngroupedintofields.Onewouldexpectthisclassificationtobereflectedincitationpatterns,sinceitshould groupsimilarjournalstogether.Weareabletotestthisassumptionbycomparingthepropertiesofartificialblocks,defined bysubfieldsandfields,withblocksinferredfromcitations.HierarchicalSBMprovidesuswithmanylevelsatdifferent resolutions,andinFig.3wecomparethelevelwiththemostsimilarnumberofblockstothenumberofsubfieldsandfields inthatnetwork.
Individualjournalsspanthewholespaceofin-andoutflowbalance(therearemanystrongsinksaswellasstrong sources),whiletheoverwhelmingmajoritydonotpredominantlycitethemselves.Thereareannotatedjournalsofhigher prominencethatarefoundinthelowerwing,whichistobeexpectedsincetheyreceivemorecitationsthangiveout.
ThefirstSBMleveldeterminesthesmallestnon-trivialstructuralblocks,anditisthesmallestgroupingofjournalsthat cite,andarecited,inasimilarwayastheyhavesimilarcitationpatternstowardstherestofthenetwork.Comparedwiththe levelofjournalsthespreadofpointsisreducedinalltimeperiods,althoughtoavaryingdegree.Thespreadisreducedwith time:thereisalmostnochangein1920s,somechangein1960sandasignificantconstrictionofvaluesfor1995–2000.This couldbeaconsequenceoftheresolutionlimit,wheresmallerdetailsareincreasinglyhardertocaptureasthesizeofthe networkincreases,ortheoutlyingjournalscarrylessinformationinlateryears,sotheyarecombinedwithmoremoderate ones.
Atthelevelofsubfields(the3rdand4throwsinFig.3),bothstructuralblocksandclassificationblocksdisplayapattern wheremultidisciplinaryblocksarelocatedmostlyonthecitedside.Thisisexpectedgiventhattheseblocksaredominatedby highimpactjournalspublishinghighqualityarticlesfromawidespectrumoffields.Comparingthedistributionofpointsfor classificationblocksversusstructuralblocksweseethatthelatteronesaremoreevenlyspreadoutin1960sand1995–2000, whilethisisnotthecasein1920s.Therearealsomorecommunity-likeblocksinthestructuralcase,inparticularinthefields ofEconomics,Physics,andMathematicsandGeosciencesin1920s.ThismeansthatSBMcapturesabroaderspectrumof blocktypes,whileclassificationblockstendtobemoresimilartoeachotherortheyhaveapreferenceforcertainproperties.
Thehighestlevelofhierarchywefocuson—forbothclassificationblocksandstructuralblocks—istheleveloffields(5th and6throwsinFig.3).Constrictioninclassificationblocksisagainpresent,albeitnotsostrongasforthesubfields.Withtime, Multidisciplinaryfieldstronglyseparatesfromthebulkthatremainsmoreelongatedinthecommunity-bridgedirection andslightlyleanedtowardsource-likebehaviour.Similartothelevelofsubfields,Multidisciplinaryblocktendstobecome morecitedwithtimeforbothclassificationandstructuralblocks,withthisbehaviourbeingmorepronouncedforstructural blocksandthelasttimeperiod.MedicineandPhysicsgetseparatedintoacommunity-likeblock,meaningthatthemost citationsremainwithintheirfield,whichwasnotthecaseforthesubfields.TheSBMfindsmultipleblocksofmixedfields thataresource-andcommunity-likefor1960s,andonlycommunity-likefor1995–2000.Mostofthemixedblocksin1960s arecomprisedofunclassifiedjournals,numberofwhichrisesinthesecondhalfofthecentury(seeAppendixAfordetails).
Forclassificationblocksthesejournalsareallcollectedunderthe“mixed/unknown”field(largewhitesquare).
NotethatoneneedstobecarefulwhencomparingblocksacrossdifferentpanelsinFig.3,asthereisnoguaranteethat blockswithsimilarqualitiesindifferentpanelscompriseofthesamejournals.Deeperanalysisinthisdirectionwouldrequire listingalljournalsinablock,orannotatingjournalsofinterest(whichisdoneforafewjournalsinFig.3).
6. Evolutionofblockconnectivitypatternsintime
Intheprevioussectionwesummarizedthetypesofcitationflowsofindividualblocks.Wewillnextbuildonthissum- marizationmethod,andquantifytheevolutionofcitationflowsofspecificfields.Tobemoreprecise,weaskthequestion:
whatistheexpectedtypeofcitationflowoftheblockwherearandomlychosenarticleofagivenfieldbelongsto?
Theexpectedflowsforajournalinafieldcanbecalculatedbycollectingalljournalsbelongingtoafieldandtakinga weightedaverageoftheflowsoftheblockstheybelongto.Tocalculateaverageflowsforarticles,averagingneedstobe weightedbyafractionofjournals’publicationsoutofthetotalnumberinthefield:
¯ sfx=
jaj¯sx
jaj , (2)
where ¯sxisanyofthethreeaverageflows(in-,out-,andinternal)fromEq.(1),ajisthenumberofarticlespublishedby thejournalj,andthesumsgooveralljournalsofthefieldf.Ifwedothisfornetworksinalltimewindows,inadditionto differentiatingfieldsamongthemselves,wecanfollowtheevolutionofcitationflowpatternsofindividualfieldsovertime.
Fig.4.Plotoftheevolutionoftheaveragecitationflows,fortheexamplefieldEngineering.Time(inyears)isonthehorizontalaxis.Relativecitationflows ofblockscontainingjournalsfromtherespectivefield[seeEq.(2)]areshownasshadedregions:bottompart(thedarkest)isforinternalflow ¯sfint,middle (lighter)forincomingflow ¯sfin,andtoppart(thelightest)foroutgoingflow ¯sfout.Theaveragevalueofeachflowismarkedwithablackline,andtheerrorof themeanisshownasatransitionalshadebetweenregions.Thecentralhorizontallinemarks50%oftheflow.
Similartotheprevioussection,werepeattheanalysisforstructuralblocksinferredbytheSBMandclassificationblocks givenasclassificationsofthedata(fieldsandsubfields).Thatis,theaverageflows ¯sfx(in-,out-andinternal)ofafieldfare calculatedoverallthefield’sjournals,andthevalues ¯sxaretakenfromblocks,eitherthestructuralblocksortheclassification blocks,thejournalsbelongto.
WeillustratetheapproachusedtoplottheevolutionoftheaveragecitationflowsofblockconnectivitiesinFig.4.Time (inyears)isonthehorizontalaxis,whileverticalaxisissplitbyrelativecitationflows(Eq.(2))ofblockscontainingjournals fromtherespectivefield:bottompart(thedarkest)representsinternalflow ¯sfint,middle(lighter)incomingflow ¯sfin,andtop part(thelightest)outgoingflow ¯sfout.Theaveragevalueofeachflowismarkedwithablackline,andtheerrorofthemean isshownasatransitionalshadebetweenregions.Thecentralhorizontallinemarks50%oftheflow.
Theevolutionoftheaveragecitationflowsorblockconnectivitiesfor16mostprominentfieldsisshowninFig.5.Fields areorderedbythesurfaceareaunderthecurvegainedbytakinglogarithmsofnumbersofjournalsineachfield.Logarithmic scaleisusedtomitigatetheimpactoftheexponentialincreaseofthenumberofjournals(c.f.AppendixA).
Forhalfofthe16fields,theinternalflowsofstructuralblocksarelargerthantheinternalflowsofclassificationblocks.
Mathematics,Economics,Geosciences,Law,Education,PoliticalScience,andAnthropologyjournalsarefoundtoresidein structuralblocksthataremorecommunity-likethanthecorrespondingclassificationblocks.ThiscouldmeanthatSBMis abletofindtheir“naturalcommunities”—onesthatcapturethemostofthecitationflows—whiletheclassificationsalone arenotabletoachieve.Thestrengthofthiseffectvaries,withthemoststrikingexamplesbeingMathematics,Law,and PoliticalScience.ClassificationblocksofEconomics,andpartiallyGeosciencesandLawarethemselvesquitecommunity- like.TheoppositeeffectispresentforMedicineandBiology,meaningthatthestructuralblocksarelesscommunity-likethan classificationones.Thiscanbeexplainedbythefactthatthesefieldshavealargenumberofbothsubfieldsandjournals,so theycontainarichinternalstructureofblockswithlargeflowofcitationsbetweenthem.Theirinternalcitationstructure mightalsobedifferentfromotherfields,astheycontainhighlycitedpapersdescribingmethodsandprocedures(Small&
Griffith,1974).
Thein-flows ¯sfinandout-flows ¯sfoutarequitebalancedformostfields,withafewnotableexceptions.Multidisciplinary fieldhasnoticeablymoreincomingcitationsthanoutgoing,forallhierarchylevelsandforbothtypesofblocks.Thisdoes notnecessarilymeanthatalljournalsinMultidisciplinaryfieldattractcitations,butitcanbeduetothefactthatitcontains severalhighprofilejournals,3 likeNature,Science,andPNAS.IndividualjournalsinPoliticalSciencefieldreceivemore citationsthantheygiveout,whilethisdifferencevanishesfortheblockstheybelongto.TheoppositeistrueforMedicine, Health,andEnvironmental(theygiveoutmorecitationsthantheyreceive)andthisbehaviourremainspresentforblocks inhigherlevels.
Intimedomain,fieldsexhibitawealthofbehaviours.Somehaverelativelystablepatterns(Medicine,Chemistry,Eco- nomics,Multidisciplinary,andtosomedegreeHealthandPoliticalScience),mostoftheothershavelargechangesaround theWorldWarII,whilesomehaveuniformandsteadyshifts(EducationandAnthropologyarethemostnotable).
ThemoststrikingfeatureisthesuddenriseinshareofoutgoingflowsatthetimeoftheWWII,mostlyforclassification blocks,andtosomedegreefor journalsofsomefields.Thelargestrises areindecreasingorderfor:Law,Geosciences, Mathematics,Anthropology,Physics,Biology,andEnvironmental.InBiologyandChemistryafainteffectisalsovisibleatthe levelofjournals,whileincaseofEnvironmentalitismostlyadropininternalflow.Giventhatthiseffectisalmostinvisible forstructuralblocks,theobservedchangesmostlikelydonotarisefromthechangeinjournals’citationpatterns,butinthe waytheyareclassified.Thissuddenchangecorrelateswiththelargeincreaseofthenumberofsubfields(c.f.AppendixA).
Somefieldsexhibitslowbutsteadychangeovertime,predominantlyinthestructuralblocks.Anthropologyjournalsare citingmoreexternalliteratureandlessthemselveswithtime,whichisalsovisibleinthesmalleststructuralblocks,tolesser extentinthestructuralblocksofsizeofsubfields,andnotatallinthestructuralblocksofsizeoffields.Thismeansthat considerableamountofthecitationflowthatisincreasinglygoingoutofthejournalsisneverthelessretainedinsidethe
3 Forthisreasonsomeauthorshaveinsimilaranalysesexcludedthesejournalsaltogether(Zhangetal.,2010).
Fig.5.Evolutionofstructuralandclassificationblockconnectivitiesintime,for16largestfields.Fieldsaregroupedincolumns(coloursarethesameas inFig.3)andeachrowisfordifferenttypeofblocks,fromtop:journalsthemselves;thefirstSBMlevel;subfieldsclassification;SBMlevelwiththeclosest matchingnumberofblockstothenumberofsubfields(SBM:∼subfields);fieldsclassification;andSBMlevelwiththeclosestmatchingnumberofblocks tothenumberoffields(SBM:∼fields).Time(inyears)isonthehorizontalaxis,whileverticalaxisissplitbyrelativecitationflowsofblockscontaining journalsfromtherespectivefield:outgoing,incomingandinternalflows(fromtoptobottom,respectively).(Forinterpretationofthereferencestocolor inthisfigurelegend,thereaderisreferredtothewebversionofthearticle.)
structuralblocks.Biology,Chemistry,Multidisciplinary,andEducationjournalsshowsimilarbehaviour,buttheiroutgoing flowsremainstablealsointhestructuralblocksofsizeofsubfields.Aplausibleexplanationisthatasanewjournalappears inthefield,it“steals”someofthecitationflowsfromtheoldjournalswhilemimickingtheircitationpatterntowardstherest ofthenetwork,whichmeansthattheywillallbeneverthelessputintothesameblockbySBM.Provingsuchexplanations wouldrequireamoredetailedanalysis.
Fig.6.Schematicrepresentationofjointjournal-annotationsmodel.Journalswithcitationsconnectingthem(bluecirclesandblacklines)areaugmented withannotations(redsquaresandgreylines).SBMisfittedontothewholenetworksuchthatblocksofjournalsareseparatefromblocksofannotations (bluecirclesandredsquaresrespectively).Notethatajournalcanhavemultipleannotations,oritcanbeunannotated.Hereweusesubfieldandfield classificationsastheannotationsofjournals,butanyotherdataonjournalscanbeused.(Forinterpretationofthereferencestocolorinthisfigurelegend, thereaderisreferredtothewebversionofthearticle.)
7. Predictivepowerofsubjectcategorisations
Citationnetworkscanbeaugmentedwithawealthofinformationaboutarticles,journals,andauthors.Theseinclude subfields,tags,keywords,authoraffiliations,etc.Inthisworkweuseclassificationofjournalsintosubjectcategories(sub- fields)providedinthedataset,andweareinterestedinhowmuchdoesthisclassificationcorrespondtothestructuralblocks foundinthecitationpatterns.
Thecomparisonoftwopartitions—suchasclassifications,clusterings,orblockstructures—isingeneraloftendoneusing somecomparisonmeasure,suchasJaccardindex,Omegaindex,andVariationofInformation(Meil˘a,2007).Itistypical tocomparepartitionsarisingfromaclassificationgiveninthedataandthegroupsarisingfromthenetworkstructureas returnedbysomecommunitydetectionmethod(Bommarito,Katza,&Zelnerd,2010;Chen&Redner,2010;Hric,Darst,&
Fortunato,2014;Lancichinetti&Fortunato,2009;Yang&Leskovec,2015).Thisisaviableoptioninourcaseaswell,butwe wouldhavetomakeachoiceofacomparisonmeasure.Becausethequestionofhowsimilartwopartitionsareisill-defined, eachcomparisonmeasurerealisesitdifferentlyandcanevenreturndifferentresults(Fortunato&Hric,2016;Meil˘a,2007;
Traud,Kelsic,Mucha,&Porter,2011).
Insteadofaskinghowsimilarthetwopartitionsare,weaskthequestion:whatcanweknowaboutthecitationsofajournal fromitsclassification?ExactlythisquestionisansweredbyincludingnodeannotationsintoSBMasitisdonebyHricetal.
(2016),whichisbasedonthenotionthatannotationsonnodesarejustmeta-informationonehasaboutthenetwork—there isnoprincipleddifferencebetweenthedataaboutconnectionsbetweentwonodes(links)andbetweenanodeandits annotations.Inliteraturedealingwiththecommunitydetectioninnetworksthisdistinctionbetweendataandannotations isoftenmadeexplicit,eitherbytreatingannotationsasasortof“groundtruth”forgroups(Yang&Leskovec,2012a,2012b, 2015),orasfeaturesthatneedtobelearnedbythemodel(Newman&Clauset,2016).Hereinstead,annotationsaretreatedas nodesofabipartitenetworkconsistingof“data”nodes(journalsinthecitationnetwork)and“annotation”nodes(subfields orfieldsofthejournals),andconnectionexistsbetweendatanodeandallofitsannotations(therecanbeanynumberof them,includingzero).
Fig.6illustratestheresultingcombinednetworkthatconsistsoftwokindsofnodes(journalsandannotations)andtwo kindsoflinks(citationsandjournal-annotationassignment).SBMisthenfittedwithaconstraintthateachinferredblock mustcontainonlyonekindofnodes.Thebenefitsofthisprocedureisthatthenodeannotationscontributetotheinferred nodeblocks,andannotationsarealsogroupedintoblocksof“equivalence”.
Workinginthisframework,thequestionfromthebeginningofthissectioncanbeformulatedas:howmuchinformation gaindoesonegetaboutlinksofasinglenode,afterlearningthenode’sannotations?Toansweritthefollowingprocedureis used.Asmallfractionofnodesisremovedfromthenetwork(5%or100,whicheverissmaller),turningtheminto“extra nodes”—thenodeswearemissingtheinformationon,andwouldliketoknowourchancesincorrectlyguessingwhere theirlinksconnectto.Then,theblocksareinferredforboththeoriginalnetwork(withoutannotations),andonedescribed above(withannotationsincludedinanadditionallayer).Theprobabilitiesfornode’slinkstoconnecttonodesthatbelong toexistingblocksaredefinedonlybytheblockthenodebelongsto.Withoutannotationstotelluswhichblockdoesthe extranodebelongto,theonlythingweknowaboutthenodeisitsdegree,andthusourbestguessfortheprobabilityofthis nodetobelongtoablockistousethesizedistributionoftheblocks,asitistheonlyinformationwehave.Incasewedohave thenode’sannotations,weknowitslinksintheannotationslayerwhichnarrowsourchoiceofblocksitcanbelongtoand thusraisestheprobabilitiesofguessingthecorrectlinks.Ifwedenotetheprobabilityforguessingalllinksofnodeiwithout knowingannotationswithPiandthesamebyusingannotationsasPi(ann)wecanquantifytherelativeimprovementwith thepredictivelikelihoodratioi:
i= Pi(ann)
Pi+Pi(ann) . (3)
Fig.7.Nodepredictionperformance,measuredbytheaveragepredictivelikelihoodratioforsubfieldsandfields[seeEq.(3)].Thevaluesare calculatedforsimplifiednetworks(seeSection2.2)for14timeslicesusedpreviously,withfivethresholdvalues˛:0.05,...,0.25.Eachbarcorresponds toasingletimewindow,withtheheightofthebarbeingtheaverageover˛valuesshownasdotsontopofeachbar.Foreach˛value,theaverageand standarddeviationovertensamplesisshown.Eachsampleisformedbyrandomlyremoving5%or100nodes,whicheverissmaller.
Thepredictivelikelihoodratioitakesvaluesfrom[0,1]andisabove0.5ifannotationsimprovelinkpredictionpower, around0.5iftheydonotchangeit,andbelow0.5iftheydecreaseit.Theaverageiofallsamplenodesistheaverage predictivelikelihoodratioforadataset.
Thismeasureisnotsensitivetothetotalnumberofblocksorannotations,asitdependsonlyonthe“power”ofthose annotationstopredicttheblockstructure.Theonlyimportantthingishowalignedtheannotationsaretothestructural blocks.
Weusetheaveragepredictivelikelihoodratiotomeasuretheabilityofsubfieldandfieldclassificationstopredict journals’citationsinthesimplifiednetworks(seeSection2.2).Predictingthelinksofthesimplifiednetworksisequivalent topredictingwhichjournalsarethemostimportantsourcesanddestinationsforthecitationsoftheextranodes,because thesenetworksonlyincludethemostimportantlinksforeachnodeanddonotincludetheactualcitationcountsforthe links.ThevaluesforarepresentedinFig.7foreachindividualtimeslice,andforfivethresholdlevels˛:0.05,0.1,0.15, 0.2,and0.25.
Overall,bothsubfieldandfieldclassificationcorrelatepositivelywiththecitationstructure.Theonlyexceptionisthe lowthresholdnetworksfor1900s,inwhichknowingthejournal’s(sub)fielddoesnothelpinpredictingitscitations.Low thresholdvaluesinthisalreadysmallnetworkcausedthelossoflargefractionoflinksandnodes,whichloweredthequality oftheapproximationbythesimplifiedandthresholdednetwork.Higherthresholdvaluesdonothavethisproblem.
Subfieldsaremorepredictivewithtime,althoughthereisaslightdeclineforthelast15years,withthepossiblereason beingasomesortofover-specializationofsubfields,whichdoesnotnecessarilycorrespondtothecitationpatternsofthe journalsbeingclassified.Fields,ontheotherhand,remainagoodproxyforlarge-scalecitationstructurethroughoutthe wholetimeperiod.
7.1. Predictabilityofindividualfields
Themethoddescribedintheprevioussectionanswersthequestionofhowmuchinformationwegainaboutjournal’s citationsifweknowwhatsubfieldsitisclassifiedinto.Wewillnextdividethisquestionintosmallerparts,andaskhow muchinformationdowegainbyknowingthatajournalbelongstoaspecificsubfield.
UsingthemodelfromHricetal.(2016)itispossibletocalculatehowmuchinformationgain(forguessingnode’slinks) doesasingleannotationprovide,incomparisontoacasewhereannotationsareassignedrandomly.Informationgainrelative totherandomcaseisdefinedaspredictivenessa,itisdefinedperannotationblocka,anditisnotaffectedbythetotal numberofblocksorannotations.FurtherdetailsandformulascanbefoundinHricetal.(2016).
Fig.8. Predictivenessesmeasuredbyofthetop16fields,forclassificationintofields,forallavailableyears.Thevaluesareten-yearslidingaveragesof averageforalltimewindowsusingthatyear.Eachpanelcontainsagroupoffourfieldsintheorderofdecreasingafter1960.Shadedregioninthe backgroundisthetotalspanofvalues.
Hereweagainconsidersubfieldsandfieldsasannotationsofjournals.AfterfittingtheSBMontothetwo-layerednetwork ofcitationsandannotations,thepredictivenessofeachblockofsubfields(orfields)iscalculated.Fieldsinthesameblock inheritpredictivenessoftheblock,whichfollowsfromtheSBMassumptionthatallannotationsinablockareequivalent.
Bycalculatingthefieldpredictivenessesforalltimewindows,ontopofbeingabletocomparefieldstoeachother,their changeintimecanalsobetrackedbothrelativelyandabsolutely.
BecausetheimplementationofSBMfittingalgorithmisprobabilistic,variationsinresultsaretobeexpectedwithsmall differencesinthenetworks(forinstancetwo10-yearwindowswith9-yearoverlap),andevenindifferentrunsoffitting functiononthesamenetwork.Thesevariationscauseavalues forthenetwork fromadjacenttimewindowstovary considerably,obscuringmoregeneraltrends.Takinganaverageaofalltimewindowsthatayearbelongsto,weareable toovercomethesefluctuations.Additionalclarityisachievedbyten-yearslidingaveragesofthesevalues.
Herewepresentthepredictivenessesofindividualfields,forthecasewhereclassificationintofieldswasused.Theresults forthecasewheresubfieldswereusedinstead,arepresentedinAppendixD.
BasedonfieldpredictivenessinFig.8,thetimecanbesplitintothreeperiods:before1940s,thetransitionperiod,and after1970s.Before1940sthefieldshaveonaveragehigherpredictiveness,butsincethedataisquitescarceforthisperiod, oneneedstobecarefulwhendrawingconclusions.Inthetransitionperiodallthefieldshaveverypoorpredictiveness, experiencingareboundafter1970forallbuthandfuloffieldsinthelastpanels(Engineering,Environmental,Medicine, Biology,Multidisciplinary,and Health).ThiscanbeasignofmajorchangessciencehasgonethroughaftertheWWII.
Mathematicshasthehighestpredictivenessinthethirdperiodbyavisiblemargin,whilebefore1940sthebestscoring fieldsareEngineeringandMultidisciplinary.Ithasrisensharplyfromthebottominthetransitionperiodtothetopinjust 25years,i.e.from1955to1980.ThismeansthatcitationpatternsofMathematicspapersbecamemorecharacteristicafter 1970s,whichispickedupbySBMandMathematicsjournalsendupinasmallnumberofexclusiveblocks.ForEngineering itistheopposite:itfaredveryhighbefore1940s,didnotsufferhardinthetransition,butneverrecovered.Thesamecanbe saidaboutMultidisciplinaryfield:ithadevensharperdropanddidnotreallyrecover.
Ontheotherside ofthespectrumareoftenlargefields(Engineering,Medicine,Biology),orrelatedtoalargefield (Environmental,Anthropology,andHealth).Theirlargesizemeansthattheycontainrichstructure withinthemselves, whichgetsdetectedbytheSBMaslargenumberofblocks.Henceknowingjustthefieldlabeltellslittleaboutthesmall blockswithinthefield.