CLUSTER COMPUTING
Master of Siene Thesis
Examiner: Professor Hannu-Matti
Järvinen
Examiner and topi approved in the
Faulty of Computing and Eletrial
Engineering Counil meeting on 4th of
May2011
TIIVISTELMÄ
TAMPEREEN TEKNILLINENYLIOPISTO
Tietotekniikan koulutusohjelma
Tuominen, Mikko: Tallennusjärjestelmien energiatehokkuus klusterilasken-
nassa
Diplomityö,42 sivua, 5liitesivua
Kesäkuu 2011
Pääaine: Hajautetut ohjelmistot
Tarkastaja: professori Hannu-Matti Järvinen
Avainsanat: Klusterilaskenta, energiatehokkuus, SSD, levypalvelin,GlusterFS, tiedosto-
järjestelmä, levyskedulointi
Energiatehokkuus on tärkeä osa-alue minkä tahansateknologian kehityksessä, eikä
klusterilaskenta tee tähän poikkeusta. Energian hinnan noustessa klusterin yl-
läpidonkustannukset ylittäväthelpostisen hankkimiseentarvittavatkustannukset.
Jokainen säästetty euro onsamanarvoinenkuin ansaittueuro.
Tämätyötarkasteleejavertaileeerilaisialaite-jaohjelmistotasonratkaisuja,joita
käytetään klusterilaskennassa datan tallentamiseen. SSD-levyjä ei yleisesti käytetä
klustereissa ja yksi tämän työn päämääristä onkin selvittää soveltuuko tämä suh-
teellisenuusi tekniikkakäytettäväksi klustereissa. Tärkeinpäämäärä on ymmärtää
mitkäseikatvaikuttavatklusterinenergiatehokkuuteendatantallennuksennäkökul-
masta. Näiden päämäärien saavuttamiseksi klusterin tehokkuutta ja energian ku-
lutusta mitataan ja arvioidaan eri kokoonpanoilla. Tästä saatuja tuloksia voidaan
käyttää energiatehokkuuden optimointiinmuissa klustereissa.
Työ on jaettu kahteen osaan. Taustatietoja tutkivassa kirjallisuusosassa pa-
neudutaan asioihin, jotka liittyvät energiatehokkuuteen, datan tallennusmalleihin,
levyihin, tiedostojärjestelmiin ja levyskedulereihin. Kokeellisessa osassa esitetään
testiympäristö sekä raportoidaan ja analysoidaan työn tulokset. Testien suorit-
tamisessa käytetään apuna CERNin CMS-ohjelmistoa ja LHC:n tuottamaa dataa
mallintamaanraskasta fysiikkalaskentaa. Testeissä käytetään sekä SSD-levyjä että
perinteisiäkiintolevyjäyhdessäkolmenerilaisendatantallennusmallinkanssa. Tähän
kuuluvathajautettuuntiedostojärjestelmään,levypalvelimeenjapaikalliseenlevyyn
pohjautuvat ratkaisut.
Tulokset paljastavat, että SSD-levyjen käytöllä ei saavuteta merkittävää etua.
Toinentärkeätulos on, ettähuomattavaosa klusterinkapasiteetistavoijäädäkäyt-
tämättä,mikälitiedostojärjestelmäjalevyskedulerieivätolehuolellavalittuja. Työn
johtopäätöson, että vaikka mitäänestettä SSD-levyjen käytölle eiole, kun otetaan
huomioonsekälevyjenhintaettä kapasiteetti,einiidenkäyttöoleperusteltua. Kun
SSD-levyjenkehitysetenee,onsyytäarvioidatilanneuudelleen. Mikälihinnatlaske-
vat ja tallennuskapasiteetti kasvaa,voimekaaninen kiintolevy siirtyä historiaan.
ABSTRACT
TAMPERE UNIVERSITY OFTECHNOLOGY
Master's DegreeProgramme in Information Tehnology
Tuominen, Mikko: Energy Eieny of Data Storage Systems in Cluster
Computing
Master of Siene Thesis, 42 pages,5 Appendixpages
June 2011
Major: Distributed software
Examiner: Professor Hannu-Matti Järvinen
Keywords: Clusteromputing,energy eieny,SSD, leserver, GlusterFS,le system,
I/Osheduling
Energyeienyisanimportantpartofthedevelopmentofanytehnology. Cluster
omputing isno exeption. As the energy pries rise, the osts of runninga luster
an easilyoverome the osts of buying one. A euro saved isa euro earned.
This thesisexamines and omparesdierenthardware levelapproahes and soft-
ware levelongurations used inlusters tostoragedata. Solid state drivesare not
ommonlyused inlustersand one of thegoals of this thesisis tostudy whether or
not this relatively new tehnology is suitableto be used inlusters. The main goal
istounderstandwhataetstotheenergyeienyofalusterfromadatastorage
point of view. To reah these goals, the performane and energy onsumption of
a luster, with dierent system ongurations, is measured and analysed. These
results an further be used tooptimiseexisting lusters.
Thethesisisdividedintotwoparts. Inthe literaturestudypart,issuesrelatedto
energyeieny,datastoragemodels,blokdevies,lesystemsandI/Oshedulers
are studied. In the experimentalpart, the test environment is introdued in detail
and the results are reported and analysed. The tests are onduted using the CMS
software with real LHC data to simulate heavy physis omputing. During these
tests,bothharddiskandsolidstatedrivesareusedwiththreedierentdatastorage
shemes;a distributedapproahwith GlusterFS(adistributed lesystem)on om-
pute nodes, a entralised approah with dediated le server and a loal approah
with drives inthe ompute nodes of the luster.
The test results reveal that no signiant gain is ahieved by using solid state
drives. Anotherkeyresultisthatalusteransuer fromamajorperformaneloss
if the lesystem and I/O sheduleris not properlyseleted. The onlusion of this
thesisis,thatalthoughthereisnofundamentalreasonwhysolidstate drivesshould
not be used in lusters, onsidering the multifoldprie and low apaity ompared
to hard disk drives, it is not justiable. As the development of solid state drives
progress,anewstudyisinorder. Ifthepriesdelineandstorageapaityinreases,
solid state drives ould abolishmehanial drives.
ACKNOWLEDGEMENTS
Iwasfortunatetohaveanopportunitytobeamemberofoneofthegreatestresearh
ommunity in Europe while writingthis thesis. The eight months I lived in Frane
was one of the best time of my life. I was able to learn a lot from my olleagues,
both o and on duty.
First of all, I want to thank my instrutor at CERN, the GreenIT Projet Leader
Tapio Niemi who guided me through this proess, oered his expertise and help,
and was always ready for disussions about my work.
My warmestthanks goto ProfessorHannu-Matti Järvinen fromthe Departmentof
InformationTehnology who helped metonish this thesis.
I want to thank my o-workers Kalle Happonen and Jukka Kommeri for their ex-
pertise and tehnial support on Linux and luster administration. I also want to
thank Lauri Wendland who was kind enough to share his time and knowledge on
physisomputing at CERN.
A speial mention goes to my family and friends for their love and support, and of
ourse,to my dear Riittawho has been there for meallthese years.
Tampere, May 19th 2011
MikkoTuominen
mikko.r.tuominengmail.om
+358 400 121 332
CONTENTS
1.Introdution . . . 1
2.Energy Eieny Optimization ofCluster Computing . . . 4
3.Data Storaging . . . 7
3.1 Disk types . . . 7
3.1.1 HardDisk Drive . . . 7
3.1.2 Solid State Drive . . . 8
3.2 DataStorage Shemes . . . 10
3.2.1 Network-attahed Storage(NAS) . . . 10
3.2.2 Redundant Array of Independent Disks (RAID) . . . 11
3.2.3 Distributed FileSystem . . . 12
3.3 HardDisk and Solid State Drives inLinux . . . 12
3.3.1 Linux File systems . . . 13
3.3.2 GlusterFS . . . 14
3.3.3 Linux I/O sheduling . . . 15
3.3.4 Read-ahead . . . 17
4.Previous work . . . 19
4.1 SSDson servers. . . 19
4.2 Sheduling . . . 20
5.TestingEnergy Eieny . . . 22
5.1 Test Cluster. . . 22
5.1.1 Operating System: Roks 5.3 . . . 22
5.1.2 Hardware. . . 23
5.2 Test Tools. . . 24
5.2.1 Computing atCERN . . . 24
5.2.2 CMSSW . . . 25
5.2.3 ROOT framework and ROOT les . . . 26
5.2.4 MeasuringTools . . . 26
5.3 CondutingTests . . . 27
5.3.1 About the performane and energy eieny . . . 27
5.3.2 Runningtests . . . 27
6.Results . . . 30
6.1 Slot size and performane . . . 31
6.2 Slot size and energy eieny . . . 32
6.3 Read-aheadand performane . . . 33
6.4 Read-aheadand energy eieny . . . 33
6.5 File system . . . 35
6.6 I/OSheduler . . . 36
6.7 The best ase . . . 38
7.Conlusion . . . 41
Bibliography . . . 43
A.Shell sript: Shedulinga set of test runs . . . 47
B.Shell sript: One test run . . . 48
C. Shellsript: Runningand timing a CMS TauAnalysis job . . . 50
D.Power use prole of NAS with both drive types . . . 51
ABBREVIATIONS
AS AntiipatorySheduling
CFQ Completely FairQueuing
CMS Compat Myon Solenoid. Large general-purpose partile physis
detetor and the name of one of the LHC experiment.
CMSSW CMS software. A physis software toolkit used to ompute data
from the CMS detetor of the LHC.
DFS Distributed File System
FIFO First In First Out
GlusterFS DistributedFile Systemsoftware. Developed byGluster in. Glus-
terFS is freesoftware, liensedunder GNU AGPLv3 liense.
HDD Hard Disk Drive
LAN LoalAreaNetwork
LHC Large Hadron Collider. A partileaelerator inCERN.
NFS Network File System. Name of a network le system implementa-
tion.
RAID Redundant Array of Independent Disks
SSD Solid State Drive
SSTF Shortest Seek Time First
WAN Wide AreaNetwork
1. INTRODUCTION
Anenergyeientomputinglusteranhelpyoutosavetwotypesofgreen;money
green and the nature type of green. Being energy eient on any eld is good for
the environment, yet alone in the line of business where thousands and thousands
ofmahines runfor 24/7. Alsothe eletriitybillplays amajorrole inthe nanial
side of runninga omputing faility. Environment aware image is tobe onsidered
also good publiity to any ompany or orporation. Whatever is your take on the
subjet,thefatisthatonewaytolookatgreenIT,istomakemorewithless. More
data,more bandwidth, morealulations andmore utilisationwith less energy, less
heat generated, less wasted resoures and with less money spent. Several studies,
suh asTsirogianniset al. [36℄ and Niemi etal. [27℄, have onluded that the best
performing system is very likely also the most energy eient. Energy eieny
equals savings in operational osts and improved performane equals savings in
hardware osts.
Clusteromputingwasdesigned toarry outalulations,whihotherwisewould
betootimeonsumingandpratiallyimpossibletoperformwithasingleomputer.
TypialI/Ointensiveomputing jobperformsarelativelysmallset ofoperationsto
a very large set of data. It is alsoommon for suh appliationsto have very large
lesizes,fromtens ofmegabytesofdata up toaterabytesale. I/O easilybeomes
the bottlenek of suh a system.
Cluster omputing is a tool. It is a tool for thousands of sientists around the
world. Like any other tool, it needs to be eient and reliable. Although any
omputing luster an have have seemingly vast resoures, these resoures are on-
stantly at use as omputing lusters are usually very highlyutilised. It is ommon
for these lustersto always have aomputing job in the queue waiting toget some
runtime. Itisobviousthatanyperformanegainisdenitelyonsideredasapositive
thing. However, the omponents used in omputing lusters needs alsobe reliable
and new tehnologiesare not adopted ina hurry. Lots of software used in lusters
have already had many updates and newer versions, but luster admins may tend
to stik with software that is already proven to be working. This alls for areful
understanding of new tehnologiesand thorough testing.
Solid state drives are alleged to be superior to hard disk drives. They onsume
less eletriity, have greater bandwidth, an serve more requests per seond and
do not suer from any mehanial delays. They are even resistant to vibrations.
Solidstate drivesarerelatively newtehnology, whihhas quikly ledtogiantleaps
forward insomeof theareas of blokdevie developement. This quik development
has even raisedsomeproblems assomeof thesoftware isnot keeping upwith them.
For example, hard disk drives an only serve a few hundred I/O operations per
seond, whih is limited by the mehanial nature of the drive. Solid state drives
an serve tens of thousands as they are based on a transistor tehnology and have
nomovingpartstoslowthem down. Unfortunately,solid state drivesare alsomore
expensiveand havenotablyless storageapaity thanexistinghard diskdrives. On
the other hand, the development of solid state drives in these areas has also been
quiterapidand theyare athingup. Therehavebeenstudiesboth forandagainst,
whether solid state drives ever overtake hard diskdrivesin every aspet.
However, solid state drives are nothing like hard disk drives. The whole teh-
nology behind them is fundamentally dierent. The problem is that the urrent
omputer systems are designed with hard disk drives in mind. These two drive
types annotbeompared withoutfullyunderstanding theirdierenies and howit
mightaet tothe system as a whole. In fat, some of the urrently used software
omponents an even degrade the performane of the solid state drives. One suh
example is I/O sheduling, whih is designed and implemented to x some of the
weaknesses of the hard disk drives. I/O shedulers are important piee of software
andtheyreallyimprovetheperformaneofharddiskdrives, buttheyanalsoreally
hurt solid state drives.
The purpose of this study is to nd out whether or not solid state drives are
suitabletobeusedinlusteromputing. Assumingthey are,itisinteresting tond
out if solid state drives are better than hard disk drives. One big question is also,
howtomeasure this. One metriis learlynot suient andexessive optimisation
for one typeof workload may not beany use for dierent kind of workload.
Thegoalofthisstudyistounderstandthemeaningofsolidstatedrivestoluster
omputing. To understand what aets to the energy eieny of a omputing
luster from a data storage point of view. What is important and what is not.
What is the right target for optimisation. This is evaluated from the perspet of
energy eieny without degrading performane.
The method of exploring these hallenges is quite pragmati: building and run-
ningatestlusterwithdierentkindofhardwaresetupsandsoftwareongurations.
The test luster exploits the CMSSW, a physis software toolkit used in CERN to
ompute the data generated by the LHC.A suitabletest jobis onstruted for the
test luster. This test job is ran against dierent set of system ongurations and
eah testrun ismeasured. These resultsare analysedinanattempttondthe best
possible onguration.
This study doesnot elaborateor evaluatethe total lifespan of eitherdrivetype.
Theeologialimpatofmanufaturingeitherasolidstatedriveoraharddiskdrive
is left outside of the sope of this study. Another big issue that is not inluded, is
the eonomial aspet. Solid state drives are at the moment a lot more expensive
than hard disk drives. Some basi numbers are provided, suh as pries of these
devies, but noomprehensiveassessment is provided.
This thesis is onstruted as follows. Chapter 2 disusses of the bakground
of energy eieny as a onept. Chapter 3 disusses of both drive types (blok
devies) and dierent data storaging shemes. Also software omponents attahed
todatastoragingareoveredinthishapter. Chapter4summarisesthe onlusions
from previous studies. The information in these Chapters (h. 2 h. 4) is fully
derived from dierent soures and produed by other people. Our role has been to
gatherthese together toprovideasound baseto evaluateand analysethe following
results.
Chapter 5 disusses of the test methods. The test luster is desribed in detail,
for both hardware and software. The atual pratial side of onduting the tests
is alsorepresented thoroughly. In the next hapter, Chapter 6, the test results are
illustraded and analysed. The hapter is divided into multiple setions and every
setion disusses a distint ongurationin detail. As the drive type is so essential
inthis study, both drivetypes are treatedseparatelywithinthese setions. Finally,
inChapter 7, the most importantresults of this study are summarised.
2. ENERGY EFFICIENCY OPTIMIZATION OF
CLUSTER COMPUTING
Grid omputing was designed to arry out alulations, whih otherwise would be
too time onsuming to perform with a single omputer. Before further disussing
how to improve the energy eieny of a luster, and espeially data storage, rst
alittleintrodutionof what aets tothe power onsumption.
First, there is the sheer size of the data storage. Data storage apaity and
usage eieny have a diret impat on power onsumption. Usage eieny an
be understood as a ratio between the total data storage apaity and the utilised
portionofit. Themoredataisstoredorthemorediskspaeisuneientlyalloated,
the more disksare needed and the more energy isonsumed. Seond, datatransfer
rate(I/Obandwidth)andaesstimealsohaveaneet. Theeasiestwaytoimprove
both is to use disks with higher rotational speed. Higher I/O bandwidth or lower
aesstimesrequiresthusmorepowerthantheless timeritialounterparts. Third
and last, there isdata availabilityand system reliability. Repliating orbakingup
a system requires additional omponents and applianes, whih of ourse requires
additional power. Improving usage eieny, minimising the energy onsumption
of urrent omponents or applying new tehnologies, suh assolid state drives, are
allpotentialapproahes toa more energy eientdata storagesolutions. [30℄
Aording to Tsirogiannis et al. [36℄ the most energy eient system ongura-
tion is alsothe highest performingone. This is quite intuitive in aluster environ-
ment, wherehigh utilisationrates are expetedand average jobthroughputiswhat
matters. Niemi et al. [27℄ onduted a study, whih had omplementary results
indiatingthat optimisingsystem throughputalsoimprovesenergyeieny. They
found that itis more eient torun more than one simultaneous job per proessor
ore ona ompute node.
Measurement of energy eieny is not a simple task. A single metri is not
enough to reate the full piture. For example, just looking at ahieved storage
spae per unit of energy, i.e. GB/Wh, is not suient. Workload harateristis
needs also to be taken into equation. Storage devies dier remarkably by perfor-
mane metris suh as throughput (MB/s), aess time or IOPS (I/O operations
per seond). Also, all distint soures of energy onsumption may not be easily
quantiableor measurable. [30℄
Anderson and Tuek [10℄ also aknowledge the diulty of measuring and om-
paring the energy eieny. They propose a sheme to alulate energy eieny
in proportion to alternative implementations. Conveniently, this approah also di-
minishthe eet thatmany omponentsonsume onstantpowerregardless ofutil-
isation,whih betterhelps toevaluatethe gains. They alsoremindthatmiroopti-
misationis feeble if orders of magnitude inreases an beobtained with alternative
solutions.
One ommon metri tomeasure the energy eieny on adata enter level in a
spiritofgreenITisthePower UsageEetiveness(PUE).PUEistheratiobetween
thepowerdeliveredtothedataenterandthepoweratuallyusedbyITequipment.
Diereneanbeexplainedbynotiingthatsomepowerisalwaysneededforooling,
lighting, et. and also some is lost due power distribution proess, e.g. with UPS
applianes. Historially PUE has been as high as 2.25 to 3.0, whihtranslates into
33-44% of utilisation rate. Today, PUE of 1.25 an be ahieved by using modern
best praties, where 80%of total faility power isdelivered toIT equipment. This
asadeeet ofpoweronsumptionisillustradedinFigure2.1. Faebookengineers
have reported PUE as low as 1.07 at full load ontheir state-of-the-artdata enter,
whereenergy eieny was animportantdesign goal [5℄. [38℄
Figure2.1: ThePowerCasade Model. Soure: SNIA[13℄
Solid statedrives(SSD) are known toonsume less energy thanhard disk drives
(HDD) due theirnon-mehanialdesign. What makesSSDs even more appealingis
tionisdependentonthe loadinalinearfashion[36℄. Narayananetal. [26℄ritiises
reent studiesonSSDs being onlyinterested onperformane but not providingany
ost basedanalysis. They are ondentthat SSDswillnot ahieve the apaity per
dollarofHDDs. TotallyoppositeestimationispresentedbyShmidtetal. [32℄,who
arguedthat annualgrowth rates inperformane of SSDdevelopment and delining
ofpries indiateSSDs outperformingHDDsinallaspets inthe nearfuture. They
alsopoint out that rising energy pries favor this development in asituation where
operationalosts dominate hardware osts.
3. DATA STORAGING
This hapter is divided into three setion. The rst setion,Disk types, introdues
the physial devies, where the bits are stored and the harateristis of these de-
vies. The seond setion, Data Storage Arhitetures, disusses several dierent
onepts and models needed to store data in a luster environment. The third
and lastsetion, Hard Disk and Solid State Drives in Linux, disusses the software
needed to make all this happen from an operating system point of view. All these
an havean eet onthe overall performane and energy eieny of a luster.
3.1 Disk types
Bothhard disk and solid state drivesare usedas blok devies. A blok devie is a
storage omponent that oers an interfae for a blok level operations. A blok is
anabstrationbetween bloknumberandthe physialrepresentationofdataonthe
devie. Operating system uses a Logial Blok Address (LBA) as a parameter for
targetingdatainI/Ooperations. Thebloksize ofthe devieneedstobeamultiple
of the setor size of the HDD. This is disussed more thoroughly in Setions 3.1.1
and 3.3.
3.1.1 Hard Disk Drive
AHard DiskDrive (HDD) is omposed ofmultiplemagneti platters,whihanbe
eitherbereadorwrittenbyusing adiskhead. Itisommontoreferthesemagneti
plattersasheads as thereisusually onlyone diskhead perplatter. Theatual disk
head is attahed to a disk arm, whih is used to move the disk head on top of the
right trak. A trak is a olletion of bits sharing the same radius from the enter
of the disk, thus forming airle on the platter. Traks that share the same radius
on dierent platters are referred to form a ylinder. When the disk spins the read
head, while positioned stationary, an aess the bits on the trak in a sequential
manner. A trak is divided intosetors. A setor is the smallest unit of data that
anbewrittentoanHDD.Typially,thesizeofthesetorissetbythemanufaturer
and annot be hanged. A very ommon setor size in the industry is 512 bytes,
whih has beome the de fato standard. Although reently manufaturers have
alsointroduedHDDswith4kbsetorsizes, buttherearesomesevereompatibility
When HDD reeives anI/O request, it transforms the logial blok address into
a physial address, e.g. to a tuple of ylinder, head and setor numbers. Common
onsumer grade HDDs and their apaities are represented in Table 3.1. The Ve-
loiraptorfromWestern Digitalisahigh performane HDDand listed asapointof
referenefor SSDs.
Table 3.1: HDD apaities. All drives are 3.5" and SATA II. Pries: www.newegg.om
(ited1-Feb-2011)
Manufaturer Model Size Prie GB/$
Hitahi Deskstar 1 TB $54.99 18
Samsung EoGreen 1 TB $38.99 26
Seagate Barrauda 2 TB $69.99 29
Western Digital CaviarGreen 1 TB $44.99 22
Western Digital CaviarGreen 2 TB $99.99 20
Western Digital CaviarGreen 3 TB $209.99 14
Western Digital Veloiraptor 300 GB $169.99 1.8
Table3.2: SSDapaities. Alldrivesare2.5",MLCandSATAII.Pries: www.newegg.om
(ited1-Feb-2011)
Manufaturer Model Size Prie GB/$
Corsair Fore F40 40GB $104.99 0.38
Corsair Fore F120 240 GB $439.99 0.55
Intel X25-M 120 GB $229.99 0.52
Kingston SSDNow V Series 128 GB $224.99 0.57
OCZ Agility 2 160 GB $299.99 0.53
3.1.2 Solid State Drive
A Solid State Drive (SSD) is a mass storage based on NAND ash memory teh-
nology. A ash memory onsists of readable and reprogrammable transistors, i.e.
memory ells. The memory ells preserve their state during a power outage. Data
is stored in these ells as voltage levels. If the ell has only two voltage levels and
thusrepresentonlyonebit,thenitisalledaSingleLevelCell(SLC).Iftheellan
distinguishfourvoltagelevels(or more)and thusrepresent twobits (ormore),then
it is alled a Multi-Level Cell (MLC). Flash memory is disussed more thoroughly
later, but rst a littleinsight intohow anSSD operates.
An SSD is omposed of many ash memory hips. Eah hip is omposed of
with the blok layer bloks disussed earlier. An SSD has three basi operations;
read, reprogram (write) and erase. The smallest unit of data for a read or write
operation is the page, whih is typially 512 - 4096 bytes. Only fresh pages an be
reprogrammed, so every dirty page must be properly erased before it is reusable.
The smallest erasable unit of data is the blok, whih an hold up to 128 pages or
512kb of data. SSDs do not atually have physial setors, but sometimes a page
anbethoughtasbeen dividedintologialsetors. Thereasonis thatforhistorial
reasonsappliations are assuming that ablok devie has 512-bytesetors.
Reading a 4kb page generally takes around tens of miroseonds and writing
hundreds ofmiroseonds. SLC baseddevies are generallyfasterthan MLCbased.
Therealpenalty omesfromerasingablok,whihtakes1.5-2ms. Thusreadingis
anorder of magnitude faster than writingand two ordersof magnitudefaster if an
erase operation is needed. SSDs (and other ash memories) use a tehnique alled
Flash TranslationLayer (FTL) tooverome this problem.
FTLreduestheeetoftimeonsumingwriteoperationsbyreservingredundant
bloks or pages and hene avoiding the ostly erase operation when data is being
updated. Downsidesare inreasedoverheadforaddresstranslation informationand
inreased amount of ash memory operations. Of ourse this does not solve the
problem ompletely as it only delays the erasing proess [22℄. This is why a trim
operation was introdued on SSDs. Its purpose is to erase unused pages on the
bakground. As mentioned earlier, a single page annot be erased as the smallest
erasableunitisthe blok. Soithastoreadthedatafromablokintoaahe,erase
the whole blok and then rewrite the data bak into the blok. [2℄
Where SSDs really exel over HDDs is the random aess. Intel X25-M SSD
an reah up to 35,000 IOPS (I/O operations per seond) for random read and
8,600 IOPSfor randomwrite [20℄. Foromparison, ahigh-performane HDD"WD
Veloiraptor"anonlyperformlessthan250IOPSforbothrandomreadandrandom
write. The relativelylowIOPSountfor HDDsderivesfrommehanialdelaysand
annotbesigniantlyimproved. SSDsaninterleavereadandwriteoperationsand
hene the overall performane of the devie an be better than the one of a single
ash memory hip. [9℄.
SSDshaveone leartehialweakness omparedtoHDDs. The write-eraseyle
of a memory ell is limited. An SLC an be reprogrammed around 100,000 times
and moreompliated MLConly 10,000times beforeit wears out [31℄. This iswhy
modernSSDsomeswithsomethingalledWear leveling. Wearlevelingallowserase
ounts of bloks to be evenly distributed over the storage media in an attempt to
inrease the endurane of anSSD. Dynamiwear leveling is analgorithmby whih
theontrollerintheSSDreylesblokswithsmalleraseountsintheashmemory
[3℄.
ThebiggestobstaleSSDsarefaingontheir waytobeomewidely adoptedand
respetablealternativetoreplaeHDDsistheirprie. IfomparingSSDsandHDDs
just by looking how many gigabytes a dollar an buy, anSSD is approximately 50
times as expensive as HDD as seen from Tables 3.1 ( 25 GB/$ for HDDs) and 3.2
( 0.5 GB/$ for SSDs). However fully eletronial SSDs are known to onsume less
power than partly mehanialHDDs [22℄.
3.2 Data Storage Shemes
Before disussing more about dierent options fordata storagingshemes, one ter-
minologialdistintion needs to be pointed out. When using a term distributed in
theontextofdatastorage,itdeliberatelyreferstoadatastoragesheme,wherethe
atualdataisdistributedovermultiplemahineinstanesinontrasttolient/server
model type ofdistribution. The dierene is vague asin a distributedenvironment
the bakend implementationis not neessarily transparent tothe lient. Forexam-
ple, a simple le server an internally exploit other servies, whih reside onother
physialmahines. Alsomanyshemes providingdistributeddatamodelanhavea
frontend mahine towork as asingle entry point and appear tobe asingle system.
In fat,in some ases it an even be tehnially possible torun suh a system on a
single mahine instane. So basially the denition is based on how the system is
meant tobeused.
3.2.1 Network-attahed Storage (NAS)
A Network-attahed Storage (NAS) is by denition a data storage aessible over
the network. NAS is based on lient/server model and provides a le level data
aess. A NAS appliane is equipped with high-speed network interfae and hard-
ware apable of storing vast amounts of data. Terminologially, subtle dierenies
between aNAS appliane and aonventional leserver an be distinguished. NAS
is designed for high performane and usually oers ustomized and pre-ongured
software and vendor support, whih make iteasy to deploy and administer. These
terms "NAS appliane" and "le server" are used interhangeably as there is little
pragmati dierene from the end user point of view. NAS exploits network le
system tehniques onprovidingdata aess for lientmahines.
A network le system is a le system that is hosted on a remote mahine and
is aessible over the network. More preisely, it is a protoolto aess the remote
le system. Network le systems are based on lient/server model and are usually
stateless,althoughalsostatefulnetworklesystemsexists. Statelessmeansthatthe
server provides the le system as is and keeps no reord of the state of individual
les. This introdues a ouple of pros and ons. Stateless design simplies the
systemarhiteture,butalsobringsoutsomesynhronisationproblemsanddegrades
onsistene. This an be a serious problem if a high level of reliability and data
integrityisrequired. Statelessnessmustbeaknowledgedandhandledatappliation
level. An approah with a entralised server makes it easier to ontrol and bakup
your les as they allresides in a singlesystem. The lients an mount the network
lesystemlikeanyotheronventionallesystem. Thepreseneofnetworkishidden
andles are transferred toloalmahineonlywhen needed. Alas,italsomakesthe
server a single point of failure, thus eliminating it as an option for appliations of
lowfault tolerane for aessibility.
The best-known and most ommonnetwork le system in Linux environment is
theNetwork FileSystem (NFS).Thebasi ideaof NFSis to,fromalientspointof
view, emulate the behaviour of a loal, mounted le system even though the disks
are not physially present. NFS is said to be inadequate to sale for systems over
100-1000nodes,i.e. NFSlients. However, thisheavilydependsonthe useproleof
thesystem andappliationsharateristis. Read intensiveappliationshavebetter
suess than write intensive. After all, NFS is not meant to serve appliations,
whih require high availability. There is also some onerns about the seurity of
theNFS. Inalusterenvironmentthis israrely anissueaslusterstendtoresidein
a private network, exluding the frontend mahine. As a whole, NFS is a popular,
widespread,easy toinstall and widely supported, whih makesit the best hoieof
adata aess implementation tehnique forthe NAS subsystem. [14℄, [33℄
3.2.2 Redundant Array of Independent Disks (RAID)
ItisommonforNAS applianestoexploittheRAIDtehnology. Redundant Array
of Independent Disks (RAID) is a sheme designed to improve both the reliabil-
ity and performane of disk aess. RAID an be implemented by using either a
hardware or software based solution. In a hardware RAID, the server mahine is
equipped with aspei RAID ontroller, whihreeivesthe I/Orequests fromthe
OS and redirets the requests to physial disks. For the OS, only one large blok
devie is visible. Withthe software-based RAID, the independent disks are visible
totheOS and avirtualdiskisreated uponthem. RAIDispereived tobereliable
when it omes tostoring data, but not neessarily interms of aessibility. This is
espeially true when using a NFS protool, but inaessibility an also stem from
networkor powerfailures [14℄.
One important tehnique used by RAID is striping. Striping means that data
is slied into xed-length hunks of data, whih are dispersed over multiple disks.
Whendataisnowaessed,theI/Orequestanbehandledparallelbymultipledisks
and thus improve performane signiantly. There are many levels of RAID, eah
no data redundany. RAID-1 is similar to RAID-0, but it also provides mirroring.
Mirroringmeans that alldata is sent to several (usually only two)disks as asafety
preaution. This setup provides exellent data reliability and performane at the
ost of diskspae. RAID-5 providesdata parity, whih means that for every blok
striped a parity blok is alulated and stored on dierent disk. If one disk fails,
the data in the failed disk an be reonstruted by using the parity information.
RAID-6 is similar, but it doubles the amount of parity and hene an tolerate two
faileddisks. [19℄
3.2.3 Distributed File System
Asmentionedearlier inSetion3.2, the denitionused inthis study for distributed
data storage refers to truly distributed data. In ontrast to network le systems,
wherealldataisstoredonasinglemahine,adistributedlesystem(DFS)isrunning
on multiple mahine instanes. A DFS an have a entralized or deentralized
arhiteture. In a entralized arhiteture, the lient onnets to a master server.
The master server is responsible for keeping the le system metadata information
and redirets the I/O requests to other servers, i.e. data servers. The data server
then provides the atual data for the lient. This arhiteture of ourse makes the
master server a single point of failure and easily beomes the bottlenek of suh a
system. Henethe deentralizedDFSs are available. Deentralizedarhiteturean
be implemented for example in a peer-to-peer manner, where also the le system
metadata isdistributed.
The distributed nature of DFSs varies as DFS an reside in a single server rak
onneted viahigh-speed LAN or itan be geographially distributed over WAN.
DFSissaid tobeaparallel lesystem if thedata of asingleleis distributedto
many dierent servers. This approah have its pros and ons. The performane of
readingorwriting,espeiallybigles, anbeimproved signiantly asmore servers
anhandletheI/O.Ontheotherhand,asseeminglysimpleoperationasadiretory
listinganbeextremelyslowaseahserverneedstobeonsulted. DFSsanalsobe
onguredtoprovidedatarepliationtoimproveaessibilityandreliabilityordata
striping(likein RAIDsystems disussed inSetion 3.2.2) toimprove performane.
3.3 Hard Disk and Solid State Drives in Linux
Topermanently storedata, moreisneeded thanjustthe physialdevies. Presene
ofanoperatingsystemisrequired. Typialdata storagesheme anbedividedinto
4 layers; devie layer, kernel layer, le system layer and appliation layer. Also a
bloklayeran bedistinguishedbetween thekernel andlesystem[28℄. Thegoalis
toprovideabstrationbetween thelayers, tohidethe implementationand tehnial
details fromthe user and toprovide interfaes tobetter support interoperabilityof
variety hardware and software omponents. Optimization of suh system an take
plaeonany of these layers. Harddisk and solidstate drives an beseen as partof
the devie layer. Between the physial devie and kernel are devie drivers, whih
are part of the kernel. The purpose of the devie drivers is to hide the dierenies
betweenthe vastvarietyofdeviesfromthekernel. Kernelannowtreatanydevie
inthe same way through a devie driver interfae. [19℄
Thekernelandlesystemlayers arethemostinterestingonesastheyprovidethe
most of easily ongurable parameters. In the Linux kernel there is a omponent
alled an I/O Sheduler. An operating system does not really require any I/O
sheduler tooperate as I/O requests an be servied in aFIFO-like queue manner.
This,however, isnotthe optimalsolutioninmostasesand useofanI/Osheduler
an improvethe performane ofthe I/Odramatially. LinuxI/Osheduleradds an
interfae between blok layerand the devie layer. [28℄
When disussing about disk performane, two terms needs to be distinguished;
the response time and the aess time of a disk. The response time is the time an
I/Orequestsneedstowaitbeforeitisservedafteritwassubmitted. Theaesstime
of an HDD is a sum of seek time and the atual transmission time. The seek time
onsistsof diskarm transfer and spin delay orrotationallateny. Before readingor
writing an happen, the disk head needs to be positioned on the beginning of the
rightsetorondisk. The seektimederivesfrommovingthe diskarmontothe right
trak and then waiting the disk to spin so that the orretsetor is under the disk
head. Transmission timeis usually onsiderably less thanseek time. Seek time an
be minimisedby intelligentpositioningof the data onto the disk and alsoby doing
disk read or write request in a best possible order. The former is done by the le
system and the latter is alled I/Osheduling. [19℄
3.3.1 Linux File systems
Alesystem isanabstrationtomapdatabloksonablokdevie,suhaHDDor
SSD,tomeaningfullesfortheoperatingsystem. Alesystemusesdatastrutures
alled inodes to save information about the les (metadata). An inode ontains
informationabout theownerofthe le, anaessontrolvetor, timestampsforle
reationand modiation, le size, type of the le(e.g. diretory, regular le, link,
et.) and pointers to the atual data onthe devie. [14℄
Usability of ale system an be measured by two ommonmetris. The rst is
howeiently a lesystem stores les, i.e. how muh spae is wasted. The seond
is howeiently data an be transfered. Usingbigger disk bloks an improve the
transferrate asmore data ishandled atone, butalsomore diskspae iswasted as
Mostlesystemstodayarejournalinglesystems. Ajournalinglesystemmeans
that the le system keeps a journal overits writes in ase of failures in the writing
proess. When dataiswrittentoadrive,alsothemetadata informationneedstobe
updated. Ifthe dataonthe driveand the metadataisout ofsyn, thelesystem is
said tobeorrupted. This an our for example inase of sudden power outageif
onlyeitherthedataormetadatawasupdated,butnotboth. Toinreasethroughput
performane drivesusually exploit heavily drive ahes, whih an delay the writes
and ause the drive tobeout of syn. When the drivegets bak online, lesystem
an nowgothrough itsjournal and replay every step toxa possible orruptedle
system. Without journaling, the whole lesystem would need a onsisteny hek,
whih would be drastially slower operation. One of the primary onerns with all
lesystems is the speed at whih a lesystem an be validated and reovered after
orruption.
ThemostpopularlesysteminLinuxduringtherstdeade of21stenturywas
the Ext3 le system, whih is still widely used. Ext3 is the default le system for
Roks luster software. Ext3 is ajournaling lesystem with maximum volume size
of 16terabytes.
Ext4 is the suessor of Ext3 lesystem. The main motivation developing new
version was the 16 TB volume size of Ext3, whih stems from 32-bit blok num-
bers. Ext4 assigns 48-bitblok numbers and an have volumes up to 1 exabyte for
4kB blok size. Ext4 also inorporates salability and performane enhanements.
Ext4 developers provided benhmark results, whih shows improvement espeially
on write I/O requests. The dominating role of Ext3 is aknowledged and upgrade
toExt4iseasyandan bemadewithoutlosingthedata. Ext3ishoweverpereived
asreliableandstableand thusstillthelesystem ofhoie inmany systems,whih
donot need the support forlarger volume sizes. [25℄
XFS is a lesystem reated inmid-1990sby Silion Graphisin. for their own
IRIXOS, but it islater ported toLinux. XFS isalsoa journalinglesystem. XFS
wasdesignedtobesalableandsupport largeleanddiretorysizes. Themaximum
volume size of XFS is16 exabytes. [35℄
3.3.2 GlusterFS
GlusterFSisadistributedlesystem,developedbyGlusterin. andprovidedunder
GNU AGPL v3 liene. GlusterFS arhiteture is based on peer-to-peer model.
Server mahines share part of their diskspae, alled a brik, into aolletivedata
pool. These briks are then used to reate virtual data volumes. Data mirroring
and data striping are both supported. On the servers, data is stored on loal le
systems. Atually, what a server shares is a diretory and it beomes the root
on that partition. Notie, that any free spae an therefore be used either by the
GlusterFSortheloallesystemandthereforethesizeofGlusterFSvolumehanges
dynamially. Briks an be added and removed on the y without disturbing the
system. In ase of resoure removal the data hosted by that node is migrated to
anotherloationautomatially.
To aess these data volumes a lient software is needed. The data volume is
mounted aspart of a loal lesystem with FUSE, Filesystem in Userspae. FUSE
is an API emulating the behavior of onventional lesystem. Eah lient has a
dummy version of the diretory tree of the volume (a lesystem). It ontains the
metadata(inode)information,butthelesizeiszero. Theatualdataisdistributed
by using the hash alulated from the name and diretory path of the le. Eah
leisnowmappedwithpartiular virtualsubvolume. Thesevirtual subvolumesare
mapped to spesi briks, i.e. hosts. Using a hash algorithm a le name an now
be onneted with the host storing the atual le data. When a le is renamed, a
pointer is reated on new host to rediret to the old loation while migrating the
data toa new loationin the bakground. When the data transfer isomplete, the
pointer an now be removed.
Any partiularmahineanat bothasaserverandasalientatthesame time,
i.e. run a server and lient software. Other features of Gluster is load balaning,
failover reovery, I/O sheduling, ahing and quotas. Gluster supports Inniband
and Ethernet (TCP/IP) for networking. [4℄
3.3.3 Linux I/O sheduling
An I/O sheduler is a kernel omponent, whih ontrols the I/O queue and uses a
sheduler-spei algorithmtoarrangeinomingI/Orequest. WhenanI/Orequest
is reeived from a le system through the blok layer interfae, an I/O sheduler
inserts it intothe queue and eventuallypasses itto the disk ontroller through the
devie driver interfae. [28℄
Linux an besaid tobeoptimisedfor magnetidisks [21℄. Thissetion disusses
primarily on sheduling HDDs in a Linux environment. Sheduling with SSDs is
disussed in Setion4.2.
The urrent Linux kernel 2.6 has four built-in shedulers. They are alled noop,
antiipatory, deadline and fq. The fq is the urrent default sheduler. These
shedulers are disussed later in detail, but rst a little insight on how the disk
ontroller operates.
Disk usagean beoptimisedbytrying tominimisethe diskarm transfer,i.e. the
seektime. Commonalgorithmsare alledFIFO,SSTF, SCANand C-SCAN.FIFO
(FirstInFirst Out)does nooptimization. SSTF(Shortest Seek TimeFirst)always
asituation,espeiallyonadevieunderheavyloads,wherediskheadkeepsserviing
request on a near-by disk bloks and other requests on the outer edges of the disk
arefaed withlong waitingperiodsoreven astarvation. Starvation isastate where
a proess is waiting for an event that never happens. SCAN just sans the disk
from one edge to another, turns bak whenever reahes the inner or outer edge of
a diskand starts to san todisk to anotherdiretion. SCANis sometimes referred
as the elevator algorithm due its similar operation logi to elevators. C-SCAN is
like SCAN, but with a dierene that it always sans the disk the same diretion.
When the armreahes the edgeof adisk,the arm ismoved tothe opposite edgeby
one long disk arm transfer. SCAN and C-SCAN are not aeted by starvation. Of
Linux 2.6 basi shedulers, the noop is based on FIFO and others on SCAN type
diskarm transferalgorithm [21℄. [19℄
Thepurposeof anI/Osheduleristoimprovethe performaneeitherbyinreas-
ingthetotalbandwidth ofthe diskorby reduingthe aesstime ofindividualI/O
requests. I/Oshedulersuseoperationsalledsortingandmerging ofI/Orequestas
atooltominimizethe diskseek times. The sortingoperation ordersrequests based
on their setor number and inserts inoming requests on their right plae on the
queue. This way,if the diskisused either with SCANorC-SCAN based sheduler,
no unneessary disk arm movement has to be made. Merging merely means that
requests from dierent proesses to the same data blok are reognised and served
together. Alsoit has tobe notedthat usually read operationsare synhronous as a
proess is waiting them to nish. On the other hand, write operations are usually
asynhroni, whih means they do not need to be served immediately and an be
storedtemporarilyin aahe. [28℄
The most simple I/O sheduler in the default Linux kernel 2.6 is the noop I/O
sheduler. Noophasminimaloverheadanditdoesonlybasimergingandsortingof
I/Orequests. Noopanbeagoodhoiewhennot usingaHDDdiretly. Eitherthe
shedulingisdonesomewhereelse than insidetheLinuxkerneloranon-mehanial
drive,suhasanSSD,isused. RAIDontrollersdotheirownshedulingandLinux
kernel does not have any knowledge of the atual disk states in a RAID array.
Therefore Linux kernel an only interfere by doing additional I/O request sorting.
Mergingof requests is of ourse desirable. SSDson the otherhand have nomoving
parts and therefore donot suer fromseek time delays. [28℄
The Deadline sheduler implementssorting of requests, but alsoimplements an
expiry time for eah request. The basi idea is aggressive reorder of requests and
at the same time to make sure no request has to wait too long to be served. If
a request is about to expire before it is served, then deadline starts to serve that
request immediately. Read requests are given higher priority than write requests,
but nonetheless the deadline mixes write requests with read requests even though
there are more pending read requests. Deadline makes a ompromisebetween high
throughputand lowI/O request response time. [12, 28℄
The Antiipatory(AS) I/O sheduler behaves like deadline, but also adds a fea-
ture alled antiipation. Antiipationderives froma situationalled deeptive idle-
ness. Deeptive idleness happens when a read operation nishes and the proess,
whih requested it, ontinues exeution only to make a onseutive read request.
Normallythe diskarm wouldhavealreadymovedintoananotherposition,but now
thediskwaitsforasmallperiodoftimeiftheproesswantstomakeananotherI/O
request. Naturallythis behavior has anegativeeet onperformane if the proess
does not make another sequential read request. On some work loads however the
overallperformanean be improved. Thereatuallyare mehanisms,suhasost-
benet analysis orstatisti analysis of aprobability of suhrequest arriving, whih
redues the negative eet of this behavior. AS tries to redue the read response
time foreah thread. [28℄
Finally,the urrently default Linux I/Osheduler, the Completely Fair Queuing
(CFQ) I/O sheduler. The basi idea of CFQ is to provide fair treatment among
dierent proesses and share the I/O bandwidth evenly with the I/O requests. In-
ternally CFQ has many I/Oqueues, whih are operatedstrit FIFO manner. Eah
proessisgivenitsownqueue derivedfromthe proess'PIDwithahash algorithm.
CFQ selets I/O requests from these queues in a round robin manner and moves
them intoa dispath queue, whihis then sorted and sent out tothe devie driver.
[28℄
It is important tonote that both ASand CFQ are implemented as Linux kernel
omponents as antiipatory and ompletely fair queuing are mere sheduling algo-
rithms. Antiipationan be built on any sheduling sheme, not just on deadline.
Also ompeletely fair queuing does not need to work with a hash algorithmto op-
erate. Any other desired tehnique an alsobe used to alloate the I/O queues for
proesses.
Changing the sheduler in Linux an be done from a ommand prompt. For
example,setting the noop shedulerfor the drivein /dev/sdb:
eho noop > /sys/blok/sdb/queue/shedul er
3.3.4 Read-ahead
The read-ahead is a mehanismto improve the performane of a blok devie. The
funtion of the read-ahead is that for every read request served, also an additional
amount of data is read from the blok devie into a ahe. It is likely that this
data is now requested soon after. When suh a request is reeived, the data an
be provided diretly from ahe and avoid the ostly seek operation. The size of
Changing the read-ahead value in Linux an be done froma ommand prompt.
Forexample, settingthe read aheadto 4kb forthe drive in/dev/sdb:
eho 4 > /sys/blok/sdb/queue/read_a head _kb
4. PREVIOUS WORK
This hapter disusses the previous researh work done related to the topi of this
thesis. The rst setion disusses about how SSDs are used inserver environment.
The seond setiondisusses about researh onI/O sheduling.
4.1 SSDs on servers
Lee et al. [23℄ onduted a study whih objetive was to identify the areas where
SSDsan best beutilized inenterprisedatabase appliations. They onludedthat
usingSSDsfortransationlog,rollbakandtemporarydatastorageissuperiorover
HDDs. They argued that the performane of transational database appliations
is more limited by disk lateny than disk bandwidth and writing log reords is
a signiant performane bottlenek. They pointed out that the I/O pattern of a
workloadtraeolletedfromaommerialdatabaseserverisfavorabletoSSDs. By
implementingthesehanges ontheir test server, they managedtotransformitfrom
I/Obound toCPU bound. Their tests showed anorder of magnitudeimprovement
intransation throughput and response time. Also,time of proessing ompliated
database operations that required the use of temporary data area dropped to one
third.
Shmidt etal. [32℄ onduted a study onusing SSDs ina database environment
as an attempt to inrease eieny and redue osts. They onluded that SSDs
outperformed HDDsboth inperformane and energyeieny, but the overallost
analysisstillfavoredHDDs. TheyarguedthatonlysuitableusageforSSDsisinhigh
IOPS demand appliations, where IOPS/$ or apaity/$ are of minor importane.
On theirbenhmark tests, they usedthe rate oftransations perseondtomeasure
performane. The tests showed that with small database sizes (10 MB), HDDsand
SSDswere equalfor read-only workloads and HDDs havinga slightedge for mixed
workload. However, the performaeofthe HDDsquikly dereased asmuhas50%
when the size of the database tenfolded (100 MB), while this had little eet on
SSDs. Growing the size of the database another ten times bigger (1000 MB); the
performane of the HDDs dropped another25%, while stillnot aeting the SSDs.
All this appliedboth read-onlyand mixed workloads.
Narayananetal. [26℄reportedsimilarresultsintheirstudy,wheretheyperformed
a ost-benet analysis for a range of workloads. They used 49 dierent workload
traes olleted from 15 dierent server mahine (storage size rangingfrom 22 GB
to 6.7 TB) to ompare SSDs and HDDs. They found out that in all ases, the
dominatingfatorwaseitherthe storageapaity orthe random-readIOPSrequire-
ment. However, due to the low apaity/$ of SSDs, the HDDs always provided the
heapest solution. They presented alulations, that depending on the workload,
the apaity per dollar of SSDs needs to improve by a fator of 3-3000. They also
argued that energy eieny is not as important reason to make the transition to
using SSDs as low-speed SATA disks are ompetitive in terms of performane and
apaity per watt.
Aording to Leventhal [24℄, SSDs should beused as omplementary toexisting
storage system, not asa replaement. He argued that SSDs "falls ina sweet spot"
between HDD and RAM and the harateristis of ash make it ideal for ertain
appliations,e.g. loggingandahing fordatabases. Hepointed out thatby repla-
ing part of the RAM with SSDs for ahing, where appliable, an turn out to be
eonomially better alternative. He even implied that having SSDs as an interme-
diatealsojustifyforasystem withslowerspinningdisks. Hebelieved thatthe right
balaneof ost and performane ould be found for any workload.
4.2 Sheduling
Pratt and Heger [28℄ onduted a study on performane evaluation of Linux 2.6
I/O shedulers. On their tests, they simulated I/O patterns on dierent hardware
setups, inluding both single-disk and RAID ongurations. They used Ext3 and
XFS lesystems and various workload senarios. They onluded that seleting an
I/Oshedulerhas tobebasedontheworkloadpattern, thehardwaresetup andthe
lesystemused,orastheyputit,"thereisnosilverbullet". Carroll[15℄onduteda
similarstudy onI/OshedulersinaRAIDenvironment. Healsofoundtheseletion
of the I/O sheduler to be workload dependent and that I/O sheduling improves
performane only onsmalltomedium size RAID arrays(six disks orless).
Kim et al. [21℄ onduted a study to analyse I/O shedulers on SSDs. They
arguedthatshedulingitselfdoesnot improvethe readperformane ofanSSD,but
preferringread requestsover writerequests does. They presented and implemented
a sheduling sheme that exploits the harateristis of the SSD. The sheme is
quitesimple, it just bundles write requests together tomath the logial blok size
and shedules read requests independently in a FIFO manner. Their benhmark
tests showed up to17% improvements over existing Linux shedulers (presented in
Setion3.3.3). Test resultsalsoshowed that the shedulers did not makea notable
diereneunder read-oriented workloads onSSDs. On aside note, the antiipatory
sheduler seemed to outperform other existing shedulers. This is quite strange
of data on HDDs and thus the devie is kept idle for short periods of time. This
should not improve the performane of an SSD, but on the ontrary, degrade it.
Thisphenomenon anbeexplainedby notingthatanindividualproess anbenet
for getting an exlusive servie for bursty I/O requests and thus improving the
overall performane. However, this is more a matter of proess optimisation than
I/Ooptimisation.
5. TESTING ENERGY EFFICIENCY
Thishapterdisusses of the test environment and theatual tests onduted. The
rst setion desribes the test luster in detail. The seond setion represents the
used test tools. The physis software and the software and hardware instruments
usedtogatherdataaredisussed inthissetion. Thethirdandlastsetiondisusses
the pratialside of running the tests and desribes how the tests were onduted.
5.1 Test Cluster
5.1.1 Operating System: Roks 5.3
The hoie forthe operatingsystem of the test lusteris Roks 5.3, anopen-soure
Linux luster distribution. Roks is developed by the Roks Cluster Group at the
San Diego Superomputer Center at the University of California, San Diego and
its ontributors. Roks is a fully stand-alone system and annot be installed on
top of existing system. Roks is basially a Red Hat Linux bundled together with
a whole set of luster software. The driving motivation behind Roks is to make
lusterseasytodeploy, manage,upgrade and sale. Thisdoesnot meanthat Roks
would be inadequate or ineient to do high performane luster omputing. On
the ontrary, Roks is used inmany universities and institutions around the world.
Installing and maintaining Roks is easy. First you have to install the frontend
mahine. This does not dier muh from a normal linux installation. Roks on-
tains many optionalpakages, alled rolls,whihyouan pik to gowith youbasi
installation. These rolls ontain additional software you may want to install. For
example, the Sun Grid Engine (SGE) roll was inluded and used as the hoie of
thebath-queuingsystemforthetestluster. Afterinstallingthefrontend,aluster
alsoneedsomputenodes. Installationofaomputenodeiseasy. Allthatisneeded,
is to ongure the ompute node to boot from the network. A ompute node reg-
isters itself to the frontend database, downloads a system image from the frontend
(or from other ompute nodes) and performs a quik installation. In fat, Roks
even deals with errors justby re-installing the ompute node rather than trying to
x it. If the default ongurationsetup and system image is not suient enough
for your needs oryouwant later to modify your ompute nodes, allyou need todo
istoongure sometext lesonthe frontend, maybeadd some additionalpakages
to be installed on ompute nodes, assemble a new system image and re-install the
nodes.
Roks also omes with many software tools that makes the administration and
management of a luster easy. Most notably the Ganglia, whih is a web-based
lustermonitoring software. [33℄
With SGE it ispossible toongure the slotsize for eah ompute node. A slot
sizedeneshowmanysimultaneousjobsanbesubmittedtoasingleomputernode.
Thenameatually derivesfromnumberofCPUslots amahine hasand itsuggests
that the number of CPU ores should be equal to the number of simultaneous
ompute jobs. However, this study wanted to try what kind of eet this has on
the performane. This study uses a term relative slot size to refer the ratio of the
slotsize andthe numberofatualCPUores. Forexample,inthe testluster,with
quadoremahines, a slotsize of eightwould equala relativeslotsize of two.
5.1.2 Hardware
The test environment onsists of a omputing luster and a dediated le server.
Clusteris omposed of fourmahines, frontend and three ompute nodes. Detailed
speiations are presented in Table 5.1. Detailed speiations of the drives used
are presented inTable 5.2.
Table 5.1: TestCluster
Frontend Nodes File Server
Model Dell server Dell R210 Dell R710
Proessor IntelXeon 2,8GHz IntelXeon 2,4 GHz Intel Xeon 2 GHz
CPU ores 2 4 4
RAM 2GB 8 GB 2 GB
Disk (OS) 160GB SATA (7.2k) 250GBSATA (7.2k) 146GB SAS(10k)
Ethernet 2x 1Gb 2x 1Gb 4x 1Gb
SSDsareCorsairCSSD-F40GB-2withaSATAII3.0Gb/sinterfae. CorsairF40
utilises MLC NAND ash tehnology. Aording to manufaturer's own speia-
tions, Corsair F40 an reah read and write speed of 270 MB/s and perform 50k
IOPS. [17℄
HDDsareSorpioBlak WD3200BEKTfromWesternDigital,witha7200 RPM
spindle speed and a SATA II 3.0Gb/s interfae. Aording to a review made by
Tom's Hardware web site, just to give a rough estimate of the performane of the
HDD,the WD3200BEKT wasbenhmarked with aesstime of 15.4 ms (inluding
spin delay), maximum read speed of 84.3 MB/s and maximum write speed of 83
and peak power of 3.26 W [1℄. Western Digital [37℄ announes the WD3200BEKT
to have an average lateny of 4.2 ms and an average seek time of 12 ms, whih
onverge quite well with numbers from Tom's Hardware review. However, power
onsumption does not onverge, as Western Digital announes WD3200BEKT to
have an idlepower of 0.85W and an average poweronsumption of 2.1W. Also the
manufaturer's numbersfor HDDbandwidthdieronsiderably, asWesternDigital
laims the diskan put up to a108 MB/s for both read and write.
Table5.2: Manufaturer speiationof the drive. Pries: www.newegg.om(ited1-Feb-
2011)
HDD SSD
Model WD SorpioBlak Corsair F40
Size 320 GB 40GB
Prie $59.99 $104.99
GB/$ 5.3 0.38
Random aess time 16 ms 0.02 ms
Read speed 108 MB/s 280 MB/s
Write speed 108 MB/s 270 MB/s
IOPS - 50000
Idle power 0.8 W 0.5 W
Ative power 1.75 W 2.0 W
5.2 Test Tools
5.2.1 Computing at CERN
TheLargeHadronCollider(LHC) isapartileaeleratoratCERN. Thefourmain
detetors of the LHC an produe 15 petabytes of data a year [6℄. The distributed
omputinganddata storageinfrastruturebuilttoproessthis vastamountofdata
is alled the Worldwide LHC Computing Grid (WLCG). As of February 2011, the
WLCG had 246,000 proessingores and 142 petabytes of disk spae [8℄.
The CERN omputing infrastruture is divided into three level of tier entres.
Tier-0 entre is loated at CERN and is responsible for storing the rst opy of
RAW experiment data from LHC. It is also responsible for produing the rst re-
onstrutionpassand distributionofdata toTier-1entres. Tier-1entres together
are responsible for storing the seond opies of the data stored in Tier-0. Tier-1
entres also further reproess the data and distribute it to Tier-2 entres. Tier-2
entres are responsible for serving the analysis requirements of the physiists and
also produing and reproessing of the simulated data. The simulated data is also
distributed toTier-1 entres. As of February 2011, besides the Tier-0 entre, there
are 11Tier-1 enters and 164 Tier-2 entres in the world [7℄. [18℄
5.2.2 CMSSW
TheCompatMuonSolenoid(CMS)isoneofthefourbigresearhprojetsattahed
to LHC. CMS an also refer to the atual partile detetor. The Compat Muon
Solenoid Software (CMSSW) is a physis software toolkit for analysing the data
fromthe CMS detetor.
A entral onept within the CMSSW is an event. An Event is a C++ objet
ontainer. An Event ontains many data tiers for allRAW and reonstruted data
related to a partiular ollision. The RAW data is the full event information and
olleted diretly from the LHC. The RAW data is unmanipulated and is not used
for analysis. The reonstruted or RECO data is reonstruted to physis objets
and still ontains most of the event information. This RECO data an be used for
analysis, but it is not onvenient on any substantial data sample. Analysis Objet
Data (AOD) is a subsetof RECO data. AOD is expeted tobeused in analysis as
AODs are basially beforehand sreened events. All objets in the Event may be
individuallyor olletivelystored in ROOT les. An event data an also be stored
indierent les to limitthe size of the le and to prevent transferring unneessary
data. This data tier modelof anEvent is illustradedin Figure5.1.
Figure5.1: Data modelusedinCMSSW. Soure: CMSWorkBook[16 ℄
simulation. DatasamplesgeneratedbyMonteCarloareusedtosimulatethephysis
signalunderinvestigation. Itanalsobeusedforreatingasampledataforpersonal
use.
CMSSW onsists of many modules, whih ontains general purpose ode for
analysing the events. The goal is to minimise the ode a physiist have to write
himself. A onguration le is needed to tell the CMSSW whih modules to load
and where the data an be found. The exeutable is alled msRun. [16℄
5.2.3 ROOT framework and ROOT les
ROOT isaC++frameworkdesigned for largesale dataanalysis anddata mining.
ROOT was rst reated at CERN, the projet starting in1995, and isstill used in
CERN for analysing the partile physis data reated by LHC. One of the funda-
mental design priniples was that although the programs analysing the data may
hange as time passes, the atual data does not. It was also designed to sale to
handle petabytes of data. ROOT relies on a "write one, read many" -model due
the nature of the data and makes itpossibleto ompress the data eiently.
A ROOT leis a ompessed binary le, whih an store any instane of a C++
lass. Data is stored in aROOT lewith a data desription so that it an beread
even if the original program used to store the data is lost. Data an be stored in
aROOT leboth row-and olumn-wisemanner. If the data is stored by olumns,
reading the same data member from multiple instanes speed up onsiderable as
unwantedpieesofdataanbeskipped. Forexampleinoneinstane,whena280MB
ROOT le was analysed, only 6.6MB of data was transferred over the network.
ROOT even implements an auto-adaptive pre-feth mehanism reading the next
entry while previousentry is stillbeing proessed.
ROOT supports XML representation of data, but does not atually save data
inXML formdue the verbose nature of XML. Also a database abstration layer is
providedmaking itpossibletostoredatainaROOTleinadatabase-likemanner.
[29℄, [11℄
5.2.4 Measuring Tools
During the tests, performane data was olleted from the luster by using both
hardware and software tools. The atual power onsumption was measured with
a WattsUp? eletriity meter, whih was attahed to the frontend mahine via
USB. A shell sript was used to read the meter information one every seond
and to write the information into a log le. The eletriity meter also provided a
umulativereadingforthewatthours(Wh)onsumed. Thepoweronsumptionwas
measured separatelyforthe leserver andfor allof the omputenodes. The power
onsumptionofthefrontendmahinewasnotmeasured. Agridmonitoringsoftware
alledGangliawasalsoused. Gangliaoperatesbyreeivingonstantlystatusreports
fromother mahines in the luster. Ganglia has a browser user interfae to display
luster performane metris, suh as network tra, CPU utilisationof individual
mahines, job queue, et. The server logs were olleted and stored together with
the other outputdata.
5.3 Conduting Tests
5.3.1 About the performane and energy eieny
We distinguish the performane and the energy eieny as atwo dierent optimi-
sation goals. The performane is measured by the average proessing times of the
CMS jobs. The energy eieny is measured by the energy in watt hours needed
by anindividual CMS job on average. These two an be highlydependant of eah
other. After all, by denition, energy equals time
×
power. However, the powerdoesnot need tobeonstant. Itis possible, that inreasing the performane italso
has some kind of eet on the power usage. Thus, these two need to be studied
separately.
5.3.2 Running tests
We reated someLinux shellsripts both toautomatiseand standardisethe testing
proess. Shell sripts were responsible for submitting the jobs, hanging ongura-
tions whereappliable (forexampleshedulingalgorithm),learingahes, starting
and stopping wattage measurement and writing log entries. The shell sripts are
attahed asappendies. Appendix A shows the mainsript, Appendix B shows the
sript used for an individual test run and appendix C shows the sript responsible
for initialisingand runningthe atualCMS job. Installingthe drives and hanging
thelesystemneeded tobedonemanually. Ashellsriptwasalsousedforreating
thetest inputdata onthetarget storagefor theCMS jobs. Toensurehomogeneous
of the test data between dierent test ongurations and between individual jobs,
the test datawasopied from the frontend for eah time a lesystem was reated.
The driveahes both onompute hostand data host wasleared between the test
runs with shell ommand:
syn; eho 3 > /pro/sys/vm/drop_ahes
Every testrunwasidential. Theshellsriptrst learedahes andthensetthe
shedulingalgorithm. Thenthe slotsize of the SGE wasongured. Eahompute
nodehad4CPUoresasshowninTable5.1. Slotsizesof2,4,8and12(relativeslot
sizes of 0.5, 1,2 and 3)was used toassign loads of 50-300%to eah ompute node.
After the luster was ongured, the sript submitted CMS jobs via SGE to eah
ompute node equal to the urrent slotsize of the node. Just beforethe jobs were
submitted,ananothersriptwasstartedtologthe wattageasmentionedinSetion
5.2.4. Whenallthejobswerenished, alsotheloggingsriptwasterminated. Using
the log le, starting and nishing time of a CMS job an be determined and also
how muh energy (watt hours) was onsumed. After the rst set of CMS jobs was
nished, the sript inreased the slot size and ran a new set of jobs. When nished
with a slotsize of 12, sheduler was hanged and slot size was set bak to 2. This
was repeated until all ombinations of four dierent slot sizes and four dierent
shedulers were used. All in all, one suh test run submitted 312 CMS jobs and
took about 8-10 hours tonish.
First,thetest wasondutedwithNAS.ARAID-5ongurationof6HDDs(320
GB) and 4 SSDs (40 GB) was set up, reating volumes of 1.6 TB and 120 GB,
respetively. The ROOT le used was 656 MB in size and it was opied to NAS
total of 72 times eah time and thus alloating 47 GB of the total volume. The
les were renamed to"data-01-01.root"..."data-06-12.root",wherethe rst number
represented thenode numberandseond numberrepresented the jobnumber. This
ensured that notwoCMS jobs was using the same data le. Also, the value of the
read-ahead was altered to test the eet it had on the performane. Read-ahead
values of 4kb, 8kb, 16kb and 32kb were used. After a test run of 312 CMS jobs
nished, a new test run was started after hanging the read-ahead value, the le
systemorRAID"disks"fromHDDstoSSDs. Allinall,thetest run wasonduted
total of 24 times. 3 le systems
×
4 read-ahead values×
2 dierent RAID "disks"equals 24.
At this point taking a quik look over the results, a pattern was pereived that
indiatedthat inreasingthe read-aheadvalue had anegativeimpatonthe perfor-
mane. Thereasonmostlikelywasthatthe ROOTleisabinary leandthe AOD
withinthe leissattered. It wasdeided not touse the read-aheadvalueanymore
as a onguration parameter. Also at this point, one test run was performed by
using only 4 HDDs for easier omparison againstthe 4 SSDs. Again, based onthe
preliminaryresults, the best performing HDD onguration of 6 HDDs was piked
and one more test run for 4 HDDs was performed with that onguration. Also,
the energy onsumption of idle ompute nodes and NAS appliane was measured,
both with and without the RAID pak. The idle tests logged an idle mahine for
one hour fromstartup. These resultsare represented inAppendix D.
Next, the SSDswere installedonthe ompute nodes and ongured asaone big
GlusterFSvolume. Withthree nodes and withoutany stripingormirroring,the 40
GB SSDs reated a volume of 120 GB. The test run was also onduted with this
onguration before dismounting the Gluster onguration and running the tests
diretly from the loal drives. Beause the test data was total of 47 GB, all of it
ould not be tted into the 40 GB drives, so only half of it was used. Copying 24
GBof test datato eah drive. This way, plentyof freespae waslefton thedevies
ashad been the ase also onearliertest runs.
Finally, the SSDs were hanged to HDDs inside the ompute nodes. As with
SSDs, a GlusterFS volume was reated rst. With320 GB in eah node, a volume
of 960 GB ould be hosted by the nodes. After running the tests on Gluster, the
same tests were ondutedagain with loaldrives. This time though,the whole 47
GB of test data was opiedto eahHDD.