Energy Efficiency of Data Storage Systems in Cluster Computing

(1)

CLUSTER COMPUTING

Master of Siene Thesis

Examiner: Professor Hannu-Matti

Järvinen

Examiner and topi approved in the

Faulty of Computing and Eletrial

Engineering Counil meeting on 4th of

May2011

(2)

TIIVISTELMÄ

TAMPEREEN TEKNILLINENYLIOPISTO

Tietotekniikan koulutusohjelma

Tuominen, Mikko: Tallennusjärjestelmien energiatehokkuus klusterilasken-

nassa

Diplomityö,42 sivua, 5liitesivua

Kesäkuu 2011

Pääaine: Hajautetut ohjelmistot

Tarkastaja: professori Hannu-Matti Järvinen

Avainsanat: Klusterilaskenta, energiatehokkuus, SSD, levypalvelin,GlusterFS, tiedosto-

järjestelmä, levyskedulointi

Energiatehokkuus on tärkeä osa-alue minkä tahansateknologian kehityksessä, eikä

klusterilaskenta tee tähän poikkeusta. Energian hinnan noustessa klusterin yl-

läpidonkustannukset ylittäväthelpostisen hankkimiseentarvittavatkustannukset.

Jokainen säästetty euro onsamanarvoinenkuin ansaittueuro.

Tämätyötarkasteleejavertaileeerilaisialaite-jaohjelmistotasonratkaisuja,joita

käytetään klusterilaskennassa datan tallentamiseen. SSD-levyjä ei yleisesti käytetä

klustereissa ja yksi tämän työn päämääristä onkin selvittää soveltuuko tämä suh-

teellisenuusi tekniikkakäytettäväksi klustereissa. Tärkeinpäämäärä on ymmärtää

mitkäseikatvaikuttavatklusterinenergiatehokkuuteendatantallennuksennäkökul-

masta. Näiden päämäärien saavuttamiseksi klusterin tehokkuutta ja energian ku-

lutusta mitataan ja arvioidaan eri kokoonpanoilla. Tästä saatuja tuloksia voidaan

käyttää energiatehokkuuden optimointiinmuissa klustereissa.

Työ on jaettu kahteen osaan. Taustatietoja tutkivassa kirjallisuusosassa pa-

neudutaan asioihin, jotka liittyvät energiatehokkuuteen, datan tallennusmalleihin,

levyihin, tiedostojärjestelmiin ja levyskedulereihin. Kokeellisessa osassa esitetään

testiympäristö sekä raportoidaan ja analysoidaan työn tulokset. Testien suorit-

tamisessa käytetään apuna CERNin CMS-ohjelmistoa ja LHC:n tuottamaa dataa

mallintamaanraskasta fysiikkalaskentaa. Testeissä käytetään sekä SSD-levyjä että

perinteisiäkiintolevyjäyhdessäkolmenerilaisendatantallennusmallinkanssa. Tähän

kuuluvathajautettuuntiedostojärjestelmään,levypalvelimeenjapaikalliseenlevyyn

pohjautuvat ratkaisut.

Tulokset paljastavat, että SSD-levyjen käytöllä ei saavuteta merkittävää etua.

Toinentärkeätulos on, ettähuomattavaosa klusterinkapasiteetistavoijäädäkäyt-

tämättä,mikälitiedostojärjestelmäjalevyskedulerieivätolehuolellavalittuja. Työn

johtopäätöson, että vaikka mitäänestettä SSD-levyjen käytölle eiole, kun otetaan

huomioonsekälevyjenhintaettä kapasiteetti,einiidenkäyttöoleperusteltua. Kun

SSD-levyjenkehitysetenee,onsyytäarvioidatilanneuudelleen. Mikälihinnatlaske-

vat ja tallennuskapasiteetti kasvaa,voimekaaninen kiintolevy siirtyä historiaan.

(3)

ABSTRACT

TAMPERE UNIVERSITY OFTECHNOLOGY

Master's DegreeProgramme in Information Tehnology

Tuominen, Mikko: Energy Eieny of Data Storage Systems in Cluster

Computing

Master of Siene Thesis, 42 pages,5 Appendixpages

June 2011

Major: Distributed software

Examiner: Professor Hannu-Matti Järvinen

Keywords: Clusteromputing,energy eieny,SSD, leserver, GlusterFS,le system,

I/Osheduling

Energyeienyisanimportantpartofthedevelopmentofanytehnology. Cluster

omputing isno exeption. As the energy pries rise, the osts of runninga luster

an easilyoverome the osts of buying one. A euro saved isa euro earned.

This thesisexamines and omparesdierenthardware levelapproahes and soft-

ware levelongurations used inlusters tostoragedata. Solid state drivesare not

ommonlyused inlustersand one of thegoals of this thesisis tostudy whether or

not this relatively new tehnology is suitableto be used inlusters. The main goal

istounderstandwhataetstotheenergyeienyofalusterfromadatastorage

point of view. To reah these goals, the performane and energy onsumption of

a luster, with dierent system ongurations, is measured and analysed. These

results an further be used tooptimiseexisting lusters.

Thethesisisdividedintotwoparts. Inthe literaturestudypart,issuesrelatedto

energyeieny,datastoragemodels,blokdevies,lesystemsandI/Oshedulers

are studied. In the experimentalpart, the test environment is introdued in detail

and the results are reported and analysed. The tests are onduted using the CMS

software with real LHC data to simulate heavy physis omputing. During these

tests,bothharddiskandsolidstatedrivesareusedwiththreedierentdatastorage

shemes;a distributedapproahwith GlusterFS(adistributed lesystem)on om-

pute nodes, a entralised approah with dediated le server and a loal approah

with drives inthe ompute nodes of the luster.

The test results reveal that no signiant gain is ahieved by using solid state

drives. Anotherkeyresultisthatalusteransuer fromamajorperformaneloss

if the lesystem and I/O sheduleris not properlyseleted. The onlusion of this

thesisis,thatalthoughthereisnofundamentalreasonwhysolidstate drivesshould

not be used in lusters, onsidering the multifoldprie and low apaity ompared

to hard disk drives, it is not justiable. As the development of solid state drives

progress,anewstudyisinorder. Ifthepriesdelineandstorageapaityinreases,

solid state drives ould abolishmehanial drives.

(4)

ACKNOWLEDGEMENTS

Iwasfortunatetohaveanopportunitytobeamemberofoneofthegreatestresearh

ommunity in Europe while writingthis thesis. The eight months I lived in Frane

was one of the best time of my life. I was able to learn a lot from my olleagues,

both o and on duty.

First of all, I want to thank my instrutor at CERN, the GreenIT Projet Leader

Tapio Niemi who guided me through this proess, oered his expertise and help,

and was always ready for disussions about my work.

My warmestthanks goto ProfessorHannu-Matti Järvinen fromthe Departmentof

InformationTehnology who helped metonish this thesis.

I want to thank my o-workers Kalle Happonen and Jukka Kommeri for their ex-

pertise and tehnial support on Linux and luster administration. I also want to

thank Lauri Wendland who was kind enough to share his time and knowledge on

physisomputing at CERN.

A speial mention goes to my family and friends for their love and support, and of

ourse,to my dear Riittawho has been there for meallthese years.

Tampere, May 19th 2011

MikkoTuominen

mikko.r.tuominengmail.om

+358 400 121 332

(5)

CONTENTS

1.Introdution . . . 1

2.Energy Eieny Optimization ofCluster Computing . . . 4

3.Data Storaging . . . 7

3.1 Disk types . . . 7

3.1.1 HardDisk Drive . . . 7

3.1.2 Solid State Drive . . . 8

3.2 DataStorage Shemes . . . 10

3.2.1 Network-attahed Storage(NAS) . . . 10

3.2.2 Redundant Array of Independent Disks (RAID) . . . 11

3.2.3 Distributed FileSystem . . . 12

3.3 HardDisk and Solid State Drives inLinux . . . 12

3.3.1 Linux File systems . . . 13

3.3.2 GlusterFS . . . 14

3.3.3 Linux I/O sheduling . . . 15

3.3.4 Read-ahead . . . 17

4.Previous work . . . 19

4.1 SSDson servers. . . 19

4.2 Sheduling . . . 20

5.TestingEnergy Eieny . . . 22

5.1 Test Cluster. . . 22

5.1.1 Operating System: Roks 5.3 . . . 22

5.1.2 Hardware. . . 23

5.2 Test Tools. . . 24

5.2.1 Computing atCERN . . . 24

5.2.2 CMSSW . . . 25

5.2.3 ROOT framework and ROOT les . . . 26

5.2.4 MeasuringTools . . . 26

5.3 CondutingTests . . . 27

5.3.1 About the performane and energy eieny . . . 27

5.3.2 Runningtests . . . 27

6.Results . . . 30

6.1 Slot size and performane . . . 31

6.2 Slot size and energy eieny . . . 32

6.3 Read-aheadand performane . . . 33

6.4 Read-aheadand energy eieny . . . 33

6.5 File system . . . 35

6.6 I/OSheduler . . . 36

(6)

6.7 The best ase . . . 38

7.Conlusion . . . 41

Bibliography . . . 43

A.Shell sript: Shedulinga set of test runs . . . 47

B.Shell sript: One test run . . . 48

C. Shellsript: Runningand timing a CMS TauAnalysis job . . . 50

D.Power use prole of NAS with both drive types . . . 51

(7)

ABBREVIATIONS

AS AntiipatorySheduling

CFQ Completely FairQueuing

CMS Compat Myon Solenoid. Large general-purpose partile physis

detetor and the name of one of the LHC experiment.

CMSSW CMS software. A physis software toolkit used to ompute data

from the CMS detetor of the LHC.

DFS Distributed File System

FIFO First In First Out

GlusterFS DistributedFile Systemsoftware. Developed byGluster in. Glus-

terFS is freesoftware, liensedunder GNU AGPLv3 liense.

HDD Hard Disk Drive

LAN LoalAreaNetwork

LHC Large Hadron Collider. A partileaelerator inCERN.

NFS Network File System. Name of a network le system implementa-

tion.

RAID Redundant Array of Independent Disks

SSD Solid State Drive

SSTF Shortest Seek Time First

WAN Wide AreaNetwork

(8)

1. INTRODUCTION

Anenergyeientomputinglusteranhelpyoutosavetwotypesofgreen;money

green and the nature type of green. Being energy eient on any eld is good for

the environment, yet alone in the line of business where thousands and thousands

ofmahines runfor 24/7. Alsothe eletriitybillplays amajorrole inthe nanial

side of runninga omputing faility. Environment aware image is tobe onsidered

also good publiity to any ompany or orporation. Whatever is your take on the

subjet,thefatisthatonewaytolookatgreenIT,istomakemorewithless. More

data,more bandwidth, morealulations andmore utilisationwith less energy, less

heat generated, less wasted resoures and with less money spent. Several studies,

suh asTsirogianniset al. [36℄ and Niemi etal. [27℄, have onluded that the best

performing system is very likely also the most energy eient. Energy eieny

equals savings in operational osts and improved performane equals savings in

hardware osts.

Clusteromputingwasdesigned toarry outalulations,whihotherwisewould

betootimeonsumingandpratiallyimpossibletoperformwithasingleomputer.

TypialI/Ointensiveomputing jobperformsarelativelysmallset ofoperationsto

a very large set of data. It is alsoommon for suh appliationsto have very large

lesizes,fromtens ofmegabytesofdata up toaterabytesale. I/O easilybeomes

the bottlenek of suh a system.

Cluster omputing is a tool. It is a tool for thousands of sientists around the

world. Like any other tool, it needs to be eient and reliable. Although any

omputing luster an have have seemingly vast resoures, these resoures are on-

stantly at use as omputing lusters are usually very highlyutilised. It is ommon

for these lustersto always have aomputing job in the queue waiting toget some

runtime. Itisobviousthatanyperformanegainisdenitelyonsideredasapositive

thing. However, the omponents used in omputing lusters needs alsobe reliable

and new tehnologiesare not adopted ina hurry. Lots of software used in lusters

have already had many updates and newer versions, but luster admins may tend

to stik with software that is already proven to be working. This alls for areful

understanding of new tehnologiesand thorough testing.

Solid state drives are alleged to be superior to hard disk drives. They onsume

less eletriity, have greater bandwidth, an serve more requests per seond and

(9)

do not suer from any mehanial delays. They are even resistant to vibrations.

Solidstate drivesarerelatively newtehnology, whihhas quikly ledtogiantleaps

forward insomeof theareas of blokdevie developement. This quik development

has even raisedsomeproblems assomeof thesoftware isnot keeping upwith them.

For example, hard disk drives an only serve a few hundred I/O operations per

seond, whih is limited by the mehanial nature of the drive. Solid state drives

an serve tens of thousands as they are based on a transistor tehnology and have

nomovingpartstoslowthem down. Unfortunately,solid state drivesare alsomore

expensiveand havenotablyless storageapaity thanexistinghard diskdrives. On

the other hand, the development of solid state drives in these areas has also been

quiterapidand theyare athingup. Therehavebeenstudiesboth forandagainst,

whether solid state drives ever overtake hard diskdrivesin every aspet.

However, solid state drives are nothing like hard disk drives. The whole teh-

nology behind them is fundamentally dierent. The problem is that the urrent

omputer systems are designed with hard disk drives in mind. These two drive

types annotbeompared withoutfullyunderstanding theirdierenies and howit

mightaet tothe system as a whole. In fat, some of the urrently used software

omponents an even degrade the performane of the solid state drives. One suh

example is I/O sheduling, whih is designed and implemented to x some of the

weaknesses of the hard disk drives. I/O shedulers are important piee of software

andtheyreallyimprovetheperformaneofharddiskdrives, buttheyanalsoreally

hurt solid state drives.

The purpose of this study is to nd out whether or not solid state drives are

suitabletobeusedinlusteromputing. Assumingthey are,itisinteresting tond

out if solid state drives are better than hard disk drives. One big question is also,

howtomeasure this. One metriis learlynot suient andexessive optimisation

for one typeof workload may not beany use for dierent kind of workload.

Thegoalofthisstudyistounderstandthemeaningofsolidstatedrivestoluster

omputing. To understand what aets to the energy eieny of a omputing

luster from a data storage point of view. What is important and what is not.

What is the right target for optimisation. This is evaluated from the perspet of

energy eieny without degrading performane.

The method of exploring these hallenges is quite pragmati: building and run-

ningatestlusterwithdierentkindofhardwaresetupsandsoftwareongurations.

The test luster exploits the CMSSW, a physis software toolkit used in CERN to

ompute the data generated by the LHC.A suitabletest jobis onstruted for the

test luster. This test job is ran against dierent set of system ongurations and

eah testrun ismeasured. These resultsare analysedinanattempttondthe best

possible onguration.

(10)

This study doesnot elaborateor evaluatethe total lifespan of eitherdrivetype.

Theeologialimpatofmanufaturingeitherasolidstatedriveoraharddiskdrive

is left outside of the sope of this study. Another big issue that is not inluded, is

the eonomial aspet. Solid state drives are at the moment a lot more expensive

than hard disk drives. Some basi numbers are provided, suh as pries of these

devies, but noomprehensiveassessment is provided.

This thesis is onstruted as follows. Chapter 2 disusses of the bakground

of energy eieny as a onept. Chapter 3 disusses of both drive types (blok

devies) and dierent data storaging shemes. Also software omponents attahed

todatastoragingareoveredinthishapter. Chapter4summarisesthe onlusions

from previous studies. The information in these Chapters (h. 2 h. 4) is fully

derived from dierent soures and produed by other people. Our role has been to

gatherthese together toprovideasound baseto evaluateand analysethe following

results.

Chapter 5 disusses of the test methods. The test luster is desribed in detail,

for both hardware and software. The atual pratial side of onduting the tests

is alsorepresented thoroughly. In the next hapter, Chapter 6, the test results are

illustraded and analysed. The hapter is divided into multiple setions and every

setion disusses a distint ongurationin detail. As the drive type is so essential

inthis study, both drivetypes are treatedseparatelywithinthese setions. Finally,

inChapter 7, the most importantresults of this study are summarised.

(11)

2. ENERGY EFFICIENCY OPTIMIZATION OF

CLUSTER COMPUTING

Grid omputing was designed to arry out alulations, whih otherwise would be

too time onsuming to perform with a single omputer. Before further disussing

how to improve the energy eieny of a luster, and espeially data storage, rst

alittleintrodutionof what aets tothe power onsumption.

First, there is the sheer size of the data storage. Data storage apaity and

usage eieny have a diret impat on power onsumption. Usage eieny an

be understood as a ratio between the total data storage apaity and the utilised

portionofit. Themoredataisstoredorthemorediskspaeisuneientlyalloated,

the more disksare needed and the more energy isonsumed. Seond, datatransfer

rate(I/Obandwidth)andaesstimealsohaveaneet. Theeasiestwaytoimprove

both is to use disks with higher rotational speed. Higher I/O bandwidth or lower

aesstimesrequiresthusmorepowerthantheless timeritialounterparts. Third

and last, there isdata availabilityand system reliability. Repliating orbakingup

a system requires additional omponents and applianes, whih of ourse requires

additional power. Improving usage eieny, minimising the energy onsumption

of urrent omponents or applying new tehnologies, suh assolid state drives, are

allpotentialapproahes toa more energy eientdata storagesolutions. [30℄

Aording to Tsirogiannis et al. [36℄ the most energy eient system ongura-

tion is alsothe highest performingone. This is quite intuitive in aluster environ-

ment, wherehigh utilisationrates are expetedand average jobthroughputiswhat

matters. Niemi et al. [27℄ onduted a study, whih had omplementary results

indiatingthat optimisingsystem throughputalsoimprovesenergyeieny. They

found that itis more eient torun more than one simultaneous job per proessor

ore ona ompute node.

Measurement of energy eieny is not a simple task. A single metri is not

enough to reate the full piture. For example, just looking at ahieved storage

spae per unit of energy, i.e. GB/Wh, is not suient. Workload harateristis

needs also to be taken into equation. Storage devies dier remarkably by perfor-

mane metris suh as throughput (MB/s), aess time or IOPS (I/O operations

per seond). Also, all distint soures of energy onsumption may not be easily

quantiableor measurable. [30℄

(12)

Anderson and Tuek [10℄ also aknowledge the diulty of measuring and om-

paring the energy eieny. They propose a sheme to alulate energy eieny

in proportion to alternative implementations. Conveniently, this approah also di-

minishthe eet thatmany omponentsonsume onstantpowerregardless ofutil-

isation,whih betterhelps toevaluatethe gains. They alsoremindthatmiroopti-

misationis feeble if orders of magnitude inreases an beobtained with alternative

solutions.

One ommon metri tomeasure the energy eieny on adata enter level in a

spiritofgreenITisthePower UsageEetiveness(PUE).PUEistheratiobetween

thepowerdeliveredtothedataenterandthepoweratuallyusedbyITequipment.

Diereneanbeexplainedbynotiingthatsomepowerisalwaysneededforooling,

lighting, et. and also some is lost due power distribution proess, e.g. with UPS

applianes. Historially PUE has been as high as 2.25 to 3.0, whihtranslates into

33-44% of utilisation rate. Today, PUE of 1.25 an be ahieved by using modern

best praties, where 80%of total faility power isdelivered toIT equipment. This

asadeeet ofpoweronsumptionisillustradedinFigure2.1. Faebookengineers

have reported PUE as low as 1.07 at full load ontheir state-of-the-artdata enter,

whereenergy eieny was animportantdesign goal [5℄. [38℄

Figure2.1: ThePowerCasade Model. Soure: SNIA[13℄

Solid statedrives(SSD) are known toonsume less energy thanhard disk drives

(HDD) due theirnon-mehanialdesign. What makesSSDs even more appealingis

(13)

tionisdependentonthe loadinalinearfashion[36℄. Narayananetal. [26℄ritiises

reent studiesonSSDs being onlyinterested onperformane but not providingany

ost basedanalysis. They are ondentthat SSDswillnot ahieve the apaity per

dollarofHDDs. TotallyoppositeestimationispresentedbyShmidtetal. [32℄,who

arguedthat annualgrowth rates inperformane of SSDdevelopment and delining

ofpries indiateSSDs outperformingHDDsinallaspets inthe nearfuture. They

alsopoint out that rising energy pries favor this development in asituation where

operationalosts dominate hardware osts.

(14)

3. DATA STORAGING

This hapter is divided into three setion. The rst setion,Disk types, introdues

the physial devies, where the bits are stored and the harateristis of these de-

vies. The seond setion, Data Storage Arhitetures, disusses several dierent

onepts and models needed to store data in a luster environment. The third

and lastsetion, Hard Disk and Solid State Drives in Linux, disusses the software

needed to make all this happen from an operating system point of view. All these

an havean eet onthe overall performane and energy eieny of a luster.

3.1 Disk types

Bothhard disk and solid state drivesare usedas blok devies. A blok devie is a

storage omponent that oers an interfae for a blok level operations. A blok is

anabstrationbetween bloknumberandthe physialrepresentationofdataonthe

devie. Operating system uses a Logial Blok Address (LBA) as a parameter for

targetingdatainI/Ooperations. Thebloksize ofthe devieneedstobeamultiple

of the setor size of the HDD. This is disussed more thoroughly in Setions 3.1.1

and 3.3.

3.1.1 Hard Disk Drive

AHard DiskDrive (HDD) is omposed ofmultiplemagneti platters,whihanbe

eitherbereadorwrittenbyusing adiskhead. Itisommontoreferthesemagneti

plattersasheads as thereisusually onlyone diskhead perplatter. Theatual disk

head is attahed to a disk arm, whih is used to move the disk head on top of the

right trak. A trak is a olletion of bits sharing the same radius from the enter

of the disk, thus forming airle on the platter. Traks that share the same radius

on dierent platters are referred to form a ylinder. When the disk spins the read

head, while positioned stationary, an aess the bits on the trak in a sequential

manner. A trak is divided intosetors. A setor is the smallest unit of data that

anbewrittentoanHDD.Typially,thesizeofthesetorissetbythemanufaturer

and annot be hanged. A very ommon setor size in the industry is 512 bytes,

whih has beome the de fato standard. Although reently manufaturers have

alsointroduedHDDswith4kbsetorsizes, buttherearesomesevereompatibility

(15)

When HDD reeives anI/O request, it transforms the logial blok address into

a physial address, e.g. to a tuple of ylinder, head and setor numbers. Common

onsumer grade HDDs and their apaities are represented in Table 3.1. The Ve-

loiraptorfromWestern Digitalisahigh performane HDDand listed asapointof

referenefor SSDs.

Table 3.1: HDD apaities. All drives are 3.5" and SATA II. Pries: www.newegg.om

(ited1-Feb-2011)

Manufaturer Model Size Prie GB/$

Hitahi Deskstar 1 TB $54.99 18

Samsung EoGreen 1 TB $38.99 26

Seagate Barrauda 2 TB $69.99 29

Western Digital CaviarGreen 1 TB $44.99 22

Western Digital Veloiraptor 300 GB $169.99 1.8

Table3.2: SSDapaities. Alldrivesare2.5",MLCandSATAII.Pries: www.newegg.om

(ited1-Feb-2011)

Manufaturer Model Size Prie GB/$

Corsair Fore F40 40GB $104.99 0.38

Corsair Fore F120 240 GB $439.99 0.55

Intel X25-M 120 GB $229.99 0.52

Kingston SSDNow V Series 128 GB $224.99 0.57

OCZ Agility 2 160 GB $299.99 0.53

3.1.2 Solid State Drive

A Solid State Drive (SSD) is a mass storage based on NAND ash memory teh-

nology. A ash memory onsists of readable and reprogrammable transistors, i.e.

memory ells. The memory ells preserve their state during a power outage. Data

is stored in these ells as voltage levels. If the ell has only two voltage levels and

thusrepresentonlyonebit,thenitisalledaSingleLevelCell(SLC).Iftheellan

distinguishfourvoltagelevels(or more)and thusrepresent twobits (ormore),then

it is alled a Multi-Level Cell (MLC). Flash memory is disussed more thoroughly

later, but rst a littleinsight intohow anSSD operates.

An SSD is omposed of many ash memory hips. Eah hip is omposed of

(16)

with the blok layer bloks disussed earlier. An SSD has three basi operations;

read, reprogram (write) and erase. The smallest unit of data for a read or write

operation is the page, whih is typially 512 - 4096 bytes. Only fresh pages an be

reprogrammed, so every dirty page must be properly erased before it is reusable.

The smallest erasable unit of data is the blok, whih an hold up to 128 pages or

512kb of data. SSDs do not atually have physial setors, but sometimes a page

anbethoughtasbeen dividedintologialsetors. Thereasonis thatforhistorial

reasonsappliations are assuming that ablok devie has 512-bytesetors.

Reading a 4kb page generally takes around tens of miroseonds and writing

hundreds ofmiroseonds. SLC baseddevies are generallyfasterthan MLCbased.

Therealpenalty omesfromerasingablok,whihtakes1.5-2ms. Thusreadingis

anorder of magnitude faster than writingand two ordersof magnitudefaster if an

erase operation is needed. SSDs (and other ash memories) use a tehnique alled

Flash TranslationLayer (FTL) tooverome this problem.

FTLreduestheeetoftimeonsumingwriteoperationsbyreservingredundant

bloks or pages and hene avoiding the ostly erase operation when data is being

updated. Downsidesare inreasedoverheadforaddresstranslation informationand

inreased amount of ash memory operations. Of ourse this does not solve the

problem ompletely as it only delays the erasing proess [22℄. This is why a trim

operation was introdued on SSDs. Its purpose is to erase unused pages on the

bakground. As mentioned earlier, a single page annot be erased as the smallest

erasableunitisthe blok. Soithastoreadthedatafromablokintoaahe,erase

the whole blok and then rewrite the data bak into the blok. [2℄

Where SSDs really exel over HDDs is the random aess. Intel X25-M SSD

an reah up to 35,000 IOPS (I/O operations per seond) for random read and

8,600 IOPSfor randomwrite [20℄. Foromparison, ahigh-performane HDD"WD

Veloiraptor"anonlyperformlessthan250IOPSforbothrandomreadandrandom

write. The relativelylowIOPSountfor HDDsderivesfrommehanialdelaysand

annotbesigniantlyimproved. SSDsaninterleavereadandwriteoperationsand

hene the overall performane of the devie an be better than the one of a single

ash memory hip. [9℄.

SSDshaveone leartehialweakness omparedtoHDDs. The write-eraseyle

of a memory ell is limited. An SLC an be reprogrammed around 100,000 times

and moreompliated MLConly 10,000times beforeit wears out [31℄. This iswhy

modernSSDsomeswithsomethingalledWear leveling. Wearlevelingallowserase

ounts of bloks to be evenly distributed over the storage media in an attempt to

inrease the endurane of anSSD. Dynamiwear leveling is analgorithmby whih

theontrollerintheSSDreylesblokswithsmalleraseountsintheashmemory

[3℄.

(17)

ThebiggestobstaleSSDsarefaingontheir waytobeomewidely adoptedand

respetablealternativetoreplaeHDDsistheirprie. IfomparingSSDsandHDDs

just by looking how many gigabytes a dollar an buy, anSSD is approximately 50

times as expensive as HDD as seen from Tables 3.1 ( 25 GB/$ for HDDs) and 3.2

( 0.5 GB/$ for SSDs). However fully eletronial SSDs are known to onsume less

power than partly mehanialHDDs [22℄.

3.2 Data Storage Shemes

Before disussing more about dierent options fordata storagingshemes, one ter-

minologialdistintion needs to be pointed out. When using a term distributed in

theontextofdatastorage,itdeliberatelyreferstoadatastoragesheme,wherethe

atualdataisdistributedovermultiplemahineinstanesinontrasttolient/server

model type ofdistribution. The dierene is vague asin a distributedenvironment

the bakend implementationis not neessarily transparent tothe lient. Forexam-

ple, a simple le server an internally exploit other servies, whih reside onother

physialmahines. Alsomanyshemes providingdistributeddatamodelanhavea

frontend mahine towork as asingle entry point and appear tobe asingle system.

In fat,in some ases it an even be tehnially possible torun suh a system on a

single mahine instane. So basially the denition is based on how the system is

meant tobeused.

3.2.1 Network-attahed Storage (NAS)

A Network-attahed Storage (NAS) is by denition a data storage aessible over

the network. NAS is based on lient/server model and provides a le level data

aess. A NAS appliane is equipped with high-speed network interfae and hard-

ware apable of storing vast amounts of data. Terminologially, subtle dierenies

between aNAS appliane and aonventional leserver an be distinguished. NAS

is designed for high performane and usually oers ustomized and pre-ongured

software and vendor support, whih make iteasy to deploy and administer. These

terms "NAS appliane" and "le server" are used interhangeably as there is little

pragmati dierene from the end user point of view. NAS exploits network le

system tehniques onprovidingdata aess for lientmahines.

A network le system is a le system that is hosted on a remote mahine and

is aessible over the network. More preisely, it is a protoolto aess the remote

le system. Network le systems are based on lient/server model and are usually

stateless,althoughalsostatefulnetworklesystemsexists. Statelessmeansthatthe

server provides the le system as is and keeps no reord of the state of individual

les. This introdues a ouple of pros and ons. Stateless design simplies the

(18)

systemarhiteture,butalsobringsoutsomesynhronisationproblemsanddegrades

onsistene. This an be a serious problem if a high level of reliability and data

integrityisrequired. Statelessnessmustbeaknowledgedandhandledatappliation

level. An approah with a entralised server makes it easier to ontrol and bakup

your les as they allresides in a singlesystem. The lients an mount the network

lesystemlikeanyotheronventionallesystem. Thepreseneofnetworkishidden

andles are transferred toloalmahineonlywhen needed. Alas,italsomakesthe

server a single point of failure, thus eliminating it as an option for appliations of

lowfault tolerane for aessibility.

The best-known and most ommonnetwork le system in Linux environment is

theNetwork FileSystem (NFS).Thebasi ideaof NFSis to,fromalientspointof

view, emulate the behaviour of a loal, mounted le system even though the disks

are not physially present. NFS is said to be inadequate to sale for systems over

100-1000nodes,i.e. NFSlients. However, thisheavilydependsonthe useproleof

thesystem andappliationsharateristis. Read intensiveappliationshavebetter

suess than write intensive. After all, NFS is not meant to serve appliations,

whih require high availability. There is also some onerns about the seurity of

theNFS. Inalusterenvironmentthis israrely anissueaslusterstendtoresidein

a private network, exluding the frontend mahine. As a whole, NFS is a popular,

widespread,easy toinstall and widely supported, whih makesit the best hoieof

adata aess implementation tehnique forthe NAS subsystem. [14℄, [33℄

3.2.2 Redundant Array of Independent Disks (RAID)

ItisommonforNAS applianestoexploittheRAIDtehnology. Redundant Array

of Independent Disks (RAID) is a sheme designed to improve both the reliabil-

ity and performane of disk aess. RAID an be implemented by using either a

hardware or software based solution. In a hardware RAID, the server mahine is

equipped with aspei RAID ontroller, whihreeivesthe I/Orequests fromthe

OS and redirets the requests to physial disks. For the OS, only one large blok

devie is visible. Withthe software-based RAID, the independent disks are visible

totheOS and avirtualdiskisreated uponthem. RAIDispereived tobereliable

when it omes tostoring data, but not neessarily interms of aessibility. This is

espeially true when using a NFS protool, but inaessibility an also stem from

networkor powerfailures [14℄.

One important tehnique used by RAID is striping. Striping means that data

is slied into xed-length hunks of data, whih are dispersed over multiple disks.

Whendataisnowaessed,theI/Orequestanbehandledparallelbymultipledisks

and thus improve performane signiantly. There are many levels of RAID, eah

(19)

no data redundany. RAID-1 is similar to RAID-0, but it also provides mirroring.

Mirroringmeans that alldata is sent to several (usually only two)disks as asafety

preaution. This setup provides exellent data reliability and performane at the

ost of diskspae. RAID-5 providesdata parity, whih means that for every blok

striped a parity blok is alulated and stored on dierent disk. If one disk fails,

the data in the failed disk an be reonstruted by using the parity information.

RAID-6 is similar, but it doubles the amount of parity and hene an tolerate two

faileddisks. [19℄

3.2.3 Distributed File System

Asmentionedearlier inSetion3.2, the denitionused inthis study for distributed

data storage refers to truly distributed data. In ontrast to network le systems,

wherealldataisstoredonasinglemahine,adistributedlesystem(DFS)isrunning

on multiple mahine instanes. A DFS an have a entralized or deentralized

arhiteture. In a entralized arhiteture, the lient onnets to a master server.

The master server is responsible for keeping the le system metadata information

and redirets the I/O requests to other servers, i.e. data servers. The data server

then provides the atual data for the lient. This arhiteture of ourse makes the

master server a single point of failure and easily beomes the bottlenek of suh a

system. Henethe deentralizedDFSs are available. Deentralizedarhiteturean

be implemented for example in a peer-to-peer manner, where also the le system

metadata isdistributed.

The distributed nature of DFSs varies as DFS an reside in a single server rak

onneted viahigh-speed LAN or itan be geographially distributed over WAN.

DFSissaid tobeaparallel lesystem if thedata of asingleleis distributedto

many dierent servers. This approah have its pros and ons. The performane of

readingorwriting,espeiallybigles, anbeimproved signiantly asmore servers

anhandletheI/O.Ontheotherhand,asseeminglysimpleoperationasadiretory

listinganbeextremelyslowaseahserverneedstobeonsulted. DFSsanalsobe

onguredtoprovidedatarepliationtoimproveaessibilityandreliabilityordata

striping(likein RAIDsystems disussed inSetion 3.2.2) toimprove performane.

3.3 Hard Disk and Solid State Drives in Linux

Topermanently storedata, moreisneeded thanjustthe physialdevies. Presene

ofanoperatingsystemisrequired. Typialdata storagesheme anbedividedinto

4 layers; devie layer, kernel layer, le system layer and appliation layer. Also a

bloklayeran bedistinguishedbetween thekernel andlesystem[28℄. Thegoalis

toprovideabstrationbetween thelayers, tohidethe implementationand tehnial

(20)

details fromthe user and toprovide interfaes tobetter support interoperabilityof

variety hardware and software omponents. Optimization of suh system an take

plaeonany of these layers. Harddisk and solidstate drives an beseen as partof

the devie layer. Between the physial devie and kernel are devie drivers, whih

are part of the kernel. The purpose of the devie drivers is to hide the dierenies

betweenthe vastvarietyofdeviesfromthekernel. Kernelannowtreatanydevie

inthe same way through a devie driver interfae. [19℄

Thekernelandlesystemlayers arethemostinterestingonesastheyprovidethe

most of easily ongurable parameters. In the Linux kernel there is a omponent

alled an I/O Sheduler. An operating system does not really require any I/O

sheduler tooperate as I/O requests an be servied in aFIFO-like queue manner.

This,however, isnotthe optimalsolutioninmostasesand useofanI/Osheduler

an improvethe performane ofthe I/Odramatially. LinuxI/Osheduleradds an

interfae between blok layerand the devie layer. [28℄

When disussing about disk performane, two terms needs to be distinguished;

the response time and the aess time of a disk. The response time is the time an

I/Orequestsneedstowaitbeforeitisservedafteritwassubmitted. Theaesstime

of an HDD is a sum of seek time and the atual transmission time. The seek time

onsistsof diskarm transfer and spin delay orrotationallateny. Before readingor

writing an happen, the disk head needs to be positioned on the beginning of the

rightsetorondisk. The seektimederivesfrommovingthe diskarmontothe right

trak and then waiting the disk to spin so that the orretsetor is under the disk

head. Transmission timeis usually onsiderably less thanseek time. Seek time an

be minimisedby intelligentpositioningof the data onto the disk and alsoby doing

disk read or write request in a best possible order. The former is done by the le

system and the latter is alled I/Osheduling. [19℄

3.3.1 Linux File systems

Alesystem isanabstrationtomapdatabloksonablokdevie,suhaHDDor

SSD,tomeaningfullesfortheoperatingsystem. Alesystemusesdatastrutures

alled inodes to save information about the les (metadata). An inode ontains

informationabout theownerofthe le, anaessontrolvetor, timestampsforle

reationand modiation, le size, type of the le(e.g. diretory, regular le, link,

et.) and pointers to the atual data onthe devie. [14℄

Usability of ale system an be measured by two ommonmetris. The rst is

howeiently a lesystem stores les, i.e. how muh spae is wasted. The seond

is howeiently data an be transfered. Usingbigger disk bloks an improve the

transferrate asmore data ishandled atone, butalsomore diskspae iswasted as

(21)

Mostlesystemstodayarejournalinglesystems. Ajournalinglesystemmeans

that the le system keeps a journal overits writes in ase of failures in the writing

proess. When dataiswrittentoadrive,alsothemetadata informationneedstobe

updated. Ifthe dataonthe driveand the metadataisout ofsyn, thelesystem is

said tobeorrupted. This an our for example inase of sudden power outageif

onlyeitherthedataormetadatawasupdated,butnotboth. Toinreasethroughput

performane drivesusually exploit heavily drive ahes, whih an delay the writes

and ause the drive tobeout of syn. When the drivegets bak online, lesystem

an nowgothrough itsjournal and replay every step toxa possible orruptedle

system. Without journaling, the whole lesystem would need a onsisteny hek,

whih would be drastially slower operation. One of the primary onerns with all

lesystems is the speed at whih a lesystem an be validated and reovered after

orruption.

ThemostpopularlesysteminLinuxduringtherstdeade of21stenturywas

the Ext3 le system, whih is still widely used. Ext3 is the default le system for

Roks luster software. Ext3 is ajournaling lesystem with maximum volume size

of 16terabytes.

Ext4 is the suessor of Ext3 lesystem. The main motivation developing new

version was the 16 TB volume size of Ext3, whih stems from 32-bit blok num-

bers. Ext4 assigns 48-bitblok numbers and an have volumes up to 1 exabyte for

4kB blok size. Ext4 also inorporates salability and performane enhanements.

Ext4 developers provided benhmark results, whih shows improvement espeially

on write I/O requests. The dominating role of Ext3 is aknowledged and upgrade

toExt4iseasyandan bemadewithoutlosingthedata. Ext3ishoweverpereived

asreliableandstableand thusstillthelesystem ofhoie inmany systems,whih

donot need the support forlarger volume sizes. [25℄

XFS is a lesystem reated inmid-1990sby Silion Graphisin. for their own

IRIXOS, but it islater ported toLinux. XFS isalsoa journalinglesystem. XFS

wasdesignedtobesalableandsupport largeleanddiretorysizes. Themaximum

volume size of XFS is16 exabytes. [35℄

3.3.2 GlusterFS

GlusterFSisadistributedlesystem,developedbyGlusterin. andprovidedunder

GNU AGPL v3 liene. GlusterFS arhiteture is based on peer-to-peer model.

Server mahines share part of their diskspae, alled a brik, into aolletivedata

pool. These briks are then used to reate virtual data volumes. Data mirroring

and data striping are both supported. On the servers, data is stored on loal le

systems. Atually, what a server shares is a diretory and it beomes the root

(22)

on that partition. Notie, that any free spae an therefore be used either by the

GlusterFSortheloallesystemandthereforethesizeofGlusterFSvolumehanges

dynamially. Briks an be added and removed on the y without disturbing the

system. In ase of resoure removal the data hosted by that node is migrated to

anotherloationautomatially.

To aess these data volumes a lient software is needed. The data volume is

mounted aspart of a loal lesystem with FUSE, Filesystem in Userspae. FUSE

is an API emulating the behavior of onventional lesystem. Eah lient has a

dummy version of the diretory tree of the volume (a lesystem). It ontains the

metadata(inode)information,butthelesizeiszero. Theatualdataisdistributed

by using the hash alulated from the name and diretory path of the le. Eah

leisnowmappedwithpartiular virtualsubvolume. Thesevirtual subvolumesare

mapped to spesi briks, i.e. hosts. Using a hash algorithm a le name an now

be onneted with the host storing the atual le data. When a le is renamed, a

pointer is reated on new host to rediret to the old loation while migrating the

data toa new loationin the bakground. When the data transfer isomplete, the

pointer an now be removed.

Any partiularmahineanat bothasaserverandasalientatthesame time,

i.e. run a server and lient software. Other features of Gluster is load balaning,

failover reovery, I/O sheduling, ahing and quotas. Gluster supports Inniband

and Ethernet (TCP/IP) for networking. [4℄

3.3.3 Linux I/O sheduling

An I/O sheduler is a kernel omponent, whih ontrols the I/O queue and uses a

sheduler-spei algorithmtoarrangeinomingI/Orequest. WhenanI/Orequest

is reeived from a le system through the blok layer interfae, an I/O sheduler

inserts it intothe queue and eventuallypasses itto the disk ontroller through the

devie driver interfae. [28℄

Linux an besaid tobeoptimisedfor magnetidisks [21℄. Thissetion disusses

primarily on sheduling HDDs in a Linux environment. Sheduling with SSDs is

disussed in Setion4.2.

The urrent Linux kernel 2.6 has four built-in shedulers. They are alled noop,

antiipatory, deadline and fq. The fq is the urrent default sheduler. These

shedulers are disussed later in detail, but rst a little insight on how the disk

ontroller operates.

Disk usagean beoptimisedbytrying tominimisethe diskarm transfer,i.e. the

seektime. Commonalgorithmsare alledFIFO,SSTF, SCANand C-SCAN.FIFO

(FirstInFirst Out)does nooptimization. SSTF(Shortest Seek TimeFirst)always

(23)

asituation,espeiallyonadevieunderheavyloads,wherediskheadkeepsserviing

request on a near-by disk bloks and other requests on the outer edges of the disk

arefaed withlong waitingperiodsoreven astarvation. Starvation isastate where

a proess is waiting for an event that never happens. SCAN just sans the disk

from one edge to another, turns bak whenever reahes the inner or outer edge of

a diskand starts to san todisk to anotherdiretion. SCANis sometimes referred

as the elevator algorithm due its similar operation logi to elevators. C-SCAN is

like SCAN, but with a dierene that it always sans the disk the same diretion.

When the armreahes the edgeof adisk,the arm ismoved tothe opposite edgeby

one long disk arm transfer. SCAN and C-SCAN are not aeted by starvation. Of

Linux 2.6 basi shedulers, the noop is based on FIFO and others on SCAN type

diskarm transferalgorithm [21℄. [19℄

Thepurposeof anI/Osheduleristoimprovethe performaneeitherbyinreas-

ingthetotalbandwidth ofthe diskorby reduingthe aesstime ofindividualI/O

requests. I/Oshedulersuseoperationsalledsortingandmerging ofI/Orequestas

atooltominimizethe diskseek times. The sortingoperation ordersrequests based

on their setor number and inserts inoming requests on their right plae on the

queue. This way,if the diskisused either with SCANorC-SCAN based sheduler,

no unneessary disk arm movement has to be made. Merging merely means that

requests from dierent proesses to the same data blok are reognised and served

together. Alsoit has tobe notedthat usually read operationsare synhronous as a

proess is waiting them to nish. On the other hand, write operations are usually

asynhroni, whih means they do not need to be served immediately and an be

storedtemporarilyin aahe. [28℄

The most simple I/O sheduler in the default Linux kernel 2.6 is the noop I/O

sheduler. Noophasminimaloverheadanditdoesonlybasimergingandsortingof

I/Orequests. Noopanbeagoodhoiewhennot usingaHDDdiretly. Eitherthe

shedulingisdonesomewhereelse than insidetheLinuxkerneloranon-mehanial

drive,suhasanSSD,isused. RAIDontrollersdotheirownshedulingandLinux

kernel does not have any knowledge of the atual disk states in a RAID array.

Therefore Linux kernel an only interfere by doing additional I/O request sorting.

Mergingof requests is of ourse desirable. SSDson the otherhand have nomoving

parts and therefore donot suer fromseek time delays. [28℄

The Deadline sheduler implementssorting of requests, but alsoimplements an

expiry time for eah request. The basi idea is aggressive reorder of requests and

at the same time to make sure no request has to wait too long to be served. If

a request is about to expire before it is served, then deadline starts to serve that

request immediately. Read requests are given higher priority than write requests,

but nonetheless the deadline mixes write requests with read requests even though

(24)

there are more pending read requests. Deadline makes a ompromisebetween high

throughputand lowI/O request response time. [12, 28℄

The Antiipatory(AS) I/O sheduler behaves like deadline, but also adds a fea-

ture alled antiipation. Antiipationderives froma situationalled deeptive idle-

ness. Deeptive idleness happens when a read operation nishes and the proess,

whih requested it, ontinues exeution only to make a onseutive read request.

Normallythe diskarm wouldhavealreadymovedintoananotherposition,but now

thediskwaitsforasmallperiodoftimeiftheproesswantstomakeananotherI/O

request. Naturallythis behavior has anegativeeet onperformane if the proess

does not make another sequential read request. On some work loads however the

overallperformanean be improved. Thereatuallyare mehanisms,suhasost-

benet analysis orstatisti analysis of aprobability of suhrequest arriving, whih

redues the negative eet of this behavior. AS tries to redue the read response

time foreah thread. [28℄

Finally,the urrently default Linux I/Osheduler, the Completely Fair Queuing

(CFQ) I/O sheduler. The basi idea of CFQ is to provide fair treatment among

dierent proesses and share the I/O bandwidth evenly with the I/O requests. In-

ternally CFQ has many I/Oqueues, whih are operatedstrit FIFO manner. Eah

proessisgivenitsownqueue derivedfromthe proess'PIDwithahash algorithm.

CFQ selets I/O requests from these queues in a round robin manner and moves

them intoa dispath queue, whihis then sorted and sent out tothe devie driver.

[28℄

It is important tonote that both ASand CFQ are implemented as Linux kernel

omponents as antiipatory and ompletely fair queuing are mere sheduling algo-

rithms. Antiipationan be built on any sheduling sheme, not just on deadline.

Also ompeletely fair queuing does not need to work with a hash algorithmto op-

erate. Any other desired tehnique an alsobe used to alloate the I/O queues for

proesses.

Changing the sheduler in Linux an be done from a ommand prompt. For

example,setting the noop shedulerfor the drivein /dev/sdb:

eho noop > /sys/blok/sdb/queue/shedul er

3.3.4 Read-ahead

The read-ahead is a mehanismto improve the performane of a blok devie. The

funtion of the read-ahead is that for every read request served, also an additional

amount of data is read from the blok devie into a ahe. It is likely that this

data is now requested soon after. When suh a request is reeived, the data an

be provided diretly from ahe and avoid the ostly seek operation. The size of

(25)

Changing the read-ahead value in Linux an be done froma ommand prompt.

Forexample, settingthe read aheadto 4kb forthe drive in/dev/sdb:

eho 4 > /sys/blok/sdb/queue/read_a head _kb

(26)

4. PREVIOUS WORK

This hapter disusses the previous researh work done related to the topi of this

thesis. The rst setion disusses about how SSDs are used inserver environment.

The seond setiondisusses about researh onI/O sheduling.

4.1 SSDs on servers

Lee et al. [23℄ onduted a study whih objetive was to identify the areas where

SSDsan best beutilized inenterprisedatabase appliations. They onludedthat

usingSSDsfortransationlog,rollbakandtemporarydatastorageissuperiorover

HDDs. They argued that the performane of transational database appliations

is more limited by disk lateny than disk bandwidth and writing log reords is

a signiant performane bottlenek. They pointed out that the I/O pattern of a

workloadtraeolletedfromaommerialdatabaseserverisfavorabletoSSDs. By

implementingthesehanges ontheir test server, they managedtotransformitfrom

I/Obound toCPU bound. Their tests showed anorder of magnitudeimprovement

intransation throughput and response time. Also,time of proessing ompliated

database operations that required the use of temporary data area dropped to one

third.

Shmidt etal. [32℄ onduted a study onusing SSDs ina database environment

as an attempt to inrease eieny and redue osts. They onluded that SSDs

outperformed HDDsboth inperformane and energyeieny, but the overallost

analysisstillfavoredHDDs. TheyarguedthatonlysuitableusageforSSDsisinhigh

IOPS demand appliations, where IOPS/$ or apaity/$ are of minor importane.

On theirbenhmark tests, they usedthe rate oftransations perseondtomeasure

performane. The tests showed that with small database sizes (10 MB), HDDsand

SSDswere equalfor read-only workloads and HDDs havinga slightedge for mixed

workload. However, the performaeofthe HDDsquikly dereased asmuhas50%

when the size of the database tenfolded (100 MB), while this had little eet on

SSDs. Growing the size of the database another ten times bigger (1000 MB); the

performane of the HDDs dropped another25%, while stillnot aeting the SSDs.

All this appliedboth read-onlyand mixed workloads.

Narayananetal. [26℄reportedsimilarresultsintheirstudy,wheretheyperformed

a ost-benet analysis for a range of workloads. They used 49 dierent workload

(27)

traes olleted from 15 dierent server mahine (storage size rangingfrom 22 GB

to 6.7 TB) to ompare SSDs and HDDs. They found out that in all ases, the

dominatingfatorwaseitherthe storageapaity orthe random-readIOPSrequire-

ment. However, due to the low apaity/$ of SSDs, the HDDs always provided the

heapest solution. They presented alulations, that depending on the workload,

the apaity per dollar of SSDs needs to improve by a fator of 3-3000. They also

argued that energy eieny is not as important reason to make the transition to

using SSDs as low-speed SATA disks are ompetitive in terms of performane and

apaity per watt.

Aording to Leventhal [24℄, SSDs should beused as omplementary toexisting

storage system, not asa replaement. He argued that SSDs "falls ina sweet spot"

between HDD and RAM and the harateristis of ash make it ideal for ertain

appliations,e.g. loggingandahing fordatabases. Hepointed out thatby repla-

ing part of the RAM with SSDs for ahing, where appliable, an turn out to be

eonomially better alternative. He even implied that having SSDs as an interme-

diatealsojustifyforasystem withslowerspinningdisks. Hebelieved thatthe right

balaneof ost and performane ould be found for any workload.

4.2 Sheduling

Pratt and Heger [28℄ onduted a study on performane evaluation of Linux 2.6

I/O shedulers. On their tests, they simulated I/O patterns on dierent hardware

setups, inluding both single-disk and RAID ongurations. They used Ext3 and

XFS lesystems and various workload senarios. They onluded that seleting an

I/Oshedulerhas tobebasedontheworkloadpattern, thehardwaresetup andthe

lesystemused,orastheyputit,"thereisnosilverbullet". Carroll[15℄onduteda

similarstudy onI/OshedulersinaRAIDenvironment. Healsofoundtheseletion

of the I/O sheduler to be workload dependent and that I/O sheduling improves

performane only onsmalltomedium size RAID arrays(six disks orless).

Kim et al. [21℄ onduted a study to analyse I/O shedulers on SSDs. They

arguedthatshedulingitselfdoesnot improvethe readperformane ofanSSD,but

preferringread requestsover writerequests does. They presented and implemented

a sheduling sheme that exploits the harateristis of the SSD. The sheme is

quitesimple, it just bundles write requests together tomath the logial blok size

and shedules read requests independently in a FIFO manner. Their benhmark

tests showed up to17% improvements over existing Linux shedulers (presented in

Setion3.3.3). Test resultsalsoshowed that the shedulers did not makea notable

diereneunder read-oriented workloads onSSDs. On aside note, the antiipatory

sheduler seemed to outperform other existing shedulers. This is quite strange

(28)

of data on HDDs and thus the devie is kept idle for short periods of time. This

should not improve the performane of an SSD, but on the ontrary, degrade it.

Thisphenomenon anbeexplainedby notingthatanindividualproess anbenet

for getting an exlusive servie for bursty I/O requests and thus improving the

overall performane. However, this is more a matter of proess optimisation than

I/Ooptimisation.

(29)

5. TESTING ENERGY EFFICIENCY

Thishapterdisusses of the test environment and theatual tests onduted. The

rst setion desribes the test luster in detail. The seond setion represents the

used test tools. The physis software and the software and hardware instruments

usedtogatherdataaredisussed inthissetion. Thethirdandlastsetiondisusses

the pratialside of running the tests and desribes how the tests were onduted.

5.1 Test Cluster

5.1.1 Operating System: Roks 5.3

The hoie forthe operatingsystem of the test lusteris Roks 5.3, anopen-soure

Linux luster distribution. Roks is developed by the Roks Cluster Group at the

San Diego Superomputer Center at the University of California, San Diego and

its ontributors. Roks is a fully stand-alone system and annot be installed on

top of existing system. Roks is basially a Red Hat Linux bundled together with

a whole set of luster software. The driving motivation behind Roks is to make

lusterseasytodeploy, manage,upgrade and sale. Thisdoesnot meanthat Roks

would be inadequate or ineient to do high performane luster omputing. On

the ontrary, Roks is used inmany universities and institutions around the world.

Installing and maintaining Roks is easy. First you have to install the frontend

mahine. This does not dier muh from a normal linux installation. Roks on-

tains many optionalpakages, alled rolls,whihyouan pik to gowith youbasi

installation. These rolls ontain additional software you may want to install. For

example, the Sun Grid Engine (SGE) roll was inluded and used as the hoie of

thebath-queuingsystemforthetestluster. Afterinstallingthefrontend,aluster

alsoneedsomputenodes. Installationofaomputenodeiseasy. Allthatisneeded,

is to ongure the ompute node to boot from the network. A ompute node reg-

isters itself to the frontend database, downloads a system image from the frontend

(or from other ompute nodes) and performs a quik installation. In fat, Roks

even deals with errors justby re-installing the ompute node rather than trying to

x it. If the default ongurationsetup and system image is not suient enough

for your needs oryouwant later to modify your ompute nodes, allyou need todo

istoongure sometext lesonthe frontend, maybeadd some additionalpakages

(30)

to be installed on ompute nodes, assemble a new system image and re-install the

nodes.

Roks also omes with many software tools that makes the administration and

management of a luster easy. Most notably the Ganglia, whih is a web-based

lustermonitoring software. [33℄

With SGE it ispossible toongure the slotsize for eah ompute node. A slot

sizedeneshowmanysimultaneousjobsanbesubmittedtoasingleomputernode.

Thenameatually derivesfromnumberofCPUslots amahine hasand itsuggests

that the number of CPU ores should be equal to the number of simultaneous

ompute jobs. However, this study wanted to try what kind of eet this has on

the performane. This study uses a term relative slot size to refer the ratio of the

slotsize andthe numberofatualCPUores. Forexample,inthe testluster,with

quadoremahines, a slotsize of eightwould equala relativeslotsize of two.

5.1.2 Hardware

The test environment onsists of a omputing luster and a dediated le server.

Clusteris omposed of fourmahines, frontend and three ompute nodes. Detailed

speiations are presented in Table 5.1. Detailed speiations of the drives used

are presented inTable 5.2.

Table 5.1: TestCluster

Frontend Nodes File Server

Model Dell server Dell R210 Dell R710

Proessor IntelXeon 2,8GHz IntelXeon 2,4 GHz Intel Xeon 2 GHz

CPU ores 2 4 4

RAM 2GB 8 GB 2 GB

Disk (OS) 160GB SATA (7.2k) 250GBSATA (7.2k) 146GB SAS(10k)

Ethernet 2x 1Gb 2x 1Gb 4x 1Gb

SSDsareCorsairCSSD-F40GB-2withaSATAII3.0Gb/sinterfae. CorsairF40

utilises MLC NAND ash tehnology. Aording to manufaturer's own speia-

tions, Corsair F40 an reah read and write speed of 270 MB/s and perform 50k

IOPS. [17℄

HDDsareSorpioBlak WD3200BEKTfromWesternDigital,witha7200 RPM

spindle speed and a SATA II 3.0Gb/s interfae. Aording to a review made by

Tom's Hardware web site, just to give a rough estimate of the performane of the

HDD,the WD3200BEKT wasbenhmarked with aesstime of 15.4 ms (inluding

spin delay), maximum read speed of 84.3 MB/s and maximum write speed of 83

(31)

and peak power of 3.26 W [1℄. Western Digital [37℄ announes the WD3200BEKT

to have an average lateny of 4.2 ms and an average seek time of 12 ms, whih

onverge quite well with numbers from Tom's Hardware review. However, power

onsumption does not onverge, as Western Digital announes WD3200BEKT to

have an idlepower of 0.85W and an average poweronsumption of 2.1W. Also the

manufaturer's numbersfor HDDbandwidthdieronsiderably, asWesternDigital

laims the diskan put up to a108 MB/s for both read and write.

Table5.2: Manufaturer speiationof the drive. Pries: www.newegg.om(ited1-Feb-

2011)

HDD SSD

Model WD SorpioBlak Corsair F40

Size 320 GB 40GB

Prie $59.99 $104.99

GB/$ 5.3 0.38

Random aess time 16 ms 0.02 ms

Read speed 108 MB/s 280 MB/s

Write speed 108 MB/s 270 MB/s

IOPS - 50000

Idle power 0.8 W 0.5 W

Ative power 1.75 W 2.0 W

5.2 Test Tools

5.2.1 Computing at CERN

TheLargeHadronCollider(LHC) isapartileaeleratoratCERN. Thefourmain

detetors of the LHC an produe 15 petabytes of data a year [6℄. The distributed

omputinganddata storageinfrastruturebuilttoproessthis vastamountofdata

is alled the Worldwide LHC Computing Grid (WLCG). As of February 2011, the

WLCG had 246,000 proessingores and 142 petabytes of disk spae [8℄.

The CERN omputing infrastruture is divided into three level of tier entres.

Tier-0 entre is loated at CERN and is responsible for storing the rst opy of

RAW experiment data from LHC. It is also responsible for produing the rst re-

onstrutionpassand distributionofdata toTier-1entres. Tier-1entres together

are responsible for storing the seond opies of the data stored in Tier-0. Tier-1

entres also further reproess the data and distribute it to Tier-2 entres. Tier-2

entres are responsible for serving the analysis requirements of the physiists and

also produing and reproessing of the simulated data. The simulated data is also

(32)

distributed toTier-1 entres. As of February 2011, besides the Tier-0 entre, there

are 11Tier-1 enters and 164 Tier-2 entres in the world [7℄. [18℄

5.2.2 CMSSW

TheCompatMuonSolenoid(CMS)isoneofthefourbigresearhprojetsattahed

to LHC. CMS an also refer to the atual partile detetor. The Compat Muon

Solenoid Software (CMSSW) is a physis software toolkit for analysing the data

fromthe CMS detetor.

A entral onept within the CMSSW is an event. An Event is a C++ objet

ontainer. An Event ontains many data tiers for allRAW and reonstruted data

related to a partiular ollision. The RAW data is the full event information and

olleted diretly from the LHC. The RAW data is unmanipulated and is not used

for analysis. The reonstruted or RECO data is reonstruted to physis objets

and still ontains most of the event information. This RECO data an be used for

analysis, but it is not onvenient on any substantial data sample. Analysis Objet

Data (AOD) is a subsetof RECO data. AOD is expeted tobeused in analysis as

AODs are basially beforehand sreened events. All objets in the Event may be

individuallyor olletivelystored in ROOT les. An event data an also be stored

indierent les to limitthe size of the le and to prevent transferring unneessary

data. This data tier modelof anEvent is illustradedin Figure5.1.

Figure5.1: Data modelusedinCMSSW. Soure: CMSWorkBook[16 ℄

(33)

simulation. DatasamplesgeneratedbyMonteCarloareusedtosimulatethephysis

signalunderinvestigation. Itanalsobeusedforreatingasampledataforpersonal

use.

CMSSW onsists of many modules, whih ontains general purpose ode for

analysing the events. The goal is to minimise the ode a physiist have to write

himself. A onguration le is needed to tell the CMSSW whih modules to load

and where the data an be found. The exeutable is alled msRun. [16℄

5.2.3 ROOT framework and ROOT les

ROOT isaC++frameworkdesigned for largesale dataanalysis anddata mining.

ROOT was rst reated at CERN, the projet starting in1995, and isstill used in

CERN for analysing the partile physis data reated by LHC. One of the funda-

mental design priniples was that although the programs analysing the data may

hange as time passes, the atual data does not. It was also designed to sale to

handle petabytes of data. ROOT relies on a "write one, read many" -model due

the nature of the data and makes itpossibleto ompress the data eiently.

A ROOT leis a ompessed binary le, whih an store any instane of a C++

lass. Data is stored in aROOT lewith a data desription so that it an beread

even if the original program used to store the data is lost. Data an be stored in

aROOT leboth row-and olumn-wisemanner. If the data is stored by olumns,

reading the same data member from multiple instanes speed up onsiderable as

unwantedpieesofdataanbeskipped. Forexampleinoneinstane,whena280MB

ROOT le was analysed, only 6.6MB of data was transferred over the network.

ROOT even implements an auto-adaptive pre-feth mehanism reading the next

entry while previousentry is stillbeing proessed.

ROOT supports XML representation of data, but does not atually save data

inXML formdue the verbose nature of XML. Also a database abstration layer is

providedmaking itpossibletostoredatainaROOTleinadatabase-likemanner.

[29℄, [11℄

5.2.4 Measuring Tools

During the tests, performane data was olleted from the luster by using both

hardware and software tools. The atual power onsumption was measured with

a WattsUp? eletriity meter, whih was attahed to the frontend mahine via

USB. A shell sript was used to read the meter information one every seond

and to write the information into a log le. The eletriity meter also provided a

umulativereadingforthewatthours(Wh)onsumed. Thepoweronsumptionwas

measured separatelyforthe leserver andfor allof the omputenodes. The power

(34)

onsumptionofthefrontendmahinewasnotmeasured. Agridmonitoringsoftware

alledGangliawasalsoused. Gangliaoperatesbyreeivingonstantlystatusreports

fromother mahines in the luster. Ganglia has a browser user interfae to display

luster performane metris, suh as network tra, CPU utilisationof individual

mahines, job queue, et. The server logs were olleted and stored together with

the other outputdata.

5.3 Conduting Tests

5.3.1 About the performane and energy eieny

We distinguish the performane and the energy eieny as atwo dierent optimi-

sation goals. The performane is measured by the average proessing times of the

CMS jobs. The energy eieny is measured by the energy in watt hours needed

by anindividual CMS job on average. These two an be highlydependant of eah

other. After all, by denition, energy equals time

×

^power. ^However, ^the ^power

doesnot need tobeonstant. Itis possible, that inreasing the performane italso

has some kind of eet on the power usage. Thus, these two need to be studied

separately.

5.3.2 Running tests

We reated someLinux shellsripts both toautomatiseand standardisethe testing

proess. Shell sripts were responsible for submitting the jobs, hanging ongura-

tions whereappliable (forexampleshedulingalgorithm),learingahes, starting

and stopping wattage measurement and writing log entries. The shell sripts are

attahed asappendies. Appendix A shows the mainsript, Appendix B shows the

sript used for an individual test run and appendix C shows the sript responsible

for initialisingand runningthe atualCMS job. Installingthe drives and hanging

thelesystemneeded tobedonemanually. Ashellsriptwasalsousedforreating

thetest inputdata onthetarget storagefor theCMS jobs. Toensurehomogeneous

of the test data between dierent test ongurations and between individual jobs,

the test datawasopied from the frontend for eah time a lesystem was reated.

The driveahes both onompute hostand data host wasleared between the test

runs with shell ommand:

syn; eho 3 > /pro/sys/vm/drop_ahes

Every testrunwasidential. Theshellsriptrst learedahes andthensetthe

shedulingalgorithm. Thenthe slotsize of the SGE wasongured. Eahompute

nodehad4CPUoresasshowninTable5.1. Slotsizesof2,4,8and12(relativeslot

(35)

sizes of 0.5, 1,2 and 3)was used toassign loads of 50-300%to eah ompute node.

After the luster was ongured, the sript submitted CMS jobs via SGE to eah

ompute node equal to the urrent slotsize of the node. Just beforethe jobs were

submitted,ananothersriptwasstartedtologthe wattageasmentionedinSetion

5.2.4. Whenallthejobswerenished, alsotheloggingsriptwasterminated. Using

the log le, starting and nishing time of a CMS job an be determined and also

how muh energy (watt hours) was onsumed. After the rst set of CMS jobs was

nished, the sript inreased the slot size and ran a new set of jobs. When nished

with a slotsize of 12, sheduler was hanged and slot size was set bak to 2. This

was repeated until all ombinations of four dierent slot sizes and four dierent

shedulers were used. All in all, one suh test run submitted 312 CMS jobs and

took about 8-10 hours tonish.

First,thetest wasondutedwithNAS.ARAID-5ongurationof6HDDs(320

GB) and 4 SSDs (40 GB) was set up, reating volumes of 1.6 TB and 120 GB,

respetively. The ROOT le used was 656 MB in size and it was opied to NAS

total of 72 times eah time and thus alloating 47 GB of the total volume. The

les were renamed to"data-01-01.root"..."data-06-12.root",wherethe rst number

represented thenode numberandseond numberrepresented the jobnumber. This

ensured that notwoCMS jobs was using the same data le. Also, the value of the

read-ahead was altered to test the eet it had on the performane. Read-ahead

values of 4kb, 8kb, 16kb and 32kb were used. After a test run of 312 CMS jobs

nished, a new test run was started after hanging the read-ahead value, the le

systemorRAID"disks"fromHDDstoSSDs. Allinall,thetest run wasonduted

total of 24 times. 3 le systems

×

⁴ ^read-ahead ^values

×

² ^dierent ^RAID "disks"

equals 24.

At this point taking a quik look over the results, a pattern was pereived that

indiatedthat inreasingthe read-aheadvalue had anegativeimpatonthe perfor-

mane. Thereasonmostlikelywasthatthe ROOTleisabinary leandthe AOD

withinthe leissattered. It wasdeided not touse the read-aheadvalueanymore

as a onguration parameter. Also at this point, one test run was performed by

using only 4 HDDs for easier omparison againstthe 4 SSDs. Again, based onthe

preliminaryresults, the best performing HDD onguration of 6 HDDs was piked

and one more test run for 4 HDDs was performed with that onguration. Also,

the energy onsumption of idle ompute nodes and NAS appliane was measured,

both with and without the RAID pak. The idle tests logged an idle mahine for

one hour fromstartup. These resultsare represented inAppendix D.

Next, the SSDswere installedonthe ompute nodes and ongured asaone big

GlusterFSvolume. Withthree nodes and withoutany stripingormirroring,the 40

GB SSDs reated a volume of 120 GB. The test run was also onduted with this

(36)

onguration before dismounting the Gluster onguration and running the tests

diretly from the loal drives. Beause the test data was total of 47 GB, all of it

ould not be tted into the 40 GB drives, so only half of it was used. Copying 24

GBof test datato eah drive. This way, plentyof freespae waslefton thedevies

ashad been the ase also onearliertest runs.

Finally, the SSDs were hanged to HDDs inside the ompute nodes. As with

SSDs, a GlusterFS volume was reated rst. With320 GB in eah node, a volume

of 960 GB ould be hosted by the nodes. After running the tests on Gluster, the

same tests were ondutedagain with loaldrives. This time though,the whole 47

GB of test data was opiedto eahHDD.