Sheduling - Previous work - Energy Efficiency of Data Storage Systems in Cluster Computing

4. Previous work

4.2 Sheduling

Pratt and Heger [28℄ onduted a study on performane evaluation of Linux 2.6

I/O shedulers. On their tests, they simulated I/O patterns on dierent hardware

setups, inluding both single-disk and RAID ongurations. They used Ext3 and

XFS lesystems and various workload senarios. They onluded that seleting an

I/Oshedulerhas tobebasedontheworkloadpattern, thehardwaresetup andthe

lesystemused,orastheyputit,"thereisnosilverbullet". Carroll[15℄onduteda

similarstudy onI/OshedulersinaRAIDenvironment. Healsofoundtheseletion

of the I/O sheduler to be workload dependent and that I/O sheduling improves

performane only onsmalltomedium size RAID arrays(six disks orless).

Kim et al. [21℄ onduted a study to analyse I/O shedulers on SSDs. They

arguedthatshedulingitselfdoesnot improvethe readperformane ofanSSD,but

preferringread requestsover writerequests does. They presented and implemented

a sheduling sheme that exploits the harateristis of the SSD. The sheme is

quitesimple, it just bundles write requests together tomath the logial blok size

and shedules read requests independently in a FIFO manner. Their benhmark

tests showed up to17% improvements over existing Linux shedulers (presented in

Setion3.3.3). Test resultsalsoshowed that the shedulers did not makea notable

diereneunder read-oriented workloads onSSDs. On aside note, the antiipatory

sheduler seemed to outperform other existing shedulers. This is quite strange

of data on HDDs and thus the devie is kept idle for short periods of time. This

should not improve the performane of an SSD, but on the ontrary, degrade it.

Thisphenomenon anbeexplainedby notingthatanindividualproess anbenet

for getting an exlusive servie for bursty I/O requests and thus improving the

overall performane. However, this is more a matter of proess optimisation than

I/Ooptimisation.

5. TESTING ENERGY EFFICIENCY

Thishapterdisusses of the test environment and theatual tests onduted. The

rst setion desribes the test luster in detail. The seond setion represents the

used test tools. The physis software and the software and hardware instruments

usedtogatherdataaredisussed inthissetion. Thethirdandlastsetiondisusses

the pratialside of running the tests and desribes how the tests were onduted.

5.1 Test Cluster

5.1.1 Operating System: Roks 5.3

The hoie forthe operatingsystem of the test lusteris Roks 5.3, anopen-soure

Linux luster distribution. Roks is developed by the Roks Cluster Group at the

San Diego Superomputer Center at the University of California, San Diego and

its ontributors. Roks is a fully stand-alone system and annot be installed on

top of existing system. Roks is basially a Red Hat Linux bundled together with

a whole set of luster software. The driving motivation behind Roks is to make

lusterseasytodeploy, manage,upgrade and sale. Thisdoesnot meanthat Roks

would be inadequate or ineient to do high performane luster omputing. On

the ontrary, Roks is used inmany universities and institutions around the world.

Installing and maintaining Roks is easy. First you have to install the frontend

mahine. This does not dier muh from a normal linux installation. Roks

on-tains many optionalpakages, alled rolls,whihyouan pik to gowith youbasi

installation. These rolls ontain additional software you may want to install. For

example, the Sun Grid Engine (SGE) roll was inluded and used as the hoie of

thebath-queuingsystemforthetestluster. Afterinstallingthefrontend,aluster

alsoneedsomputenodes. Installationofaomputenodeiseasy. Allthatisneeded,

is to ongure the ompute node to boot from the network. A ompute node

reg-isters itself to the frontend database, downloads a system image from the frontend

(or from other ompute nodes) and performs a quik installation. In fat, Roks

even deals with errors justby re-installing the ompute node rather than trying to

x it. If the default ongurationsetup and system image is not suient enough

for your needs oryouwant later to modify your ompute nodes, allyou need todo

istoongure sometext lesonthe frontend, maybeadd some additionalpakages

to be installed on ompute nodes, assemble a new system image and re-install the

nodes.

Roks also omes with many software tools that makes the administration and

management of a luster easy. Most notably the Ganglia, whih is a web-based

lustermonitoring software. [33℄

With SGE it ispossible toongure the slotsize for eah ompute node. A slot

sizedeneshowmanysimultaneousjobsanbesubmittedtoasingleomputernode.

Thenameatually derivesfromnumberofCPUslots amahine hasand itsuggests

that the number of CPU ores should be equal to the number of simultaneous

ompute jobs. However, this study wanted to try what kind of eet this has on

the performane. This study uses a term relative slot size to refer the ratio of the

slotsize andthe numberofatualCPUores. Forexample,inthe testluster,with

quadoremahines, a slotsize of eightwould equala relativeslotsize of two.

5.1.2 Hardware

The test environment onsists of a omputing luster and a dediated le server.

Clusteris omposed of fourmahines, frontend and three ompute nodes. Detailed

speiations are presented in Table 5.1. Detailed speiations of the drives used

are presented inTable 5.2.

Table 5.1: TestCluster

Frontend Nodes File Server

Model Dell server Dell R210 Dell R710

Proessor IntelXeon 2,8GHz IntelXeon 2,4 GHz Intel Xeon 2 GHz

CPU ores 2 4 4

RAM 2GB 8 GB 2 GB

Disk (OS) 160GB SATA (7.2k) 250GBSATA (7.2k) 146GB SAS(10k)

Ethernet 2x 1Gb 2x 1Gb 4x 1Gb

SSDsareCorsairCSSD-F40GB-2withaSATAII3.0Gb/sinterfae. CorsairF40

utilises MLC NAND ash tehnology. Aording to manufaturer's own

speia-tions, Corsair F40 an reah read and write speed of 270 MB/s and perform 50k

IOPS. [17℄

HDDsareSorpioBlak WD3200BEKTfromWesternDigital,witha7200 RPM

spindle speed and a SATA II 3.0Gb/s interfae. Aording to a review made by

Tom's Hardware web site, just to give a rough estimate of the performane of the

HDD,the WD3200BEKT wasbenhmarked with aesstime of 15.4 ms (inluding

spin delay), maximum read speed of 84.3 MB/s and maximum write speed of 83

and peak power of 3.26 W [1℄. Western Digital [37℄ announes the WD3200BEKT

to have an average lateny of 4.2 ms and an average seek time of 12 ms, whih

onverge quite well with numbers from Tom's Hardware review. However, power

onsumption does not onverge, as Western Digital announes WD3200BEKT to

have an idlepower of 0.85W and an average poweronsumption of 2.1W. Also the

manufaturer's numbersfor HDDbandwidthdieronsiderably, asWesternDigital

laims the diskan put up to a108 MB/s for both read and write.

Table5.2: Manufaturer speiationof the drive. Pries: www.newegg.om(ited

1-Feb-2011)

HDD SSD

Model WD SorpioBlak Corsair F40

Size 320 GB 40GB

Prie $59.99 $104.99

GB/$ 5.3 0.38

Random aess time 16 ms 0.02 ms

Read speed 108 MB/s 280 MB/s

Write speed 108 MB/s 270 MB/s

IOPS - 50000

Idle power 0.8 W 0.5 W

Ative power 1.75 W 2.0 W

5.2 Test Tools

5.2.1 Computing at CERN

TheLargeHadronCollider(LHC) isapartileaeleratoratCERN. Thefourmain

detetors of the LHC an produe 15 petabytes of data a year [6℄. The distributed

omputinganddata storageinfrastruturebuilttoproessthis vastamountofdata

is alled the Worldwide LHC Computing Grid (WLCG). As of February 2011, the

WLCG had 246,000 proessingores and 142 petabytes of disk spae [8℄.

The CERN omputing infrastruture is divided into three level of tier entres.

Tier-0 entre is loated at CERN and is responsible for storing the rst opy of

RAW experiment data from LHC. It is also responsible for produing the rst

re-onstrutionpassand distributionofdata toTier-1entres. Tier-1entres together

are responsible for storing the seond opies of the data stored in Tier-0. Tier-1

entres also further reproess the data and distribute it to Tier-2 entres. Tier-2

entres are responsible for serving the analysis requirements of the physiists and

also produing and reproessing of the simulated data. The simulated data is also

distributed toTier-1 entres. As of February 2011, besides the Tier-0 entre, there

are 11Tier-1 enters and 164 Tier-2 entres in the world [7℄. [18℄

5.2.2 CMSSW

TheCompatMuonSolenoid(CMS)isoneofthefourbigresearhprojetsattahed

to LHC. CMS an also refer to the atual partile detetor. The Compat Muon

Solenoid Software (CMSSW) is a physis software toolkit for analysing the data

fromthe CMS detetor.

A entral onept within the CMSSW is an event. An Event is a C++ objet

ontainer. An Event ontains many data tiers for allRAW and reonstruted data

related to a partiular ollision. The RAW data is the full event information and

olleted diretly from the LHC. The RAW data is unmanipulated and is not used

for analysis. The reonstruted or RECO data is reonstruted to physis objets

and still ontains most of the event information. This RECO data an be used for

analysis, but it is not onvenient on any substantial data sample. Analysis Objet

Data (AOD) is a subsetof RECO data. AOD is expeted tobeused in analysis as

AODs are basially beforehand sreened events. All objets in the Event may be

individuallyor olletivelystored in ROOT les. An event data an also be stored

indierent les to limitthe size of the le and to prevent transferring unneessary

data. This data tier modelof anEvent is illustradedin Figure5.1.

Figure5.1: Data modelusedinCMSSW. Soure: CMSWorkBook[16 ℄

simulation. DatasamplesgeneratedbyMonteCarloareusedtosimulatethephysis

signalunderinvestigation. Itanalsobeusedforreatingasampledataforpersonal

use.

CMSSW onsists of many modules, whih ontains general purpose ode for

analysing the events. The goal is to minimise the ode a physiist have to write

himself. A onguration le is needed to tell the CMSSW whih modules to load

and where the data an be found. The exeutable is alled msRun. [16℄

5.2.3 ROOT framework and ROOT les

ROOT isaC++frameworkdesigned for largesale dataanalysis anddata mining.

ROOT was rst reated at CERN, the projet starting in1995, and isstill used in

CERN for analysing the partile physis data reated by LHC. One of the

funda-mental design priniples was that although the programs analysing the data may

hange as time passes, the atual data does not. It was also designed to sale to

handle petabytes of data. ROOT relies on a "write one, read many" -model due

the nature of the data and makes itpossibleto ompress the data eiently.

A ROOT leis a ompessed binary le, whih an store any instane of a C++

lass. Data is stored in aROOT lewith a data desription so that it an beread

even if the original program used to store the data is lost. Data an be stored in

aROOT leboth row-and olumn-wisemanner. If the data is stored by olumns,

reading the same data member from multiple instanes speed up onsiderable as

unwantedpieesofdataanbeskipped. Forexampleinoneinstane,whena280MB

ROOT le was analysed, only 6.6MB of data was transferred over the network.

ROOT even implements an auto-adaptive pre-feth mehanism reading the next

entry while previousentry is stillbeing proessed.

ROOT supports XML representation of data, but does not atually save data

inXML formdue the verbose nature of XML. Also a database abstration layer is

providedmaking itpossibletostoredatainaROOTleinadatabase-likemanner.

[29℄, [11℄

5.2.4 Measuring Tools

During the tests, performane data was olleted from the luster by using both

hardware and software tools. The atual power onsumption was measured with

a WattsUp? eletriity meter, whih was attahed to the frontend mahine via

USB. A shell sript was used to read the meter information one every seond

and to write the information into a log le. The eletriity meter also provided a

umulativereadingforthewatthours(Wh)onsumed. Thepoweronsumptionwas

measured separatelyforthe leserver andfor allof the omputenodes. The power

onsumptionofthefrontendmahinewasnotmeasured. Agridmonitoringsoftware

alledGangliawasalsoused. Gangliaoperatesbyreeivingonstantlystatusreports

fromother mahines in the luster. Ganglia has a browser user interfae to display

luster performane metris, suh as network tra, CPU utilisationof individual

mahines, job queue, et. The server logs were olleted and stored together with

the other outputdata.

5.3 Conduting Tests

5.3.1 About the performane and energy eieny

We distinguish the performane and the energy eieny as atwo dierent

optimi-sation goals. The performane is measured by the average proessing times of the

CMS jobs. The energy eieny is measured by the energy in watt hours needed

by anindividual CMS job on average. These two an be highlydependant of eah

other. After all, by denition, energy equals time

×

^power. ^However, ^the ^power

doesnot need tobeonstant. Itis possible, that inreasing the performane italso

has some kind of eet on the power usage. Thus, these two need to be studied

separately.

5.3.2 Running tests

We reated someLinux shellsripts both toautomatiseand standardisethe testing

proess. Shell sripts were responsible for submitting the jobs, hanging

ongura-tions whereappliable (forexampleshedulingalgorithm),learingahes, starting

and stopping wattage measurement and writing log entries. The shell sripts are

attahed asappendies. Appendix A shows the mainsript, Appendix B shows the

sript used for an individual test run and appendix C shows the sript responsible

for initialisingand runningthe atualCMS job. Installingthe drives and hanging

thelesystemneeded tobedonemanually. Ashellsriptwasalsousedforreating

thetest inputdata onthetarget storagefor theCMS jobs. Toensurehomogeneous

of the test data between dierent test ongurations and between individual jobs,

the test datawasopied from the frontend for eah time a lesystem was reated.

The driveahes both onompute hostand data host wasleared between the test

runs with shell ommand:

syn; eho 3 > /pro/sys/vm/drop_ahes

Every testrunwasidential. Theshellsriptrst learedahes andthensetthe

shedulingalgorithm. Thenthe slotsize of the SGE wasongured. Eahompute

nodehad4CPUoresasshowninTable5.1. Slotsizesof2,4,8and12(relativeslot

sizes of 0.5, 1,2 and 3)was used toassign loads of 50-300%to eah ompute node.

After the luster was ongured, the sript submitted CMS jobs via SGE to eah

ompute node equal to the urrent slotsize of the node. Just beforethe jobs were

submitted,ananothersriptwasstartedtologthe wattageasmentionedinSetion

5.2.4. Whenallthejobswerenished, alsotheloggingsriptwasterminated. Using

the log le, starting and nishing time of a CMS job an be determined and also

how muh energy (watt hours) was onsumed. After the rst set of CMS jobs was

nished, the sript inreased the slot size and ran a new set of jobs. When nished

with a slotsize of 12, sheduler was hanged and slot size was set bak to 2. This

was repeated until all ombinations of four dierent slot sizes and four dierent

shedulers were used. All in all, one suh test run submitted 312 CMS jobs and

took about 8-10 hours tonish.

First,thetest wasondutedwithNAS.ARAID-5ongurationof6HDDs(320

GB) and 4 SSDs (40 GB) was set up, reating volumes of 1.6 TB and 120 GB,

respetively. The ROOT le used was 656 MB in size and it was opied to NAS

total of 72 times eah time and thus alloating 47 GB of the total volume. The

les were renamed to"data-01-01.root"..."data-06-12.root",wherethe rst number

represented thenode numberandseond numberrepresented the jobnumber. This

ensured that notwoCMS jobs was using the same data le. Also, the value of the

read-ahead was altered to test the eet it had on the performane. Read-ahead

values of 4kb, 8kb, 16kb and 32kb were used. After a test run of 312 CMS jobs

nished, a new test run was started after hanging the read-ahead value, the le

systemorRAID"disks"fromHDDstoSSDs. Allinall,thetest run wasonduted

total of 24 times. 3 le systems

×

⁴ ^read-ahead ^values

×

² ^dierent ^RAID "disks"

equals 24.

At this point taking a quik look over the results, a pattern was pereived that

indiatedthat inreasingthe read-aheadvalue had anegativeimpatonthe

perfor-mane. Thereasonmostlikelywasthatthe ROOTleisabinary leandthe AOD

withinthe leissattered. It wasdeided not touse the read-aheadvalueanymore

as a onguration parameter. Also at this point, one test run was performed by

using only 4 HDDs for easier omparison againstthe 4 SSDs. Again, based onthe

preliminaryresults, the best performing HDD onguration of 6 HDDs was piked

and one more test run for 4 HDDs was performed with that onguration. Also,

the energy onsumption of idle ompute nodes and NAS appliane was measured,

both with and without the RAID pak. The idle tests logged an idle mahine for

one hour fromstartup. These resultsare represented inAppendix D.

Next, the SSDswere installedonthe ompute nodes and ongured asaone big

GlusterFSvolume. Withthree nodes and withoutany stripingormirroring,the 40

GB SSDs reated a volume of 120 GB. The test run was also onduted with this

onguration before dismounting the Gluster onguration and running the tests

diretly from the loal drives. Beause the test data was total of 47 GB, all of it

ould not be tted into the 40 GB drives, so only half of it was used. Copying 24

GBof test datato eah drive. This way, plentyof freespae waslefton thedevies

ashad been the ase also onearliertest runs.

Finally, the SSDs were hanged to HDDs inside the ompute nodes. As with

SSDs, a GlusterFS volume was reated rst. With320 GB in eah node, a volume

of 960 GB ould be hosted by the nodes. After running the tests on Gluster, the

same tests were ondutedagain with loaldrives. This time though,the whole 47

GB of test data was opiedto eahHDD.

6. RESULTS

Theresultshapterdisussesthendingsofthestudyindividually. Theperformane

and the energy eieny are distinguished as a two dierent optimization goals as

disussed inSetion 5.3.1. However, this study alsotries to evaluatethe resultas a

whole. Theperformanegainismeasuredbyomparingtheaverageproessingtimes

In document Energy Efficiency of Data Storage Systems in Cluster Computing (sivua 27-0)