4. Previous work
4.2 Sheduling
Pratt and Heger [28℄ onduted a study on performane evaluation of Linux 2.6
I/O shedulers. On their tests, they simulated I/O patterns on dierent hardware
setups, inluding both single-disk and RAID ongurations. They used Ext3 and
XFS lesystems and various workload senarios. They onluded that seleting an
I/Oshedulerhas tobebasedontheworkloadpattern, thehardwaresetup andthe
lesystemused,orastheyputit,"thereisnosilverbullet". Carroll[15℄onduteda
similarstudy onI/OshedulersinaRAIDenvironment. Healsofoundtheseletion
of the I/O sheduler to be workload dependent and that I/O sheduling improves
performane only onsmalltomedium size RAID arrays(six disks orless).
Kim et al. [21℄ onduted a study to analyse I/O shedulers on SSDs. They
arguedthatshedulingitselfdoesnot improvethe readperformane ofanSSD,but
preferringread requestsover writerequests does. They presented and implemented
a sheduling sheme that exploits the harateristis of the SSD. The sheme is
quitesimple, it just bundles write requests together tomath the logial blok size
and shedules read requests independently in a FIFO manner. Their benhmark
tests showed up to17% improvements over existing Linux shedulers (presented in
Setion3.3.3). Test resultsalsoshowed that the shedulers did not makea notable
diereneunder read-oriented workloads onSSDs. On aside note, the antiipatory
sheduler seemed to outperform other existing shedulers. This is quite strange
of data on HDDs and thus the devie is kept idle for short periods of time. This
should not improve the performane of an SSD, but on the ontrary, degrade it.
Thisphenomenon anbeexplainedby notingthatanindividualproess anbenet
for getting an exlusive servie for bursty I/O requests and thus improving the
overall performane. However, this is more a matter of proess optimisation than
I/Ooptimisation.
5. TESTING ENERGY EFFICIENCY
Thishapterdisusses of the test environment and theatual tests onduted. The
rst setion desribes the test luster in detail. The seond setion represents the
used test tools. The physis software and the software and hardware instruments
usedtogatherdataaredisussed inthissetion. Thethirdandlastsetiondisusses
the pratialside of running the tests and desribes how the tests were onduted.
5.1 Test Cluster
5.1.1 Operating System: Roks 5.3
The hoie forthe operatingsystem of the test lusteris Roks 5.3, anopen-soure
Linux luster distribution. Roks is developed by the Roks Cluster Group at the
San Diego Superomputer Center at the University of California, San Diego and
its ontributors. Roks is a fully stand-alone system and annot be installed on
top of existing system. Roks is basially a Red Hat Linux bundled together with
a whole set of luster software. The driving motivation behind Roks is to make
lusterseasytodeploy, manage,upgrade and sale. Thisdoesnot meanthat Roks
would be inadequate or ineient to do high performane luster omputing. On
the ontrary, Roks is used inmany universities and institutions around the world.
Installing and maintaining Roks is easy. First you have to install the frontend
mahine. This does not dier muh from a normal linux installation. Roks
on-tains many optionalpakages, alled rolls,whihyouan pik to gowith youbasi
installation. These rolls ontain additional software you may want to install. For
example, the Sun Grid Engine (SGE) roll was inluded and used as the hoie of
thebath-queuingsystemforthetestluster. Afterinstallingthefrontend,aluster
alsoneedsomputenodes. Installationofaomputenodeiseasy. Allthatisneeded,
is to ongure the ompute node to boot from the network. A ompute node
reg-isters itself to the frontend database, downloads a system image from the frontend
(or from other ompute nodes) and performs a quik installation. In fat, Roks
even deals with errors justby re-installing the ompute node rather than trying to
x it. If the default ongurationsetup and system image is not suient enough
for your needs oryouwant later to modify your ompute nodes, allyou need todo
istoongure sometext lesonthe frontend, maybeadd some additionalpakages
to be installed on ompute nodes, assemble a new system image and re-install the
nodes.
Roks also omes with many software tools that makes the administration and
management of a luster easy. Most notably the Ganglia, whih is a web-based
lustermonitoring software. [33℄
With SGE it ispossible toongure the slotsize for eah ompute node. A slot
sizedeneshowmanysimultaneousjobsanbesubmittedtoasingleomputernode.
Thenameatually derivesfromnumberofCPUslots amahine hasand itsuggests
that the number of CPU ores should be equal to the number of simultaneous
ompute jobs. However, this study wanted to try what kind of eet this has on
the performane. This study uses a term relative slot size to refer the ratio of the
slotsize andthe numberofatualCPUores. Forexample,inthe testluster,with
quadoremahines, a slotsize of eightwould equala relativeslotsize of two.
5.1.2 Hardware
The test environment onsists of a omputing luster and a dediated le server.
Clusteris omposed of fourmahines, frontend and three ompute nodes. Detailed
speiations are presented in Table 5.1. Detailed speiations of the drives used
are presented inTable 5.2.
Table 5.1: TestCluster
Frontend Nodes File Server
Model Dell server Dell R210 Dell R710
Proessor IntelXeon 2,8GHz IntelXeon 2,4 GHz Intel Xeon 2 GHz
CPU ores 2 4 4
RAM 2GB 8 GB 2 GB
Disk (OS) 160GB SATA (7.2k) 250GBSATA (7.2k) 146GB SAS(10k)
Ethernet 2x 1Gb 2x 1Gb 4x 1Gb
SSDsareCorsairCSSD-F40GB-2withaSATAII3.0Gb/sinterfae. CorsairF40
utilises MLC NAND ash tehnology. Aording to manufaturer's own
speia-tions, Corsair F40 an reah read and write speed of 270 MB/s and perform 50k
IOPS. [17℄
HDDsareSorpioBlak WD3200BEKTfromWesternDigital,witha7200 RPM
spindle speed and a SATA II 3.0Gb/s interfae. Aording to a review made by
Tom's Hardware web site, just to give a rough estimate of the performane of the
HDD,the WD3200BEKT wasbenhmarked with aesstime of 15.4 ms (inluding
spin delay), maximum read speed of 84.3 MB/s and maximum write speed of 83
and peak power of 3.26 W [1℄. Western Digital [37℄ announes the WD3200BEKT
to have an average lateny of 4.2 ms and an average seek time of 12 ms, whih
onverge quite well with numbers from Tom's Hardware review. However, power
onsumption does not onverge, as Western Digital announes WD3200BEKT to
have an idlepower of 0.85W and an average poweronsumption of 2.1W. Also the
manufaturer's numbersfor HDDbandwidthdieronsiderably, asWesternDigital
laims the diskan put up to a108 MB/s for both read and write.
Table5.2: Manufaturer speiationof the drive. Pries: www.newegg.om(ited
1-Feb-2011)
HDD SSD
Model WD SorpioBlak Corsair F40
Size 320 GB 40GB
Prie $59.99 $104.99
GB/$ 5.3 0.38
Random aess time 16 ms 0.02 ms
Read speed 108 MB/s 280 MB/s
Write speed 108 MB/s 270 MB/s
IOPS - 50000
Idle power 0.8 W 0.5 W
Ative power 1.75 W 2.0 W
5.2 Test Tools
5.2.1 Computing at CERN
TheLargeHadronCollider(LHC) isapartileaeleratoratCERN. Thefourmain
detetors of the LHC an produe 15 petabytes of data a year [6℄. The distributed
omputinganddata storageinfrastruturebuilttoproessthis vastamountofdata
is alled the Worldwide LHC Computing Grid (WLCG). As of February 2011, the
WLCG had 246,000 proessingores and 142 petabytes of disk spae [8℄.
The CERN omputing infrastruture is divided into three level of tier entres.
Tier-0 entre is loated at CERN and is responsible for storing the rst opy of
RAW experiment data from LHC. It is also responsible for produing the rst
re-onstrutionpassand distributionofdata toTier-1entres. Tier-1entres together
are responsible for storing the seond opies of the data stored in Tier-0. Tier-1
entres also further reproess the data and distribute it to Tier-2 entres. Tier-2
entres are responsible for serving the analysis requirements of the physiists and
also produing and reproessing of the simulated data. The simulated data is also
distributed toTier-1 entres. As of February 2011, besides the Tier-0 entre, there
are 11Tier-1 enters and 164 Tier-2 entres in the world [7℄. [18℄
5.2.2 CMSSW
TheCompatMuonSolenoid(CMS)isoneofthefourbigresearhprojetsattahed
to LHC. CMS an also refer to the atual partile detetor. The Compat Muon
Solenoid Software (CMSSW) is a physis software toolkit for analysing the data
fromthe CMS detetor.
A entral onept within the CMSSW is an event. An Event is a C++ objet
ontainer. An Event ontains many data tiers for allRAW and reonstruted data
related to a partiular ollision. The RAW data is the full event information and
olleted diretly from the LHC. The RAW data is unmanipulated and is not used
for analysis. The reonstruted or RECO data is reonstruted to physis objets
and still ontains most of the event information. This RECO data an be used for
analysis, but it is not onvenient on any substantial data sample. Analysis Objet
Data (AOD) is a subsetof RECO data. AOD is expeted tobeused in analysis as
AODs are basially beforehand sreened events. All objets in the Event may be
individuallyor olletivelystored in ROOT les. An event data an also be stored
indierent les to limitthe size of the le and to prevent transferring unneessary
data. This data tier modelof anEvent is illustradedin Figure5.1.
Figure5.1: Data modelusedinCMSSW. Soure: CMSWorkBook[16 ℄
simulation. DatasamplesgeneratedbyMonteCarloareusedtosimulatethephysis
signalunderinvestigation. Itanalsobeusedforreatingasampledataforpersonal
use.
CMSSW onsists of many modules, whih ontains general purpose ode for
analysing the events. The goal is to minimise the ode a physiist have to write
himself. A onguration le is needed to tell the CMSSW whih modules to load
and where the data an be found. The exeutable is alled msRun. [16℄
5.2.3 ROOT framework and ROOT les
ROOT isaC++frameworkdesigned for largesale dataanalysis anddata mining.
ROOT was rst reated at CERN, the projet starting in1995, and isstill used in
CERN for analysing the partile physis data reated by LHC. One of the
funda-mental design priniples was that although the programs analysing the data may
hange as time passes, the atual data does not. It was also designed to sale to
handle petabytes of data. ROOT relies on a "write one, read many" -model due
the nature of the data and makes itpossibleto ompress the data eiently.
A ROOT leis a ompessed binary le, whih an store any instane of a C++
lass. Data is stored in aROOT lewith a data desription so that it an beread
even if the original program used to store the data is lost. Data an be stored in
aROOT leboth row-and olumn-wisemanner. If the data is stored by olumns,
reading the same data member from multiple instanes speed up onsiderable as
unwantedpieesofdataanbeskipped. Forexampleinoneinstane,whena280MB
ROOT le was analysed, only 6.6MB of data was transferred over the network.
ROOT even implements an auto-adaptive pre-feth mehanism reading the next
entry while previousentry is stillbeing proessed.
ROOT supports XML representation of data, but does not atually save data
inXML formdue the verbose nature of XML. Also a database abstration layer is
providedmaking itpossibletostoredatainaROOTleinadatabase-likemanner.
[29℄, [11℄
5.2.4 Measuring Tools
During the tests, performane data was olleted from the luster by using both
hardware and software tools. The atual power onsumption was measured with
a WattsUp? eletriity meter, whih was attahed to the frontend mahine via
USB. A shell sript was used to read the meter information one every seond
and to write the information into a log le. The eletriity meter also provided a
umulativereadingforthewatthours(Wh)onsumed. Thepoweronsumptionwas
measured separatelyforthe leserver andfor allof the omputenodes. The power
onsumptionofthefrontendmahinewasnotmeasured. Agridmonitoringsoftware
alledGangliawasalsoused. Gangliaoperatesbyreeivingonstantlystatusreports
fromother mahines in the luster. Ganglia has a browser user interfae to display
luster performane metris, suh as network tra, CPU utilisationof individual
mahines, job queue, et. The server logs were olleted and stored together with
the other outputdata.
5.3 Conduting Tests
5.3.1 About the performane and energy eieny
We distinguish the performane and the energy eieny as atwo dierent
optimi-sation goals. The performane is measured by the average proessing times of the
CMS jobs. The energy eieny is measured by the energy in watt hours needed
by anindividual CMS job on average. These two an be highlydependant of eah
other. After all, by denition, energy equals time
×
power. However, the powerdoesnot need tobeonstant. Itis possible, that inreasing the performane italso
has some kind of eet on the power usage. Thus, these two need to be studied
separately.
5.3.2 Running tests
We reated someLinux shellsripts both toautomatiseand standardisethe testing
proess. Shell sripts were responsible for submitting the jobs, hanging
ongura-tions whereappliable (forexampleshedulingalgorithm),learingahes, starting
and stopping wattage measurement and writing log entries. The shell sripts are
attahed asappendies. Appendix A shows the mainsript, Appendix B shows the
sript used for an individual test run and appendix C shows the sript responsible
for initialisingand runningthe atualCMS job. Installingthe drives and hanging
thelesystemneeded tobedonemanually. Ashellsriptwasalsousedforreating
thetest inputdata onthetarget storagefor theCMS jobs. Toensurehomogeneous
of the test data between dierent test ongurations and between individual jobs,
the test datawasopied from the frontend for eah time a lesystem was reated.
The driveahes both onompute hostand data host wasleared between the test
runs with shell ommand:
syn; eho 3 > /pro/sys/vm/drop_ahes
Every testrunwasidential. Theshellsriptrst learedahes andthensetthe
shedulingalgorithm. Thenthe slotsize of the SGE wasongured. Eahompute
nodehad4CPUoresasshowninTable5.1. Slotsizesof2,4,8and12(relativeslot
sizes of 0.5, 1,2 and 3)was used toassign loads of 50-300%to eah ompute node.
After the luster was ongured, the sript submitted CMS jobs via SGE to eah
ompute node equal to the urrent slotsize of the node. Just beforethe jobs were
submitted,ananothersriptwasstartedtologthe wattageasmentionedinSetion
5.2.4. Whenallthejobswerenished, alsotheloggingsriptwasterminated. Using
the log le, starting and nishing time of a CMS job an be determined and also
how muh energy (watt hours) was onsumed. After the rst set of CMS jobs was
nished, the sript inreased the slot size and ran a new set of jobs. When nished
with a slotsize of 12, sheduler was hanged and slot size was set bak to 2. This
was repeated until all ombinations of four dierent slot sizes and four dierent
shedulers were used. All in all, one suh test run submitted 312 CMS jobs and
took about 8-10 hours tonish.
First,thetest wasondutedwithNAS.ARAID-5ongurationof6HDDs(320
GB) and 4 SSDs (40 GB) was set up, reating volumes of 1.6 TB and 120 GB,
respetively. The ROOT le used was 656 MB in size and it was opied to NAS
total of 72 times eah time and thus alloating 47 GB of the total volume. The
les were renamed to"data-01-01.root"..."data-06-12.root",wherethe rst number
represented thenode numberandseond numberrepresented the jobnumber. This
ensured that notwoCMS jobs was using the same data le. Also, the value of the
read-ahead was altered to test the eet it had on the performane. Read-ahead
values of 4kb, 8kb, 16kb and 32kb were used. After a test run of 312 CMS jobs
nished, a new test run was started after hanging the read-ahead value, the le
systemorRAID"disks"fromHDDstoSSDs. Allinall,thetest run wasonduted
total of 24 times. 3 le systems
×
4 read-ahead values×
2 dierent RAID "disks"equals 24.
At this point taking a quik look over the results, a pattern was pereived that
indiatedthat inreasingthe read-aheadvalue had anegativeimpatonthe
perfor-mane. Thereasonmostlikelywasthatthe ROOTleisabinary leandthe AOD
withinthe leissattered. It wasdeided not touse the read-aheadvalueanymore
as a onguration parameter. Also at this point, one test run was performed by
using only 4 HDDs for easier omparison againstthe 4 SSDs. Again, based onthe
preliminaryresults, the best performing HDD onguration of 6 HDDs was piked
and one more test run for 4 HDDs was performed with that onguration. Also,
the energy onsumption of idle ompute nodes and NAS appliane was measured,
both with and without the RAID pak. The idle tests logged an idle mahine for
one hour fromstartup. These resultsare represented inAppendix D.
Next, the SSDswere installedonthe ompute nodes and ongured asaone big
GlusterFSvolume. Withthree nodes and withoutany stripingormirroring,the 40
GB SSDs reated a volume of 120 GB. The test run was also onduted with this
onguration before dismounting the Gluster onguration and running the tests
diretly from the loal drives. Beause the test data was total of 47 GB, all of it
ould not be tted into the 40 GB drives, so only half of it was used. Copying 24
GBof test datato eah drive. This way, plentyof freespae waslefton thedevies
ashad been the ase also onearliertest runs.
Finally, the SSDs were hanged to HDDs inside the ompute nodes. As with
SSDs, a GlusterFS volume was reated rst. With320 GB in eah node, a volume
of 960 GB ould be hosted by the nodes. After running the tests on Gluster, the
same tests were ondutedagain with loaldrives. This time though,the whole 47
GB of test data was opiedto eahHDD.
6. RESULTS
Theresultshapterdisussesthendingsofthestudyindividually. Theperformane
and the energy eieny are distinguished as a two dierent optimization goals as
disussed inSetion 5.3.1. However, this study alsotries to evaluatethe resultas a
whole. Theperformanegainismeasuredbyomparingtheaverageproessingtimes
whole. Theperformanegainismeasuredbyomparingtheaverageproessingtimes