Stochastic approximation methods in stochastic optimization

(1)

Faulty of Tehnology

Department of Mathematis and Physis

Laboratory of Applied Mathematis

Stohasti approximation methods in

stohasti optimization

The topi of this Master's thesis was approved by the ounil of the Department of

Mathematis and Physis on

25 ^th

^of ^Otober, ^2013.

The examiners of the thesis were:

Prof. Heikki Haario and PHD Tuomo Kauranne.

The thesis was supervised by Professor Heikki Haario.

In Lappeenranta,8.11.2013

Ilya Bobylev

Ruskonlahdenkatu 13-15 E6

53850 Lappeenranta

Phone:

+3580417061366

Ilya.Bobylev[at℄lut.

(2)

Lappeenranta University of Tehnology

Faulty of Tehnology

Department of Mathematis and Physis

Ilya Bobylev

Stohasti approximation methods in stohasti optimization

Thesis for the Degree of Master of Siene inTehnology

2013 year

72 pages, 33gures,6 tables

Supervisors: Prof. Heikki Haario, PHD Tuomo Kauranne.

Examiners: Prof. Heikki Haario, PHD Tuomo Kauranne.

Key words: Stohasti Optimization, Deterministi equivalents, Stohasti Quasi-

gradient algorithm (SQG), mean-value, value-at-risk (VaR), onditional value-at-risk

(CVaR).

Stohastiapproximationmethodsforstohastioptimizationareonsidered. Reviewed

the main methods of stohasti approximation: stohasti quasi-gradient algorithm,

Kiefer-Wolfowitz algorithm and adaptive rules for them, simultaneous perturbation

stohasti approximation (SPSA) algorithm. Suggested the model and the solution

of the retailer'sprot optimizationproblemand onsidered an appliationof the SQG-

algorithm for the optimization problems with objetive funtions given in the form of

ordinary dierentialequation.

(3)

First of all, I would like to express my respet to the Department of Mathematis for

the sholarship duringmy studies,otherwise my stay in Finlandwould not bepossible.

Seondly, I have to thank the Finnishatmosphere: peaeful and quiet, that gives wide

possibilitiesfor produtive study and work.

Finally my sinere gratitude goes toprofessors Heikki Haario, Matti Heilioand Tuomo

Kauranne for their supervision and advies.

Lappeenranta, 2013.

Ilya Bobylev

(4)

•

^CDF ^-^umulative distributionfuntion.

•

^PDF ^- probabilitydensity funtion.

•

^SQG ^- ^stohasti quasi-gradientalgorithm.

•

^VaR^- value-at-risk.

•

^CV^aR ^- ^onditional value-at-risk.

•

^w.r.t ^- ^with ^respet ^to.

•

â.s. ^- ônvergene âlmost ^sure.

• σ{X 1 , X 2 , ..., X n }

^- sigma-algebra, generated by the distribution of random variables

X 1 , X 2 , ..., X n

^.

•

^MC ^- Monte-Carlosimulation.

•

^MCMC ^-^Markov ^Chain Monte-Carlosimulation.

•

^LP ^- ^linearprogramming.

•

^SPSA ^-Simultaneous perturbationstohasti approximation.

•

^SA ^- ^Stohasti approximation.

•

^KW ^- ^Kiefer-W^olfowitz ^algorithm.

•

ÔDE ^- Ôrdinary^dierentialêquation.

(5)

1 Introdution 1

1.1 Objetive of the thesis . . . 1

1.2 Struture of the thesis . . . 1

1.3 Sienti novelty . . . 2

2 Optimization riteria, used in stohasti programming 3 2.0.1 Mean-value riterion . . . 3

2.0.2 Minimaxriterion . . . 4

2.0.3 Value-at-Risk(VaR)riterion . . . 4

2.0.4 Conditional Value-at-Risk (CVaR)riterion . . . 5

2.0.5 Demonstration example . . . 5

3 Stohasti approximation 8 3.1 Conditional mathematialexpetation . . . 10

3.2 Disrete time martingales . . . 11

3.3 Convergene of the supermartingales . . . 12

3.4 Convergene of the random variables . . . 12

3.5 Stohasti quasi-gradientmethod . . . 14

3.6 Kiefer-Wolfowitzalgorithm . . . 20

3.7 Reipesfor step sizes . . . 22

3.7.1 Kesten Rule . . . 22

3.7.2 Adaptiverule of Uryasev . . . 26

3.8 Simultaneous perturbation stohasti approximation. . . 27

4 Appliation problem: EletriityRetailer Prot Optimization 31

(6)

4.2 Solutionwith mean-valueriterion . . . 33

4.2.1 Deterministiequivalent . . . 33

4.2.2 SQG-solution . . . 38

4.3 Solutionwith CVaRriterion . . . 38

4.3.1 Deterministiequivalent . . . 40

4.3.2 SQG-solution . . . 42

4.4 Numerialexamples. . . 43

4.5 Analysisof the results . . . 46

5 Appliation problem: SQG-algorithm for the objetive funtions in ODE-form 47 5.1 Problemformulation . . . 47

5.2 Solutiontehnique . . . 48

5.2.1 Equation invariations . . . 49

5.2.2 Appliationof the equation in variationsin the onsidered ase. . 51

5.3 Numerialexamples. . . 54

5.4 Analysisof the results . . . 62

6 Conlusion and main results 63 Referenes 64 7 Appendies 66 7.1 Implementation of the stohasti approximationalgorithms . . . 66

7.1.1 ImplementationoftheSQG-algorithmwith expliitlydenedgra- dient . . . 67

(7)

dient and adaptive Kestenrule . . . 67

7.1.3 ImplementationoftheSQG-algorithmwith expliitlydenedgra-

dient and adaptive Uryasevrule . . . 67

7.1.4 Implementation of the Kiefer-Wolfowitzalgorithm . . . 68

7.1.5 Implementation of the Kiefer-Wolfowitz algorithmwith adaptive

Kesten rule . . . 68

7.1.6 Implementation of the Kiefer-Wolfowitz algorithmwith adaptive

Uryasev rule . . . 69

7.1.7 Implementation of the SPSA algorithm . . . 69

7.1.8 The funtion whih alulates nite dierene approximation of

the stohasti quasi-gradient . . . 70

7.1.9 The funtion displayingthe onvergene rate . . . 70

7.1.10 Eletriity retailer prot optimization- solution with mathemat-

ialexpetation.. . . 70

(8)

1 Numerialvalues. . . 43

2 Disrete distribution of the random demand. . . 43

3 Disrete distribution of the random prie.. . . 43

4 Measurements of the reation. . . 54

5 Measurements of the reation . . . 57

6 Calulation results . . . 59

(9)

1 Illustrationof the solution. . . 7

2 Convergene of the random variablesin mean-square. . . 13

3 Convergene of the random variablesalmost sure. . . 13

4 Illustrationof the SQG-algorithm.. . . 17

5 Illustrationof the lassialgradient desent method.. . . 18

6 Illustrationof the stohasti gradientmethod. . . 19

7 Optimizationfuntion. . . 23

8 Small step size. Stohasti Gradient Desent (left). Stohasti Gradient Desent with adaptive Kestenrule (right). . . 24

9 Kiefer-Wolfowitzalgorithm(left). Kiefer-Wolfowitzalgorithmwithadap- tiveKesten rule (right). . . 24

10 Stohasti Gradient Desent (left). Stohasti Gradient Desent with adaptiveKesten rule (right). . . 25

11 Kiefer-Wolfowitzalgorithm(left). Kiefer-Wolfowitzalgorithmwithadap- tiveKesten rule (right). . . 25

12 Stohasti Gradient Desent (left). Stohasti Gradient Desent with adaptiveUryasev rule (right). . . 27

13 Kiefer-Wolfowitzalgorithm(left). Kiefer-Wolfowitzalgorithmwithadap- tiveUryasev rule (right). . . 27

14 Comparison of the SQG and SPSA onvergene traes. . . 29

15 Illustrationof the SPSA onvergene. . . 30

16 Distributionof the imbalane. . . 33

17 Solutionwith mean-valueriterion forthe disretedistribution. . . 44

18 Solutionwith mean-valueriterion forthe ontinuous distribution. . . 44

19 SQG-algorithmfor the ontinuous distribution. . . 45

(10)

21 Solutionwith CVaRriterionfor the ontinuous distribution. . . 46

22 SQG-solution. . . 46

23 The sheme of the solution tehnique. . . 53

24 Distributionof the parametersestimated by MCMC for the rst example. 55 25 Objetive funtionand the trae by the algorithmfor the rst test-ase. . 56

26 Convergene of the methodfor the rst test-ase. . . 56

27 Distributionof the parameters estimated by MCMC for the seondtest- ase. . . 57

28 SQG with equation invariations enhanement forthe seondexample. . 59

29 Convergene of the SQG-algorithmfor the seond example. . . 60

30 Kefer-Wolfowitz algroithmfor the seond test-ase. . . 60

31 Convergene of the KW-algorithm for the seondexample. . . 61

32 SPSA-algorithm forthe seondexample. . . 61

33 Convergene of the SPSA-algorithm forthe seondtest-ase. . . 62

(11)

Optimization problems under unertainty have been intensively studied by many re-

searhes during the last 50 years. In fat mathematial models based on stohasti

theory are more approahed tothe real life then deterministi beause the random pa-

rameters whih onsidered in objetive funtion or in onstraints reet the nature of

unertainty duringthe deision making.

Stohasti models are widely used in insurane, nanial and portfolio optimization

wherethebehaviorofmarketannotbepreiselyforeastedorinaerospaetehnologies

where it isrequired toobtain robust results.

Generally the solution of the stohasti optimization problems looks as follows: rstly

researher attempts to derive a deterministi equivalent and use deterministi method

of linear or nonlinear optimization depending on the ase. If the ost funtion is quite

ompliated but if the gradient an be expliitlyfound for the given realization of the

random parameters or suessfully approximated with nite dierenes, then stohas-

ti approximation (SA) will be a good hoie as a solution tehnique. Finally if the

methodfails (example: anobjetivefuntion ismultimodal, ontains both disreteand

ontinuous arguments), thenthe Monte-Carlosimulation,genetialgorithms,simulated

annealing or other methodsof globaloptimization are used.

Robustness, simpliity and omputation speed makes stohasti approximation a very

popularalgorithmforthe wide lass ofoptimizationproblems therefore developmentof

SA is animportantand demanded researh area.

1.1 Objetive of the thesis

The main purpose of the thesis is tostudy and develop stohasti approximation algo-

rithms whihare used instohasti optimization.

1.2 Struture of the thesis

The material of the thesis is organized as follows: the rst 3 hapters are devoted for

the denitions, theoretial bakground of the stohasti approximation and are mainly

neededtopreparethereaderfortheappliationproblemsthatarestudiedinthehapters

4-5.

(12)

The thesis ontains several new theoretial results: deterministiequivalents of the op-

timizationproblemswithmean-valueandCVaR- eletriityretailerprotoptimization

problem(seetheoremsintheorrespondingsetion)andinterestingomputationalidea:

equationinvariationsused inSQG-algorithmforthe objetivefuntionsgiveninODE-

form.

The rstresultgivesanopportunity tosubstitutethe solutionofthe stohastiproblem

with deterministi one and instantly get the result using a well-known simplex-method

for the linear-programming.

The seond result signiantly inreases omputational speed of the stohasti approx-

imation proedure for the onsidered problem.

(13)

ming

Let

Φ(u, X)

^further ⁱⁿ ^this ^setion ^denotes ^a ^ost ^funtion, ^where

u ∈ U ⊂ R ⁿ

is an optimization strategy and X is a random vetor with the realizations belonging to the

set

E ⊆ R ^m

.

2.0.1 Mean-value riterion

Historially the rst wasa mean-value riterionwhihan bedened as follows:

min u∈U E[Φ(u, X)].

⁽¹⁾

Intuitively the riterion means to hose those strategy, whih minimizesaverage loses,

represented by the objetive funtion.

The mainpropertiesofthe mean-valueriterionthat allowsonetoderiveasolutionand

build eetive algorithmsare:

•

^Linearity:

E[aX + bY + c] = aE[X] + bE [Y ] + c,

⁽²⁾

where

X, Y

^are ^random^variables^and ^parameters

a, b, c ∈ R

;

•

^Simple representation for the disrete distribution with nite number of realizations:

E[X] = X n

i=1

x i p i ,

⁽³⁾

where

x i

^,

i = 1, .., n

^are realisations of the random vetor X, and

p i

^,

i = 1, .., n

^are ^the

orresponding probabilities,so that

P (X = x i ) = p i

^.

Unfortunately, the mean-value riterion together with suh attrative properties has a

numberofaseswhenitfails: togetabasiideaoneanimaginethesituationwhenthe

deision about the onditionsof the patientsina lini are made,based onthe average

temperature amongthem. The nonsense is obvious- it is possible to have the value of

the riterion

36, 6 ^◦

^but ^all ^the ^patients ^will ^be ^about ^to ^die: ^the ^rst ^half ^with

∼ 34C ^◦

and the rest with

∼ 40C ^◦

^.

(14)

Theminimaxriterionomesfromthegametheoryandinthestatistialanalysismeans

to hose those strategy,whih attainsthe minimum of loses in the worst possible ase:

min u∈U max

X ∈E Φ(u, X).

⁽⁴⁾

The problem statement an be treated as a game against an aggressive nature all the

time forming the worst possible ase for the deision maker. The positive side of the

strategy hosen aording to minimax is an assurane that the losses will be always

below foreasted. Drawbak eets are obvious: suh strategy ould be too autious

and wasteful and often unahievable. For example in the book [4℄ the hapter devoted

to optimization of the airraft runway area ontains a joke that in order to solve the

problem with the minimax riterionit is required to onrete the whole globe together

with oeans and sees.

2.0.3 Value-at-Risk (VaR) riterion

To show an idea behind the VaR riterion let us dene two onepts: the probability

funtion

P φ (u)

^and ^the ^quantile^funtion

φ α (u)

^[4℄.

P φ (u) = P {Φ(u, X) ≤ φ},

⁽⁵⁾

φ α (u) = min {ϕ : P ϕ (u) ≥ α}.

⁽⁶⁾

The probability funtionrepresentsprobability thatthe ostfuntion

Φ(u, X)

^does^not

exeed thelevel

φ

^for^a^hosen ^xed^strategy

u

^,^while ^the^quartile^funtion^indiates^the

orrespondingminimallevel. Finally,togettheVaR-strategy,thefollowingoptimization

task haveto be solved:

min u∈U φ α (u).

⁽⁷⁾

Unfortunately VaR laks of some desirable theoretial properties. Firstly, it does not

preserve onvexity, i.e. onvexity of the objetive funtion w.r.t. to a strategy, does

not always results in a onvexity of the quantile funtion. Seondly, there are quite a

smallnumberof ases when it ispossibleto derivea gradient w.r.t toa strategy of the

VaRexpliitly. Thismakesthe solutionproessofVaRproblems extremelyompliated

and therefore the Monte-Carlosimulationis oftenused. Oneapproah todeal with the

quantile riterionisto build anupper-value estimate, based onthe ondene set.

(15)

minimax problem:

φ α (u) = min

E∈ E _α max

x∈E Φ(u, x),

where

E _α

isafamilyoftheondentsets

P (E) ≥ α

^,

E ∈ E _α

. Obviouslyifinsteadofthe optimalondentset

E ^∗

^will^beûsedânârbitraryône

E ˆ

^, ^thenânûpperêstimate^to^the

optimal solutionwillbe found. Basedonthat observation therewere many appliation

problem solved [23, 24,25℄.

2.0.4 Conditional Value-at-Risk (CVaR) riterion

The riterion wasreated asan enhanement to VaRand represents anaverage rate of

loses exeeding the level

φ α (u)

^:

ψ α (u) = E[Φ(u, X) | Φ(u, X) ≥ φ α (u)] = 1 1 − α

Z

Φ(u,X(w))≥φ α (u)

Φ(u, X(w))dP (w),

⁽⁸⁾

Aordingto[7℄, aminimizationofCVaRoverthestrategy

u ∈ U

^equals ^to^the ^solution

of the following optimizationproblem:

ψ ^∗ = min

(u,φ)∈U×R F α (u, φ),

⁽⁹⁾

where

F α (u, φ) = φ + 1

1 − α · E [max{F (u, Z) − φ, 0}.

⁽¹⁰⁾

This results opens wide possibilities for the CVaR-optimization via well-known teh-

niques of the mean-value minimization.

2.0.5 Demonstration example

Let us onsider a simple example to show the dierene between the foregoing riteria

in pratie. Assume that a ost funtion isdened aspresented below:

Φ(u, X ) = u ² + X

⁽¹¹⁾

and a randomparameter is a normally-distributed

X ∼ N (0, 1)

^.

•

^Mean-value ^solution:

min u∈ R {E[u ² + X]}.

⁽¹²⁾

(16)

istiequivalent:

min u∈ R u ²

⁽¹³⁾

and the orrespondingsolution is

u ^∗ = 0

^,

E[u ^∗ , X ] = 0

^.

•

^Minimax^solution:

min u∈ R max

X {u ² + X}.

⁽¹⁴⁾

The Normaldistributionhas aontinuousCDF, therefore

∀ C ∈ R

:

P (X > C ) >

0

^. Âs â ^result, ^the ^left ^hand ^side ôf ^the ^previous êquation îs ânon-restrited and the problem doesnot have a solution.

•

VaR-solution:

min u∈ R {φ α (u)}.

⁽¹⁵⁾

The probability funtion looks asfollows:

P φ (u) = P (u ² + X ≤ φ) = P (X ≤ φ − u ² ) = F X (φ − u ² )

^,^where

F X

îsâ^CDF ôf

X

The quantile funtion:

φ α (u) = min{φ : P φ (u) ≥ α} = min{φ : F X (φ − u ² ) ≥ α}

By the denition CDF is a monotoniallynon-dereasing funtion, therefore:

F X (φ − u ² ) ≥ α ⇐⇒ φ − u ² ≥ x α = ⇒ φ ≥ u ² + x α = ⇒ φ α (u) = u ² + x α

Finally,the deterministiequivalentassumes the shown belowform:

min u∈ R {u ² + x α }.

⁽¹⁶⁾

As it an be learly seen:

u ^∗ = 0

^,

φ α (u ^∗ ) = x α

^.

•

^CVaR-solution: Considering the previous VaRsoultion:

ψ α (u) = E[Φ(u, X) | Φ(u, X) ≥ φ α (u)] = E[u ² + x | u ² + x ≥ u ² + x α ] = u ² + E[x | x ≥ x α ] = u ² + y α

^, ^where

y α = α − CV aR N(0,1)

^. ^Finally,^the ^determin-

istiequivalent assumesthe following form:

min u∈ R {u ² + y α }

⁽¹⁷⁾

and the solution:

u ^∗ = 0

^,

ψ α (u ^∗ ) = y α

^.

It is worth to note, that in this partiulary example, the strategy

u = 0

îs ân ôptimal

simultaneously for the mean-value, VaR, and CVaR riteria.

(17)

−6 −4 −2 0 2 4 6 8 0

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

X: 2.34 Y: 0 X: 1.96

Y: 0

Mean−value α −CVaR α −VaR

Max Loss

∞ Loss PDF(Loss)

Figure 1: Illustrationof the solution.

(18)

A historyof the stohasti approximationstarts fromtheartile ofRobbinsand Monro

[20℄ where authors applied a reursive SA proedure to nd a root of the equation

observing noisy data. As it has been disovered, there is noneed toknow exat values

of the funtion, beause during the solution we are only interested in the diretion of

the root. As aresultof thisobservation, the proessan workwith noisyvalues instead

of the exat data.

Let us review several illustrative appliation problems from the real life where the

stohasti approximationalgorithmsan be quiteuseful.

•

^During^the^testingôfînsetiides^thereâreâ^problemôfdeterminingthemaximum dose whih guarantee the intensity of the reation.

•

^The^hardness ôf^the îron-opperâlloy^dependsôn^the^timeôftemperatureimpat.

Let

u

^denotes ^a ^time,^and

Y (u)

^-^the ^hardness ôfâlloy. ^The ômputation^problem

is to determine suh

u

^when ^the ^hardness ^of ^alloy

Y (u)

^reahes

α

^. Ît îs âlso â

well-known fat that the hardness of alloy varies overthe items.

•

^The sensitivityof the explosivesubstane tothe physialimpatare measured by hitting. The smallitem ofthe substane are punhed by solidobjetthrown from

the xed height. Some samples willexplode and some of them not. The problem

is todetermine the ritialheightfor the substane.

•

Â ^grain êld âre ^fertilized ^by ^some ^hemials. ^Let û ^denotes ân âmount ôf ^the

fertilizer and Y(u) denotes a orresponding amount of grain gathered from the

eld. It an be learlyseen that lak of the fertilizer (aswellas over-fertilization)

yields to a redution of the rops. Only some optimalamount of hemials leads

to a signiant outome. Moreover the proliness of the eld are hanging over

the years even if anamount of the fertilizer

u

^remains^the ^same.

The solution of the rst three ases an be organized aording to the proedure of

Robins and Monro: a tester piks an arbitrary value

x 1

^from ^the âdmissible ârea, ôn-

duts anexperimentandobservesarealisation

y(x 1 )

^of ^the^random^variable

Y (x 1 )

^with

expetation

M (x ₁ ) = E{Y (x ₁ )}

^, ^where

M

îs ^some înreasing ^funtion. ^The ^tester âlso

hooses adereasing sequene

a n = _n ^c

^,^where

c

îs â^positive ônstantând

n

^is ^a^number

of the step. The problemto besolved is todetermine suh

θ

^, ^that

M (θ) = α

^. ^F^or ^the

next experiment he takes

x

^aording ^to ^the ^rule:

(19)

To understand the main idea of the method assume

α = 0

^. ^In ^this ^ase ^the ^previous

statement yields

x n+1 = x n − c

n y(x n ).

⁽¹⁹⁾

If

y(x n ) > 0

^, ^then

x n+1 < x n

^, ^and

y(x n ) < 0

^leads orrespondingly to

x n+1 > x n

^,

meaningthattheequation(19)looksreasonablesinewearesearhing for

θ : M(θ) = 0

^.

The solution of the last problem an be organized with the help of stohasti quasi-

gradient method whih is studiedin detailsfurther inthe thesis.

Kiefer andWolfowitz (in1d-ase) [17℄and Blum[21℄(inmultidimensionalase) applied

stohastiapproximationto optimizemean-value funtionals.

ThebooksofErmoliev[9℄,Powel[22℄ontaineasyunderstandableproofsandmotivation

of using stohastiapproximation.

KanandKibzun[4℄haveappliedstohastiapproximationfortheoptimizationproblems

with VaR riterion.

Spall [19℄ have suggested to use random perturbations in the lassial KW-proedure

and dereasethenumberof evaluationsofthe objetivefuntion. Moreoverhispersonal

page [19℄ lists alarge numberof referenes regarding appliationsof SPSA.

Let us list several dierent areas of siene where the SPSA-algorithmshave been su-

essfully applied[19℄:

•

Âirraft ^modelingând ôntrol

•

^Atmospheri ^and ^planetary^modeling

•

Cardiologial data analysis

•

^Noise ^anellation

•

^Queuing ^network ^design

•

^Robot ^ontrol

•

^Sensor ^plaement

•

^T^ra ^signal ^timing ^or^other transportation problems

•

Underground mine detetion

(20)

onvergene speed and suggested several dierent versionsof the proedure.

Thenextsetionontainsexplanationsofthestohastiapproximationproedureswhih

are used for the optimizationof ost-funtions.

The presentation is organized as follows: rstly we introdue onditional expetation,

then onitsbasissome denitionsfrommartingale theorywillbeshown. Havinginsight

in the martingale onvergene theory we gradually start the main proof regarding the

onvergeneoftheSQG-proedure. Attheendofthesetionwedesribearelativelynew

area ofstohastiapproximation-theSPSA algorithmanddisussadaptiveproedures

that pratially enhane onvergene.

3.1 Conditional mathematial expetation

Let

{Ω, F, P }

^denotes ^a probability spae,

G

^is ^a

σ

^-algebra ^of ^random ^events

G ⊆ F

and

ξ

^is^a^random^variable

E[ξ] < ∞

^. ^A^random^variable

E[ξ | G]

îsâlledâônditional

mathematial expetationof

ξ

^w.r.t.

G

îf ^the ^followingônditions âre ^valid:

1.

E[ξ | G]

^is ^a

G

-measurable funtion, 2. for every set

A ⊆ G

^,

Z

A

ξ(ω) P (dω) = Z

A

E [ξ | G] (ω) P (dω).

⁽²⁰⁾

The main properties of the onditional mathematial expetation, that is used in the

next paragraphs are [13℄:

1. if

ξ = C = const

^, ^then

E[ξ | G] = C (a.s.),

⁽²¹⁾

2. if

ξ ≤ η

^(a.s) ^then

E[ξ | G] ≤ E [η | G] (a.s.),

⁽²²⁾

3. if the randomvariable

ξ

^is ^measurable ^with ^respet ^to

σ

^-algebra

G

^, ^then

E[η | G] = ξ (a.s.),

⁽²³⁾

(21)

E [E[ξ | G]] = E[ξ] (a.s.),

⁽²⁴⁾

5. linearity. If

a, b ∈ R

and

E [ξ] < ∞

^,

E [η] < ∞

^then

E [aξ + bη | G] = aE [ξ | G] + bE [η | G] (a.s.),

⁽²⁵⁾

6. if the randomvariable

ξ

^does ^not ^depend ^on

σ

^-algebra

G

^, ^then

E[ξ | G] = E[ξ] (a.s.),

⁽²⁶⁾

7. onditionalmathematialexpetationwithrespet toarandomvariableisdened

as anexpetation w.r.t. the

σ

^-algebra, ^generated ^by ^this ^variable:

E [ξ | F ⁿ ] = E [ξ | η] , F ⁿ = σ{η}.

⁽²⁷⁾

To get intuitivesenseregarding onditionalmathematialexpetation, one an imagine

E [X | F ]

âsâ^rough,âveraged^versionôf^X,^beauseôn^theârbitrary^set

A ∈ F

^random

variable X an assume any values, but

E [X | F ]

^gets ^only ^one ^xed ^value:

E [X | A]

what is anaverage value of

X

^on ^the ^set

A ∈ F

^.

3.2 Disrete time martingales

Assume that

F ₀ ⊆ F ₁ ⊆ ... ⊆ F

^is ^a non-dereasing family of

σ

^-algebras, ^dened ^on^a

probabilityspae

{Ω, F , P }

^. ^Fôrêxampleîf

{ξ n }, n ≥ 0

îsâ^setôf^random^variables^given

on

{Ω, F , P }

^and

F _n ^l = σ{ξ 0 , ... , ξ n }

^is^asigma-algebra,generatedby

{ξ k , k = 0, ... , n}

^,

then

{F _n ^l }

^is ^a non-dereasing familyof

σ

^-algebras.

Let

{X n , n ≥ 0}

îsâ^sequene ôf^random^variables^dened ôn

{Ω, F , P }

^. ^If^for^all

n ≥ 0 X n

^is

F n

-measurable, then

{X n , F n }

îsâlled â ^stohasti ^sequene.

Ifforall

n ≥ 0 X n

^is

F n−1

-measurable,thenthesequene

{X n , F n−1 }

îsâlledâ^stohas-

ti sequene.

Stohasti sequene

{X n , F n }

^,

E [|X n |] < ∞

^is ^alled:

•

^a martingale,if

E [X n+1 | F n ] = X n (a.s.)

^,

•

^a submartingale,if

E [X n+1 | F n ] ≥ X n (a.s.)

^,

•

^a supermartingale,if

E [X n+1 | F n ] ≤ X n (a.s.)

^.

(22)

It an be learly seen that supermartingales are stohasti analogues of non-inreasing

sequenes of the real numbers. As a result, supermartingales (as well as submartin-

gales) under some onditions an have a limit(what is infat arandomvariable). The

orresponding statement are shown in the following theorem [9℄.

Theorem 3.1 (Convergene theorem). If

{X n , F n }

^is ^a ^supermartingale, suh that

inf n E [X _n ⁻ ] > −∞

^,

X _n ⁻ = min{X n , 0}

^then

| lim

n→∞ X n | < ∞.

⁽²⁸⁾

From the theorem above itispossibletoonlude: if

{X n , F n }

^is^a non-negative supermartingale, then a.s. (with probability 1)there exists its limit.

3.4 Convergene of the random variables

Let

{X 1 , ..., X n , ...}

îs â ^sequene ôf ^random ^variables, ^whih âre ^dened ôn ^the ^same

probability spae

{Ω, F , P }

^.

1.

X n

− → P X

⁽ⁱⁿprobability) if

∀ǫ > 0 lim

n→∞

P (|X n − X| > ǫ) = 0

^,

2.

X n

−−→ m.s. X

⁽ⁱⁿ^mean ^square) ^if

lim

n→∞ E{|X n − X| ² } = 0

^,

3.

X n

−−→ a.s. X

^(almost^sure) ^if

P

w ∈ Ω : lim

n→∞ X n (w) = X(w)

= 1,

4.

X n

− d

→ X

⁽ⁱⁿdistributionorweakly)if

lim

n→∞ F n (x) = F (x)

^for^every

x ∈ R

atwhih

F (x)

^is ^ontinuous. ^Here

F n (x)

^and

F

^are ^the ^umulative distribution funtions (CDF) of the random variables

X n

^and

X

respetively.

Let

X n

îs â ^sequene ôf ^random ^variables distributed as follows:

P (X n = 0) = 1 − _n ¹

^,

P (X n = 1) = _n ¹

^. ^This ^sequene ^onverges ⁱⁿ mean-square and therefore in probability but doesnot onverges almost sure. [26℄

(23)

0 2000 4000 6000 8000 10000 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Step(n)

X n

Figure 2: Convergene of the random variables inmean-square.

Let

X n

îs â ^sequene ôf ^random ^variables distributed as follows:

P (X n = 0) = 1 − _n ¹ 2

^,

P (X n = 1) = _n ¹ 2

^. ^This ^sequene ônverges âlmost ^sure ând ^thereforeⁱⁿ probability but does not onverges inmean-square. [26℄

0 2000 4000 6000 8000 10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Step(n)

X n

Figure3: Convergene of the randomvariables almostsure.

As it an be seen from the gures, in the rst ase even after5000 steps, there an be

outliers from the aumulation point, but in the ase of the onvergene almost sure it

(24)

does not happen, as for any

w ∈ Ω

^we ^annot ^have ^values ^beyond ^the ^arbitrary ^given

epsilon-neighborhoodof the limited point.

3.5 Stohasti quasi-gradient method

Havingunderstandingoftheforegoingoneptswewillproeeddesribingthestohasti

quasi-gradientalgorithmfortheoptimizationproblemswithmean-valueriterion,whih

is formulated below.

min u∈U F (u), F (u) = E{Φ(u, X)}.

(29)

Here

Φ(u, X)

âs^before^denotes ânôbjetive ^funtion^dependent ônâ^random ^vetor

X

and the strategy

u

^whih ^should^be^hosen ^from ^the ^set

U ⊆ R ⁿ

.

If the gradient

∇F (u)

^was ^known ^we ^ould ^use ^the ^lassial gradient-desent method, starting from some initial point

u ⁰

^. ^The ^main ^statement ôf ^the âlgorithm îs ^written

below:

u ^ν+1 = Π U (u ^ν − ρ ν ∇F (u ^ν )) ,

⁽³⁰⁾

where

Π U (u)

^denotes â^pro^jetion ôperatorônto ^the ^set

U

^:

Π U (u) = arg min

y∈U

||u − y||.

⁽³¹⁾

The projetion operator are used here to restrit the loation of the points

u 0

^,...,

u n

generated by the algorithmtothe admissiblearea

U

^.

The problem to nd a projetion of the point onto the set is not always an easy task.

Generally the set

U

^has ^to ^be ^losed ând ônvex ⁱⁿ ôrder ^to ^have â ûnique ^projetion

(it mighthappen that the projetion does not exists at allif the set

U

îs ôpen ând ^the

projetion mightbe non-unique in the ase of non-onvex

U

^).

The projetion anbe foundinstantlyforthe elementarytypesofthe set

U

^. ^Suh^ases

are listed below.

1. A non-negative orthant:

U = E ₊ ⁿ = {u : u ∈ R ⁿ , u j ≥ 0, j = 1, .., n}.

⁽³²⁾

(25)

Let usdenote

p j

^the

j

^-th ^oordinate^of ^the ^pro^jetion

Π U (u)

^, ^then

p j =

0, u j < 0,

u j , u j ≥ 0.

⁽³³⁾

2. n-dimensional parallelepiped:

U = {u : u ∈ R ⁿ , a j ≤ u j ≤ b j , j = 1, .., n}

⁽³⁴⁾

p j =







a j , u j < a j , u j , a j ≤ u j ≤ b j , b j , u j > b j .

(35)

where

p j

^as ^usuall ^the

j

^-th^oordinate ^of ^the ^projetion

Π U (u)

^.

3. A sphere with the radius

r

^:

U = {u : u ∈ R ⁿ , ||u|| ≤ r}.

⁽³⁶⁾

The projetion an be found as follows:

Π U (u) =

u, u ∈ U,

r

||u|| u, u / ∈ U,

⁽³⁷⁾

4. A hyperplane:

U = {u : u ∈ R ⁿ , hc, yi = b},

⁽³⁸⁾

where

c ∈ E ⁿ

^,

b ∈ R

^.

Π U (u) = u + b − hc, ui

||c|| ² c.

⁽³⁹⁾

5. A halfspae:

U = {u : u ∈ R ⁿ , hc, yi ≤ b},

⁽⁴⁰⁾

Π U (u) =

( u, u ∈ U,

u + ^b−hc,ui _||c|| 2 c, u / ∈ U,

⁽⁴¹⁾

Considering the remarkregarding projetion operator agradientdesent algorithman

bewrittern:

1. Initialize the step

ν = 0

^,^hoose ^the ^initial^point

u ⁰

^.

(26)

2. Compute the next point

u ^ν+1

^using ^the ^previous ^point

u ^ν

ând ân êquation:

u ^ν+1 = Π U (u ^ν − ρ ν ∇F (u ^ν )) ,

⁽⁴²⁾

To nd a projetion onto the set

U

^use ^above ^mentioned ^remarks ^regarding ^the

projetion operator.

3. Chek the stoppingriterion. Ifthestoppingondition issatised, put

u ^∗ = u ^ν+1

^,

otherwise assign

ν = ν + 1

^and ^go^the ^step ^2.

Let

f (u)

îs â ^funtion, ^dened ôn ^the ^set

U ⊂ R ⁿ

. The vetor

∂(u ⁰ ) ⊂ R ⁿ

is alled a subgradientof the funtion

f (u)

^at^the ^point

u ⁰ ∈ U

^if

∀u ∈ U

f (u) ≥ f (u ⁰ ) + h∂(u ⁰ ), u − u ⁰ i.

⁽⁴³⁾

The formula above geometrially means that the graph of the funtion

f (u)

^loated

abovethe linearfuntion

f (u ⁰ ) + h∂(u ⁰ ), u − u ⁰ i

^and^at^the ^point

u ⁰

^the^graphs^oinide.

Obviously if the funtion is dierentiable there is only one suh vetor

∂(u ⁰ )

^equals ^to

the gradient

∇f (u ⁰ )

^. ^For^a non-dierentiable funtion at the point

(u ⁰ )

^there ^exists ^a

set of the subgradients, that is alledsubdierentialset.

From the point of view of optimization methods, if the objetive funtion

F (u)

^is ^not

dierentiable,the gradient

∇F (u)

^issubstituted by thesubgradient

∂F (u)

^and ^the ^rest

of the gradient desent method is used without any hanges. Suh method is alled

subgradientalgorithm.

In ase ofthe stohasti objetivefuntionanexat gradient isunknown, andtherefore

its stohasti version isused:

u ^ν+1 = Π U (u ^ν − ρ ν ∇ u Φ(u ^ν , x ^ν )),

⁽⁴⁴⁾

where

x ^ν , (ν = 1, ...)

^is ^a realization of

X

^at^the ^step

ν

^of ^the ^algorithm. ^So ^that

x ^ν

^is

a random variable, beause we do not know the value whih assumes

X

^at ^the ^step

ν

^.

Moreover the point

u ^ν , (ν = 1, ...)

îs âlso â ^random ^variable, ^beause ît ^depends ôn

x ^ν

and the previous point

u ^ν−1

^.

Finallynotethatanexpression(44) desribesarandomproessstartingfromtheinitial

point

u ⁰

ând ûnder ^some ^further ^formulated ônditions ^should ^lead ûs ^to ^the ôptimal

value of the optimizationproblem (29).

(27)

ν = 0

u ⁰

^.

2. Generate arealization

x ^ν

^of ^the ^random^vetor

X

^. ^Compute ^a^stohasti^gradient

∇ u Φ(u ^ν , x ^ν )

u ^ν

^and ^the realization

x ^ν

^.

3. Calulate the next point

u ^ν+1

u ^ν

ând ân êquation:

u ^ν+1 = Π U (u ^ν − ρ ν ∇ u Φ(u ^ν , x ^ν )),

⁽⁴⁵⁾

U

^use ^the ^mentioned ^remarks ^regarding ^the

projetion operator.

u ^∗ = u ^ν+1

^,

otherwise assign

ν = ν + 1

^and ^go^the ^step ^2.

If the funtion

Φ(u ^ν , x ^ν )

^is ^not dierentiable with respet to variable

u

^a ^stohasti

gradient is substituted by the quasi-gradient:

E

ζ ^ν | u ⁰ , ..., u ^ν ⁻¹

∈ ∂ u E [Φ(u ^ν , X )] .

⁽⁴⁶⁾

Intuitively the above mentioned statement means that the onditional expetation of

the stohasti quasi-gradient atthe point

u ^ν

^with ^respet ^to ^all^previous ^points^should

oinide almost sure (a.s.) with one of the subgradients of the objetive funtion at

the point

u ^ν

^(i.e. ^belonging ^to ^the subdierential set). And the main statement of the stohasti quasi-gradient method looksas follows:

u ^ν+1 = Π U (u ^ν − ρ ν ζ ^ν ).

⁽⁴⁷⁾

(28)

When we are dealing with a deterministi problem the step size

ρ ν+1

^an ^be ^obtained

from the solution ofa one-dimensional optimizationproblem:

ρ ν+1 = min

ρ {F (u ^ν − ρ∇ u F (u ^ν ))},

⁽⁴⁸⁾

meaning that inevery step we shouldmove towards the gradientdiretion tominimize

the objetive funtion as muhas possible.

In the stohasti version there are several problems whih do not allow to use above

shown tehnique. Firstly we donot know exatly anexpeted value of

Φ(u, X)

^for ^the

given

u

^(we^know^only

Φ(u, x)

^). ^Seondly^the^stohasti^gradientât^some^stepsânêven

give a wrong diretion, so any positive value

ρ

^leads ^to ^the ^worse ^point ^than ^previous

one.

The next gures showthe lassialgradient desentmethodand the stohasti version.

In the rst ase we expliitlyan nd the gradient and omputing anoptimal step size

reah an optimum with

1

^step. În ^the ^seond âse ^we ônly ôbserve ^noisy ^data ând ^get

an optimum of the mean-value riterion applying stohasti gradient method, what is

of ause requires more steps.

−10

−5 0

5

10 −10

−5 0 5 10

0 50 100 150 200

u 1 u 2

F(u 1 ,u 2 )=u 1 2 +u 2 2

Figure5: Illustration ofthe lassialgradient desent method.

(29)

−5 −10 5 0

10 −10

−5

0

5 10 0

100 200

u ₂

u 1 F(u 1 ,u 2 )=E{u 1 2 +u 2 2 +Xu 1 +Yu 2 }

Figure6: Illustrationof the stohasti gradient method.

Toprooftheonvergenea.s. (47)ofthealgorithmwehavetomakeseveralassumptions

[11℄:

•

^the ^funtion

E [Φ(u ^ν , X )]

îsâ ônvex nite-valued funtion;

•

^the ^set

U

îsâ ônvex ând ômpat;

•

^the ^parameters

ρ

^and

γ

^satisfy ^a.s.:

ρ ν ≥ 0, X ∞

ν=0

ρ ν = ∞, X ∞

ν=0

E

ρ ν |γ ν | + ρ ² _ν kζ ^ν k ²

< ∞.

⁽⁴⁹⁾

Thersttwoonditionsislassialrequirementsforanexisteneandauniquenessofthe

solution in onvex optimization. The last two requirements mostly aet the sequene

of the step size. The sequene has to be very speial: it should onverge, but not so

fast so that the point generated by the algorithm will be able to reah an optimum

independently of the initialpoint wherethe algorithmstarts.

In order to highlightthe main steps of the proof [12℄,let usassume the followingnota-

tions:

u ^∗ ∈ U

îs ânôptimal^solution ôf ^the ^problem^(29),

F ^ν = σ{u ⁰ , ..., u ^ν }

^- ^asigma-algebra,generated by the random variables

{u ⁰ , ..., u ^ν }

^.

(30)

Themainideaoftheproofistoshowthatunderonditionsofthetheorem

E [|u ^ν+1 − u ^∗ k |F ν ]

is a non-negative supermartingale, therefore a.s. it has some limit and then to proof

that the limitis0, meaning

u ^ν → u ^∗

^a.s.

•

^Projetion ^implyes:

ku ^ν+1 − u ^∗ k ² ≤ ku ^ν − u ^∗ k ² − 2ρ ν hζ ^ν , u ^ν − u ^∗ i + ρ ² _ν kζ ^ν k ²

^.

•

^Tâke ônditionalêxpetation^w.r.t.

F ν

^:

E [ku ^ν+1 − u ^∗ k ² |F ν ] ≤ ku ^ν − u ^∗ k ² − 2ρ ν hE [ζ ^ν |F ν ] , u ^ν − u ^∗ i + ρ ² _ν E [kζ ^ν k ² |F ν ]

^.

•

^By ^the ^onditions ^of ^the ^theorem:

γ ν + hE [ζ ^ν |F ν ] , u ^∗ − u ^ν i ≥ E [Φ(u ^∗ , X )] − E [Φ(u ^ν , X )] ≥ 0 = ⇒ hE [ζ ^ν |F ν ] , u ^∗ − u ^ν i ≥ −γ ν

^.

•

^T^wo ^previous ^steps ^together ^results ^in:

E [ku ^ν+1 − u ^∗ k ² |F ν ] ≤ ku ^ν − u ^∗ k ² + 2ρ ν γ ν + ρ ² _ν E [kζ ^ν k ² |F ν ] ≤ ku ^ν − u ^∗ k ² + 2ρ ν |γ ν | + ρ ² _ν E

kζ ^ν k ² |F ν

| {z }

R ν

= ku ^ν − u ^∗ k ² + R ν

^.

•

^Assume

Y ^ν = ku ^∗ − u ^ν k ² + P ^∞

k=0

R k

^, ^hene:

E [Y ^ν+1 |F ν ] ≤ Y ^ν , Y ^ν ≥ 0 = ⇒ Y ^ν → Y

^(a.s.) ^- ^onvergene ^theorem ^for

supermartingales.

•

^Reursively^, ^taking ^the ^full expetation:

E [ku ^ν+1 − u ^∗ k ² ] ≤ ku ⁰ − u ^∗ k ² + P ν k=0

ρ ² _k E kζ ^k k ²

− 2 P ν k=0

ρ k hE ζ ^k

, u ^k − u ^∗ i = ⇒ E [ku ^ν+1 − u ^∗ k ² ]−E [ku ⁰ − u ^∗ k ² ]−

P ν k=0

E [R k ] ≤ 2 P ν k=0

ρ k E [Φ(u ^∗ , X )] − E

Φ(u ^k , X ) .

•

Considering onditions of the theorem:

P ∞ k=0

E [R k ] < ∞

^and

E [Φ(u ^∗ , X )] < E

Φ(u ^k , X )

we have:

−∞ <

P ∞ k=0

ρ k E [Φ(u ^∗ , X )] − E

Φ(u ^k , X )

≤ 0

^,^therefore

E

Φ(u ^k , X )

→ E [Φ(u ^∗ , X )]

^.

3.6 Kiefer-Wolfowitz algorithm

Ifthesubgradient

∂ u Φ(u, x)

^or^even ^the^gradient

∇ u Φ(u, x)

^of^the^funtion

Φ(u, x)

^w.r.t.

variable

u

ân ^be êxpliitly ^found, ^then ⁱⁿ êvery ^step ôf ^the SQG-algorithm

∂ u Φ(u, x)

(31)

or

∇ u Φ(u, x)

îs ûsually ûsed âs â quasi-gradient estimate, otherwise a nite-dierene approximationis employed.

ζ ^k = 1 2δ k

X n

j=1

Φ(u ^k + δ k e j , x ^k ) − Φ(u ^k − δ k e j , x ^k )

e j .

⁽⁵⁰⁾

where

x ^k

îsâsûsualârealization oftherandomvetorX atthe step

k

^of ^the^algorithm,

e j

^,

j = 1, ..., n

âre ^the ûnit ^vetors ^direted âlong^the ôordinateâxes.

ConsideringtheformulaabovethemethodofKieferandWolfowitzassumesthefollowing

form:

k = 0

u ⁰

^.

2. Generate a realization

x ^k

^of ^the ^random ^vetor

X

^. ^Compute ^a ^stohasti ^quasi-

gradient

ζ ^k

^via^formula⁽⁵⁰⁾ ^using ^the ^previous ^point

u ^k

^and ^the realization

x ^k

^.

3. Calulatethenextpoint

u ^k+1

^using^the^previous^point

u ^k

^aording^to^the^formula:

u ^k+1 = Π U (u ^k − ρ k ζ ^k ),

⁽⁵¹⁾

U

^use ^the ^mentioned ^remarks ^regarding ^the

projetion operator.

u ^∗ = u ^k+1

^,

otherwise assign

k = k + 1

^and ^go^the ^step ² ^.

Asitan beseenthealgorithmrequires

2n

evaluationsoftheobjetivefuntion

Φ(u, X)

in every step of the method,where

n

îs â ^dimension ôf ^the ôntrol^varaible

u

^.

The onvergene theorem of Kiefer-Wolfowits [17℄ and its generalization an be found

in [4,9℄.

Theorem 3.2. If the below onditions are valid:

•

^the ^funtion

f (u) = E[Φ(u, X)]

^has ^only ^extremum

u ^∗ ∈ int{U }

^;

•

^the ^seond derivatives of

f (u)

âre ôntinuous ând ^bounded;

•

^the ^variane

D[Φ(u, X)] ≤ C < ∞

^for ^all

u ∈ U

^;

•

^the ^sequen^es

ρ k

^and

δ k

^satisfy

ρ k ≥ 0, X ∞

k=1

ρ k = ∞, X ∞

k=1

ρ k

δ k

2 < ∞, X ∞

k=1

ρ k |δ k | < ∞.

⁽⁵²⁾

then the sequene generated by Kiefer-Wolfowits algorithm onverges to the optimal so-

lution of the mean-value problem almost sure (

u ^k → u ^∗

^a.s.).

(32)

As it was mentioned the step size is a very important feature diretly aeting perfor-

maneof themethod. Asitoftenhappensthe theoretialresults regardingonvergene

of the methoddo not suggest the way to inrease performane and therefore empirial

shemes are used. An opposite side of suh proess is that adaptive algorithmsshould

be theoretially justied otherwise there exist a risk to fae the speial ase when an

inaurate adaptive method fails. As an example: a stohasti approximation adap-

tive rule for the step size have to satisfy step size onditions to guarantee onvergene

otherwise all the proof must berevised.

Additionally adaptation should not signiantly inrease omplexity of the original

methodand preferably donot have alot of tuning parameters tobe adjusted. Despite

this hallenges several suessful results regarding adaptive stohasti approximation

are known.

3.7.1 Kesten Rule

The rst idea touse adaptive step size in stohasti approximation belongs toKesten.

His method based on the simple idea that if we are far from the optimum, the errors

tend to have the same sign but when we are getting loser, the errors start alternate.

Under the error here it is assumed the dierene between the previous

u ⁿ⁻¹

^and ^the

urrent

u ⁿ

^point. ^Kesten ^suggested ^the ^following^rule:

ρ n−1 = a

a + K ⁿ − 1 ,

⁽⁵³⁾

where

a

îs â ^parameter ^to ^be âlibrated.

K ⁿ

îs â ^parameter, ^whih ôunts ^the ^number

of times that the sign of the errors have been hanged.

K ⁿ =

n n = 1, 2,

K ⁿ⁻¹ + 1 {hǫ ⁿ ,ǫ ⁿ⁻ ¹ i<0} n > 2,

⁽⁵⁴⁾

where

ǫ n = u ⁿ⁻¹ − u ⁿ .

⁽⁵⁵⁾

To asses the performane of the adaptive method let us have a look at the following

example:

min E

u ² ₁ + u ² ₂ + max(u ² ₁ , u ² ₂ ) + u 1 X , X ∈ N (0, 1).

⁽⁵⁶⁾

(33)

Obviously, the optimum of the funtion is

u ^∗ ₁ = 0, u ^∗ ₂ = 0

^, ^beause

∀u 1 , u 2 u ² ₁ + u ² ₂ + max(u ² ₁ , u ² ₂ ) ≥ 0

^and ^all ^terms

u ² ₁ ≥ 0, u ² ₂ ≥ 0, max(u ² ₁ , u ² ₂ ) ≥ 0

^.

−10

−5 0 5 10

−10

−5 0

5 10 0

50 100 150 200 250 300

u 2 u 1

F(u 1 ,u 2 )=E{u 1 2 +u 2 2 +max(u 1 2 ,u 2 2 )+u 1 X}

Figure 7: Optimizationfuntion.

The stohasti gradientan bewritten expliitlyinthe form:

ξ(u, X) =

2u ₁ + 2u ₁ I _u ²

1 >u ² ₂ + X 2u 2 + 2u 2 I _u ² ₁ _≤u ² ₂

.

⁽⁵⁷⁾

Let us hoose the step size

ρ k

^aording ^to^the ^rule:

ρ k = 0.1 · _k ¹

^. ^In ^the ^standard ^ase

the steps will be too small, and the adaptive rule will preserve the step size from the

rapid drop enhaning onvergene.

On the next gures we an see how the ontrol variables where hanging towards the

optimumover thesteps. Initialvalue ofthe step sizewere hosenthe samefor adaptive

and non-adaptive methods. In fat non-adaptive algorithms even after 1000 iterations

were relativelyfar from the optimum, while the adaptive shemes were able torih the

steady state after100 steps.

(34)

0 200 400 600 800 1000 0

2 4 6

Step(N) u 1

0 200 400 600 800 1000

0 2 4 6

Step(N) u 2

0 200 400 600 800 1000

0 2 4 6

Step(N) u 1

0 200 400 600 800 1000

0 2 4 6

Step(N) u 2

Figure 8: Small step size. Stohasti Gradient Desent (left). Stohasti Gradient

Desent with adaptiveKesten rule (right).

0 200 400 600 800 1000

0 2 4 6 8

Step(N) u 1

0 200 400 600 800 1000

0 2 4 6 8

Step(N) u 2

0 200 400 600 800 1000

0 2 4 6 8

Step(N) u 1

0 200 400 600 800 1000

0 2 4 6 8

Step(N) u 2

Figure 9: Kiefer-Wolfowitz algorithm(left). Kiefer-Wolfowitz algorithmwith adaptive

Kesten rule (right).

As it an be seen the test resultsagree with the theoretial reasoning.

Now let us test the adaptive rule for the large step lengths on the same test example,

so

ρ k = 10 · _k ¹

^. ^At ^the ^beginning ^when

k

^is ^relatively ^small, ^the ^steps ^will ^be ^too ^large

and the points,generated by the algorithmswilltry toleavethe admissiblearea

U

^but

due tothe projetion operatorthey willbe returned to the border of the set

U

^. ^When

the step size assumes the reasonable values, the proedures will start onverge to the

optimum. Here the adaptive rule doesnot enhane onvergene asit does not derease

the steps faster than instandard versions.

(35)

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 1

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 2

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 1

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 2

Figure 10: Stohasti GradientDesent (left). Stohasti GradientDesent with adap-

tiveKesten rule (right).

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 1

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 2

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 1

0 200 400 600 800 1000

−10

−5 0 5 10

Step(N) u 2

Figure11: Kiefer-Wolfowitzalgorithm(left). Kiefer-Wolfowitz algorithmwith adaptive

Kesten rule (right).

The test examples again agree with theoretial reasoning. This example also demon-

strates the behaviourof the stohasti approximation methods: the onvergene traes

w.r.t. oordinate

u ₂

^are ^smoother, ^then ^w.r.t ^to

u ₁

^as^a stohastiity mostly aets the rst oordinatedue tothe term

u ₁ X

ⁱⁿ^the ^objetive ^funtion.

FinallyregardingKestenruleweanonlude thatitreasonabletouse forthe relatively

smallormiddlesteplengthasitombinesonstantstep size rule(whenwe arefar from

the optimum) with the usual step size rule (whenwe are lose tothe end).

(36)

Uryasev [28℄ suggested more advaned adaptive rule whih an derease the steps as

well asinrease them depending onthe generated sequene of points:

ρ k+1 = min(¯ ρ, ρ k a ^−hξ ^k+1 ^,∆ ^k+1 ^i−δρ ^k ),

∆ ^k+1 = u ^k+1 − u ^k , a k > 1, δ > 0,

(58)

To understandthe meaningof the termsinthe mentionedabove ruleletushave alook

at the main formulaof SQG-algorithm:

u k+1 = u k − ρ k ξ ^k .

⁽⁵⁹⁾

It would belogial to hoose the step, thatminimizes the funtion

F k (ρ)

^w.r.t

ρ

^:

F k (ρ) = E[Φ(u k − ρξ ^k ) | F ^k ].

⁽⁶⁰⁾

A omputationof

F k (ρ)

îsâ^very^diult^task,^therefore^letûsdierentiate

Φ(u k − ρξ ^k )

w.r.t

ρ

^at ^the ^point

ρ k

^.

∂ ρ Φ(u k − ρξ ^k )| ρ k = −hξ ^k , ∂Φ(u k − ρ k ξ ^k )i = −hξ ^k , ξ ^k+1 i.

⁽⁶¹⁾

Hene

−E[hξ ^k , ξ ^k+1 i | F ^k ] ∈ ∂F k (ρ k )

ând ^the ^following ^gradient ^proedure ân ^be ûsed

to modify the step size:

ρ _k+1 = ρ k + λ k hξ ^k , ξ ^k+1 i.

⁽⁶²⁾

The adaptive proedure (58) diretly omes from the previous formula and works as

follows: the term

hξ ^k+1 , ∆ ^k+1 i

^gives information wether or not the minimum of

F k (ρ)

w.r.t

ρ

^has ^been ^reahed. ^If

−hξ ^k+1 , ∆ ^k+1 i > 0

^then ^with ^high probability we an say thattheminimumhasnotbeenahieved yetandthestep

ρ k

^will^be^inreased,^otherwise

it dereases.

Let ustest this adaptive rule on the previous modelexample.

Stochastic approximation methods in stochastic optimization

25 th

+3580417061366

•

•

•

•

•

•

•

• σ{X 1 , X 2 , ..., X n }

X 1 , X 2 , ..., X n

•

•

•

•

•

•

•

Φ(u, X)

u ∈ U ⊂ R n

E ⊆ R m

min u∈U E[Φ(u, X)].

•

E[aX + bY + c] = aE[X] + bE [Y ] + c,

X, Y

a, b, c ∈ R

•

E[X] = X n

i=1

x i p i ,

x i

i = 1, .., n

p i

i = 1, .., n

P (X = x i ) = p i

36, 6 ◦

∼ 34C ◦

∼ 40C ◦

min u∈U max

X ∈E Φ(u, X).

P φ (u)

φ α (u)

P φ (u) = P {Φ(u, X) ≤ φ},

φ α (u) = min {ϕ : P ϕ (u) ≥ α}.

Φ(u, X)

φ

u

min u∈U φ α (u).

φ α (u) = min

E∈ E α max

x∈E Φ(u, x),

E α

P (E) ≥ α

E ∈ E α

E ∗

E ˆ

φ α (u)

ψ α (u) = E[Φ(u, X) | Φ(u, X) ≥ φ α (u)] = 1 1 − α

Z

Φ(u,X(w))≥φ α (u)

Φ(u, X(w))dP (w),

u ∈ U

ψ ∗ = min

(u,φ)∈U×R F α (u, φ),

F α (u, φ) = φ + 1

1 − α · E [max{F (u, Z) − φ, 0}.

Φ(u, X ) = u 2 + X

X ∼ N (0, 1)

•

min u∈ R {E[u 2 + X]}.

min u∈ R u 2

u ∗ = 0

E[u ∗ , X ] = 0

•

min u∈ R max

X {u 2 + X}.

∀ C ∈ R

P (X > C ) >

0

25 ^th

u ∈ U ⊂ R ⁿ

E ⊆ R ^m

36, 6 ^◦

∼ 34C ^◦

∼ 40C ^◦

E∈ E _α max

E _α

E ∈ E _α

E ^∗

ψ ^∗ = min

Φ(u, X ) = u ² + X

min u∈ R {E[u ² + X]}.

min u∈ R u ²

u ^∗ = 0

E[u ^∗ , X ] = 0

X {u ² + X}.

P φ (u) = P (u ² + X ≤ φ) = P (X ≤ φ − u ² ) = F X (φ − u ² )

φ α (u) = min{φ : P φ (u) ≥ α} = min{φ : F X (φ − u ² ) ≥ α}

F X (φ − u ² ) ≥ α ⇐⇒ φ − u ² ≥ x α = ⇒ φ ≥ u ² + x α = ⇒ φ α (u) = u ² + x α

min u∈ R {u ² + x α }.

u ^∗ = 0

φ α (u ^∗ ) = x α

ψ α (u) = E[Φ(u, X) | Φ(u, X) ≥ φ α (u)] = E[u ² + x | u ² + x ≥ u ² + x α ] = u ² + E[x | x ≥ x α ] = u ² + y α

min u∈ R {u ² + y α }

u ^∗ = 0

ψ α (u ^∗ ) = y α

M (x ₁ ) = E{Y (x ₁ )}

a n = _n ^c