Perceptron algorithm

(1)

Training the linear classifier

I A natural way to train the classifier is to minimize the number of classification errors on the training data, i.e. choosingw so that the training error

C(w) =

{(xi,yi)|yi 6= sign(w^Txi)}

is minimized. (We use|A|do denote the cardinality of set A.)

I However, minimizingC(w) is in the general case computationally intractable (NP-hard)

I If it is known that the classes are linearly separable, i.e. that there exists a w^∗ such that C(w^∗) = 0, the problem can be

(2)

Perceptron algorithm

I The perceptron algorithmis a simple iterative method which can be used to train a linear classifier

I The algorithm converges if and only if the training data is linearly separable. Of course, the algorithm (or its variations) can be used also for non-separable data.

I Even if the data is linearly separable, the perceptron algorithm is only guaranteed to converge in some finite time. For difficult problems, it may be very slow. A better worst-case behavior is obtained by usinglinear programming.

I The following pseudocode goes through the data repeatedly until a linear classifier with zero training error is found, or until a predefined maximum number of iterationsT is reached

(3)

Perceptron algorithm: Pseudocode

w:=0

for round = 1:T update := false for i = 1:N

ˆ

y_i := sign(w^Tx_i) if ˆyi 6=yi

w:=w+y_ix_i update := true if update == false

break returnw

(4)

Perceptron algorithm: Main ideas

I The algorithm keeps track of and updates a weight vectorw

I Each input item is shown once in a sweep. If a full sweep is completed without any misclassifications then we are done. If T sweeps are reached then we stop (without convergence)

I Whenever ˆy_i 6=y_i we update w by adding y_ix_i. This turns w towards xi if yi = +1, and away fromxi ifyi =−1

I Note on terminology: a full sweep through the data is in this context often called anepoch

(5)

Perceptron algorithm: Illustration

training example of class +1 training example of class –1

w

Current state ofw

(6)

Perceptron algorithm: Illustration

w

Red point classified correctly, no change tow

(7)

Perceptron algorithm: Illustration

w

Green point classified correctly, no change tow

(8)

Perceptron algorithm: Illustration

w

Green point misclassified, will changew as follows...

(9)

Perceptron algorithm: Illustration

w

Addingyixi to current weight vectorw to obtain new weight vector

(10)

Perceptron algorithm: Convergence proof

I Assumption 1:

The training data is linearly separable with a margin γ >0:

There exists a w^∗∈Rⁿ for which||w^∗||₂= 1 andy_ix^T_i w^∗≥γ for all i= 1, . . . ,N

w^∗

γ

(Note: ||w||2=

√

w^Twis the regular Euclidean norm.)

(11)

I Assumption 2:

The training data fits into a sphere with radiusr centered at the origin:

||xi||2≤r for alli= 1, . . . ,N

For a finite numberN of points this is of course always satisfied.

Letr equal the norm of the datapoint with the largest norm.

r

(12)

I Consequence 1:

Each update ofw increases the inner productw^Tw^∗ by at least γ:

Letw denote the weight vector before the update, while w⁰=w+y_ix_i is the vector after the update. Then we have

w⁰^Tw^∗= (w+yixi)^Tw^∗=w^Tw^∗+yix^T_i w^∗≥w^Tw^∗+γ Note that since we started withw=0at the first iteration, we had w^Tw^∗= 0 at the start, so afterpupdates we necessarily have w^Tw^∗≥pγ.

(13)

I Consequence 2:

Each update ofw increases the squared norm||w||²₂ofwby at most r²:

Letw denote the weight vector before the update, while w⁰=w+y_ix_i is the vector after the update. Then we have

||w⁰||²₂=w⁰^Tw⁰= (w+yixi)^T(w+yixi) =w^Tw+ 2yix^T_i w+y_i²x^T_i xi, where yix^T_i w<0 since the example was misclassified,y_i²= 1, and x^T_i xi =||xi||²₂≤r²by Assumption 2. Hence||w⁰||²₂≤ ||w||²₂+r². Note that since we started withw=0at the first iteration, we had

||w||²₂= 0 at the start, so afterpupdates we necessarily have

||w||²₂≤pr².

(14)

I Result:

Thus we obtain

1≥ w^Tw^∗

||w||2||w^∗||2

≥ pγ

ppr²,

where pis the number of updates,||w^∗||2= 1 by Assumption 1, and the first inequality is an instance of the well known inequality

x^Ty

||x||2||y||2

≤1, for any two vectorsx andy.

Rewriting the inequality we obtain:

p≤ r² γ²

(15)

Perceptron algorithm: Non-convergence

I We just proved that if the training datais linearly separable, the perceptron algorithm converges after a finitenumber of steps.

I However, the finite number can be very large if the margin is not wide.

I If the training data is notlinearly separable, the algorithm does not converge, because it simply cannot (by definition) get through a full sweep of the data without performing an update.

I What then?

(16)

Other linear classifiers

I Linear Discriminant Analysis (LDA): classic statistical method

I Support Vector Machine (SVM)

I often used withkernel function to allow adding a very large number of new features

I scaling to extremely large data sets problematic

I not covered in this class, see Section 5.5 of textbook

I Logistic regression

I replaces 0-1 loss with a continuous loss, does probabilistic predictions

I covered a bit later in this class

(17)

Perceptron algorithm: dealing with non-convergence

I If the Perceptron algorithm does not converge, how do you pick a final w to use on new data?

I Some methods that are generally useful for improving convergence of iterative algorithms can be useful

I Use a smaller learning rate, as inw:=w+λyixi for misclassified points (x_i,y_i), withλ >0 decreasing for each sweep

I Take a running average of recent weight vectors

I The Pocket Algorithm(Gallant 1990) is a simple method to get a final w when iterations don’t converge

(18)

The Pocket algorithm

I Run Perceptron algorithm updatingw as usual

I Keep ascore for current w by simply counting how many correct predictions w has made in a row

I The algorithm has a “pocket” where it stores the weight vector w_p that has had the highest score thus far, and its score scorep

I Initially w_p= 0 andscore_p= 0

I When a full epoch passes without the contents of the pocket changing, the algorithm gives wp as its final result and terminates

(19)

The Pocket algorithm (2)

Corresponding to each stepi of the Perceptron algorithm

I if the prediction was correct (sign(w^Tx_i) =y_i)

I increase score: score:=score+ 1

I otherwise (if (sign(w^Tx_i)6=y_i))

I if the previous high score was beaten, i.e. score>scorep, then setwp:=wandscorep:=score

I updatew:=w+yixi as usual

I resetscore:= 0.

(20)

Logistic regression

I Instead of giving a ‘forced choice’ prediction ˆy = sign(w^Tx), we could give probabilistic predictions:

P(y = +1|x) =f(w^Tx) = 1 1 + exp(−w^Tx)

Heref is the logistic function, and such a classifier is known as logistic regression

I In what follows, we again usey ∈ {1,0}for the labels for

(21)

Logistic regression: Illustration

I The constant-probability surfaces of P(y = +1|x) are hyperplanes, with the.5 hyperplane going through the origin:

.9

(22)

Logistic regression: Learning

I In principle, we could use any proper cost function, and minimize the cost on the training set

I In practice, the common choice is tomaximize the likelihood, equivalent to minimizing the logarithmic cost. This gives the cost function

C(w) =−

N

X

i=1

h

yilogf(w^Txi) + (1−yi) logf(−w^Txi) i

I Need to minimize C(w) with respect tow

I Recommended: Use existing logistic regression fitting procedures (in Matlab/Octave/R)

I A simple, but not very efficient, alternative: gradient descent (if you want to implement something simple yourself)

(23)

Logistic regression: Linearly separable case

I If the training set is linearly separable, the norm of wgrows without bound

I ⇒the cost on the training set approaches zero

I ⇒P(y = 1|x⁰)∈ {0,1}for any new datapointx⁰

I As in linear regression, we can regularize by adding a term...

I λ||w||²₂(quadratic penalty), or

I λ||w||1(absolute-value penalty, sparser solutions) ...to the cost function.

(24)

Logistic regression: Nonlinear features

I Again, by first computing features z=f(x) wheref is a nonlinearfunction, and applying the logistic regresion to z, we can obtain nonlinear structure

w^∗

x1

x2

z1

z2

.5.7 .9

.3 .1 .5

.7 .9

.3 .1

(25)

Unsupervised learning

(26)

Supervised vs unsupervised learning

I In supervised learning, we are trying to learn a mapping from X toY. The training data contains pairs (x_i,y_i), i.e. it includes the target outputs, hence the term supervised.

I In unsupervisedlearning, we are only given some data points x_i, and the goal is to perform some useful analysis on that data.

I Of course, what isusefulvaries from problem to problem...

I ...so there are many different types of unsupervised learning.

(see the examples on the next slides)

(27)

Unsupervised learning: Example 1

Clustering search results (http://search.carrot2.org/stable/search):

(28)

Unsupervised learning: Example 2

Learning the structure of natural language:

I Unigrams (‘bag of words’ model):

P(hword₁, . . . ,wordNi) =P(word1). . .P(wordN)

I Bigrams:

P(hword₁, . . . ,word_Ni) =

P(word1)P(word2 |word1). . .P(wordN |wordN−1)

I n-grams: Condition on the previous n−1 words. Lots and lots of n-gram data available from Google!

(29)

I Example: Text produced with trigram model

Nonsense, Matriona will feed it.

It is always either vodka or brandy.

Yet I am sorry to leave.

You should not handle youthful egoism so roughly, sister.

What did I hurt my poor boy?

No, indeed, are ambition; for the first day.

Yes, they are singing across the water.

It is like a beggar beneath your window.

(see http://www.shiffman.net/teaching/a2z/generate/)

I Such a model may be quite helpful in statistical machine translation (because you want to produce good grammar in addition to a decent translation)

(30)

Unsupervised learning: Example 3

Process monitoring:

I Continuous-valued measurements of a factory process:

�

I We would like to automate the ‘staring at the curves’ job to detect fault conditions

I Standard classification problem? X is the measurements,

(31)

I Problem: Lots of measurements from ‘normal’ operating conditions, very few measurements from ‘fault’ conditions. No guarantee that new faults will ‘look like’ old ones...

I Solution? Create some form of model for what the ‘normal’

data looks like, and then when monitoring sound the alarm when a new vector is “too far” from any of the normal states (this isanomaly detection)

(32)

Unsupervised learning: Example 4

Association rule mining:

I In NBA (U.S. basketball league) very detailed statistics are kept of players’ performance

I This results in a large database that can be searched for dependencies that otherwise may be missed

I The ‘Advanced Scout’ system¹ looks for rules such as‘when player X is on the field, the accuracy of player Y drops from 75% to 30%’

I Note: Beware of multiple testing effects (these have to be taken into account in the analysis)!

(33)

Unsupervised learning: Example 5

Solving the “cocktail-party problem”

I n sound sources (e.g. people speaking) n microphones at different points in the room

(34)

Unsupervised learning: Example 5

Solving the “cocktail-party problem”

I n sound sources (e.g. people speaking) n microphones at different points in the room

I Each microphone picks up adifferent combination of all the sound sources. The problem is to separate out the signals

I Online demo: