What is machine learning?

(1)

582631 — 5 credits

Introduction to Machine Learning

Lecturer: Jyrki Kivinen

Assistants: Johannes Verwijnen and Amin Sorkhei Department of Computer Science

University of Helsinki

earlier material created by Patrik Hoyer and others

27 October–11 December 2015

(2)

Introduction

I What is machine learning? Motivation & examples

I Definition

I Relation to other fields

I Examples

I Course outline and related courses

I Practical details of the course

I Lectures

I Exercises

I Exam

I Grading

(3)

What is machine learning?

I Definition:

machine = computer, computer program (in this course) learning = improving performance on a given task, based

on experience / examples

I In other words

I instead of the programmer writing explicit rules for how to solve a given problem, the programmer instructs the computer how to learn from examples

I in many cases the computer program can even become better at the task than the programmer is!

(4)

Example 1: tic-tac-toe

I How to program the computer to play tic-tac-toe?

I Option A: The programmer writes explicit rules, e.g. ‘if the opponent has two in a row, and the third is free, stop it by placing your mark there’, etc (lots of work, difficult, not at all scalable!)

I Option B: Go through the game tree, choose optimally(for non-trivial games, must be combined with some heuristics to restrict tree size)

I Option C: Let the computer try out various strategies by playing against itself and others, and noting which strategies lead to winning and which to losing (=‘machine learning’)

(5)

I Arthur Samuel (50’s and 60’s):

I Computer program that learns to play checkers

I Program plays against itself thousands of times, learns which positions are good and which are bad (i.e. which lead to winning and which to losing)

I The computer program eventually becomes much better than the programmer.

(6)

Example 2: spam filter

I Programmer writes rules: “If it contains ‘viagra’ then it is spam.” (difficult, not user-adaptive)

I The user marks which mails are spam, which are legit, and the computer learns itself what words are predictive

Example 2: spam filter

� X is the set of all possible emails (strings)

� Y is the set{spam, non-spam}

From: medshop@spam.com Subject: viagra cheap meds...

spam

From: my.professor@helsinki.fi Subject: important information

here’s how to ace the test... non-spam

... ...

From: mike@example.org Subject: you need to see this how to win $1,000,000...

?

(7)

Example 3: face recognition

I Face recognition is hot (facebook, apple; security; . . . )

I Programmer writes rules: “If short dark hair, big nose, then it is Mikko”(impossible! how do we judge the size of the nose?!)

I The computer is shown many (image, name)example pairs, and the computer learns which features of the images are predictive (difficult, but not impossible)

patrik antti doris patrik

...

_?

(8)

Problem setup

I One definition of machine learning: A computer program improves its performance on a given task with

experience (i.e. examples, data).

I So we need to separate

I Task: What is the problem that the program is solving?

I Performance measure: How is the performance of the program (when solving the given task) evaluated?

I Experience: What is the data (examples) that the program is using to improve its performance?

(9)

Related scientific disciplines (1)

I Artificial Intelligence (AI)

I Machine learning can be seen as ‘one approach’ towards implementing ‘intelligent’ machines (or at least machines that behave in a seemingly intelligent way).

I Artificial neural networks, computational neuroscience

I Inspired by and trying to mimic the function of biological brains, in order to make computers that learn from experience.

Modern machine learning really grew out of the neural networks boom in the 1980’s and early 1990’s.

I Pattern recognition

I Recognizing objects and identifying people in controlled or uncontrolled settings, from images, audio, etc. Such tasks typically require machine learning techniques.

(10)

Availability of data

I These days it is very easy to

I collect data (sensors are cheap, much information digital)

I store data (hard drives are big and cheap)

I transmit data (essentially free on the internet).

I The result? Everybody is collecting large quantities of data.

I Businesses: shops (market-basket data), search engines (web pages and user queries), financial sector (stocks, bonds, currencies etc), manufacturing (sensors of all kinds), social networking sites (facebook, twitter), anybody with a web server (hits, user activity)

I Science: genomes sequenced, gene expression data,

experiments in high-energy physics, images of remote galaxies, global ecosystem monitoring data, drug research and

development, public health data

I But how to benefit from it? Analysis is becoming key!

(11)

Big Data

I one definition: data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges (Oxford English Dictionary)

I 3V: volume, velocity, and variety (Doug Laney, 2001)

I a database may be able to handle a lot of data, but you can’t implement a machine learning algorithm as an SQL query

I on this course we do not consider technical issues relating to extremely large data sets

I basic principles of machine learning still apply, but many algorithms may be difficult to implement efficiently

(12)

Related scientific disciplines (2)

I Data mining

I Trying to identify interesting and useful associations and patterns in huge datasets

I Focus on scalable algorithms

I Example: shopping basket analysis

I Statistics

I historically, introductory courses on statistics tend to focus on hypothesis testing and some other basic problems

I however there’s a lot more to statistics than hypothesis testing

I there is a lot of interaction between research in machine learning, data mining and statistics

(13)

Example 4

I Prediction of search queries

I The programmer provides a standard dictionary(words and expressions change!)

I Previous search queries are used as examples!

(14)

Example 5

I Ranking search results:

I Various criteria for ranking results

I What do users click on after a given search?

Search engines can learn what users are looking for by collecting queries and the resulting clicks.

(15)

Example 6

I Detecting credit card fraud

I Credit card companies typically end up paying for fraud (stolen cards, stolen card numbers)

I Useful to try to detect fraud, for instance large transactions

I Important to be adaptive to the behaviors of customers, i.e. learn from existing data how users normally behave, and try to detect ‘unusual’ transactions

(16)

Example 7

I Self-driving cars:

I Sensors (radars, cameras) superior to humans

I How to make the computer react appropriately to the sensor data?

(17)

Example 8

I Character recognition:

I Automatically sorting mail (handwritten characters)

I Digitizing old books and newspapers into easily searchable format (printed characters)

(18)

Example 9

I Recommendation systems (‘collaborative filtering’):

I Amazon: ”Customers who boughtX also boughtY”...

I Netflix: ”Based on your movie ratings, you might enjoy...”

Challenge: One million dollars ($1,000,000) prize money recently awarded!

4 5 5 1 2

3 4 3

1 4 1 5 1

? 4 1 ?

2 1 1 5

1 1 4 5

4 5 5

2 3 3

Fargo

Seven Aliens Leon Avatar

Bill Jack Linda

John Lucy

(19)

Example 10

I Machine translation:

I Traditional approach: Dictionary and explicit grammar

I More recently,statistical machine translation based on example data is increasingly being used

(20)

Example 11

I Online store website optimization:

I What items to present, what layout?

I What colors to use?

I Can significantly affect sales volume

I Experiment, and analyze the results!

(lots of decisions on how exactly to

experiment and how to ensure meaningful results)

(21)

Example 12

I Mining chat and discussion forums

I Breaking news

I Detecting outbreaks of infectious disease

I Tracking consumer sentiment about companies / products

(22)

Example 13

I Real-time sales and inventory management

I Picking up quickly on new trends (what’shot at the moment?)

I Deciding on what to produce or order

(23)

Example 14

I Prediction of friends in Facebook, or prediction of who you’d like to follow on Twitter.

(24)

What about privacy?

I Users are surprisingly willing to sacrifice privacy to obtain useful services and benefits

I Regardless of what position you take on this issue, it is important to know what canand whatcannot be done with various types information (i.e. what the dangers are)

I ‘Privacy-preserving data mining’

I What type of statistics/data can be released without exposing sensitive personal information? (e.g. government statistics)

I Developing data mining algorithms that limit exposure of user data (e.g. ‘Collaborative filtering with privacy’, Canny 2002)

(25)

Course outline

I Introduction

I Ingredients of machine learning

I task

I models

I data

I Supervised learning

I classification

I regression

I evaluation and model selection

I Unsupervised learning

I clustering

I matrix decompositions

(26)

Related courses

I Various continuation courses at CS (spring 2015):

I Probabilistic Models (period III, plus optional project)

I Project in Practical Machine Learning (period III)

I Advanced Machine Learning (period IV)

I Data Mining (period III, plus optional project)

I Big Data Frameworks (period IV)

I A number of other specialized courses at CS department

I A number of courses at maths+stats

I Lots of courses at Aalto as well

(27)

Practical details (1)

I Lectures:

I 27 October (today) – 11 December

I Tuesdays and Fridays at 10:15–12:00 in Exactum C222

I Lecturer: Jyrki Kivinen

(Exactum B229a, jyrki.kivinen@cs.helsinki.fi)

I Language: English

I Based on parts of the course textbook (next slide)

I (previous instances of this course used a different textbook)

(28)

Practical details (2)

I Textbook:

I author: Peter Flach

I title: Machine Learning. The Art and Science of Algorithms that Make Sense of Data.

I publisher: Cambridge University Press (2012, first edition)

I author’s web page:

https://www.cs.bris.ac.uk /~flach/mlbook/

I we’ll cover Chapters 1–3, main ideas in Chapters 4–6, and quite a lot of Chapters 7–9

(29)

Practical details (3)

I Lecture material

I this set of slides (by JK, partly using previous years’ course materials) is intended for use as part of the actual lectures, together with the blackboard etc.

I the textbook author has a full set of much more polished slides available on his web page

I however the lectures will cover some issues in more detail than the textbook

I in particular some additional detail is needed for homework problems

I both the lectures and the assigned parts of the textbook are required material for the exam

(30)

Practical details (4)

I Exercises:

I course assistants: Johannes Verwijnen and Amin Sorkhei

I Learning by doing:

I mathematical exercises (pen-and-paper)

I computer exercises (support given in Python)

I Problem set handed out every Friday, focusing on topics from that week’s lectures

I Deadline for handing in your solutions is next Friday at 23:59.

I In the exercise session on the day before deadline (Thu 14:15–16:00), you can discuss the problems with the assistant and with other students.

I Attending exercise sessions is completely voluntary.

I Language of exercise sessions: English

I Exercise points make up 40% of your total grade, must get at least half the points to be eligible for the course exam.

I Details will appear on the course web page.

(31)

Practical details (5)

I Exercises this week:

I No regular exercise session this week.

I Instead: instruction on Python and its libraries that are useful on this course

I Tuesday 27 October (today!) at 12:15–16:00 in B221

I Voluntary, no points awarded. Recommended for everyone not previously familiar with Python.

I Assistant will be available between 12:15–16:00; you don’t have to be there at 12 and may leave before 16

I You don’t have to use Python for the course work if you prefer some other language. Ask the course assistants about which alternatives are acceptable.

(32)

Practical details (6)

I Course exam:

I 16 December at 9:00 (double-check a few days before the exam)

I Constitutes 60% of your course grade

I Must get a minimum of half the points of the exam to pass the course

I Pen-and-paper problems, similar style as in exercises (also

‘essay’ or ‘explain’ problems)

I (Note: To be eligible to take a ‘separate exam’ you need to first complete some programming assignments. These will be available on the course web page a bit later. However since you are here at the lecture, this probably does not concern you.)

I You may answer exam problems also in Finnish or Swedish.

(33)

Practical details (7)

I Grading:

I Exercises: (typically: 3 pen-and-paper and 1 programming problem per week)

I Programming problem graded to 0–15 points

I Pen-and-paper problems graded to 0–3 points

I First week’s Python exercises: Voluntary, no points

I Late homework policy will be explained on course web page

I Exam: (4–5 problems)

I Pen-and-paper: 0–6 points/problem (tentative)

I Rescaling done so that 40% of total points come from exercises, 60% from exam

I Half of all total points required for lowest grade, close to maximum total points for highest grade

I Note: Must get at least half the points of the exam, and must get at least half the points available from the exercises

(34)

Practical details (8)

I Prerequisites:

I Mathematics: Basics of probability theory and statistics, linear algebra and real analysis

I Computer science: Basics of programming (but no previous familiarity with Python necessary)

I Prerequisites quiz!

I For you to get a sense of how well you know the prerequisites

I For me to get a sense of how well you (in aggregate!) know the prerequisites. Fully anonymous!

(35)

Practical details (9)

I Course material:

I Webpage (public information about the course):

http://www.cs.helsinki.fi/en/courses/582631/2015/s/k/1

I Sign up in Ilmo (department registration system)

I Help?

I Ask the assistants/lecturer at exercises/lectures

I Contact assistants/lecturer separately

(36)

Questions?

(37)

Ingredients of machine learning

(38)

Ingredients of machine learning: Outline

I We take a first look into some basic components of a machine learning scenario, such as task, data, features, model.

I We will soon get back to all this on a more technical level.

I Read Prologue and Chapter 1 from textbook.

(39)

Summary of the setting

I Task is what an end user actually wants to do.

I Modelis a (hopefully good) solution to the task.

I Data consists of objects in the domain the user is interested in, with perhaps some additional information attached.

I Machine learning algorithm produces a model based on data.

I Features are how we represent the objects in the domain.

(40)

Task

I Task is an actual data processing problem some end user needs to solve.

I Examples were given in lecture 1 (image recognition, fraud detection, ranking web pages, collaborative filtering, . . . )

I Typically, a task involves getting some input and then producing the appropriate output.

I For example, in hand-written digit recognition, the input is a pixel matrix representing an image of a digit, and the output is one of the labels ’0’, . . . , ’9’.

I Machine learning is a way to find a solution (basically, an algorithm) for this data processing problem when it’s too complicated or poorly understood for a programmer (or an application specialist) to figure it out.

(41)

Supervised learning

Many common tasks belong to the area ofsupervised learning where we neeed to produce some target value (often calledlabel):

I binary classification: divide inputs into two categories

I Example: classify e-mail messages into spam and non-spam

I multiclass classification: more than two categories

I Example: classify an image of a digit into one of the classes

’0’, . . . , ’9’

I multilabel classification: multiple classes, of which more than one may match simultaneously

I Example: classify news stories based on their topics

I regression: output is a real-valued

I Example: predict the value of a house based on its size, location etc.

(42)

Unsupervised learning

In unsupervised learning, there is no specific target value of interest.

Examples of unsupervised learning tasks:

I clustering: partition the given set of data intoclusters so that elements that belong to same cluster are similar (in terms of some given similarity measure)

I association rules: for example, given shopping cart contents of different customers, find product combinations that are often bought together

I dimensionality reduction: if the data is high-dimensional (each data point is described by a large number of variables), find an alternative lower-dimensional representation that retains as much of the structure as possible

(43)

Semisupervised learning

Unsupervised learning can be used as pre-processing to help supervised learning

I for example, classifying images

I Internet has as many non-classified images as we could ever want

I less easy tolabel the data (assign the correct class to each image)

I solution: use the unlabelled images to find a low-dimensional representation, then use a smaller set of labelled images to solve the hopefully easier lower-dimensional classification problem

(44)

Predictive vs. descriptive model

I Predictive model: our ultimate goal is to predict some target variable

I Descriptive model: no specific target variable, we want to understand the data

I distinction between supervised vs. unsupervised learning is whether the target values are available to the learning algorithm

I Examples:

I predictive and supervised: classification, regression

I descriptive and unsupervised: clustering

I descriptive and supervised: find subgroups that behave differently with respect to the target variable

I predictive and unsupervised: clustering, with the clusters then used to assign classes

(45)

Evaluating performance on a task

I When we apply machine learning techniques, we have

available sometraining data which we use to come up with a good model

I However what really counts isgeneralisation: performance on futureunseen data that wasnotin the training set; this is called

I Overfittingis the mistake of building a too complicated model using too little training data

I results look very good on training data

I however performance on unseen data is bad

I to get an unbiased estimate on performance on unseen data, one can withhold part of the training data and use it as test set when learning is finished (but not before!)

(46)

Latent variables

Latent(orhidden) variables describe underlying structure of the data that is not immediately obvious in the original input variables.

I Example: we have N persons and M movies, and anN×M matrix of scores wheres_ij is the score given by person i to movie j.

I We might try to represent this using K latent variables (let’s call them genres, but remember that they arenotgiven as part of the data):

I pik is 1 is personi likes genrek, and 0 otherwise

I q_kj is 1 is moviej belongs to genrek, and 0 otherwise

I w_k is the importance of belonging to genrek

(47)

Latent variables (2)

I if for some K min{M,N} we can find (pik), (qkj) and (wk) such that

XK k=1

w_kp_ikq_kj ≈s_ij for all i and j then we have found some interesting structure.

I In the new representation we haveK(N+M+ 1) parameters, as opposed to the original NM.

(48)

Features

I Features define the vocabulary we use to describe the objects in the domain of interest (the “original inputs” of the task).

I Finding the right features is extremely important.

I However it’s often also highly dependent on the application.

I Basic approaches to finding good features include

I ask someone who understands the domain

I use some standard set of features

I analyse the data

I On this course we see some examples of often used features, but mostly we don’t consider where the features come from.

(49)

Features (2)

I Examples of how features can be formed:

I take the Fourier transformation of an audio signal

I find edges, corners and other basic elements from an image

I scaling numerical inputs to have similar ranges

I representing a document asbag of words(what words appear in the document, and how many times each)

I bag of words can be extended to consider pairs, triples etc. of words

I thekernel trickis a way of using certain types of features in certain types of learning algorithms in a computationally efficient manner

(50)

Similarity and dissimilarity

I Notions of similarity and dissimilarity between objects are important in machine learning

I clustering tries to group similar objects together

I many classification algorithms are based on the idea that similar objects tend to belong to same class

I etc.

I We cover the topic here a bit more thoroughly that the textbook since this is one of the topics for first homework

I You should also read textbook Section 8.1 on distance measures

(51)

I Examples: think about suitable similarity measures for

I handwritten letters

I segments of DNA

I text documents

“Parliament overwhelmingly approved amendments to the Firearms Act on Wednesday. The new law requires physicians to inform authorities of individuals they consider unfit to own guns. It also increases the age for handgun ownership from 18 to 20.”

“Parliament's Committee for Constitutional Law says that persons applying for handgun licences should not be required to join a gun club in the future. Government is proposing that handgun users be part of a gun association for at least two years.”

“The cabinet on Wednesday will be looking at a controversial package of proposed changes in the curriculum in the nation's comprehensive schools.

The most divisive issue is expected to be the question of expanded language teaching.”

ACCTGTCG ATCCTGTG

TCGATTGC

(52)

I Similarity: s

I Numerical measure of the degree to which two objects arealike

I Higher for objects that are alike

I Typically between 0 (no similarity) and 1 (completely similar)

I Dissimilarity: d

I Numerical measure of the degree to which two objects are different

I Higher for objects that are different

I Typically between 0 (no difference) and∞(completely different)

I Transformations

I Converting from one to the other

I Use similarity or dissimilarity measures?

I Method-specific

(53)

Similarity for bit vectors

I Objects are often represented by d-bit vectors for somed

I This may arise naturally, or because we represent objects using d features which are all binary valued

I Consider now two n-bit vectors

I Let f_ab denote the number of positions where first vector has a and second hasb

I Thus f₀₀+f₀₁+f₁₀+f₁₁=d

(54)

Similarity for bit vectors (2)

I Hamming distance (dissimilarity)

H =f₀₁+f₁₀ (1)

I Simple matching coefficient

SMC = f11+f00

f₁₁+f₀₀+f₀₁+f₁₀ (2)

I Jaccard coefficient

J= f11

f₁₁+f₀₁+f₁₀ (3)

(55)

Dissimilarity in R

^d

I We start by defining the p-norm (orL^p norm) for d-dimensional real vector z= (z1, . . . ,z_d)∈R^d as

kzkp= Xd

i=1

|z_i|^p

!1/p

I For p ≥1 this actually satisfies the definition of a norm (see some book on real analysis)

I We then defineMinkowski distance between xand yas Disp(x,y) =kx−ykp

I p= 2: Euclidean distance

I p= 1: “Manhattan” or ”city block” distance

(56)

Dissimilarity in R

^d

(2)

I For p ≥1, Minkowski distance is a metric:

I Dis(x,x) = 0

I Dis(x,y)>0 ifx6=y

I Dis(x,y) = Dis(y,x)

I triangle inequality:

Dis(x,z)≤Dis(x,y) + Dis(y,z)

I For p <1 the triangle inequality does not hold

(57)

Dissimilarity in R

^d

(3)

I Considering the limit p → ∞leads to defining also kzk∞= max{ |z_i| |i = 1, . . . ,d}

I We also define 0-norm as the number of non-zero components kzk₀=

Xd

i=1

I[zi 6= 0]

whereI[φ] = 1 ifφis true and I[φ] = 0 otherwise.

(58)

Chess problem

R´eti 1921 White to play and draw

(59)

L

^∞

distance in chess

I Textbook makes some points about distance measures in chess (pp. 232–233)

I Perhaps the most interesting point is that movements of a King follow L^∞ distance

I moves diagonally as fast as horizontally and vertically

I This leads to some counterintuitive results (see famous problem on previous slide)

(60)

Mahalanobis distance

I Given any positive definite symmetrical matrixM we can define a new metric by

DisM(x,y) = q

(x−y)^TM(x−y)

I Euclidean distance is the special casedM =I (identity matrix)

I IfM is diagonal, this is a rescaling that gives different weights to different coordinates

I More generally this can represent first changing the

coordinates by rotating the vectors and then rescaling the new coordinates

(61)

Similarity in R

^d

I Cosine similarity (ranges from -1 to 1) cos(x,y) = x·y

kxk2kyk2

wherex·y=Pd i=1xiyi

I Pearson’s correlation coefficient (ranges from -1 to 1) r =

Pd

i=1(xi−¯x)(yi −y)¯ qPd

i=1(xi −x)¯ ²qPd

i=1(yi−¯y)² where ¯x = _d¹Pd

i=1x_i, and ¯y similarly

(62)

Models: the output of machine learning

I We have a preliminary look into what kinds of models are commonly considered

I We cover this topic a bit more lightly than the textbook (Section 1.2) since this will be covered in much more detail when we get to actual machine learning algorithm

I The various classifications of models should not be taken too rigidly. The purpose is just to see some different ideas we can apply when thinking about models

(63)

Geometric models

I Instances are the objects in our domain that we wish to, say, classify.

I Instance spaceis the set of all conceivable instances our model should be able to handle

I Geometric models treat instances as vectors in R^d and use notions such as

I angle between vectors

I distance, length of a vector

I dot product

(64)

Nearest neighbour classifier

I Nearest-neighbour classifier is a simple geometric model based on distances:

I store all the training data

I to classify a new data point, find the closest one in the training set and use its class

I More generally,k-nearest-neighbour classifier (for some integerk) findsk nearest points in the training set and uses the majority class (ties are broken arbitrarily)

I Different notions of distance can be used, but Euclidean is the most obvious

(65)

Linear classifier

I Linear classifier is a simple non-local geometric model

I Any weight vectorw∈R^d defines a classifier f(x) =

1 ifw·x>0

−1 otherwise

I We divide instances to classes along thehyperplane {x|w·x= 0}

I If we want to consider hyperplanes that do not pass through origin, we may replace the condition byw·x+b>0 where b is another parameter (to be determined by the learning algorithm, like w)

(66)

Basic linear classifier

I We learn a simple binary classifier from training data {(xi,y_i)|i = 1, . . . ,n}where x_i ∈R^d and y ∈ { −1,1}

I Let P ={xi |yi = 1}be the set of positive exampleswith centre point

p= 1

|P| X

x∈P

x

I Similarly let nbe the centre of negative examples

I Now w=p−n is the vector fromn top, and m= (p+n)/2 is the point in the middle

(67)

Basic linear classifier (2)

I The basic binary classifieris given by f(x) =

1 ifw·x>w·m

−1 otherwise

I It splits the instance space into positive and negative part through the mid-point betweenp andn

I We can also interpret the basic linear classifier as a

distance-based model that classifiesx depending on which of theprototypes p andn is closer

(68)

Probabilistic models

I Probabilistic approach to machine learning is in particular used in generative learning

I For example in supervised learning, the task is to predict some target values Y given observed valuesX

I However in practice we don’t usually think that X somehow causes Y

I E.g. in medicine, we need to predict disease based on symptoms, but causally it’s the disease that causes the symptoms

(69)

Probabilistic models (2)

I Generative model is a joint distribution P(X,Y) for bothX andY

I Given P(X,Y), we can predict Y based on X: P(Y |X) = P(X,Y)

P(X) where

P(X) =X

Y

P(X,Y)

(70)

Probabilistic models (3)

I In contrast to generative model, adiscriminative model just tries to find out what separates, say,X values associated with Y = 1 fromX values associated with Y =−1

I In terms of probabilities, discriminative learning can be seen as modelling just P(Y |X) and ignoringP(X)

I Avoids solving a more difficult problem than we really need

I However throws away some information

(71)

Logical models

I Typical logical models are based on rules ifconditionthen label

where the condition contains feature values and possibly logical connectives

I Decision trees are a popular logical model class that can textually be represented as a nested if-then-else structure

I Logical models are often more easily understood by humans

I But large decision trees can be very confusing

(72)

Grouping and grading

I Linear models are typical gradingmodels

I w·x gets arbitrary real values

I for anyx₁andx₂ withw·x₁<w·x₂we can find t such that w·x₁<t<w·x₂

I Decision tree models are typical grouping models

I Usually a large number of instances end up in the same leaf of the tree

I If two instances are in the same leaf, they always receive the same label

I Grading models allow for more fine grained separation of instances than grouping models

(73)

Binary classification and related tasks

(74)

Binary classification: Outline

I We consider more closely the binary classification task

I In particular, we consider performance measures

I Ranking is closely related to binary classification

I Many learning algorithms actually solve a ranking problem

I Ranking can then be converted into binary classifier

I We consider the notion of Bayes optimalityin a bit more detail than the textbook

I Read Sections 2.1 and 2.2 from the textbook; we defer Section 2.3 until we have more to say about learning probabilistic models

(75)

Basic setting

I Instance spaceX: inputs of the model

I Output space Y: outputs of the model

I In supervised learning, the input to the learning algorithm typically consists ofexamples(x,y)∈ X × Y that somehow exhibit the desired behaviour of the model

I We may also have a separatelabel spaceL 6=Y, so that examples are fromX × L

I for example, labels are classes, outputs class probabilities

(76)

Supervised learning tasks

I classification: L=Y =C whereC ={C1, . . . ,Ck}is the set of classes

I case|C|= 2 is called binary classification

I multilabel classification: L=Y= 2^C where 2^A denotes the power set ofA (the set of all subsets ofA)

I scoring and ranking: L=C andY =R^|C|

I probability prediction: L=C andY = [0,1]^|C|

I regression: L=Y =R

(77)

Supervised learning

I Input of the learning algorithm is a training set Tr⊂ X × L consisting of pairs (x,l) wherel ∈ C

I Ideally the training set consists of pairs (x,l(x)) for some unknown function l :X → Lwhich we would like to know (sometimes calledtarget function)

I Output of the learning algorithm is a function ˆl :X → L (often called hypothesis) that hopefully is a good

approximation of l

I In practice the training set does not contain perfect examples of a target function

I instance and label get corrupted bynoise

I there may not be sufficient information to fully determine the label

(78)

Supervised learning (2)

I To evaluate the performance of the hypothesis ˆl, we need a test set Te⊂ X × L

I sometimes bothTrandTeare given

I more often you are just given some data and need to split it intoTrandTeyourself

I Ideally for all (x,l)∈Tewe havel = ˆl(x)

I We shall look into more realistic performance measures very soon

I In all cases it is important not to use the test data during training (i.e. until you have fully decided on the output ˆl)

I we are interested in performance on new unseen data

I evaluating on same data that was used for learning gives over-optimistic results and leads to too complicated hypotheses (overfitting)

(79)

Supervised learning (3)

I Usually the instances are given in terms of a finite number of features

I If we haved features, and featurei takes values in domainFi, we haveX =F1× · · · × Fd.

I We don’t here consider where the features come from

I Notice that although any feature values can usually be encoded as real numbers, this may lead to problems if done carelessly

I suppose we encode featureCountryso that Germany is 17, Finland is 18 and France is 19

I a geometric learning algorithm might interpret that Finland =Germany + France

2

(80)

Assessing classifier performance

I Notation:

I c(x) is the true label of instancex

I ˆc(x) is the label predicted forx by the classifier we are assessing

I In binary classification labels are in { −1,+1}

I Te⁺is the set of all positive examples:

Te⁺ = {x |(x,+1)∈Te}

= {x ∈Te|c(x) = +1}

I Te⁻ is the set of all negative examples

I Pos=Te⁺andNeg=Te⁻

(81)

Contingency table

I Also known as confusion matrix

I Generally an |L| × |Y|matrix where cell (l,y) has the number of instances x such thatc(x) =l and ˆc(x) =y

I We consider binary classification and hence 2×2 matrices predict + predict − total

actual + TP FN Pos

actual − FP TN Neg

total TP+FP TN+FN |Te|

I P/N: prediction is + or−

I T/F: prediction is correct (“true”) or incorrect (“false”)

I row and column sums are calledmarginals

(82)

Performance metrics for binary classification

I Accuracy = (TP+TN)/(TP+TN+FP+FN)

I Error rate = 1−Accuracy

I True positive rate (‘sensitivity’) = TP/(TP+FN)

I True negative rate (‘specificity’) = TN/(TN+FP)

I False positive rate = FP/(TN+FP)

I False negative rate = FN/(TP+FN)

I Recall = TP/(TP+FN)

I Precision = TP/(TP+FP)

(83)

Performance metrics (2)

I If we want to summarise classifier performance in one number (instead of the whole contingency table), accurary is most commonly used

I However sometimes considering just accuracy is not enough

I unbalanced class distribution (e.g. information retrieval)

I different cost for false positive and false negative (e.g. spam filtering, medical diagnostics)

I We will soon consider more closely the idea of assigning different costs to different mistakes

(84)

Coverage plot

I Coverage plot is a way to visualise the performance of a binary classifier on a data set

I Horizontal axis ranges from 0 toNeg

I Vertical axis ranges from 0 toPos

I An algorithm is represented by the point (FP,TP)

I up and left is good

I down and right is bad

(85)

Coverage plot (2)

I To compare two algorithms AandB on the same data set, we plot their corresponding points (FPA,TP_A) and (FPB,TP_B)

I IfA is to up and left ofB, we say that A dominates B

I In this caseFP_A <FP_B andTP_A>TP_B

I SincePos=TP+FN, we also haveFNA<FNB I HenceAis more accurate than B on both classes

I IfA is to up and right of B, there is no clear comparison

I Amakes fewer false negatives but more false positives thanB

I In particular ifA andB are on the same line with 45 degree slope, they have same accuracy

I TP_A−TP_B =FP_A−FP_B soFP_A+FN_A=FP_B +FN_B

(86)

Coverage plot (3)

I Coverage plot in range [0,Neg]×[0,Pos] are called unnormalised

I It is also common to scale the coordinates to range in [0,1]

I We then plot an algorithm at point (fpr,tpr)

I IfPos6=Neg, the slopes in the figure change

I Lines with 45 degree slope now connect algorithms with same average recalldefined as (tpr+tnr)/2

I Normalised coverage plots are known as ROCplots (from Receiver Operating Characteristic in signal processing)

(87)

Cost function for classification

I Consider classification with k classes c₁, . . . ,c_k

I Let L(ci,c_j) be the costorlossincurred when we predict ˆ

c(x) =c_i but the correct label is c(x) =c_j

I ideally these are actual costs related to the application

I in practice we may not have any real costs available

I The default choice is 0-1 losswhich just counts the number of mistakes:

L01(ci,cj) =

0 ifci =cj

1 otherwise

I Generally we may have an arbitrary k×k cost matrixwhere element (j,i) containsL(ci,c_j)

I Usually we can assume that diagonal entries are 0

(88)

Total cost of classification

I Given a confusion matrix and a cost matrix, we can calculate the total cost of the classifications by elementwise

multiplication

I Example:

predict

+ −

actual + 27 3

− 6 38 Confusion matrix

predict

+ −

actual + 0 30

− 1 0

Cost matrix

I Here the total cost is 27·0 + 3·30 + 6·1 + 38·0 = 96

(89)

Forced choice

I Suppose we have a probabilistic model that, for a given instance x, predicts ˆp(+1) =aand ˆp(−1) = 1−a

I Suppose further that for the application we need to predict either +1 or−1 (say, to take a definite action); this is sometimes called forced choice

I Considering 0-1 loss it seems intuitively clear that we should predict +1 if a>1/2 and −1 ifa<1/2

I How about more general loss (say, the one from previous slide)?

I We choose the prediction that minimises the expected loss

I This is calledBayes optimalprediction

(90)

Bayes optimal prediction: example

I So consider the cost matrix

predict

+ −

actual + 0 30

− 1 0

I and assume we have class probabilitiesp(+1) =aand p(−1) = 1−a

I these can be predicted by a probabilistic model we have learned, or have some other source; anyway that’s what we believe

(91)

Bayes optimal prediction: example (2)

I If we predict +1, we have probability a of loss 0, and probability 1−a of loss 1, so the expected loss is a·0 + (1−a)·1 =1−a

I If we predict −1, we have probability a of loss 30, and probability 1−a of loss 0, so the expected loss is a·30 + (1−a)·0 =30a

I So we should predict +1 if 1−a<30a, which holds if a<1/31

(92)

Bayes optimality in general

I Bayes optimality applies to decision making also outside machine learning

I Consider the following scenario:

I we make an observationx

I we have a set ofactionsY we can take

I for each actiony∈ Y andoutcome l ∈ Lthere is a cost L(y,l)∈R

I We have a conditional probabilityP(l|x) over outcomes given x

I We want to choose action y that minimises the expected loss X

l∈L

L(y,l)P(l |x)

(93)

Bayesian classifier

I We now develop the idea of Bayes optimality a bit further for classifiers

I Suppose we have a probabilistic model P(x,y)

I Probabilistic learning algorithms often learn P(x,y) by separately learning P(y) andP(x|y), from which we get P(x,y) =P(x |y)P(y)

I From Bayes rule we get

P(y |x) = P(x,y)

P(x) = P(y)P(x |y) P(x)

I Note that the denominator is a constant (so when maximizing the left-hand-side with respect toy it can safely be ignored)

I We can also write it out as P(x) =P

yP(x,y)

(94)

Bayesian classifier (2)

I Suppose we are now given probabilitiesP(x,y) and costs L(y,y⁰) forx ∈ X andy,y⁰ ∈ C

I We consider the expected cost of a classifier c:X → C, often calledrisk:

Risk(c) = X

(x,y)∈X ×C

P(x,y)L(y,c(x))

I We want to find a minimum risk classifier c∗= arg min

c Risk(c)

I For historical reasons, the minimum cost classifier c∗ is called Bayes optimal, and its associated riskRisk(c∗) the Bayes erroror risk

(95)

Bayesian classifier (3)

I We can write the risk as Risk(c) =X

x∈X

P(x)X

y∈C

P(y |x)L(y,c(x))

I We are choosing c(x) separately for eachx, so c∗(x) = arg min

y⁰

X

y∈C

P(y |x)L(y,y⁰)

I In particular, for 0-1 loss we have X

y∈C

P(y |x)L01(y,y⁰) = X

y6=y⁰

P(y |x) = 1−P(y⁰ |x) so

f∗(x) = arg max

y

P(y |x)

(96)

Cost functions for probabilistic classification

I Output of the model is now a probability distribution ˆP(·) overC

I Labels are still individual classes c ∈ C

I Typical cost functions include

I logarithmic cost

L( ˆP,c) =−log ˆP(c)

I linear cost

L( ˆP,c) = 1−P(c)ˆ

(97)

Proper cost functions

I We can apply the idea of Bayes optimality also in probabilistic classification

I Suppose we have a “true” distribution P(·) over the classes

I Which probabilistic prediction ˆP(·) should we choose to minimise the expected loss

X

c∈C

P(c)L( ˆP,c)

I We callL apropercost function if the answer is ˆP =P

I We next see that linear loss is not proper, whereas logarithmic loss is

(98)

Linear cost is not proper

I Say the true distribution is P(c =−1) = 1/4, P(c = +1) = 3/4

I Set the predicted distribution to Pb(c =−1) =b, Pb(c = +1) = 1−b

I The expected cost is

E_c[L( ˆP,c)] = E_c[1−P(c)]b

= 1

4(1−b) +3

4(1−(1−b))

= 1

4 +1 2b

I This is minimized forb = 0, not the true probability b= 1/4.

(99)

Logarithmic cost is proper

I We show this for the binary case

I Say the true distribution is P(c =−1) =aand P(c = +1) = 1−a for an arbitrary 0≤a≤1.

I Set the predicted distribution to Pb(c =−1) =b, Pb(c = +1) = 1−b

I The expected logarithmic loss is

Ec[L( ˆP,c)] =−alogb−(1−a) log(1−b)

I We claim this is minimised forb =a

(100)

Logarithmic cost is proper (2)

I Denote the expected logarithmic loss byf(b):

f(b) =−alogb−(1−a) log(1−b)

I Taking a derivative gives f⁰(b) =−a1

b −(1−a) 1

1−b(−1) =−a

b +1−a 1−b

I Setting this to zero gives:

1−a 1−b = a

b ⇐⇒ b−ab =a−ab ⇐⇒ b =a

(101)

Logarithmic cost is proper (3)

I Take the second derivative:

f⁰⁰(b) = a

b² − 1−a

(1−b)²(−1) = a

b² + 1−a (1−b)² >0

I Hence b=a is the unique minimum

I So the logarithmic cost is proper according to the definition.

(102)

Scoring

I A scoring classifierwith k classes outputs for instancex a vector of k real-valued scores

ˆs(x) = (ˆs₁(x), . . . ,ˆs_k(x))

I Large value ˆsi(x) meansx is likely to belong to classCi I In binary classification we usually just consider a single score,

with large ˆs(x) denoting class +1 and small ˆs(x) denoting class−1

I For example a linear model gives a scoring function ˆ

s(x) =w·x−t

I We get a binary classifier naturally as ˆc(x) = sign(ˆs(x))