• Ei tuloksia

Decision trees and rule-based classifiers

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Decision trees and rule-based classifiers"

Copied!
42
0
0

Kokoteksti

(1)

Decision trees and rule-based classifiers

152 ,

(2)

Decision tree: An example

I Idea: Ask a sequence of questions (as in the ‘20 questions’

game) to infer the class

includes ‘netflix prize’

includes ‘viagra’

includes ‘millions’

includes ‘meds’

no yes

not spam spam

yes spam

yes no

no

spam yes no

not spam

(3)

Decision tree: A second example

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">

!"#$$%&%'#(%)*+,-'.*%/0-$

O5),*?*3%"#1))"+$?)4"6)(-34?

O@/A)B+$?)4"6)(-34?

O6)031C"+$?)4"1)$?3%*%7

OD)/1$A"D)(E31F?

OD$GH)"I$C)? $%4"I$C)?*$%"I)A*)J"D)(E31F?

O'/KK31("L),(31"6$,-*%)?

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""M

12#34"-+)&+#+5-'%$%)*+,6--

!"# !"#$%& '()*+(,

-+(+$. /(0(1,"

2%345" 67"(+

: N)? '*%7A) :<>. 84

< D3 6$11*)4 :==. 84

O D3 '*%7A) P=. 84

8 N)? 6$11*)4 :<=. 84

> D3 5*H31,)4 Q>. 9".

M D3 6$11*)4 M=. 84

P N)? 5*H31,)4 <<=. 84

; D3 '*%7A) ;>. 9".

Q D3 6$11*)4 P>. 84

:= D3 '*%7A) Q=. 9".

:=

3(+":4)*3(, 3(+":4)*3(,

34%+*%$4$.

3,(..

!"#$%&

'()-+

/(02%3

8< 9;- 8<

8<

N)? D3

6$11*)4 '*%7A)&"5*H31,)4

R";=. S";=.

!"#$%%$&'()%%*$+,%-.

/)(*%*%:=>(+( '4&",?==>"3*.*4%=/)""

I There can be many different trees that all work equally well!

154 ,

(4)

Decision tree: Structure

I Structure of the tree:

I A singleroot nodewith no incoming edges, and zero or more outgoing edges

I Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges

I Leafor terminalnodes, each of which has exactly one incoming edge and no outgoing edges

I Node contents:

I Each terminal node is assigned a prediction (here, for simplicity: a definite class label).

I Each non-terminal node defines a test, with the outgoing edges representing the various possible results of the test (here, for simplicity: a test only involves a single attribute)

(5)

Decision tree: 2D example

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

!"#$%$&'()&*'+,-.

?"@"=ABBC

"""""D"=

"""""D"B

"""""D"8

"""""D"=

?"@"=A8>C

""""D"8

""""D"=

"""""D"=

"""""D"8 E"@"=A8BC F)G

F)G

H3

H3 F)G H3

= =A: =A< =AB =A8 =AI =AJ =A> =A; =AK :

=

=A:

=A<

=AB

=A8

=AI

=AJ

=A>

=A;

=AK :

E

?

L!"#$%#&'()%&*%+,%%)&+,"&)%(-.*"#()-&#%-(")/&"0&$(00%#%)+&1'2//%/&(/&

3)",)&2/&$%1(/(")&*"4)$2#5

L6%1(/(")&*"4)$2#5&(/&72#2''%'&+"&28%/&*%124/%&+%/+&1")$(+(")&()9"'9%/&

2&/()-'%&2++#(*4+%&2+:2:+(;%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

/01$2*"(!"#$%$&'(3-""%

8&<&5&=&>

?'2//&@&< ?'2//&@&&&&&

LA%/+&1")$(+(")&;25&()9"'9%&;4'+(7'%&2++#(*4+%/

LB"#%&%87#%//(9%&#%7#%/%)+2+(")

LC()$()-&"7+(;2'&+%/+&1")$(+(")&(/&1";74+2+(")2''5&%87%)/(9%

I Notation: In this figurex andy are two continuous-valued attributes (i.e.y is notthe class label in this figure!)

I Decision boundary consists of parts which all are parallel to the axes because each decision depends only on a single attribute

156 ,

(6)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

x1

x2

0 1 2 3 4 5

0 1 2 3

(7)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

A

x1

x2

0 1 2 3 4 5

0 1 2 3

x1>2.8

157 ,

(8)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

A

B

x1

x2

0 1 2 3 4 5

0 1 2 3

x1>2.8

x2>2.0

(9)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

A

B B

x1

x2

0 1 2 3 4 5

0 1 2 3

x1>2.8

x2>2.0 x2>1.2

157 ,

(10)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

A

B B

C x1

x2

0 1 2 3 4 5

0 1 2 3

x1>2.8

x2>2.0 x2>1.2

x1>4.3

(11)

Learning a decision tree from data: Hunt’s algorithm

I Notation: Let Dt denote the set of records corresponding to node t. For the root node,Dt is the set of all training data.

I Hunt’s algorithm:

1. If all the records inDt belong to the same classyt, thent is a leaf node labeled asyt

2. IfDt contains records that belong to more than one class, select an attribute test conditionthat partitions the records into smaller subsets. Create a child node for each outcome and distribute the records inDt to the children. Apply the

algorithm recursively to each child node.

Dt is an empty setuse majority vote among parent records All records in Dt are identical, but labels notuse majority vote This general method is also known asTDIDT(Top-Down Induction of Decision Trees).

158 ,

(12)

Attribute test conditions

I Binary attributes: yes / no (two children only)

I Nominal (categorical) attributes withL states:

I Multiway split (Lchildren)

I Binary split (2 children, any of the 2L−11 ways of splitting)

I Ordinal attributes with Lstates:

I Multiway or binary split

I Must respect the ordering (only combine contiguous values)

I Continuous attributes:

I Multiway or binary split

I Defined using breakpoints

(13)

Impurity measures

How do we measure whether a subset of the records is ‘pure’ or

‘impure’ ?

I Denote by p(i |t) the fraction of records belonging to class i of all the records at nodet.

I Impurity measures:

Entropy(t) =

K−1

X

i=0

p(i|t) logp(i|t)

Gini(t) = 1

K−1

X

i=0

p(i|t)2 Classification error(t) = 1max

i p(i|t)

whereK is the total number of classes.

160 ,

(14)

Impurity measures: Binary classification

0.0 0.2 0.4 0.6 0.8 1.0

0.00.10.20.30.40.5

p0

entropy

Entropy/2 Gini

Misclassification error

I Qualitatively, the three measures agree. However, some differences in the selection of test attributes do occur.

(15)

Selecting the best split

I Test all valid splits

I Select the split which maximizes the gain∆:

∆ =I(parent)−

k

X

j=1

N(vj)

N I(vj), (18) where

I I(·) is the impurity of a given node

I N is the total number of records of the parent

I N(vj) is the number of records of childvj

I Note: Essentially just minimizing a weighted sum of the impurities of the children in the split

162 ,

(16)

I Computational issues:

I Binary attributes: Just one split to test per attribute

I Discrete attributes: Finite number of possible splits to test per attribute

I Continuous attributes: In principle, any valuex R could be chosen as a breakpoint, but we only need to test midpoints between adjacent points. Computationally feasible by first sorting (complexityO(NlogN)) the records, then stepping through in order while updating the necessary statistics.

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">?

!"#$%#&"&'()$$*%+&$,'-(!"./&$%#0(1%#% 2#3,4555

O @31")AA*,*)%(",30B/($(*3%C"A31")$,-"$((1*+/()&

D '31("(-)"$((1*+/()"3%"E$F/)G

D H*%)$1FI"G,$%"(-)G)"E$F/)G&")$,-"(*0)"/B4$(*%7"(-)",3/%("0$(1*J"

$%4",30B/(*%7"7*%* *%4)J

D K-33G)"(-)"GBF*("B3G*(*3%"(-$("-$G"(-)"F)$G("7*%* *%4)J

!"#$% &' &' &' (#) (#) (#) &' &' &' &'

*$+$,-#./01'2#

34 54 56 76 84 86 944 9:4 9:6 ::4

66 36 5: 74 75 8: 85 994 9:: 95: :;4

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= L

(#) 4 ; 4 ; 4 ; 4 ; 9 : : 9 ; 4 ; 4 ; 4 ; 4 ; 4

&' 4 5 9 3 : 6 ; ? ; ? ; ? ; ? ? ; 6 : 3 9 5 4

@A0A 4B?:4 4B?44 4B;56 4B;?; 4B?95 4B?44 !"#!! 4B;?; 4B;56 4B?44 4B?:4 CD-A%.E')A%A'0)

C'F%#G.H$-I#)

)6$,*#7$%8,(9/6%$$%#0(!*%$,*%7(+7',3("#(2:;<

O M%(13BI"$("$"7*E)%"%34)"(C

NOP#MC"!"#$#%#&'#*G"(-)"1)F$(*E)"A1)Q/)%,I"3A",F$GG"R"$("%34)"(ST

D6)$G/1)G"-3037)%)*(I"3A"$"%34)T"

‹6$J*0/0"NF37"%,S"U-)%"1),314G"$1)")Q/$FFI"4*G(1*+/()4"

$03%7"$FF",F$GG)G"*0BFI*%7"F)$G("*%A310$(*3%

‹6*%*0/0"N=T=S"U-)%"$FF"1),314G"+)F3%7"(3"3%)",F$GG&"

*0BFI*%7"03G("*%A310$(*3%

DM%(13BI"+$G)4",30B/($(*3%G"$1)"G*0*F$1"(3"(-)"

V2O2"*%4)J",30B/($(*3%G

¦

$

! $ & ! $ &

&

()&*+!, # ! # " ! $%& # " !

163 ,

(17)

Example: Web Robot Detection

164 ,

(18)

Example: Web Robot Detection

I Resulting decision tree:

(more details in the book)

(19)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(20)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

(21)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(22)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

(23)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(24)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

(25)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(26)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

(27)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(28)

Characteristics of Decision Tree Induction (cont.)

I Local, greedy learning to find a reasonable solution in reasonable time (as opposed to finding a globally optimal solution)

I Relativelyeasy to interpret (by experts or regular users of the system)

I cf. ”Why was I recommended this?” on amazon.com (helps build confidence in the system)

(29)

Characteristics of Decision Tree Induction (cont.)

I Data fragmentation problem:

I The nodes far down the tree are based on a very small fraction of the data, even only on a few data pointstypically not very reliable information

I Example: Divide any of the existing leaves into purer partitions:

A

B B

C x1

x2

0 1 2 3 4 5

0 1 2 3

x1>2.8

x2>2.0 x2>1.2

x1>4.3

168 ,

(30)

Rule-based classifier

I e.g. Mac OS X ‘Mail’ application:

(31)

Rule-based classifier: Example

I Example:

r1: (‘webmail’)∩ (‘give password’)→spam

r2: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam

r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r5: (‘you have won’)→ spam

I Idea:

‘Many local classification models make up one global classifier’

I If we can find reliable rules of the form r1, . . . ,rn we can build a working classifier

170 ,

(32)

Rule-based classifier: Definition

I Rule set R={r1, . . . ,rn}

I Each rule ri is of the form

ri : (Conditioni)→yi, (19) where the left-hand-side is the rule antecedent(a logical expression that evaluates to true or false depending on the datapoint) and the right-hand-side is the rule consequence, i.e. a prediction (here, for simplicity, one of the classes).

I A ruleri is said tocover a record if its rule antecedent evaluates to true. Conversely, the record is said totriggerthe rule.

(33)

Properties of a single rule r

i

I The coverageof a rule is the fraction of records (in the training data) which it covers

I The accuracyof a rule is the fraction of covered records for which the rule consequence equals the true class

For instance, if the rule

r5: (‘you have won’)→ spam

has a coverage of 0.1 and an accuracy of 0.85, it means that 10% of all emails received included the phrase ‘you have won’, and out of all those emails 85% were truly spam and 15%

were non-spam.

172 ,

(34)

Properties of the rule set R as a whole

I The rules in rule set R are mutually exclusiveif no two rules can be triggered by any single record.

I The rules in rule set R are exhaustive if at least one rule is triggered by any record.

(Note that we are here consideringany possible record, not just the records in the training dataset.)

I Together, the two properties imply that any record can be classified into a unique class. However, not all rule sets R have these properties.

(35)

Example

I The rules:

r1: (‘webmail’)∩ (‘give password’)→spam

r2: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam

r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r5: (‘you have won’)→ spam

are not mutually exclusive because the email: “you have won the netflix challenge, please come to the bank tomorrow to collect your millions” triggers both ruler4 and r5.

They are also not exhaustive because the email “hello from an old friend” is not covered by any rule.

174 ,

(36)

2D example

x1

x2

0 1 2 3 4 5

0 1 2 3

green red

red

r1 r2

r3

I Rule r1: (0.2≤x1 ≤2.2)∩(1.3≤x2 ≤4.2)→ red

has a coverage of 8/40 = 0.2 and an accuracy of 7/8 = 0.875.

I The rules in rule set R={r1,r2,r3} are mutually exclusive because the rectangles defining the rules are non-overlapping, but they are not exhaustive because there are areas of the space not covered by any of the rectangles

(37)

Default rule

I A simple way to handle a non-exhaustive rule set is to add a default rule which has an empty antecedent (always true):

rd: ()→yd

I Note that since this rule is always triggered the resulting rule set is necessarilynotmutually exclusive (as long as any other rule is sometimes triggered)

⇒ Some mechanism is needed to handle non-mutually exclusive cases (see next slide)

176 ,

(38)

Ordering or voting

I When the rule set is not mutually exclusive, conflicts need to be handled by

I Ruleordering: Instead of an unordered collection of rules, the order is taken to be important. Hence, a new record is classified by thefirstrule that it triggers.

I Voting: Each triggered rule votes for its consequence class.

(The votes can also be weighed by the rule’s accuracy.)

I Note that an ordered list of rules (a decision list) behaves quite a bit like a decision tree. In fact, it is easy to show that a decision tree can be written in the form of an ordered list of rules, and vice versa.

(39)

Sequential covering algorithm

I Learning a rule-based classifier can be performed by

identifying good rules and then removing the covered records from the training set:

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%&'()*(+',-'./0#&(1)2'30.4

?*@"A1*7*%$B"5$($ ?**@"'()C":

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D

!"#$%&'()*(+',-'./0#&(1)2'30.45

?***@"'()C"<

E:

?*F@"'()C"G E:

E<

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%&'()*(+',-'./0#&(1)2'30.4

?*@"A1*7*%$B"5$($ ?**@"'()C":

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D

!"#$%&'()*(+',-'./0#&(1)2'30.45

?***@"'()C"<

E:

?*F@"'()C"G E:

E<

178 ,

(40)

Selecting the next rule

I We want to select a rule that has bothhigh coverage and high accuracy. (Of course, typically this is a trade-off which has to be made.)

I Searching for such a rule:

I Exhaustive search of all possible rules infeasible for all but the very smallest of problems

I ‘General-to-specific’ greedy search: Start from an empty antecedent, add conditions one at a time to improve accuracy (but take care to still have decent coverage as well)

I ‘Specific-to-general’ greedy search: Start from a random record, then generalize by removing conditions one-by-one to obtain better coverage while not sacrificing too much in accuracy

(41)

Characteristics of rule-based classifiers

I Expressiveness similar to that of decision trees

I Rectilinear partitioning (though more complex decision boundaries with unordered rule sets and voting)

I Ordered rule sets can be written as decision trees and vice versa

I Similar also in terms of the underlying principles of learning the models

I Non-parametric, need for regularization

I Can produce descriptive models

I Easy to interpret (when the number of rules is not too large)

180 ,

(42)

Decision trees and rule-based classifiers: Summary

I Can be easy to understand/interpret by experts/users (for small models)

I Classification generally very fast (worst case is linear in depth of decision tree or rule list length)

I Non-parametric method (can approximate optimal classifier for any underlying distribution) in the large sample limit (but may require a verylarge number of rules)

I Need to regularize / find a good stopping criterion when learning to avoid overfitting

Viittaukset

LIITTYVÄT TIEDOSTOT

The third example of quantitative methods is the multi-criteria decision analysis (MCDA) approach that is widely applied in supporting decision-making in forestry and the use

We next show that any norm on a finite-dimensional vector space X is equiv- alent to the norm based on the basis of the space and given in an example above.. Theorem 4.8 Let X be

Samalla kuitenkin myös sekä systeemidynaaminen mallinnus että arviointi voivat tuottaa tarvittavaa tietoa muutostilanteeseen hahmottamiseksi.. Toinen ideaalityyppi voidaan

tieliikenteen ominaiskulutus vuonna 2008 oli melko lähellä vuoden 1995 ta- soa, mutta sen jälkeen kulutus on taantuman myötä hieman kasvanut (esi- merkiksi vähemmän

Ydinvoimateollisuudessa on aina käytetty alihankkijoita ja urakoitsijoita. Esimerkiksi laitosten rakentamisen aikana suuri osa työstä tehdään urakoitsijoiden, erityisesti

Mansikan kauppakestävyyden parantaminen -tutkimushankkeessa kesän 1995 kokeissa erot jäähdytettyjen ja jäähdyttämättömien mansikoiden vaurioitumisessa kuljetusta

Solmuvalvonta voidaan tehdä siten, että jokin solmuista (esim. verkonhallintaisäntä) voidaan määrätä kiertoky- selijäksi tai solmut voivat kysellä läsnäoloa solmuilta, jotka

Työn merkityksellisyyden rakentamista ohjaa moraalinen kehys; se auttaa ihmistä valitsemaan asioita, joihin hän sitoutuu. Yksilön moraaliseen kehyk- seen voi kytkeytyä