Decision trees and rule-based classiﬁers

(1)

Decision trees and rule-based classifiers

152 ,

(2)

Decision tree: An example

I Idea: Ask a sequence of questions (as in the ‘20 questions’

game) to infer the class

includes ‘netflix prize’

includes ‘viagra’

includes ‘millions’

includes ‘meds’

no yes

not spam spam

yes spam

yes no

no

spam yes no

not spam

(3)

Decision tree: A second example

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">

**!"#$$%&%'#(%)+,-'.%/0-$**

O5),*?*3%"#1))"+$?)4"6)(-34?

O@/A)B+$?)4"6)(-34?

O6)031C"+$?)4"1)$?3%*%7

OD)/1$A"D)(E31F?

OD$GH)"I$C)? $%4"I$C)?*$%"I)A*)J"D)(E31F?

O'/KK31("L),(31"6$,-*%)?

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""M

**12#34"-+)&+#+5-'%$%)*+,6--**

!"# !"#$%& '()*+(,

-+(+$. /(0(1,"

2%345" 67"(+

: N)? '*%7A) :<>. 84

< D3 6$11*)4 :==. 84

O D3 '*%7A) P=. 84

8 N)? 6$11*)4 :<=. 84

> D3 5*H31,)4 Q>. 9".

M D3 6$11*)4 M=. 84

P N)? 5*H31,)4 <<=. 84

; D3 '*%7A) ;>. 9".

Q D3 6$11*)4 P>. 84

:= D3 '*%7A) Q=. 9".

:=

3(+":4)*3(, 3(+":4)*3(,

34%+*%$4$.

3,(..

!"#$%&

'()-+

/(02%3

8< 9;- 8<

8<

N)? D3

6$11*)4 '*%7A)&"5*H31,)4

R";=. S";=.

!"#$%%$&'()%%*$+,%-.

/)(*%*%:=>(+( '4&",?==>"3*.*4%=/)""

I There can be many different trees that all work equally well!

154 ,

(4)

Decision tree: Structure

I Structure of the tree:

I A singleroot nodewith no incoming edges, and zero or more outgoing edges

I Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges

I Leafor terminalnodes, each of which has exactly one incoming edge and no outgoing edges

I Node contents:

I Each terminal node is assigned a prediction (here, for simplicity: a definite class label).

I Each non-terminal node defines a test, with the outgoing edges representing the various possible results of the test (here, for simplicity: a test only involves a single attribute)

(5)

Decision tree: 2D example

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">:

**!"#$%$&'()&*'+,-.**

?"@"=ABBC

"""""D"=

"""""D"B

"""""D"8

"""""D"=

?"@"=A8>C

""""D"8

""""D"=

"""""D"=

"""""D"8 E"@"=A8BC F)G

F)G

H3

H3 F)G H3

= =A: =A< =AB =A8 =AI =AJ =A> =A; =AK :

=

=A:

=A<

=AB

=A8

=AI

=AJ

=A>

=A;

=AK :

E

?

L!"#$%#&'()%&*%+,%%)&+,"&)%(-.*"#()-&#%-(")/&"0&$(00%#%)+&1'2//%/&(/&

3)",)&2/&$%1(/(")&*"4)$2#5

L6%1(/(")&*"4)$2#5&(/&72#2''%'&+"&28%/&*%124/%&+%/+&1")$(+(")&()9"'9%/&

2&/()-'%&2++#(*4+%&2+:2:+(;%

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8"""""""""""""""><

**/01$2*"(!"#$%$&'(3-""%**

8&<&5&=&>

?'2//&@&< ?'2//&@&&&&&

LA%/+&1")$(+(")&;25&()9"'9%&;4'+(7'%&2++#(*4+%/

LB"#%&%87#%//(9%&#%7#%/%)+2+(")

LC()$()-&"7+(;2'&+%/+&1")$(+(")&(/&1";74+2+(")2''5&%87%)/(9%

I Notation: In this figurex andy are two continuous-valued attributes (i.e.y is notthe class label in this figure!)

I Decision boundary consists of parts which all are parallel to the axes because each decision depends only on a single attribute

156 ,

(6)

Learning a decision tree from data: General idea

I Simple idea: Recursively divide up the space into pieces which are as pureas possible

x1

x₂

0 1 2 3 4 5

0 1 2 3

(7)

Learning a decision tree from data: General idea

A

x1

x₂

0 1 2 3 4 5

0 1 2 3

x₁>2.8

157 ,

(8)

Learning a decision tree from data: General idea

A

B

x1

x₂

0 1 2 3 4 5

0 1 2 3

x₁>2.8

x2>2.0

(9)

Learning a decision tree from data: General idea

A

B B

x1

x₂

0 1 2 3 4 5

0 1 2 3

x₁>2.8

x2>2.0 x2>1.2

157 ,

(10)

Learning a decision tree from data: General idea

A

B B

C x1

x₂

0 1 2 3 4 5

0 1 2 3

x₁>2.8

x2>2.0 x2>1.2

x1>4.3

(11)

Learning a decision tree from data: Hunt’s algorithm

I Notation: Let D_t denote the set of records corresponding to node t. For the root node,D_t is the set of all training data.

I Hunt’s algorithm:

1. If all the records inDt belong to the same classyt, thent is a leaf node labeled asyt

2. IfD_t contains records that belong to more than one class, select an attribute test conditionthat partitions the records into smaller subsets. Create a child node for each outcome and distribute the records inDt to the children. Apply the

algorithm recursively to each child node.

D_t is an empty set⇒use majority vote among parent records All records in D_t are identical, but labels not⇒use majority vote This general method is also known asTDIDT(Top-Down Induction of Decision Trees).

158 ,

(12)

Attribute test conditions

I Binary attributes: yes / no (two children only)

I Nominal (categorical) attributes withL states:

I Multiway split (Lchildren)

I Binary split (2 children, any of the 2^L−1−1 ways of splitting)

I Ordinal attributes with Lstates:

I Multiway or binary split

I Must respect the ordering (only combine contiguous values)

I Continuous attributes:

I Multiway or binary split

I Defined using breakpoints

(13)

Impurity measures

How do we measure whether a subset of the records is ‘pure’ or

‘impure’ ?

I Denote by p(i |t) the fraction of records belonging to class i of all the records at nodet.

I Impurity measures:

Entropy(t) = −

K−1

X

i=0

p(i|t) logp(i|t)

Gini(t) = 1−

K−1

X

i=0

p(i|t)² Classification error(t) = 1−max

i p(i|t)

whereK is the total number of classes.

160 ,

(14)

Impurity measures: Binary classification

0.0 0.2 0.4 0.6 0.8 1.0

0.00.10.20.30.40.5

p0

entropy

Entropy/2 Gini

Misclassification error

I Qualitatively, the three measures agree. However, some differences in the selection of test attributes do occur.

(15)

Selecting the best split

I Test all valid splits

I Select the split which maximizes the gain∆:

∆ =I(parent)−

k

X

j=1

N(vj)

N I(v_j), (18) where

I I(·) is the impurity of a given node

I N is the total number of records of the parent

I N(vj) is the number of records of childvj

I Note: Essentially just minimizing a weighted sum of the impurities of the children in the split

162 ,

(16)

I Computational issues:

I Binary attributes: Just one split to test per attribute

I Discrete attributes: Finite number of possible splits to test per attribute

I Continuous attributes: In principle, any valuex ∈R could be chosen as a breakpoint, but we only need to test midpoints between adjacent points. Computationally feasible by first sorting (complexityO(NlogN)) the records, then stepping through in order while updating the necessary statistics.

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""">?

!"#$%#&"&'()$$*%+&$,'-(!"./&$%#0(1%#% 2#3,4555

O @31")AA*,*)%(",30B/($(*3%C"A31")$,-"$((1*+/()&

D '31("(-)"$((1*+/()"3%"E$F/)G

D H*%)$1FI"G,$%"(-)G)"E$F/)G&")$,-"(*0)"/B4$(*%7"(-)",3/%("0$(1*J"

$%4",30B/(*%7"7*%* *%4)J

D K-33G)"(-)"GBF*("B3G*(*3%"(-$("-$G"(-)"F)$G("7*%* *%4)J

!"#$% &' &' &' (#) (#) (#) &' &' &' &'

*$+$,-#./01'2#

34 54 56 76 84 86 944 9:4 9:6 ::4

66 36 5: 74 75 8: 85 994 9:: 95: :;4

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= L

(#) 4 ; 4 ; 4 ; 4 ; 9 : : 9 ; 4 ; 4 ; 4 ; 4 ; 4

&' 4 5 9 3 : 6 ; ? ; ? ; ? ; ? ? ; 6 : 3 9 5 4

@A0A 4B?:4 4B?44 4B;56 4B;?; 4B?95 4B?44 !"#!! 4B;?; 4B;56 4B?44 4B?:4 CD-A%.E')A%A'0)

C'F%#G.H$-I#)

)6$,*#7$%8,(9/6%$$%#0(!*%$,*%7(+7',3("#(2:;<

O M%(13BI"$("$"7*E)%"%34)"(C

NOP#MC"!"#$#%#&'#*G"(-)"1)F$(*E)"A1)Q/)%,I"3A",F$GG"R"$("%34)"(ST

D6)$G/1)G"-3037)%)*(I"3A"$"%34)T"

6$J*0/0"NF37"%_,S"U-)%"1),314G"$1)")Q/$FFI"4*G(1*+/()4"

$03%7"$FF",F$GG)G"*0BFI*%7"F)$G("*%A310$(*3%

6*%*0/0"N=T=S"U-)%"$FF"1),314G"+)F3%7"(3"3%)",F$GG&"

*0BFI*%7"03G("*%A310$(*3%

DM%(13BI"+$G)4",30B/($(*3%G"$1)"G*0*F$1"(3"(-)"

V2O2"*%4)J",30B/($(*3%G

¦

$

! $ & ! $ &

&

()&+!,* # ! # " ! $%& # " !

163 ,

(17)

Example: Web Robot Detection

164 ,

(18)

Example: Web Robot Detection

I Resulting decision tree:

(more details in the book)

(19)

Characteristics of Decision Tree Induction

I Nonparametric approach:

I Can in the limit approximateanydecision boundary to arbitrary precision

⇒Approaches optimal performance (i.e. Bayesian classifier with known distributions) in the infinite sample limit ...but requiresregularizationto avoid overlearning.

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(20)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

(21)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(22)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

(23)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(24)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

(25)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(26)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

(27)

Characteristics of Decision Tree Induction

x1

x2

0 1 2 3 4 5

0 1 2 3

166 ,

(28)

Characteristics of Decision Tree Induction (cont.)

I Local, greedy learning to find a reasonable solution in reasonable time (as opposed to finding a globally optimal solution)

I Relativelyeasy to interpret (by experts or regular users of the system)

I cf. ”Why was I recommended this?” on amazon.com (helps build confidence in the system)

(29)

Characteristics of Decision Tree Induction (cont.)

I Data fragmentation problem:

I The nodes far down the tree are based on a very small fraction of the data, even only on a few data points⇒typically not very reliable information

I Example: Divide any of the existing leaves into purer partitions:

A

B B

C x₁

x₂

0 1 2 3 4 5

0 1 2 3

x₁>2.8

x₂>2.0 x₂>1.2

x₁>4.3

168 ,

(30)

Rule-based classifier

I e.g. Mac OS X ‘Mail’ application:

(31)

Rule-based classifier: Example

I Example:

r1: (‘webmail’)∩ (‘give password’)→spam

r₂: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam

r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r₅: (‘you have won’)→ spam

I Idea:

‘Many local classification models make up one global classifier’

I If we can find reliable rules of the form r₁, . . . ,r_n we can build a working classifier

170 ,

(32)

Rule-based classifier: Definition

I Rule set R={r₁, . . . ,r_n}

I Each rule r_i is of the form

r_i : (Condition_i)→y_i, (19) where the left-hand-side is the rule antecedent(a logical expression that evaluates to true or false depending on the datapoint) and the right-hand-side is the rule consequence, i.e. a prediction (here, for simplicity, one of the classes).

I A ruler_i is said tocover a record if its rule antecedent evaluates to true. Conversely, the record is said totriggerthe rule.

(33)

Properties of a single rule r

i

I The coverageof a rule is the fraction of records (in the training data) which it covers

I The accuracyof a rule is the fraction of covered records for which the rule consequence equals the true class

For instance, if the rule

r₅: (‘you have won’)→ spam

has a coverage of 0.1 and an accuracy of 0.85, it means that 10% of all emails received included the phrase ‘you have won’, and out of all those emails 85% were truly spam and 15%

were non-spam.

172 ,

(34)

Properties of the rule set R as a whole

I The rules in rule set R are mutually exclusiveif no two rules can be triggered by any single record.

I The rules in rule set R are exhaustive if at least one rule is triggered by any record.

(Note that we are here consideringany possible record, not just the records in the training dataset.)

I Together, the two properties imply that any record can be classified into a unique class. However, not all rule sets R have these properties.

(35)

Example

I The rules:

r1: (‘webmail’)∩ (‘give password’)→spam

r₂: (‘important’)∩ (sender = teacher)→ not spam r3: (‘viagra’)∩(‘extra strength’) → spam

r4: (‘millions’) ∩(‘netflix challenge’)→ not spam r₅: (‘you have won’)→ spam

are not mutually exclusive because the email: “you have won the netflix challenge, please come to the bank tomorrow to collect your millions” triggers both ruler4 and r5.

They are also not exhaustive because the email “hello from an old friend” is not covered by any rule.

174 ,

(36)

2D example

x1

x2

0 1 2 3 4 5

0 1 2 3

green red

red

r₁ r2

r3

I Rule r₁: (0.2≤x₁ ≤2.2)∩(1.3≤x₂ ≤4.2)→ red

has a coverage of 8/40 = 0.2 and an accuracy of 7/8 = 0.875.

I The rules in rule set R={r₁,r2,r3} are mutually exclusive because the rectangles defining the rules are non-overlapping, but they are not exhaustive because there are areas of the space not covered by any of the rectangles

(37)

Default rule

I A simple way to handle a non-exhaustive rule set is to add a default rule which has an empty antecedent (always true):

r_d: ()→y_d

I Note that since this rule is always triggered the resulting rule set is necessarilynotmutually exclusive (as long as any other rule is sometimes triggered)

⇒ Some mechanism is needed to handle non-mutually exclusive cases (see next slide)

176 ,

(38)

Ordering or voting

I When the rule set is not mutually exclusive, conflicts need to be handled by

I Ruleordering: Instead of an unordered collection of rules, the order is taken to be important. Hence, a new record is classified by thefirstrule that it triggers.

I Voting: Each triggered rule votes for its consequence class.

(The votes can also be weighed by the rule’s accuracy.)

I Note that an ordered list of rules (a decision list) behaves quite a bit like a decision tree. In fact, it is easy to show that a decision tree can be written in the form of an ordered list of rules, and vice versa.

(39)

Sequential covering algorithm

I Learning a rule-based classifier can be performed by

identifying good rules and then removing the covered records from the training set:

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%&'()*(+',-'./0#&(1)2'30.4

?*@"A1*7*%$B"5$($ ?**@"'()C":

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D

!"#$%&'()*(+',-'./0#&(1)2'30.45

?***@"'()C"<

E:

?*F@"'()C"G E:

E<

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":>

!"#$%&'()*(+',-'./0#&(1)2'30.4

?*@"A1*7*%$B"5$($ ?**@"'()C":

!"#$%&'()*%+$,-&"./0$1" 2%(134/,(*3%"(3"5$($"6*%*%7"""""""" 89:;9<==8""""""""""""""":D

!"#$%&'()*(+',-'./0#&(1)2'30.45

?***@"'()C"<

E:

?*F@"'()C"G E:

E<

178 ,

(40)

Selecting the next rule

I We want to select a rule that has bothhigh coverage and high accuracy. (Of course, typically this is a trade-off which has to be made.)

I Searching for such a rule:

I Exhaustive search of all possible rules infeasible for all but the very smallest of problems

I ‘General-to-specific’ greedy search: Start from an empty antecedent, add conditions one at a time to improve accuracy (but take care to still have decent coverage as well)

I ‘Specific-to-general’ greedy search: Start from a random record, then generalize by removing conditions one-by-one to obtain better coverage while not sacrificing too much in accuracy

(41)

Characteristics of rule-based classifiers

I Expressiveness similar to that of decision trees

I Rectilinear partitioning (though more complex decision boundaries with unordered rule sets and voting)

I Ordered rule sets can be written as decision trees and vice versa

I Similar also in terms of the underlying principles of learning the models

I Non-parametric, need for regularization

I Can produce descriptive models

I Easy to interpret (when the number of rules is not too large)

180 ,

(42)

Decision trees and rule-based classifiers: Summary

I Can be easy to understand/interpret by experts/users (for small models)

I Classification generally very fast (worst case is linear in depth of decision tree or rule list length)

I Non-parametric method (can approximate optimal classifier for any underlying distribution) in the large sample limit (but may require a verylarge number of rules)

I Need to regularize / find a good stopping criterion when learning to avoid overfitting

Decision trees and rule-based classiﬁers

Decision trees and rule-based classifiers

Decision tree: An example

Decision tree: A second example

!"#$$%&%'#(%)*+,-'.*%/0-$

12#34"-+)&+#+5-'%$%)*+,6--

Decision tree: Structure

Decision tree: 2D example

!"#$%$&'()&*'+,-.

/01$2*"(!"#$%$&'(3-""%

Learning a decision tree from data: General idea

Learning a decision tree from data: General idea

Learning a decision tree from data: General idea

Learning a decision tree from data: General idea

Learning a decision tree from data: General idea

Learning a decision tree from data: Hunt’s algorithm

Attribute test conditions

Impurity measures

Impurity measures: Binary classification

Selecting the best split

! $ & ! $ &

&

()&*+!, # ! # " ! $%& # " !

Example: Web Robot Detection

Example: Web Robot Detection

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction

Characteristics of Decision Tree Induction (cont.)

Characteristics of Decision Tree Induction (cont.)

Rule-based classifier

Rule-based classifier: Example

Rule-based classifier: Definition

Properties of a single rule r

Properties of the rule set R as a whole

Example

2D example

Default rule

Ordering or voting

Sequential covering algorithm

Selecting the next rule

Characteristics of rule-based classifiers

Decision trees and rule-based classifiers: Summary

**!"#$$%&%'#(%)+,-'.%/0-$**

**12#34"-+)&+#+5-'%$%)*+,6--**

**!"#$%$&'()&*'+,-.**

**/01$2*"(!"#$%$&'(3-""%**

()&+!,* # ! # " ! $%& # " !